This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
hi ( t) otherwise.
0·c·1
(4.2)
(4.3)
(4.4)
Effects of Pair-wise and Higher-order Correlations
161
Figure 2: (A–C) Probability of N clusters (p) as a function of the pair-wise correlation r , for three ring probabilities, f1 = 0.05 (A), 0.146 (B), and 0.225 (C). Probabilities are clipped at a value of 10¡4 . (D–F) Contour plots of the same data show that the separation between the two peaks becomes larger with increasing r . Contour levels indicate probabilities of 0.001 0.01, and 0.1.
Without input, the membrane potential Ei (t) is driven towards its resting value E0 with a time constant tE . An inux of potassium drives the potential toward the potassium equilibrium potential Ek . Excitatory input moves the potential toward the ionic equilibrium value Eex . In the simulations, the
162
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
values used by Deppisch et al. (1993) were adopted: E 0 = 0, tE = 2.5 msec, Ek = ¡1, E ex = 7. The membrane time constant tE may appear to be rather small, although some have argued that it may be within the biologically plausible range (Konig ¨ et al., 1996; Bernander, Douglas, Martin, & Koch, 1991). In the discussion we will address the dependence of our results on this particular choice. Input to the cells was strictly excitatory, with synaptic delay tij = 0.5 msec. External input is provided by external stochastically spiking units and is treated equivalent to internal input. Internal noise is added by the additive noise term gi ( t) in equation 4.1, with a standard deviation of 0.06 ¢ Ei ( t). Equation 4.2 describes the slow adaptation of the threshold hi to the membrane potential, modeling an adaptation of the neuron to excitation (h0 = 1, th = 10 msec, c = 0.3). Equation 4.3 governs the potassium current decay dynamics, which is driven toward g0 = 0 with time constant tg (5 msec). In the case of a spike, the potassium current rises by an amount b (4.0) corresponding to the outward potassium current. A spike is generated each time the membrane potential exceeds the threshold (see equation 4.4), after which the neuron is in a refractory state for another 1.5 msec. In the actual numerical implementation of the neuron model, a discrete-time analog of equations 4.1–4.4 was used. These equations were iterated with time steps corresponding to 0.5 msec of real time. Each neuron was connected to every other neuron with a xed homogeneous synaptic weight wij = wint / N > 0. Here, wint denotes the sum of the strengths of all synapses from within the network impinging on a single cell. External input consisted of N independent stochastic elements generating spikes with p = 0.05 per msec (50 Hz), connected in a one-to-one fashion with the network neurons with a constant weight wext = 0.8. The parameter wint was varied during the simulation. Increasing the value of the weights drives the network from a stochastic mode into an oscillatory mode (Deppisch et al., 1993). Somewhere within this range of synaptic weights, the network alternates between the stochastic and oscillatory modes. All results are based on simulations with a network with N = 150 neurons, with the exception of the results in section 4.5. This was a compromise between a realistic number of neurons and a network consideration: the absence of inhibitory neurons makes a network very quickly prone to saturation, a state where the constant excitation induces a very sharp oscillation. A further increase in the number of neurons also narrows the weight range in which the network is in the alternating state. The network output was obtained by recording the output of all 150 neurons. A sample of the output of 50 neurons is shown in Figure 3. 4.2 Network Simulations. In order to vary the average pair-wise correlation, simulations were performed with different values of the synaptic weight wint . As was noted by Deppisch et al. (1993), the network exhibits three distinct modes, which depend on the value of the synaptic weight.
Effects of Pair-wise and Higher-order Correlations
163
Figure 3: (Upper panel) Activity of 50 of the 150 neurons. On the horizontal axis, time is displayed (sampling time, 0.5 msec). The vertical ticks indicate action potentials. Oscillatory episodes are alternated by periods of more random activity. (Lower panel) Summed activity of all 150 neurons.
The rst, at low values of the synaptic weights, is the stochastic mode in which the average correlation between pairs of neurons is near zero. At the other end of the scale is the mode with high values of internal weight. In this mode, the network activity is highly oscillatory. In between these extremes are synaptic weight values for which the network exhibits episodes in which many neurons re synchronously, alternated by episodes in which neurons are synchronized to a lesser degree. Typical activity of the neurons in these three network modes is shown in Figures 4A–C. Also plotted are the cross-correlations between a pair of neurons in the network (see Figures 4D–F). As observed by Deppisch et al. (1993), these cross-correlograms show qualitative resemblance to data obtained from electrophysiological recordings (e.g., Konig ¨ et al., 1995). Remarkably, the network with intermediate synaptic strength (w = 3.875, r = 0.03, for 2 msec bin size) exhibits occasional population bursts, which barely show up in the correlation function (see Figures 4B and 4E). Thus, a network state in which a number of highly synchronous population bursts occur can be associated with correlation functions indicative of weak pair-wise coupling. When investigating the relationship between occurrences of N clusters and the pair-wise correlation, an additional assumption has to be made with regard to the maximal time difference between two spikes that are considered to be synchronous. As a rst approach, we used a time window of 2 msec. The maximal ring rate of the neurons is determined by the refrac-
164
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
Figure 4: Network activity for three values of the synaptic weight, wint . (A–C) Summed activity of the 150 neurons. The synaptic weights were 3.5 (A, r = 0.003), 3.875 (B, r = 0.03), and 4.375 (C, r = 0.20). Pair-wise correlations were determined for coincidences within a window of 2 msec. (D–F) Cross-correlation functions of two randomly selected neurons for the same synaptic weights as are plotted on the left. Even for a low pair-wise correlation (r = 0.03), highly synchronous network bursts occur, but these barely show up in the cross-correlogram.
tory period (1.5 msec) and the duration of an action potential (0.5 msec). During network bursts, neurons reach this maximal ring rate, and a single spike occurs in each 2-msec bin. The effect of changing the bin width on the distribution of cluster sizes will be investigated in section 4.4.
Effects of Pair-wise and Higher-order Correlations
165
Figure 5 shows the relationship between the occurrence of N clusters and the pair-wise correlation. Plotted is the probability Pi (see also equation 3.8) of a particular number of coincident spikes, for simulations with different pair-wise correlations (different values of wint ). As can be seen in Figure 5A, most 2-msec bins are occupied by N clusters containing a fairly low number of spikes. This corresponds to the stochastic activity between bursts, and the probability of these events is approximated by a binomial distribution. Synchronous bursts are represented by the second peak in the probability distribution, as can be seen more clearly in the logarithmic plot of Figure 5B. Between these two peaks is a plateau of time bins in which an intermediate number of neurons re. These events are caused by time bins that are aligned on the onset or end of a burst. Increases in the pair-wise correlation are associated with a slightly smaller rst peak and an enhanced second peak. For very large, and biologically implausible, pair-wise correlations, a third peak containing intermediate cluster sizes is visible, which can be attributed to the onset and offset of bursts. Let us now consider the impact of this distribution on a postsynaptic cell, receiving input from all network neurons. For such a cell, the exact number of synchronous spikes (the quantity plotted in Figures 5A and 5B) is not important. Rather, for a postsynaptic neuron with a ring threshold of m excitatory postsynaptic potentials (EPSPs), the incidence of m or more coincident spikes is a much more signicant quantity, since this determines its ring rate to a large extent. Therefore, we computed a cumulative probability distribution: the probability that m or more neurons re synchronously. The cumulative probability distributions (the cumulatives of Figure 5B) are plotted in Figure 5C. Suppose that the membrane potential of a postsynaptic neuron, which receives input from all cells in our network, equals the resting potential at time t. The probability that this neuron res at time t C 1 is equal to the probability ofh 0 or more coincident spikes, where h 0 is the ratio between the postsynaptic threshold (h ) and the EPSP amplitude. In other words, the curves in Figure 5C illustrate the relation between the threshold and ring probability of a neuron, which receives input from all neurons in the network, in the case of a very short membrane time constant. The curve for a very low pair-wise correlation coefcient (r = 0.003) shows the rapidly declining probability of higher-order events. The curves for network activity with a larger pair-wise correlation exhibit a plateau before their decline at very high numbers of synchronized spikes. This plateau results from the second peak in the N cluster probability distribution. It is remarkable that even small changes in the strength of the pair-wise correlations, which are not physiologically implausible (e.g., Gray et al., 1989; Livingstone, 1996), exhibit a strong effect on the ring probability in case of an intermediate threshold h 0 . Indeed, the ring probability of a neuron with a threshold of, say, 50 EPSPs is raised by an order of magnitude due to an increase in the pair-wise correlation as small as 0.08 (compare the distributions forr = 0.03 and r = 0.12).
166
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
The question remains how closely the ring rate of a postsynaptic neuron with nonzero membrane time constant is approximated by the cumulative probability distribution. There would obviously be a one-to-one relationship for a postsynaptic neuron that has “zero” memory—that is, a neuron that is inuenced only by the input in the previous time step (the bin size for which spikes are considered synchronous, as discussed above). However, typical neurons have parameters with larger time constants, including the refractory period, and the time constant of the membrane. In order to assess the impact of this “nonzero memory” an additional 151st neuron was included in the network. This cell was identical to the other 150 neurons from which it received input, but it did not project back to them. The pair-wise correlation was xed at 0.12, and simulations were performed with different values of the postsynaptic thresholdh 0 by varying wi,151 (1 < i < 150), the strength of the synapses projecting onto neuron 151. The dependence of the ring probability on h 0 (h/ wi,151 ) is shown in Figure 5D. Superimposed on this graph is the distribution of N clusters. As expected, some minor differences between the ring probability and the cumulative probability distribution are observed. Most of these differences can be explained, given the fact that the nonzero average activity in a pool of neurons keeps the membrane potential of individual neurons at a somewhat higher level than the resting potential. This lowers the average number of coincident spikes the neuron would require to cross threshold. On the whole, however, the ring probability of the postsynaptic neuron with a nonzero membrane constant is predicted with good accuracy by the cumulative probability distribution. A remarkable feature of Figure 5D is that the ring probability of a neuron with a threshold of, for example, 40 does not differ much from that of a neuron with a much higher threshold (e.g., 130). Most of the spikes of a postsynaptic cell with a threshold larger than 40 are triggered by synchronous bursts in which the majority of neurons participate.
Figure 5: Facing page. (A,B) Probability distribution of N clusters, plotted on a normal scale (A) and a logarithmic scale (B). (C) Cumulative probability of N clusters. Abscissa, cluster size (h 0 ). Ordinate, probability of a cluster with a size that is equal to or larger than h 0 . (D) Comparison of the cumulative N-cluster distribution (solid line) to the ring probability of a postsynaptic neuron, which receives input from all 150 neurons in the network. The threshold h 0 is dened as h / wi,151 . (E) Dependence of the ring probability of a neuron receiving input from all 150 neurons on its relative threshold h 0 and the pair-wise correlation in the network. Different curves correspond to different values ofh 0 . Triangles indicate dependence of the average ring probability f1 on the pair-wise correlation. (F) Same as E, but the ring probability of the postsynaptic cell and f1 are plotted as a function of wint .
Effects of Pair-wise and Higher-order Correlations
167
Using this interpretation of the cumulative probability distribution, the effect of an increase in the pair-wise correlation in the network on postsynaptic neurons was investigated by varying the synaptic weight (wint ). Figure 5E shows the relationship between the average pair-wise correlation and the ring probability of a hypothetical postsynaptic neuron with threshold h 0 . Calculated were the ring probabilities for thresholds h 0 = 30, 50, and 100. It can be seen that in the biologically relevant range of pair-
168
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
wise correlations (r = 0–0.2), there is a monotonic relationship between the postsynaptic ring probability and the pair-wise correlation. For values of r below 0.2, changes in the pair-wise correlation are associated with a relatively large increase in the ring probability of the postsynaptic neuron. Importantly, higher values of r are also associated with an enhanced ring rate of the presynaptic network neurons (see Figure 5E). This increase in activity undoubtedly contributes to the enhanced probability of higher-order events. However, it should be noted that the probability of higher-order events exhibits a steeper dependence on r than the ring probability of the network neurons (see Figure 5E). Figure 5F illustrates the dependency of the ring probability of the postsynaptic neuron on wint , the parameter that was actually varied during the simulations. 4.3 Estimation of the N-Cluster Distribution by Entropy Maximization. In most electrophysiological experiments, only data on ring probabilities and pair-wise correlations are available. Using the mathematical framework developed in section 3, we investigated whether the observed relationship between the probability of N clusters and the correlation coefcient could have been predicted from these two types of measurements. For the network behavior at different values of wint (3.5, 3.875, and 4.125), distributions of N clusters were calculated that maximized entropy under the constraints of the observed ring probability and pair-wise correlations. A comparison between distributions based on the maximal entropy calculation and the experimental distribution is shown in Figures 6A–C. The distribution that maximized the entropy deviates somewhat from the observed distribution for larger values of r . The maximal entropy calculation underestimates the incidence of clusters between 60 and 100 spikes, which occur during the start and end of population bursts, as was discussed above. In addition, the second peak in the maximal entropy distribution is located at a cluster size of 130, whereas the actual location of this peak is 150. Typically, all neurons re within a 2-msec window during a network burst. In order to estimate the effect of these discrepancies on the ring probability of a neuron receiving input from the network, the cumulative probability distributions are plotted in Figures 6D–F. It can be seen that the underestimation of the incidence of intermediate cluster sizes by the maximal entropy calculation is compensated by the overestimation of the incidence of clusters with a size between 100 and 140. For values of h 0 smaller than 130, the largest deviation is about a factor of 2. This approximation is reasonable, since the maximal entropy calculation depends on only two parameters, which are estimated from the network activity: the ring probability and the pair-wise correlation. Nevertheless, large deviations occur for values of h 0 larger than 130. However, these high-threshold values are physiologically implausible, because they would imply that a postsynaptic neuron would re only when almost every input is active within a narrow time window.
Effects of Pair-wise and Higher-order Correlations
169
Figure 6: Comparison of the probability of N clusters in the network simulation and their probability based on maximal entropy. (A–C) Probabilities of N clusters in the simulation are shown as solid curves for three values of the synaptic weights. Dashed line: The corresponding maximal entropy probability distributions with the same ring probability and pair-wise correlation. (D–F) Relationship between ring probability (ordinate) and threshold (abscissa) of a neuron with an integration window of 2 msec that receives input from all 150 neurons of the network (cumulatives of A–C).
170
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
4.4 Effects of Varying Bin Width on the Maximal Entropy Distribution. The results of the maximal entropy calculation described so far were obtained with a bin width of 2 msec, which equals the minimal interspike interval. Figure 7 shows the dependence of the N cluster distribution and the maximal entropy estimation on the bin width. Spike trains obtained with a synaptic weight (wij ) of 4.125 were rebinned, using bin widths of 0.5, 1.0, 2.0, and 4.0 msec. The rebinning process inuenced the pair-wise correlation and f1 , the probability of ring in a time bin (see equation 3.2). Smaller bin widths reduce f1 . This results in a leftward shift in the location of the rst peak in the distribution of cluster sizes, which represents stochastic activity between bursts (compare Figures 7A and 7B to Figure 7C). A second effect of reducing the bin width is a disappearance of clusters with sizes larger than 110. This disappearance is related to the refractory period, which prohibits neurons from ring in consecutive bins during population bursts. Spikes red by different neurons during these bursts are therefore divided between successive time bins, and no bins remain in which all neurons re simultaneously. This modication of the distribution of cluster sizes is not captured by the maximal entropy estimation, which underestimates the incidence of clusters with sizes between 20 and 110 and overestimates the incidence of clusters with a size larger than 130 (see Figures 7A–B). In other words, the refractory period adds structure to the spike trains, which therefore have less than maximal entropy. When a bin width is used that is longer than the refractory period, this additional structure is lost, and the maximal entropy calculation may provide a reasonable estimate of the distribution of cluster sizes. For bin sizes that are longer than the minimal interspike interval, there are time windows during which individual cells re more than a single spike. It is possible, in principle, to adapt the maximal entropy estimation to this situation. One approach, in the case of a bin size that may include two spikes, is to compute the probability that N neurons re once and M neurons re twice in a bin, for each combination of N and M. Unfortunately, this increases the number of variables that should be calculated from 151 (D0 to D150) to more than 10,000. The computational requirements increase further if three or more spikes can occur in a single bin. Therefore, we took an alternative approach in which the state of a neuron was labeled “off” in the case of no spike and labeled “on” in the case of one or more spikes within a time bin. This keeps the computational requirements within bounds, but at the cost of losing spikes in the rebinning process. Figure 7D compares the distribution of cluster sizes to the maximal entropy distribution for a bin width of 4 msec. The experimental distribution is largely described by a broad rst peak, which is shifted to the right, and a very narrow second peak at a cluster size of 150. An increase in f1 also shifts the rst peak of maximal entropy distribution to the right. However, larger values of f1 are also associated with a leftward shift of the second peak in the distribution of maximal entropy, as was discussed in section 3. This causes large deviations
Effects of Pair-wise and Higher-order Correlations
171
Figure 7: (A–D) Comparison of the maximal entropy distribution to the distribution of N clusters in the simulation for bin sizes ranging from 0.5 msec to 4 msec. The quality of the maximal entropy approximation depends on bin size, that is, on what is considered coincident. In our simulations, a bin size of 2 msec yields the best approximation (C). For smaller bin sizes (A: 0,5 msec, B: 1 msec), the actual and predicted distributions deviate considerably. (E–H) Impact of these deviations on a postsynaptic neuron with threshold h 0 .
172
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
for cluster sizes between 60 and 120, the incidence of which is overestimated by the maximal entropy estimation. Moreover, the narrow peak at a cluster size of 150 is absent in the distribution of maximal entropy. In summary, reasonable estimates of the N-cluster distribution are obtained only for a bin width that is equal to the minimal interspike interval. In the discussion, we will address the question of whether these results can be generalized to other network architectures. 4.5 Effects of Network Scaling. In order to investigate how the network behavior and the goodness of t of the maximal entropy distribution depend on network size, simulations were run with networks composed of 75, 150, and 200 neurons. Figure 8A shows the cumulative distribution of N clusters for a network of 75 neurons with a wint of 3.55, which exhibited a pair-wise correlation of 0.07. It should be noted that wint represents the sum of the synaptic weights wij of all inputs converging onto a single neuron (section 4.1). When the network size is doubled, each cell receives input from twice as many neurons, and the strength of the individual synapses wij was reduced accordingly, in order to maintain a constant value of wint . Nevertheless, a doubling of the network size resulted in a clear leftward shift of the distribution of N clusters and a reduction of the pair-wise correlation to 0.003. This indicates that a constant value of wint is not sufcient to guarantee a qualitatively similar network behavior when the network size is increased. Indeed, a larger number of synapses with reduced weight impinging on a network unit results in a reduction of the uctuations in the input, as long as the network is in the stochastic mode. This reduces the probability of bursts in the network (Tsodyks & Sejnowski, 1995). Tsodyks and Sejnowski (1995) have suggested that the variance in the input to network units may be kept approximately constant by reducing the release probability of the synapses rather than reducing their strength wij , when network size is increased. Figure 8C shows the cumulative distribution of N clusters for a network of 150 neurons, in which the effective input strength was kept constant by reducing the release probability to 50%. The pair-wise correlation was 0.12, and the cumulative distribution of N clusters was qualitatively similar to that of the smaller network, in accordance with Tsodyks and Sejnowski (1995). Figure 8D shows a similar result for a network composed of 200 neurons. In all cases the estimate based on maximal entropy was reasonable (dashed lines in Figures 8A–D), which indicates that the quality of the maximal entropy calculation does not depend critically on the size of the network. 5 Discussion In most physiological studies on the synchronization behavior of cortical neurons, recordings are obtained from pairs of neurons or pairs of cell clusters. Our results illustrate that these data can supply only limited informa-
Effects of Pair-wise and Higher-order Correlations
173
Figure 8: Effect of scaling the network size. (A) Continuous line, probability of N clusters with a minimal size of h 0 , as a function of h 0 , for a network with 75 neurons. Dashed line: Distribution of maximal entropy with the same ring probability and pair-wise correlation. Parameters of the simulation: wij = 4.73 ¢ 10¡2 , r = 0.07. (B) N cluster distribution for a larger network with 150 neurons. The total input converging on a neuron, wint , was kept constant by reducing wij to 2.37¢ 10¡2 . Nevertheless, the distribution of N clusters was shifted to the left, and the pair-wise correlation was reduced to 0.003. (C) Same as B, but the effective wint was kept constant by reducing the release probability to 50% rather than by reducing wij . The synaptic strength wij was 4.73¢10¡2 , andr = 0.12. (D) N-cluster distribution for a network of 200 neurons, with a r of 0.15. The synaptic weight wij was identical to that in A, but release probability was reduced to 37.5%. Note the similarity of the distributions in C and D to the distribution in A.
tion about the probability of higher-order events—for example, the probability that 30 or more neurons that project to a target neuron re simultaneously. As an approximation for the probability of higher-order events, we used the distribution of maximal entropy. This method provides the most unstructured distribution, given the constraints supplied by the ring probabilities and the pair-wise correlations. The N-cluster distribution with maximal entropy for a homogeneous network without higher-order
174
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
correlations exhibits two peaks. The rst peak represents stochastic activity and resembles a binomial distribution. This peak also occurs in the absence of correlations. If the ring probability is moderate (< 0.25), a second peak occurs at a relatively large cluster size. The magnitude of this second peak depends on the strength of the pair-wise correlation. Thus, an absence of higher-order correlations dictates that the network should generate coincidences in a number of tightly synchronized network bursts. The N-cluster distribution observed in the simulations also exhibited two peaks, although the position of the second peak was different from the second peak in the maximal entropy distribution. Questions about the generality of these results, in particular, about their dependence on the details of the network implementation, will have to await further experimentation. Previous studies on the impact of correlations among neurons that project to a common postsynaptic target have used N-cluster distributions with a drastically different shape (Bernander et al., 1994; Murthy & Fetz, 1994). In these studies correlations were introduced in the input by forcing a subset of the presynaptic neurons to re in perfect synchrony, but independent of the other presynaptic neurons. Since the resulting N-cluster distribution is relatively devoid of highly synchronous events and has less than maximal entropy, the generality of the results obtained in these earlier studies may also be limited. It is obvious that network bursts in which a large number of neurons participate are rather effective in driving a postsynaptic neuron, especially in the case of a high postsynaptic threshold. Indeed, a recent study (Alonso et al., 1996) uncovered tight correlations, with a peak width in the crosscorrelogram of less than 1 msec, among pairs of neurons in the lateral geniculate nucleus (LGN). In cases in which the LGN neurons projected to a common cortical neuron, the impact of synchronous events was stronger than that of asynchronous spikes. We observed an orderly relationship between the incidence of highly synchronous network bursts and the pair-wise correlation. Relatively small increases in the pair-wise correlation, which are not physiologically implausible, may raise the incidence of highly synchronous events, and thereby the ring rate of a postsynaptic cell, by an order of magnitude (see Figures 5C and 5E). We will now discuss the limitations of the maximal entropy estimation and the way in which these limitations may be overcome by future studies. First, the quality of the approximation by the distribution of maximal entropy exhibited a strong dependence on the choice of the bin size. Reasonable results were obtained only for bin sizes larger than the refractory period of the neurons. If the bin size was smaller than the refractory period, structure was added to the N-cluster distribution, which therefore had less than maximal entropy. This limitation is a direct consequence of the way in which the entropy was dened. The entropy was determined by the distribution of N clusters within individual time bins and was independent of the order of N clusters in successive bins. The equivalent situation for
Effects of Pair-wise and Higher-order Correlations
175
a cross-correlation study would be to calculate only the center bin of the cross-correlation functions, that is, only the probability that two neurons re at exactly the same time. A possible extension of the method would be to reformulate entropy in order to include second-order correlations with a time delay (i.e., the noncentral bins in the auto- and cross-correlation functions) in the calculation. However, such an extension is likely to result in a substantial increase in the number of variables and equations. Second, the quality of approximation by the maximal entropy distribution was also degraded if a bin width was chosen that was larger than the minimal interspike interval (larger than 2 msec in our simulations). In this situation, multiple spikes of a single neuron occurred within a single time bin. The membrane time constant of the postsynaptic neuron provides a natural temporal window over which input is integrated, that is, the natural coincidence window. Currently the value of the effective membrane time constant in cortical neurons is a topic of considerable debate (Bernander et al., 1991; Konig ¨ et al., 1996; Shadlen & Newsome, 1995). A coincidence window of 2 msec, which we used, is presumably at the lower end of the physiologically plausible range. However, it seems likely that an increase in the width of the bin in which spikes are considered to be coincident to 5 or even 10 msec may be unproblematic in the case of cortical neurons. Typical peak ring rates of cortical neurons are around 100 Hz, which implies that the probability of multiple spikes within a 5- or 10-msec bin will still be rather low. Alternatively, the calculation of the maximal entropy may be adapted to allow for multiple spikes in a single bin, as was discussed in section 4.4. If a very large bin size is chosen, the distribution of the number of spikes red by a single neuron in a bin may approach a normal distribution. In this case it is relatively easy to calculate the distribution of N clusters if the only correlations are of second order. Third, the method according to which we derived the distribution of maximal entropy is valid only for a homogeneous network. For a nonhomogeneous network, the number of variables grows exponentially with the number of neurons (see section 3). We have incorporated a single inhibitory neuron in our network, which received input from all excitatory cells and provided strong inhibitory feedback (data not shown). The inhibitory neuron added structure to the spike trains by curtailing population bursts. We did not explicitly incorporate the ring pattern of the inhibitory neuron in our calculations, which caused a considerable discrepancy between the maximal entropy distribution and the actual distribution of N clusters. These limitations, taken together, imply that it will be rather problematic to predict the probability of highly synchronous events from ring rates and pair-wise correlations in physiological experiments. In a physiological study on higher-order events among neurons of the frontal cortex, Martignon et al. (1995) obtained discrepancies between the probability of actually occurring events and their probability predicted by entropy maximization. Unfortunately, even in this study, in which simultaneous recordings from
176
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
six neurons were studied, sampling problems occurred that were caused by the exponentially growing number of spike congurations and the limited recording time available during such experiments. Another way in which the probability of highly synchronous events might be estimated is by direct, in vivo measurements of the distribution of the postsynaptic potential. A comparison of the postsynaptic potential to the size of individual EPSPs could provide valuable insight in the degree of synchronicity among neurons that project to a common target cell. Appendix: Maximizing the Entropy in a Homogenous Network For N neurons three equations are derived from the pair-wise correlation, the ring probability, and the fact that all probabilities should add up to 1. First, the probabilities should add up to 1: 1=
¡¢
n X n ¢ Di i
(3.1)
i= 0
Second, the ring probability f1 is xed. For example, in the case that N = 7, it is easy to see that f1 equals: f1 =
1 X 1 X 1 X 1 X 1 X 1 X
G1ijklmn =
i= 0 j= 0 k= 0 l= 0 m= 0 n= 0
7 X i= 1
¡
¢
6 Di . i¡1
(A.1)
In the general case of N = n, this reads: f1 =
¡
¢
n X n¡1 ¢ Di . i¡1
(3.2)
i= 1
Third, the pair-wise correlations are xed, and this implies that the probability that two neurons re simultaneously is also xed. Analogously to equation 3.2, we nd for f2 : f2 =
¡
¢
n X n¡2 ¢ Di . i¡2
(3.4)
i= 2
These equations can be rewritten as: D0 = 1 ¡
¡¢
¡¢
¡¢
n n X X n n n Di = 1 ¡ nD1 ¡ D2 ¡ Di 2 i i
D 1 = f1 ¡ D 2 = f2 ¡
i= 1 n X i= 2 n X i= 3
¡ n ¡ 1¢ i¡1
¡ n ¡ 2¢ i¡2
Di = f1 ¡ [n ¡ 1]D2 ¡ Di .
i= 3 n X i= 3
¡ n ¡ 1¢ i¡1
Di (A.2)
Effects of Pair-wise and Higher-order Correlations
Solving for D0 , D1 , and D2 : D0 = 1 ¡ nf1 C n( n ¡ 1) f2 ¡
¡ n¢
µ C
2
Cn
¡ n( n ¡ 1)
¡
¡ n¢ 2
¶X n
¢
f2
¡ n ¡ 2¢ i¡2
i= 3
Di
¡¢
n n X X n¡1 n Di ¡ Di i¡1 i i= 3
i= 3
D1 = f1 ¡ ( n ¡ 1) f2 C ( n ¡ 1)
¡
¢
177
¡
¢
¡
¢
n n X X n¡2 n¡1 Di ¡ Di i¡2 i¡1 i= 3
i= 3
n X n¡2 D2 = f2 ¡ Di , i¡2
(A.3a–c)
i= 3
with the entropy dened as S= ¡
¡¢
n X n Di ln(Di ). i
(3.5)
i= 0
This is a concave function (which is easily veried), and therefore it has a single, unique maximum. To obtain the maximum, we calculate the derivative to Di : @S = 0 @Di
@S = ¡ @Di
for i = 3, . . . , n
»µ
¡ n¢ 2
¶ ¡ n( n ¡ 1)
¡
(3.6)
¡ n ¡ 2¢ i¡2
¢ ¡
Cn
¢
¡ n ¡ 1 ¢ ¡ n¢ ¼ i¡1
¡
i
ln(D0 ) C
» ¼ n¡2 n¡1 ¡ n (n ¡ 1) ¡ ln(D1 ) C i¡2 i¡1 C ¡
¡ n¢ ¡ n ¡ 2 ¢ 2
¡ n¢ i
i¡2
ln(D2 ) C
ln(Di ) = 0.
(A.4)
We note that equations A.3a–c are linear in Di and that a concave function remains concave on a linear subspace. In other words, equations A.3a–c and 3.6, taken together, also have a unique solution. Equation A.4 can be
178
S. M. Bohte, H. Spekreijse, and P. R. Roelfsema
rewritten as:
¡
¢
1 ln(Di ) = ¡ ¡ i C 1 ( i ¡ 1) ln(D0 ) ¡ i(i ¡ 2) ln(D1 ) 2 1 C i(i ¡ 1) ln(D2 ) 2
for i = 3, . . . , n.
(3.7)
Inserting these values for Di into equations A.3a–c yields three equations with three unknowns and a unique solution, which is found using the Newton-Raphson method as described by Press et al. (1986). Use of this algorithm in Maple V v4.0 allowed us to solve the equations for up to 150 neurons in about 2 hours on a Pentium 150 Mhz. Acknowledgments We thank Holger Bosch for helpful comments and a critical reading of the manuscript for this article. The research of P.R.R. was funded by a fellowship of the Royal Netherlands Academy of Arts and Sciences. References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1658. Alonso, J-M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated ring in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994).Power spectrum analysis of bursting cells in area MT in the behaving monkey.J. Neurosci., 14, 2870–2892. ¨ Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic backBernander, O., ground activity inuences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. ¨ Koch, C., & Usher, M. (1994). The effect of synchronized inputs Bernander, O., at the single neuron level. Neural Comp., 6, 622–641. Csiszar, I. (1975). I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3, 146–158. Deppisch, J., Bauer, H-U., Schillen, T., Konig, ¨ P., Pawelzik, K., & Geisel, T. (1993). Alternating oscillatory and stochastic states in a network of spiking neurons. Network, 4, 243–257. Eckhorn, R., Bauer, R., Jordan, W., Broch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226. Gokhale, D. V., & Kullback, S. (1978). The information in contingency tables. New York: Dekker.
Effects of Pair-wise and Higher-order Correlations
179
Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Gray, G. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Konig, ¨ P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends NeuroSci., 19, 130–137. Livingstone, M. S. (1996). Oscillatory ring and interneuronal correlations in squirrel monkey striate cortex. J. Neurophysiol., 75, 2467–2485. MacGregor, R. J., & Oliver, R. M. (1974). A model for repetitive ring in neurons. Kybernetik, 16, 53–64. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the ring rate of a three-conductance cortical neuron model. Neural Comp., 6, 1111–1126. Press, W. H., Flanney, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes:The art of scientic computing. Cambridge: Cambridge University Press. Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer, W. (1997). Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Curr. Opin. Neurobiol., 5, 248–250. Singer, W. (1995). Development and plasticity of cortical processing architectures. Science, 270, 758–764. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Softky, W. R. (1995). Simple codes versus efcient codes. Curr. Opin. Neurobiol., 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSP’s. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Neural Comp., 6, 111–124. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received May 29, 1998; accepted January 19, 1999.
LETTER
Communicated by Peter Konig ¨
Effects of Spike Timing on Winner-Take-All Competition in Model Cortical Circuits Erik D. Lumer Wellcome Department of Cognitive Neurology, London, WC1N 3BG, U.K.
Synaptic interactions in cortical circuits involve strong recurrent excitation between nearby neurons and lateral inhibition that is more widely spread. This architecture is commonly thought to promote a winner-takeall competition, in which a small fraction of neuronal responses is selected for further processing. Here I report that such a competition is remarkably sensitive to the timing of neuronal action potentials. This is shown using simulations of model neurons and synaptic connections representing a patch of cortical tissue. In the simulations, uncorrelated discharge among neuronal units results in patterns of response dominance and suppression, that is, in a winner-take-all competition. Synchronization of ring, however, prevents such competition. These results demonstrate a novel property of recurrent cortical-like circuits, suggesting that the temporal patterning of cortical activity may play an important part in selection among stimuli competing for the control of attention and motor action. 1 Introduction The brain constantly receives a welter of sensory stimuli, only a small fraction of which seems to reach awareness or inuence behavior. Consequently, in the central nervous system, mechanisms must exist to lter out unwanted activity. This selectivity typically operates at the level of perceptual groups rather than simple features, biasing processing in favor of objects that are perceptually salient or currently relevant to behavior (Humphreys, Romani, Olson, Riddoch, & Duncan, 1994). Such a selectivity probably reects contentions for limited processing capacity and the control of motor action. Thus, it is central to the organization of adaptive responses. Neurophysiological experiments in primates suggest that this selection process involves a suppression of neuronal responses to irrelevant stimuli (Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996; Chellazzi, Miller, Duncan, & Desimone, 1993; Schall & Sanes, 1993). For example, attention-dependent modulation of discharge rates has been observed in the visual cortex of monkeys trained to attend to stimuli at one location and ignore stimuli placed at another location (Moran & Desimone, 1985; Motter, 1993). Similarly, responses of direction-selective neurons in middle temporal visual areas are transiently suppressed when attention is diverted away from their activatNeural Computation 12, 181–194 (2000)
c 2000 Massachusetts Institute of Technology °
182
Erik D. Lumer
ing stimuli (Treue & Maunsell, 1996). This modulation of neuronal ring is commonly thought to reect the effect of attentional biases on intrinsic mechanisms of competition within circumscribed cortical areas, mediated by mutual inhibition (Kastner, De Weerd, Desimone, & Ungerleider, 1998; Chellazzi et al., 1993; Schall & Sanes, 1993). How such mechanisms of competition interact with those underlying basic perceptual processes, including segmentation and grouping, and top-down attentional biases, however, remains uncertain. Here, I describe a novel property of cortical circuits that may in part mediate such an interaction. Using simulations of spiking neurons grouped in recurrent networks that conform to the basic principles of organization of local cortical circuitry, I show that the competition that arises naturally in such networks exhibits a remarkable sensitivity to the relative timing of action potentials. Specically, synchronization of neuronal activity can switch the network dynamics from a winner-take-all regime, in which a small subset of neurons can sustain elevated ring rates, to one in which most cells are active concurrently. Hence, this effect links local mechanisms of response selection to temporal aspects of cortical activity that have been previously implicated in perceptual organization (Milner, 1974; von der Malsburg, 1981; Gray, Konig, ¨ Engel, & Singer, 1989; Singer, 1993). Previous modeling studies have suggested that the combined inuences of recurrent excitation between nearby cells and more widely spread inhibition within cortical areas provide an intrinsic mechanism by which unwanted signals could be suppressed (Stemmler, Usher, & Niebur, 1995; Salinas & Abbott, 1996). When stimuli activating different groups of neurons are presented simultaneously, the recurrent connections generate a nonlinear behavior, yielding neuronal responses that are restricted to the cell group with strongest input. This winner-take-all competition reects a synaptic organization that is common to cortical circuits (Kisv´arday & Eysel, 1993; Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Such a mechanism has been proposed as a basic computational strategy for learning and action selection (Hertz, Krogh, & Palmer, 1991), and underlies several models of brain function, ranging from the development of cortical maps (von der Malsburg, 1973) to the selection of targets during saccadic eye movements (Salinas & Abbott, 1996) and visual search (Koch & Ullman, 1985), and to the suppression of redundancies in visual scenes (Stemmler et al., 1995). However, most models of winner-take-all mechanisms in neural circuits focus on the modulation of mean discharge rates due to competitive interactions. Accordingly, these simulation studies often fail to model individual neurons or their precise spike timing. Yet recent evidence in animals suggests that the temporal coordination of neuronal activity may play an important role in neural network operations. In particular, synchronization of action potentials on a millisecond timescale can occur selectively between cortical neurons that signal related features (Gray et al., 1989). It
Effects of Spike Timing on Winner-Take-All Competition
183
has been suggested that such synchronization may serve to integrate distributed neural events into coherent percepts (Milner, 1974; von der Malsburg, 1981; Singer, 1993). Moreover, it has been shown, in both the hippocampus and neocortex, that networks of recurrently connected inhibitory neurons can both suppress neuronal responses and affect their temporal patterning (Cobb, Bohl, Halasy, Paulsen, & Somogyi, 1995; Whittington, Traub, & Jefferys, 1995). It remains unclear, however, whether the temporal coordination of neuronal activity can inuence the ability of recurrent circuits to induce response suppression. To investigate this possibility, I model neural network interactions in a small patch of cortex in which the relative spike timing between competing units is varied systematically. 2 Methods The model consists of populations of excitatory and inhibitory cells that are grouped in local recurrent circuits similar to those in previous simulation studies of neocortex (Douglas et al., 1995; Somers, Nelson, & Sur, 1995; Stemmler et al., 1995; Salinas & Abbott, 1996). Synaptic interactions within a group are both excitatory and inhibitory, but are restricted to inhibition between cell groups (see Figure 1). This synaptic organization mimics that found in a local sector of cortex, for example, a hypercolumn of V1, in which massively recurrent excitatory connections dominate nearby cells, whereas inhibitory interactions extend over larger distances. Given the topographic nature of cortical maps, this lateral inhibition interconnects adjacent groups of cells with dissimilar response specicity, for example, with an orthogonal orientation preference in V1 (Kisv´arday & Eysel, 1993). In most simulations presented in section 3, two cell groups coupled through mutual inhibition are considered (target groups). Simulations involving three competing target groups are also reported. Target cells are driven by a population of neurons with similar intrinsic connectivity (input groups). In the model, each input group either projects to a single target group or provides common (divergent) activation to several target groups. Each group of cells comprises 800 excitatory units and 200 inhibitory units. Units of each type receive 80 excitatory connections and either 20 (input groups) or 12 (target groups) inhibitory connections from cells within the same group. Intergroup connections consist of 12 inhibitory contacts per cell between target groups. Input groups contribute 40 excitatory projections onto each unit in the target layer, with the proportion of common input varied systematically in the simulations. These specications yield a ratio of excitatory to inhibitory units, as well as proportions of excitatory and inhibitory synapses, that are consistent with anatomical data (Beaulieu, Kisv´arday, Somogyi, Cynader, & Cowey, 1992). Individual connections are established at random within and between interconnected populations of cells. These connections incorporate transmission delays with a mean of 2 ms.
184
Erik D. Lumer
Figure 1: Model architecture. Spiking units are grouped in cortical-like recurrent circuits (boxes), each comprising 800 excitatory (Ex) and 200 inhibitory (Inh) cells. Input groups can provide either independent or common drive to two target groups (1 and 2) that compete through mutual inhibition. Excitatory and inhibitory connections are represented by lines with arrowheads and round terminals, respectively.
Neurons are modeled as single-voltage compartments with realistic membrane time constants and ring characteristics. Specically, each cell’s membrane potential obeys the equation tm dV/ dt = ¡V C E0 ¡ gEx ( t)( V ¡ EEx ) ¡ gInh ( t)(V¡Einh ), where tm is the membrane time constant, E0 is the cell’s resting potential, gEx and gInh are, respectively, time-dependent excitatory and inhibitory synaptic conductances expressed in units of a leak conductance, and EEx and Einh are the reversal potentials of the corresponding channels. Membrane time constants are set to 16 ms for excitatory cortical units and 8 ms for inhibitory units. These passive membrane time constants are consistent with empirical data (Connors, Gutnick, & Prince, 1982; Bloomeld, Hamos, & Sherman, 1987). E0 is ¡60 mV for both cell types, in agreement with in vivo measurements (Ferster & Jagadeesh, 1992). E Ex and Einh are 0 mV and ¡70 mV, respectively. Spiking is modeled by a standard integrateand-re mechanism, in which units that reach a threshold are reset to the potassium reversal potential (¡90 mV). Thresholds are ¡51 mV for excitatory units and ¡52 mV for inhibitory units. These threshold values fall in the in vivo range (Baranyi, Szente, & Woody, 1993). The choice of different ring thresholds for these two cell types makes it easier to reproduce experimental levels of spontaneous activity, as described below. Absolute refractory periods are 1.5 ms. Relative refractory periods are generated in excitatory units by adding 10 mV to their ring threshold after each spike; relaxation to baseline occurs exponentially with a 10-ms time constant. Similar cellular parameters have been justied in previous simulations of striate cortex
Effects of Spike Timing on Winner-Take-All Competition
185
Table 1: Synaptic Channel Parameters. Receptor
gpeak (units of gleak )
t1 (ms)
t2 (ms)
E (mv)
AMPA GABAa
0.05 0.1
0.5 1
2.4 7
0 ¡70
(e.g., Somers et al., 1995; Lumer, Edelman, & Tononi, 1997). For simplicity, spike rate adaptation is not incorporated in excitatory units. Preliminary simulations showed that this choice does not affect the processes studied in this article. When a unit res, a spiking event is relayed to its postsynaptic targets. Synaptic activation is modeled as a transient conductance change, using dual exponential functions dened as (Wilson & Bower, 1989): g( t) = gpeak N ( e¡t/ t 1 ¡ e¡t/ t 2 ),
where t1 and t2 are the rising and decay time constants, N is a normalizing factor, and gpeak is the peak conductance elicited by a single spike. In the model, synapses provide fast inhibition and excitation, with time constants of activation consistent with empirical data in vitro (Otis & Mody, 1992; Stern, Edwards, & Sakmann, 1992) (see Table 1). At rest, peak postsynaptic potentials (PSPs) elicited by a single spike are 0.5 and 0.25 mV for excitatory and inhibitory synaptic channels, respectively. These PSPs are consistent with the higher end of experimental values (Mason, Nicholl, & Stratford, 1991; Komatsu, Nakajima, Toyama, & Fetz, 1988); large, singleevent PSPs must be used in the model because it includes fewer synapses per cell compared with in vivo connectivity. Background synaptic activity is simulated by a random (Poisson distributed) activation of both types of synaptic channels in each unit. This results in a balance of mean excitation and inhibition and a spontaneous discharge of individual units at a rate of about 3 Hz for excitatory units and 15 Hz for inhibitory units. In addition to this background activity, extrinsic activation of the network is achieved by applying a stochastic (Poisson distributed, l = 0.4 kHz) synaptic excitation independently to each cell in the input groups. This activation occurs with an amplitude equivalent to ve synchronous excitatory postsynaptic potentials (EPSPs) per synaptic event. To clarify the inuence of relative spike timing on competition, the temporal coordination between neuronal responses in the two target groups is manipulated selectively. This is implemented by varying systematically the proportion of common versus independent activation that each target group receives. It must be emphasized that this manipulation of feedforward connectivity is not meant to represent any physiological mechanism, but rather to provide a simple control over the degree of coherence between competing populations of neurons. The question of what mechanisms can
186
Erik D. Lumer
produce dynamic changes in neuronal synchronization is beyond the scope of this study. Sufce it to note that changes in relative spike timing do take place in a task and stimulus-dependent manner in vivo (Gray et al., 1989; Roelfsema, Engel, Konig, ¨ & Singer, 1997). A number of aggregate measures are used to characterize the model’s behavior. The average membrane potential is computed within each cell group. This variable, which is reminiscent of a local eld potential, serves as a useful indicator of within-group coherence. Instantaneous populationaveraged ring rates are computed by cumulating all of the spikes emitted by excitatory units within a given cell group and a sliding 100-ms time window and by normalizing for cell number and window size. Finally, the competition index of two groups of neurons is dened as the mean absolute value of the difference between their instantaneous ring rates divided by the highest instantaneous rate among the two cell groups. This index varies between 0 for noncompeting groups of cells and 1 for groups of neurons with mutually exclusive patterns of activity (i.e., winner-take-all regime). Thus, it provides a useful measure of the degree of competition in the target layer. 3 Results Mutual inhibition in cortical circuits can both suppress neuronal responses and affect their temporal patterning by inducing synchronous oscillations. This is illustrated in Figure 2A, in which the population-averaged membrane potential in input and target groups is plotted, along with single-unit activity from representative target cells. As shown by the autocorrelation function, input populations exhibit weak rhythmic activity in the 60-Hz range (see Figure 2B). By contrast, target populations exhibit strong coherent oscillations with a frequency at around 30 Hz (see Figure 2C). The population oscillations can be shown to arise as an emergent property of recurrent synaptic coupling. This is because individual inhibitory neurons target several excitatory cells, forcing them to pause in concert; the silencing of excitatory neurons in turn causes a drop in inhibition, resulting in a depolarizing rebound and a synchronous volley of action potentials. The sequence repeats itself periodically at a pace that depends in part on the time constants of synaptic inhibition. Because this aspect of the neural dynamics has been investigated in detail in previous studies (Bush & Douglas, 1991; Whittington et al., 1995; Lumer et al., 1997), it is not further characterized here. The simulations also reveal that changes in spike coordination modify the effects of lateral inhibition that is not matched by excitation. We illustrate this by comparing the responses in a model with two target groups: when they sample their inputs from either a common neuronal pool or independent groups of cells. Although the mean activation of each target group remains identical in both cases, they yield different correlation patterns. In the rst instance, the common input provided to the target populations causes them to re in synchrony. With independent inputs, however, the
Effects of Spike Timing on Winner-Take-All Competition
187
Figure 2: Coherent activity in target groups. (A) The three upper plots show the population- averaged membrane potential of excitatory units in an input group (In) and in each target group (Pop1 and Pop2) over a period of 250 ms before and 750 ms during application of extrinsic activation to the model circuit. The remaining plots show the corresponding single-unit activity for representative cells in each target group. The arrow at the bottom marks the stimulation onset. Note the oscillations in the population-averaged responses, despite somewhat variable interspike intervals evidenced by the single unit traces. (B) Autocorrelogram of the mean membrane potential in the input group reveals a weak oscillation at around 60 Hz. (C) Autocorrelogram of the mean membrane potential in one target group reveals synchronous oscillations at around 30 Hz.
target populations exhibit little correlation in the timing of their action potentials. As shown in Figure 3A, these two input conditions yield markedly different behavior in the target layer. Independent activation of the target groups results in a winner-take-all competition, in which one group of cells exhibits elevated ring rates, whereas responses in the other neuronal group are suppressed. This contrasts with the absence of response suppression in the target layer when target groups are driven by a common (synchronizing) input. Thus, synchronization of ring prevents competition in recurrent neural circuits, despite the presence of strong mutual inhibition. The inuence exerted by spike coordination on winner-take-all mechanisms operates very rapidly. This is demonstrated in Figure 3B, in which a histogram of the mean absolute difference in ring rate between the two target groups is shown immediately before and after a change from independent to common afferent input. The histogram reveals that a signicant modulation of differential ring levels is detectable within the rst 50 ms following the change in spike coordination. To quantify the sensitivity of neural competition to the degree of zerolag synchronization, I compute the competition index (CI), as dened in section 2, between two target groups for different proportions of common
188
Erik D. Lumer
Figure 3: Temporal control of winner-take-all mechanism. (A) Instantaneous ring rates, as dened in section 2, are plotted for excitatory units in two competing target groups, using black and gray traces, respectively. The inputs to the two target groups are changed from common (sync) to separate (async) once every 500 ms. Uncorrelated inputs yield a pattern of response dominance and suppression, which is abolished when common inputs are used instead. Input conditions are shown below the activity traces. (B) Differential discharge rate histogram for the target groups during successive suppressed and coordinated time intervals (bin = 50 ms).
input. As shown in Figure 4A, the degree of competition between target groups rises rapidly as the proportion of common input, and hence of their spike correlation, decreases. Because competition in the model depends on the presence of strong mutual inhibition between target groups, the inuence of spike coordination can be interpreted as a modulation of the effective connectivity between these groups (Gerstein & Perkel, 1969; Friston, Frith, Little, & Frackowiak, 1993). This view is consistent with the results of simulations in which, for each setting of input sources, we determine the strength of intergroup synaptic inhibition that yields an equivalent degree of competition in the target layer—in this instance, CI = 0.4. This functionally equivalent inhibitory strength is plotted as a function of the proportion of common input provided to the target populations (see Figure 4B). To a rst approximation, it establishes a boundary between regions of the parameter space yielding, on the one hand, a winner-take-all-regime (above the boundary) and, on the other, a regime in which both target groups are activated simultaneously (below the boundary). Accordingly, physiological factors that transiently alter the efcacy of inhibitory postsynaptic potentials (e.g., neuromodulation) or input coherence (e.g., stimulus conguration) could switch the network dynamics between these two regimes. Finally, it must be noted that similar effects are seen when more than two
Effects of Spike Timing on Winner-Take-All Competition
189
Figure 4: Sensitivity of competition to spike timing. (A) The competition index (CI) is plotted for increasing proportions of common input to the two target groups. The CI for uncorrelated inputs to the two groups corresponds to the leftmost point in the gure. (B) The plot shows the peak conductance of intergroup inhibitory synapses yielding a constant CI (0.4), as a function of the proportion of common input to the two target groups. This plot separates the parameter space in a region of increased competition (above) and one of coordination (below).
competing groups are considered. This is illustrated in simulations of three competing populations of neurons that are driven by independent input groups (see Figure 5A) or by inputs that are common to two (see Figure 5C) or all of the target groups (see Figure 5B). As shown in the gures, target groups that receive a common input, and hence re synchronously, do not suppress each other ’s activity. This contrasts with the winner-take-all competition between populations with uncorrelated synaptic inputs. These simulations illustrate how dynamic changes in input correlation can recongure the competition among target cells; neurons can switch their membership from one coactivated assembly to another on the basis of changes in relative spike timing. Thus, these results are consistent with the notion that the same group of neurons can be involved in the representation of multiple objects. At any one moment, the specic binding of neuronal activity to single objects can be determined by a buildup of synchronization patterns and selection of neuronal responses that conform to such patterns.
190
Erik D. Lumer
Figure 5: Instantaneous ring rates for three competing groups. (A) Independent activation of each group yields a winner-take-all regime, in which a single group maintains elevated ring rates at any one moment. (B) Competition is prevented by the introduction of common inputs to all target populations. (C) Common input to two groups prevents them from competing with each other but not with a third uncorrelated group. Activity in the third group is suppressed in this example, but it could also dominate competition in some runs.
4 Discussion A prominent feature of the local circuitry in neocortex is its topographic organization in clusters, with massively recurrent excitation between neighbor cells and lateral inhibition that is more widely spread. Recent simulation studies suggest that this architecture yields stereotyped responses, in which weak afferent inputs undergo nonlinear amplication due to intracortical recurrent excitation and the strongest inputs are differentially activated (i.e., selected) as a result of lateral inhibitory interactions. The models show that such response properties can account for the emergence of direction and orientation selectivity in striate cortex and for the multiplicative gain modulation of parietal neurons by eye and head position signals (Douglas et al., 1995; Somers et al., 1995; Salinas & Abbott, 1996). The simulations here demonstrate that the basic design principles of cortical architectures can also produce stereotyped temporal patterns of neuronal activity, in particular synchronous and oscillatory population responses in the high-frequency range (20–70 Hz). More important, it has been shown that the winner-take-all competition that arises naturally in such circuits exhibits a remarkable sensitivity to the temporal coordination of neuronal action potentials. Specically, the onset of a winner-takeall behavior can be prevented by a zero-lag synchronization of discharges among pools of neurons competing through mutual inhibition. Transitions
Effects of Spike Timing on Winner-Take-All Competition
191
from competitive to cooperative ring regimes occur very rapidly—within roughly 50 msec of a change from uncorrelated to synchronous population activity. This novel network property reects the sensitivity of reciprocal inhibitory interactions to the relative timing of action potentials. Indeed, the simulations reported in this article show that the effective connectivity, of an inhibitory nature, mediating lateral interactions responsible for competition is decreased in the context of synchronization. This is an interesting and counterintuitive observation; increased functional connectivity (dened as the correlation between neurophysiological events) is, in this instance, accompanied by a decrease in (inhibitory) effective connectivity (dened as the inuence one neuronal system exerts over another). The effects of relative spike timing on competition described here are robust to changes in model parameters. In particular, these effects are maintained when the total number or relative proportion of excitatory and inhibitory units in the model are varied by a twofold factor with respect to the values used in the reported simulations. The network organization underlying these effects—a center of recurrent excitation with a inhibitory surround— does not take into account the long-range horizontal connections between pyramidal cells in upper and lower cortical laminae. However, these distal horizontal connections are organized in discrete patches placed outside the afferent region of inhibitory cells and restricted to cells of similar response specicity (Lund, Yoshioka, & Levitt, 1993). Moreover, physiological evidence suggests that these long- range connections have a modulatory rather than a driving inuence on their neuronal targets (Hirsh & Gilbert, 1991). Thus, in rst approximation these nonlocal connections are not relevant for the local competition mechanisms considered in this article. These results may help in understanding how the brain selects among the welter of stimuli constantly competing for the control of attention and motor action. Physiologically, selective sensory processing is often accompanied by a suppression of neuronal responses to contextually irrelevant stimuli (Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996). A candidate mechanism for the selective suppression of neuronal responses is mutual inhibition between cortical cells responsive for relevant and irrelevant stimuli (Chellazzi et al., 1993; Schall & Sanes, 1993). Because selection typically operates at the level of perceptual groups rather than simple features, however, such competition must be conned between cells belonging to different object representations. The results here suggest a mechanism whereby this specicity of competitive interactions can be achieved dynamically. It exploits the fact that the relative timing of action potentials in sensory cortices can vary in a context-dependent manner, with synchronization of activity arising selectively among cells representing related features of an object (Gray et al., 1989; Singer, 1993). Thus, according to the effects reported in this article, the neuronal responses associated with various attributes of an object emerge with mutual immunity from competition by virtue of their synchronization. By contrast, the patterns of activity associated with differ-
192
Erik D. Lumer
ent objects or alternative stimulus interpretations may be uncorrelated and hence subject to competition. It has recently been proposed that such effects may underlie the competition between alternative stimulus representations during binocular rivalry (Lumer, 1998). More generally, they link, within a common computational framework, early processes of perceptual organization based on spike timing and subsequent selection among perceptual groups competing for the control of behavior. These newly described effects of spike coordination are functionally distinct from the ability of neuronal synchronization to augment the responses of target cells due to maximal temporal summation of synaptic events (Abeles, 1982). 5 Conclusion What emerges from these simulations is the notion of a parameter space for cortical operation that spans two domains. One domain is capable of supporting activity in multiple populations of neurons for a short period of time, after which one population is selected by competitive lateral interactions. In the other domain, neuronal responses in multiple populations can coexist, provided that they are expressed synchronously. The parameters dening this space can be divided into those that dynamically modulate synaptic efcacy (e.g., classical neuromodulation and voltage-dependent interactions) and the temporal patterning of activity among the interacting populations. I have focused on the latter here and have demonstrated how temporal coordination can move the network dynamics from one domain to the other. The role of temporal coordination in facilitating or attenuating competitive interactions in cortical circuits may represent a fundamental component of perceptual synthesis. Acknowledgments I thank Karl Friston and Richard Frackowiak for their encouragement and for helpful discussions and comments on the manuscript for this article. This research was supported by the Wellcome Trust. References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Isr J Med Sci, 18, 83–92. Baranyi, A., Szente, M., & Woody, C. (1993). Electrophysiological characterization of different types of neurons recorded in vivo in motor cortex of the cat. II. Membrane parameters, action potentials, current-induced voltage responses and electrotonic structures. J Neurophysiol, 69, 1865–1879. Beaulieu, C., Kisv a´ rday, Z., Somogyi, P., Cynader, M., & Cowey, A. (1992). Quantitative distribution of GABA-immunonegative neurons and synapses in the monkey striate cortex (area 17). Cerebral Cortex, 2, 295–309.
Effects of Spike Timing on Winner-Take-All Competition
193
Bloomeld, S. A., Hamos, J. E., & Sherman, S. M. (1987). Passive cable properties and morphological correlates of neurones in the lateral geniculate nucleus of the cat. J Physiol, 383, 653–692. Bush, P., & Douglas, R. (1991). Synchronization of bursting action potentials in a model network of neocortical neurons. Neural Comp, 3, 19–30. Chellazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature 363, 345–347. Cobb, S. R., Buhl, E., Halasy, K., Paulsen, O., & Somogyi, P. (1995). Synchronization of neuronal activity in hippocampus by individual GABAergic interneurons. Nature, 378, 75–78. Connors, B., Gutnick, N. I., & Prince, D. (1982). Electrophysiological properties of neocortical neurons in vitro. J Neurophysiol, 48, 1302-1320. Douglas, R. J., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995).Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ferster, D., & Jagadeesh, B. (1992). EPSP-IPSP interactions in cat visual cortex studied with in vivo whole-cell patch recording. J Neurosci, 12, 1262–1274. Friston, K. J., Frith, C. D., Little, P. F., & Frackowiak, R. S. J. (1993). Functional connectivity: The principal component analysis of large (PET) data sets. J Cereb Blood Flow Metab, 13, 5–14. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science, 164, 828–830. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hirsh, J., & Gilbert, C. (1991). Synaptic physiology of horizontal connections in the cat visual cortex. J. Neurosci, 11, 1800–1809. Humphreys, G. W., Romani, C., Olson, A., Riddoch, M. J., & Duncan, J. (1994). Non-spatial extinction following lesions of the parietal lobe in humans. Nature, 372, 357–359. Kastner, S., De Weerd, P., Desimone, R., & Ungerleider, L. G. (1998). Mechanisms of directed attention in the human extrastriate cortex as revealed by functional MRI. Science, 282, 108–111. Kisva´ rday, Z. F., & Eysel, U. T. (1993). Functional and structural topography of horizontal inhibitory connections in cat visual cortex. Europ J Neurosci, 5, 1558–1572. Koch, C., & Ullman, S. (1985).Shifts in selective attention: Towards an underlying neural circuitry. Human Neurobiology, 4, 219–227. Komatsu, Y., Nakajima, S., Toyama, K., & Fetz, E. (1988). Intracortical connectivity revealed by spike-triggered averaging in slice preparation of cat visual cortex. Brain Res, 442, 359–362. Lumer, E. D. (1998).A neural model of binocular integration and rivalry based on the coordination of action-potential timing in primary visual cortex. Cerebral Cortex, 8, 553–561. Lumer, E. D., Edelman, G. M., & Tononi, G. (1997). Neural dynamics in a model of the thalamocortical system. I. Layers, loops, and the emergence of fast
194
Erik D. Lumer
synchronous rhythms. Cerebral Cortex, 7, 207–227. Lund, J. S., Yoshioka, T., & Levitt, J. B. (1993). Comparison of intrinsic connectivity in different areas of macaque monkey cerebral cortex. Cerebral Cortex, 3, 148–162. Mason, A., Nicholl, A., & Stratford, K. (1991). Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. J Neurosci, 11, 72–84. Milner, P. M. (1974). A model of visual shape recognition. Psychol Rev, 81, 521– 535. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J Neurophysiol, 70, 909–919. Otis, T. S., & Mody, I. (1992). Differential activation of GABAa and GABAb receptors by spontaneously released transmitter. J Neurophysiol, 67, 227–235. Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer W. (1997). Visuomotor integration is associated with zero-lag synchronization among cortical areas. Nature, 385, 157–161. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc Natl Acad Sci USA, 93, 11956–11961. Schall, J. D., & Sanes, D. P. (1993). Neural basis of saccade target selection in frontal eye eld during visual search. Nature, 366, 467–469. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing. Annu Rev Physiol, 55, 349–374. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J Neurosci, 15, 5448–5465. Stemmler, M., Usher, M., & Niebur, E. (1995). Lateral interactions in primary visual cortex: A model bridging physiology and psychophysics. Science, 269, 1877–1880. Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCS on stellate cells elicited by focal stimulation in slices of rat visual cortex. J Physiol (London), 449, 247–278. Treue, S., & Maunsell, J. H. R. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in primary visual cortex. Kybernetik, 14, 85–100. von der Malsburg, C. (1981). The correlation theory of the brain (Internal Rep. No. 81-2). Gottingen: ¨ Max Planck Institute for Biophysical Chemistry. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Wilson, M. A., & Bower, J. M. (1989). In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 291–333). Cambridge, MA: MIT Press. Received April 21, 1998; accepted February 17, 1999.
LETTER
Communicated by George Gerstein
Model Dependence in Quantication of Spike Interdependence by Joint Peri-Stimulus Time Histogram Hiroyuki Ito Satoshi Tsuji Department of Information and Communication Sciences, Faculty of Engineering, Kyoto Sangyo University, Kita-ku, Kyoto 603-8555, Japan and CREST, Japan Science and Technology
Multineuronal recordings have enabled us to examine context-dependent changes in the relationship between the activities of multiple cells. The joint peri-stimulus time histogram (JPSTH) is a much-used method for investigating the dynamics of the interdependence of spike events between pairs of cells. Its results are often taken as an estimate of interaction strength between cells, independent of modulations in the cells’ ring rates. We evaluate the adequacy of this estimate by examining the mathematical structure of how the JPSTH quanties an interaction strength after excluding the contribution of ring rates. We introduce a simple probabilistic model of interacting point processes to generate simulated spike data and show that the normalized JPSTH incorrectly infers the temporal structure of variations in the interaction parameter strength. This occurs because, in our model, the correct normalization of ringrate contributions is different than that used in Aertsen, Gerstein, Habib, and Palm’s (1989) effective connectivity model. This demonstrates that ring-rate modulations cannot be corrected for in a model-independent manner, and therefore the effective connectivity does not represent a universal characteristic that is independent of modulation of the ring rates. Aertsen et al.’s (1989) effective connectivity may still be used in the analysis of experimental data, provided we are aware that this is simply one of many ways of describing the structure of interdependence. We also discuss some measure-independent characteristics of the structure of interdependence. 1 Introduction Recently, a great amount of attention has been focused on the understanding of network mechanisms of information processing in the brain. The advancement of multineuronal recording techniques has enabled us to record simultaneously the spike activities of multiple single units from spatially distant sites as well as from local cortical sites. The availability of multiple neuron spike data has led to a novel coding paradigm called the relational Neural Computation 12, 195–217 (2000)
c 2000 Massachusetts Institute of Technology °
196
Hiroyuki Ito and Satoshi Tsuji
code. In this paradigm, the information or function is not necessarily localized at the single-cell level. This information or function is dynamically encoded by the context of the relationship among spike activities of multiple cells. Such a relationship can be either a coherence in the mean ring rates in a psychological timescale (Hebb, 1949) or a ne temporal interdependence among spike events (Abeles, 1982a, 1991; von der Malsburg, 1981, 1986; Gerstein, Bedenbaugh, & Aertsen, 1989; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996). In order to understand the network mechanism involved in information processing, we investigate the relationship among multiple cells and discuss its relation within the context of information processing (stimuli presentation or task performance). Thus, we need a powerful analytical tool that visualizes the spatiotemporal statistical structure in the activities of the neuronal network based on the spike data of multiple single units. Traditionally, we have computed peri-stimulus time histograms (PSTH) for all of the recorded cells to examine coherence in the ring rates. A crosscorrelogram has then been applied to a pair of cells to detect the temporal interdependence among spike activities. However, the cross-correlogram has a limitation in analyzing a rapid change of the structure of interdependence due to averaging over the sample interval. Recent experimental reports of a time domain analysis suggest that the spike interdependence between two cells evolves in a nonstationary way even within the trial duration of a few seconds (Gray, Engel, Konig, ¨ & Singer, 1992; Vaadia et al., 1995). Aertsen, Gerstein, Habib, & Palm, (1989) introduced a novel statistical analytical method for the spike trains of a pair of cells, the joint peri-stimulus time histogram (JPSTH). The JPSTH was designed to quantify the nonstationary modulation of the spike interdependence within the trial duration. Currently, the JPSTH is considered to be the most advanced tool for the statistical analysis of spike interdependence. The number of experimental reports applying the JPSTH has been increasing, and this increased interest reects the interest in the dynamical structure of interdependence within a cell assembly. Here we examine the mathematical structure of how the JPSTH quanties a degree of interdependence after excluding the contribution of ring rates. In section 2, we provide a basic description of the JPSTH method. Then, following the introduction of two different ways of using the JPSTH, we discuss how Aertsen et al. (1989) dened the effective connectivity to quantify the degree of interdependence. We present a statistical evaluation of the JPSTH in section 3. We introduce a simple probabilistic model of the point processes where the interaction strength is characterized by a different parameter from the effective connectivity. Through the normalization procedure of the JPSTH of the simulated spike data, we examine how the temporal structure of interdependence described by our interaction parameter is related to that by the effective connectivity. We discuss the results of
Model Dependence
197
the simulations in section 4. We summarize detailed mathematical derivations of the formula used in the text in the appendix. In this article, we consistently use the term interdependence instead of the well-used correlation because the latter is often confused with the correlation coefcient, which is a particular measure of interdependence. The essence of this study has been presented elsewhere (Ito & Tsuji, 1997; Ito, 1998). 2 J-PSTH 2.1 Method. By simultaneously recording spike activities of two cells, cell 1 and cell 2, for a trial duration T over K trials under the same experimental condition, we get K pairs of spike trains. Here we assume that the spike trains of each trial are time locked to some trigger event such as the stimulus onset. We divide the trial duration into multiple bins with a small interval D so that there is at most a single spike within any bin for each trial. ( ) Let ni k ( u) be the spike count of cell i (i = 1, 2) in the uth bin of the kth trial. ( ) Then, the variable ni k ( u) can take either 0 or 1. The average spike count, hni ( u)i =
K 1X n( k) ( u), K k= 1 i
(2.1)
represents the probability of the occurrence of the spike event at cell i at bin u. The histogram of hni ( u)i over index u represents the PSTH. In the computation of the raw-JPSTH, a square panel of a size of the trial duration T is prepared, and time axes are set along the bottom and the left sides. The two time axes are locked to the stimulus onset at the left-bottom corner. The square panel is subdivided into tiny compartments of size D , each of which is specied by a pair of two integer indices (u, v). At each compartment, we assign the matrix element that quanties the probability of the occurrence of a joint event, where cell 1 has a spike event at bin u and cell 2 has one at bin v, hn12( u, v)i =
K 1X ( k) n (u, v), K k= 1 12
(2.2)
where ( )
( )
( )
k k k n12 ( u, v) = n1 ( u) n2 (v).
(2.3)
An example of the raw-JPSTH is shown in the left panel in Figure 2a. The magnitude of each matrix element is represented by a color scale. At the bottom of the square, the PSTH of cell 1 is plotted upside down. The PSTH of cell 2 is plotted on the left axis.
198
Hiroyuki Ito and Satoshi Tsuji
The matrix elements on the main diagonal line represent the coincident spike events of the two cells. These elements are used for the PST coincidence histogram, which is plotted along the diagonal line in the right panel in Figure 2a. On the other hand, the summation within paradiagonal bins is used to estimate the ordinary cross-correlogram, which is plotted in the upper right corner in Figure 2a. The central peak of the cross-correlogram represents the cumulative frequency of coincident spike events in entire trial duration. The PST coincidence histogram visualizes ne temporal modulation of the frequency of the coincident events within the trial duration, that is, the time domain of more frequent coincident events. In the raw-JPSTH, the statistical structure in the spike data is represented by three quantities: hn1 ( u)i, hn2 ( v)i, hn12 (u, v)i. The goal of the JPSTH is to provide a statistical measure extracting the net degree of interdependence from the joint probability hn12 (u, v)i after excluding the contribution of ring rates. The element in the predicted-JPSTH is given by hpredicted-JPSTHi = hn1 ( u)ihn2 ( v)i,
(2.4)
which represents the probability of the accidentally correlated events that appear even when the two spike trains are statistically independent (null hypothesis). The departure of the test data from the null hypothesis is estimated by the covariance, which is a subtraction of the predicted-JPSTH from the raw-JPSTH at each matrix element, h(raw ¡ predicted )-JPSTHi = hn12 (u, v)i¡hn1 (u)ihn2 (v)i.
(2.5)
In general, there exist modulations of ring rates within the trial duration. For a comparison of the net degree of spike interdependence over the different compartments, we must further normalize the covariance at each compartment by a factor depending on the ring rates of the two cells. This normalization procedure is the critical issue in the formulation of the JPSTH. The main concern of this study is how to introduce an adequate normalization. 2.2 Standpoint of Analysis. There are two different ways of using the JPSTH to extract the degree of interdependence. In the rst way, we consider the JPSTH as an abstract statistical measure (statistic) of the spike interdependence between the two cells. The element in the raw-JPSTH is transformed into some statistical measure, such as the correlation coefcient. The measure is dened purely for statistical purposes and has nothing to do with the mechanism of the point processes generating the spike data. In this sense, this approach is essentially model free. The choice of a particular measure of interdependence is not unique, and various statistical measures of different characteristics (e.g., efcacy, contribution, correlation density, correlation coefcient, and so on) have been introduced (Brillinger, 1975; Palm, Aertsen, & Gerstein, 1988; Aertsen et al., 1989).
Model Dependence
199
The second way starts from an explicit denition of a simple probabilistic model of the point processes, which is characterized by the ring rates of the two cells and a parameter controlling the interaction strength between the spike events at the two cells. Let this parameter be called interaction mechanism strength. We rst solve a forward problem to obtain the relationship between these model parameters and the covariance of the spike data generated by this model. With this relation, the normalization of the covariance is dened to reproduce the model parameter of interaction mechanism strength. Therefore, the JPSTH works as a tool to facilitate the inference of the interaction mechanism strength from the given spike data. In this model-based approach, the model is introduced as a conventional tool (equivalent circuit) used for our description of the structure of spike interdependence. We attempt to interpret the structure of interdependence in the data through its correspondence to the phenomena in the particular model. Here again the choices of the model and the parameter of interaction mechanism strength are not unique. Aertsen et al. (1989) took a somewhat intermediate approach between the two approaches. They rst dened a specic probabilistic spike model (Aertsen & Gerstein, 1985). The model was then used to generate spike data with the temporal modulations of both the ring rates and the interaction mechanism strength intermingled. Subsequently, they searched for a way to normalize the JPSTH so that the modulation prole of the interaction mechanism strength of their model was qualitatively reproduced. After various normalization trials, they nally adopted the one that divided the (raw ¡ predicted)-JPSTH (covariance) by the product of the standard deviations of the ring rates of the two cells. This normalization procedure coincided with the transformation of the raw-JPSTH into the correlation coefcient. Their choice of normalization was through empirical trial and error, not through the forward problem of their original spike model. Aertsen et al. (1989) calibrated their normalization by positively interdependent spike data from two cells with the same rate of low ring. As we will show in section 3, the interaction mechanism strength of their original model is well approximated by the correlation coefcient only in such a limited parameter region. Aertsen et al. (1989) called the elements of their normalized-JPSTH the effective connectivity between the two cells. Logically there are two possible interpretations of the effective connectivity. In the model-free approach, the effective connectivity is nothing but a correlation coefcient. It is an abstract statistical measure of interdependence and has nothing to do with any model of point processes. In the model-based approach, the effective connectivity in the full parameter regime, represents a parameter of interaction mechanism strength of some other model than the original model by Aertsen and Gerstein (1985). The model, which is dened implicitly by their normalization, and its interaction mechanism strength (effective connectivity) coincide with those in the original model setup only in the limited parameter
200
Hiroyuki Ito and Satoshi Tsuji
Figure 1: Algorithm for imposing interaction mechanism strength a to independent spike trains of ring rates r 1 and r 2 .
region. It is important to keep the two standpoints clearly separate to avoid a confusion in our discussion. Aertsen et al. (1989) stressed a model-based approach by regarding the effective connectivity as “a minimum neuronal circuit model that would reproduce such correlation measurement.” In the following study, we also apply the model-based approach to examine the characteristics of the effective connectivity. 3 Statistical Evaluation of the JPSTH Our main concern with the JPSTH is the characteristics of the effective connectivity’s method of estimating the temporal structure in the interaction parameter in the spike data. Its results are often taken as a universal estimate of interaction strength, independent of modulations in the ring rates. In order to evaluate the adequacy of this estimate, we introduce another spike model where the interaction mechanism strength is characterized by a different parameter from the effective connectivity. Then we examine how the temporal structure of interdependence described by our parameter is related to that by the effective connectivity. 3.1 Model. We introduce a simple model of two cells. The generation of interdependent spike trains by the model is schematically depicted in Figure 1. First, spike trains of the Poisson randomness are generated for cell 1 and cell 2 with a mean ring rate ofr 1 and r 2 , respectively (0 · r 1 , r 2 · 1).1 The two spike trains do not have any signicant interdependence more than an accidental component. Then a positive interdependence is imposed 1 In actual numerical simulations, the trial duration is divided into multiple bins with a width D . Initially we assign a value 0 at each bin, representing the absence of a spike event. The value is replaced by 1 when the algorithm imposes a spike event at the corresponding bin. Thus, the spike timing is quantized with the unit D . The mean ring rate r is the probability of nding a spike event at each bin.
Model Dependence
201
on the spike trains by the following algorithm. Let t1 ( i) be the occurrence time of the ith spike event in the spike train of cell 1. Consider the event at time t1 ( i) C t in the spike train of cell 2, where t is a delay interval of the interdependent spike events. When there is already a spike event at that time, leave it unaffected. However, when there is no spike, add a spike event with a probability of a, (0 · a · 1). In Figure 1, the inserted spike events are represented by thick lines. Repeat this procedure for all the spike events in the spike train of cell 1. As a result of those procedures, we can impose the spike interdependence of the specied strength with the time delay t. Pairs of spike trains with the same statistical property are generated for multiple trials for statistical averaging. A negative interdependence between the two spike trains can be imposed by the reverse of the above procedure. When cell 2 has a spike event at time t1 (i) C t , remove it with a probability |a|, where ¡1 · a · 0. When there is no spike event at that instant, do nothing. The degree of interdependence is quantied by the parameter of interaction mechanism strength a in our model. Our model is identical to the one by Aertsen and Gerstein (1985), except that they did not consider the case when cell 2 already has a spike event in the insertion of the interdependent spike. Thus, the two models become truly identical when the ring rate of cell 2 is low. 3.2 Simulation. We show a typical example of the JPSTH analysis of the simulated spike trains. Figure 2a shows the raw-JPSTH of the spike trains with a trial duration of 300 msec. As seen in the two PSTHs, both cells had a lower ring rate (r 1 = r 2 = 0.14) in the rst and last 100-msec intervals in the trial duration. The ring rates of the two cells increased to a higher value (r 1 = r 2 = 0.32) in the intermediate 100-msec interval. However, the interaction mechanism strength was kept to be a negative constant value a = ¡0.8, and the time delay t was set to be 0 msec throughout the trial duration. Thus, cell 2 had a coincident spike ring with cell 1 much less frequently than the purely accidental component. In the computation of the PST coincidence histogram in Figures 2a–c, we performed a smoothing of the diagonal matrix elements along the diagonal line by a moving average using a gaussian lter with s = 2 msec. Figures 2b and 2c, respectively, show the predicted-JPSTH and the (raw¡predicted)-JPSTH. The degree of departure from the null hypothesis (covariance) in the PST coincidence histogram in Figure 5c remains apparently inuenced by the nonstationary modulation of the ring rates, even though the interaction mechanism strength was kept constant throughout the trial duration. 3.3 Normalization. We attempt to separate the modulation in the (raw¡ predicted)-JPSTH into two components: the modulation derived purely from the ring rates and that of the interaction mechanism strength in the individual pair of spikes of the two cells. This procedure requires normalization of the matrix elements of the (raw ¡ predicted)-JPSTH by a factor depending
202
Hiroyuki Ito and Satoshi Tsuji
Figure 2: JPSTH of simulated spike data. (a) raw-JPSTH. (b) predicted-JPSTH. (c) (raw ¡ predicted)-JPSTH. Scales of color code of JPSTH, PST coincidence histogram, and cross-correlogram are different in the three plots. Minimum and maximum values of matrix elements in the three JPSTHs are: (a) 0.00, 0.220, (b) 0.00, 0.176, (c) ¡0.104, 0.124. Firing rates of two cells r 1 and r 2 had nonstationary modulation (0.14 ! 0.32 ! 0.14). Interaction mechanism strength was constant (a = ¡0.8, t = 0 msec). PST coincidence histograms were smoothed using a gaussian lter with s = 2 msec. (T = 300 msec, D = 1 msec, K = 50).
Model Dependence
203
on the ring rates of the two cells. At rst, we apply the normalization procedure by Aertsen et al. (1989) that transforms the matrix element into the effective connectivity: hnormalized ¡ JPSTHi =
h(raw ¡ predicted )-JPSTHi p , r 1 (1 ¡ r 1 )r 20 (1 ¡ r 20 )
(3.1)
where r 1 = hn1 ( u)i and r 20 = hn2 ( v)i are the ring rates of cell 1 at the uth bin and that of cell 2 at the vth bin, respectively. The prime is added to r 20 , which is not to be confused with r 2 , the ring rate of the initial spike train. Figure 3a shows the normalized-JPSTH computed by equation 3.1. We omit the display of the matrix elements. It is evident that this normalization does not work with our data. The inferred interaction mechanism strength in the PST coincidence histogram still has an apparent modulation by the ring rates. This discrepancy derives from the fact that our interaction mechanism strength, parameter a, is different from that in the model setup by Aertsen et al., (1989), the effective connectivity. Recall that Aertsen et al., (1989) introduced the effective connectivity as a parameter of interaction mechanism strength in their implicit model setup. The temporal structure of modulations in the model-based parameter is different when we apply a different assumption about the model and the interaction mechanism strength that led to the data. In order to nd the proper normalization of the JPSTH that reproduces our model parameter, we need to know the relationship between the (raw ¡ predicted)-JPSTH (covariance) and the model parameters. This is how sensitively the raw-JPSTH departs from the predicted-JPSTH when the interaction mechanism strength a is imposed on the two independent spike trains having ring rates of r 1 and r 2 . This relationship is obtained only in the forward problem from the spike model. Thus, the computation of the normalized-JPSTH inevitably becomes model dependent. As summarized in the appendix, straightforward calculations give the following expressions for our spike model: h(raw ¡ predicted )-JPSTHi = r 1 (1 ¡ r 1 )(1 ¡ r 2 )a,
(for positive interdependence).
(3.2)
By adding the interdependent spikes, the ring rate of cell 2 was increased to r 20 = r 2 Cr 1 (1 ¡ r 2 )a.
(3.3)
Note that only the quantities r 1 , r 20 , and h(raw ¡ predicted)-JPSTHi are available in the analysis of the experimental data. We need to compute the two unknown parameters, r 2 and a, by solving the simultaneous equations (3.2) and (3.3). Substitution of equation 3.3 into equation 3.2 leads to h(raw ¡ predicted )-JPSTHi =
¡
¢
a r 1 (1 ¡ r 1 )(1 ¡ r 20 ). 1 ¡ ar 1
(3.4)
204
Hiroyuki Ito and Satoshi Tsuji
Figure 3: Normalized-JPSTH of simulated spike data applying (a) the form by Aertsen et al. (1989) and (b) our form. Scales of PST coincidence histogram and cross-correlogram are different in the two plots. The display of matrix elements is omitted, and PST coincidence histograms are smoothed using a gaussian lter with s = 2 msec.
The normalization of the (raw ¡ predicted)-JPSTH is attained by solving the above algebraic equation for a. For negative interdependence (¡1 · a · 0), we get h(raw ¡ predicted )-JPSTHi = r 1 (1 ¡ r 1 )r 2 a,
(for negative interdependence).
(3.5)
Model Dependence
205
Substitution of the relationship r 20 = r 2 Cr 1r 2 a results in h(raw ¡ predicted )-JPSTHi =
(3.6)
¡
¢
a r 1 (1 ¡ r 1 )r 20 . 1 C ar 1
(3.7)
The normalized-JPSTH applying equation 3.7 is shown in Figure 3b. Now, the PST coincidence histogram shows almost constant interaction mechanism strength throughout the trial duration. The mean value averaged over the whole trial duration, ¡0.78, has a good quantitative agreement with the value of the original spike data, ¡0.8. 3.4 Comparison to Aertsen et al.’s (1989) Normalization. In our modelbased formulation, the ring rates before adding the interdependent spikes r 1 , r 2 , and the interaction mechanism strength a are the independent variables, which we can set arbitrarily. From equations 3.2 and 3.5, the (raw ¡ predicted)-JPSTH was expressed in a simpler form by using these variables. In Aertsen et al.’s (1989) formulation with an implicit model setup, on the other hand, the ring rates of the interdependent spike data, r 1 , r 20 , and the effective connectivity are the independent variables. Their normalization factor was given by (Aertsen et al.’s (1989) normalization) =
q r 1 (1 ¡ r 1 )r 20 (1 ¡ r 20 ) ,
(3.8)
for both positive and negative interdependence. The different independent variables between the two formulations prevent us from a straightforward comparison. From equations 3.4 and 3.7, we can obtain the corresponding factors in our formulation: (Our normalization for positive interdependence) r 1 (1 ¡ r 1 )(1 ¡ r 20 ) = , 1 ¡ ar 1
(3.9)
(Our normalization for negative interdependence) r 1 (1 ¡ r 1 )r 20 = . 1 C ar 1
(3.10)
In this format, the normalization factor itself depends on the interaction mechanism strength a. This is not a problem when we consider parameter a as the solution of equation 3.4 or 3.7 rather than an unknown variable.
206
Hiroyuki Ito and Satoshi Tsuji
Figure 4: Dependence of normalization factor on ring rate r with (a) positive interdependence and (b) negative interdependence. Firing rates of two cells r 1 andr 20 are constrained to ber . Graphs are plotted for different values of a: (a) 0.0, 0.2, 0.4, 0.6, 0.8, 1.0; (b) 0.0, ¡0.2, ¡0.4, ¡0.6, ¡0.8, ¡1.0 (Thick lines represent normalization factor by Aertsen et al. (1989); the two panels have different scales in their ordinates).
Showing the difference between the normalization factors of the two formulations, we plot their graphs as a function of ring rate r in Figure 4 for the positive interdependence and the negative interdependence, respectively. Here, we consider only a special case with r 1 = r 20 = r for simplicity. In each panel, the graphs are plotted for different values of a, and the thick line represents the normalization factor by Aertsen et al., (1989) that is, r (1 ¡r ). Note that the ordinate has a different scale in the two panels. Since the (raw ¡ predicted)-JPSTH is an estimate of the covariance between the probabilities of a spike event at the two cells, it becomes zero when cells either
Model Dependence
207
always re (r = 1) or never re (r = 0), and so does the normalization.2 For positive interdependence, the graphs are rather similar at lower ring rates (r · 0.1). Our normalization coincides with that by Aertsen et al. (1989) when a = 1. For negative interdependence, on the other hand, they are very different even at lower ring rates. Therefore the difference in the normalized-JPSTH between our result and that of Aertsen et al. (1989) becomes more prominent with negative interdependence. In the general parameter region of lower ring rates (r 1 , r 20 ¿ 1), the normalization factors are approximated as (Aertsen et al.’s (1989) normalization) ’
q r 1r 20 ,
(3.11)
and (Our normalization for positive interdependence) ’ r 1 ,
(Our normalization for negative interdependence) ’ r 1r 20 .
(3.12) (3.13)
Our spike model is identical to the one by Aertsen and Gerstein (1985) in this parameter region. Thus, we see that Aertsen et al.’s (1989) normalization (effective connectivity) can reproduce the interaction mechanism strength of their original spike model only when r 1 = r 20 ¿ 1 for positive interdependence. This was exactly the parameter region that they used for simulating spike data to calibrate their normalization procedure (Aertsen et al., 1989). For negative interdependence, their normalization always underestimates the interaction mechanism strength of the original model. This asymmetry between a positive and negative interdependence is an intrinsic property of the point processes of the spike events (Palm et al., 1988) and was the main conclusion of the study by Aertsen and Gerstein (1985) on the cross-correlogram. 3.5 Simulation of Nonstationary Interdependent Structure. We examined another example where the interaction mechanism strength showed a nonstationary modulation but the ring rates of the two cells were kept constant. The initial Poisson random spike trains were generated with a ring rate of r 1 = r 2 = 0.32 throughout the trial duration of 300 msec. Then, in the intermediate 100-msec interval, positively interdependent spike events were added to the spike train of cell 2 by imposing a strength of a = 0.3. The time delay was set at t = 0 msec. The interdependent spikes were not added in the interval of the rst and last 100 msec where a = 0. Only the 2 When a = ¡1, our normalization factor has a singularity at r = 1, which derives from our enforcement of its expression by r 20 . Since r 1 = 1 and a = ¡1, every initial spike event in cell 2 should be removed by the algorithm. This is impossible under the current constraint of r 20 =1.
208
Hiroyuki Ito and Satoshi Tsuji
Figure 5: Normalized-JPSTH of simulated spike data applying our normalization. Data show stationary ring rates of two cells (r 1 = r 2 = 0.32) and nonstationary modulation of interaction mechanism strength within trial duration (a: 0 ! 0.3 ! 0, t = 0 msec). Display of matrix elements omitted and PST coincidence histogram smoothed using a gaussian lter with s = 2 msec. (T = 300 msec, D = 1 msec, K = 50.)
PST coincidence histogram of the normalized-JPSTH applying our normalization is shown in Figure 5 to demonstrate that the nonstationary structure of interdependence is faithfully reproduced. The mean value of the interaction mechanism strength averaged over the intermediate 100-msec interval is 0.32 and close to the original value. 4 Discussion 4.1 Model Dependence. We could correctly normalize the JPSTH in Figure 3b so as to reproduce our model parameter. This was because we knew that the given spike data were initially generated by the same model. However, if we start our analysis only from the spike data without knowing the spike model, as in the analysis of experimental data, the model selected would no longer be unique. There are other possible models that generate spike data with the same statistical structure as the given data. The parameter of interaction mechanism strength quantifying the degree of interdependence could be dened differently in each model, so we need a different normalization procedure to reproduce its parameter. Thus, when we apply a different assumption about the model or mechanism that led to the data, the temporal structure of modulations in that model-based parameter (interaction mechanism strength) is different. The quantication of the degree
Model Dependence
209
of interdependence through the inference of the model-based parameter is an ill-posed inverse problem without any unique solution. Aertsen et al. (1989) explicitly cautioned against a misunderstanding that the temporal modulation of the effective connectivity represents a universal characteristic, which is independent of the modulation of the ring rates. However, this common misunderstanding still seems to exist. Like our parameter a, the effective connectivity is not a universal measure of the spike interdependence. It is just one of many possible ways of describing the degree of interdependence. 4.2 Logic in the Determination of Measures of Interdependence. Up to this point, we retained a model-based approach in order to examine the characteristic of the effective connectivity. On the other hand, as explained in section 2, a variety of model-free measures of interdependence have been suggested. Although each abstract statistical measure is mathematically well dened, there exists a nonuniqueness in our choices for a particular measure adopted for characterizing the structure of interdependence in the given spike data. Of course, when the model-free measures of different characteristics (e.g., correlation density and correlation coefcient) are applied to the same data, we get qualitatively different modulation proles of the degree of interdependence (Aertsen et al., 1989). Those descriptions can be used to study different aspects of the structure of spike interdependence. The logical structure in the nonuniqueness of the measure of interdependence can be clearly understood by applying an elemental concept borrowed from the information geometry (Amari, 1985, 1998). At each compartment ( u, v) in the JPSTH, a set of joint probabilities fpij g = ( p00 , p01 , p10, p11 ) can be computed from the spike data. The quantity pij represents the joint probability of the event: the spike count at the uth bin of cell 1 is i (0 or 1) and that at the vth bin of cell 2 is j. A particular set of probabilities fpij g can be regarded as a point in the space of the probability distribution under a constraint p00 Cp01 Cp10 Cp11 = 1. Characterization of the statistical structure in the set fpij g is performed by a transformation into a new coordinate set composed of the statistical measures. Initially, the ring rates of the two cells, r 1 = p10 C p11 and r 20 = p01 C p11, are selected as universal measures. Then the set fpij g is represented by the new coordinate set (r 1 , r 20 , p11 ), which is exactly the raw-JPSTH. In this article, we formulated our measure of interdependence through the normalization of the covariance, p11 ¡r 1r 20 . However, the measure j can be introduced in a more general form by imposing a particular relation, j = j (r 1 , r 20 , p11 ), satisfying the mathematical condition of one-to-one mapping.3 Then the set fpij g can be represented by the coordinate 3 Actually, the measure proposed by Amari (1998) in the framework of information p p geometry has a form j = log p 00 p 11 . 01 10
210
Hiroyuki Ito and Satoshi Tsuji
system (r 1 , r 20 , j ), which corresponds to the normalized-JPSTH. Just as there is nonuniqueness in the selection of the coordinate system in the Euclidean space, there are innite possibilities in the selection of such a measure of interdependence. Any measure is equally adequate as long as the coordinate system (r 1 , r 20 , j ) uniquely indicates the universal point fpij g. The value of the coordinate j (degree of interdependence) has only a relative meaning to the selected coordinate system. We can discuss the variations of the degree of interdependence within the trial duration by applying a particular measure. However, the description does not have a universal meaning independent of the choice of measure. The interrelationship between the descriptions by the different measures can be examined, as we did with the numerical simulation in the last section. Although the measure of interdependence is a model parameter of interaction mechanism strength in the model-based approach, it is an abstract statistical measure (statistic) in the model-free approach. However, the two approaches share the same logical structure that for determination of a measure j , we need to impose some constraint or criterion on the value leading to a particular relation j = j (r 1 , r 20 , p11 ). In the model-based approach, the introduction of a particular model of the point processes imposes a specic parameter of interaction mechanism strength j . Aertsen et al. (1989) introduced an implicit model and dened their model-based measure (effective connectivity) under the condition that in the limited parameter regime, it approximated the interaction mechanism strength of another explicit model proposed by them previously. We also introduced a similar explicit model and formulated a JPSTH normalization procedure that reproduces our parameter of interaction mechanism strength in the entire parameter regime. In the model-free approach, on the other hand, the selection of a particular measure should be guided by a purely statistical value (Brillinger, 1975; Palm et al., 1988). Among various statistical measures having a symmetrical dependence on the ring rates of the two cells, the correlation coefcient is most often used because of its established popularity. Amari (1998) suggested that a measure can be derived in a less subjective manner in the mathematical framework of information geometry by imposing a particular measure of distance between the points in the space of the probability distribution fpij g. 4.3 Universal Characteristics of Structure of Interdependence. Although the modulation prole of the degree of interdependence itself is measure dependent, we can discuss universal characteristics in special cases. The rst case is when there is a signicant modulation in the degree of interdependence within the trial duration, despite the ring rates of the two cells showing little modulation. We have simulated such a case in Figure 5. Another case is when the modulation prole of the degree of interdependence shows a signicant change between the different contexts of the experiment (different stimuli or tasks), but the PSTH of each cell remains unchanged.
Model Dependence
211
In these cases, at least the fact that there exists a change in the degree of interdependence is a measure-independent characteristic. An example was introduced by Vaadia et al. (1995) in the recordings of two cells in the prefrontal cortex of a monkey. In both of the two different paradigms in the delayed localization task, there was very little modulation of the ring rate in both cells throughout the trial duration. Nevertheless, the JPSTH of Aertsen et al.’s (1989) normalization revealed a nonstationary modulation of the effective connectivity on the order of a few tens of milliseconds. Interestingly, the modulation showed signicantly different proles between the two paradigms. The modulation of the ring rates and its inuence on the degree of interdependence have the same timescale as seen in Figure 3. Thus, if there is a statistically signicant modulation of the degree of interdependence at a different (possibly, faster) timescale than that of the ring rates, it would be an indication of a universal statistical characteristic in the structure of interdependence. In any measure of interdependence, the statistical independence between the two spike trains is assigned a null value. Then the sign of the interdependence (positive or negative) has a universal meaning. By adopting any measure of interdependence, we can statistically test whether the test data depart signicantly from the null hypothesis of statistical independence. Thus, the standard cross-correlogram and unitary event analysis (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997), by which we look for signicantly interdependent events, are basically a model-free analysis. However, the same difculty as with the JPSTH arises whenever we attempt to normalize the correlogram for a comparison of the degree of interdependence between the data of different ring rates (Abeles, 1982b). Nevertheless, even if we apply a different normalization procedure, it simply changes the overall scale of the frequency in the histogram, so that the structure of the cross-correlogram remains qualitatively the same. However, different normalization procedures in the JPSTH can lead to results that are qualitatively different. 4.4 Future Problems. We have avoided introducing a large-scale model composed of physiologically plausible units like a detailed compartment model. Our simple probabilistic model was nothing more than an abstract algorithm for imposing spike interdependence. In addition, the model parameters used in the simulations were not necessarily within the physiologically relevant range of the actual cortical system, which has low ring rates and weak neuronal connectivity. This is because we introduced our model as a conventional tool to describe structure of interdependence rather than as a simulator of realistic cortical dynamics. The following two tasks can be independently carried out: the characterization of the statistical structure of spike data and solving the inverse problem to determine what actual neuronal network mechanism realizes the observed statistical characteristics.
212
Hiroyuki Ito and Satoshi Tsuji
For future research, it would be valuable to simulate an in virtu experiment (Aertsen, 1988; Aertsen, Erb, & Palm, 1994) using a physiologically plausible network model on a larger scale. We could numerically simulate the dynamics of the network and perform a virtual recording of the spike activities of only two cells. Then we would infer the effective connectivity by the JPSTH as we do for the actual experimental data. Contrary to the actual experiment, all numerical data would be available on the activities of other “nonrecorded” cells. We could then discuss how the global activities of the network inuence the temporal modulation of the degree of interdependence between the two cells. Such knowledge is important for the inference of network activities based on the real experimental data without knowing the activities of other nonrecorded cells. Also, it is valuable to execute systematic evaluations of what measures are relevant for detecting particular network events. Melssen and Epping (1987) investigated the adequacy of cross-correlation procedures for detecting and estimating neuronal connectivity by computer simulations of small networks composed of fairly realistic model neurons. They demonstrated that besides the strength of connectivity and the ring rate, the neurons’ internal state—the nonlinear spike-generating characteristics—also inuenced the visibility and detectability of neuronal connectivity. In their in virtu experiment, they examined how nonlinear superposition of the stimulus-driven activities degraded the reliability of estimations of neuronal connectivity. Their results suggested that there is no universal procedure to estimate adequately the neuronal connectivity equally for all the modes of neuronal activities. This conclusion is supported also by our current study. Throughout this study, we assume that the bin width of the JPSTH is small enough that no bin ever holds more than one spike event per trial. However, in many analyses of experimental data, this assumption is not made. In the actual recordings, the ring rates are low, and neurons are recorded from only during nite numbers of trials. Neurons nevertheless sometimes re in bursts. Therefore, if bins small enough to guarantee the above condition were used, most bins would be empty, and the structure in the JPSTH would be very scattered and difcult to visualize. Broader bins are used for smoothing the scattered structure in the JPSTH. We stress that our conclusion of the JPSTH has a broader nature that is not dependent on this assumption of the bin width. 5 Conclusion We examined the mathematical structure of how the JPSTH quanties a degree of interdependence between the spike events of the two cells after excluding the contribution of ring rates. We introduced a simple probabilistic model of interacting point processes to generate simulated spike data and showed that the normalized JPSTH incorrectly inferred the temporal structure of variations in the interaction parameter strength. This occurred
Model Dependence
213
because, in our model, the correct normalization of ring rate contributions was different from that used in Aertsen et al.’s (1989) “effective connectivity” model. This demonstrated that ring-rate modulations cannot be corrected for in a model-independent manner. Finally, our purpose is not to encourage using our normalization form instead of Aertsen et al.’s (1989) standard normalization. It is to provide the awareness that no model parameter of spike interdependence—neither Aertsen et al.’s (1989) effective connectivity nor ours—represents a universal characteristic that is independent of modulations of ring rates. Aertsen et al.’s (1989) effective connectivity may still be used in the analysis of experimental data, provided that there is an awareness that this is simply one of many ways of describing the structure of interdependence. We recommend rst characterizing the statistical structure of interdependence by applying an a priori model-free measure (correlation coefcient). Then its results are used to infer the interaction mechanism strength of a particular model after assumptions about the interaction mechanism are made. Appendix:
Expression of (raw ¡ predicted)-JPSTH
The proper normalization of the (raw ¡ predicted)-JPSTH can be obtained by the relationship between the JPSTH and the model parameters. A.1 Positive Interdependence. We consider the case when the algorithm imposes spike interdependence with a strength a and a delay t on the two independent random spike trains of mean ring rates r 1 and r 2 . Owing to the constraint that the width of the temporal bin should contain no more than one spike event, the mean ring rate represents a probability of nding a spike event at each bin. Let ni ( u) be a variable representing an occurrence of a spike event at cell i (i = 1, 2) at the uth bin. It takes either 1 or 0 depending on whether there is a spike. The matrix element of the rawJPSTH at (u, u C t ) is given by a probability: P[n1 (u) = 1 ^ n2 (u C t) = 1], where P[A ^ B] represents the joint probability of two events A and B. This is computed by P[n1 ( u) = 1^n2 ( u Ct ) = 1] = P[n1 ( u) = 1]P[n2 ( u Ct ) = 1 | n1 ( u) = 1],
(A.1)
where P(A | B) represents the conditional probability of event A under the condition of event B. In the spike trains before adding interdependent spikes, two events n1 (u) = 1 and n2 ( u C t ) = 1 are independent so that P[n1 (u) = 1] = r 1 and P[n2 (u C t ) = 1 | n1 ( u) = 1] = P[n2 ( u C t) = 1] = r 2 . Then, P[n1 ( u) = 1 ^ n2 (u C t ) = 1] = r 1r 2
(before imposing interdependence).
214
Hiroyuki Ito and Satoshi Tsuji
On the other hand, after the algorithm adds interdependent spikes, there are two possibilities for the occurrence of event n2 ( u C t ) = 1 under the condition of n1 ( u) = 1: 1. n2 ( u C t ) = 1 already in the initial spike train.
2. n2 ( u C t ) = 0 in the initial spike train; then a spike is added with the probability a. The conditional probability of the rst event is given by r 2 and that of the second event by (1 ¡ r 2 )a. Since the two events are exclusive, we get P[n2 ( u C t ) = 1 | n1 ( u) = 1] = r 2 C (1 ¡ r 2 )a.
(A.2)
Substitution of this result into equation A.1 leads to the matrix element of the raw-JPSTH, P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1 fr 2 C (1 ¡ r 2 )ag,
(A.3)
hraw-JPSTHi.
By adding interdependent spikes, the algorithm increased the ring rate of cell 2, which can be computed by a relation, P[n2 ( u C t ) = 1] = P[n1 ( u) = 0]P[n2 ( u C t ) = 1 | n1 ( u) = 0]
C P[n1 ( u) = 1]P[n2 (u C t) = 1 | n1 ( u) = 1].
Substitutions of P[n2 ( u Ct ) = 1 | n1 ( u) = 0] = r 2 and equation A.2 into the above relation result in P[n2 ( u C t ) = 1] = (1 ¡ r 1 )r 2 Cr 1 fr 2 C (1 ¡ r 2 )ag = r 2 Cr 1 (1 ¡ r 2 )a.
The matrix element of the predicted-JPSTH is given by the probability of an accidental joint event between the two independent spike trains with rates r 1 and r 20 = r 2 Cr 1 (1 ¡ r 2 )a, that is, P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1r 20 = r 1 fr 2 Cr 1 (1 ¡ r 2 )ag,
(A.4)
hpredicted-JPSTHi.
Finally, the (raw ¡ predicted)-JPSTH is given by subtracting equation A.4 from equation A.3, d P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1 (1 ¡ r 1 )(1 ¡ r 2 )a, h(raw ¡ predicted)-JPSTHi.
Model Dependence
215
A.2 Negative Interdependence. After the algorithm removes interdependent spikes from the spike train of cell 2, there is a unique possibility for the occurrence of event n2 (u C t ) = 1 with the condition n1 ( u) = 1. That is, if n2 ( u C t) = 1 in the initial spike train, then the spike is not removed with a probability 1 ¡ |a|, where ¡1 · a · 0. The probability of this event is given by P[n2 ( u C t ) = 1 | n1 (u) = 1] = r 2 (1 ¡ |a| ).
(A.5)
Substitution of this result into equation A.1 leads to the matrix element of the raw-JPSTH, P[n1 ( u) = 1 ^ n2 ( u Ct ) = 1] = r 1r 2 (1 ¡ |a|),
hraw-JPSTHi.
(A.6)
The ring rate of cell 2 was decreased by the removal of interdependent spikes so that P[n2 ( u C t ) = 1] = (1 ¡ r 1 )r 2 Cr 1r 2 (1 ¡ |a|) = r 2 ¡ r 1r 2 | a|.
Then the matrix element of the predicted-JPSTH is given by P[n1 ( u) = 1 ^ n2 (u C t ) = 1] = r 1 (r 2 ¡ r 1r 2 |a|),
(A.7)
hpredicted-JPSTHi.
Subtraction of equation A.7 from equation A.6 gives the nal result dP[n1 (u) = 1 ^ n2 ( u C t ) = 1] = ¡r 1 (1 ¡ r 1 )r 2 |a|,
h(raw ¡ predicted )-JPSTHi.
Acknowledgments We thank S. Amari for his important comments. We also express our thanks to M. Abeles for his comments and encouragement. We thank L. Kay for critical readings of the manuscript. Finally, we are grateful to anonymous referees for numerous instructive comments helping us to improve the focus of the article. References Abeles, M. (1982a). Local cortical circuits. Berlin: Springer-Verlag. Abeles, M. (1982b). Quantication, smoothing, and condence limits for singleunits’ histograms. Journal of Neuroscience Methods, 5, 317–325.
216
Hiroyuki Ito and Satoshi Tsuji
Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A. (1988).Discourse on current issues in multi-neuron studies. In W. von Seelen, G. Shaw, & U. M. Leinhos (Eds.), Organization of neural networks (pp. 87–107). Weinheim: VHC Verlag. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-base interpretation. Physica D, 75, 103–128. Aertsen, A. M. H J., & Gerstein, G. L. (1985).Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A. M. H. J., Gerstein, M. K., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61, 900–918. Amari, S. (1985). Differential-geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (1998). Information geometry of neuro-manifolds. In S. Usui & T. Omori (Eds.), Proceedings of ICONIP’98 (pp. 591–593). Amsterdam: IOS Press. Brillinger, D. R. (1975). The identication of point process systems. Ann. Prob., 3, 909–929. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15, 218–226. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—Theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9 (8), 1303–1350. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Transactions in Biomedical Engineering, 36, 4–14. Gray, C. M., Engel, A. K., Konig, ¨ P., & Singer, W. (1992). Synchronization of oscillatory neuronal responses in cat striate cortex: Temporal properties. Visual Neurosciences, 8, 337–347. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Ito, H. (1998). Need for development of statistical analysis tools of dynamical neuronal code—Statistical evaluation of joint-PSTH. In S. Usui & T. Omori (Eds.), Proceedings of ICONIP’98 (pp. 175–178). Amsterdam: IOS Press. Ito, H. & Tsuji, S. (1997). Statistical evaluation of joint peri-stimulus time histogram for analysis of dynamical neuronal code. Soc. Neurosci. Abstr., 23, 1400. Melssen, W. J., & Epping, W. J. M. (1987). Detection and estimation of neural connectivity based on crosscorrelation analysis. Biol. Cybern., 57, 403–414. Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. M. H. J. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 5151–5158.
Model Dependence
217
von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2), Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C. (1986). Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 161–176). Berlin: Springer-Verlag. Received March 30, 1998; accepted February 4, 1999.
LETTER
Communicated by Peter Dayan
Reinforcement Learning in Continuous Time and Space Kenji Doya ATR Human Information Processing Research Laboratories, Soraku, Kyoto 619-0288, Japan
This article presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for innitehorizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(l ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJBbased framework. The performance of the proposed algorithms is rst tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efcient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cartpole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto, Sutton, & Anderson, 1983; Sutton, 1988; Sutton & Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difcult to obtain. A numNeural Computation 12, 219–245 (2000)
c 2000 Massachusetts Institute of Technology °
220
Kenji Doya
ber of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites & Barto, 1996; Zhang & Dietterich, 1996; Singh & Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling, Littman, & Moore, 1996, and Sutton & Barto, 1998, for reviews). The progress of RL research so far, however, has been mostly constrained to the formulation of the problem in which discrete actions are taken in discrete time-steps based on the observation of the discrete state of the system. Many interesting real-world control tasks, such as driving a car or riding a snowboard, require smooth, continuous actions taken in response to high-dimensional, real-valued sensory input. In applications of RL to continuous problems, the most common approach has been rst to discretize time, state, and action and then to apply an RL algorithm for a discrete stochastic system. However, this discretization approach has the following drawbacks: When a coarse discretization is used, the control output is not smooth, resulting in poor performance. When a ne discretization is used, the number of states and the number of iteration steps become huge, which necessitates not only large memory storage but also many learning trials. In order to keep the number of states manageable, an elaborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difculties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis & Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh, Jaakkola, & Jordan, 1995; Asada, Noda, & Hosoda, 1996; Pareigis, 1998), and multiple timescale methods (Sutton, 1995). In this article, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resorting to the explicit discretization of time, state, and action. The continuous framework has the following possible advantages: A smooth control performance can be achieved. An efcient control policy can be derived using the gradient of the value function (Werbos, 1990). There is no need to guess how to partition the state, action, and time. It is the task of the function approximation and numerical integration algorithms to nd the right granularity. There have been several attempts at extending RL algorithms to continuous cases. Bradtke (1993) showed convergence results for Q-learning algorithms for discrete-time, continuous-state systems with linear dynam-
Reinforcement Learning in Continuous Time and Space
221
ics and quadratic costs. Bradtke and Duff (1995) derived a TD algorithm for continuous-time, discrete-state systems (semi-Markov decision problems). Baird (1993) proposed the “advantage updating” method by extending Q-learning to be used for continuous-time, continuous-state problems. When we consider optimization problems in continuous-time systems, the Hamilton-Jacobi-Bellman (HJB) equation, which is a continuous-time counterpart of the Bellman equation for discrete-time systems, provides a sound theoretical basis (see, e.g., Bertsekas, 1995, and Fleming & Soner, 1993). Methods for learning the optimal value function that satises the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for grid sizes taken to zero (Munos, 1997; Munos & Bourgine, 1998). However, the direct implementation of such methods is impractical in a highdimensional state-space. An HJB-based method that uses function approximators was presented by Dayan and Singh (1996). They proposed the learning of the gradients of the value function without learning the value function itself, but the method is applicable only to nondiscounted reward problems. This article presents a set of RL algorithms for nonlinear dynamical systems based on the HJB equation for innite-horizon, discounted-reward problems. A series of simulations are devised to evaluate their effectiveness when used with continuous function approximators. We rst consider methods for learning the value function on the basis of minimizing a continuous-time form of the TD error. The update algorithms are derived either by using a single step or exponentially weighed eligibility traces. The relationships of these algorithms with the residual gradient (Baird, 1995), TD(0), and TD(l ) algorithms (Sutton, 1988) for discrete cases are also shown. Next, we formulate methods for improving the policy using the value function: the continuous actor-critic method and a value-gradientbased policy. Specically, when a model is available for the input gain of the system dynamics, we derive a closed-form feedback policy that is suitable for real-time implementation. Its relationship with “advantage updating” (Baird, 1993) is also discussed. The performance of the proposed methods is rst evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1994; Doya, 1996) using normalized gaussian basis function networks for representing the value function, the policy, and the model. We test (1) the performance of the discrete actor-critic, continuous actor-critic, and valuegradient-based methods; (2) the performance of the value function update methods; and (3) the effects of the learning parameters, including the action cost, exploration noise, and landscape of the reward function. Then we test the algorithms in a more challenging task, cart-pole swing-up (Doya, 1997), in which the state-space is higher-dimensional and the system input gain is state dependent.
222
Kenji Doya
2 The Optimal Value Function for a Discounted Reward Task In this article, we consider the continuous-time deterministic system, x(t) P = f (x( t), u(t)),
(2.1)
where x 2 X ½ Rn is the state and u 2 U ½ Rm is the action (control input). We denote the immediate reward for the state and the action as r(t) = r(x(t), u(t)).
(2.2)
Our goal is to nd a policy (control law), u(t) = m (x( t)),
(2.3)
that maximizes the cumulative future rewards, Z 1 s¡t Vm (x(t)) = e¡ t r(x(s), u(s)) ds t
(2.4)
for any initial state x(t). Note that x(s) and u(s) ( t · s < 1) follow the system dynamics (2.1) and the policy (2.3). Vm (x) is called the value function of the state x, and t is the time constant for discounting future rewards. An important feature of this innite-horizon formulation is that the value function and the optimal policy do not depend explicitly on time, which is convenient in estimating them using function approximators. The discounted reward makes it unnecessary to assume that the state is attracted to a zero-reward state. The value function V¤ for the optimal policy m ¤ is dened as µZ V ¤ (x( t)) = max
u[t,1)
1 t
e¡
s¡t t
¶ r(x( s), u(s))ds ,
(2.5)
where u[t, 1) denotes the time course u(s) 2 U for t · s < 1. According to the principle of optimality, the condition for the optimal value function at time t is given by µ ¶ 1 ¤ @V ¤ (x) (x( )) (x( ), (x( ), V t = max r t u(t)) C f t u(t)) , u(t)2U t @x
(2.6)
which is a discounted version of the HJB equation (see appendix A). The optimal policy is given by the action that maximizes the right-hand side of the HJB equation: µ ¶ @V¤ (x) u(t) = m ¤ (x( t)) = arg max r(x( t), u) C f (x( t), u) . u2U @x
(2.7)
Reinforcement Learning in Continuous Time and Space
223
Reinforcement learning can be formulated as the process of bringing the current policym and its value function estimate V closer to the optimal policy m ¤ and the optimal value function V¤ . It generally involves two components: 1. Estimate the value function V based on the current policym . 2. Improve the policy m by making it greedy with respect to the current estimate of the value function V. We will consider the algorithms for these two processes in the following two sections. 3 Learning the Value Function For the learning of the value function in a continuous state-space, it is mandatory to use some form of function approximator. We denote the current estimate of the value function as Vm (x( t)) ’ V (x( t)I w),
(3.1)
where w is a parameter of the function approximator, or simply, V( t). In the framework of TD learning, the estimate of the value function is updated using a self-consistency condition that is local in time and space. This is given by differentiating denition 2.4 by t as 1 VP m (x( t)) = Vm (x(t)) ¡ r(t). t
(3.2)
This should hold for any policy, including the optimal policy given by equation 2.7. If the current estimate V of the value function is perfect, it should satisfy the consistency condition VP (t) = t1 V (t)¡ r( t). If this condition is not satised, the prediction should be adjusted to decrease the inconsistency, d (t) ´ r( t) ¡
1 V (t) C VP ( t). t
(3.3)
This is the continuous-time counterpart of the TD error (Barto, Sutton, & Anderson, 1983; Sutton, 1988). 3.1 Updating the Level and the Slope. In order to bring the TD error (3.3) to zero, we can tune the level of the value function V (t), its time derivative VP (t), or both, as illustrated in Figures 1A–C. Now we consider the objective function (Baird, 1993), E( t ) =
1 |d ( t)| 2 . 2
(3.4)
224
Kenji Doya
A d (t)
^ B V(t)
t0
t
t0
t
^ C V(t)
^ D V(t)
O Figure 1: Possible updates for the value function estimate V(t) for an instantaP neous TD error d (t) = r(t) ¡ 1t V(t) C V(t). (A) A positive TD error at t = t0 can be corrected by (B) an increase in V(t), or (C) a decrease in the time derivative P V(t), or (D) an exponentially weighted increase in V(t) (t < t0 ).
From denition 3.3 and the chain rule VP ( t) = @V xP ( t), the gradient of the @x objective function with respect to a parameter wi is given by µ ¶ 1 @E(t) @ P = d ( t) r ( t ) ¡ V ( t) C V ( t ) t @wi @wi µ ¶ 1 @V (xI w) @ @V (xI w) ( ) ( ) P = d t ¡ C xt . t @wi @wi @x
¡
¢
Therefore, the gradient descent algorithm is given by µ 1 @V (xI w) @E @ @V (xI w) = gd (t) ¡ wP i = ¡g t @wi @wi @wi @x
¡
¢
¶ xP ( t) ,
(3.5)
where g is the learning rate. A potential problem with this update algorithm is its symmetry in time. Since the boundary condition for the value function is given at t ! 1, it would be more appropriate to update the past estimates without affecting the future estimates. Below, we consider methods for implementing the backup of TD errors. 3.2 Backward Euler Differentiation: Residual Gradient and TD(0). One way of implementing the backup of TD errors is to use the backward Euler
Reinforcement Learning in Continuous Time and Space
225
approximation of time derivative VP (t). By substituting VP (t) = ( V( t) ¡ V (t ¡ D t))/ D t into equation 3.3, we have d (t) = r( t) C
1 Dt
¡
µ
1¡
Dt t
¢
¶ V ( t) ¡ V ( t ¡ D t) .
(3.6)
Then the gradient of the squared TD error (see equation 3.4) with respect to the parameter wi is given by µ ¶ 1 D t @V (x(t)I w) @E( t) @V (x( t ¡ D t)I w) = d ( t) 1¡ ¡ . Dt t @wi @wi @wi
¡
¢
A straightforward gradient descent algorithm is given by
¡
µ Dt ( ) P wi = gd t ¡ 1 ¡ t
¢ @V(x(t)I w) @wi
¶ @V(x(t ¡ D t)I w) C . @wi
(3.7)
An alternative way is to update only V ( t ¡ D t) without explicitly changing V ( t) by wP i = gd ( t)
@V (x( t ¡ D t)I w) . @wi
(3.8)
The Euler discretized TD error, equation 3.6, coincides with the conventional TD error, d t = rt Cc Vt ¡ Vt¡1 ,
Dt
by taking the discount factor c = 1 ¡ Dt t ’ e¡ t and rescaling the values as Vt = D1t V( t). The update schemes, equations 3.7 and 3.8, correspond to the residual-gradient (Baird, 1995; Harmon, Baird, & Klopf, 1996) and TD(0) algorithms, respectively. Note that time step D t of the Euler differentiation does not have to be equal to the control cycle of the physical system. 3.3 Exponential Eligibility Trace: TD(l). Now let us consider how an instantaneous TD error should be corrected by a change in the value V as a function of time. Suppose an impulse of reward is given at time t = t0 . Then, from denition 2.4, the corresponding temporal prole of the value function is ( t ¡t ¡ 0t t · t0 , m V ( t) = e 0 t > t0 . Because the value function is linear with respect to the reward, the desired correction of the value function for an instantaneous TD errord ( t0 ) is ( t0 ¡t ( t0 ) e¡ t d t · t0 , O ( ) V t = 0 t > t0 ,
226
Kenji Doya
as illustrated in Figure 1D. Therefore, the update of wi given d ( t0 ) should be made as Z wP i = g
t0
¡1
@V VO ( t)
(x( t)I w) @wi
Z dt = gd (t0 )
t0 ¡1
e¡
t0 ¡t t
@V (x( t)I w) dt. (3.9) @wi
We can consider the exponentially weighted integral of the derivatives as the eligibility trace ei for the parameter wi . Then a class of learning algorithms is derived as wP i = gd ( t) ei ( t),
1 @V (x( t)I w) , ePi ( t) = ¡ ei ( t) C k @wi
(3.10)
where 0 < k · t is the time constant of the eligibility trace. If we discretize equation 3.10 with time step D t, it coincides with the eligibility trace update in TD(l ) (see, e.g., Sutton & Barto, 1998), ei ( t C D t) = lc ei ( t) C with l =
1¡D t/ k 1¡D t/ t
@Vt , @wi
.
4 Improving the Policy Now we consider ways for improving the policy u(t) = m (x( t)) using its associated value function V (x). One way is to improve the policy stochastically using the actor-critic method, in which the TD error is used as the effective reinforcement signal. Another way is to take a greedy policy with respect to the current value function, µ ¶ @V(x) u(t) = m (x( t)) = arg max r(x( t), u) C f (x(t), u) , u2U @x
(4.1)
using knowledge about the reward and the system dynamics. 4.1 Continuous Actor-Critic. First, we derive a continuous version of the actor-critic method (Barto et al., 1983). By comparing equations 3.3 and 4.1, we can see that the TD error is maximized by the greedy action u(t). Accordingly, in the actor-critic method, the TD error is used as the reinforcement signal for policy improvement. We consider the policy implemented by the actor as ³ ´ u(t) = s A(x(t)I wA ) C sn(t) ,
(4.2)
Reinforcement Learning in Continuous Time and Space
227
where A(x( t)I wA ) 2 Rm is a function approximator with a parameter vector wA , n(t) 2 Rm is noise, and s() is a monotonically increasing output function. The parameters are updated by the stochastic real-valued (SRV) unit algorithm (Gullapalli, 1990) as wP iA = g Ad ( t)n( t)
@A(x( t)I wA ) . @wiA
(4.3)
4.2 Value-Gradient Based Policy. In discrete problems, a greedy policy can be found by one-ply search for an action that maximizes the sum of the immediate reward and the value of the next state. In the continuous case, the right-hand side of equation 4.1 has to be minimized over a continuous set of actions at every instant, which in general can be computationally expensive. However, when the reinforcement r(x, u) is convex with respect to the action u and the system dynamics f (x, u) is linear with respect to the action u, the optimization problem in equation 4.1 has a unique solution, and we can derive a closed-form expression of the greedy policy. Here, we assume that the reward r(x, u) can be separated into two parts: the reward for state R(x), which is given by the environment and unknown, and the cost for action S(u), which can be chosen as a part of the learning strategy. We specically consider the case r(x, u) = R(x) ¡
m X
Sj (uj ),
(4.4)
j= 1
where Sj () is a cost function for action variable uj . In this case, the condition for the greedy action, equation 4.1, is given by ¡Sj0 ( uj ) C where
@f (x,u) @uj
@V (x) @f (x, u) = 0 @x @uj
( j = 1, . . . , m),
is the jth column vector of the n £ m input gain matrix
@f (x,u) @u
(x,u)
of the system dynamics. We now assume that the input gain @f @u is not dependent on u; that is, the system is linear with respect to the input and the action cost function Sj () is convex. Then the the above equation has a unique solution, uj = Sj0
¡1
¡ @V(x) @f (x, u) ¢ @x
@uj
,
where Sj0 () is a monotonic function. Accordingly, the greedy policy is represented in vector notation as ! @f (x, u) T @V (x) T 0 ¡1 u= S , (4.5) @u @x
(
228
Kenji Doya
where
@V (x) T @x
represents the steepest ascent direction of the value function, (x,u) T
which is then transformed by the “transpose” model @f @u into a direction in the action space, and the actual amplitude of the action is determined by gain function S0 ¡1 (). (x) Note that the gradient @V@x can be calculated by backpropagation when the value function is represented by a multilayer network. The assumption of linearity with respect to the input is valid in most Newtonian mechanical systems (e.g., the acceleration is proportional to the force), and the gain @f (x,u) matrix @u can be calculated from the inertia matrix. When the dynamics is linear and the reward is quadratic, the value function is also quadratic, and equation 4.5 coincides with the optimal feedback law for a linear quadratic regulator (LQR; see, e.g., Bertsekas, 1995). 4.2.1 Feedback Control with a Sigmoid Output Function. A common constraint in control tasks is that the amplitude of the action, such as the force or torque, is bounded. Such a constraint can be incorporated into the above policy with an appropriate choice of the action cost. Suppose that the amplitude of the action is limited as |uj | · ujmax ( j = 1, . . . , m). We then dene the action cost as ! Z uj u ¡1 (4.6) Sj ( uj ) = cj s du, ujmax 0
(
where s() is a sigmoid function that saturates as s(§1) = §1. In this case, the greedy feedback policy, equation 4.5, results in feedback control with a sigmoid output function, uj =
ujmax s
(
! 1 @f (x, u) T @V (x) T . cj @uj @x
(4.7)
In the limit of cj ! 0, the policy will be a “bang-bang” control law, "
uj =
ujmax
# @f (x, u) T @V(x) T sign . @uj @x
(4.8)
4.3 Advantage Updating. When the model of the dynamics is not available, as in Q-learning (Watkins, 1989), we can select a greedy action by directly learning the term to be maximized in the HJB equation, r(x( t), u(t)) C
@V ¤ (x) f (x( t), u(t)). @x
This idea has been implemented in the advantage updating method (Baird, 1993; Harmon et al., 1996), in which both the value function V(x) and the
Reinforcement Learning in Continuous Time and Space
229
advantage function A(x, u) are updated. The optimal advantage function A¤ (x, u) is represented in the current HJB formulation as A¤ (x, u) = r(x, u) ¡
1 ¤ @V ¤ (x) V (x) C f (x, u), t @x
(4.9)
which takes the maximum value of zero for the optimal action u. The advantage function A(x, u) is updated by A(x, u) Ã max[A(x, u)] C r(x, u) ¡ u
= max[A(x, u)] Cd (t) u
1 ¤ V (x) C VP (x) t (4.10)
under the constraint maxu [A(x, u)] = 0. The main difference between the advantage updating and the valuegradient-based policy described above is while the value V and the advantage A are updated in the former, the value V and the model f are updated @f (x,u) and their derivatives are used in the latter. When the input gain model @u is known or easy to learn, the use of the closed-form policy, equation 4.5, in the latter approach is advantageous because it simplies the process of maximizing the right-hand side of the HJB equation. 5 Simulations We tested the performance of the continuous RL algorithms in two nonlinear control tasks: a pendulum swing-up task (n=2, m=1) and a cart-pole swingup task (n=4, m=1). In each of these tasks, we compared the performance of three control schemes: 1. Actor-critic: control by equation 4.2 and learning by equation 4.3 2. Value-gradient-based policy, equation 4.7, with an exact gain matrix 3. Value-gradient-based policy, equation 4.7, with concurrent learning of the input gain matrix The value functions were updated using the exponential eligibility trace (see equation 3.10) except in the experiments of Figure 6. Both the value and policy functions were implemented by normalized gaussian networks, as described in appendix B. A sigmoid output function s( x) = p2 arctan( p2 x) (Hopeld, 1984) was used in equations 4.2 and 4.7. In order to promote exploration, we incorporated a noise term sn(t) in both policies, equations 4.2 and 4.7 (see equations B.2 and B.3 in appendix B). P We used low-pass ltered noise tn n(t) = ¡n(t) C N(t), where N(t) denotes normal gaussian noise. The size of the perturbation s was tapered off as the performance improved (Gullapalli, 1990). We took the modulation scheme
230
Kenji Doya
q l T mg Figure 2: Control of a pendulum with limited torque. The dynamics were given by hP = v and ml2 vP = ¡m v C mgl sinh C u. The physical parameters were m = l = 1, g = 9.8, m = 0.01, and umax = 5.0. The learning parameters were t = 1.0, k = 0.1, c = 0.1, tn = 1.0, s0 = 0.5, V0 = 0, V1 = 1, g = 1.0, g A = 5.0, and g M = 10.0, in the following simulations unless otherwise specied.
s = s0 min[1, max[0, VV11¡V(t) ¡V0 ]], where V0 and V1 are the minimal and maximal levels of the expected reward. The physical systems were simulated by a fourth-order Runge-Kutta method, and the learning dynamics was simulated by a Euler method, both with the time step of 0.02 sec. 5.1 Pendulum Swing-Up with Limited Torque. First, we tested the continuous-time RL algorithms in the task of a pendulum swinging upward with limited torque (see Figure 2) (Atkeson, 1994; Doya, 1996). The control of this one degree of freedom system is nontrivial if the maximal output torque umax is smaller than the maximal load torque mgl. The controller has to swing the pendulum several times to build up momentum and also has to decelerate the pendulum early enough to prevent the pendulum from falling over. The reward was given by the height of the tip of the pendulum, R(x) = cosh . The policy and value functions were implemented by normalized gaussian networks with 15£15 basis functions to cover the two-dimensional state-space x = (h , v). In modeling the system dynamics, 15 £ 15 £ 2 bases were used for the state-action space (h , v, u). Each trial was started from an initial state x(0) = (h (0), 0), where h (0) was selected randomly in [¡p , p ]. A trial lasted for 20 seconds unless the pendulum was over-rotated (|h | > 5p ). Upon such a failure, the trial was terminated with a reward r( t) = ¡1 for 1 second. As a measure of the swing-up performance, we dened the time in which the pendulum stayed up (|h | < p / 4) as tup . A trial was regarded as “successful” when tup > 10 seconds. We used the number of trials made before achieving 10 successful trials as the measure of the learning speed.
Reinforcement Learning in Continuous Time and Space
231
2p
V
1 0.5 0 0.5 1
0 0
- 2p p
2p
w
q Figure 3: Landscape of the value function V(h , v) for the pendulum swing-up task. The white line shows an example of a swing-up trajectory. The state-space was a cylinder with h = § p connected. The 15 £ 15 centers of normalized gaussian basis functions are located on a uniform grid that covers the area [¡p , p ] £ [¡5/ 4p , 5/ 4p ].
Figure 3 illustrates the landscape of the value function and a typical trajectory of swing-up using the value-gradient-based policy. The trajectory starts from the bottom of the basin, which corresponds to the pendulum hanging downward, and spirals up the hill along the ridge of the value function until it reaches the peak, which corresponds to the pendulum standing upward. 5.1.1 Actor-Critic, Value Gradient, and Physical Model. We rst compared the performance of the three continuous RL algorithms with the discrete actor-critic algorithm (Barto et al., 1983). Figure 4 shows the time course of learning in ve simulation runs, and Figure 5 shows the average number of trials needed until 10 successful swing-ups. The discrete actor-critic algorithm took about ve times more trials than the continuous actor-critic. Note that the continuous algorithms were simulated with the same time step as the discrete algorithm. Consequently, the performance difference was due to a better spatial generalization with the normalized gaussian networks. Whereas the continuous algorithms performed well with 15 £ 15 basis functions, the discrete algorithm did not achieve the task using 15 £15, 20 £ 20, or 25 £ 25 grids. The result shown here was obtained by 30 £ 30 grid discretization of the state.
232
Kenji Doya B 20
20
15
15 t_up
t_up
A
10 5 0
10 5
0
500
1000 trials
1500
0
2000
50
100 trials
150
200
D 20
20
15
15 t_up
t_up
C
0
10 5 0
10 5
0
20
40
trials
60
80
0
100
0
20
40
trials
60
80
100
Figure 4: Comparison of the time course of learning with different control schemes: (A) discrete actor-critic, (B) continuous actor-critic, (C) value-gradientbased policy with an exact model, (D) value-gradient policy with a learned model (note the different scales). tup : time in which the pendulum stayed up. In the discrete actor-critic, the state-space was evenly discretized into 30 £30 boxes and the action was binary (u = § umax ). The learning parameters were c = 0.98, l = 0.8, g = 1.0, and g A = 0.1. 400 350 300 Trials
250 200 150 100 50 0
DiscAC
ActCrit
ValGrad PhysModel
Figure 5: Comparison of learning speeds with discrete and continuous actorcritic and value-gradient-based policies with an exact and learned physical models. The ordinate is the number of trials made until 10 successful swing-ups.
Reinforcement Learning in Continuous Time and Space
233
Among the continuous algorithms, the learning was fastest with the value-gradient-based policy using an exact input gain. Concurrent learning of the input gain model resulted in slower learning. The actor-critic was the slowest. This was due to more effective exploitation of the value function in the gradient-based policy (see equation 4.7) compared to the stochastically improved policy (see equation 4.2) in the actor-critic. 5.1.2 Methods of Value Function Update. Next, we compared the methods of value function update algorithms (see equations 3.7, 3.8, and 3.10) using the greedy policy with an exact gain model (see Figure 6). Although the three algorithms attained comparable performances with the optimal settings for D t and k , the method with the exponential eligibility trace performed well in the widest range of the time constant and the learning rate. We also tested the purely symmetric update method (see equation 3.5), but its performance was very unstable even when the learning rates for the value and its gradient were tuned carefully. 5.1.3 Action Cost, Graded Reward, and Exploration. We then tested how the performance depended on the action cost c, the shape of the reward function R(x), and the size of the exploration noise s0 . Figure 7A compares the performance with different action costs c = 0, 0.01, 0.1, and 1.0. The learning was slower with a large cost for the torque (c = 1) because of the weak output in the early stage. The results with bang-bang control (c = 0) tended to be less consistent than with sigmoid control with the small costs (c = 0.01, 0.1). Figure 7B summarizes the effects of the reward function and exploration noise. When binary reward function, » 1 R(x) = 0
|h | < p / 4 otherwise,
was used instead of cos(h ), the task was more difcult to learn. However, a better performance was observed with the use of negative binary reward function, » 0 |h | < p / 4 R(x) = ¡1 otherwise. The difference was more drastic with a xed initial state x(0) = (p , 0) and no noise s = 0, for which no success was achieved with the positive binary reward. The better performance with the negative reward was due to the initialization of the value function as V(x) = 0. As the value function near h = p is learned as V (x) ’ ¡1, the value-gradient-based policy drives the state to unexplored areas, which are assigned higher values V(x) ’ 0 by default.
234
Kenji Doya
Figure 6: Comparison of different value function update methods with different settings for the time constants. (A) Residual gradient (see equation 3.7). (B) Single-step eligibility trace (see equation 3.8). (C) Exponential eligibility (see equation 3.10). The learning rate g was roughly optimized for each method and each setting of time step D t of the Euler approximation or time constant k of the eligibility trace. The performance was very unstable, with D t = 1.0 in the discretization-based methods.
Reinforcement Learning in Continuous Time and Space
235
A 100
Trials
80 60 40 20 0
0.
0.01 0.1 control cost coef.
1.0
B 100
Trials
80 60 40 20 0
cosq s
[0,1] [-1,0] 0 = 0.5
cosq s
[0,1] [-1,0] 0 = 0.0
Figure 7: Effects of parameters of the policy. (A) Control-cost coefcient c. (B) Reward function R(x) and perturbation size s0 .
5.2 Cart-Pole Swing-Up Task. Next, we tested the learning schemes in a more challenging task of cart-pole swing-up (see Figure 8), which is a strongly nonlinear extension to the common cart-pole balancing task (Barto et al., 1983). The physical parameters of the cart pole were the same as in Barto et al. (1983), but the pole had to be swung up from an arbitrary angle and balanced. Marked differences from the previous task were that the dimension of the state-space was higher and the input gain was state dependent. The state vector was x = ( x, v, h , v), where x and v are the position and the velocity of the cart. The value and policy functions were implemented
236
Kenji Doya
A
B
C
Figure 8: Examples of cart-pole swing-up trajectories. The arrows indicate the initial position of the pole. (A) A typical swing-up from the bottom position. (B) When a small perturbation v > 0 is given, the cart moves to the right and keeps the pole upright. (C) When a larger perturbation is given, the cart initially tries to keep the pole upright, but then brakes to avoid collision with the end of the track and swings the pole up on the left side. The learning parameters were t = 1.0, k = 0.5, c = 0.01, tn = 0.5, s0 = 0.5, V0 = 0, V1 = 1, g = 5.0, g A = 10.0, and g M = 10.0.
Reinforcement Learning in Continuous Time and Space
237
by normalized gaussian networks with 7£7£15£15 bases. A 2£2£4£2£2 basis network was used for modeling the system dynamics. The reward was given by R(x) = cosh2 ¡1 . When the cart bumped into the end of the track or when the pole overrotated (|h | > 5p ), a terminal reward r(t) = 1 was given for 0.5 second. Otherwise a trial lasted for 20 seconds. Figure 8 illustrates the control performance after 1000 learning trials with the greedy policy using the value gradient and a learned input gain model. Figure 9A shows the value function in the 4D state-space x = ( x, v, h, v). Each of the 3 £ 3 squares represents a subspace (h , v) with different values of ( x, v). We can see the 1-shaped ridge of the value function, which is similar to the one seen in the pendulum swing-up task (see Figure 3). Also note the lower values with x and v both positive or negative, which signal the danger of bumping into the end of the track. Figure 9B shows the most critical components of the input gain vector v @P . The gain represents how the force applied to the cart is transformed @u as the angular acceleration of the pole. The gain model could successfully capture the change of the sign with the upward (|h | < p / 2) orientation and the downward (|h | > p / 2) orientation of the pole. Figure 10 is a comparison of the number of trials necessary for 100 successful swing-ups. The value-gradient-based greedy policies performed about three times faster than the actor-critic. The performances with the exact and learned input gains were comparable in this case. This was because the learning of the physical model was relatively easy compared to the learning of the value function. 6 Discussion The results of the simulations can be summarized as follows: 1. The swing-up task was accomplished by the continuous actor-critic in a number of trials several times fewer than by the conventional discrete actor-critic (see Figures 4 and 5). 2. Among the continuous methods, the value-gradient-based policywith a known or learned dynamic model performed signicantly better than the actor-critic (see Figures 4, 5, and 10). 3. The value function update methods using exponential eligibility traces were more efcient and stable than the methods based on Euler approximation (see Figure 6). 4. Reward-related parameters, such as the landscape and the baseline level of the reward function, greatly affect the speed of learning (see Figure 7). 5. The value-gradient-based method worked well even when the input gain was state dependent (see Figures 9 and 10).
238
Kenji Doya A
v
q
- p
0
p
4p 0
2.4
w
- 4p 0.0 V +0.021 -2.4
-0.347
-2.4
0.0 x
B
v
-0.715
2.4
- p
q
0
p
4p 0
2.4
w
- 4p 0.0 df/du +1.423 -2.4
-0.064
-2.4
0.0 x
2.4
-1.551
Figure 9: (A) Landscape of the value function for the cart-pole swing-up task. (B) Learned gain model @@uvP .
Reinforcement Learning in Continuous Time and Space
239
Figure 10: Comparison of the number of trials until 100 successful swing-ups with the actor-critic, value-gradient-based policy with an exact and learned physical models.
Among the three major RL methods—the actor critic, Q-learning, and model-based look-ahead—only the Q-learning has been extended to continuous-time cases as advantage updating (Baird, 1993). This article presents continuous-time counterparts for all three of the methods based on the HJB equation, 2.6, and therefore provides a more complete repertoire of continuous RL methods. A major contribution of this article is the derivation of the closed-form policy (see equation 4.5) using the value gradient and the dynamic model. One critical issue in advantage updating is the need for nding the maximum of the advantage function on every control cycle, which can be computationally expensive except in special cases like linear quadratic problems (Harmon et al., 1996). As illustrated by simulation, the value-gradient-based policy (see equation 4.5) can be applied to a broad class of physical control problems using a priori or learned models of the system dynamics. The usefulness of value gradients in RL was considered by Werbos (1990) for discrete-time cases. The use of value gradients was also proposed by Dayan and Singh (1996), with the motivation being to eliminate the need for updating the value function in advantage updating. Their method, however, in which the value gradients @V are updated without updating the value @x function V (x) itself, is applicable only to nondiscounted problems. When the system or the policy is stochastic, HJB equation 2.6 will include second-order partial derivatives of the value function, µ 1 ¤ @V ¤ (x) V (x( t)) = max r(x( t), u(t)) C f (x( t), u(t)) u(t)2U t @x
240
Kenji Doya
C tr
¡ @ V (x) ¢ ¶ 2
¤
@x2
C
,
(6.1)
where C is the covariance matrix of the system noise (see, e.g., Fleming & Soner, 1993). In our simulations, the methods based on deterministic HJB equation 2.6 worked well, although we incorporated noise terms in the policies to promote exploration. One reason is that the noise was small enough so that the contribution of the second-order term was minor. Another reason could be that the second-order term has a smoothing effect on the value function, and this was implicitly achieved by the use of the smooth function approximator. This point needs further investigation. The convergent properties of HJB-based RL algorithms were recently shown for deterministic (Munos, 1997) and stochastic (Munos & Bourgine, 1998) cases using a grid-based discretization of space and time. However, the convergent properties of continuous RL algorithms combined with function approximators remain to be studied. When a continuous RL algorithm is numerically implemented with a nite time step, as shown in sections 3.2 and 3.3, it becomes equivalent to a discrete-time TD algorithm, for which some convergent properties have been shown with the use of function approximators (Gordon, 1995; Tsitsiklis & Van Roy, 1997). For example, the convergence of TD algorithms has been shown with the use of a linear function approximator and on-line sampling (Tsitsiklis & Van Roy, 1997), which was the case with our simulations. However, the above result considers value function approximation for a given policyand does not guarantee the convergence of the entire RL process to a satisfactory solution. For example, in our swing-up tasks, the learning sometimes got stuck in a locally optimal solution of endless rotation of the pendulum when a penalty for over-rotation was not given. The use of xed smooth basis functions has a limitation in that steep cliffs in the value or policy functions cannot be achieved. Despite some negative didactic examples (Tsitsiklis & Van Roy, 1997), methods that dynamically allocate or reshape basis functions have been successfully used with continuous RL algorithms, for example, in a swing-up task (Schaal, 1997) and in a stand-up task for a three-link robot (Morimoto & Doya, 1998). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully—for example, the properties of the function approximatorsand the methods for exploration—remains the subject of future empirical and theoretical studies. Appendix A: HJB Equation for Discounted Reward According to the optimality principle, we divide the integral in equation 2.5 into two parts [t, t C D t] and [t C D t, 1) and then solve a short-term opti-
Reinforcement Learning in Continuous Time and Space
241
mization problem, µZ V¤ (x( t)) = max
u[t,tCD t]
tCD t t
e¡
s¡t t
¶ Dt r(x( s), u(s))ds Ce¡ t V ¤ (x( t CD t)) . (A.1)
For a small D t, the rst term is approximated as r(x( t), u(t))D t C o(D t), and the second term is Taylor expanded as V¤ (x( t C D t)) = V ¤ (x( t)) C
@V ¤ f (x(t), u(t))D t C o(D t). @x( t)
By substituting them into equation A.1 and collecting V ¤ (x( t)) on the lefthand side, we have an optimality condition for [t, t C D t] as Dt
(1 ¡ e¡ t ) V ¤ (x( t)) µ ¶ ¤ ¡ Dt t @V = max r(x(t), u(t))D t C e f (x( t), u(t))D t C o(D t) . (A.2) u[t,tCD t] @x( t) By dividing both sides by D t and taking D t to zero, we have the condition for the optimal value function, µ ¶ 1 ¤ @V ¤ V (x( t)) = max r(x( t), u(t)) C f (x( t), u(t)) . u(t)2U t @x
(A.3)
Appendix B: Normalized Gaussian Network A value function is represented by V(xI w) =
K X
wk bk (x),
(B.1)
k= 1
where ak (x) , bk (x) = PK (x) l= 1 al
(x¡ck )k . ak (x) = e¡ksk T
2
The vectors ck and sk dene the center and the size of the kth basis function. Note that the basis functions located on the ends of the grids are extended like sigmoid functions by the effect of normalization. In the current simulations, the centers are xed in a grid, which is analogous to the “boxes” approach (Barto et al., 1983) often used in discrete RL. Grid allocation of the basis functions enables efcient calculation of their
242
Kenji Doya
activation as the outer product of the activation vectors for individual input variables. In the actor-critic method, the policy is implemented as u(t) = u
max
s
(
X k
! wkA bk (x(t))
C sn(t) ,
(B.2)
where s is a component-wise sigmoid function and n(t) is the noise. In the value-gradient-based methods, the policy is given by u(t) = u
max
! 1 @f (x, u) T X @bk (x) T C sn(t) . s wk c @u @x k
(
(B.3)
To implement the input gain model, a network is trained to predict the time derivative of the state from x and u, x(t) P ’ fO(x, u) =
X k
wkM bk (x( t), u(t)).
(B.4)
The weights are updated by M( () P ¡ fO(x( t), u(t)))bk (x( t), u(t)), x(t) wP M k t = g
(B.5)
and the input gain of the system dynamics is given by X (x, u) @f (x, u) M @bk . ’ wk @u @u u= 0 k
(B.6)
Acknowledgments I thank Mitsuo Kawato, Stefan Schaal, Chris Atkeson, and Jim Morimoto for their helpful discussions. References Asada, M., Noda, S., & Hosoda, K. (1996). Action-based sensor space categorization for robot learning. In Proceedings of IEEE/ RSJ International Conference on Intelligent Robots and Systems (pp. 1502–1509). Atkeson, C. G. (1994). Using local trajectory optimizers to speed up global optimization in dynamic programming. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing system, 6 (pp. 663–670). San Mateo, CA: Morgan Kaufmann. Baird, L. C. (1993). Advantage updating (Tech. Rep. No. WL-TR-93-1146). Wright Laboratory, Wright-Patterson Air Force Base, OH.
Reinforcement Learning in Continuous Time and Space
243
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis & S. Russel (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Mateo, CA: Morgan Kaufmann. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Cambridge, MA: MIT Press. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 295–302). San Mateo, CA: Morgan Kaufmann. Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 393– 400). Cambridge, MA: MIT Press. Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1017–1023). Cambridge, MA: MIT Press. Dayan, P., & Singh, S. P. (1996). Improving policies without measuring merits. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1059–1065). Cambridge, MA: MIT Press. Doya, K. (1996). Temporal difference learning in continuous time and space. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1073–1079). Cambridge, MA: MIT Press. Doya, K. (1997). Efcient nonlinear control with actor-tutor architecture. In M. C. Mozer, M. I. Jordan, & T. P. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 1012–1018). Cambridge, MA: MIT Press. Fleming, W. H., & Soner, H. M. (1993). Controlled Markov processes and viscosity solutions. New York: Springer-Verlag. Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis & S. Russel (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Mateo, CA: Morgan Kaufmann. Gordon, G. J. (1996). Stable tted reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural informationprocessing systems, 8 (pp. 1052–1058). Cambridge, MA: MIT Press. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671–192. Harmon, M. E., Baird, III, L. C., & Klopf, A. H. (1996). Reinforcement learning applied to a differential game. Adaptive Behavior, 4, 3–28. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of National Academy of Science, 81, 3088–3092. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4, 237–285.
244
Kenji Doya
Mataric, M. J. (1994). Reward functions for accelerated learning. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the 11th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Moore, A. W. (1994). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural informationprocessingsystems,6 (pp. 711– 718). San Mateo, CA: Morgan Kaufmann. Morimoto, J., & Doya, K. (1998). Reinforcement learning of dynamic motor sequence: Learning to stand up. In Proceedings of IEEE/ RSJ International Conference on Intelligent Robots and Systems, 3, 1721–1726. Munos, R. (1997). A convergent reinforcement learning algorithm in the continuous case based on a nite difference method. In Proceedings of International Joint Conference on Articial Intelligence (pp. 826–831). Munos, R., & Bourgine, P. (1998).Reinforcement learning for continuous stochastic control problems. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 1029–1035). Cambridge, MA: MIT Press. Pareigis, S. (1998). Adaptive choice of grid and time in reinforcement learning. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 1036–1042). Cambridge, MA: MIT Press. Peterson, J. K. (1993). On-line estimation of the optimal value function: HJBestimators. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 319–326). San Mateo, CA: Morgan Kaufmann. Schaal, S. (1997). Learning from demonstration. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 1040– 1046). Cambridge, MA: MIT Press. Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 974– 980). Cambridge, MA: MIT Press. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 361–368). Cambridge, MA: MIT Press. Sutton, R. S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, 9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Proceedings of the 12th International Conference on Machine Learning (pp. 531– 539). San Mateo, CA: Morgan Kaufmann. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1038– 1044). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press.
Reinforcement Learning in Continuous Time and Space
245
Tesauro, G. (1994). TD-Gammon, a self teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674– 690. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.), Neural networks for control (pp. 67–95). Cambridge, MA: MIT Press. Zhang, W., & Dietterich, T. G. (1996). High-performance job-shop scheduling with a time-delay TD(l ) network. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, 8. Cambridge, MA: MIT Press. Received January 18, 1998; accepted January 13, 1999.
Neural Computation, Volume 12, Number 2
CONTENTS Article
Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel and J´ozsef Fiser
247
Notes
Relationship Between Phase and Energy Methods for Disparity Computation Ning Qian and Samuel Mikaelian
279
N-tuple Network, CART, and Bagging Aleksander Ko lcz
293
Improving the Practice of Classier Performance Assessment N. M. Adams and D. J. Hand
305
Letters
Do Simple Cells in Primary Visual Cortex Form a Tight Frame? Emilio Salinas and L. F. Abbott
313
Learning Overcomplete Representations Michael S. Lewicki and Terrence J. Sejnowski
337
Noise in Integrate-and-Fire Neurons: From Stochastic Input to Escape Rates Hans E. Plesser and Wulfram Gerstner
367
Modeling Synaptic Plasticity in Conjunction with the Timing of Pre- and Postsynaptic Action Potentials Werner M. Kistler and J. Leo van Hemmen
385
On-line EM Algorithm for the Normalized Gaussian Network Masa-aki Sato and Shin Ishii
407
A General Probability Estimation Approach for Neural Computation Maxim Khaikine and Klaus Holthausen
433
On the Synthesis of Brain-State-in-a-Box Neural Models with Application to Associative Memory Fation Sevrani and Kennichi Abe
451
ARTICLE
Communicated by Shimon Edelman
Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Jozsef ´ Fiser Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627-0268, U.S.A. We have studied some of the design trade-offs governing visual representations based on spatially invariant conjunctive feature detectors, with an emphasis on the susceptibility of such systems to false-positive recognition errors—Malsburg’s classical binding problem. We begin by deriving an analytical model that makes explicit how recognition performance is affected by the number of objects that must be distinguished, the number of features included in the representation, the complexity of individual objects, and the clutter load, that is, the amount of visual material in the eld of view in which multiple objects must be simultaneously recognized, independent of pose, and without explicit segmentation. Using the domain of text to model object recognition in cluttered scenes, we show that with corrections for the nonuniform probability and nonindependence of text features, the analytical model achieves good ts to measured recognition rates in simulations involving a wide range of clutter loads, word sizes, and feature counts. We then introduce a greedy algorithm for feature learning, derived from the analytical model, which grows a representation by choosing those conjunctive features that are most likely to distinguish objects from the cluttered backgrounds in which they are embedded. We show that the representations produced by this algorithm are compact, decorrelated, and heavily weighted toward features of low conjunctive order. Our results provide a more quantitative basis for understanding when spatially invariant conjunctive features can support unambiguous perception in multiobject scenes, and lead to several insights regarding the properties of visual representations optimized for specic recognition tasks. 1 Introduction
The problem of classifying objects in visual scenes has remained a scientic and a technical holy grail for several decades. In spite of intensive research in Neural Computation 12, 247–278 (2000)
c 2000 Massachusetts Institute of Technology °
248
Bartlett W. Mel and J´ozsef Fiser
the eld of computer vision and a large body of empirical data from the elds of visual psychophysics and neurophysiology, the recognition competence of a two-year-old human child remains unexplained as a theoretical matter and well beyond the technical state of the art. This fact is surprising given that (1) the remarkable speed of recognition in the primate brain allows for only very brief processing times at each stage in a pipeline containing only a few stages (Potter, 1976; Oram & Perrett, 1992; Heller, Hertz, Kjær, & Richmond, 1995; Thorpe, Fize, & Marlot, 1996; Fize, Boulanouar, Ranjeva, Fabre-Thorpe, & Thorpe, 1998), (2) the computations that can be performed within each neural processing stage are strongly constrained by the structure of the underlying neural tissue, about which a great deal is known (Hubel & Wiesel, 1968; Szentagothai, 1977; Jones, 1981; Gilbert, 1983; Van Essen, 1985; Douglas & Martin, 1998), (3) the response properties of neurons in each of the relevant brain areas have been well studied, and evolve from stage to stage in systematic ways (Oram & Perrett, 1994; Logothetis & Sheinberg, 1996; Tanaka, 1996), and (4) computer systems may already be powerful enough to emulate the functions of these neural processing stages, were it only known what exactly to do. A number of neuromorphic approaches to visual recognition have been proposed over the years (Pitts & McCullough, 1947; Fukushima, Miyake, & Ito, 1983; Sandon & Urh, 1988; Zemel, Mozer, & Hinton, 1990; Le Cun et al., 1990; Mozer, 1991; Swain & Ballard, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Schiele & Crowley, 1996; Mel, 1997; Weng, Ahuja, & Huang, 1997; Lang & Seitz, 1997; Wallis & Rolls, 1997; Edelman & Duvdevani-Bar, 1997). Many of these neurally inspired systems involve constructing banks of feature detectors, often called receptive elds (RFs), each of which is sensitive to some localized spatial conguration of image cues but invariant to one or more spatial transformations of its preferred stimulus—typically including invariance to translation, rotation, scale, or spatial distortion, as specied by the task. 1 Since the goal to extract useful invariant features is common to most conventional approaches to computer vision as well, the “neuralness” of a recognition system lies primarily in its emphasis on feedforward, hierarchical construction of the invariant features, where the computations are usually restricted to simple spatial conjunction and disjunction operations (e.g., Fukushima et al., 1983), and/or specify the inclusion of a relatively large number of invariant features in the visual representation (see Mozer, 1991; Califano & Mohan, 1994; Mel, 1997)—anywhere from hundreds to millions. Neural approaches typically also involve some form of learning, whether supervised by category labels at the output layer (Le Cun et al., 1990), supervised directly within 1 A number of the above-mentioned systems have emphasized other mechanisms instead of or in addition to straightforward image ltering operations (Zemel et al., 1990; Mozer, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Edelman & Duvdevani-Bar, 1997).
Minimizing Binding Errors
249
intermediate network layers (Fukushima et al., 1983), or involving purely unsupervised learning principles that home in on features that occur frequently in target objects (Fukushima et al., 1983; Zemel et al., 1990; Wallis & Rolls, 1997; Weng et al., 1997). While architectures of this general type have performed well in a variety of difcult, though limited, recognition problems, it has yet to be proved that a few stages of simple feedforward ltering operations can explain the remarkable recognition and classication capacities of the human visual system, which is commonly confronted with scenes acquired from varying viewpoints, containing multiple objects drawn from thousands of categories, and appearing in an innite panoply of spatial congurations. In fact, there are reasons for pessimism. 1.1 The Binding Problem. One of the most persistent worries concerning this class of architecture is due to Malsburg (1981), who noted the potential for ambiguity in visual representations constructed from spatially invariant lters. A spatial binding problem arises, for example, when each detector in the visual representation reports only the presence of some elemental object attribute but not its spatial location (or, more generally, its pose). Under these unfavorable conditions, an object is hallucinated (i.e., detected when not actually present) whenever all of its elemental features are present in the visual eld, even when embedded piecemeal in improper congurations within a scattered coalition of distractor objects (see Figure 1A). This type of failure mode for a visual representation emanates from the dual pressures to cope with uncontrolled variation in object pose, which forces the use of detectors with excessive spatial invariances, and the necessity to process multiple objects simultaneously, which overstimulate the visual representation and increase the probability of ambiguous perception. One approach to the binding problem is to build a separate, full-order conjunctive detector for every possible view (e.g., spatial pose, state of occlusion, distortion, degradation) of every possible object, and then for each object provide a huge disjunction that pools over all of the views of that object. Malsburg (1981) pointed out that while the binding problem could thus be solved, the remedy is a false one since it leads to a combinatorial explosion in the number of needed visual processing operations (see Figure 1B). Other courses of action include (1) strategies for image preprocessing designed to segment scenes into regions containing individual objects, thus reducing the clutter load confronting the visual representation, and (2) explicit normalization procedures to reduce pose uncertainty (e.g., centering, scaling, warping), thereby reducing the invariance load that must be sustained by the individual receptive elds. Both strategies reduce the probability that any given object’s feature set will be activated by a spurious collection of features drawn from other objects. 1.2 Wickelsystems. Preprocessing strategies aside, various workers have explored the middle ground between “binding breakdown” and “com-
250
Bartlett W. Mel and J´ozsef Fiser
Figure 1: (A) A binding problem occurs when an object’s representation can be falsely activated by multiple (or inappropriate) other objects. (B) A combinatorial explosion arises when all possible views of all possible objects are represented by explicit conjunctions. (C) Compromise representation containing an intermediate number of receptive elds that bind image features to intermediate order.
Minimizing Binding Errors
251
binatorial catastrophe,” where a visual representation contains an intermediate number of detectors, each of which binds together an intermediatesized subset of feature elements in their proper relations before integrating out the irrelevant spatial transformations (see Mozer, 1991, for a thoughtful advocacy of this position). Under this compromise, each detector becomes a spatially invariant template for a specic minipattern, which is more complex than an elemental feature, such as an oriented edge, but simpler than a full-blown view of an object (see Figure 1C). A set of spatially invariant features of intermediate complexity can be likened to a box of jumbled jigsaw puzzle pieces: although the nal shape of the composite object—the puzzle—is not explicitly contained in this peculiar “representation,” it is nonetheless unambiguously specied: the collection of parts (features) can interlock in only one way. An important early version of this idea is due to Wickelgren (1969), who proposed a scheme for representing the phonological structure of spoken language involving units sensitive to contiguous triples of phonological features but not to the absolute position of the tuples in the speech stream. Though not phrased in this language, “Wickelfeatures” were essentially translation-invariant detectors of phonological minipatterns of intermediate complexity, analyzing the one-dimensional speech signal. Representational schemes inspired by Wickelgren’s proposal have since cropped up in other inuential models of language processing (McClelland & Rumelhart, 1981; Mozer, 1991). The use of invariant conjunctive features schemes to address representational problems in elds other than vision highlights the fact that the binding problem is not an inherently visuospatial one. Rather, it can arise whenever the set of features representing an object (visual, auditory, olfactory, etc.) can be fully, and thus inappropriately, activated by input “scenes” that do not contain the object. In general, a binding problem has the potential to exist whenever a representation lacks features that conjoin an object’s elemental attributes (e.g., parts, spatial relations, surface properties) to sufciently high conjunctive order. On the solution side, binding ambiguity can be mitigated as in the scenario of Figure 1 by building conjunctive features that are increasingly object specic, until the probability of false-positive activation for any object is driven below an acceptable threshold. Given that this conjunctive order-boosting prescription to reduce ambiguity is algorithmically trivial, it shifts emphasis away from the question as to how to eliminate binding ambiguity, toward the question as to whether, for a given recognition problem, ambiguity can be eliminated at an acceptable cost. 1.3 A Missing Theory. In spite of the importance of spatially invariant conjunctive visual representations as models of neural function, and the demonstrated practical successes of computer vision systems constructed along these lines, many of the design trade-offs that govern the performance of visual “Wickelsystems” in response to cluttered input images have re-
252
Bartlett W. Mel and J´ozsef Fiser
mained poorly understood. For example, it is unknown in general how recognition performance depends on (1) the number of object categories that must be distinguished, (2) the similarity structure of the object category distribution (i.e., whether object categories are intrinsically very similar or very different from each other), (3) the featural complexity of individual objects, (4) the number and conjunctive order of features included in the representation, (5) the clutter load (i.e., the amount of visual material in the eld of view from which multiple objects must be recognized without explicit segmentation), and (6) the invariance load (i.e., the set of spatial transformations that do not affect the identities of objects, and that must be ignored by each individual detector). In a previous analysis that touched on some of these issues, Califano and Mohan (1994) showed that large, parameterized families of complex invariant features (e.g., containing 106 elements or more) could indeed support recognition of multiple objects in complex scenes without prior segmentation or discrimination among a large number of highly similar objects in a spatially invariant fashion. However, these authors were primarily concerned with the mathematical advantages of large versus small feature sets in a generic sense, and thus did not test their analytical predictions using a large object database and varying object complexity, category structure, clutter load, and other factors. Beyond our lack of understanding regarding the trade-offs inuencing system performance, the issue as to which invariant conjunctive features and of what complexity should be incorporated into a visual representation, through learning, has also not been well worked out as a conceptual matter. Previous approaches to feature learning in neuromorphic visual systems have invoked (1) generic gradient-based supervised learning principles (e.g., Le Cun et al., 1990) that offer no direct insight into the qualities of a good representation, (2) or straightforward unsupervised learning principles (e.g., Wallis & Rolls, 1997), which tend to discover frequently occurring features or principal components rather than those needed to distinguish objects from cluttered backgrounds on a task-specic basis. Network approaches to learning have also typically operated within a xed network architecture, which largely predenes and homogenizes the complexity of top-level features to be used for recognition, though optimal representations may actually require features drawn from a range of complexities, and could vary in their composition on a per-object basis. All this suggests that further work on the principles of feature learning is needed. In this article, we describe our work to test a simple analytical model that captures several trade-offs governing the performance of visual recognition systems based on spatially invariant conjunctive features. In addition, we introduce a supervised greedy algorithm for feature learning that grows a visual representation in such a way as to minimize false-positive recognition errors. Finally, we consider some of the surprising properties of “good” representations and the implications of our results for more realistic visual recognition problems.
Minimizing Binding Errors
253
2 Methods 2.1 Text World as a Model for Object Recognition. The studies described here were carried out in text world, a domain with many of the complexities of vision in general (see Figure 2): target objects are numerous (more than 40,000 words in the database), are highly nonuniform in their prior probabilities (relative probabilities range from 1 to 400,000), are constructed from a set of relatively few underlying parts (26 letters), and individually can contain from 1 to 20 parts. Furthermore, the parts from which words are built are highly nonuniform in their relative frequencies and contain strong statistical dependencies. Finally, word objects occur in highly cluttered visual environments—embedded in input arrays in which many other words are simultaneously present—but must nonetheless be recognized regardless of position in the visual eld. Although text world lacks several additional complexities characteristic of real object vision (see section 6), it nonetheless provides a rich but tractable domain in which to quantify binding trade-offs and to facilitate the testing of analytical predictions and the development of feature learning algorithms. An important caveat, however, is that the use of text as a substrate for developing and testing our analytical model does not imply that our conclusions bear directly on any aspect of human reading behavior. Two databases were used in the studies. The word database contained 44,414 entries, representing all lowercase punctuation-free words and their relative frequencies found in 5 million words in the Wall Street Journal (WSJ) (available online from the Association for Computational Linguistics at http://morph.ldc.upenn.edu/). The English text database consisted of approximately 1 million words drawn from a variety of sources (KuLcera & Francis, 1967). 2.2 Recognition Using n-Grams. A natural class of visual features in text world is the position-invariant n-gram, dened here as a binary detector that responds when a specic spatial conguration of n letters is found anywhere (one or more times) in the input array.2 The value of n is termed the conjunctive or binding order of the n-gram, and the diameter of the n-gram is the maximum span between characters specied within it. For example, th** is a 3-gram of diameter 5 since it species the relative locations of three characters (including the space character but not wild card characters *), and spans a eld ve characters in width. The concept of an n-gram is used also in computational linguistics, though usually referring to n-tuples of consecutive words (Charniak, 1993).
2 An n-gram could also be multivalued, that is, respond in a graded fashion depending on the number of occurrences of the target feature in the input; the binary version was used here to facilitate analysis.
254
Bartlett W. Mel and J´ozsef Fiser
10
6
10
5
10
4
10
3
10
2
B
Non-Uniform Word Frequencies
Non-Uniform Word Sizes 8
Word Counts (x 1000)
Log Relative Frequency
A
10
7 6 5 4 3 2 1 0
1
1 2 3 4 5 6 7 8 9 1011121314 151617181920
Words Rank-Ordered by Frequency
C
Word Size (Letters)
Non-Uniform Letter Frequencies
D
450
Relative Frequency (x 1000)
1200
Relative Frequency
400 350 300 250 200 150 100 50 _ e t a o n h i s r d l u c wm f g y p b v k x j q z
Single Letters Ordered by Frequency Info Gain About Adjacent Letter (Bits)
1000 800 600 400
unifor m appr oximatio n
200 0
0
E
Non-Uniform 2-gram Frequencies
0
100
200
300
400
500
600
Adjacent 2-grams Ordered by Frequency
Statistical Dependencies Among Adjacent Letters 5 letter to left
4
letter to right 3 2 1 0 a
b
c d
e
f
g h
i
j
k
l m n o
p
q
r
s
t
u
v w x y
z
Figure 2: Text world was used as a surrogate for visual recognition in multiobject scenes. (A) Word frequencies in the 5-million-word Wall Street Journal (WSJ) corpus are highly nonuniform, following an approximately exponential decay according to Zipf’s law. (B) Words range from 1 to 20 characters in length; 7letter words are most numerous. (C,D) Relative frequencies of individual letters or adjacent letter pairs are highly nonuniform. Ordinate values represent the number of times the feature was found in the WSJ corpus. The dashed rectangle in D represents simplifying assumption regarding the 2-gram frequency distribution used for quantitative predictions below. (E) The parts of words show strong statistical dependencies. Columns show information gain about identity of the adjacent letter (or space character) to left or right, given the presence of letter indicated on the x-axis. The dashed horizontal line indicates perfect knowledge—log 2 (27) = 4.75 bits.
Minimizing Binding Errors
255
In relation to conventional visual domains, n-grams may be viewed as complex visual feature detectors, each of which responds to a specic conjunction of subparts in a specic spatial conguration, but with built-in invariance to a set of spatial transformations predened by the recognition task. In relation to a shape-based visual domain, a 2-gram is analogous to a corner detector that signals the co-occurrence of a vertical and a horizontal edge, with appropriate spatial offsets, anywhere within an extended region of the image. Let R be a representation containing a set of position-invariant n-grams analyzing the input array and W be a set of word detectors analyzing R, with one detector for every entry in the word database. Each word detector receives input from some or all of the n-gram units in R that are contained in the respective word. However, not all of a word’s n-grams are necessarily contained in R, and not all of the word’s n-grams that are contained in R necessarily provide input to the word’s detector. For example, the representation might contain four spatially invariant features R = fa, c, o, sg, while the detector for the word cows receives input from features c and s, but not from the o feature, which is present in R, or from the w feature, which is not contained in R. Word detectors are binary-valued conjunctions, responding 1 when all of their inputs are active and 0 otherwise. The collection of ngrams feeding a word detector is hereafter referred to as the word’s minimal feature set. A correct recognition is scored when an input array of characters is presented, and each word detector is activated if and only if its associated word is present in the input. Any word detector activated when its corresponding word is not actually present in the input array is counted as a hallucination, and the recognition trial is scored as an error. Recognition errors for a particular representation R are reported as the percentage of input arrays drawn randomly from the English text database that generated one or more word hallucinations.
3 Results 3.1 Word-Word Collisions in a Simple 1-Gram Representation. We begin by considering the case without visual clutter and ask how often pairs of isolated words collide with each other within a binary 1-gram representation, where each word is represented as an unordered set of individual letters (with repeats ignored). This is the situation in which the binding problem is most evident, as schematized in Figure 1A. Results for the approximately 2 billion distinct pairs of words in the database are summarized in Table 1. About half of the 44,414 words contained in the database collide with at least one other word (e.g., cerebrum ´ cucumber , calculation ´ unconstitutional ). The largest cohort of 1-gram-identical words contained 28 members.
256
Bartlett W. Mel and J´ozsef Fiser
Table 1: Performance of a Binary 1-Gram Representation in Word-Word Comparisons. Quantity
Value
Comment
Number of 1-grams in R
27
a, b, . . . , z,
Number of distinct words
44,414
From 5 million words in the WSJ
Word-word comparisons
»2 billion
Number of ambiguous words
24,488
analogies suction scientists
´ ´ ´
gasoline continuous nicest
stare, arrest, tears, restates, reassert, rarest, easter . . .
Largest self-similar cohort
28 words
Baseline entropy in wordfrequency distribution
9.76 bits
< log 2 (44, 414) = 15.4 bits
1.4 bits
Narrows eld to » 3 possible words
Residual uncertainty about identity of a randomly drawn word, knowing its 1-grams
As an aside, we noted that if the two words compared were required also to contain the same number of letters, the number of unambiguously represented words fell to 11,377, and if the words were compared within a multivalued 1-gram representation (where words matched only if they contained the identical multiset of letters, i.e., in which repeats were taken into account), the ambiguous word count fell further to 5,483, or about 12% of English words according to this sample. The largest self-similar cohort in this case contained only 5 words (traces , reacts , crates , caters , caster ). Although a 1-gram representation contains no explicit part-part relations and fails to pin down uniquely the identity of more than half of the English words in the database, it nonetheless provides most of the needed information to distinguish individually presented words. The baseline entropy in the WSJ word-frequency distribution is 9.76 bits, signicantly less than the 15.4 = log2 (44,414) bits for an equiprobable word set. The average residual uncertainty about the identity of a word drawn from the WSJ word frequency distribution, given its 1-gram representation, is only 1.4 bits, meaning that for individually presented words, the 1-gram representation narrows the set of possibilities to about three words on average. 3.2 Word-Word Collisions in an Adjancent 2-Gram Representation.
Since many word objects have identical 1-gram representations even when full multisets of letters are considered, we examined the collision rate when an adjacent binary-valued 2-gram representation was used (i.e., coding the
Minimizing Binding Errors
257
Table 2: Performance of a Binary Adjacent 2-Gram Representation in WordWord Comparisons. Quantity Number of adjacent 2-grams Number of words Word-word comparisons Number of ambiguous words Largest self-similar cohort
Ambiguous word pairs that are linguistically distinct Words with identical adjacent 2-gram multisets
Value
Comment
729
[aa], [ab], . . . , [a ], . . . , [ z], [ ]
44,414 »2 billion 57 5 words
4 pairs
2 words
ohhh, ahhh, shhh, hmmm, whoosh, whirr , . . . ohhh ´ ohhhh ´ ohhhhh ´ ohhhhhh ´ ohhhhhhh asses possess seamstress intended
´ ´ ´ ´
assess possesses seamstresses indented
intended ´ indented
set of all adjacent letter pairs, including the space character). This representation is signicant in that adjacent 2-grams are the simplest features that explicitly encode spatial relations between parts. The results of this test are summarized in Table 2. There were only 46 word-pair collisions in this case, comprising 57 distinct words. Forty-two of the 46 problematic word pairs differed only in the length of a string of repeated letters (e.g., ahhh ´ ahhhh ´ ahhhhh . . . , similarly for ohhh, ummm, zzz, wheee, whirrr ) or into sets representing varying spellings or misspellings of the same word also involving repeated letters (e.g., tepees ´ teepees, llama ´ lllama ). Only four pairs of distinct words collided that had different morphologies in the linguistic sense (see Table 2). When words were constrained to be the same length or to contain the identical multiset of adjacent 2-grams, only a single collision persisted: intended ´ indented . At the level of word-word comparisons, therefore, the binding of adjacent letter pairs is sufcient practically to eliminate the ambiguity in a large database of English words. We noted empirically that the inclusion of fewer than 20 additional nonadjacent 2-grams to the representation entirely eliminated the word-word ambiguity in this database, without recourse to multiset comparisons or word-length constraints. 4 An Analytical Model
The results of these word-word comparisons suggest that under some circumstances, a modest number of low-order n-grams can disambiguate a
258
Bartlett W. Mel and J´ozsef Fiser
large, complex object set. However, one of the most dire challenges to a spatially invariant feature representation, object clutter, remains to be dealt with. To this end, we developed a simple probabilistic model to predict recognition performance when a cluttered input array containing multiple objects is presented to a bank of feature detectors. To permit analysis of this problem, we make two simplifying assumptions. First, we assume that the features contained in R are statistically independent. This assumption is false for English n-grams: for example, [th] predicts that [he] will follow about one in three times, whereas the baseline rate for [he] is only 0.5%. Second, we assume that all the features in R are activated with equal probability. This is also false in English: for example ed occurs approximately 90,000 times more often than [mj] (as in ramjet ). Proceeding nonetheless under these two assumptions, it is straightforward to calculate the probability that no object in the database is hallucinated in response to a randomly drawn input image. We rst calculate the probability o that all of the features feeding a given object detector are activated by an arbitrary input image:
o=
¡ d ¡ w¢ c¡w ¡ d¢ = c
( d ¡ w)! c! ³ c ´w ¼ , ( c ¡ w)! d! d
(4.1)
where d is the total number of feature detectors in R , w is the size of the object’s minimal feature set, i.e., the number of features required by the target object (word) detector, and c is the number of features activated by a “cluttered” multiobject input image. The left-most combinatorial ratio term in equation 4.1 counts the number of ways of choosing c from the set of d detectors, including all w belonging to the target object, as a fraction of the total number of ways of choosing c of the d detectors in any fashion. The approximation in equation 4.1 holds when c and d are large compared to w, as is generally the case. Equation 4.1 is, incidentally, the value of a “hypergeometric” random variable in the special case where all of the target elements are chosen; this distribution is used, for example, to calculate the probability of picking a winning lottery ticket. The probability that an object is hallucinated (i.e., perceived but not actually present) is given by hi =
oi ¡ qi , 1 ¡ qi
(4.2)
where qi is the probability that the object does in fact appear in the input image, which by denition means it cannot be hallucinated. According to Zipf’s law, however, which tells us that most objects in the world are quite uncommon (see Figure 2A), we may in some cases make the further simpli-
Minimizing Binding Errors
259
cation that qi ¿ oi , giving hi ¼ oi . We adopt this assumption in the following. (This simplication would result also from the assumption that inputs consist of object-like noise structured to trigger false-positive perceptions.) We may now write the probability of veridical perception—the probability that no object in the database is hallucinated, µ ³ c ´w ¶N , pv = (1 ¡ h) N ¼ 1 ¡ d
(4.3)
where N is the number of objects in the database. This expression assumes a homogeneous object database, where all N objects require exactly w features in R. Given the independence assumption, the expression for a heterogeneousQobject database consists of a product of object-specic probabilities c wi pv = i (1 ¡ ( d ) ), where wi is the number of features required by object i. Note that objects (words) containing the same number of parts (letters) do not necessarily activate the same number of features (n-grams) in R, depending on both the object and the composition of R. From an expansion to rst order of equation 4.3 around ( c/ d) w = 0, which gives pv = 1¡N(c/ d) w , it may be seen that perception is veridical (i.e., pv ¼ 1) when ( c/ d) w ¿ 1/ N. Thus, recognition errors are expected to be small when (1) a cluttered input scene activates only a small fraction of the detectors in R, (2) individual objects depend on as many detectors in R as possible, and (3) the number of objects to be discriminated is not too large. The rst two effects generate pressure to grow large representations, since if d is scaled up and c and w scale with it (as would be expected for statistically independent features), recognition performance improves rapidly. Conveniently, the size of the representation can always be increased by including new features of higher binding order or larger diameters, or both. It is also interesting to note that item 1 in the previous paragraph appears to militate for a sparse visual representation, while equation 4.2 militates for a dense representation. This apparent contradiction reects the fundamental trade-off between the need for individual objects to have large minimal feature sets in order to ward off hallucinations and the need for input scenes containing many objects to activate as few features as possible, again in order to minimize hallucinations. This trade-off is not captured within the standard conception of a feature space, since the notion of “similarity as distance” in a feature space does not naturally extend to the situation in which inputs represent multiobject scenes and require multiple output labels. 4.1 Fitting the Model to Recognition Performance in Text World. We ran a series of experiments to test the analytical model. Input arrays varying in width from 10 to 50 characters were drawn at random from the text database and presented to R. The representation contained either all adjacent 2-grams, called Rf2g to signify a diameter of 2, or the set of all 2-grams of diameter 2 or 3, called Rf2,3g , where Rf2,3g was twice the size of Rf2g .
260
Bartlett W. Mel and J´ozsef Fiser
In individual runs, the word database was restricted to words of a single length—two-letter words (N = 253), ve-letter words (N = 4024), or sevenletter words (N = 7159)—in order for the value of w to be kept relatively constant within a trial.3 For each word size and input size and for each of the two representations, average values for w and c were measured from samples drawn from the word or text database. Not surprisingly, given the assumption violations noted above (feature independence and equiprobability), predicting recognition performance using these measured average values led to gross underestimates of the hallucination errors actually generated by the representation. Two kinds of corrections to the model were made. The rst correction involved estimating a better value for d, the number of n-grams contained in the representation, since some of the detectors nominally present in Rf2g and Rf2,3g were infrequently, if ever, actually activated. The value for d was thus down-corrected to reect the number of detectors that would account for more than 99.9% of the probability density over the detector set: for Rf2g , the representation size dropped from d = 729 to d0 = 409, while for Rf2,3g , d was cut from 1,458 to 948. The second correction was based on the fact that the n-grams activated by English words and phrases are highly statistically redundant, meaning that a set of d0 n-grams has many fewer than d0 underlying degrees of freedom. We therefore introduced a redundancy factor r, which allowed us to “pretend” that the representation contained only d0 / r virtual n-grams, and a word or an input image activated only w/ r or c/ r virtual n-grams in the representation, respectively. (Note that since c and d appear only as a ratio in the approximation of equation 4.3, their r-corrections cancel out.) By systematically adjusting the redundancy factor depending on test condition (ranging from 1.60 to 3.45), qualitatively good ts to the empirical data could be generated, as shown in Figure 3. The redundancy factor was generally larger for long words relative to short words, reecting the fact that long English words have more internal predictability (i.e., ll a smaller proportion of the space of long strings than short words ll in the space of short strings). The r-factor was also larger for Rf2,3g than for Rf2g , reecting the fact that the increased feature overlap due to the inclusion of larger feature diameters in Rf2,3g leads to stronger interfeature correlations, and hence fewer underlying degrees of freedom. Thus, with the inclusion of correction factors for nonuniform activation levels and nonindependence of the features in R , the remarkably simple relation in equation 4.3 captures the main trends that describe recognition performance in scenarios with database sizes ranging from 253 to 7159 words, words containing between
3 Since R f2g and R f2, 3g contained all n-grams of their respective sizes, the value of w could vary only from word to word to the extent that words contained one or more repeated n-grams.
Minimizing Binding Errors
261
Figure 3: Fits of equation 4.3 to recognition performance in text world. (A) The representation, designated R f2g , contained the complete set of 729 adjacent 2grams (including the space character). The x-axis shows the width of a text window presented to the n-gram representation, and the y-axis shows the probability that the input generates one or more hallucinated words. Analytical predictions are shown in dashed lines using redundancy factors specied in the legend. (B) Same as A, with results for R f2,3g .
262
Bartlett W. Mel and J´ozsef Fiser
2 and 7 letters, input arrays containing from 10 to 50 characters, and using representations varying in size by a factor of 2. We did not attempt to t larger ranges of these experimental variables. Regarding the sensitivity of the model ts to the choice of d0 and r, we noted rst that ts were not strongly affected by the cumulative probability cutoff used to choose d0 —qualitatively reasonable ts could be also generated using a cutoff value of 95%, for example, for which d0 values for R f2,3g and Rf2g shrank to 220 and 504, respectively. In contrast, we noted that a change of one part per thousand in the value of r could result in a signicant change in the shape a curve generated by equation 4.3, and with it significant degradation of the quality of a t. However, the systematic increase in the optimal value of r with increasing word size and representation size indicates that this redundancy correction factor is not an entirely free parameter. It is also worth emphasizing that a redundancy “factor,” which uniformly scales w, c, and d, is a very crude model of the complex statistical dependencies that exist among English n-grams; the fact that any set of small r values can be found that tracks empirical recognition curves over wide variations in the several other task variables is therefore signicant. 5 Greedy Feature Learning
Several lessons from the analysis and experiments are that (1) larger representations can lead to better recognition performance, (2) only those features that are actually used should be included in a representation, (3) considerable statistical redundancy exists in a fully enumerated (or randomly drawn) set of conjunctive features, which should be suppressed if possible, and (4) features of low binding order, while potentially adequate to represent isolated objects, are insufcient for veridical perception of scenes containing multiple objects. In addition, we infer that different objects are likely to have different representational requirements depending on their complexity and that the composition of an ideal representation is likely to depend heavily on the particular recognition task. Taken together, these points suggest that a representation should be learned that includes higher-order features only as needed. In contrast, blind enumeration of an ever-increasing number of higher-order features is an expensive and relatively inefcient way to proceed. Standard network approaches to supervised learning could in principle address all of the points raised above, including nding appropriate weights for features of various orders depending on their utility and helping to decorrelate the input representation. However, the potential need to acquire features of high order—so high that enumeration and testing of the full cohort of high-order features is impractical—compels the use of some form of explicit search as part of the learning process. The use of greedy, incremental, or heuristic techniques to grow a neural representation through increasing nonlinear orders is not new (e.g., Barron, Mucciardi, Cook, Craig,
Minimizing Binding Errors
263
& Barron, 1984; Fahlman & Lebiere, 1990), though such techniques have been relatively little discussed in the context of vision. To cope with the many constraints that must be satised by a good visual representation, we developed a greedy supervised algorithm for feature learning that (1) selects the order in which features should be added to the representation, and (2) selects which features added to the representation should contribute to the minimal feature sets of each object Q in the database. By differentiating the log probability uv = log(pv ) = log i (1 ¡ hi ) with respect to the hallucination probability of the ith object, we nd that ¡1 duv = , 1 ¡ hi dhi
(5.1)
indicating that the largest multiplicative increments to pv are realized by squashing the largest values of hi = ( c/ d) wi . Practically, this is achieved by incrementing the w-values (i.e., growing the minimal feature sets) of the most frequently hallucinated objects. The following learning algorithm aims to do this in an efcient way: 1. Start with a small bootstrap representation. (We typically used 50 2grams found to be useful in pilot experiments.) 2. Present a large number of text “images” to the representation. 3. Collect hallucinated words—any word that was ever detected but not actually present—in an input image. 4. For each hallucinated word and the input array that generated it, increment a global frequency table for every n-gram (up to a maximum order of 5 and diameter of 6) that is (i) contained in the hallucinated word, (ii) not contained in the offending input, and (iii) not currently included in the representation. 5. Choose the most frequent n-gram in this table of any order and diameter, and add it to the current representation. 6. Build a connection from the newly added n-gram to any word detector involved in a successful vote for this feature in step 4. Inclusion of this n-gram in these words’ minimal feature sets will eliminate the largest number of word hallucination events encountered in the set of training images. 7. Go to 2. We ran the algorithm using input arrays 50 characters in width (often containing 10 or more words) and plotted the number of hallucinated words (see Figure 4A) and proportion of input arrays causing hallucinations (see Figure 4B) as a function of increasing representation size during learning. The periodic ratcheting in the performance curves (most visible in the lower curve of Figure 4B) was due to staging of the training process for efciency
264
Bartlett W. Mel and J´ozsef Fiser
reasons: batches of 1000 input arrays were processed until a xed hallucination error criterion of 1% was reached, when a new batch of 1000 random input arrays was drawn, and so on. The envelope of the oscillating performance curves provides an estimate of the true performance curve that would result from an innitely large training corpus. Learning was terminated when the average initial error rate for three consecutive new training batches fell below 1%. Given that this termination condition was typically reached in fewer than 50 training epochs (consisting of 50,000 50-character input images), no more than half the text database was visited by random drawings of training-testing sets. We found that for 50-character input arrays, the hallucination error rate fell below 1% after the inclusion of 2287 n-grams in R. These results were typical. The inferior performance of a simple unsupervised algorithm, which adds n-grams to the representation in order of their raw frequencies in English, is shown in Figure 4AB for comparison (upper dashed curves). The counts of total words hallucinated (see Figure 4A), with initial values indicating many 10s of hallucinated words per input, can be seen to fall far faster than the hallucination error rate (see Figure 4B), since a sentence was scored as an error until its very last hallucinated word was quashed.
Figure 4: Facing page. Results of greedy n-gram learning algorithm using input arrays 50 characters in width and a 50-element bootstrap representation. Training was staged in batches of 1000 input arrays until the steady-state hallucination error rate fell below 1%. (A) Growth of the representation is plotted along the x-axis, and the number of words hallucinated at least once in the current 1000-input training batch is plotted on the y-axis. The drop is far faster for the greedy algorithm (lower curve) than for a frequency-based unsupervised algorithm (upper curve). Unsupervised bootstrap representation also included the 27 1-grams (i.e., individual letters plus space character) that were excluded from the bootstrap representation used in the greedy learning run. Without these 1-grams, performance of the unsupervised algorithm was even more severely hampered; with these 1-grams incorporated in the greedy learning runs, the wvalues of every object were unduly inated. (B) Same as (A), but the y-axis shows the proportion of input arrays producing at least one hallucination error. Jaggies occur at transitions to new training batches each time the 1% error criterion was reached. Asymptote is reached in this run after the inclusion of 2287 n-grams. (C) A breakdown of learned representation showing relative numbers of n-grams of each order and diameter. (D) The column height shows the minimal feature set size after learning, that is, average number of n-grams feeding word conjunctions for words of various lengths. The dark lower portion of the columns shows initial counts based on full connectivity to bootstrap representation; the light upper portions show an increment in w due to learning. Nearly constant w values after learning reect the tendency of the algorithm to grow the smallest minimal feature sets. The dashed line corresponds to one n-gram per letter in word.
Minimizing Binding Errors
265
One of the most striking outcomes of the greedy learning process is that the learned representation is heavily weighted toward low-order features, as shown in Figure 4C. Thus, while the learned representation includes ngrams up to order 5, the number of higher-order features required to remedy perceptual errors falls off precipitously with increasing order—far faster than the explosive growth in the number of available features at each order. Quantitatively, the ratios of n-grams included in R to n-grams contained in the word database for orders 2, 3, 4, and 5 were 0.42, 0.0095, 0.00017, and 0.0000055, respectively. The rapidly diminishing ratios of useful features to available features at higher orders conrm the impracticality of learning on a fully enumerated high-order feature set and illustrate the natural bias in the learning procedure to choose the lowest-order features that “get the job done.” The dominance of low-order features is of considerable practical signicance, since higher-order features are more expensive to compute, less robust to image degradation, operate with lower duty cycles, and provide a more limited basis for generalization. The relatively broad distribution of
A
Fewer Words are Hallucinated as Representation Grows
Inputs Causing Hallucinations (%)
10000
Words Hallucinated
9000 8000 7000 6000
Unsupervised
5000
Greedy
4000 3000 2000 1000 0 0
1000
2000
3000
B
4000
Fewer Inputs Cause Hallucinations as Representation Grows 100 90 80 70 60
30 20 10 0
5000
Sizes of N-grams in Final Learned Representation
0
D
6 5
1200 1000
4
800
3
600
2
400 200 0 2
3 4 N-gram Order
5
Mean Number of N-grams
Number of N-grams
diameter
1000
2000
3000
4000
5000
N-grams Included in Representation
1600 1400
Greedy
40
N-grams Included in Representation
C
Unsupervised
50
Number of N-grams Feeding Words of Varying Lengths 20 18 16 14 12 10 8 6 4
initial learned
2 0 1 2 3 4 5 6 7 8 9 1011121314151617 181920
Word Size (Letters)
266
Bartlett W. Mel and J´ozsef Fiser
diameters of the learned feature set are also shown in Figure 4C for n-grams of each order. In spite of its relatively small size, it is likely that the n-gram representation produced by the greedy algorithm could be signicantly compacted without loss of recognition performance, by alternately pruning the least useful n-grams from R (which lead to minimal increments in error rate) and retraining back to criterion. However, such a scheme was not tried. After learning to the 1% error criterion, a 50-character input array activated an average of 168 (7.3%) of the 2287 detectors in R, while the average minimal feature set size for an individual word was 11.4. By comparison, the initial c/ d ratio for 50-character inputs projected onto the 50-element bootstrap representation was a much less favorable 44%. Thus, a primary outcome of the learning algorithm in this case involving heavy clutter is a larger, more sparsely activated representation, which minimizes collisions between input images and the feature sets associated with individual target words. Further, the estimated r value from this run was 1.95, indicating signicantly less redundancy in the learned representation than was estimated using seven-letter words (r = 3.45) in the fully enumerated 2-gram representations reported in Figure 3; this is in spite of the fact that the learned representation contained more than twice the number of active features as the fully enumerated 2-gram representation. The total height of each column in Figure 4D shows the w value—the size of the minimal feature sets after learning for words of various lengths. The dark lower segment of each column counts initial connections to the word conjunctions from the 50-element bootstrap representation, while the average number of new connections gained during learning for words of each length is given by the height of each gray upper segment. Given the algorithm’s tendency to expand the smallest minimal feature sets, the w values for short words grew proportionally much more than those for longer words, producing a nal distribution of w values that was remarkably independent of word length for all but the shortest words. The w values for one-, two-, and three-letter words were lower either because they reached their maximum possible values (4 for one-letter words) or because these words contained rarer features, on average, derived from a relatively high concentration in the database of unpronounceable abbreviations, acronyms, and so forth. The dashed line indicates a minimal feature set size equal to the number of letters contained in the word; the longest words can be seen to depend on substantially fewer n-grams than they contain letters. The pressures inuencing both the choice of n-grams for inclusion in R and the assignment of n-grams to the minimal feature sets of individual words were very different from those associated with a frequency-based learning scheme. Thus, the features included in R were not nececssarily the most common ones, and the most commonly occurring English words were not necessarily the most heavily represented.
Minimizing Binding Errors
267
Table 3: First 10 n-Grams to Be Added to R During a Learning Run with 50-Character Input Arrays, Initialized with a 200-Element Bootstrap Representation. Order of Inclusion 1 2 3 4 5 6 7 8 9 10
n-Gram
Relative Frequency
Cumulative %
[z ] [ j] [k* ] [x ] [ki] [ q] [ *v] [ **z] [o*i] [p*** ]
417 14,167 39,590 6090 20,558 8771 24,278 3184 37,562 55,567
8 55 71 42 62 47 64 31 70 76
Note: The “ ” character represents the space character, and “*” matches any character. A value of k in the cumulative distribution column indicates that the total probability summed over all less common features is k%.)
First, in lieu of choosing the most commonly occurring n-grams, the greedy algorithm prefers features that are relatively rare in English text as a whole but relatively common in those words most likely to be hallucinated: short words and words containing relatively common English structure. The paradoxical need to nd features that distinguish objects from common backgrounds, but do so as often as possible, leads to a representation containing features that occur with moderate frequency. Table 3 shows the rst 10 n-grams added to a 200-element bootstrap representation during learning in one experimental run. The features’ relative frequencies in English are shown, which jump about erratically, but are mostly from the middle ranges of the cumulative probability distribution for English n-grams. Most of these rst 10 features include a space character, indicating that failure to represent elemental features properly in relation to object boundaries is a major contributor to hallucination errors in this domain. Specically, we found that hallucination errors are frequently caused by the presence in the input of multiple “distractor ” objects that together contain most or all of the features of a target object, for example, in the sense that national and vacation together contain nearly all the features of the word nation. To distinguish nation from the superposition of these two distractors requires the inclusion in the representation of a 2-gram such as [n***** ] whose diameter of 7 exceeds the maximum value of 6 arbitrarily established in the present experiments. Having established that the features included in R are not keyed in a simple way to their relative frequencies in English, we also observed that no simple relation exists between word frequency and the per-word representational cost, that is, the minimal feature set size established during
268
Bartlett W. Mel and J´ozsef Fiser Table 4: Correlation Between Word Frequency in English and Minimal Feature Set Size After Learning Was Only 0.1. Word
Relative Frequency
Minimal Feature Set Sizea
resting leading
100 315
25 24
buffalo unhappy
188 144
8 7
thereat fording
2 1
25 23
bedknob jackdaw
1 1
7 6
a Number of
n-grams in R that feed the word’s conjunction, that is, which must be present in order for a word to be detected. Table shows examples of common and uncommon words with large and small minimal features sets.
learning. In fact, if attention is restricted to the set of all words of a given length, with nominally equivalent representational requirements, the correlation between relative frequency of the word in English and its postlearning minimal feature set size was only 0.1. Thus some common words require a conjunction of many n-grams in R, while others require few, and some rare words require many n-grams while others require few (see Table 4). 5.1 Sensitivity of the Learned Representation to Object Category Structure. Words are unlike common objects in that every word forms its own
category, which must be distinguished from all other words. In text world, therefore, the critical issue of generalization to novel class exemplars cannot be directly studied. To address this shortcoming partially and assess the impact of broader category structures on the composition of the n-gram representation, we modied the learning algorithm so that a word was considered to be hallucinated only if no similar—rather than identical—word was present in the input when the target word was declared recognized. In this case, similar was dened as different up to any single letter replacement—for example, dog = ffog, dig, . . .g, but dog 6 = ffig, og, dogs, . . .g. The resulting category structure imposed on words was thus a heavily overlapping one in which most words were similar to several other words.4 4 This learning procedure, combined with the assumption that every word is represented by only a simple conjunction of features in R (disjunctions of multiple prototypes per word were not supported), cannot guarantee that every word detector is activated
Minimizing Binding Errors
269
Decrease in Representational Costs for Broader, Overlapping Object Categories 1400 tolerance = 0
Number of N-grams
1200
tolerance = 1
1000 800 600 400 200 0 2
3 4 N-gram Order
5
Figure 5: Breakdown of representations produced by two learning runs differing in breadth of category structure. Fifty-character input arrays were used as before to train to an asymptotic 1% error criterion, this time beginning with a 200-element bootstrap representation. Run with tolerance = 0 was as in Figure 4, where a hallucination was scored whenever a word’s minimal feature set was activated, but the word was not identically present in the input. In a run with tolerance = 1, a hallucination was not scored if any similar word was present in the input, up to a single character replacement. The latter run, with a more lenient category structure, led to a representation that was signicantly smaller, since it was not required to discriminate as frequently among highly similar objects.
The main result of using this more tolerant similarity metric during learning is that fewer word hallucinations are encountered per input image, leading to a nal representation that is smaller by more than half (see Figure 5). In short, the demands on a conjunctive n-gram representation are lessened when the object category structure is broadened, since fewer distinctions among objects must be made. 5.2 Scaling of Learned Representation with Input Clutter. As is conrmed by the results of Figure 3, one of the most pernicious threats to a spatially invariant representation is that of clutter, dened here as the quantity of input material that must be simultaneously processed by the representain response to any of the legal variants of the word. Rather, the procedure allows a word detector to be activated by any of the legal variants without penalty.
270
Bartlett W. Mel and J´ozsef Fiser Table 5: Measured Average Values of c, d, and w and Calculated Values of r in a Series of Three Learning Runs with Input Clutter Load Increased from 25 to 75 Characters, Trained to a Fixed Error Criterion of 3%. Input Width
c
d
w
r
25 50 75
55 152 253
816 1768 2634
7.6 10.7 12.8
1.45 1.85 2.11
Note: The 50-element bootstrap representation was used to initialize all three runs. Equation 4.3 was used to nd values of r for each run, such that with w ! w/ r, the measured values of c and d, and N = 44,414, a 3% error rate was predicted.
tion. The difculty may be easily seen in equation 4.3, where an unopposed increase in the value of c leads to a precipitous drop in recognition performance. On the other hand, equation 4.3 also indicates that increases in c can be counteracted by appropriate increases in d and w, which conserve the value of ( c/ d) w . To examine this issue, we systematically increased the clutter load in a series of three learning runs from 25 to 75 characters, training always to a xed 3% error criterion, and recorded for each run the postlearning values of c, d, and w. Although the WSJ database contained a wide range of word sizes (from 1 to 20 characters in length), the measurement of a single value for w averaged over all words was justied since, as shown in Figure 4D, the learning algorithm tended to equalize the minimal feature set sizes for words independent of their lengths. Under the assumption of uniformly probable, statistically independent features in the learned representations, the value of h = (c/ d)w for all three runs should be equal to » 7 £ 10¡7 , the value that predicts a 3% error rate for a 44,414-object database using equation 4.3. However, corrections (w ! w/ r) were needed to factor out unequal amounts of statistical redundancy in representations of various sizes, where larger representations contained more redundancy. The measured values of c, d, and w and the empirically determined r values are shown for each run in Table 5. As in the runs of Figure 3, redundancy values were again systematically greater for the larger representations generated, and the redundancy values calculated here for the learned representation were again signicantly smaller than those seen for the fully enumerated 2-gram representation. 6 Discussion
Using a simple analytical model as a taking-off point and a variety of simulations in text world, we have shown that low-ambiguity perception in un-
Minimizing Binding Errors
271
segmented multiobject scenes can occur whenever the probability of fully activating any single object’s minimal feature set, by a randomly drawn scene, is kept sufciently low. One prescription for achieving this condition is straightforward: conjunctive features are mined from hallucinated objects until false-positive recognition errors are suppressed to criterion. Even in a complex domain containing a large number of highly similar object categories and severe background clutter, this prescription can produce a remarkably compact representation. This conrms that brute-force enumeration of large families or, worse, all high-order conjunctive features—the combinatorial explosion recognized by Malsburg—is not inevitable. The analysis and experiments reported here have provided several insights into the properties of feature sets that support hallucination-proof recognition and into learning procedures capable of building these feature sets: 1. Efcient feature sets are likely to (1) contain features that span the range from very simple to very complex features, (2) contain relatively many simple features and relatively few complex features, (3) emphasize features that are only moderately common (giving a representation that is neither sparse nor dense) in response to the conicting constraints that features should appear frequently in objects but not in backgrounds also composed of objects, and (4) in spatial domains, emphasize features that encode the relations of parts to object boundaries. 2. An efcient learning algorithm works to drive toward zero, and therefore to equalize, the false-positive recognition rates for all objects considered individually. Thus, frequently hallucinated objects—objects with few parts or common internal structure, or both—demand the most attention during learning. Two consequences of this focus of effort on frequently hallucinated objects are that (1) the average value of w, the size of the minimal feature set required to activate an object, becomes nearly independent of the number of parts contained in the object, so that simpler objects are relatively more intensively represented than complex objects, and (2) among objects of the same part complexity, the minimal conjunctive feature sets grow largest for objects containing the most common substructures, though these are not necessarily the most common objects. A curious implication of the pressure to represent objects heavily that are frequently hallucinated is that the backgrounds in which objects are typically embedded can strongly inuence the composition of the optimal feature set for recognition. 3. The demands on a visual representation are heavily dependent on the object category structure imposed by the task. Where object classes are large and diffuse, the required representation is smaller and weighted
272
Bartlett W. Mel and J´ozsef Fiser
to features of lower conjunctive order, whereas for a category structure like words, in which every object forms its own category that must often be distinguished from a large number of highly similar objects (e.g., cat from cats, cut, rat), the representation must be larger and depend more heavily on features of higher conjunctive order. 6.1 Further Implications of the Analytical Model. Equation 4.3 provides an explicit mathematical relation among several quantities relating to multiple-object recognition. Since the equation assumes statistical independence and uniform activation probability of the features contained in the representation, neither of which is a valid assumption in most practical situations, we were unable to predict recognition errors using measured values for d, c, and w. However, we found that error rates in each case examined could be tted using a small correction factor for statistical redundancy, which uniformly scaled d, c, and w and whose magnitude grew systematically for larger collections of features activated by larger—more redundant—units of text. From this we conclude that equation 4.3 at least crudely captures the quantitative trade-offs governing recognition performance in receptive-eld-based visual systems. One of the pithier features of equation 4.3 is that it provides a criterion for good recognition performance: ( c/ d) w ¿ 1/ N. The exponential dependence on the minimal feature set size, w, means that modest increases in w can counter a large increase in the number of object categories, N, or unfavorable increases in the activation density, c/ d, which could be due to either an increase in the clutter load or a collapse in the size of the representation. For example, using roughly the postlearning values of c/ d » 0.07 and r » 2 from the run of Figure 4, we nd that increasing w from 10 to 12 can compensate for a more than 10-fold increase N, or a nearly 60% increase in the c/ d ratio, while maintaining the same recognition error rate. The rst-order expansion of equation 4.3 also states that recognition errors grow as the wth power of c/ d, where w is generally much larger than 1. A visual system constructed to minimize hallucinations therefore abhors uncompensated increases in clutter or reduction in the size of the representation. These effects are exactly those that have motivated the various compensatory strategies discussed in section 1. The abhorrence of clutter generates strong pressure to invoke any readily available segmentation strategies that limit processing to one or a small number of objects at a time. For example, cutting out just half the visual material in the input array when w/ r = 5, as in the above example, knocks down the expected error rate by a factor of 32. The pressure to reduce the number of objects simultaneously presented to the visual system could account in part for the presence in biological visual systems of (1) a fovea, which leads to a huge overrepresentation of the center of xation and marginalization of the surround, (2) covert attentional mechanisms, which selectively upor down-modulate sensitivity to different portions of the visual eld, and
Minimizing Binding Errors
273
(3) dynamical processes that segment “good” gures (in the Gestalt sense) from background. The second pressure involved in maintaining a low c/ d ratio involves maintaining a visual representation of adequate size, as measured by d. One situation in which this is particularly difcult arises when the task mandates that objects be distinguished under excessively wide ranges of spatial variation, an invariance load that must be borne in turn by each of the individual feature detectors in the representation. For example, consider a relatively simple task in which the orientation of image features is a valid cue to object identity, such as a task in which objects usually appear in a standard orientation. In this case a bank of several orientation-specic variants of each feature can be separately maintained within the representation. In contrast, in a more difcult task that requires that objects be recognized in any orientation, each bank of orientation-specic detectors must be pooled together to create a single, orientation-invariant detector for that conjunctive feature. The inclusion of this invariance in the task denition can thus lead to an order-of-magnitude reduction in the size of the representation, which, unopposed could produce a catastrophic breakdown of recognition according to equation 4.3. As a rule of thumb, therefore, the inclusion of an additional order-of-magnitude invariance must be countered by an order-of-magnitude increase in the number of distinct feature conjunctions included in the representation, drawn from the reservoir of higher-order features contained in the objects in question. The main cost in this approach lies in the additional hardware, requiring components that are more numerous, expensive, prone to failure, and used at a far lower rate. 6.2 Space, Shape, Invariance, and the Binding Problem. While one emphasis in this work has been the issue of spatial invariance in a visual representation, we reiterate that the mathematical relation expressed by equation 4.3 knows nothing of space or shape or invariance, but rather views objects and scenes as collections of statistically independent features. What, then, is the role of space? From the point of view of predicting recognition performance and with regard to the quantitative trade-offs discussed here, we can nd no special status for worlds in which spatial relations among parts help to distinguish objects, since spatial relations may be viewed simply as additional sources of information regarding object identity. (Not all worlds have this property; an example is olfactory identication.) Further, the issue of spatial invariance plays into the mathematical relation of equation 4.3 only indirectly, in that it species the degree of spatial pooling needed to construct the invariant features included in the representation; once again, the details relating to the internal construction of these features are not visible to the analysis. In this sense, the spatial binding problem, which appears at rst to derive from an underspecication of the spatial relations needed to glue an object together, in fact reduces to the more mundane problem of ambiguity arising
274
Bartlett W. Mel and J´ozsef Fiser
from overstimulation of the visual representation by input images—that is, a c/ d ratio that is simply too large. The hallucination of objects based on parts contained in background clutter can thus occur in any world, even one in which there is no notion of space (e.g., matching bowls of alphabet soup). However, the fact that object recognition in real-world situations typically depends on spatial relations between parts and the fact that recognition invariances are also very often spatial in nature together strongly inuence the way in which a visual representation based on spatially invariant, receptive elds can be most efciently constructed. In particular, many economies of design can be realized through the use of hierarchy, as exemplied by the seminal architecture of Fukushima et al. (1983). The key observation is that a population of spatially invariant, higher-order conjunctive features generally shares many underlying processing operations. The present analysis is, however, mute on this issue. 6.3 Relations to “Real” Vision. The quantitative relations given by equation 4.3 cast the problem of ambiguous perception in cluttered scenes in the simple terms of set intersection probabilities: a cluttered scene activates c of the d feature detectors at random and is correctly perceived with high probability as long as the set of c activated features only rarely includes all w features associated with any one of the N known objects. The analysis is thus heavily abstracted from the details of the visual recognition problem, most obviously ignoring the particular selectivites and invariances that parameterize the detectors contained in the representation. On the other hand, the analysis makes it possible to understand in a straightforward way how the simultaneous challenges of clutter and invariance conspire to create binding ambiguity and the degree to which compensatory increases in the size of the representation, through conjunctive order boosting, can be used to suppress this ambiguity. These basic trade-offs have operated close to their predicted levels in the text world experiments carried out in this work, in spite of violations of the simplifying assumptions of feature independence and equiprobability. Since the application of our analysis to any particular recognition problem involves estimating the domain-specic quantities c, d, w, and N, we do not expect predictions relating to levels of binding ambiguity or recognition performance to carry over directly from the problem of word recognition in blocks of symbolic text to other visual recognition domains, such as viewpoint-independent recognition of real three-dimensional objects embedded in cluttered scenes. On the other hand, we expect the basic representational trade-offs to persist. The most important way that recognition in more realistic visual domains differs from recognition in text world lies in the invariance load, which in most interesting cases extends well beyond simple translation invariance. Basic-level object naming in humans, for example, is largely invariant to
Minimizing Binding Errors
275
translation, rotation, scale, and various forms of distortion, occlusion, and degradation—invariances that persist even for certain classes of nonsense objects (Biederman, 1995). Following the logic of our probabilistic analysis, the daunting array of invariances that must be maintained in this and many other natural visual tasks would seem to present a huge challenge to biological visual systems. In keeping with this, performance limitations in human perception suggest that our visual systems have responded in predictable ways to the pressures of the various tasks with which they are confronted; in particular, recognition in a variety of visual domains appears to include only those invariances that are absolutely necessary and those for which simple compensatory strategies are not available. For example (1) recognition is restricted to a highly localized foveal acceptance window for text or other detailed gures, where by “giving up” on translation invariance, this substream of the visual system frees hardware resources for the several other essential invariances that cannot be so easily compensated for (e.g. size, font, kerning, orientation, distortion), (2) face recognition operates under strong restrictions on the image-plane oriention and direction of lighting of faces (Yin, 1969; Johnston, Hill, & Carman, 1992), a reasonable compromise given that faces almost always present themselves to our visual systems upright and with positive contrast and lighting from above, and (3) unlike the case for common objects, reliable discrimination of complex three-dimensional nonsense objects (e.g., bent paper clips, crumpled pieces of paper) is restricted to a very limited range of three-dimensional orientations in the vicinity of familiar object views (Biederman & Gerhardstein, 1995), though this deciency can be overcome in some cases by extensive training (Logothetis & Pauls, 1995). When viewed within our framework, the exceptional difculty of this latter task arises from the need for, or more precisely a lack of, the very large number of very-high-order conjunctive features— essentially full object views locked to specic orientations (see Figure 1B)— that are necessary to support reliable viewpoint-invariant discrimination among a large set of objects with highly overlapping part structures. In summary, the primate visual system manifests a variety of domain-specic design compromises and performance limitations, which can be interpreted as attempts to achieve simultaneously the needed perceptual invariances, acceptably low levels of binding ambiguity, and reasonable hardware costs. In considering the astounding overall performance of the human visual system in comparison to the technical state of the art, it is also worth noting that human recognition performance is based on neural representations that could contain tens of thousands of task-appropriate visual features (extrapolated from the monkey; see Tanaka, 1996) spanning a wide range of conjunctive orders and nely tuned invariances. In continuing work, we are exploring further implications of the tradeoffs discussed here in relation to the neurobiological underpinnings of spatially invariant conjunctive receptive elds in visual cortex (Mel, Ruderman & Archie, 1998) and on the development of more efcient supervised
276
Bartlett W. Mel and J´ozsef Fiser
and unsupervised hierarchical learning procedures needed to build highperformance, task-specic visual recognition machines. Acknowledgments
Thanks to Gary Holt for many helpful comments on this work. This work was funded by the Ofce of Naval Research. References Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and Applications in the United States of algorithms related to GMDH. In S. Farrow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Biederman, I. (1995). Visual object recognition. In S. Kosslyn & D. Osherson (Eds.), An invitation to cognitive science (2nd ed.) (pp. 121–165). Cambridge, MA: MIT Press. Biederman, I., & Gerhardstein, P. (1995). Viewpoint-dependent mechanisms in visual object recognition: Reply to Tarr and Bulthoff ¨ (1995). J. Exp. Psychol. (Human Perception and Performance), 21, 1506–1514. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Douglas, R., & Martin, K. (1998). Neocortex. In G. Shepherd (Ed.), The synaptic organization of the brain (pp. 567–509). Oxford: Oxford University Press. Edelman, S., & Duvdevani-Bar, S. (1997). A model of visual recognition and categorization. Phil. Trans. R Soc. Lond. [Biol], 352, 1191–1202. Fahlman, S., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Fize, D., Boulanouar, K., Ranjeva, J., Fabre-Thorpe, M., & Thorpe, S. (1998). Brain activity during rapid scene categorization—a study using event-related FMRI. J. Cog. Neurosci., suppl S, 72–72. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognition: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man & Cybernetics, SMC-13, 826–834. Gilbert, C. (1983). Microcircuitry of the visual cortex. Ann. Rev. Neurosci, 89, 8366–8370. Heller, J., Hertz, J., Kjær, T., & Richmond, B. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Hubel, D., & Wiesel, T. (1968). Receptive elds and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hummel, J., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Johnston, A., Hill, H., & Carman, N. (1992).Recognizing faces: Effects of lighting direction, inversion, and brightness reversal. Perception, 21, 365–375.
Minimizing Binding Errors
277
Jones, E. (1981). Anatomy of cerebral cortex: Columnar input-output relations. In F. Schmitt, F. Worden, G. Adelman, & S. Dennis (Eds.), The organization of cerebral cortex. Cambridge, MA: MIT Press. Kuù cera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Komen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lang, G. K., & Seitz, P. (1997). Robust classication of arbitrary object classes based on hierarchical spatial feature-matching. Machine Vision and Applications, 10, 123–135. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cerebral Cortex, 3, 270–288. Logothetis, N., & Sheinberg, D. (1996). Visual object recognition. Ann. Rev. Neurosci., 19, 577–621. Malsburg, C. (1994).The correlation theory of brain function (reprint from 1981). In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 95–119). Berlin: Springer. McClelland, J., & Rumelhart, D. (1981). An interactive activation model of context effects in letter perception. Psych. Rev., 88, 375–407. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Mozer, M. (1991). The perception of multiple objects. Cambridge, MA: MIT Press. Oram, M., & Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. Pitts, W., & McCullough, W. (1947). How we know universals: The perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Sandon, P., & Urh, L. (1988). An adaptive model for viewpoint-invariant object recognition. In Proc. of the 10th Ann. Conf. of the Cog. Sci. Soc. (pp. 209–215). Hillsdale, NJ: Erlbaum. Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive eld histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2 pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Swain, M., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32.
278
Bartlett W. Mel and J´ozsef Fiser
Szentagothai, J. (1977). The neuron network of the cerebral cortex: A functional interpretation. Proc. R. Soc. Lond. B, 201, 219–248. Tanaka, K. (1996). Inferotemporal cortex and object vision. Ann. Rev. Neurosci, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Van Essen, D. (1985). Functional organization of primate visual cortex. In A. Peters & E. Jones (Eds.), Cerebral cortex (pp. 259–329). New York: Plenum Publishing. Wallis, G., & Rolls, E. T. (1997).Invariant face and object recognition in the visual system. Prog. Neurobiol., 51, 167–194. Weng, J., Ahuja, N., & Huang, T. S. (1997). Learning recognition and segmentation using the cresceptron. Int. J. Comp. Vis., 25(2), 109–143. Wickelgren, W. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15. Yin, R. (1969). Looking at upside down faces. J. Exp. Psychol., 81, 141–145. Zemel, R., Mozer, M., & Hinton, G. (1990). TRAFFIC: Recognizing objects using hierarchical reference frame transformations. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 266–273). San Mateo, CA: Morgan Kaufmann. Received September 29, 1998; accepted March 5, 1999.
NOTE
Communicated by Terence Sanger
Relationship Between Phase and Energy Methods for Disparity Computation Ning Qian Center for Neurobiology and Behavior, Columbia University, New York, NY 10032, U.S.A. Samuel Mikaelian Center for Neural Science, New York University, New York, NY 10003, U.S.A. The phase and energy methods for computing binocular disparity maps from stereograms are motivated differently, have different physiological relevances, and involve different computational steps. Nevertheless, we demonstrate that at the nal stages where disparity values are made explicit, the simplest versions of the two methods are exactly equivalent. The equivalence also holds when the quadrature-pair construction in the energy method is replaced with a more physiologically plausible phase-averaging step. The equivalence fails, however, when the phase-difference receptive eld model is replaced by the position-shift model. Additionally, intermediate results from the two methods are always quite distinct. In particular, the energy method generates a distributed disparity representation similar to that found in the visual cortex, while the phase method does not. Finally, more elaborate versions of the two methods are in general not equivalent. We also briey compare these two methods with some other stereo models in the literature. 1 Introduction
Binocular disparity is dened as the positional difference between the two retinal projections of a given point in space. It is well known that the horizontal component of disparity provides the sensory cue for stereoscopic depth perception. Many computational models for disparity estimation have been proposed. Here we compare two computational models that appear to have the strongest biological relevance: the phase method and the energy method. The phase method (Sanger, 1988; Fleet, Jepson and Jenkin, 1991) for disparity computation is based on the mathematical result that the displacement of a function generates a proportional phase shift in its complex Fourier transform. The binocular disparity at each location is therefore proportional to the difference of the Fourier phases of the corresponding left and right Neural Computation 12, 279–292 (2000)
c 2000 Massachusetts Institute of Technology °
280
Ning Qian and Samuel Mikaelian
image patches. Sanger (1988) used sine and cosine Gabor lters to estimate the local Fourier phases of both left and right images, and then calculated the Fourier phase difference at each location to nd disparity. The algorithm has been tested on both random dot stereograms and natural images (Sanger, 1988). The energy method for disparity computation (Qian, 1994, 1997; Zhu & Qian, 1996; Qian & Zhu, 1997; Qian & Andersen, 1997; Fleet, Wagner, & Heeger, 1996) was derived from the energy model of binocular cell responses in the cat’s primary visual cortex (Ohzawa, DeAngelis, & Freeman, 1990, 1996, 1997; Freeman & Ohzawa, 1990; DeAngelis, Ohzawa, & Freeman, 1991). Through quantitative physiological experiments, Freeman and coworkers found that a typical binocular simple cell can be described by two Gabor functions—one for its left and the other for its right receptive elds. There is a relative phase difference (or positional shift, or both) between the two receptive elds that generates disparity sensitivity. The activities of such simple cells were determined by rst convolving the left and right images with the left and right receptive elds, respectively, and then summing the two contributions. Freeman et al. further found that responses of binocular complex cells in cat’s primary visual cortex can be well modeled by summing the squared outputs of a quadrature pair of simple cells. We analyzed this so-called energy model for binocular complex cells and found that a population of such complex cells can be used to extract stimulus disparity (see Qian, 1997, for a review). The resulting energy method for disparity computation has been tested on random dot stereograms (Qian, 1994; Qian & Zhu, 1997). These two methods are different in several ways. First, although both methods use Gabor lters at the front end, the phase method applies two monocular Gabor lters separately and delays binocular comparison until the last step, when the Fourier phases of the left and right image patches are subtracted, while the energy method uses a set of binocular lters that combines contributions from the two eyes at the rst step. In addition, the phase method explicitly computes and represents the complex Fourier phases of the left and right image patches, while the energy method uses the quadrature-pair construction (or the equivalent phase averaging; see section 2) at the complex cell stage to remove the simple cells’ dependence on image Fourier phases. Furthermore, complex cell responses in the energy method form a distributed representation of disparity; such a representation is absent in the phase method. Despite these differences, we show that the simplest versions of the two methods are exactly equivalent at the nal steps where disparity values are made explicit. The intermediate steps, however, differ in the two methods and have different physiological implications. In addition, the two methods can be elaborated in different ways, resulting in nonequivalent algorithms.
Phase and Energy Methods for Disparity Computation
281
2 Results
Before we start, a few potential confusions in terminology should be claried. First, the phase method should not be confused with the phasedifference (also called phase-parameter or phase-shift) receptive eld model. The former species a method for disparity estimation from stereograms, while the latter is a model for describing receptive eld proles of binocular simple cells often used in the energy method. Second, the phase parameters in the phase-difference receptive eld model (w l and w r in equations 2.1 and 2.2) should not be confused with the image Fourier phases. The former determine the positions of the excitatory-inhibitory bands relative to the receptive eld center, while the latter are the phases of the complex Fourier transforms of retinal images. Finally, mathematically complex quantities, with real and imaginary components, should not be confused with complex cell properties. We now demonstrate the exact equivalence of the simplest versions of the phase and the energy methods. Elaborations of the methods such as frequency pooling (for both methods), condence measure (for the phase method), and spatial pooling (for the energy method) are ignored in this section (but see section 3). We start by reformulating the energy method (Qian, 1994) and convert it into a form identical to that for the phase method (Sanger, 1988). Consider a binocular simple cell centered at x = 0, with the left and right receptive elds given by the following Gabor functions (Marcelja, 1980; Daugman, 1985; McLean & Palmer, 1989; Ohzawa et al., 1990): fl ( x) = g( x) cos(vx C w l ) fr ( x) = g( x) cos(vx C w r ) with g( x) = p
1 2p s
¡
exp ¡
x2 2s 2
(2.1) (2.2)
¢
,
(2.3)
where v is the preferred (angular) horizontal spatial frequency of the cell, s is the gaussian width determining the receptive eld size, and w l and w r are the left and right phase parameters. Its response to a stereo image pair Il ( x) and Ir ( x) is given by: Z rs =
1 ¡1
£
¤ fl ( x) Il ( x) C fr ( x) Ir ( x) dx.
(2.4)
(For a layer of topographically arranged cells of the same type, the convolution operation between images and receptive elds should be used instead.) The simple cell with a quadrature phase relationship to the above cell has
282
Ning Qian and Samuel Mikaelian
receptive elds of the same form but with the cosine functions replaced by sines (Pollen, 1981; Adelson & Bergen, 1985; Watson & Ahumada, 1985; Ohzawa et al., 1990; Qian, 1994). It is easy to see that the responses of these two simple cells are the real and imaginary parts of Z 1 £ ¤ (2.5) RC ´ g( x)eivx eiw l Il ( x) C eiw r Ir (x) dx. ¡1
According to the quadrature-pair construction, the complex cell response is given by the sum of the squared responses of these two simple cells, (or equivalently, four half-wave rectied simple cells (Ohzawa et al., 1990)), and can thus be written as Z 1 2 £ ¤ 2 ivx iw ¡ (2.6) rq = |R C| = g(x) e Il ( x) C e Ir ( x) dx , ¡1
where (2.7)
w ¡ ´ wr ¡ w l
is the phase parameter difference. Therefore, complex cell responses depend on only w ¡ instead of on w r and w l individually. For a given binocular stimulus, complex cells with different w ¡ will give different responses. These cells at a given spatial location form a distributed representation of the stimulus disparity at that location. According to the energy method (Qian, 1994), the stimulus disparity can be explicitly estimated from the distribution as: D=
wO ¡ , v
(2.8)
where wO ¡ is the phase difference w ¡ that maximizes the complex cell response rq . Let Z
1
¡1 Z 1 ¡1
g(x) Il ( x) eivx dx ´ Âl (v),
(2.9)
g( x) Ir ( x) eivx dx ´ Âr (v),
(2.10)
so that Âl (v) and Âr (v) are the Fourier transforms of the left and right image patches under the gaussian envelope g( x), respectively. Equation 2.6 then becomes:
2 rq = Âl (v) C eiw ¡ Âr (v) .
(2.11)
To nd wO ¡ , note that the maximum of rq is attained when the two terms inside the norm of the above expression are along the same direction in the
Phase and Energy Methods for Disparity Computation
283
complex plane, that is, wO¡ = arg(Âl ) ¡ arg(Âr ) "R1 # ( ) ( ) ¡1 dx Il x g x sin vx R = arctan 1 ( ) ( ) ¡1 dx Il x g x cos vx "R1 # ( ) ( ) ¡1 dx I r x g x sin vx ¡ arctan R 1 . ( ) ( ) ¡1 dx Ir x g x cos vx
(2.12)
(2.13)
Equation 2.13 combined with equation 2.8 is precisely the expression for the phase method proposed by Sanger (1988). The two terms in equation 2.13 are the local complex Fourier phases of the left and right image patches, respectively, estimated through the sine and cosine Gabor lters. This establishes the equivalence between the two methods. It is interesting to note that w ¡ simply compensates for the different Fourier phases of the right and left image patches in equation 2.11. Scaling the relative contrast of the image pair while preserving the contrast polarity will not affect the disparity estimates because only the relative amplitudes (but not phases) of Âl and Âr will be changed by this manipulation (Qian, 1994). When the contrast polarity of one image in the pair is reversed, however, w ¡ has to be shifted by p to compensate, resulting in an inverted complex cell turning curve. This prediction is also contained in equation 2.13 of Qian (1994) and has been veried experimentally (Ohzawa et al., 1990; Cumming & Parker, 1997; Masson, Busettini, & Miles, 1997). We have previously discussed the similarities and differences between the energy method and the cross-correlator (Qian & Zhu, 1997). With the current formulation, it is easy to see that nding wO¡ , which maximizes the quadrature-pair response, is equivalent to determining the Fourier phase of the cross-correlation between the left and right image patches under the gaussian envelope g( x). This is because the Fourier transform of the crosscorrelation, Z Clr ( x) =
1 ¡1
g( x0 )Il (x0 ) g( x C x0 )Ir ( x C x0 ) dx0 ,
(2.14)
is simply Âl (v) Âr¤ (v) where ¤ denotes complex conjugation. Therefore, the phase of this cross-correlator, or its normalized version, is just the expression for wO¡ in equation 2.12. Correlation methods usually use the peak location of Clr ( x) to estimate disparity. If Clr ( x) is sharply peaked at xo , its Fourier phase is approximately vxo . It is in deviations from this approximation that the cross-correlator and the energy method can yield different estimates. We used the quadrature-pair construction for obtaining complex cell responses above. As we pointed out previously (Qian & Zhu, 1997), a less demanding and physiologically more plausible alternative is to integrate
284
Ning Qian and Samuel Mikaelian
squared responses of many simple cells all with the same w ¡ but with (2.15)
w C ´ wr C w l
uniformly distributed in the entire 2p range (see equation 2.19). The energy method with this uniform phase averaging approach is still equivalent to the phase method because the phase averaging gives exactly the same complex cell expression as the quadrature-pair construction. To demonstrate, dene: Z 1 (2.16) R¡ ´ R¤C = g( x) e¡ivx [e¡iw l Il ( x) C e¡iw r Ir ( x)]dx, ¡1
and rewrite the simple cell response as: rs =
1 ( R C C R¡ ). 2
(2.17)
Since rs is real, we have: r2s = |rs | 2 =
´ 1³ ´ 1³ |R C| 2 C |R¡ | 2 C 2ReR CR¤¡ = | R C| 2 C ReR2C . 4 2
(2.18)
According to the phase averaging approach, the complex cell response is given by the integration: 1 ra = 2p
Z
2p 0
r2s dw C
1 = 2p
Z
2p 0
´ 1³ | R C| 2 C ReR2C dw C, 2
(2.19)
while w ¡ is kept constant. Since |R C| 2 is not a function of w C (see equation 2.6), the averaging leaves the rst term unchanged. The second term integrates to zero because: " Z # h i 2 1 2 ivx ¡iw ¡ / 2 ( ) iw ¡ / 2 ( ) iw C ( ) ReR C = Re (2.20) dxg x e e Il x C e Ir x e
¡
¢
¡1
and Z
2p 0
eiw C dw C = 0.
(2.21)
Therefore, the phase averaging approach generates a complex cell response proportional to that of the quadrature pair in equation 2.6, ra =
1 1 |R C| 2 = rq , 2 2
(2.22)
where the proportionality constant is immaterial for disparity computation. Intuitively, one can imagine dividing the 2p range for w C into many small
Phase and Energy Methods for Disparity Computation
285
intervals in equation 2.19. Then each pair of intervals differing by p / 2 can be considered a quadrature pair. Therefore, the above phase averaging can be viewed as averaging together a continuum of quadrature-pair responses, all with the same w ¡ . Two types of receptive eld models for binocular simple cells have been proposed: the position-shift model (Bishop, Henry, & Smith, 1971) and the phase-differences model (Ohzawa et al., 1990; DeAngelis et al., 1991). The latter was used in the above derivations. Real cortical cells may use a combination of both mechanisms for coding disparity (Jacobson, Gaska, & Pollen, 1993; Zhu & Qian, 1996; Fleet et al., 1996; Anzai, Ohzawa, & Freeman, 1997). We showed previously that although there are important differences between them, under reasonable assumptions, both models (or their hybrid) can be used with the energy method for computing disparity maps from stereograms (Zhu & Qian, 1996; Qian & Zhu, 1997). However, the exact equivalence between the phase and the energy methods does not hold when the position-shift receptive eld model is used in the energy method. To see this, note that according to the position-shift receptive eld model, equation 2.2 should be replaced by fr ( x) = fl (x C d) = g(x C d) cos[v(x C d) C w l ],
(2.23)
where d represents the amount of positional shift between the left and right receptive eld centers. The complex cell responses in equation 2.6 should then be written as: Z 1 2 h i ivx ivd (2.24) rq = e g(x) Il ( x) C e g( x C d) Ir ( x) dx , ¡1
and the stimulus disparity is given by (Zhu & Qian, 1996): O D = d,
(2.25)
where dO is the positional shift that maximizes complex cell response. Equation 2.24 can be rewritten as
2
rq = Âl C eivd Âr ( d) , with the denitions: Z 1 g( x) Il ( x)eivx dx ´ Âl , ¡1 Z 1 g(x C d) Ir ( x)eivx dx ´ Âr (d) . ¡1
(2.26)
(2.27) (2.28)
It is obvious that a closed-form expression similar to equation 2.13 cannot be obtained for dO because Âr is also a function of d, and the maximum of
286
Ning Qian and Samuel Mikaelian
rq as a function of d may not occur when the two terms inside the norm of equation 2.26 have the same direction on the complex plane. Therefore, the energy method with the position-shift receptive eld model is not equivalent to the phase method. However, under the oft-invoked assumption that the disparity is small compared to the receptive eld size, the d-dependence of Âr in equation 2.28 is negligible and vd in equation 2.26 plays the same role as w ¡ in equation 2.11. Clearly, in this approximation the energy method with the positional-shift model corresponds to the phase method in the same way as the energy method with the phase-difference model does. 3 Discussion
We have demonstrated that the simplest versions of the energy and the phase methods for disparity computation are exactly equivalent at the nal stages where the disparity values are made explicit. The equivalence still holds when the quadrature-pair construction in the energy method is replaced by a less demanding phase-averaging procedure. However, when the position-shift type of simple cell receptive eld model is used to replace the phase-difference type in the energy method, the two methods are no longer exactly equivalent. This result is consistent with the fact that both the phase method and the energy method combined with the phase-difference receptive eld model predict a relationship between the spatial scale v and the computed disparity range (the so-called size-disparity correlation), while the methods with the position-shift receptive eld model do not make such a prediction (Sanger, 1988; Smallman & MacLeod, 1994; Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996). Regardless of the receptive eld models used, the intermediate results in the two methods are always very different. Indeed, the energy method contains simple and complex cell stages that simulate binocular interactions observed in simple and complex cortical cells, while none of the steps in the phase method resembles actual binocular cells. More importantly, the complex cells in the energy method form a distributed representation of binocular disparity at each location; such a representation is absent in the phase method. The distributed representation is useful because it could directly guide motor behavior such as vergence eye movement (Masson et al., 1997) without rst generating an explicit disparity map. In fact, the nal explicit disparity extraction in both methods does not seem to happen in the brain since “grandmother cells” for disparity coding have never been recorded from the visual cortex. While an explicit, grandmother-cell type of representation is more convenient for us to comprehend, the brain appears to rely more on implicit, distributed representations for perception and control. The simplest versions of the phase and the energy methods can also be elaborated in many ways. For example, both methods can be extended to combine results from different spatial scales (i.e., lters with different vs).
Phase and Energy Methods for Disparity Computation
287
Obviously, if the same extension is made after the nal stages in both methods, they remain equivalent. Any elaborations at intermediate steps or those that can only be applied to one method, however, will in general render the two methods nonequivalent. One example is the spatial pooling procedure in the energy method, which greatly improves the algorithm by making complex cell responses more independent of image Fourier phases (Zhu & Qian, 1996; Qian & Zhu, 1997). This procedure cannot be easily incorporated into the phase method because image Fourier phases are precisely the useful information in the phase method. Similarly, the condence measure in the phase method for appropriately weighting different spatial scales (Sanger, 1988) cannot be readily extended to the energy method since the measure is derived from the Fourier amplitudes of the left and right image patches; these monocular amplitudes are not computed in the energy method that starts with binocular lters. Of course, a different spatial pooling step could be incorporated into the phase method. For example, one could average the Fourier phases estimated from several nearby spatial locations. Similarly, a different condence measure could be added to the energy method. One possibility would be using the normalized range of the complex cell responses, O 2 Re(e¡iw ¡ Âl Âr¤ ) rq (wO ¡ ) ¡ rq (wO ¡ C p ) 2|Âi ||ÂrO | = = , 2 C |Â |2 O O |Â | | | 2 C | Âr | 2 Â ( ) ( ) rq w ¡ C rq w ¡ C p r l l
(3.1)
as the condence measure because a larger modulation would allow a more reliable estimation of wO ¡ . In conclusion, the equivalence between the simplest versions of the phase and energy methods demonstrated in this article indicates that the two methods can be viewed as different implementations of the same underlying mathematical principle. The differences in implementation, however, allow the methods to be elaborated, interpreted, and used in different ways by subsequent processes. Our results are reminiscent of the relationship between various motion models that have been documented in the literature (Adelson & Bergen, 1985; van Santen & Sperling, 1985; Borst & Egelhaaf, 1989; Simoncelli & Adelson, 1991; Heeger & Simoncelli, 1992). 3.1 Other Methods. One of the reviewers pointed out that disparity can also be computed by calculating the magnitude of the bandpass-ltered sum of the left and right image patches. This magnitude is equivalent to setting w ¡ to zero in equation 2.6 (or setting d to zero in equation 2.24) and dropping the immaterial squaring outside the norm. For a lter well-tuned to horizontal frequency v and for left and right image patches given by Il = I ( x) and Ir = I ( x CD), this magnitude is approximately proportional to | cos(vD/ 2)|| IQ(v)|. If the image spectrum | IQ(v)| does not vary much with v, the disparity D can be estimated by using two or more lters with different
288
Ning Qian and Samuel Mikaelian
v. We will refer to this algorithm as the frequency method because it applies identical lters to the left and right images and consequently has to rely on more than one frequency band for disparity computation. In essence, the frequency method xes w ¡ (or d) to zero and varies v, while the energy method xes v and varies w ¡ (or d). Obviously the frequency method is not equivalent to the energy (or the phase) method because the former cannot compute disparity within a single frequency band, while the latter can. Existing psychophysical evidence is against the frequency method in its pure form: human subjects do perceive stereoscopic depth in band-limited stereograms (Julesz, 1971). Psychophysical observations also indicate that different frequency channels interact with each other in human stereo vision (Wilson, Blake, & Halpern, 1991; Rohaly & Wilson, 1993, 1994; Smallman, 1995; Mallot, Gillner, & Arndt, 1996). It is thus possible that the brain may combine the frequency and the energy methods by using cells with different v and different w ¡ (or d) simultanously in disparity computation. Alternatively, the brain could rst apply the energy method to obtain one disparity estimate from each frequency channel and then pool the estimates across channels at a later stage (Sanger, 1988; Qian & Zhu, 1997). Further studies are needed for differentiating these two possibilities. In addition to the phase and the energy methods, many other stereo models have been proposed over the years. Similar to the phase method that rst estimates Fourier phases of the left and right image patches separately, most of the other models also start with a monocular detection of some matching primitives. For example, several studies (Marr & Poggio, 1976; Pollard, Mayhew, & Frisby, 1985; Prazdny, 1985; Qian & Sejnowski, 1989; Marshall, Kalarickal, & Graves, 1996) let individual dots in random-dot stereograms be the matching primitives. Marr and Poggio (1979) later used zero crossings at different spatial scales as the primitives. For convenience, we classify these methods as initially monocular. In contrast, the energy method is initially binocular because it starts with binocular ltering without performing any monocular feature detection. In this aspect it is similar to the cepstral method (Yeshurun & Schwartz, 1989) and the frequency method discussed above. The cepstral and the energy methods are very different in other aspects, however. For instance, the former contains a complex logrithmic operation and complex Fourier transforms, while the latter does not. (Note that the Fourier transform used in this article and elsewhere in connection with the energy method is only for mathematically analyzing the method; it is not a step in the method.) In addition, the cepstral method does not use binocular receptive elds from the physiological experiments. The initially monocular algorithms generally contain a subsequent step that matches the monocular primitives to solve the correspondence problem. Most methods (Marr & Poggio, 1976, 1979; Pollard et al., 1985; Prazdny, 1985; Qian & Sejnowski, 1989; Marshall et al., 1996) do so by introducing explicit constraints or rules that determine correct matches between the left and right primitives among all possible matches within a certain disparity
Phase and Energy Methods for Disparity Computation
289
range. These methods will be referred to as the explicit methods. In contrast, the phase method is implicit (Sanger, 1988) because it nds disparity by simply subtracting the two monocular phases without performing any explicit matching. (If the monocular Fourier amplitudes are used to calculate a condence weighting factor [Sanger, 1988], then the matching is slightly more explicit). The initially binocular algorithms dened above are necessarily implicit since there are no monocular primitives for explicit matching in the rst place. An advantage of the implicit methods is that they can have subpixel disparity resolution without the burden of sorting through a huge number of potential matches (Sanger, 1988; Qian & Zhu, 1997). As we discussed elsewhere (Qian, 1997), the implicit methods are also more physiologically plausible. To summarize, most stereo models are initially monocular for feature extraction and are then explicit in binocular matching. The cepstral, the energy, and the frequency methods, on the other hand, are initially binocular and implicit. The phase method is somewhere in between: it is initially monocular and is implicit. Obviously, there are many other ways of classifying stereo models. For example, some models measure and rely on disparity gradients, while others do not; some implement the uniqueness constraint through inhibitory interactions, while others emphasize facilitation and allow multiple matches of monocular primitives. The current formulation of the energy method assumes that the disparity is constant over the small image patches covered by the receptive elds (Qian, 1994; Qian & Zhu, 1997). Because of this assumption, the performance tends to degrade when there are large disparity gradients in the images such as across disparity boundaries. However, the energy method can be extended to encompass disparity gradients. Specically, it can be shown that a disparity gradient of D0 (= dD( x)/ dx) can be best measured by binocular cells whose left and right preferred spatial frequencies (vl and vr ) are related by vr = (1 C D0 )vl . Therefore, the maximum measurable disparity gradient depends on the differences between vl and vr among real binocular cells. If vl and vr are equal, then the cells’ frequency tuning widths have to be at least (1 C D0 )vl in order to detect D0 (Sanger, 1988). This extension of the energy method can be viewed as an implicit counterpart of the gradient limit rule employed explicitly by Pollard et al. (1985). It is also worth pointing out that the energy method in its current form (Qian, 1994; Qian & Zhu, 1997) already contains an implicit implementation of the uniqueness constraint (used by many explicit models in various forms). Because of the approximately cosine tuning behavior, there is only one maximum of the complex cell response as a function of w ¡ within the [¡p , p ) range even when there are two overlapping stimulus disparities. The stereo transparency problem could still be solved by an interdigitating representation of the two disparities by cells tuned to different spatial locations, similar to a previous motion-transparency model (Qian, Andersen, & Adelson, 1994). This approach is also related to those explicit methods
290
Ning Qian and Samuel Mikaelian
that use the uniqueness constraint to model stereo transparency (Qian & Sejnowski, 1989; Pollard & Frisby, 1990). It thus appears that despite the major differences between the explicit and implicit methods, they could share certain conceptual similarities. Acknowledgments
This work was supported by a Sloan Research Fellowship and NIH grant MH54125, both to N. Q. We are grateful to the anonymous reviewers for their helpful comments. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2), 284–299. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Nat. Acad. Sci. USA, 94, 5438–5443. Bishop, P. O., Henry, G. H., & Smith, C. J. (1971). Binocular interaction elds of single units in the cat striate cortex. J. Physiol., 216, 39–68. Borst, A., & Egelhaaf, M. (1989). Principles of visual motion detection. Trends Neurosci., 12, 297–306. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2, 1160–1169. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive eld structure. Nature, 352, 156–159. Fleet, D. J., Jepson, A. D., & Jenkin, M. (1991). Phase-based disparity measurement. Comp. Vis. Graphics Image Proc., 53, 198–210. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1858. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Res., 30, 1661–1676. Heeger, D. J., & Simoncelli, E. P. (1992). Model of visual motion sensing. In L. Harris & M. Jenkins (Eds.), Spatial vision in humans and robots. Cambridge: Cambridge University Press. Jacobson, L., Gaska, J. P., & Pollen, D. A. (1993). Phase, displacement and hybrid models for disparity coding. Invest. Ophthalmol. and Vis. Sci. Suppl. (ARVO), 34, 908. Julesz, B. (1971). Foundations of cyclopean perception. Chicago: University of Chicago Press. Mallot, H. A., Gillner, S., & Arndt, P. A. (1996). Is correspondence search in human stereo vision a coarse-to-ne process? Biol. Cybern., 74, 95–106.
Phase and Energy Methods for Disparity Computation
291
Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. A, 70, 1297–1300. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204, 301–328. Marshall, J. A., Kalarickal, G. J., & Graves, E. B. (1996). Neural model of visual stereomatching: Slant, transparency, and clouds. Network: Comput. Neural Sys., 7, 635–639. Masson, G. S., Busettini, C., & Miles, F. A. (1997). Vergence eye movements in response to binocular disparity without depth perception. Nature, 389, 283– 286. McLean, J., & Palmer, L. A. (1989). Contribution of linear spatiotemporal receptive eld structure to velocity selectivity of simple cells in area 17 of cat. Vision Res., 29, 675–679. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75, 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77, 2879– 2909. Pollard, S. B., & Frisby, J. P. (1990). Transparency and the uniqueness constraint in human and computer stereo vision. Nature, 347, 553–556. Pollard, S. B., Mayhew, J. E., & Frisby, J. P. (1985). PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception, 14, 449–470. Pollen, D. A. (1981). Phase relationship between adjacent simple cells in the visual cortex. Nature, 212, 1409–1411. Prazdny, K. (1985). Detection of binocular disparities. Biol. Cybern., 52, 93–99. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18, 359–368. Qian, N., & Andersen, R. A. (1997). A physiological model for motion-stereo integration and a unied explanation of Pulfrich-like phenomena. Vision Res., 37, 1683–1698. Qian, N., Andersen, R. A., & Adelson, E. H. (1994). Transparent motion perception as detection of unbalanced motion signals III: Modeling. J. Neurosci., 14, 7381–7392. Qian, N., & Sejnowski, T. J. (1989). Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 435–443). San Mateo, CA: Morgan Kaufmann.
292
Ning Qian and Samuel Mikaelian
Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res., 37, 1811–1827. Rohaly, A. M., & Wilson, H. R. (1993). Nature of coarse-to-ne constraints on binocular fusion. J. Opt. Soc. Am. A, 10, 2433–2441. Rohaly, A. M., & Wilson, H. R. (1994). Disparity averaging across spatial scales. Vision Res., 34, 1315–1325. Sanger, T. D. (1988). Stereo disparity computation using Gabor lters. Biol. Cybern., 59, 405–418. Simoncelli, E. P., & Adelson, E. H. (1991).Relationship between gradient, spatiotemporal energy, and regression models for motion perception. Invest. Ophthalmol. and Vis. Sci. Suppl. (ARVO), 32, 893. Smallman, H. S. (1995). Fine-to-coarse scale disambiguation in stereopsis. Vision Res., 35, 1047–1060. Smallman, H. S., & MacLeod, D. I. (1994).Size-disparity correlation in stereopsis at contrast threshold. J. Opt. Soc. Am. A, 11, 2169–2183. van Santen, J. P. H., & Sperling, G. (1985). Elaborated Reichardt detectors. J. Opt. Soc. Am. A, 2, 300–321. Watson, A. B., & Ahumada, A. J. (1985).Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2, 322–342. Wilson, H. R., Blake, R., & Halpern, D. L. (1991). Coarse spatial scales constrain the range of binocular fusion on ne scales. J. Opt. Soc. Am. A, 8, 229–236. Yeshurun, Y., & Schwartz, E. L. (1989). Cepstral ltering on a columnar image architecture—A fast algorithm for binocular stereo segmentation. IEEE Pat. Anal. Mach. Intell., 11, 759–767. Zhu, Y., & Qian, N. (1996). Binocular receptive elds, disparity tuning, and characteristic disparity. Neural Comput., 8, 1647–1677. Received September 30, 1998; accepted February 12, 1999.
NOTE
Communicated by Leo Breiman
N-tuple Network, CART, and Bagging Aleksander Ko lcz Electrical and Computer Engineering Department, University of Colorado at Colorado Springs, Colorado Springs, CO 80918, U.S.A. Similarities between bootstrap aggregation (bagging) and N-tuple sampling are explored to propose a retina-free data-driven version of the Ntuple network, whose close analogies to aggregated regression trees, such as classication and regression trees (CART), lead to further architectural enhancements. Performance of the proposed algorithms is compared with the traditional versions of the N-tuple and CART networks on a number of regression problems. The architecture signicantly outperforms conventional N-tuple networks while leading to more compact solutions and avoiding certain implementational pitfalls of the latter.
1 Introduction
Recent years have seen a reemergence of interest in the N-tuple network (NTN)architecture (Bledsoe & Browning, 1959). Despite numerous analyses (Bledsoe & Bisson, 1962; Steck, 1962; Roy & Sherman, 1967) and signicant progress (Luttrell, 1992; Rohwer, 1995; Ko lcz & Allinson, 1996; Jung, Krishnamoorthy, Nagy, & Shapira, 1996; Bradshaw, 1997; Rohwer & Morciniec, 1998), the computational capabilities of this architecture are still not very well understood. An NTN combines multiple models randomly partitioning the input space. In this article we show that a single NTN node can be represented by a binary tree, which can benet from learning algorithms similar to those used by classication and regression trees (CART) (Breiman, Friedman, Olshen, & Stone, 1984). This abstraction, together with analogies between N-tuple sampling and bootstrap aggregation (bagging) (Breiman, 1996a), are used to propose a data-driven method of creating a set of NTN tree-based nodes, resulting in a network related to bagged CART. The article is organized as follows. In section 2 we demonstrate a basic similarity between N-tuple sampling and bagging, and propose a retina-free version of the NTN. Section 3 introduces two tree architectures combining the attributes of CART and NTN. In section 4 numerical experiments are used to demonstrate the advantages of the proposed tree versions of the NTN when compared to the original architecture and to CART on a number of regression problems. Section 5 contains the conclusion. Neural Computation 12, 293–304 (2000)
c 2000 Massachusetts Institute of Technology °
294
Aleksander Ko lcz
2 Retina-Free NTN and Bagging Estimators
Assuming a binary input pattern space, usually represented in the form of a two-dimensional “retina,” NTN performs N-tuple sampling, where a set of K memory nodes, each having an N-bit-long address word, randomly distribute their address lines among the retina bits, with each address line being associated with a particular random feature of the binary input space. The NTN mapping is thus an aggregate of K random low-complexity models. In continuous variable environments, real-valued inputs are converted into a binary format suitable for N-tuple sampling. In particular, when thermometer encoding (Aleksander & Stonham, 1979) is used, an instance of N-tuple sampling becomes equivalent to placing N quantization thresholds (constrained by a uniform quantization grid in each dimension) within the range of the D input variables (Tattersall, Foster, & Johnston, 1991), leading to hyper-rectangular quantization of the input space. The network is biased by the assumption of uniformity imposed on the input space. The alternative we consider is to bypass the input-conversion stage altogether and perform N-tuple sampling by directly distributing the quantization thresholds within the ranges of the input variables, a process that can be guided by the empirical input-space distribution provided by a training set. In the light of this idea, certain similarities between the nature of NTN design and some recent advances in learning algorithms become apparent. Breiman (1996a) showed that signicant improvements in network prediction performance can be gained by using the concept of bagging, where the variance component of the generalization error is reduced by averaging several variants of the network. Each variant is designed with a different version of the training set, with each version created by sampling with repetitions from the original collection. To see the similarity between bagging and NTN design, let us rst note that each node of an NTN represents a crude model of the desired mapping, which is highly dependent on the positioning of the N-tuple thresholds. Averaging a large number of such randomly designed nodes reduces the variance of the estimate irrespective of the data set considered. However, when N-tuple sampling is constrained to choose the quantization thresholds from within the set training-point coordinates (according to the sampling with repetition regimen), the process of selecting a set of K NTN nodes becomes data driven and effectively corresponds to bagging. Although such an approach should lead to performance improvements, we propose to structure the process of training individual NTN nodes further by taking advantage of their inherent similarity to binary trees. 3 NTN Analogies to CART
CART is a well-known binary tree network based on hyper-rectangular quantization of the input space. In the regression variant, its output to an
N-tuple Network, CART, and Bagging
295
input x is given by an arithmetic average of the training points falling into the same tree leaf as x. During the tree-growing process, which starts with a root node containing all training points, CART chooses the next node to be partitioned such that the split of the corresponding cell maximizes the decrease of the overall training set error. A split is performed parallel to one of the axes by choosing the threshold point out of the set of coordinate values of training points falling into that node. Node splitting continues until a sufciently large tree is created, at which point a pruning process is initiated to optimize the tree and avoid the problem of overtting. Although a node of the NTN is created differently from CART, certain commonalities can be seen. Starting with the unpartitioned input space, the placement of the rst quantization threshold of an NTN node is equivalent to splitting the space at the root node of CART. The placement of any additional threshold results in splitting all of the current leaf nodes of the tree at that threshold, instead of splitting exactly one node, as is the case in CART. Despite this important difference, we can create an equivalent tree corresponding to a single node of the NTN, considering that the order of threshold-point placement is xed (i.e., each N-tuple is ordered). The value of N corresponds to the depth of the equivalent tree. 3.1 Bagged-Tree Implementation of an NTN. In view of the similarities between NTN and CART, a question arises whether the structured approach of tree design could be applied to the NTN in order to enhance the network performance. One obvious problem is that CART constructs essentially one tree, whereas NTN uses K nodes to achieve satisfactory performance levels. However, correspondence between bagging and N-tuple sampling allows substitution of the collection of randomly designed N-tuple nodes for a collection of bagged trees. K bootstrapped CART trees can also be interpreted as nonidentical quantizations of the input space, where each point resides in exactly K quantization cells, as in the NTN. Taking these observations into account, if a structured (i.e., data driven) node design were to be applied to the NTN, bagging could be used to derive the K network nodes. 3.2 Adapting the CART Node-Splitting Criterion. At each step in the CART tree-growing algorithm, all current tree leaves are evaluated for the possibility of a split. The leaf leading to the highest reduction in error is then chosen. The best-split selection process can be expressed as
¡
n, d, t
¢ best
£ ¡ ¡ ¢ ¡ ¢ ¢¤ = arg max E ( n) ¡ E n, d, t l C E n, d, t r , ( n,d,t)
(3.1)
where E (n) represents the error estimate at node n, d represents ¡ an ¢input variable (d = 1, . . . , D), t is a particular threshold value, and E n, d, t l and ¡ ¢ E n, d, t r correspond to the error estimates for points at node n residing on both (i.e., “left” and “right”) sides of the partitioning threshold xd = t.
296
Aleksander Ko lcz
Generally, splitting any node at any point will reduce the error (provided that it results in a division of points at that node), and the procedure chooses the best overall split. In the case of the NTN, the splitting occurs for all current terminal nodes. Consequently, it makes sense to choose the best split on the basis of the aggregated decrease of the training-set error, taking all nodes into account, that is, X£ ¡ ¢ ¡ ¡ ¢ ¡ ¢ ¢¤ (3.2) d, t best = arg max E ( n) ¡ E n, d, t l C E n, d, t r . ( d,t) n Note that at some nodes, the gain might be zero if all points in that node lie on the same side of the threshold. In such cases when the split is carried out, the tree will have some empty terminal nodes, which might be split again as the process continues. To avoid this undesirable effect, splits are allowed only if the node contains more than one element and the split does not result in an empty cell. In this way no empty cells will be added during the training process. 3.3 Pruning.
3.3.1 Cost-Complexity Pruning. Since every split results in an error reduction, the greedy algorithm outlined above can, in principle, continue until each leaf contains just one element. Clearly, such a network provides a good t to the training set, but can be expected to have poor generalization properties. To overcome this effect, the CART algorithm utilizes cost-complexity pruning, where the magnitude of the training-set error is weighted against the complexity of the tree measured in terms of the number of its terminal nodes, E = ER C a ¢ e T, where ER is the mean squared error over the training set and e T is the tree size. After a cross-validated estimate of a is obtained, the tree is pruned by merging pairs of terminal nodes, one pair at a time, until the desired tree complexity is reached. The cost-complexity pruning can be applied directly to the tree created by the NTN algorithm. The resulting tree will not correspond to the NTN anymore, since while for some nodes a particular threshold may be implemented, it may be omitted for others, even though carrying out of the split would not result in the creation of an empty cell. A tree grown according to the NTN strategy and pruned according to the cost-complexity approach will be called CART-NTN. An example input-space quantization characteristic of CART-NTN is illustrated in Figure 1 (left). 3.3.2 Optimal-Depth Pruning. The cost-complexity pruning makes the resulting tree analogous to CART, since the tree can have branches of different lengths not necessarily dictated by the lack of data in the terminal nodes. To keep within the NTN framework, however, each pruning step should merge all cells at the current level. A simple pruning method conforming with the NTN methodology is to choose a depth of the tree that
N-tuple Network, CART, and Bagging
297
Figure 1: Input space quantization due to (left) CART-NTN and (right) NTNCART. In CART-NTN, some of the initial quantization thresholds (indicated by heavier lines) will affect the entire input domain, while others will be more local, due to the CART-like pruning procedure. In NTN-CART most of the partitioning thresholds have a global scope, just as in the NTN, but in certain local cases, the splits are not implemented (the shaded regions) due to the lack of training data.
results in the best generalization error, obtained through cross validation. Note that this type of pruning is signicantly simpler than cost-complexity pruning. The order of node removal is exactly the same as the order of node generation since we remove all nodes residing at a current level in each stage. Therefore, while the original tree is grown, a cross-validated estimate of the generalization error is computed and associated with that level each time a split is made. After the original tree is grown, the depth corresponding to the minimum cross-validation error is then chosen as the optimal depth of the nal tree. This value corresponds essentially to the value of N in the case of NTN. A tree grown according to the NTN strategy and pruned according to the optimal-depth approach will be called NTN-CART. An example input-space quantization typical of NTN-CART is shown in Figure 1 (right). 4 Numerical Experiments
In view of the extensions proposed, CART-NTN and NTN-CART can be considered intermediate architectures, bridging CART with NTN. Performance of CART, NTN, and the intermediate CART-NTN and NTN-CART networks was evaluated in a regression setting using one real and three simulated data sets, which have been used in Breiman (1996a). One hundred random variants of each data set were considered. For the simulated data, the different variants were generated using the formulas given below; for
298
Aleksander Ko lcz
the real data set, the different samples were obtained by a random split of the whole set into its training and test components. In all experiments with bagged networks 25 bagging repetitions were used. The real-world data set Boston corresponds to the Boston Housing experiment (Belsley, Kuh, & Welsh, 1980), where a mix of 13 social and environmental variables is used to estimate real estate prices. The total data set consists of 506 elements, where in each random case 481 points were used to train a network and the remaining 25 were used for testing. The three articial data sets are due to Friedman (1991). In each random case, a training set consisted of 200 points, and a test set contained 1000 points. The data set Friedman-1 is generated according to the formula y = 10 ¢ sin (p x1 x2 ) C 20 ¢ ( x3 ¡ 0.5) 2 C 10 ¢ x4 C 5 ¢ x5 C e, where e à N (0, 1) and [x1 , . . . , x10 ] is a 10-dimensional input vector, whose elements represent independent variables distributed uniformly within [0, 1]; e represents a normally distributed noise component. Note that only 5 of the 10 variables have any inuence on the output. In the case of data sets Friedman-2 and Friedman-3, the variables of the four-dimensional input space are independent and distributed uniformly in the ranges: x1 2 [0, 100] , x2 2 [40p , 560p ] , x3 2 [0, 1] , x4 2 [1, 11]. For the q data set Friedman-2, the output variable is generated according to y =
x21 C (x2 x3 ¡ 1/ x2 x4 )2 C e (where e à N (0, 127)), while for the case of the ³ ´ ¡1/ Friedman-3 data set, the output variable is given by y = arctan x2 x3 x1 x2 x4 C e (where e à N (0, 0.106)). 4.1 Network Implementation. CART was implemented according to the description provided in Breiman et al. (1984). The initial tree was grown by successive terminal node splitting (see equation 3.1), while at least one tree leaf contained more than ve points and a split reducing the training set error was possible. Cost complexity reduced the number of terminal nodes once the original tree had been grown. Tenfold cross validation was used to estimate the optimum pruning parameter a. Both CART-NTN and NTN-CART followed a similar initial tree-growing process, except that all current terminal nodes were split at each step, if possible, according to equation 3.2. In the case of CART-NTN, the costcomplexity pruning identical to the one used in CART was then applied, while NTN-CART found the optimal tree depth, minimizing the estimated generalization error obtained via 10-fold cross validation. In the case of the NTN, in all experiments all inputs were mapped to the 0, . . . , 255 range and thermometer encoded to form a binary retina of size R = 256 ¢ D, where D is the number of input variables. In one set of experiments, the number K of N-tuple memories was chosen to allow complete sampling of the retina bits for a given value of N, that is, K » = R/ N.
N-tuple Network, CART, and Bagging
299
Another set of experiments used a generally larger but xed number of Ntuple nodes (K = 500). For each variant of the data set, 25 random N-tuple retina samplings were considered, and the test-set errors obtained in each case were then averaged. The N-tuple memories were implemented using hashing (Knuth, 1973), with each memory node containing 5000 locations. In this way, large N-tuple sizes could be readily considered, and since the training set sizes were in all cases fewer than 500, the probability of hash collisions was negligibly small. To reduce the generalization error due to empty memory locations, the NTN mapping was modeled as y = b C NTN (x ), with the bias b estimated as an average of the training-set values. The network response formation was considered in two variants. One with the output response calculated according to Ko lcz and Allinson (1996), NTN (x ) =
PK
k= 1 wk ( x ) PK k= 1 ck ( x )
=
PT
¡
t t t= 1 y ¢ k x, x ¡ ¢ PT t t= 1 k x , x
¢ ,
(4.1)
where T is the size ¡ t of ¢ the training set, ck ( x ) is the counter associated with t represents the tth input-output training pair, and ( ), weight x x , w y ¡ ¢ k k x , x t is the equivalent kernel function of the NTN. In a one-pass training procedure, wk ( x ) is set to the sum of training values of points addressing that memory location, and ck ( x ) is set to their number. The other form of NTN response was produced according to NTN (x ) =
K 1X wk (x ) , K k= 1 ck ( x )
(4.2)
with the weights and counters set in the same way. The latter variant directly corresponds to the CART network, since the response of each memory location represents an average of the training-point values that address that location and hence fall into the corresponding quantization cell. Since equation 4.2 makes NTN directly related to the remaining networks, it has been chosen as a basis for performance comparison. 4.2 Discussion of Results. The combined results for all networks and data sets considered are presented in Table 1. The complete results for both variants of the NTN and all values of N considered are shown in Figure 2. The accuracy of CART and CART-NTN was closely matched in all cases (with similar depths of the trees produced), which is not surprising since the same pruning algorithm was applied in both cases. The advantage of one network over the other proved to be data dependent: for the Friedman1 and Friedman-3 experiments, CART-NTN was slightly better than CART, while for the Friedman-2 and Boston experiments, the situation was reversed. It appears that the depths of the trees produced by CART-NTN were unaf-
300
Aleksander Ko lcz
Table 1: Combined Mean Squared Error (MSE) Results for All Data Sets. Network
MSE
MSE-bag
Decrease
ATD
ATD-bag
9.0 8.7 6.9 12.3
31% 29% 40% 3%
5 4 4 25
7 4 9 25
24,840 26,297 24,911 30,797
32% 31% 32% 5%
8 4 5 20
9 4 14 20
0.0370 0.0346 0.0281 0.044
38% 36% 37% 2%
4 4 5 20
6 4 14 20
16.2 16.2 13.1 15.9
42% 48% 42% 5%
6 5 8 40
7 5 25 45
Results for the Friedman-1 data set
CART CART-NTN NTN-CART NTN
13.1 12.2 11.4 12.7
Results for the Friedman-2 data set
CART CART-NTN NTN-CART NTN
36,438 37,930 36,510 32,437
Results for the Friedman-3 data set
CART CART-NTN NTN-CART NTN
0.0600 0.0544 0.0449 0.045
Results for the Boston data set
CART CART-NTN NTN-CART NTN
27.8 31.4 22.6 16.7
Notes: MSE-bag refers to the error obtained for the bagged variant of CART, CARTNTN, and NTN-CART and for the variant of NTN with the number of N-tuple nodes xed to K = 500 (in the MSE column, the number of nodes of an NTN is just large enough to cover the retina with N-tuples). A decrease in the test-set error due to bagging is indicated in column 4, and the average tree depths (ATD) (and the value of N for NTN)for the standard and bagged variants of the networks are given in columns 5 and 6, respectively. The NTN results correspond to the optimal values of N.
fected by bagging, whereas for both CART and NTN-CART, bagging led to deeper trees. NTN-CART proved to be superior to both CART and CART-NTN for all data sets considered, except for the Friedman-2 experiment, where its performance was essentially equal to that of CART. This can probably be attributed to the optimal-depth pruning used by NTN-CART, which resulted in trees of higher complexity from those used by the other networks. This may indicate that cost-complexity pruning applied to CART and CART-NTN can prune the trees too much in certain cases. Interestingly, the distribution-based trees proposed in Breiman (1996b) also combined higher tree complexity with better prediction accuracy. All three tree networks demonstrated similar levels of sensitivity to the training data, ranging between 29% and 48% for the data sets considered, where the sensitivity is measured by the decrease of the test-set error due
N-tuple Network, CART, and Bagging
301
NTN : BOSTON (Kmin=67, Kmax=666)
NTN : FRIEDMAN 1 (Kmin=52, Kmax=513)
80
35
tree
70
tree
kernel
kernel 30
prediction error
tree 500
prediction error
tree 500
60
kernel 500
kernel 500
25
50 40
20
30 15
20 10
4
30
40
50
10
FRIEDMAN 2 (Kmin=21, Kmax=205) 0.1
kernel
N
30
40
50
tree kernel
0.09
tree 500
10
20
NTN : FRIEDMAN 3 (Kmin=21, Kmax=205)
tree
12 prediction error
N
tree 500
prediction error
x 10
20
kernel 500
kernel 500
0.08
8
0.07
6
0.06
4
0.05 10
20
N
30
40
50
10
20
N
30
40
50
Figure 2: Combined NTN results for the data sets considered and different values of N. The continuous graph lines correspond to NTN with a retina-size dependent K, while the marker graphs refer to the xed K cases. The kernel label refers to the response of an NTN according to equation 4.1 and tree according to equation 4.2.
to bagging. This suggests that although CART-NTN and NTN-CART have more data at their disposal than CART when making node-splitting decisions, the splitting process is inherently less stable than in CART since each split affects the whole of the input space. As far as the N-tuple networks are concerned, their performance proved to be comparable, and in some cases better than the performance of singlenode CART, CART-NTN, and NTN-CART, but generally worse than the bagged versions of these networks. This is largely expected and can be attributed to the random nature of N-tuple sampling, which is far from optimal unless the joint distribution of the input-output space is actually uniform. Increasing the resolution of the input retina by using more than 256 quantization levels per input variable did not result in a signicant
302
Aleksander Ko lcz
improvement of NTN performance. Conforming with the results published in Rohwer and Lamb (1993) and Ko lcz and Allinson (1996), the optimal values of N tend to be rather large compared to the ones usually used in implementations of the NTN. Figure 2 conrms that by increasing the number of N-tuple nodes, the network performance is improved. The improvement is often small, which may be attributed to the random nature of N-tuple sampling, which does not take full advantage of the distribution-related information provided by the training data. The performance increase due to large K is more pronounced for the synthetic data sets, for which the input space distribution is truly uniform. The results obtained for NTN indicate a clear advantage of combining the individual node responses according to equation 4.2 when compared to the regression-NTN method of equation 4.1. The explanation lies in the higher bias induced by equation 4.1. Let fw1 , . . . , wK g denote the sums of training values in the cells selected for a particular input. Let fa1 , . . . , aK g and fc1 , . . . , cK g correspond to the averages and numbers of training points in these cells, respectively. The node aggregation method analogous to bagged CART (see equation 4.2) produces the response as K K 1X 1X wk = ak , K k= 1 ck K k= 1
while the response of the regression NTN (see equation 4.1) is given by PK
k= 1 wk PK k= 1 ck
=
PK
k= 1 ck
PK
¢ ak
k= 1 ck
=
K X k= 1
b k ¢ ak ,
where b k is proportional to the number of training points in the kth cell, and K X
bk = 1.
k= 1
Thus cell averages generated by larger number of points have more inuence on the output in this case. But a larger number of training points falling into a cell is likely to be associated with a larger “footprint” of the cell in the input space, in which many points inuencing the cell average ak are likely to be farther away from the actual input point for which the network response is computed. Consequently, a higher bias will result. For larger values of N, the size of partitioning cells is small enough for the diluting effects of equation 4.1 to become insignicant (see Figure 2). Note that if all cells selected by x contain an identical number of points, the responses produced according to equations 4.2 and 4.1 will be identical.
N-tuple Network, CART, and Bagging
303
5 Conclusion
The analogy between a single N-tuple memory node and a tree network, as well as between N-tuple sampling and bagging, was used to propose a tree network (in two variants: CART-NTN and NTN-CART) using the datadriven, tree-growing strategy of CART and the structure of input-space quantization of an NTN node. The concept of bagging was applied to create a number of such network nodes in a data-driven instead of a purely random way. A simple tree-pruning method maintaining a close similarity to the original NTN architecture has also been suggested. NTN-CART, combining all these features, provides a data-driven, tree-based version of an NTN using the concept of bagging for the generation of multiple network nodes. Experimental results show that bagged NTN-CART provides an attractive alternative to the NTN, since its data-driven approach to network design is able to match the placement of quantization thresholds to the empirical data distribution provided by the training set. The performance increase is accompanied by a smaller number of nodes, a smaller value of N, and a way of eliminating the problem of empty memory locations characteristic of N-tuple networks. At the same time, the simpler nature of pruning proposed for NTN-CART can also have advantages for this network when compared with the traditional cost-complexity pruning of CART. References Aleksander, I., & Stonham, T. J. (1979). A guide to pattern recognition using random-access memories. IEE Proceedings—E Computers and Digital Techniques, 2(1), 29–40. Belsley, D., Kuh, E., & Welsh, R. (1980). Regression diagnostics. New York: Wiley. Bledsoe, W. W., & Bisson, C. L. (1962). Improved memory matrices for the Ntuple pattern recognition method. IRE Transactions on Electronic Computers, EC-11, 414–415. Bledsoe, W., & Browning, I. (1959). Pattern recognition and reading by machine. IRE Joint Computer Conference, pp. 225–232. Bradshaw, N. P. (1997). The effective VC dimension of the N-tuple classier. In W. Gerstner, A. Germond, M. Hasler, & J. D. Nicoud (Eds.), Articial neural networks—ICANN ’97. 7th International Conference Proceedings (pp. 511–516). Berlin: Springer-Verlag. Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996b). Distribution based trees are more accurate. In S. I. Amari, L. Xu, L. W. Chan, I. King, & K. S. Leung (Eds.), Progress in neural information processing: Proceedings of the International Conference on Neural Information Processing (pp. 133–138). Berlin: Springer-Verlag. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and regression trees. Monterey, CA: Wadsworth and Brooks/Cole.
304
Aleksander Ko lcz
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19(1), 1–141. Jung, D. M., Krishnamoorthy, M. S., Nagy, G., & Shapira, A. (1996). N-tuple features for OCR revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7), 734–745. Knuth, D. E. (1973). The art of computer programming (Vol. 3). Reading, MA: Addison-Wesley. Ko lcz, A., & Allinson, N. M. (1996).N-tuple regression network. Neural Networks, 9, 855–869. Luttrell, S. P. (1992). Gibbs distribution theory of adaptive N-tuple networks. In I. Aleksander & J. Taylor (Eds.), Articial neural networks 2: Proceedings of the 1992 International Conference on Articial Neural Networks (pp. 313–316). New York: Elsevier. Rohwer, R. J. (1995).Two Bayesian treatments of the N-tuple recognition method. Proceedings of the 4th International Conference on Articial Neural Networks (pp. 171–176). London: IEE. Rohwer, R. J., & Lamb, A. (1993). An exploration of the effect of super large N-tuples on single layer RAMnets. In N. M. Allinson (Ed.), Proceedings of Weightless Neural Network Workshop ’93 (pp. 33–37). York, UK: University of York. Rohwer, R., & Morciniec, M. (1998). The theoretical and experimental status of the N-tuple classier. Neural Networks, 11(1), 1–14. Roy, R. J., & Sherman, J. (1967). Two viewpoints of K-tuple pattern recognition. IEEE Transactions on System Science and Cybernetics, 3(2), 117–120. Steck, G. P. (1962). Stochastic model for the Bledsoe-Browning pattern recognition scheme. IRE Transactions on Electronic Computers, 11, 274–282. Tattersall, G. D., Foster, S., & Johnston, R. D. (1991). Single-layer lookup perceptrons. IEE Proc. F, 138, 46–54. Received May 1, 1998; accepted February 14, 1999.
NOTE
Communicated by Robert Tibshirani
Improving the Practice of Classier Performance Assessment N. M. Adams D. J. Hand Department of Mathematics, Imperial College, Huxley Building, 180 Queen’s Gate, London, SW7 2BZ, U. K. In this note we use examples from the literature to illustrate some poor practices in assessing the performance of supervised classication rules, and we suggest guidelines for better methodology . We also describe a new assessment criterion that is suitable for the needs of many practical problems. 1 Introduction
The construction of classication tools (Hand, 1997) in real applications can involve a sophisticated design stage, but this is often followed by a much less sophisticated, and sometimes inadequate, performance assessment stage. The consequence of poor assessment methodology is that an inferior rule may be chosen. In order to assess performance, one must rst decide on the criteria for measuring it. However, in many applications, assessment criteria are chosen that do not match the problem very well. Restricting ourselves to the two-class case, two criteria are in common use. One is error rate, or misclassication rate, which assumes that misallocation costs (the respective costs, or utilities, of classifying a class 0 object as class 1, and the reverse) are equal. The other is the area under a receiver operating characteristic (ROC) curve, or the equivalent Gini index, which assumes nothing whatever is known about misallocation costs. Both of these assumptions are unrealistic: costs are unlikely to be equal, and though they are seldom known precisely, it is very rare that nothing is known about them. Using real examples from the literature, our purpose here is to describe shortcomings of routine practice in assessing classier performance and present a set of guidelines that may be useful. In addition, we describe a new loss-based method for comparative analysis of pairs of classiers. We begin with the confusion matrix, which shows the cross-classication of predicted class by true class:
Predicted
Neural Computation 12, 305–311 (2000)
0 1
True 0 1 a b c d
c 2000 Massachusetts Institute of Technology °
306
N. M. Adams and D. J. Hand
where a C b C c C d = n and n is the total number of objects to which the classier has been applied. In order to obtain unbiased performance estimates, one would normally apply the classier to an independent test set or use more sophisticated methods, but our purpose here is to discuss performance assessment measures rather than issues of estimation. Suppose that class 0 represents diseased people and class 1 represents healthy people. Then a/ ( a Cc) is called sensitivity, denoted Se, and d/ ( b Cd) is called specicity, denoted Sp. If we knew that the exact cost of misclassifying a diseased individual as healthy was c0 , and the reverse misallocation cost was c1 , then the overall expected loss from using the rule would be (1 ¡ Se)p 0 c0 C (1 ¡ Sp)p 1 c1 , where p i refers to the prior probability of class i. Loss is the appropriate criterion for many classication problems. Misclassication rate corresponds to equal costs. However, eliciting misallocation costs in real situations can be extremely difcult, and instead people often adopt the area under an ROC curve. An ROC curve is (normally) a plot of sensitivity on the vertical axis and (1-specicity) on the horizontal axis. The area under this curve (denoted AUC) is a popular summary measure of performance that incorporates all possible costs, so that one does not have to choose a particular cost pair (Bradley, 1997). However, this goes to the other extreme from the explicit loss value above and assumes that nothing can be elicited about costs. This is just as unrealistic as assuming the costs are known exactly. 2 Examples from the Literature
In this section, we discuss ve examples from the applications literature that could have used better assessment methodology. Our purpose is not to criticize these authors, since many of these practices seem endemic, but rather to demonstrate scope for improvement in the practice of classier assessment. 2.1 Assuming Equal Costs. Koziol, Adams, and Frutos (1996) were interested in seeing if certain indicators were useful for classifying interstitial cystitis cases into ulcer and nonulcer categories. They compared LDA (linear discriminant analysis), logistic regression, and recursive partitioning using error rate. However, it is very rare in a medical context that the seriousness of the two kinds of misclassications are equal. Typically (though not always) allocating a case as a noncase is more serious (i.e., incurs a greater cost) than the reverse.
Classier Performance Assessment
307
2.2 Integrating over Costs. Page et al. (1996) discuss the use of neural networks to discriminate between the brain scans of healthy subjects and Alzheimer’s patients. They conclude that for this purpose, neural nets are better than the conventional statistical method (LDA) on the basis of the AUC. The AUC summarizes classier performance across the entire cost range. It is difcult to believe that in this problem, the entire range of possible relative costs is equally legitimate. This would mean, for example, that one was prepared to entertain the possibility that misclassifying an Alzheimer’s patient as healthy was vastly more serious than the reverse— and also to entertain the possibility that misclassifying a healthy subject as suffering from Alzheimer’s was vastly more serious than the reverse. It is unlikely that both cost allocations would be regarded as legitimate. It is much more likely that a narrower range would be appropriate—that one type of misclassication is more serious than the other, even if one cannot say by how much. 2.3 Crossing ROC Curves. Comparing the AUCs of two classiers can be appropriate only if one curve entirely dominates the other (one classier is better than the other for all costs). If ROC curves cross, as they do in Page et al. (1996, p. 199), it means that different classiers are better for different values of the costs. Summarizing over all possible costs could lead to completely incorrect conclusions. 2.4 Fixing Costs. There are very few studies in the applications literature that use explicit misallocation costs. This is because there are normally intrinsic uncertainties surrounding the outcome (just how serious is one type of misdiagnosis compared with the other?). Even in domains where one might have expected precise costs to be available, such as commercial and banking domains, they are rare. Michie, Spiegelhalter, and Taylor (1994) give precise costs for a problem of classifying applicants for loans to good and bad classes, but in our experience these misallocation costs are seldom available (Hand & Henley, 1997). The costs depend on protability, which includes issues such as marketing and advertising costs. Moreover, the relative misclassication costs are likely to change over time as the economy evolves. Although it is probable that a range of likely costs can be given, it is improbable that exact ones can be given. 2.5 Variability. Dybowski, Weller, Chang, and Gant (1996) discuss the decision-making process involved in admitting patients to intensive care. They conducted a study to assess the performance of various neural network models and logistic discrimination in predicting death during admission, specically of predicting in-hospital mortality among patients with systemic inammatory response syndrome and hemodynamic shock. Comparison
308
N. M. Adams and D. J. Hand
was made using both AUC and Brier score (Hand, 1997). However, no standard error is quoted for any of the summary statistics for the competing classiers, and no signicance test results are given. The chosen architecture may have had the highest AUC purely by chance. 3 Guidelines for Comparing Classiers
Our examples illustrate that equal costs (error rate) are rarely an appropriate assumption, that assuming nothing is known about the costs is unrealistic, that assuming they are known exactly is often unrealistic, that AUC may be entirely inappropriate if ROC curves cross, and that in comparing rules, the variation in the performance statistics must be taken into account. We could have easily picked other articles to demonstrate these points. In general, the assessment criteria must be chosen to match the objectives. Methodological and simulation studies out of any explicit context may have little bearing on real problems. In particular, it is important to take into account what is known about the relative misclassication costs because this can radically inuence the choice of method. In section 4 we describe a performance measure that does this, making use of what can be said about the costs (since something can always be said) even if precise values cannot be given. 4 LC Assessment Method
In this section we outline the LC assessment method (described in detail in Adams and Hand, 1999), using the German credit data from the STATLOG project (Michie, Spiegelhalter, & Taylor, 1994). These data consist of 24 variables measured on 1000 people: 700 regarded as good credit risks (class 0) and 300 regarded as bad credit risks (class 1). These data were divided into training and test sets of 800 and 200 observations, respectively. The ROC curves for linear and quadratic discriminant analysis (QDA), estimated on the training set and applied to the test set, are given in Figure 1. In terms of the shortcomings described in Section 2, this comparison involves crossing ROC curves. The AUCs (and associated standard errors) for LDA and QDA are 0.8371 (0.0268) and 0.8135 (0.0288), respectively. These summaries provide no evidence to favor either classier, although published studies may ignore the estimates of standard error. Such a comparison also involves integrating over all possible costs and ignoring estimates of variability, both shortcomings described above. In fact, the German credit data have a misallocation matrix, which states that the severity of misclassifying a bad risk is ve times that of misclassifying a good risk. With this extra information, standard ROC curve analysis is certainly inappropriate. However, our experience (Hand & Henley, 1997;
309
0.8
1.0
Classier Performance Assessment
0.0
0.2
0.4
F^(t|0)
0.6
LDA QDA
0.0
0.2
0.4
0.6
0.8
1.0
F^(t|1)
Figure 1: ROC curves for LDA and QDA applied to the German credit data.
Hand & Jacka, 1998) suggests that such precision in the denition of misallocation costs for credit problems is misplaced, and we therefore specify a range of 3 to 7 for this misclassication cost, with most probable value remaining at the quoted one of 5. It is not meaningful for a given classier to compare the losses arising from two different costs pairs (c0 , c1 ) and ( c00 , c01 ). All one can do is compare the losses from different classiers at a xed cost pair. Since the losses due to different cost pairs cannot be compared, the only useful information in comparing different classiers is whether one classier has a smaller loss than the other for particular cost pairs. It follows that the values of a pair ( c0 , c1 ) are not important, but only the relative size of c0 and c1 . For practical reasons outlined in Adams and Hand (1999), it is convenient to impose the constraint c0 C c1 = 1 and work just in terms of c1 . Domain experts can normally specify upper and lower bounds on the possible values c1 can take (although in our experience they do this most comfortably in terms of the cost ratio cc10 ) and they can also specify a most likely value for c1 . We use these three values to dene a belief distribution over the possible values of c1 . For reasons explained in Adams and Hand (1999), we take a triangular function with support dened by the upper and lower bounds and maximum at the most likely value. This belief function, L( c1 ) tells us how probable the expert thinks it is that the different values of c1 will occur.
310
N. M. Adams and D. J. Hand
For each possible value of c1 , we can see which classier incurs the smallest loss. This allows us to dene an indicator function I ( c1 ) for each classier that takes the value 0 if the classier does not have the smallest loss for a given value of c1 and the value 1 if it does. For each classier we combine I and L as Z LC = I ( c1 ) L( c1 ) dc1 to give an overall measure of expected performance. The best classier is that which gives the largest value of the LC index (Adams & Hand (1999) are concerned with only two classiers and work with the difference of the two LC values.) Returning to the example, we have the predictions of LDA and QDA on the test set and the belief distribution dened earlier. The estimates in this case are LCLDA = 1 and LCQDA = 0. Supercially, then, this is strong evidence in favor of LDA. Estimates of standard error for the LC index are not as readily obtained as estimates of standard error for more conventional summaries like error rate or AUC. The approach we adopt in Adams and Hand (2000) is to obtain a bootstrap estimate (Efron & Tibshirani, 1993; Davidson & Hinkley, 1997) of the standard error of the LC index. This involves drawing random resamples (of the same size as the test set) with replacement from the test set. The LC index for the two classiers is computed on each resample. The standard deviation of the resampled indexes is the bootstrap estimate of standard error. Since this example involves two classes, the LC index is a comparative measure (complementary results are obtained for the two classes). This implies that it is not necessary to use a paired test: the ratio of LC index to bootstrap standard error can serve as a test statistic. In this case, the standard error (common to both classiers) is 0.0224. Using LC analysis in this case has come to a rm conclusion using an appropriate cost interval and incorporating variability estimates. 5 Conclusion
Efforts to optimize classier performance can be wasted by the use of an inappropriate assessment criterion. A classier that is suboptimal with respect to the characteristics of the problem can be installed. We have presented some guidelines and a tool that should help reduce these problems. Acknowledgments
The work of N. A. on this project was supported by grant GR/K55219 from the EPSRC, under its “Neural Networks—The Key Questions” initiative.
Classier Performance Assessment
311
References Adams, N. M., & Hand, D. J. (2000). An improved method for comparing diagnostic tests. To appear in Computers in Biology and Medicine. Adams, N. M., & Hand, D. J. (1999). Comparing classiers when the misallocation costs are uncertain. Pattern Recognition, 32, 1139–1147. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press. Dybowski, R., Weller, P., Chang, R., & Gant, V. (1996). Prediction of outcome in critically ill patients using articial neural-network synthesized by genetic algorithm. Lancet, 347, 1146–1150. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman and Hall. Hand, D. J. (1997). Construction and assessment of classication rules. Chichester: Wiley. Hand, D. J., & Henley, W. E. (1997).Statistical classication methods in consumer credit scoring: A review. Journal of the Royal Statistical Society, Series A, 160(3), 523–541. Hand, D. J., & Jacka, S. D. (1998). Statistics in nance. London: Arnold. Koziol, J. A., Adams, H. P., & Frutos, A. (1996). Discrimination between the ulcerous and the nonulcerous forms of interstitial cystitis by noninvasive ndings. Journal of Urology, 155(1), 89–90. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural networks and statistical classication. New York: Ellis Horwood. Page, M. P. A., Howard, R. J., O’Brien, J. T., Buxton-Thomas, M. S., & Pickering, A. D. (1996).Use of neural networks in brain SPECT to diagnose Alzheimer ’s disease. Journal of Nuclear Medicine 37(2), 195–200. Received August 24, 1998; accepted March 1, 1999.
LETTER
Communicated by Alexandre Pouget
Do Simple Cells in Primary Visual Cortex Form a Tight Frame? Emilio Salinas Instituto de Fisiolog´õa Celular, UNAM, Ciudad Universitaria S/N, 04510 M´exico D.F., M´exico L. F. Abbott Volen Center for Complex Systems and Department of Biology, Brandeis University, Waltham, MA 02254-9110, U.S.A Sets of neuronal tuning curves, which describe the responses of neurons as functions of a stimulus, can serve as a basis for approximating other functions of stimulus parameters. In a function-approximating network, synaptic weights determined by a correlation-based Hebbian rule are closely related to the coefcients that result when a function is expanded in an orthogonal basis. Although neuronal tuning curves typically are not orthogonal functions, the relationship between function approximation and correlation-based synaptic weights can be retained if the tuning curves satisfy the conditions of a tight frame. We examine whether the spatial receptive elds of simple cells in cat and monkey primary visual cortex (V1) form a tight frame, allowing them to serve as a basis for constructing more complicated extrastriate receptive elds using correlationbased synaptic weights. Our calculations show that the set of V1 simple cell receptive elds is not tight enough to account for the acuity observed psychophysically. 1 Introduction
There are a number of parallels between representation and learning in neural networks and methods of functional expansion in mathematical analysis. Populations of neurons with responses tuned to properties of a stimulus may be well suited for approximating functions of stimulus parameters (Poggio, 1990; Poggio & Girosi, 1990; Girosi, Jones, & Poggio, 1995). If certain conditions are satised, such a population can act as a highly efcient substrate for function representation. Accurate representation of a given function depends critically on the synaptic weights connecting input and output neurons. These weights are often adjusted with a delta-type learning rule that is efcient but not biologically plausible. However, if the tuning curves of the input neurons form what is called a tight frame (Daubechies, Grossmann, & Meyer, 1986; Daubechies, 1990, 1992), function approximation can be achieved using a more biologically reasonable Hebbian or correlationNeural Computation 12, 313–335 (2000)
c 2000 Massachusetts Institute of Technology °
314
Emilio Salinas and L. F. Abbott
Figure 1: A feedforward network for function approximation. The input neurons at the bottom of the gure respond with ring rates that are functions of the stimulus variable x. The ring rate of input neuron j is fj (x). The upper output neuron is driven by the input neurons through synaptic connections; wj represents the strength, or weight, of the synapse from input neuron j to the output neuron. The task is to make the ring rate R of the output neuron approximate a given function F(x).
based rule for the synaptic weights. We investigate here whether the simple cells of primary visual cortex provide such a basis. To introduce the hypothesis to be tested, we begin with a simple example. Consider a population of N neurons that respond to a stimulus by ring at rates that depend on a single stimulus parameter x. The ring rate of neuron j as a function of x, also known as the response tuning curve, is fj ( x). We wish to examine whether such a population can generate an approximation of another function F( x) that is not equal to any of the input tuning curves. Figure 1 shows a simple network for function approximation. In this network, a population of neurons acting as an input layer drives a single output neuron through a set of synapses. The target function F is represented by the ring rate of the output neuron. The weight of the synapse connecting input neuron j to the output neuron is denoted by wj . Following standard neural network modeling practice, we write the ring rate R of the output neuron in response to the input x as a synaptically weighted sum of the input ring rates, R=
N X
wj fj (x).
(1.1)
j= 1
If R = F( x), or at least R ¼ F(x) to a sufciently high degree of accuracy, the tuning curve of the output neuron will provide a representation of the target function. The problem is to nd a set of input tuning curves, and the synaptic weights corresponding to these tuning curves, that produces an output response close to the target function. The expansion of a function as a weighted sum of other functions is a standard problem in mathematical analysis. In the limit N ! 1, any set
Simple Cells and Tight Frames
315
of functions that provides a complete basis over the range of x values considered can be used. Completeness is not very restrictive, but an additional condition is usually applied. The weights wj that make F( x ) =
1 X
wj fj (x)
(1.2)
j= 1
can be computed easily if the basis functions fj are orthogonal and normalized so that Z (1.3) dy fi ( y) fj ( y) = d ij , where d ii = 1 for all i and d ij = 0 for all i 6 = j. In this case the weights are given by Z (1.4) wj = dy fj ( y) F( y). Equation 1.4 for the weights is closely related to forms of Hebbian and correlation-based learning commonly used in neural networks. A correlationbased rule sets the synaptic weights to values given by wj = h fj Ri,
(1.5)
where the brackets indicate an average over stimuli. Equation 1.5 is equivalent to the statement that the synaptic weight is proportional to the correlation of the ring rates of the pre- and postsynaptic neurons. The idea that synaptic weights are related to correlations in pre- and postsynaptic ring is a standard hypothesis going back to the work of Hebb (1949). Often this principle is expressed in the form of a learning rule that adjusts the weights incrementally toward their desired values during a training period. We can also think of equation 1.5 as a consistency condition required for the maintenance of weights at their correct values. Suppose that a network has been set up to approximate a given function, so that R = F( x) to a high degree of accuracy. In order for the network to function properly over an extended period of time, some mechanism must maintain the weights at values that produce this desired output. If synaptic weights are maintained by a process that uses a correlation-based rule, condition 1.5 can be interpreted as a consistency condition on maintainable weights. In this interpretation, the Hebbian rule is applied continuously; there is no distinct training period. The average over stimuli in equation 1.5 is then over all stimuli typically encountered, and the output ring rate R can be approximated by the function F(x). Thus, the consistency condition is wj = h fj (x) F(x)i.
(1.6)
316
Emilio Salinas and L. F. Abbott
If all stimulus values x occur with equal probability, the average over stimuli is proportional to an integral over x, and the weights given by equations 1.4 and 1.6 are proportional to each other. Thus, the Hebbian consistency condition is equivalent to the formula for the coefcients of an orthogonal expansion. The relationship between equation 1.4 and correlation-based synaptic weights suggests a close relationship between Hebbian ideas about synaptic plasticity and standard mathematical approaches to functional expansion. Unfortunately, a direct link of this type is inappropriate because most sets of neuronal tuning curves are highly overlapping and far from orthogonal. However, all is not lost. What happens if the values of the weights given by equation 1.4 are used in equation 1.1 even when the input tuning functions are not orthogonal? This question can be answered simply by substituting these weights into equation 1.1: R=
N Z X
dy fj ( y) F( y) fj ( x).
(1.7)
j= 1
If we interchange the order of summation and integration and dene D( x, y) ´
N X
fj ( x) fj ( y),
(1.8)
j= 1
we nd that the output ring rate is a convolution of the target function with D, Z (1.9) R = dy D( x, y) F( y). By computing the D function for a given set of input tuning curves, we can determine how accurately a set of neurons can represent a function using synaptic weights given by equation 1.4. The output ring rate in equation 1.9 will be exactly equal to the target function if D( x, y) = d ( x ¡ y), since Z
R=
dyd (x ¡ y) F( y) = F( x),
(1.10)
by the denition of the Dirac d function. The condition D( x, y) = d ( x ¡ y) is less restrictive than the orthogonality condition. A set of functions fj that satisfy D(x, y) = d ( x ¡ y) is called a tight frame (Daubechies et al., 1986; Daubechies, 1990, 1992). A nite number of well-behaved tuning curves can never generate a true d function in equation 1.8, but they can create a good approximation to one. If the integral of D( x, y) over y is one for all values of x and if
Simple Cells and Tight Frames
317
D(x, y) is signicantly different from zero only for values of y near x, the integral in equation 1.9 is approximately equal to F(x), provided that F varies more slowly than D. Thus, nite sets of tuning curves can act as approximate tight frames. The importance of D( x, y) is that, as a function of y, its width determines the resolution with which an output function can be approximated using a correlation rule like equation 1.6 for the synaptic strengths. For this reason, it is interesting to examine D functions for real neuronal tuning curves. The D function provides a useful characterization of the computational capabilities of a population of neurons for function expansion. To apply ideas about tight frames to real neural circuits, we must nd areas where function approximation may be taking place. Primary visual cortex seems to be a good candidate. Neurons in primary visual cortex provide the rst cortical representation of the visual world, and their receptive elds provide a basis set from which the more complicated receptive elds of extrastriate neurons are constructed. If the tuning curves of simple cells form a tight frame, the construction of more complicated representations using an architecture like that in Figure 1 can be based on Hebbian synapses whose efcacy is maintained by correlation-based rules of the form given by equation 1.6. To investigate these ideas, we will compute the D function (and a related function K) for the receptive elds of simple cells in the primary visual cortices of cats and monkeys. This will allow us to determine how well the receptive eld functions of simple cells collectively approximate a tight frame. Lee (1996) has explored the conditions under which sets of lters with properties like those of simple cells satisfy the mathematical denition of a tight frame. We will use the experimentally measured lter distributions to determine explicitly the resolution with which simple cells can generate arbitrary visual lters for downstream neurons using correlation-based synaptic weights. 2 Function Approximation by Simple Cells
Representation of the visual world is more complicated than the simplied scenario described in section 1, so we start by recasting the basic problem in terms that are appropriate for this application. Simple cell responses can be expressed approximately through spatiotemporal linear lters acting on visual image intensity, but we will focus exclusively on the spatial aspects of these responses. If I (x ) is the luminance (relative to background) of an image that appears at the point x on the retina, where x is a two-dimensional vector, the ring rate of simple cell j is approximated as Z rj = g
dx fj ( x) I ( x ),
(2.1)
318
Emilio Salinas and L. F. Abbott
where g is an overall scale factor determining the magnitude of the ring rate. Because this expression can be positive or negative, we must interpret it as the difference between the ring rates of two neurons having receptive eld lters Cfj and ¡ fj . The lter functions fj for simple cells are often t using plane waves multiplied by gaussian envelopes, known as Gabor functions (Daugman, 1985; Jones & Palmer, 1987; Webster & De Valois, 1985), ! | x ¡ aj | 2 1 exp ¡ cos( kj ¢ x ¡ w j ). (2.2) fj (x ) = 2p sj2 2sj2
(
These lters are parameterized by the receptive eld center a, receptive eld envelope width s, preferred spatial frequency k ´ | k |, preferred orientation h (the direction perpendicular to k ), and spatial phase w . For simplicity, we restrict ourselves to circular rather than elliptical receptive eld envelopes. We have veried that for values in the physiologically measured range, elliptical receptive elds give essentially the same results reported below for circular receptive elds. We have included a factor of 1/ s 2 in the normalization of the Gabor function so that neurons with different lters give similar maximum ring rates when the optimum images are presented with a given overall amplitude. In the case of visual responses, the lter functions ll the role that the tuning curves played in the example given in section 1. Similarly, the target function we wish to represent in this case is a lter function. In other words, we attempt to construct a downstream neuron that is selective for some target function or target image F( x ). The target response is given by Z = (2.3) R G dz F(z ) I ( z), with G the scale factor for the output rate. For xed image power, the maximum value of R is evoked when I is equal to the target function F. This output neuron is driven, through synaptic weights, by the responses of a population of simple cells, so that R=
N X j= 1
wj rj = g
N X
Z wj
dx fj ( x) I ( x ),
(2.4)
j= 1
where we have used equation 2.1. Again we impose the condition that synaptic weights are equal to the average product of pre- and postsynaptic ring rates, assuming that the postsynaptic neuron res as prescribed by equation 2.3. The weights are then given by Z = hr = (2.5) wj Gg dy dz fj ( y )F( z )hI ( y )I ( z )i. j Ri
Simple Cells and Tight Frames
319
In this case the average over stimuli, denoted by angle brackets, is an average over the ensemble of visual images. Substituting these weights into equation 2.4, we nd Z R = Gg2
dx dz K( x, z )F( z ) I (x ),
(2.6)
with Z K ( x , z) =
dy D( x, y )hI ( y ) I (z )i
(2.7)
and N X
D( x , y ) =
fj ( x ) fj ( y).
(2.8)
j= 1
Equation 2.6 tells us that the ring rate of the output neuron is proportional to a ltered integral of the image, Z R = Gg
2
dx Fef f (x ) I ( x).
(2.9)
The effective lter Fef f is a smeared version of the desired lter F and is given by Z Fef f ( x ) =
dy K(x , y )F( y ).
(2.10)
The condition for perfect function representation is then K( x , y) = d ( x ¡ y ), and the resolution with which a given lter function can be represented is related to the degree to which K approximates a d function. To determine how well simple cells approximate a tight frame, we rst compute both the D and K functions. An interesting aspect of the computation of D from equation 2.8 is that it involves not only the form of the receptive eld lters of simple cells, but also the distribution of their parameters across the population of cortical neurons. Once D is determined, we compute K from equation 2.7 by including the correlation function hI( y ) I (z )i for natural images (Field, 1987; Ruderman & Bialek, 1994; Dong & Atick, 1995). Then we discuss whether D and K behave like a d function. 3 Calculation of D( x , y )
Equation 2.8, which denes D, is a sum over simple cell lters. To compute it, we reparameterize the lters 2.2 in terms of quantities whose distributions
320
Emilio Salinas and L. F. Abbott
have been measured. We then approximate the sum over cells by integrals over those parameters using mathematical ts to reported distributions. To simplify the calculation, we group the simple cell lters into pairs with spatial phases 90 degrees apart (quadrature pairs), corresponding to sine and cosine terms. Each pair of lters can then be described by a single complex exponential lter gj , ! ¡ ¡ ¢¢ | x ¡ aj | 2 1 exp ¡ exp i kj ¢ x ¡ wj , (3.1) gj (x ) = 2 2 2p sj 2sj
(
p where i ´ ¡1. Neurophysiological experiments have found pairs of adjacent simple cells roughly 90 degrees out of phase with each other (Pollen & Ronner, 1981), and we assume that combinations as in equation 3.1 can be formed for all cells. Using complex exponential lters, the sum that denes D becomes D( x , y) =
N X j= 1
=
N X
gj ( x ) gj¤ ( y) 1
4p 2 sj4 j= 1
(
exp ¡
| x ¡ aj | 2 C | y ¡ aj | 2 2sj2
!
¡ ¢ exp ikj ¢ ( x ¡ y ) ,
(3.2)
where gj¤ denotes the complex conjugate of gj . Note that the spatial phase cancels out in the expression for D. Simple cell receptive eld lters are characterized by location, envelope width, and preferred spatial wavelength, phase, and orientation. The envelope width and preferred spatial frequency are correlated; larger receptive elds tend to have lower preferred spatial frequencies. To eliminate this dependence, we will characterize the receptive elds by bandwidth rather than by envelope width. The bandwidth is the width of the spatial frequency tuning curve measured in octaves. It is dened as b ´ log2 (v2 / v1 ), where v2 and v1 are the higher and lower spatial frequencies, respectively, at which the neuron gives half of its maximum response. The bandwidth is related to the envelope width s and the preferred frequency k by p 2 ln(2) 2b C 1 s= . (3.3) 2b ¡ 1 k We use b and k as independent parameters, with the tuning curve width s given by equation 3.3. To compute D, we use an integral over receptive eld parameters as an approximation to the sum over neurons, making the replacement N X j= 1
Z !
da dkdbdh dw r a ( a )r k ( k)r b (b)r h (h )r w (w ),
(3.4)
Simple Cells and Tight Frames
321
where the different r functions are the neuronal coverage densities (the number of neurons per unit parameter range) for the different receptive eld parameters. The validity of this approximation is addressed at the end of this section. Preferred orientations and spatial phases appear to be uniformly distributed over neurons in primary visual cortex (Hubel & Wiesel, 1974; Field & Tolhurst, 1986; Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993) so we assume that rh (h ) and r w (w ) are constants. Over a small region of the cortex, r a ( a) can also be approximated as a constant. This means that we are considering a small enough region of the visual eld so that the dependence of the cortical magnication factor on eccentricity can be ignored. Since the D function we compute will be quite narrow, this is a reasonable approximation. With these assumptions, we nd Z D( x , y ) =
dadkdbdh dw r k (k)r b (b)
¡
| x ¡ a| 2 C | y ¡ a | 2 c exp ¡ s4 2s 2
¢
¡ ¢ exp ik| x ¡ y |) cos(h ) .
(3.5)
In this and all the following expressions, c denotes a constant that absorbs all of the factors that affect the amplitude but not the functional form of D. To simplify the above expression, we have made a shift in the denition of the angle h , which is allowed because the h integral is over all values. The integrals over a and w in equation 3.5 can be done immediately to obtain Z (| ) D x ¡ y | = dk dbdh r k ( k)r b (b)
¡
| x ¡ y| 2 c exp ¡ 2 s 4s 2
¢
¡ ¢ exp ik| x ¡ y | cos(h ) .
(3.6)
Note that D depends on only the magnitude of x ¡ y . To proceed further, we need to determine the density functions r k ( k) and r b (b). We have done this by tting three different data sets. We rst performed ts on the data reported by Movshon, Thompson, and Tolhurst (1978) for neurons with 5 degree eccentricity or less in cat area 17. The data and ts are shown in Figure 2 (top). The mathematical ts are given by
(
(b ¡ 1.39) 2 r b (b) = p exp ¡ 2(0.44) 2 2p 0.44 16.6
! (3.7)
and r k ( k) =
1.7 k2 ¡ ¢5 , 1 C k/ 5.2
(3.8)
322
Emilio Salinas and L. F. Abbott
Figure 2: Distributions of optimal spatial frequencies (left) and bandwidths (right) for simple cells. The black circles correspond to experimental data, and the solid curves are mathematical ts. (Top) Cat area 17 simple cells with eccentricities smaller than 5 degrees. Data from Movshon et al. (1978). (Middle) Monkey V1 simple cells within 1.5 degrees of the fovea. Data from De Valois et al. (1982). (Bottom) Monkey V1 simple cells with eccentricities from 3 to 5 degrees (parafoveal sample). Data from De Valois et al. (1982). Notice the different scales in the frequency axes of cat and monkey data. Parameters in the tting equations were found by minimizing the  2 statistic (Press et al., 1992). All ts had P > 0.3, except for the preferred frequency distribution in the monkey parafoveal sample (lower left), which had P > 0.01.
and we assume that r k ( k) = 0 for k > 20 radians per degree. We also t two data sets of De Valois, Albrecht, and Thorell (1982) from area V1 in macaque monkeys. These correspond to a region from 0 to 1.5 degrees eccentricity
Simple Cells and Tight Frames
323
(foveal) and another from 3 to 5 degrees eccentricity (parafoveal). The ts for the region closest to the fovea are shown in Figure 2 (middle), and are given by
(
(b ¡ 1.49) 2 r b (b) = p exp ¡ 2(0.62) 2 2p 0.62 31.7
! (3.9)
and r k ( k) =
0.17 k2 ¡ ¢5 . 1 C k/ 22.2
(3.10)
We assume that r (k) = 0 for k > 90 radians per degree. The ts for the parafoveal region are shown in Figure 2 (bottom) and are given by
(
(b ¡ 1.3) 2 r b (b) = p exp ¡ 2(0.49) 2 2p 0.49 16.2
! (3.11)
and r k ( k) =
0.22 k2 ¡ ¢5 . 1 C k/ 14.6
(3.12)
Here we assume that r ( k) = 0 for k > 50 radians per degree. The distributions we have extracted exhibit an interesting property: the bandwidth distributions from all three data sets are very similar, and the preferred spatial frequency distributions are scaled versions of each other to a high degree of accuracy. Because the receptive eld width s is inversely proportional to the preferred spatial frequency k (see equation 3.3), the integrand in equation 3.6 depends on | x ¡ y | only through the combination k| x ¡ y |. This means that scaling the distribution r k ( k) by a factor s will have the effect of scaling the argument of the resulting D function to give D(s| x ¡ y |). As a result, we can compute the D function for any one of the distribution pairs and determine D for the others simply by scaling the spatial argument of the function. We have veried through numerical integration that the three bandwidth distributions are equivalent and that this approximation is practically exact (compare Figures 3A and 4A). We will use the experimental results for cells nearest to the fovea in the monkey. The D function we compute and show in the gures corresponds to this case, but the results for the parafoveal region of the monkey can be obtained by broadening these curves by a factor of 1.52, and for the cat data with a broadening factor of 4.27. Results for any eccentricity in either animal can similarly be obtained by scaling the curves by appropriate factors.
324
Emilio Salinas and L. F. Abbott
Figure 3: D function for monkey simple cells. All D curves shown are normalized to a maximum height of one. (A) Result when the experimentally measured neuronal distributions, equations 3.9 and 3.10, are used in the calculation of D. The lled circles correspond to numerical integration of equation 3.6 for different values of | x ¡ y |; the continuous line is a t to the numerical results using a combination of three gaussian functions (see equation 3.13). The width of the curve at half maximum height, which determines the resolution for function approximation, is 0.06 degree. (B) Results obtained when the distributions of bandwidths and preferred frequencies are manipulated. The dotted line is the D function when only neurons near the peaks of both distributions are used; the width at half-height is 0.13 degree. The continuous line is the result when only neurons with optimal frequencies lower than 50 radians per degree are included in the calculation; the width at half-height is 0.086 degree. The dashed line is the result when only neurons with optimal frequencies higher than 50 radians per degree are included; the width at half-height is 0.042 degree. In these last two cases the full bandwidth distribution was used.
Substituting equations 3.3, 3.9, and 3.10 into 3.6, the triple integral can be computed numerically for different values of | x ¡ y | to yield the function D. The result is shown in Figure 3A, where D has been normalized to a maximum value of one. The continuous line is a t to the numerical result, indicated by dots. The t was obtained by combining three gaussians,
¡
D f it (| x ¡ y | ) = 0.8590 exp ¡
| x ¡ y| 2 2(0.024) 2
¡
¡ 0.2579 exp ¡
¢
| x ¡ y| 2 2(0.085) 2
¡
C 0.3590 exp ¡
¢
.
| x ¡ y| 2 2(0.059)2
¢
(3.13)
The gure shows that the full width at half-maximum height is 0.06 degrees. It also shows that the sidebands around the central peak are very small; the function does not oscillate. This suggests that D can approximate ad function to a resolution of roughly 4 minutes of arc.
Simple Cells and Tight Frames
325
Figure 4: D function for simple cells and retinal ganglion cells of the cat. Both curves are normalized to a maximum height of one. (A) The D function for simple cells of the cat primary visual cortex obtained by numerical integration of equation 3.6. This curve is practically indistinguishable from the result in Figure 3A scaled by a factor of 4.27. For comparison, the x-axes in this gure have been scaled precisely by this amount. The full width at half maximum height is 0.26 degree. (B) The D function for a uniformly distributed population of difference of gaussian lters resembling those of retinal ganglion cells (see the appendix). The full width at half maximum height is 0.5 degree.
In Figure 3B we examine how the D function depends on the distributions of preferred spatial frequencies and bandwidths used in the calculation. First, we investigate what happens if only neurons with parameters near the peaks of the distributions in Figure 2 are considered. We include bandwidths ranging between 1.4 and 1.6 octaves and optimal frequencies ranging between 21 and 23 radians per degree, with no neurons outside these ranges. The D function that results is shown in Figure 3B (dotted line). It has large inhibitory sidebands, and its width at half-height is more than twice the width of the original function. When the full bandwidth distribution is used but the same restricted range of preferred frequencies is included, the result is similar. The neurons at the tail of the preferred frequency distribution are important in setting the width of D, even though proportionally they represent a small fraction of the total. When the distribution is truncated, allowing only optimal frequencies below 50 radians per degree, the resulting D function has a width of 0.086 degrees. Conversely, when only optimal frequencies above 50 radians per degree are included, the resulting D function has a width of 0.042 degrees. Thus, the resolution of the full set of simple cells seems to be limited by the highest frequencies included. The 1/ s 2 factor that appears in equation 2.2 is partly responsible for this effect. It is interesting to compare the D function of simple cells with that of neurons at earlier stages in the visual pathway. In the case of retinal gan-
326
Emilio Salinas and L. F. Abbott
glion cells, the computation is straightforward because their lters can be described by differences of gaussians (see the appendix). The D functions for simple cells and retinal ganglion cells corresponding to data collected in cats are shown in Figure 4. The full widths at half-maximum height are 0.26 and 0.5 degree, respectively. The resolution of simple cells for function approximation in the same animal is better, roughly by a factor of 2. This is related to the fact that some simple cells have high preferred spatial frequencies or, equivalently, lters that vary more rapidly than those of retinal ganglion cells. Because the integral in equation 3.6 was computed numerically using the trapezoidal rule (Press, Flannery, Teukolsky, & Vetterling, 1992), the D function was actually computed as a sum. This is appropriate since the biological system involves individual neurons, not a continuum. The number of terms in the sum is proportional to the number of neurons with receptive elds overlapping the same spatial location. Examining how the number of terms affects the sum reveals how many neurons are needed to obtain a smooth D function. The results shown in Figure 3A were obtained with 30 steps in orientation h (in the range 0–2p ), 50 steps in preferred frequency k (in the range 0–90 radians per degree), and 50 steps in bandwidth b (in the range 0.1–3.0 radians per degree). With a much coarser grid of 8 orientations, 10 frequencies, and a single bandwidth (same ranges), the resulting curve was almost identical to the one in Figure 3A. In general, the distribution of bandwidths had little impact on the results. Hence, for a given point in visual space, sampling at 8 orientations and 10 frequencies per orientation is enough to construct a D function with a 0.06 degree resolution. Are the neurophysiological data consistent with these requirements? Within the dimensions of a hypercolumn, preferred orientation changes smoothly and uniformly throughout 180 degrees (Hubel & Wiesel, 1974; Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993). In fact, in orientation maps obtained with optical recordings, 8 orientation bins in the 0–180 degrees range can easily be distinguished by eye (Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993). Thus, sampling in orientation space is not a limitation. On the other hand, the 10 frequency steps do not correspond to 10 neurons, for two reasons. First, each complex lter represents a sine-cosine pair where each element is the difference between Cfj and ¡ fi , which means 4 neurons per lter. Second, r k ( k) is not uniform. To get at least 4 complex lters with optimal frequencies inside the 90 radians per degree bin, the 10 frequency bins must involve a total of about 300 lters, or 1200 neurons. This means that for each point in space, at least 1200 frequencies are needed in each of the orientation bins to achieve the resolution implied by the D function. De Valois et al. (1982) estimated that in monkeys, on the order of 100,000 neurons respond to stimuli at a given location in visual space. Assuming that half of those are simple cells and that frequencies are uniformly distributed in the 8 orientation bins (De Valois et al., 1982; Field & Tolhurst, 1986; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997) leaves about 6000 frequencies per
Simple Cells and Tight Frames
327
orientation bin. This number, although only a rough gure, suggests that the sampling density in spatial frequency is sufcient to achieve the resolution implied by the D functions that we have computed. 4 Calculation of K( x , y)
To compute the function K from D, we need to include corre« the two-point ¬ lation function of ensembles of natural images, Q ´ I (x ) I ( y ) . This function has been characterized in the frequency domain (Field, 1987; Ruderman & Bialek, 1994; Dong & Atick, 1995), so we will use its Fourier transform QQ to compute the convolution 2.7 through which the K function is dened (Bracewell, 1965). Using a minimal set of assumptions, Dong & Atick (1995) derived an analytic expression for the full spatiotemporal power spectrum of natural images and t it to data from movies and video recordings. Taking k as the spatial frequency, in radians per degree, and v as the temporal frequency, in radians per second, they proposed that QQ ( k, v) =
A
Z
km¡1 w2
r2 v/ ck r1 v/ ck
dv vP( v),
(4.1)
where A is a constant that absorbs the amplitude factors, c = 360/ 2p , m has been measured from snapshot images to be about 2.3, r1 and r2 are, respectively, the minimum and maximum viewing distances at which objects are resolved, and P( v) is the probability distribution of velocities of objects that are seen. Dong & Atick (1995) suggested that P( v ) »
1 , ( v C v0 ) n
(4.2)
which t their two data sets extremely well using n = 3.7, v0 = 1.02 meters per second, r1 equal to 2 or 4 meters, and r2 either 23 or 40 meters (depending on the data set). The parameter v0 is related to the mean speed by v = v0 / ( n ¡ 2). We will use the above expression in our calculation of K, but to investigate whether the particular choice for P(v) is critical to our results, we also consider another distribution with a different shape: P(v) » v exp(¡av).
(4.3)
Because we are interested only in the spatial component of QQ (k, v), we need to average over v, or consider it as an additional parameter. Given typical ocular movements, a reasonable choice is to set v equal to the mean saccade frequency—about 4 per second or v = 8p radians per second. We will use this value and consider QQ as a function of k only.
328
Emilio Salinas and L. F. Abbott
Figure 5: K functions for monkey and cat simple cells. The curves shown correspond to numerical integration of equation 4.5 for different values of | x ¡ y |. (A) Results for monkey simple cells based on the D f it function of equation 3.13 and shown in Figure 3A. The thick line is the result when the following parameters were used in equations 4.1 and 4.2: n = 3.7, m = 2.3, r1 = 0.2 m, r2 = 100 m, v0 = 1.02 m/s, v = 8p rad/s. The full width at half height is 0.112 degree. The thin lines were obtained with the same parameters, except v = 32p (narrower curve; width at half-height is 0.105 degree) and v = 2p (wider curve; width at half-height is 0.135± ). (B) Results for cat simple cells also based on D f it in equation 3.13, but scaled by a factor of 4.27. The x-axis in this gure was scaled by the same amount. The width at half-height is 0.45 degree.
We rst compute the Fourier transform of K. This is equal to the product of the Fourier transforms of D and Q, since K is dened by a convolution (Bracewell, 1965; Press et al., 1992). We actually use the Fourier transform of D f it , which is simply the sum of the Fourier transforms of three gaussians. Q Transforming back to the spatial The Fourier transform of K is KQ = DQ f it Q. representation we have K( x , y ) =
1 4p 2
Z
¡ ¢ dk exp ik ¢ (x ¡ y ) DQ f it ( k) QQ ( k).
(4.4)
Writing k in polar coordinates and shifting the angle variable, we obtain K(| x ¡ y | ) =
1 4p 2
Z
¡ ¢ dkdh k exp ik| x ¡ y | cos(h ) DQ f it ( k) QQ ( k).
(4.5)
Q the above integral can be computed Given values for the parameters of Q, numerically for each value of | x ¡ y |. Figure 5 shows the results for the data collected in monkeys (see Figure 5A) and cats (see Figure 5B) integrating equation 4.1 using equation 4.2. For the monkey data, equation 3.13 was used for D f it , and for the cat data this same curve was scaled by a factor of
Simple Cells and Tight Frames
329
4.27. Comparison between Figures 5A and 5B shows that the K functions are approximately scaled versions of each other; the K function for cats is just slightly narrower than predicted by scaling. The thick curve in Figure 5A is the result for a standard set of parameters, whose values are listed in the caption of Figure 5. The full width at half-height of the curve is 0.112 degree, which is about twice the width of the original D function. Thus, averaging over images worsens the resolution. The thin curves in the same gure were obtained by multiplying the standard value of v by 1/4 (wider curve) and by 4 (narrower curve). This has the same effect as multiplying r1 and r2 or v0 by these same factors, because these parameters multiply or divide v. The thin curves are quite close to the thick one, which means that the parameters in equations 4.1 and 4.2 have relatively little importance in dening the shape and width of K. The precise functional form of P( v) did not seem to be crucial either. When equation 4.3 was used (parameter a was set so that the mean speed was the same as before), the results were Q which goes as indistinguishable from those in Figure 5. It is the tail of Q, m¡1 1/ f , that makes K broader than D and denes its shape; the detailed form of QQ near k = 0, which is where the parameters have an inuence, does not seem to matter much. 5 Assessment of Tightness
We have compared the computed D and K kernels to a d function based only on the width of their central peaks, but the tightness of a set of basis functions can be more easily grasped by plotting the Fourier transform of its associated D function. In this way, frequency ranges in which DQ is approximately at can be promptly identied. This is important, because the class of functions that are band-limited to such frequency ranges will be accurately represented by the fj basis, that is, the fj functions will behave as a tight frame for the class of functions whose Fourier transform is zero outside the identied ranges. Flatness is quantied by the ratio max( DQ )/ min( DQ ), which is commonly used as a rigorous measure of tightness (Lee, 1996). Figure 6 shows the Fourier transforms of several D and K functions. Because K is related to D through a convolution with the correlation of natural images (see equation 2.7), in all cases KQ is the product of DQ times the Q Here DQ reveals the Fourier transform of the natural image correlation, Q. tightness of the original set of functions, when no averaging over images takes place; in contrast, KQ reveals the tightness of the set when averaging over multiple images does take place (see section 6). Figures 6E and 6F show an ideal case in which the basis set provides perfect approximation of functions with frequency components below 30 cycles per degree; the associated DQ function is at in this range. This corresponds, for example, to a set of Q so Figure 6F wavelets (Daubechies, 1990, 1992). For this set KQ is equal to Q, simply shows the Fourier transform of the natural image correlation func-
330
Emilio Salinas and L. F. Abbott
Figure 6: Fourier transforms of different D (left) and K (right) functions. In all cases the plots on the right were obtained by multiplying the plots on the left Q (shown by the Fourier transform of the natural image correlation function, Q in F). The top row shows the results for monkey V1 foveal neurons. (A) Fourier transform of the D function shown in Figure 3A. (B) Fourier transform of the K function shown in Figure 5A. The middle row shows the results for retinal ganglion cells of the cat (see the appendix). (C) Fourier transform of the D function shown in Figure 4B. (D) Fourier transform of the K function associated with retinal ganglion cells of the cat. The bottom row shows an ideal case of a set of basis functions whose associated D function is exactly ad function. Thus, Q The Q Q and D Q curves were scaled to a maximum of the KQ function is equal to Q. one. Note that in these plots the units of spatial frequency are cycles per degree.
tion. This function is not at all at; although it looks fairly narrow in x space (not shown, but very similar to Figure 5A), its slowly decaying tail makes it a poor approximation to ad function. Figures 6A and 6B show the results for Q suggestmonkey V1 simple cells (foveal). The KQ function is very similar to Q, ing that the ensemble of V1 neurons is not compensating for the average over natural images. On the other hand, DQ is at up to about 1 cycle per degree, in-
Simple Cells and Tight Frames
331
dicating that simple cell receptive elds provide an excellent approximation of a tight frame for functions with frequency cutoffs no higher than 1 cycle per degree. Even up to 10 cycles per degree, where simple cell responses start to attentuate signicantly, DQ varies only by a factor between 4 and 5, not too far from the factor of 2 or less that corresponds to a standard denition of a tight frame (Lee, 1996). As a comparison, Figures 6C and 6D show the results for retinal ganglion neurons from the cat (see the appendix). In this case KQ is Q albeit with a cutoff between 1 and 2 cycles per degree. much atter than D, 6 Discussion
We have evaluated the capacity of populations of simple cells in primary visual cortex to provide a basis for constructing more complicated receptive elds through Hebbian synaptic weights. There are some similarities between the conditions we have studied, in particular, the suggestion that K is proportional to a d function, and the work of Atick and Redlich (1990, 1992) concerning ideas proposed by Barlow (1989) regarding redundancy in neuronal representations. However, there are substantial differences between our approach and theirs. Atick and Redlich (1990, 1992) imposed translational invariance so that all of the neurons they considered had the same receptive eld structure. Their formulations resulted in consideration of functions such as K, including averages over natural images. Because all the receptive elds were taken to be the same, this approach generated an optimality condition on a single receptive eld. Atick and Redlich (1990, 1992) computed several optimal receptive eld lters, among them those that solved the equation for K close to a d function. The case of function approximation that we consider involves cells with different receptive eld structures and depends critically on the distribution of their parameters across the cortical population. Our D and K functions characterize the capacity of the ensemble of all simple cells for function approximation; they are not a property of single cells, as in the work of Atick and Redlich. It is interesting that both cases involve averages over natural images and some similar mathematical expressions, but the quantities being considered are very different: the capacity to transmit information in one case and the accuracy for function approximation in the other. From the DQ and KQ curves computed (see Figure 6), we conclude that the set of monkey V1 simple cells approximates a tight frame with a resolution, Q on the order of 1/10 of a degree. The resolution given by the cutoff in D, for cats, obtained by appropriate scaling of D, is about 0.43 degrees. In both cases the D function is more similar to the ideal d function than the K function, which includes corrections for natural image statistics. Thus, we nd no indication that this basis set is counteracting the effects of averaging over many images. In contrast, sets of lters like those of retinal ganglion neurons seem to be better suited for function approximation when the ef-
332
Emilio Salinas and L. F. Abbott
fect of image correlations is included. However, their resolution is rather poor—on the order of 1 degree. We should stress that these resolutions are meaningful only for synaptic modication processes based on correlational mechanisms. In particular, our results do not imply that simple cells have a higher overall resolution than retinal ganglion cells, an impossibility given that retinal ganglion cells provide the input to simple cells. Our results imply only that simple cells serve as a better basis from which neurons selective for arbitrary images can be constructed on the basis of linear superposition and Hebbian synaptic modication. To understand the difference between using the D and K functions to determine whether simple cell receptive elds form a tight frame, we might imagine two forms of Hebbian learning. For the type of Hebbian learning we discussed in section 1, synaptic adaptation occurs slowly while an ensemble of images is displayed. In this case, the function K sets the resolution for function approximation with Hebbian synapses. Suppose instead that Hebbian modication of synapses occurs rapidly and only when an image equal to the target function appears. In this case the process is fast enough that images different from the target function do not affect the synaptic weights. For this learning procedure, the resolution for function approximation is set by D. Visual acuity in humans and monkeys, in image discrimination or recognition tasks that involve the activities of large numbers of neurons, is about 1 minute of arc, or 0.017 degree (Westheimer, 1980). This is about six times better than the resolution we have estimated for approximation of an arbitrary function based on D. It might seem odd that our estimation is considerably below psychophysical discrimination thresholds, but our analysis applies specically to single neurons downstream from V1 that combine their inputs in a perfectly linear fashion. The discrepancy indicates that the visual system does not rely exclusively on the tightness of the V1 neuronal ensemble to achieve the acuity exhibited at the behavioral level; additional mechanisms beyond linear summation are required. There are two prominent possibilities for this. First, the neural circuitry may be able to exploit some of its nonlinearities to increase accuracy. On the other hand, higher accuracy may also be achieved through a population of neurons. Considering the ubiquity of population codes throughout the cortex, the capacity of V1 simple cells to act as a basis for Hebbian-based function approximation by single postsynaptic neurons may appear substantial—only a factor of 6 below behavioral performance. However, to try to answer the question posed in the title, the frame formed by simple cells in primary visual cortex is tighter than would be expected from a random distribution of Gabor lters, but not tight enough to account for behavioral acuity. Appendix
Here we compute the D function for retinal ganglion cells (X cells). The ring of these cells is well characterized through a linear ltering operation (af-
Simple Cells and Tight Frames
333
terward rectifying the result to eliminate negative rates), as in equation 2.1. In this case each lter is equal to the difference of two gaussians, ¡ ¢2 ! ¡ ¢2 ! x ¡ aj x ¡ aj A1 A2 exp ¡ ¡ exp ¡ . (A.1) fj ( x) = 2p s12 2s12 2p s22 2s22
(
(
We assume that the only parameter that varies across neurons is a, the receptive eld location. Values for the rest of the parameters are as follows: A1 / A2 = 17/ 16 (Linsenmeier, Frishman, Jakiela, & Enroth-Cugell, 1982), s1 = 0.17666 ± and s2 = 0.53± (Peichl & W¨asle, 1979). The D function is computed as in equation 2.8, using the above lters. The sum over neurons is approximated by an integral over a assuming that the receptive elds are uniformly distributed, so r ( a ) is a constant. The integrals can then be computed analytically, and the result is ¡ ¢2 ! ¡ ¢2 ! x¡y x¡y A21 A22 (| ) exp ¡ C exp ¡ D x ¡ y| = 4p s12 4s12 4p s22 4s22 ¡ ¢2 ! x¡y A1 A2 ¡ exp ¡ . (A.2) 4p (s12 C s22 ) 2(s12 C s22 )
(
(
(
The Fourier transform of this expression is DQ ( k) =
(
(
s 2 k2 A1 exp ¡ 1 2
!
(
s 2 k2 ¡ A2 exp ¡ 2 2
!!2 ,
(A.3)
where k is in radians per degree. Acknowledgments
We thank Sacha Nelson for his helpful advice. This research was supported by the Sloan Center for Theoretical Neurobiology at Brandeis University, the National Science Foundation (DMS-9503261), and the W. M. Keck Foundation. References Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural Comput., 2, 308–320. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comput., 4, 196–210. Barlow, H. B. (1989). Unsupervised learning. Neural Comput., 1, 295–311. Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation preference pinwheels, cytochrome oxidase blobs and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89, 11905–11909.
334
Emilio Salinas and L. F. Abbott
Bonhoeffer, T., & Grinvald, A. (1993). The layout of iso-orientation domains in area 18 of cat visual cortex: Optical imaging reveals a pinwheel-like organization. J. Neurosci., 13, 4157–4180. Bracewell, R. (1965). The Fourier transform and its applications. New York: McGraw-Hill. Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. IEEE Trans. Information Theory, 36, 961–1004. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM. Daubechies, I., Grossmann, A., & Meyer, Y. (1986). Painless nonorthogonal expansions. J. Math. Phys., 27, 1271–1283. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimization by two-dimensional visual cortical lters. J. Opt. Soc. Am., 2, 1160–1169. De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Res., 22, 545–559. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network, 6, 345–358. Field, D. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Field, D., & Tolhurst, D. J. (1986). The structure and symmetry of simple-cell receptive-eld proles in the cat’s visual cortex. Proc. R. Soc. Lond. B, 228, 379–400. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks. Neural Comput., 7, 219–269. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17, 9270–9284. Jones, J., & Palmer, L. (1987). An evaluation of the two-dimensional Gabor lter model of simple receptive elds in cat striate cortex. J. Neurophysiol.,58, 1233– 1259. Lee, T. S. (1996). Image representation using 2D Gabor wavelets. IEEE Trans. Pattern Analysis and Machine Intelligence, 18, 959–971. Linsenmeier, R. A., Frishman, L. J., Jakiela, H. G., & Enroth-Cugell, C. (1982). Receptive eld properties of X and Y cells in the cat retina derived from contrast sensitivity measurements. Vision Res., 22, 1173–1183. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Spatial and temporal contrast sensitivity of neurons in area 17 and 18 of the cat’s visual cortex. J. Physiol., 283, 101–120. Peichl, L., & Wa¨ sle, H. (1979).Size, scatter and coverage of ganglion cell receptive eld centers in the cat retina. J. Physiol. (Lond.), 291, 117–141. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982.
Simple Cells and Tight Frames
335
Pollen, D. A., & Ronner, S. F. (1981).Phase relationships between adjacent simple cells in the visual cortex. Science, 212, 1409–1411. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992).Numerical recipes in C. New York: Cambridge University Press. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Let., 73, 814–817. Webster, M. A., & De Valois, R. L. (1985).Relationship between spatial-frequency and orientation tuning of striate-cortex cells. J. Opt. Soc. Am. A, 2, 1124–1132. Westheimer, G. (1980). The eye, including central nervous system control of eye movements. In V. B. Mountcastle (Ed.), Medical physiology (Vol. 1, pp. 481– 503). St. Louis: Mosby. Received June 11, 1998; accepted February 4, 1999.
LETTER
Communicated by Joachim Buhmann
Learning Overcomplete Representations Michael S. Lewicki Computer Science Dept. and Center for the Neural Basis of Cognition, Carnegie Mellon Univ., 115 Mellon Inst., 4400 Fifth Ave., Pittsburgh, PA 15213 Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, The Salk Institute, La Jolla, CA 92037, U.S.A. In an overcomplete basis, the number of basis vectors is greater than the dimensionality of the input, and the representation of an input is not a unique combination of basis vectors. Overcomplete representations have been advocated because they have greater robustness in the presence of noise, can be sparser, and can have greater exibility in matching structure in the data. Overcomplete codes have also been proposed as a model of some of the response properties of neurons in primary visual cortex. Previous work has focused on nding the best representation of a signal using a xed overcomplete basis (or dictionary). We present an algorithm for learning an overcomplete basis by viewing it as probabilistic model of the observed data. We show that overcomplete bases can yield a better approximation of the underlying statistical distribution of the data and can thus lead to greater coding efciency. This can be viewed as a generalization of the technique of independent component analysis and provides a method for Bayesian reconstruction of signals in the presence of noise and for blind source separation when there are more sources than mixtures. 1 Introduction
A common way to represent real-valued signals is with a linear superposition of basis functions. This can be an efcient way to encode a highdimensional data space because the representation is distributed. Bases such as the Fourier or wavelet can provide a useful representation of some signals, but they are limited because they are not specialized for the signals under consideration. An alternative and potentially more general method of signal representation uses so-called overcomplete bases (also called overcomplete dictionaries), which allow a greater number of basis functions (also called dictionary elements) than samples in the input signal (Simoncelli, Freeman, Adelson, & Heeger, 1992; Mallat & Zhang, 1993; Chen, Donoho, & Saunders, 1996). Overcomplete bases are typically constructed by merging a set of complete Neural Computation 12, 337–365 (2000)
c 2000 Massachusetts Institute of Technology °
338
Michael S. Lewicki and Terrence J. Sejnowski
bases (e.g., Fourier, wavelet, and Gabor), or by adding basis functions to a complete basis (e.g., adding frequencies to a Fourier basis). Under an overcomplete basis, the decomposition of a signal is not unique, but this can offer some advantages. One is that there is greater exibility in capturing structure in the data. Instead of a small set of general basis functions, there is a larger set of more specialized basis functions such that relatively few are required to represent any particular signal. These can form more compact representations, because each basis function can describe a signicant amount of structure in the data. For example, if a signal is largely sinusoidal, it will have a compact representation in a Fourier basis. Similarly, a signal composed of chirps is naturally represented in a chirp basis. Combining both of these bases into a single overcomplete basis would allow compact representations for both types of signals (Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Chen et al., 1996). It is also possible to obtain compact representations when the overcomplete basis contains a single class of basis functions. An overcomplete Fourier basis, with more than the minimum number of sinusoids, can compactly represent signals composed of small numbers of frequencies, achieving superresolution (Chen et al., 1996). An additional advantage of some overcomplete representations is increased stability of the representation in response to small perturbations of the signal (Simoncelli et al., 1992). Unlike the case in a complete basis, where signal decomposition is well dened and unique, nding the “best” representation in terms of an overcomplete basis is a challenging problem. It requires both an objective for decomposition and an algorithm that can achieve that objective. Decomposition can be expressed as nding a solution to x = As ,
(1.1)
where x is the signal, A is a (nonsquare) matrix of basis functions (vectors), and s is the vector of coefcients, that is, the representation of the signal. Developing efcient algorithms to solve this equation is an active area of research. One approach to removing the degeneracy in equation 1.1 is to place a constraint on s (Daubechies, 1990; Chen et al., 1996), for example, by nding s satisfying equation 1.1 with minimum L1 norm. A different approach is to construct iteratively a sparse representation of the signal (Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993). In some of these approaches, the decomposition can be a nonlinear function of the data. Although overcomplete bases can be more exible in terms of how the signal is represented, there is no guarantee that hand-selected basis vectors will be well matched to the structure in the data. Ideally, we would like the basis itself to be adapted to the data, so that for signal class of interest, each basis function captures a maximal amount of structure. One recent success along these lines was developed by Olshausen and Field (1996, 1997) from the viewpoint of learning sparse codes. When adapted
Learning Overcomplete Representations
339
to natural images, the basis functions shared many properties with neurons in primary visual cortex, suggesting that overcomplete representations might a useful model for neural population codes. In this article, we present an algorithm for learning an overcomplete basis by viewing it as a probabilistic model of the observed data. This approach provides a natural solution to decomposition (see equation 1.1) by nding the maximum a posteriori representation of the data. The prior distribution on the basis function coefcients removes the redundancy in the representation and leads to representations that are sparse and are a nonlinear function of the data. The probabilistic approach to decomposition also leads to a natural method of denoising. From this model, we derive a simple and robust learning algorithm by maximizing the data likelihood over the basis functions. This work generalizes the algorithm of Olshausen and Field (1996) by deriving a algorithm for learning overcomplete bases from a direct approximation to the data likelihood. This allows learning for arbitrary input noise levels, allows for the objective comparison of different models, and provides a way to estimate a model’s coding efciency. We also show that overcomplete representations can provide a better and more efcient representation, because they can better approximate the underlying statistical density of the input data. This also generalizes the technique of independent component analysis (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995) and provides a method for the identication of more sources than mixtures. 2 Model
We assume that each data vector, x = x1 , . . . , xL , can be described with an overcomplete linear basis plus additive noise, x = As C ²,
(2.1)
where A is an L£ M matrix with M > L. We assume gaussian additive noise, ². The data likelihood is log P( x | A , s ) / ¡
1 (x ¡ As )2 , 2s 2
(2.2)
where s 2 is the noise variance. One criticism of overcomplete representations is that they are redundant— a given data point can have many possible representations—but this redundancy can be removed by a proper choice for the prior probability of the basis coefcients, P( s), which species the probability of the alternative representations. This density determines how the underlying statistical structure is modeled and the nature of the representation. Standard approaches to signal representation do not specify a prior for the coefcients, because
340
Michael S. Lewicki and Terrence J. Sejnowski
for most complete bases and assuming zero noise, the representation of the signal is unique. If A is invertible, the decomposition of the signal x is given by s = A ¡1 x . Because A ¡1 is expensive to compute, there is strong incentive to nd basis matrices that are easy to invert, such as restricting the basis functions to be orthogonal or further restricting the basis functions to those for which there are fast algorithms, such as Fourier or wavelet analysis. A more general approach is afforded in the probabilistic formulation. The parameter values of the internal states are found by maximizing the posterior distribution of s , sˆ = argmax P( s| x, A ) = argmax P(x | A , s) P( s), s
s
(2.3)
where sˆ is the most probable decomposition of the signal. This formulation of the problem offers the advantage that the model can t more general types of distributions. Q For simplicity, we assume independence of the coefcients: P( s) = m P( sm ). Another advantage of this formulation is that the process of nding the most probable representation determines the best representation for the noise level dened by s. This automatically performs “denoising” in that s encodes the underlying signal without noise. The noise level can be set to arbitrary levels, including zero. It is also possible to use Bayesian methods to infer the most probable noise level, but this will not be pursued here. 2.1 The Relation Between the Prior and the Representation. Figure 1 shows how different priors induce different representations of the data. A standard choice might be a gaussian prior (equivalent to factor analysis), but this yields no advantage to having an overcomplete representation because the underlying assumption is still that the data are gaussian. An alternative choice, advocated by some authors (Field, 1994; Olshausen & Field, 1996; Chen et al., 1996), is to use priors that assume sparse representations. This is accomplished by priors that have high kurtosis, such as the Laplacian, P( sm ) / exp(¡h |sm | ). Compared to a gaussian, this distribution puts greater weight on values close to zero, and as result the representations are sparser— they have a greater number of small-valued coefcients. For overcomplete bases, we expect a priori that the representation will be sparse, because only L out of M nonzero coefcients are needed to represent arbitrary Ldimensional input patterns. In the case of zero noise and P( s) gaussian, maximizing equation 2.3 is equivalent to mins || s || 2 subject to x = As . The solution can be obtained with the pseudoinverse, s = ACx , and is a linear function of A and s. In the case of zero noise and P(s ) Laplacian, maximizing equation 2.3 is equivalent to mins || s || 1 subject to x = As. Unlike the gaussian prior, this solution cannot be obtained by a simple linear operation.
Learning Overcomplete Representations
341
Figure 1: Different priors induce different representations. (a) The data distribution is two-dimensional and has three main arms. The overlayed axes form an overcomplete representation. (b, c) Optimal scaled basis vectors for the data point x under a gaussian and Laplacian prior, respectively. Assuming low noise, a gaussian for P( s) is equivalent to nding s with minimum L2 norm such that x = As . The solution is given by the pseudoinverse s = ACx and is a linear function of the data. A Laplacian prior (P(sm ) / exp[¡h |sm |]) nds s with minimum L1 norm. This is a nonlinear operation that essentially selects a subset of basis vectors to represent the data (Chen et al., 1996), so that the resulting representation is sparse. (d) A 64-sample segment of natural speech was t to a 2£ overcomplete Fourier representation (128 basis functions) (see section 7). The plot shows rank-order distribution of the coefcients of s under a gaussian prior (dashed) and a Laplacian prior (solid). Under a gaussian prior, nearly all of the coefcients in the solution are nonzero, whereas far fewer are nonzero under a Laplacian prior.
3 Inferring the Internal State
A general approach for optimizing s in the case of nite noise (² > 0) and nongaussian P(s ) is to use the gradient of the log posterior in an optimization algorithm (Daugman, 1988; Olshausen & Field, 1996). A suitable initial condition is s = A T x or s = ACx .
342
Michael S. Lewicki and Terrence J. Sejnowski
An alternative method, which can be used when the prior is Laplacian and ² = 0, is to view the problem as a linear program (Chen et al., 1996): min c T | s|
subject to
As = x .
(3.1)
Letting c = (1, . . . , 1), the objective function in the linear program, c T | s| = P m | sm |, corresponds to maximizing the log posterior likelihood under a Laplacian prior. This can be converted to a standard linear program (with only positive coefcients) by separating positive and negative coefcients. Making the substitutions, s à [u I v ], c à [1 I 1 ], and A à [A I ¡A ], equation 3.1 becomes min 1 T [u I v ]
subject to
[A I ¡A ][u I v ] = x ,
u , v ¸ 0,
(3.2)
which replaces the basis vector matrix A with one that contains both positive and negative copies of the vectors. This separates the positive and negative coefcients of the solution s into the positive variables u and v , respectively. This can be solved efciently and exactly with interior point linear programming methods (Chen et al., 1996). Quadratic programming approaches to this type of problem have also recently been suggested (Osuna, Freund, & Girosi, 1997) for similar problems. We have used both the linear programming and gradient-based methods. The linear programming methods were superior for nding exact solutions in the case of zero noise. The standard implementation handles only the noiseless case but can be generalized (Chen et al., 1996). We found gradientbased methods to be faster in obtaining good approximate solutions. They also have the advantage that they can easily be adapted for more general models, such as positive noise levels or different priors. 4 Learning
To derive a learning algorithm we must rst specify an appropriate objective function. A natural objective is to maximize the probability of the data given the model. For a set of K independent data vectors X = x 1 , . . . , x K , P(X | A) =
K Y
P ( xk | A ).
(4.1)
k= 1
The probability of a single data point is computed by marginalizing over the states of the network, Z (4.2) P (x k | A ) = ds P( s) P( x k | A , s ). This formulates the problem as one of density estimation and is equivalent to minimizing the Kullback-Leibler divergence between the model density
Learning Overcomplete Representations
343
and the distribution of the data. If implementation-related issues such as synaptic noise are ignored, this is equivalent to the methods of redundancy reduction and maximizing the mutual information between the input and the representation (Nadal & Parga, 1994a, 1994b; Cardoso, 1997), which have been advocated by several researchers (Barlow, 1961, 1989; Hinton and Sejnowski, 1986; Daugman, 1989; Linsker, 1988; Atick, 1992). 4.1 Fitting Bases to the Data Distribution. To understand how overcomplete representations can yield better approximations to the underlying density, it is helpful to contrast different techniques for adapting a basis to a particular data set. Principal component analysis (PCA) models the data with a multivariate gaussian. This representation is commonly used to nd the directions in the data with largest variation—the principal components—even if the data do not have gaussian structure. The basis functions are the eigenvectors of the covariance matrix and are restricted to be orthogonal. An extension of PCA, called independent component analysis (ICA) (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995), allows the learning of nonorthogonal bases for data with nongaussian distributions. ICA is highly effective in several applications such as blind source separation of mixed audio signals (Jutten & H´erault, 1991; Bell & Sejnowski, 1995), decomposition of electroencephalographic (EEG) signals (Makeig, Jung, Bell, Ghahremani, & Sejnowski, 1996), and the analysis of functional magnetic resonance imaging (fMRI) data (McKeown et al., 1998). In all of these techniques, the number of basis vectors is equal to the number of inputs. Because these bases span the input space, they are complete and are sufcient to represent the data, but we will see that this representation can be limited. Figure 2 illustrates how different data densities are modeled by various approaches in a simple two-dimensional data space. PCA assumes gaussian structure, but if the data have nongaussian structure, the vectors can point in directions that contain very few data, and the probability density dened by the model will predict data where none occur. This means that the model will underestimate the likelihood of data in the dense regions and overestimate it in the sparse regions. By Shannon’s theorem, this will limit the efciency of the representation. ICA assumes the coefcients have nongaussian structure and allows the vectors to be nonorthogonal. In this example, the basis vectors point along high-density regions of the data space. If the density is more complicated, however, as in the case of the three-armed density, neither PCA nor ICA captures the underlying structure. Although it is possible to represent every point in the two-dimensional space with a linear combination of two vectors, the underlying density cannot be described without specifying a complicated prior on the basis function coefcients. An overcomplete basis, however, allows an efcient representation of the underlying density with simple priors on the basis function coefcients.
344
Michael S. Lewicki and Terrence J. Sejnowski
a
b
c
d
Figure 2: Fitting two-dimensional data with different bases. (a) PCA makes the assumption that the data have a gaussian distribution. The optimal basis vectors are orthogonal and are not efcient at representing nonorthogonal density distributions. (b, c) ICA does not require that the vectors be orthogonal and can t more general types of densities. (c) Some data distributions, however, cannot be modeled adequately by either PCA or ICA. (d) Allowing the basis to be overcomplete allows the three-armed distribution to be t with simple and independent distributions for the basis vector coefcients.
4.2 Approximating the Data Probability. In deriving the learning algorithm, the rst problem is that, in general, the integral in equation 4.2 is intractable. In the special case of zero noise and a complete representation (i.e., A is invertible) this integral can be solved and leads to the well-known ICA algorithm (MacKay, 1996; Pearlmutter & Parra, 1997; Olshausen & Field, 1997; Cardoso, 1997). But in the case of an overcomplete basis, such a solution is not possible. Some recent approaches have tried to approximate this integral by evaluating P( s )P( x | A , s ) at its maximum (Olshausen & Field, 1996, 1997), but this ignores the volume information of the posterior. This means representations that are well determined, i.e., have a sharp posterior distribution) are treated in the same way as those that are ill determined, which can introduce biases into the learning. Another problem is that this leads to a trivial solution where the magnitude of the basis functions diverge. This problem can be somewhat circumvented by adaptive normalization (Olshausen & Field, 1996), but setting the adaptation rates can be tricky in practice, and, more important,
Learning Overcomplete Representations
345
there is no guarantee the desired objective function is the one being optimized. The approach we take here is to approximate equation 4.2 with a gaussian around the posterior mode, sˆ . This yields log P( x | A ) ¼
l l L M log C log(2p ) C log P(sˆ ) ¡ ( x ¡ A sˆ ) 2 2 2p 2 2 1 ¡ log det H , 2
(4.3)
where l = 1/ s 2 , and H is the Hessian of the log posterior at sˆ , H = l ATA ¡ rs rs log P( sˆ ). This is a saddle-point approximation. (See appendix for derivation details.) Care must be taken that the log prior has some curvature, because det( ATA ) = 0 arising from the fact that ATA has the same rank as A , which is assumed to be rectangular. The accuracy of the saddlepoint approximation is determined by how closely the posterior mode can be approximated by a gaussian. The approximation will be poor if there are multiple modes or there is excessive skew or kurtosis. We present methods for addressing some of these issues in section 6.1. A learning rule is obtained by differentiating log P( x| A ) with respect to A (see the appendix), which leads to the following expression: D A = AAT rA log P( x| A ) ¼ ¡A ( zˆsT C I ),
(4.4)
where zk = @log P( sk )/ @sk . This rule contains no matrix inverses, and the vector z involves only the derivative of the log prior. In the case where A is square, this form of the rule is exactly the natural gradient ICA learning rule for the basis matrix (Amari, Cichocki, & Yang, 1996). The difference in the more general case where A is rectangular is in how the coefcients s are calculated. In the standard ICA learning algorithm (A square, zero noise), the coefcients are given by s = Wx , where W = A ¡1 is the lter matrix. For the learning rule to work the optimization of s must maximize the posterior distribution P( s| x, A ) (see equation 2.3). 5 Examples
We now demonstrate the algorithm on two simple data sets. Figure 3 shows the results of tting two different two-dimensional data sets. The learning procedure was as follows. The initial bases were random normalized vectors with the constraint that no two vectors be closer in angle than 30 degrees. Each basis was adapted to the data using the learning rule given in equation 4.4. The bases were adapted with 50 iterations of simple gradient descent with a batch size of 500 and a step size of 0.1 for the rst 30 iterations, which was reduced to 0.001 over the last 20. The most probable coefcients were obtained using a publicly available interior point
346
Michael S. Lewicki and Terrence J. Sejnowski a
b
c
Figure 3: Three examples illustrating the tting of 2D distributions with overcomplete bases. The gray vectors show the true basis vectors used to generate the distribution. The black vectors show the learned basis vectors. For clarity, the vectors have been rescaled. The learned basis vectors can be antiparallel to the true vectors, because both positive and negative coefcients are allowed.
linear programming package (Meszaros, 1997). Convergence of the learning algorithm was rapid, usually reaching the solution in fewer than 30 iterations. A solution was discarded if the magnitude of one of the basis vector dropped to zero. This happened occasionally when one vector was “trapped” between two others that were already representing the data in that region. The rst example is shown in Figure 3a. The data were generated from the true basis vectors (shown in gray) using x = As. To illustrate better the arms of the distribution, the elements of s were drawn from an exponential distribution (i.e., only positive coefcients) with unit mean. The direction of the learned vectors always matched that of the true generating vectors. The magnitude was smaller than the true basis vectors, possibly due to the approximation used for P(x | A ) (see equation 4.3). Identical results were obtained when the coefcients were generated from a Laplacian prior (i.e., both positive and negative coefcients). The second example (see Figure 3b) has four arms, again generated from the true basis vectors using a Laplacian (both positive and negative coefcients) with unit variance. The learned vectors were close to the true vectors but showed a consistent bias. This might occur because the data are too dense for there to be any distinct direction or from the approximation to P( x| A ). The bias can be greatly reduced if the coefcients are drawn from a distribution with greater sparsity. The example in Figure 3c uses a data set from the same underlying vectors, but generated from a generalized Laplacian distribution, P( sm ) / exp(¡|sm | p ). When this distribution is tted to wavelet subband coefcients of images, Buccigrossi and Simoncelli (1997) found that p was in the range [0.5, 1.0]. Varying this exponent varies the kurtosis of the distribution, which is one measure of a sparseness. In
Learning Overcomplete Representations
347
Figure 3c, the data were generated with p = 0.6. The arms of this data set are more distinct, and the directions match those of the generating vectors. 6 Quantifying the Efciency of the Representation
One advantage of the probabilistic framework is that it provides a natural means for comparing objectively alternative representations and models. In this section we compare two methods for comparing the coding efciency of the learned basis functions. 6.1 Estimating Coding Cost Using P( x| A ) . The objective function P( x | A ) for the probability of the data under the model is a natural measure for comparing different models. It is helpful to convert this value into a more intuitive form. According to Shannon’s coding theorem, the probability of a code word gives the lower bound on the length of that code word, under the assumption that the model is correct. The number of bits required to encode the pattern is given by
#bits ¸ ¡ log2 P( x | A ) ¡ L log2 (sx ),
(6.1)
where L is the dimensionality of the input pattern x, and sx species the quantization level of the encoding. 6.1.1 A More Accurate Approximation to P( x| A ) . The learning algorithm has been derived by making a gaussian approximation around the posterior maximum sˆ . In practice this is sufciently accurate to generate a useful gradient for learning overcomplete bases. One caution, however, is that there is not necessarily a relation between the local curvature and the volume for general priors, as for a gaussian prior. When comparing alternative models or basis functions, we may want a more accurate approximation. One approach that works well is a method based on second differences. Rather than rely on the analytic Hessian to estimate the volume around sˆ , the volume can be estimated directly by calculating the change in the posterior, P(s ) P(x | s , A ) around the maximum sˆ using second differences. Computing the entire Hessian in this manner would require O( M2 ) evaluations of the posterior, which itself is an O( M2 ) operation. We can reduce the number of posterior evaluations to O( M) if we estimate the curvature only in the direction of the eigenvectors of the Hessian. To obtain an accurate volume estimate, the volume of gaussian approximation should match as closely as possible to the volume around the posterior maximum. This requires a choice of the step size along the eigenvector direction. In principle, this t could be optimized, for example, by minimizing the cross entropy, but at signicant computational expense. A relatively quick and reliable estimate can be obtained by choosing the step length, gi , which produces a
348
Michael S. Lewicki and Terrence J. Sejnowski
xed drop, D , in the value of the posterior log-likelihood, £ ¤ g i = argmingi log P( sˆ | x, A ) ¡ log P( sˆ Cgi ei | x, A ) ¡ D ,
(6.2)
where gi is the step length along the direction of eigenvector e i . This method of choosing g i requires a line search, but this can typically be done with three or four function evaluations. In the examples below, we used D = 3.8, the optimal value for approximating a Laplacian with a gaussian. This O , which is a more accurate estimate procedure computes a new Hessian, H of the posterior curvature. The estimated Hessian is then substituted into (equation 4.3) to obtain a more accurate estimate of log P( x| A ). Figure 4 shows cross-sections of the true posterior and the gaussian approximation using the direct Hessian and the second-differences method. The direct Hessian method produced poor approximations in directions where the curvature was sharp around the sˆ . Estimating the volume with second differences produced a much better estimate along all of the eigenvector directions. There is no guarantee, however, that this is the best estimate, because there could still exist some directions that are poorly approximated. For example, the rst plots in Figures 4a and 4b (labeled s ¡ smp) show the direction between sˆ using a convergence tolerance of 10¡4 and sˆ using a convergence tolerance of 10¡6 . This was the direction of the gradient when optimization was terminated and is likely to be in the direction of coefcients whose values are poorly determined. The maximum has not been reached, and the volume in this direction is underestimated. Overall, however, the second-differences method produces a much more accurate estimate of log P (x | A ) than that using the analytic Hessian and obtains more accurate estimates of the relative coding efciencies of different models. 6.1.2 Application to Test Data. We now quantify the efciency of different representations for various test data sets using the methods described above. The estimated coding cost was calculated from 10 sets of 1000 randomly sampled data points using an encoding precision sx = 0.01. The relative encoding efciency of the different representations depends on the relative difference of their predictions—how closely P( x | A ) matches that of the underlying distribution. If there is little difference between the predictive distribution, then there will be little difference in the expected coding cost. Table 1 shows the estimated coding costs of various models for the data sets used above. The uniform distribution gives an upper bound on the coding cost by assuming equal probability for all points within the range spanned by the data points. A simple bivariate gaussian model of the data points, equivalent to a code based on the principal components of the data, does signicantly better. The remaining table entries assume the model dened in equation 2.1 with a Laplacian prior on the basis coefcients. The 2 £ 2 basis matrix is
a
349
P(s|x,A)
Learning Overcomplete Representations
5 4 3 0.5 s smp
0 e128
0.5 0.4 0.2 0 e112
0.2
0
0.5 1 e96
1.5 3
2.5 2 e80
1.5
P(s|x,A)
6
b
e64
1 1.2
1 e48
0.8 0.1
0 e32
0.1 1.15
e16
1 0.02 0
0.02 0.04 e1
P(s|x,A)
1.05
5 4 3 0.5 s smp
0 e128
0.5 0.4 0.2 0 e112
0.2
0
0.5 1 e96
1.5 3
2.5 2 e80
1.5
P(s|x,A)
6
1.05
e64
1 1.2
1 e48
0.8 0.1
0 e32
0.1 1.15
e16
1 0.02 0
0.02 0.04 e1
Figure 4: The solid lines in the gure show the cross-sections of a 128dimensional posterior distribution (P( s| x , A ) along the directions of the eigenvectors of the Hessian matrix evaluated at sˆ (see equation 2.3). This posterior resulted from tting a 2£-overcomplete representation to a 64-sample segment of natural speech (see section 7). The rst cross-section is in the direction between sˆ using a tolerance of 10¡4 and a more probable sˆ found using a tolerance of 10¡6 (labeled s ¡ smp). The remaining cross-sections are in order of the eigenvalues, showing a sample from smallest (e128) to largest (e1). The dots show the gaussian approximation for the same cross-section. The C indicates the position of sˆ . Note that the x-axes have different scales. (a) The gaussian approximation obtained using the analytic Hessian to estimate the curvature. (b) The same cross-sections but with the gaussian approximation calculated using the seconddifference method described in the text.
equivalent to the ICA solution (under the assumption of a Laplacian prior). The 2£3 and 2£4 matrices are overcomplete representations. All bases were learned with the methods discussed above. The most probable coefcients were calculated using the linear programming method as was done during learning. Table 1 shows that the coding efciency estimates obtained using the approximation to P( x | A ) are reasonably accurate compared to the estimated
350
Michael S. Lewicki and Terrence J. Sejnowski
Table 1: Estimated Coding Costs using P( x | A ). Bits per Pattern Model uniform Gaussian A 2£2 (ICA) A 2£3 A 2£4
Figure 1a 22.31 § 18.96 § 18.84 § 18.59 § —
0.17 0.06 0.04 0.03
Figure 3a 21.65 § 17.98 § 17.79 § 16.87 § —
0.19 0.07 0.07 0.06
Figure 3b 21.57 § 18.55 § 18.38 § 18.04 § 18.15 §
0.16 0.06 0.05 0.04 0.08
Figure 3c 26.67 § 22.12 § 21.46 § 21.12 § 21.24 §
0.31 0.15 0.08 0.08 0.08
bits per pattern for the uniform and gaussian models for which the expected coding costs are exact. Apart from the uniform model, the differences in the predicted densities are rather subtle and do not show any large differences in coding costs. This would be expected because the densities predicted by the various models are similar. For all the data sets, using a three-vector basis (A 2£3 ) gives greater expected coding efciencies, which indicates that higher degrees of overcompleteness can yield more efcient coding. For the four-vector basis (A 2£4 ), there is no further improvement in coding efciency. This reects the fact that when the prior distribution is limited to a Laplacian, there is an inherent limitation on the model’s degrees of freedom; increasing the number of basis functions does not always increase the space of distributions that can be modeled. 6.2 Estimating Coding Cost Using Coefcient Entropy. The probability of the data gives a lower bound on code length but does not specify a code. An alternative method for estimating the coding efciency is to estimate the entropy of a proposed code. For these models, a natural choice for the code is the vector coefcients, s. In this approach, the distribution of the coefcients s (t to a training data set) is estimated with a function f ( s). The coding cost for a test data set is computed estimating the entropy of the tted coefcients to a quantization level needed to maintain an encoding noise level of sx . This has the advantage that f ( s) can give a better approximation to the observed distribution of s (better than the Laplacian assumed under the model). In the examples below, we assume that all the coefcients have the same distribution, although it is straightforward to use a separate function for the distribution estimate of each coefcient. One note of caution with this technique, however, is that it does not include any cost of mistting. A basis that does not fully span the data space will result in poor ts to the data but can still yield low entropy. If the bases under consideration fully span the data space, then the entropy will yield a reasonable estimate of coding cost. To compute the entropy, the precision to which each coefcient is encoded needs to be specied. Ideally, f (s) should be quantized to maintain
Learning Overcomplete Representations
351
an encoding noise level of sx . In the examples below, we use the approximation dsi ¼ sx / | A i |. This relates a change in si to a change in the data xi and is exact if A is orthogonal. We use the mean value of d si to obtain a single quantization level d s for all coefcients. The function f ( s) by applying kernel density estimation (Silverman, 1986) to the distribution of coefcients t to a training data set. We use a Laplacian kernel with a window width of 2d s. The most straightforward method of estimating the coding cost (in bits per pattern) is to sum over the individual entropies, #bits ¸ ¡
X ni log2 f [i], i N
(6.3)
where f ( s) is quantized as f [i], the index i ranges over all the bins in the quantization, and ni is the number of counts observed in each bin for each of the coefcients t to a test data set consisting of N patterns. The distinction between the test and training data set is necessary because the probabilities of the code words need to be specied a priori. Using the same data set for test and training underestimates the cost, although for large data sets, the difference is minimal. An alternative method of computing the coding cost using the entropy is to make use of the fact that for sparse-overcomplete representations, a subset of the coefcients will be zero. The coding cost can be computed by summing the entropy of the nonzero coefcients plus the cost of identifying them, #bits ¸ ¡
X ni log2 fnz [i] C min(M ¡ L, L) log2 ( M) , i N
(6.4)
where fnz [i] is the quantized density estimate of the nonzero coefcients in the ith bin and the index i ranges over all the bins. This method is appropriate when the cost of identifying the nonzero coefcients is small compared to the cost of sending M ¡ L (zero-valued) coefcients. 6.2.1 Application to Test Data. We now quantify the efciency of various representations for the data sets shown in Figure 3 by computing the entropy of the coefcients. As before, the estimated coding cost was calculated on 1000 randomly sampled data points, using an encoding precision of 0.01. A separate training data set consisting of 10,000 data points was used to estimate the distribution of the coefcients. The entropy was estimated by computing the entropy of the nonzero coefcients (see equation 6.4), which yielded lower estimated coding costs for the overcomplete bases compared to summing the individual entropies (see equation 6.3). Table 2 shows the estimated coding costs for the same learned bases and data sets used in Table 1. The second column lists the cost of labeling the
352
Michael S. Lewicki and Terrence J. Sejnowski
Table 2: Estimated Entropy of Nonzero Coefcients.
Model A 2£2 A 2£3 A 2£4
Bits per Pattern
Labeling Cost
Figure 1a
Figure 3a
Figure 3b
Figure 3c
0.0 1.6 2.0
18.88 § 0.04 18.02 § 0.03 —
17.78 § 0.07 17.28 § 0.05 —
18.18 § 0.05 17.75 § 0.05 17.88 § 0.07
21.45 § 0.09 20.95 § 0.08 20.81 § 0.09
nonzero coefcients. The other columns list the coding cost for the values of the nonzero coefcients. The total coding cost is the sum of these two numbers. For the 2 £ 2 matrices (the ICA solution), the entropy estimate yields approximately the same coding cost as the estimate based on P( x| A ), which indicates that in the complete case, the computation of P(x | A ) is accurate. This entropy estimate, however, yields a higher coding for the overcomplete bases. The coding cost computed from P(x | A ) suggests that lower average coding costs are achievable. One strategy for lowering the average coding cost would be to avoid identifying the nonzero coefcients for every pattern, for example, by grouping patterns that share common nonzero coefcients. This illustrates the advantages of contrasting the minimal coding cost estimated from the probability with that of a particular code. 7 Learning Sparse Representations of Speech
Speech data were obtained from the TIMIT database, using speech a single speaker, speaking 10 different example sentences. Speech segments were 64 samples in duration (8 msecs at the sampling frequency of 8000 Hz). No preprocessing was done. Both a complete (1£-overcomplete or 64 basis vectors) and a 2£-overcomplete basis (128 basis vectors) were learned. The bases were initialized to the standard Fourier basis and a 2£-overcomplete Fourier basis, which was composed of twice the number of sine and cosine basis functions, linearly spaced in frequency over the Nyquist range. For larger problems, it is desirable to make the learning faster. Some simple modications to the basic gradient descent procedure were used that produced more rapid and reliable convergence. For each gradient (see equation A.33), a step size was computed by d i = 2 i / amax , where amax is the element of the basis matrix, A , with largest absolute value. The parameter 2 was reduced from 0.02r to 0.001r over the rst 1000 iterations and xed at 0.001r for the remaining iterations, where r is a measure of the data range and was dened to be the standard deviation of the data. For the learning rule, we replaced the term A in equation A.39 with an approximation to l AATAH ¡1 (see equation A.36) as suggested in Lewicki
Learning Overcomplete Representations a
353 b
Figure 5: An example of tting a 2£-overcomplete representation to segments of natural speech. Speech segments consisted of 64 samples in duration (8 msecs at the sampling frequency of 8000 Hz). (a) A random sample of 30 of the 128 basis vectors (each scaled to full range). (b) The power spectral densities (0 to 4000 Hz) of the corresponding basis in a.
and Olshausen (1998, 1999): ¡l ATAH ¡1 ¼ I ¡ BQ diag¡1 [V C Q T BQ ]Q T ,
(7.1)
where B = rs rs log P( sˆ ) and Q and V are obtained from the singular value decomposition l ATA = QVQ T . This yielded solutions similar to equation 4.4, but resulted in fewer zero-length basis vectors. One hundred patterns were used to estimate each learning step. Learning was terminated after 5000 steps. The most probable basis function coefcients, sˆ , were obtained using a modied conjugate gradient routine (Press, Teukolsky, Vetterling, & Flannery, 1992). The basic routine was modied to replace the line search with an approximate Newton step. This approach resulted in a substantial speed improvement and produced much better solutions in a xed amount of time than the standard routine. To improve speed, a convergence criterion in the conjugate gradient routine was started at value of 10¡2 and reduced to 10¡3 over the course of learning. Figure 5 shows a sample of the learned basis vectors from the 2£-overcomplete basis. The waveforms in Figure 5a show a random sample of the learned basis vectors; the corresponding power spectral densities (0– 4000 Hz) are shown in Figure 5b. An immediate observation is that, unlike the Fourier basis vectors, many of the learned basis vectors have a broad bandwidth, and some have multiple spectral peaks. This is an indication that the basis functions contain some of the broadband and harmonic structure inherent in the speech signal. Another way of contrasting the learned representation with the Fourier representation is shown in Figure 6. This shows the log-coefcient magnitudes over the duration of a speech sentence taken from the TIMIT database for the 2£-overcomplete Fourier (middle plot) and 2£-overcomplete learned
354
Michael S. Lewicki and Terrence J. Sejnowski
she
had your
dark
suit
in
greasy
wash
water
all
year
3
20 40
2
60
1
80 100
log |s|
coefficient number
Fourier basis
0
120
20 40 60 80 100 120
3 2
log |s|
coefficient number
learned basis
1 0
Figure 6: Comparison of coefcient values over the duration of a speech sample from the TIMIT database. (Top) The amplitude envelope of a sentence of speech. The vertical lines are a rough indication of the word boundaries, as listed in the database. (Middle) Plots the log-coefcient magnitudes for a 2£-overcomplete Fourier basis over the duration of the speech example, using nonoverlapping blocks of 64 samples without windowing. (Bottom) Shows the log-coefcient magnitudes for a 2£-overcomplete learned basis. Only the largest 10% of the coefcients (in terms of magnitude) are plotted. The basis vectors are ordered in terms of their power spectra, with higher frequencies toward the top.
representations (bottom plot). The middle plot is similar to a standard spectrogram except that no windowing was performed and the windows were not overlapping. Figure 6 shows that the coefcient values for the learned basis are used much more uniformly than in the Fourier basis. This is consistent with the learning objective, which optimizes the basis vectors to make the coefcient values as independent as possible. One consequence is that the correlations across frequency that exist in the Fourier representation (reecting the harmonic structure in the speech) are less apparent in the learned representation. 7.1 Comparing Efciencies of Representations. Table 3 shows the estimated number of bits per sample to encode a speech segment to a precision of 1 bit out of 8 (sx = 1/ 256 of the amplitude range). The table shows the
Learning Overcomplete Representations
355
Table 3: Estimated Bits per Sample for Speech Data. Estimation Method ¡ log 2 P( x |bA) ¡ L log 2 (sx )
Basis 1£ learned 1£ Fourier 2£ learned 2£ Fourier
4.17 § 5.39 § 5.85 § 6.87 §
Total Entropy
0.07 0.03 0.06 0.06
3.40 § 4.29 § 5.22 § 6.57 §
Nonzero Entropy
0.05 0.07 0.04 0.14
3.36 § 4.24 § 3.07 § 4.12 §
0.08 0.09 0.07 0.11
estimated coding costs using the probability of the data (see equation 6.1) and using the entropy, computed by summing the individual entropy estimates of all coefcients (see equation 6.3) (total). Also shown is the entropy of only the nonzero coefcients. By both estimates, the complete and the 2£-overcomplete learned basis perform signicantly better than the corresponding Fourier basis. This is not surprising given the observed redundancy in the Fourier representation (see Figure 6). For all of the bases, the coding cost estimates derived from the entropy of the coefcients are lower than those derived from P (x | A ). A possible reason is that the coefcients are sparser than the Laplacian prior assumed by the model. This is supported by looking at the histogram of the coefcients (see Figure 7), which shows a density with kurtosis much higher than the as-
10
Distribution of basis coefficients
0
2x learned 2x Fourier
P(S)
10
10
10
1
2
3
4
10 1.5
1
0.5
0 S
0.5
1
1.5
Figure 7: Distribution of basis function coefcients for the 2£ learned and Fourier bases. Note the log scale on the y-axis.
356
Michael S. Lewicki and Terrence J. Sejnowski
sumed Laplacian. The entropy method can obtain a lower estimate, because it ts the observed density of s. The last column in the table shows that although the entropy per coefcient of the 2£-overcomplete bases is less than for the complete bases, neither basis yields an overall improvement in coding efciency. There could be several reasons for this. One likely Q reason is due to the independence ( )). This assumes that there is no assumption of the model (P( s ) = m P sm structure among the learned basis vectors. From Figure 6, however, it is apparent that there is considerably more structure left in the values of basis vector coefcients. Possible generalizations for capturing this kind of structure are discussed below. 8 Discussion
We have presented an algorithm for learning overcomplete bases. In some cases, overcomplete representations allow a basis to approximate better the underlying statistical density of the data, which can lead to representations that better capture the underlying structure in the data and have greater coding efciency. The probabilistic formulation of the basis inference problem offers the advantages that assumptions about the prior distribution on the basis coefcients are made explicit. Moreover, by estimating the probability of the data given the model or the entropy of the basis function coefcients, different models can be compared objectively. This algorithm generalizes ICA so that the model accounts for additive noise and allows the basis to be overcomplete. Unlike standard ICA, where the internal states are computed by inverting the basis function matrix, in this model the transformation from the data to the internal representation is nonlinear. This occurs because when the model is generalized to account for additive noise or when the basis is allowed to be overcomplete, the internal representation is ambiguous. The internal representation is then obtained by maximizing the posterior probability of s given the assumptions of the model, which is, in general, a nonlinear operation. A further advantage of this formulation is that it is possible to “denoise” the data. The inference procedure of nding the most probable coefcients automatically separates the data into the underlying signal and the additive noise. By setting the noise to appropriate levels, the model species what structure in the data should be ignored and what structure should be modeled by the basis functions. The derivation given here also allows the noise level to be set to zero. In this case, the model attempts to account for all of the variability in the data. This approach to denoising has been applied successfully to natural images (Lewicki & Olshausen, 1998, 1999). Another potential application is the blind separation of more sources than mixtures. For example, the two-dimensional examples in Figure 3 can be viewed as a source separation problem in which a number of sources are mixed onto a smaller number of channels (three to two in Figure 3a and
Learning Overcomplete Representations
357
four to two in Figures 3b and 3c). The overcomplete bases allow the model to capture the underlying statistical structure in the data space. True source separation will be limited, however, because the sources are being mapped down to a smaller subspace and there is necessarily a loss of information. Nonetheless, is it possible to separate three speakers on two channels with good delity (Lee, Lewicki, Girolami, & Sejnowski, 1999). We have also shown, in the case of natural speech, that learned bases have better coding properties than commonly used representations such as the Fourier basis. In these examples, the learned basis resulted in an estimated coding efciency for (near-lossless compression) that was about 1.4 bits per sample better than the Fourier basis. This reects the fact that spectral representations of speech are redundant, because of spectral structures such as speaker harmonics. In the complete case, the model is equivalent to ICA, but with additive noise. We emphasize that these numbers reect only a very simple coding scheme based on 64 sample speech segments. Other coding schemes can achieve much greater compression. The analysis presented here serves only to compare the relative efciency of a code based on Fourier representation with that of a learned representation. The learned overcomplete representations also showed greater coding efciency than the overcomplete Fourier representation but did not show greater coding efciency than the complete representation. One possible reason is that the approximations are inaccurate. The approximation used to estimate the probability of the data (see equation 4.2) works well for learning the basis functions and also gives reasonable estimates when exact quantities are available. However, it is possible that the approximation becomes increasingly inaccurate for higher degrees of overcompleteness. An open challenge for this and other inference problems is how to obtain accurate and computationally tractable approximations to the data probability. Another, perhaps more plausible, reason for the lack of coding efciency for the overcomplete representations is that the assumptions of the model are inaccurate. An obvious one is the prior distribution for the coefcients. For example, the observed coefcient distribution for the speech data (see Figure 7) is much sparser than that of the Laplacian assumed by the model. Generalizing these models so that P( s) better captures the kind of structure would improve the accuracy of the model and allow it to t a broader range of distributions. Generalizing these techniques to handle more general prior distributions, however, is not straightforward because of the difculties in optimization techniques for nding the most probable coefcients and the approximation of the data probability, log P (x | A ). These are directions we are currently investigating. We need to question the fundamental assumption that the coefcients are statistically independent. This is not true for structures such as speech, where there is a high degree of nonstationary statistical structure. This type of structure is clearly visible in the spectrogram-like plots in Figure 6. Be-
358
Michael S. Lewicki and Terrence J. Sejnowski
cause the coefcients are assumed to be independent, it is not possible to capture mutually exclusive relationships that might exist among different speech sounds. For example, different bases could be specialized for different types of phonemes. Capturing this type of structure requires generalizing the model so that it can represent and learn higher-order structure. In applications of overcomplete representations, it is common to assume the coefcients are independent and have Laplacian distributions (Mallat & Zhang, 1993; Chen et al., 1996). These results show that although overcomplete representations can potentially yield many benets, in practice these can be limited by inaccurate prior assumptions. An improved model of the data distribution will lead to greater coding efciency and also allow for more accurate inference. The Bayesian framework presented in this article provides a direct method for evaluating the validity of these assumptions and also suggests ways in which these models might be generalized. Finally, the approach taken here to overcomplete representation may also be useful for understanding the nature of neural codes in the cerebral cortex. One million ganglion cells in the optic nerve are represented by 100 million cells in the primary visual cortex of the monkey. Thus, it is possible that at the rst stage of cortical processing, the visual world is overrepresented by a large factor. Appendix A.1 Approximating P (x | A ) . The integral
Z P(x | A ) =
d s P( s ) P( x | s , A )
(A.1)
can be approximated with the gaussian integral, Z ds f (s ) ¼ f ( sˆ )(2p ) k/ 2 | ¡ rs rs log f ( s )| ¡1/ 2 ,
(A.2)
where | ¢ | indicates the determinant and rs rs indicates the Hessian, which is evaluated at the solution sˆ (see equation 2.3). Using f ( s) = P(s ) P(x | s , A ) we have £ ¤ ¡rs rs log P( s) P( x| s, A ) = H ( s)
µ ¶ l = ¡rs rs ¡ (x ¡ As) 2 C log P(s ) 2 h i = rs ¡l AT (x ¡ As) ¡ rs P (s ) = l ATA ¡ rs rs P( s).
(A.3) (A.4) (A.5)
Learning Overcomplete Representations
359
A.2 The Hessian of the Laplacian. In the case of a Laplacian prior on
sm , log P( s ) ´
X m
log P( sm ) = M logh ¡ h
M X
|sm |.
(A.6)
m= 1
Because log P( s ) is piece-wise linear in s , the curvature is zero, and there is a discontinuity in the derivative at zero. To approximate the volume contribution from P (s ), we use @ log P( s) ¼ ¡h tanh(bsm ), @sm
(A.7)
which is equivalent to using the approximation P( sm ) / cosh¡h/ b (b sm ). For large b this approximates the true Laplacian prior while staying smooth around zero. This leads to the following diagonal expression for the Hessian: @2 log P( sm ) = ¡hb sech2 (b sm ). @s2m
(A.8)
A.3 Derivation of the Learning Rule. For a set of independent data vectors X = x 1 , . . . , x K , expand the posterior probability density by a saddlepoint approximation:
log P( X | A ) = log
Y
P( x k | A )
k
¼
l KL KM log C log(2p ) (A.9) 2 2p 2 ¶ K µ X l 1 C log P( sˆ k ) ¡ ( x k ¡ A sˆ k ) 2 ¡ log det H ( sˆ k ) . 2 2 k= 1
A learning rule can be obtained by differentiating log P( x | A ) with respect to A . Letting rA = @/ @A and for clarity letting H = H ( s), we see that there are three main derivatives that need to be considered: rA log P( x | A ) = rA log P( sˆ ) ¡
l 1 rA ( x ¡ A sˆ ) 2 ¡ r A log det H . (A.10) 2 2
We consider each in turn. A.3.1 Deriving rA log P( sˆ ) . The rst term in the learning rule species how to change A to make the representation sˆ more probable. If we assume a prior distribution with high kurtosis, this component will change the weights to make the representation sparser.
360
Michael S. Lewicki and Terrence J. Sejnowski
From the chain rule, X @log P( sOm ) @Osm @log P(sˆ ) = , (A.11) @A @Osm @A m Q assuming P(sˆ ) = sm / @A , we need to describe sˆ as a m P(Osm ). To obtain @O function of A . If the basis is complete (and we assume low noise), then we can simply invert A to obtain sˆ = A ¡1 x. When A is overcomplete, however, there is no simple expression, but we can still make an approximation. For certain priors, the most probable solution, sˆ , will yield at most L nonzero elements. In effect, the procedure for computing sˆ selects a complete basis from A that best accounts for the data x . We can use this hypothetical reduced basis to derive a learning algorithm for the full basis matrix A . L represent a reduced basis for a particular x , that is, A L is composed Let A of the basis vectors in A that have nonzero coefcients. More precisely, let c1 , . . . , cL be the indices of the nonzero coefcients in sˆ . If there are fewer than L nonzero coefcients, zero-valued coefcients can be included without loss L 1:L,i ´ A 1:L,c . We then have of generality. Then, A i L ¡1 (x ¡ ²), sL = A
(A.12)
@Lsk @ X L ¡1 = A kl ( xl ¡ 2 l ), @Laij @Laij l
(A.13)
L ¡1 obtained where sL is equal to sˆ with M¡L zero-valued elements removed. A by removing the columns of A corresponding to the M ¡ L zero-valued elements of sˆ . This allows the use of results obtained for the case when A is invertible. Note that the construction of the reduced basis is only a mathematical device used for this derivation. The nal gradient equation does not depend on this construction. Following MacKay (1996) we have
Using the identity @A ¡1 / @aij = ¡A ¡1 Ajl¡1 , kl ki X @Lsk L ¡1 A L ¡1 (xl ¡ 2 l ). = ¡ A ki jl @Laij l L ¡1 sLj . = ¡A ki
(A.14) (A.15)
Letting zL k = @log P( sLk )/ @Lsk , X @log P(sL ) L ¡1 sLj . = ¡ zL k A ki @Laij k
(A.16)
Changing back to matrix notation, @log P(sL ) L ¡T zL sL T . = ¡A L A @
(A.17)
Learning Overcomplete Representations
361
This derivative can be expressed in terms of the original variables (that is, nonreduced). We invert the mapping used to obtain the reduced coefcients, sˆ ! sL . We have zcm = zL m , with the remaining (M ¡ L) values of z dened L ¡1 , with the remaining rows to be zero. We dene the matrix W 1:L,ci ´ A 1:L,i equal to zero. Then @log P( s) = ¡W T zsT . @A
(A.18)
This gives a gradient of zero for the columns of A that have zero-valued coefcients. A.3.2 Deriving rA ( x ¡ A sˆ ) 2 . The second term species how to change A to minimize the data mist. Letting ek = [x ¡ As ]k and using the results and notation from above: X X @sl @ lX 2 ek = lei sj Cl ek akl @aij 2 k @aij k l X X = lei sj Cl ¡akl wli sj ek k
(A.19) (A.20)
l
= lei sj ¡ l ei sj = 0.
(A.21)
Thus, there is no gradient component arising from the error term that might be expected if the residual error has no structure. A.3.3 Deriving rA log det H . The third term in the learning rule species how to change the weights to minimize the width of the posterior distribution P(x | A ) and thus increase the overall probability of the data. Using the chain rule, 1 X @det H @H mn @log det H = . det H mn @H mn @aij @aij
(A.22)
An element of H is dened by H mn =
X k
l akm akn C bmn
= cmn C bmn ,
(A.23) (A.24)
where bmn = (¡rs rs log P( sˆ )) mn . Using the identity @det H / @H mn = (det H ) H ¡1 nm we obtain µ ¶ X @log det H @bmn ¡1 @cmn = H nm C . @aij @aij @aij mn
(A.25)
362
Michael S. Lewicki and Terrence J. Sejnowski
Considering the rst term in the square brackets,
@cmn @aij
8 > > l aim > : 2l aij
m, n 6 = j, m = j, n = j, m = n = j.
(A.26)
We then have X mn
H ¡1 mn
X X @cmn ¡1 = H ¡1 H jm l aim C H jj¡1 2l aij mj laim C @aij 6 6 = = m j m j X ¡1 = 2l H mj aim ,
(A.27) (A.28)
m
¡1 using the fact that H ¡1 = H jm due to the symmetry of the Hessian. In matrix mj notation this becomes
X mn
H ¡1 mn
@cmn = 2l AH ¡1 . @A
Next we derive @bmm / @aij . Assuming P( sˆ ) = rs rs log P( sˆ ) is diagonal and thus X mn
H ¡1 nm
(A.29) Q
X @bmn @bmm @Osm = H ¡1 . mm @aij @Osm @aij m
mP
( sOm ) we have that
(A.30)
Letting 2ym = H ¡1 sm and using the result under the reduced repremm @bmm / @O sentation (see equation A.15), X mm
H ¡1 mm
@bmm = ¡2W T yˆs T . @A
(A.31)
Finally, putting the results for @cmm / @A and @bmm / @A back into equation A.25, @log det H = 2l AH ¡1 ¡ 2W T yˆsT . @A
(A.32)
Gathering the three components of rA log P(x | A ) together yields the following expression for the learning rule: r A log P( x| A ) = ¡W T zˆs T ¡ l AH ¡1 C W T yˆs T .
(A.33)
Learning Overcomplete Representations
363
A.3.4 Stabilizing and Simplifying the Learning Rule. Using the gradient given in equation A.33 directly is problematic due to the matrix inverses, which make it both impractical and unstable. This can be alleviated, however, by multiplying the gradient by an appropriate positive denite matrix (Amari et al., 1996). This rescales the components of the gradient but still preserves a direction valid for optimization. The fact that AT W T = I for all W allows us to eliminate W from the equation for the learning rule: AAT rA log P( x| A ) = ¡Az sˆ T ¡ l AATAH ¡1 C AyˆsT .
(A.34)
Additional simplications can be made. Ifl is large (low noise), then the Hessian is dominated by l ATA and ¡l AATAH ¡1 = ¡Al ATA (l ATA C B ) ¡1 ¼ ¡A .
(A.35) (A.36)
It is also possible to obtain more accurate approximations of this term (Lewicki & Olshausen, 1998, 1999). The vector y hides a computation involving the inverse Hessian. If the basis vectors in A are randomly distributed, then as the dimensionality of A increases, the basis vectors become approximately orthogonal, and consequently the Hessian becomes approximately diagonal. In this case, ym ¼ =
1
@bmm
(A.37)
H mm @Osm
@bmm / @Osm . P 2 l k akm C bmm
(A.38)
Thus if log P( s ) and its derivatives are smooth, ym vanishes for large l. In practice, excluding y from the learning rule yields more stable learning. We conjecture that this term can be ignored, possibly because it represents curvature components that are unrelated to the volume. We thus obtain the following expression for a learning rule: D A = AAT rA log P( x| A ) ¼ ¡Az sˆ T ¡ A
= ¡A ( zˆsT C I ).
(A.39) (A.40)
Acknowledgments
We thank Bruno Olshausen and Tony Bell for many helpful discussions.
364
Michael S. Lewicki and Terrence J. Sejnowski
References Amari, S., Cichocki, A., & Yang, H. H. (1996).A new learning algorithm for blind signal separation. In D. Touretzy, M. Mozer, & M. Hasselmo (Eds.), Advances in neural and information processing systems, 8 (pp. 757–763). San Mateo, CA: Morgan Kaufmann. Atick, J. J. (1992). Could information-theory provide an ecological theory of sensory processing? Network-Computation in Neural Systems, 3(2), 213–251. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenbluth (Ed.), Senosory communication (pp. 217– 234). Cambridge, MA: MIT Press. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Buccigrossi, R. W., & Simoncelli, E. P. (1997). Image compression via joint statistical characterization in the wavelet domain (Tech. Rep. No. 414). Philadelphia: University of Pennsylvania. Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE Signal Processing Letters, 4, 109–111. Chen, S., Donoho, D. L., & Saunders, M. A. (1996). Atomic decomposition by basis pursuit (Technical Rep.). Stanford, CA: Department of Statistics, Stanford University. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2), 713–718. Comon, P. (1994). Independent component analysis, a new concept. Signal Processing, 36(3), 287–314. Daubechies, I. (1990). The wavelet transform, time-frequency localization, and signal analysis. IEEE Transactions on Information Theory, 36(5), 961–1004. Daugman, J. G. (1988). Complete discrete 2-D Gabor transforms by neural networks for image-analysis and compression. IEEE Transactions on Acoustics Speech and Signal Processing, 36(7), 1169–1179. Daugman, J. G. (1989). Entropy reduction and decorrelation in visual coding by oriented neural receptive-elds. IEEE Trans. Bio. Eng., 36(1), 107–114. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6(4), 559–601. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Jutten, C., & H´erault, J. (1991). Blind separation of sources. 1. An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Lee, T.-W., Lewicki, M., Girolami, M., & Sejnowski, T. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Sig. Proc. Lett., 6, 87–90. Lewicki, M. S., & Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an efcient coding framework. In M. Kearns, M. Jordan, & S. Solla
Learning Overcomplete Representations
365
(Eds.),Advances in neural and informationprocessing systems, 10. San Mateo, CA: Morgan Kaufmann. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. J. Opt. Soc. Am. A: Optics, Image Science, and Vision, 16(7), 1587–1601. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Cambridge: University of Cambridge, Cavendish Laboratory. Available online at: ftp://wol.ra.phy.cam.ac.uk/pub/mackay/ica.ps.gz. Makeig, S., Jung, T. P., Bell, A. J., Ghahremani, D., & Sejnowski, T. J. (1996). Blind separation of event-related brain response components. Psychophysiology, 33, S58–S58. Mallat, S. G., & Zhang, Z. F. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. McKeown, M. J., Jung, T.-P., Makeig, S., Brown, G. G., Kindermann, S. S., Lee, T., & Sejnowski, T. (1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the stroop color-naming task. Proc. Natl. Acad. Sci. USA, 95, 803–810. Meszaros, C. (1997).BPMPD: An interior point linear programming solver. Code available online at: ftp://ftp.netlib.org/opt/bpmpd.tar.gz. Nadal, J.-P. & Parga, N. (1994a). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Nadal, J.-P., & Parga, N. (1994b). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Network, 5, 565–581. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Res., 37(23). Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithm for support vector machines. Proc. of IEEE NNSP’97 (pp. 24–26). Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural and information processing systems, 9. San Mateo, CA: Morgan Kaufmann. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic programming (2nd ed.). Cambridge: Cambridge University Press. Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multiscale transforms. IEEE Trans. Info. Theory, 38, 587–607. Received February 8, 1998; accepted February 1, 1999.
LETTER
Communicated by Adi Bulsara
Noise in Integrate-and-Fire Neurons: From Stochastic Input to Escape Rates Hans E. Plesser ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany MPI fur Wulfram Gerstner MANTRA Center for Neuromimetic Systems, Swiss Federal Institute of Technology, EPFL/DI, CH-1015 Lausanne, Switzerland We analyze the effect of noise in integrate-and-re neurons driven by time-dependent input and compare the diffusion approximation for the membrane potential to escape noise. It is shown that for time-dependent subthreshold input, diffusive noise can be replaced by escape noise with a hazard function that has a gaussian dependence on the distance between the (noise-free) membrane voltage and threshold. The approximation is improved if we add to the hazard function a probability current propor tional to the derivative of the voltage. Stochastic resonance in response to periodic input occurs in both noise models and exhibits similar characteristics.
1 Introduction
Given the enormous number of degrees of freedom in single neurons due to millions of ion channels and thousands of presynaptic signals impinging on the neuron, it seems natural to describe neuronal activity as a stochastic process (Tuckwell, 1989). Since in most neurons the output signal consists of stereotypical electrical pulses (spikes), the theory of point processes has long been employed to analyze neuronal activity (Perkel, Gerstein, & Moore, 1967; Johnson, 1996). The simplest point process model of neural responses to stimuli is that of an inhomogeneous Poisson process with an instantaneous ring rate that depends on the stimulus, and thus on time. Models of this type have successfully been employed to characterize neural activity in the auditory system (Siebert, 1970; Johnson, 1980; Gummer, 1991) and to investigate the facilitation of signal transduction by noise, a phenomenon known as stochastic resonance (Collins, Chow, Capela, Imhoff, 1996). Stevens and Zador (1996) have recently suggested that in the limit of very low ring rates, Poisson models of neuronal activity indeed capture most of the dynamics of more complex models. Neural Computation 12, 367–384 (2000)
c 2000 Massachusetts Institute of Technology °
368
Hans E. Plesser and Wulfram Gerstner
On the other hand, much recent research indicates that the precise timing of neuronal spikes plays a prominent role in information processing (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Thus, spiking neuron models are required that provide more insight into the relation between the input current a neuron receives and the spike train it generates as output. In short, these model neurons generate an output spike whenever the membrane potential v, driven by an input current I, reaches a threshold H; afterward the potential v is reset to some value vres . For a recent review, see Gerstner (1998). The archetypical spiking neuron model is the integrate-and-re neuron introduced by Lapicque (Tuckwell, 1988). Stein (1965) rst replaced a continuous input current with more realistic random sequences of excitatory and inhibitory input pulses. If each pulse is small and a large number of pulses impinges on the neuron per membrane time constant, the effective input to the neuron can be described by a time-dependent deterministic current I ( t) and additive noise j (t), which is taken to be gaussian. In this limit, the evolution of the membrane potential from reset to threshold is a time-dependent Ornstein-Uhlenbeck process with the threshold as an absorbing boundary (Johannesma, 1968; L´ansky, ´ 1997). We will refer to this as the diffusion approximation of neuronal activity. The separation of the total input current into a deterministic part I ( t) and noise is much less clear-cut in neuronal modeling than in physical processes such as Brownian motion. Recent research suggests that the variability of input spike arrival times plays a much greater role than intrinsic noise (Mainen & Sejnowski, 1995; Bryant & Segundo, 1976). Stochastic spike arrival may arise in noise-free networks with heterogeneous connections as shown in recent theoretical studies (van Vreeswijk & Sompolinsky, 1996; Brunel & Hakim, 1999). In these studies balanced excitation and inhibition prepared the neuron in a state that is just subthreshold. Spikes are then triggered by the uctuations in the input. Neurons in this regime have a large coefcient of variation just as cortical neurons. For a recent review, see Konig, ¨ Engel, & Singer, 1996. The subthreshold regime is also particularly interesting from the point of view of information transmission. Neurons optimally detect temporal information if the mean membrane potential is roughly one standard deviation of the noise below threshold (Kempter, Gerstner, van Hemmen, & Wagner, 1998). The improvement of signal transduction by noise, known as stochastic resonance, also appears to be limited to this regime (Bulsara & Zador, 1996). Therefore we will focus here on neurons driven by weakly subthreshold input—those neurons that would be silent in the absence of noise. In this work, we take the diffusion approximation sketched above as the reference model for neuronal noise. A disadvantage of this model is that it is difcult to solve analytically, particularly for time-dependent input I (t). The interval distribution for a given input I ( t) is mathematically equivalent
Noise in Integrate-and-Fire Neurons
369
to a rst-passage-time problem, which is known to be hard. Our aim in this study is to replace the stochasticity introduced by the diffusion term by a mathematically more tractable inhomogeneous point process. To this end, we investigate various escape processes across the ring threshold. An escape process is completely described by a hazard function h( t), which depends only on the momentary values of the membrane potential v(t) and the input current I ( t). We want to identify that function h which reproduces as closely as possible the behavior of the diffusion model, for both periodic and aperiodic inputs. We demonstrate that the optimal hazard function not only reproduces the generation of individual spikes well, but that complex response properties such as stochastic resonance are preserved. 2 Integrate-and-Fire Neuron Models 2.1 Diffusion Model. Between two output spikes, the membrane potential of an integrate-and-re neuron receiving input I ( t) evolves according to the following Langevin equation (Tuckwell, 1989):
vP ( t) = ¡v(t) C I ( t) C sj ( t).
(2.1)
Upon crossing the threshold, v( t) = 1, a spike is recorded and the « potential ¬ reset to v = 0. j ( t) is gaussian white noise with autocorrelation j ( t)j (t0 ) = d ( t ¡ t0 ). The diffusion constant s is the root mean square (rms) amplitude of this background noise. Note that we have used dimensionless units above; time is measured in units of the membrane time constant, while voltage is given in terms of the threshold H. As a consequence of the diffusion approximation, the input I ( t) will not contain any d-pulses, and v(t) follows continuous trajectories from reset potential to threshold. Let us denote the ring time of the last output spike by t¤ . In the noisefree case (s = 0), the potential v( t) for t > t¤ depends on the input I ( t0 ) for t¤ < t0 < t and can be found by integration of equation 2.1: Z v0 (t | t¤ , I (¢)) = e¡t
t 0
I ( t¤ C s) es ds.
(2.2)
Our notation makes explicit that the noise-free trajectory can be calculated given the last ring time t¤ and the input I. The subscript 0 is intended to remind the reader that equation 2.2 describes the noise-free trajectory. The next output spike of the noiseless model would occur at a time t¤ Ct dened by the rst threshold crossing, © ª t = min t 0 > 0 | v0 (t 0 | t¤ , I (¢)) ¸ 1 .
(2.3)
In the presence of noise, the actual membrane trajectory will uctuate around the reference trajectory (see equation 2.2) with a variance of the
370
Hans E. Plesser and Wulfram Gerstner
order of s. We can no longer predict the exact timing of the next spike, but we may ask for the probability distribution r (t | t¤ , I (¢)) that a spike occurs at t¤ C t , given that the last spike was at t¤ and the current is I ( t0 ) for t¤ < t0 < t¤ C t : 8 9 < Spike in [t¤ C t, t¤ C t C D t) = r (t | t¤ , I (¢))D t = Pr given the last reset at t¤ and input . (2.4) : ¤ ; I ( t C t 0 ), t 0 > 0. Our notation makes explicit that no information about the past of the neuron going further back than the last ring or reset at t¤ is needed. We may think of equation 2.4 as a conditional interspike interval (ISI) distribution. For the diffusion noise model, analytical expressions for the interval distributions are not known, except for some rare special cases (Tuckwell, 1989). It is, however, possible to calculate these ISI distributions numerically based on Schrodinger ¨ ’s renewal ansatz (Plesser & Tanaka, 1997; Linz, 1985). (Source code is available from us on request.) In the following we use the diffusion model as the reference noise model. If there were no absorbing threshold, equation 2.1 would describe free diffusion with drift. Then the pertaining Fokker-Planck equation, ¡ ¢ ¤ ¡ ¢ @ @ £ P f v, t¤ C t | t¤ = ¡ ¡v C I (t¤ C t ) P f v, t¤ C t | t¤ @t @v C
¡ ¢ s 2 @2 P f v, t¤ C t | t¤ , 2 2 @v
(2.5)
with boundary conditions limv!§ 1 P f ( v, t¤ C t | t¤ ) = 0 has the solution ¡ ¢ P f v, t¤ C t | t¤ = p
» ¼ [v ¡ v0 (t | t¤ , I (¢))]2 exp ¡ , 2g 2 (t ) 2pg 2 (t ) 1
(2.6)
¢ 2 ¡ where g 2 (t ) = s2 1 ¡ e¡2t . As the exponential term ing(t ) decays at twice the rate of those in v0 (t | t¤ , I (¢)), we may approximate for t À 1 » ¼ ¡ ¢ 1 [v ¡ v0 (t | t¤ , I (¢))]2 P f v, t¤ C t | t¤ ¼ p exp ¡ . s2 ps
(2.7)
2.2 Escape Noise and Hazard Functions. In a noise-free model, the membrane trajectory v0 (t | t¤ , I (¢)) is given by equation 2.2, and the time of the next spike determined by equation 2.3. In the presence of noise, ring is no longer precise but may occur even though the noise-free trajectory has not yet reached threshold. A simple intuitive noise model can be based on the idea of an escape probability: at each moment of time, the neuron may re with an instantaneous rate h, which depends on the momentary distance
Noise in Integrate-and-Fire Neurons
371
between the noise-free trajectory v0 and the threshold, and possibly the momentary input current as well. More generally, we may introduce a hazard function h(t | t¤ , I (¢)) that describes the risk of escape across the threshold and depends on the last ring time t¤ and the input I ( t0 ) for t¤ < t0 < t¤ C t. Once we know the hazard function, the ISI distribution is given by Cox & Lewis, 1966) µ Z r (t | t , I (¢)) = h(t | t , I (¢)) exp ¡ ¤
¤
t 0
¶ h(s | t , I (¢)) ds . ¤
(2.8)
The exponential term accounts for the probability that a neuron survives from t¤ to t¤ Ct without ring; the factor h(t | t¤ , I (¢)) gives the rate of ring at t¤ C t , provided that the neuron has survived thus far. We discuss in this work four models of simplied neuronal dynamics, all of which aim to approximate the diffusion model by an escape noise ansatz. The various models differ in the choice of the function h(t | t¤ , I (¢)) (see Figure 1). These hazard functions depend on the noise-free membrane potential v0 and the input current I only through two scaled quantities. One is the momentary distance x between the membrane potential v0 and the threshold H = 1, scaled by the noise amplitude: x(t | t¤ ) =
1 ¡ v0 (t | t¤ ) . s
(2.9)
The other is the net current charging the membrane in the noise-free case, also scaled by the noise amplitude: Y(t | t¤ ) =
¡v0 (t | t¤ ) C I ( t¤ C t ) 1 d = v0 (t | t¤ ). s s dt
(2.10)
The last equality follows immediately from equation (2.1) with s = 0 and indicates that Y can be interpreted as a relative velocity of the noise-free membrane potential v0 . 2.2.1 Arrhenius Model. As long as the membrane potential v stays sufciently far below threshold, we may describe neuronal ring as a noiseactivated process. The latter is most easily characterized by an Arrhenius ansatz for the hazard function (van Kampen, 1992): » ¼ (1 ¡ v0 (t | t¤ )) 2 ¤ 2 = w e¡x(t | t ) , hArr (t | t¤ ) = w exp ¡ 2 s wopt = 0.95.
(2.11)
Here, 1 ¡ v0 (t | t¤ ) is the voltage gap that needs to be bridged to initiate a spike. Its square may be interpreted as the required activation energy
372
Hans E. Plesser and Wulfram Gerstner
(a)
h(v)
1 0.5 0 0.7
0.8
0.9
0.8
0.9
v
1
1.1
1.2
1
1.1
1.2
(b)
h(v)
1 0.5 0 0.7
v
Figure 1: Hazard functions h(v) versus membrane potential v. (a) Arrhenius model (solid), error function model (dash-dotted), and Tuckwell model (dashed). (b) Arrhenius&Current model for different values of the input current: I = 0.95 (solid), 1 (dash-dotted), and 1.1 (dashed); the case I = 0 is drawn dotted for comparison. In all cases, the threshold is at v = 1 and the noise amplitude is s = 0.1. The vertical dotted lines mark one noise amplitude below threshold.
and s 2 as the energy supplied by the noise. The free parameter w was determined by an optimization procedure described below. We found w = wopt = 0.95 to be the optimal value. Note that for strongly superthreshold stimuli (v0 À 1), the hazard function vanishes exponentially. This might seem paradoxical at rst, but is of little concern as long as the input I ( t) contains no d-pulses and v0 reaches the threshold only along continuous trajectories. Then, superthreshold levels of the potential are accessible only via periods of maximum hazard, so that the neuron will usually have red before v0 becomes signicantly superthreshold. 2.2.2 Arrhenius&Current Model. Strong, positive transients in the input current will push the membrane potential toward and across the threshold in a time that is short on the timescale of diffusion. Figuratively, the membrane potential distribution P f is shifted as a whole. The simple Arrhenius ansatz
Noise in Integrate-and-Fire Neurons
373
will not be able to reproduce these transients well. We thus consider, in addition to the diffusive current, the probability current induced by shifting the probability density at the threshold, P f (H = 1, t | t¤ ), across threshold d with the speed of the center of this distribution, dt v0 (t | t¤ ) = ¡v0 (t | t¤ ) C I ( t¤ C t). Since there can be no drift from above threshold downward, we set all negative currents to zero and obtain the drift probability current: ¡ ¢ Jdrift (t | t¤ ) = [¡v0 (t | t¤ ) C I ( t¤ C t )] C P f H = 1, t | t¤ =
[Y(t | t¤ )] C ¡x(t | t¤ ) 2 e p . p
P f (v, t | t¤ ) is again the free gaussian distribution given by equation 2.7 and [x] C ´ ( x C|x|)/ 2. Together with the original, diffusive Arrhenius term, we obtain the hazard function: hAC (t | t¤ ) = hArr (t | t¤ ) C Jdrift ( v, t | t¤ ) =
¡
wC
¢
[Y(t | t¤ )] C ¡x(t |t¤ ) 2 e , p p
wopt = 0.72.
(2.12)
As before, w is a free parameter with optimal value wopt . Below we will refer to the rst term of equation 2.12 as the diffusion term and to the second one as the drift term. 2.2.3 Sigmoidal Model. Abeles (1982) has suggested that the hazard should be related to the proportion of the free membrane R 1potential distribution P f that is beyond the threshold, that is, h(t | t¤ ) » H P f (v, t | t¤ ) dv = erfc x(t | t¤ ). This ansatz is questionable, as the correct potential distribution vanishes beyond the threshold. In view of the widespread use of sigmoidal activation functions in the theory of neural networks (Wilson & Cowan, 1972), we nevertheless test a sigmoidal model but provide it with two free parameters w1 and w2 : herf (t | t¤ ) = w1 erfc[x(t | t¤ ) ¡ w2 ],
opt
w1
opt
= 0.66 , w2
= 0.53.
(2.13)
erfc(x) = 1 ¡ erf(x) is the complementary error function. Given the great similarity of the error function and the hyperbolic tangent, we have not extensively investigated the latter. Preliminary results indicate negligible differences (data not shown). 2.2.4 Tuckwell Model. Suppose that the input I ( t) varies sufciently slowly so that we may, at any point, consider the membrane potential of the neuron to be stationary. A stationary potential v0 (t | t¤ ) corresponds to
374
Hans E. Plesser and Wulfram Gerstner
a constant input current Is = v0 (t | t¤ ). In this case, the average ring rate, and thus the hazard, can be approximated in closed form (Tuckwell, 1989): hTuck (t | t¤ ) =
x(t | t¤ ) ¡x(t | t¤ ) 2 e p h (x(t | t¤ )). p
(2.14)
The Tuckwell model has no free parameter. The Heaviside step function h ( x) is introduced to avoid negative values of the hazard. The approximation made in computing the rate is valid only for stimuli far below threshold, that is, x(t | t¤ ) À 1. Note that this rate is identical to the diffusive current at the location of the threshold for the threshold-free case, ¡
Jdiff v = 1, t | t
¤
¢
¡ ¢ s 2 @2 = P f v, t | t¤ , 2 @v2 v= 1
where P f ( v, t | t¤ ) is the free gaussian distribution (see equation 2.7). 2.3 Input. To compare the responses of integrate-and-re neurons with various hazard functions to those with diffusion noise, we have to choose an input scenario. We will try to be as general as possible and consider time-dependent inputs with both aperiodic and periodic time course. The general form of the input is an oscillatory component of rms amplitude q uctuating around a DC value m , p q 2 X aj cos(vj t C w j ), I ( t) = m C q P 2 a j k k
(2.15)
R where vj = j D v with D v the base frequency. Note that q2 = limT!1 T1 0T (I (t) ¡m ) 2 dt is the expected variance of the current. The membrane potential in response to the input in the absence of noise is found from equation 2.2: p £ ¡ ¢ ¤ v0 (t | t¤ ) = m 1 ¡ e¡t C q 2 F( t¤ C t ) ¡ e¡t F( t¤ ) , 1 F ( t) = q P
2 k ak
X j
q
aj
1 C vj2
¡ ¢ sin vj t C wj C acot vj .
(2.16)
Aperiodic stimuli are characterized by a cutoff frequency Vc and are generated by drawing the phases w j at random from a uniform distribution over [0, 2p ), while the amplitudes are given by
8 > : exp(¡ 12 ( j ¡ jc ) 2 )
vj · 0,
0 < vj · Vc , jc < j.
(2.17)
Noise in Integrate-and-Fire Neurons
375
Periodic input is taken to be cosinusoidal with frequency V = vk . We set aj = d jk in equation 2.15 and obtain p I (t) = m C q 2 cos(Vt C w ). To compare the performance of the various models across stimuli, we describe each stimulus by its relative distance from threshold,
e=
q « ¬ 1 ¡ (m C 2 D v20 ) s
,
(2.18)
where Z D E D v20 = lim
T!1 0
T
2 q2 X aj (v0 (t | t¤ ) ¡ m ) 2 dt = P 2 2 k ak j 1 C vj
is the rms amplitude of the membranep potential oscillations in the noise-free case. For periodic input, the factor of 2 in the denition of e yields ! p q 2 se = 1 ¡ m C p = 1 ¡ sup v0 (t | t¤ ). 1 C V2 t ¸0
(
Thus subthreshold stimuli have e > 0, superthreshold stimuli e < 0. Note that for periodic stimuli e > 0 indeed guarantees that the stimulus is subthreshold; in the absence of noise, the neuron never res. For aperiodic input, however, the denition holds only in an rms sense, so that v0 may occasionally cross the threshold even for e > 0. 2.4 Choice of Stimuli. The results presented here were obtained from a set of 14,400 periodic and 14,400 aperiodic stimuli. DC inputs were in p the range 0.55 · m · 1.2, AC amplitudes approximately 0.1(1 ¡ m ) < q 2 < 1.5(1 ¡ m ), stimulus/cutoff frequencies 0.02p · Vc · 2p . For each set of these parameters, ve phases were chosen randomly from [0, 2p ) for periodic stimuli and ve different random stimuli generated for the aperiodic case. Each stimulus was tested at eight noise amplitudes, randomly chosen so that the stimuli would have a roughly uniform distribution with respect to their relative distance from threshold with 0.1 < |e| < 3. Some two-thirds of all stimuli were subthreshold. Some stimuli were excluded from analysis. The rejection was based solely on the ISI distributions computed for the diffusion approximation. Specifically, stimuli were excluded if the ring probability was so low that their R norm was insufcient ( 0T r (t ) dt < 0.8), where T was 20 stimulus periods for periodic and 409.4 for aperiodic stimuli. Furthermore, stimuli were rejected if the numerical algorithm was unstable at the time resolution chosen. We veried for some of these cases that with appropriate discretization,
376
Hans E. Plesser and Wulfram Gerstner
the algorithm did converge. The instabilities were caused by very steep threshold crossings of the membrane potential in combination with small noise. After defective stimuli had been excluded, we were left with 8714 periodic stimuli (out of these 4706 subthreshold stimuli) and 12,181 aperiodic stimuli (8027 subthreshold). 2.5 Parameter Optimization. The weights in the models with parameters were chosen as follows. We split the entire set of stimuli into an optimization and a validation set. For each single stimulus, we determined that weight w or set of weights w1 , w2 that minimized the error E dened in equation 3.1 below for this one stimulus, using MATLAB minimization routines. We repeated the procedure for all stimuli in the optimization set and constructed the distribution of weights found in this way. The weights opt opt wopt [respectively, w1 , w2 ] are the medians of the weight distributions. Evaluating the error E over both the optimization and the validation set with the xed weight wopt yielded identical results. Overtting can therefore be excluded. 3 Results 3.1 ISI Distributions. Figure 2 shows the ISI distribution evoked from the diffusion model by a subthreshold aperiodic stimulus. The predicted ISI distribution for the Arrhenius&Current model (dashed) has been based on equation 2.4. To observe this ISI in an experiment, one would present the stimulus shown as the dotted line in Figure 2c repeatedly, starting over at t = 0 each time a spike has been red. We may also think of Figure 2a as a PSTH (peri-stimulus time histogram), where each spike triggers a reset of the stimulus. To measure the agreement between this reference solution and the distributions rendered by the escape noise models, we use the relative integrated mean square error (Scott, 1992):
E=
R1 0
dt [r (t | t¤ = 0) ¡ r esc (t | t¤ = 0)]2 R1 . (t | t¤ = 0) 2 0 dt r
(3.1)
Here, r (t | t¤ = 0) is the ISI distribution for the diffusion model, while r esc(t | t¤ = 0) is the distribution obtained from any one of the escape noise models. The example in Figure 2 corresponds to an error of E = 0.026. Figure 3 shows the cumulative distribution of the error E for all four models. The Arrhenius&Current model obviously performs signicantly better than all the other models for sub- and superthreshold stimuli, both periodic and aperiodic, while the Tuckwell model does worst. The Arrhenius and the error function model are tied for second, with the Arrhenius
Noise in Integrate-and-Fire Neurons
377
r (t |t*)
(a) 0.3 0.2 0.1 0 0
5
10
15
20
25
30
5
10
15
20
25
30
5
10
15 t
20
25
30
h(t |t*)
(b) 1 0.75 0.5 0.25 0 0
0.9
0
I(t ), v (t |t*)
(c)
0.7 0.5 0
Figure 2: (a) Interval distribution from the diffusion model (solid) and the Arrhenius&Current model (dashed) for aperiodic input with DC offset m = 0.85, AC amplitude q = 0.1, cutoff frequency Vc = p , and noise amplitude s = 0.1. The relative distance from threshold is e = 0.88 and the error E = 0.026. (b) Hazard as a function of time for the Arrhenius&Current model (solid). The dashed line is the diffusive component of the hazard dened in equation 2.12. (c) Input current (dotted) and noise-free membrane potential for the given stimulus (solid). The membrane potential stays below the threshold (dashed) but approaches it closely around t ¼ 27. Note that peaks in the input current in (c) lead to sharp transients of the hazard (b) that would not be reproduced correctly by the diffusion term alone. Thus the ring probability of the neuron is correlated to maxima of the current rather than the membrane potential, in excellent agreement with the exact diffusion model; see the nice t in a.
378
Hans E. Plesser and Wulfram Gerstner
P(E)
(a)
(b)
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
10
3
2
10 10 E
1
10
(c)
P(E)
0
0
3
10 10 E
2
1
10
0
3
10 10 E
2
1
10
0
10
(d)
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
10
3
2
10 10 E
1
10
0
0
10
Figure 3: Cumulative distribution of the relative mean square error E over all stimuli for Arrhenius (solid), Arrhenius&Current (dash-dotted), error function (dashed), and Tuckwell (dotted) model. The stimuli were (a) subthreshold periodic, (b) superthreshold periodic, (c) subthreshold aperiodic, and (d) superthreshold aperiodic. To interpret the cumulative distributions, let us study the likelihood that the error E of our approximation is smaller than 0.1. We may read off from the graphs in, for example, a, that for subthreshold periodic stimuli, the Tuckwell model would give a probability of E · 0.1 of about 50%, whereas the Arrhenius model and the error function give about 80% and the Arrhenius&Current model a probability of more than 95%.
model yielding slightly smaller errors for superthreshold stimuli. To quantify the performance of the models, we give in Table 1 the median and 90th percentile of the error for all models. Note that for the Arrhenius&Current model for all stimulus conditions, the error is E < 0.077 in 90% of all cases. We therefore conclude that this model provides an excellent approximation to the dynamics of the diffusion model of neuronal activity. We now describe in more detail how the model error depends on the stimuli and briey discuss why the models perform so differently. Figure 4
Noise in Integrate-and-Fire Neurons
379
Table 1: Median (= 50th ) and 90th Percentiles of Errors E. Periodic Aperiodic Periodic Aperiodic Subthreshold Subthreshold Superthreshold Superthreshold Overall 50% 90% 50% 90% 50% 90% 50% 90% 50% 90% Arrhenius Arrhenius &Current erf Tuckwell
0.055 0.148 0.075 0.336 0.157
0.409
0.130
0.375
0.091 0.334
0.018 0.044 0.024 0.070 0.042 0.061 0.147 0.083 0.346 0.189 0.137 0.333 0.242 0.837 0.747
0.090 0.429 0.896
0.034 0.160 0.713
0.090 0.397 0.901
0.026 0.077 0.103 0.358 0.398 0.864
shows the error of the models versus relative distance from threshold of the stimuli, for both periodic and aperiodic stimuli. All models perform better for sub- than for superthreshold stimuli, but the Arrhenius&Current is relatively good across the entire range. Note that the Tuckwell model becomes quite good for periodic stimuli that are far below threshold (e > 1), as is to be expected, since the Tuckwell approximation is valid in this regime. It does not fare as well for aperiodic stimuli, for they may contain intervals during which the membrane potential of a subthreshold stimulus comes close to threshold. 3.2 Stochastic Resonance. Recent years have seen mounting evidence that signal transmission in neurons can be improved by noise, an effect known as stochastic resonance (Douglass, Wilkens, Pantazelou, & Moss, 1993; Wiesenfeld & Moss, 1995; Cordo et al., 1996; Levin & Miller, 1996; Moss & Russell, 1998). The typical experimental paradigm is to stimulate a neuron sinusoidally, adding noise of varying intensity, and to measure the signal-to-noise ratio (SNR) of the spike train elicited from the neuron. If this SNR peaks for nonvanishing input noise, the system is said to exhibit stochastic resonance. (For a recent review of the eld, see Gammaitoni, H¨anggi, Jung, & Marchesoni, 1998.) Plesser and Geisel (1999) have demonstrated that noise can induce a further kind of resonance in periodically stimulated integrate-and-re neurons: the optimal SNR is attained for a particular stimulation frequency. This latter resonance arises by matching the period of stimulation to the timescale set by the membrane time constant of the neuron. We shall show here that the simplied Arrhenius&Current model reproduces this behavior well and that it is thus well suited to investigate integrate-and-re dynamics. (For details of the methods, see Plesser & Geisel, 1999.) Briey, one proceeds as follows to obtain the SNR in response to sinusoidal stimulation. For periodic input I ( t) = m C q cos(Vt C w), the ISI distributions r (t | t¤ , I (¢)) depend on input and external time only through the input phase y ¤ = [Vt¤ C w 0 ] mod 2p at the time t¤ of the most recent spike. With respect to this spike phase, the spike train reduces to a Markov
380
Hans E. Plesser and Wulfram Gerstner
(a) 0.4
(b) 0.08
E
0.3 0.2
0.04
0.1 0
2
0 e
0
2
0.4
(d) 0.8
0.3
0.6
0.2
0.4
0.1
0.2
E
(c)
0
2
0 e
0
2
2
0 e
2
2
0 e
2
Figure 4: Error E versus relative distance from threshold e for (a) Arrhenius, (b) Arrhenius&Current, (c) error function, and (d) Tuckwell model. Symbols mark the medians, and vertical bars span the 20th to 80th percentiles of 250 stimuli with neighboring e-values. Circles indicate periodic stimuli, crosses aperiodic ones. e > 0 corresponds to the regime of subthreshold stimuli. Note the different scales of the y-axis.
process with transition probability (or kernel), Z
T (y k | y k¡1 ) =
1 0
r (t | y k¡1 )d (y k ¡ [Vt C y k¡1 ] mod 2p )
dt , V
(3.2)
between the phases of the kth and the k ¡ 1st spike. The power spectral density at the stimulus frequency V of a d-spike train of total length To is then given by 1 STo (V) = p To
*t ,t < T jX k o j,k
+ e
iV (tj ¡tk )
* + No X 1 i(yj ¡yk ) e ¼ . p No ht i j,k= 1
(3.3)
Noise in Integrate-and-Fire Neurons
381
16 14 12
SNR
10 8 6 4 2 0 0
0.05 s
0.1
0.15
p Figure 5: SNR versus noise amplitude s form = 0.95, q 2 = 0.05 and V = 0.33p (highest SNR), V = p (lowest SNR) and V = 0.1p . Solid lines are for the diffusion model, dashed lines for the Arrhenius&Current model.
Here, tj are the spike times and yj the spike phases, and we have exploited that yj ¡ yk = [V (tj ¡ tk )] mod 2p . The mean ISI length is given by ht i, and No = bTo / ht ic is the average number of spikes in To . The approximation in equation 3.3 consists in averaging the number of spikes in To and their phases separately. Since the correlations between the spike phases yj are given by equation 3.2, the right-hand side of equation 3.3 can be evaluated in closed form. Results are in excellent agreement with spectra computed via fast Fourier transform from simulated spike trains. As an estimate of the noise background on which this signal will be transmitted, we use the power spectral density of a Poisson process of the same mean number of spikes per time as the spike train studied in equation 3.3 (see Stemmler, 1996). A Poisson process with intensity ht i¡1 has a white spectrum with power SP = (p ht i) ¡1 (see Cox & Lewis, 1966). The SNR of a spike train transmitted within a xed observation period To is then * + No 1 X STo (V) (yj ¡yk ) i e SNR = ¼ . SP No j,k= 1 Figure 5 shows results for the integrate-and-re model with diffusion noise and Arrhenius&Current escape noise. The agreement is good, indicating that the Arrhenius&Current model is well suited to replace diffusive noise in the integrate-and-re neuron when investigating more intricate
382
Hans E. Plesser and Wulfram Gerstner
problems of neuronal dynamics. Note in particular that both models show a twofold stochastic resonance. The SNR exhibits, as always in stochastic resonance, a peak as a function of the noise amplitude. The height of this peak is maximal around a stimulus frequency V = 0.33p and decreases for both larger (V = p ) and smaller frequencies (V = 0.1p ). The optimal frequency V = 0.33p corresponds to a period of T = 6, which should be compared to the mean ISI length ht i ¼ 9.7 at the resonance. (For a more detailed discussion of the timescale matching underlying this effect, see Plesser & Geisel, 1999.)
4 Discussion
We have demonstrated that the dynamics of the integrate-and-re neuron with diffusive noise may be well approximated by simple escape noise models and have identied the Arrhenius&Current hazard function as the optimal choice of model. Upon periodic stimulation, the SNR of the neuron’s output exhibits stochastic resonance for the Arrhenius&Current noise in the same way as it does for diffusive noise. This indicates that this escape noise model renders the correlations within the spike train—induced by the timedependent stimulus—correctly. The relative error of the model depends only weakly on the stimulus, permitting a widespread use. Of the remaining models, the pure Arrhenius model nishes as a strong second, at least for subthreshold stimuli. Compared to the error function model, it has the additional charm of mathematical simplicity. The model based on Tuckwell’s approximation of the mean ring rate, in contrast, is useful only if stimuli are sufciently subthreshold. With this study, we have provided a tool for the efcient investigation of large networks of model neurons. In fact, escape noise models have been used in the past in the context of the spike response model (Gerstner & van Hemmen, 1992; Gerstner, 1995), and this study provides additional support to such simplied treatment of noise. For escape noise models, signal transmission properties of spiking neurons can be studied analytically (Gerstner, forthcoming). Several issues remain to be tackled, particularly the extension of the investigation presented here to the case of colored noise. Finally, although the Arrhenius&Current model is the best model we have found so far, this is as yet no proof that it is indeed the optimal model.
Acknowledgments
We acknowledge helpful comments by two anonymous referees on an earlier version of the manuscript. H. E. P. was supported by Deutsche Forschungsgemeinschaft through SFB 185 “Nichtlineare Dynamik.”
Noise in Integrate-and-Fire Neurons
383
References Abeles, M. (1982).Role of the cortical neuron: Integrator or coincidence detector? Israel J Med Sci, 18, 83–92. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comput, 11, 1621–1671. Bryant, H. L., & Segundo, J. P. (1976).Spike initiation by transmembrane current: A White-noise analysis. J Physiol, 260, 279–314. Bulsara, A. R., & Zador, A. (1996). Threshold detection of wideband signals: A noise-induced maximum in the mutual information. Phys Rev E, 54, R2185– R2188. Collins, J. J., Chow, C. C., Capela, A. C., & Imhoff, T. T. (1996). Aperiodic stochastic resonance. Phys Rev E, 54, 5575–5584. Cordo, P. Inglis, J. T., Verschueren, S., Collins, J. J., Merfeld, D. M., Rosenblum, S., Buckley, S., & Moss, F. (1996). Noise in human muscle spindels. Nature, 383, 769–770. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Douglass, J. K., Wilkens, L., Pantazelou, E., & Moss, F. (1993).Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev Mod Phys, 70, 223–287. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys Rev E, 51, 738–758. Gerstner, W. (1998). Spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 3–54). Cambridge, MA: MIT Press. Gerstner, W. (forthcoming). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gummer, A. W. (1991). Probability density function of successive intervals of a nonhomogeneous Poisson process under low-frequency conditions. Biol Cybern, 65, 23–30. Johannesma, P. I. M. (1968). Diffusion models of the stochastic activity of neurons. In E. R. Caianiello (Ed.), Neural networks (pp. 116–144). Berlin: Springer. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve bers to single tones. J Acoust Soc Am, 68, 1115– 1122. Johnson, D. H. (1996). Point process models of single-neuron discharges. J Comp Neurosci, 3, 275–299. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998).Extracting out oscillations: Neural coincidence detection with periodic spike input. Neural Comput, 10, 1987–2017. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci, 19(4), 130–137.
384
Hans E. Plesser and Wulfram Gerstner
L´ansky, ´ P. (1997). Sources of periodical force in noisy integrate-and-re models of neuronal dynamics. Phys Rev E, 55, 2040–2043. Levin, J. E., & Miller, J. P. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380, 165–168. Linz, P. (1985). Analytical and numerical methods for Volterra equations. Philadelphia: SIAM. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing of neocortical neurons. Science, 268, 1503–1506. Moss, F., & Russell, D. F. (1998). Animal behavior enhanced by noise. Paper presented at the Computational Neuroscience Meeting ’98, Santa Barbara, CA. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. Biophys J, 7, 391–418. Plesser, H. E., & Geisel, T. (1999). Markov analysis of stochastic resonance in a periodically driven integrate-re neuron. Phys Rev E, 59, 7008–7017. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys Lett A, 225, 228–234. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Scott, D. W. (1992). Multivariate density estimation. New York: Wiley. Siebert, W. M. (1970). Frequency discrimination in the auditory system: Place or periodicity mechanisms. Proc IEEE, 58, 723–730. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys J, 5, 173–194. Stemmler, M. (1996). A single spike sufces: The simplest form of stochastic resonance in neuron models. Network, 7, 687–716. Stevens, C. F., & Zador, A. (1996). When is an integrate-and-re neuron like a Poisson neuron? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 103–109). Cambridge, MA: MIT Press. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 1). Cambridge, UK: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry (2nd ed.). Amsterdam: North-Holland. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wiesenfeld, K., & Moss, F. (1995). Stochastic resonance and the benets of noise: From ice ages to craysh and SQUIDs. Nature, 373, 33–36. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys J, 12, 1–24. Received April 27, 1998; accepted March 31, 1999.
LETTER
Communicated by Misha Tsodyks
Modeling Synaptic Plasticity in Conjunction with the Timing of Pre- and Postsynaptic Action Potentials Werner M. Kistler J. Leo van Hemmen ¨ ¨ Physik Department der TU Munchen, D-85747 Garching bei Munchen, Germany We present a spiking neuron model that allows for an analytic calculation of the correlations between pre- and postsynaptic spikes. The neuron model is a generalization of the integrate-and-re model and equipped with a probabilistic spike-triggering mechanism. We show that under certain biologically plausible conditions, pre- and postsynaptic spike trains can be described simultaneously as an inhomogeneous Poisson process. Inspired by experimental ndings, we develop a model for synaptic long-term plasticity that relies on the relative timing of pre- and postsynaptic action potentials. Being given an input statistics, we compute the stationary synaptic weights that result from the temporal correlations between the pre- and postsynaptic spikes. By means of both analytic calculations and computer simulations, we show that such a mechanism of synaptic plasticity is able to strengthen those input synapses that convey precisely timed spikes at the expense of synapses that deliver spikes with a broad temporal distribution. This may be of vital importance for any kind of information processing based on spiking neurons and temporal coding.
1 Introduction
There is growing experimental evidence that the strength of a synapse is persistently changed according to the relative timing of the arrival of presynaptic spikes and the triggering of postsynaptic action potentials (Markram, Lubke, ¨ Frotscher, & Sakmann, (1997); Bell, Han, Sugawara, & Grant, 1997). A theoretical approach to this phenomenon is far from being trivial because it involves the calculation of correlations between pre- and postsynaptic action potentials (Gerstner, Kempter, van Hemmen, & Wagner 1996). In the context of simple neuron models such as the standard integrate-and-re model, the determination of the ring-time distribution involves the solution of rst-passage time problems (Tuckwell, 1988). These problems are known to be hard, the more so if the neuron model is extended so as to include biologically realistic postsynaptic potentials. Neural Computation 12, 385–405 (2000)
c 2000 Massachusetts Institute of Technology °
386
Werner M. Kistler and J. Leo van Hemmen
We are going to use a neuron model, the spike-response model, which is a generalization of the integrate-and-re model (Gerstner & van Hemmen, 1992; Gerstner, 1995; Kistler, Gerstner, & van Hemmen, 1997). This model combines analytic simplicity with the ability to give a faithful description of “biological” neurons in terms of postsynaptic potentials, afterpotentials, and other aspects (Kistler, et al., 1997). In order to be able to solve the rst-passage time problem, the sharp ring threshold is replaced by a probabilistic spike-triggering mechanism (section 2). In section 3 we investigate conditions that allow for a description of the postsynaptic spike train as an inhomogeneous Poisson process, if the presynaptic spike trains are described by inhomogeneous Poisson processes too. Using this stochastic neuron model, we can analytically compute the distribution of the rst postsynaptic ring time for a given ensemble of input spike trains in two limiting cases: for low- and high-ring threshold (section 4). Finally, in section 5 a model of synaptic plasticity is introduced and analyzed in the context of a neuron that receives input from presynaptic neurons with different temporal precision. We show analytically and by computer simulations that the mechanism of synaptic plasticity is able to distinguish between synapses that convey spikes with a narrow or a broad temporal jitter. The resulting stationary synaptic weight vector favors synapses that deliver precisely timed spikes at the expense of the other synapses so as to produce again highly precise postsynaptic spikes. 2 The Model
We will investigate a neuron model—the spike-response model—that is in many aspects similar to but much more general than the standard integrateand-re model (Gerstner & van Hemmen, 1992; Gerstner, 1995). The membrane potential is given by a combination of pre- and postsynaptic contributions described by a response kernel 2 that gives the form of an elementary postsynaptic potential and a kernel g that accounts for refractoriness and has the function of an afterpotential, so that h i ( t) =
X j, f
f
Jij2 ( t ¡ tj ¡ D ij ) ¡
X f
f
g(t ¡ ti ).
(2.1)
Here, hi is the membrane potential of neuron i, Jij is the strength of the synapse connecting neuron j to neuron i, D ij is the corresponding axonal f
delay from j to i, and ftj , f = 1, 2, . . .g are the ring times of neuron j. Causality is respected if both 2 ( s) and g(s) vanish identically for s < 0. Depending on the choice for 2 and g, the spike-response model can reproduce the standard integrate-and-re model or even mimic complex HodgkinHuxley-type neuron models (Kistler et al., 1997). A typical choice for 2 is 2 ( t) = t/ tM exp(1 ¡ t/ tM )H (t), where tM is a membrane time constant and
Modeling Synaptic Plasticity
387
H the Heaviside step function with H ( t) = 1 for t > 0 and H(t) = 0 elsewhere. As for the refractory function g, a simple exponential can be used, for example, g(t) = g0 exp(¡t/ tref )H (t). Equation 2.1 could also be generalized so as to account for short-term depression and facilitation if the constant coupling Jij is replaced by a function of the previous spike arrival times at synapse i 7! j (Kistler & van Hemmen, 1999). In this context, however, we will concentrate on situations where only a single spike is transmitted across each synapse, so that effects of short-term plasticity do not come into action. The crucial point for what follows is the way in which new ring events are dened. Instead of using a deterministic threshold criterion for spike release, we describe the spike train of the postsynaptic neuron as a stochastic process with the probability density of having a spike at time t being a nonlinear function º of the membrane potential h( t). If we neglect the effect of the afterpotential, g = 0, the spike train of neuron i is an inhomogeneous Poisson process with rate º = º( hi ) for every xed set of presynaptic ring f
times ftj , j 6 = ig. Spike triggering in neurons is more or less a threshold process. In order to mimic this observation, we have chosen the rate function º( h) to be º(h) = ºmax H ( h ¡ #),
(2.2)
where ºmax is a positive constant and # the ring threshold. We note that ºmax is not the maximum ring rate of the neuron, which is given by the form of g(t), but a kind of a reliability parameter. The larger ºmax , the faster the neuron will re after the threshold # has been reached from below. The role of the parameter ºmax is further discussed in section 4. 3 Noisy Input f
For a xed set of presynaptic spikes at times ftj , j 6 = ig and a negligible afterpotential (g = 0), the membrane potential of neuron i at time t is uniquely determined and given by hi (t) =
X j, f
f
Jij2 ( t ¡ tj ¡ D ij ).
(3.1)
In this case, the spike train of neuron i is a Poisson process with rate ºmax for those time intervals with hi (t) > # and no spikes in between. Things become more complicated if we include an afterpotential after each spike of neuron i and if the arrival times of the presynaptic spikes are no longer xed but become random variables too. We will not discuss the rst point because we are mainly interested in the rst spike of neuron i after a period of inactivity. As a consequence of the second point, the membrane
388
Werner M. Kistler and J. Leo van Hemmen
potential hi ( t) is no longer a xed function of time, but a stochastic process f because every realization of presynaptic ring times ftj , j 6 = ig results in a different realization of the time course of the membrane potential hi ( t) as given through equation 3.1. The composite process is thus doubly stochastic in the sense that in a rst step, a set of input spike trains is drawn from an ensemble characterized by Poisson rates ºi . This realization of input spike trains then determines the output rate function, which produces in a second step a specic realization of the postsynaptic spike train. The natural question is how to characterize the nature of the resulting postsynaptic spike train. For an inhomogeneous Poisson process with rate function º, the expectation of the number k of events in an interval [T1 , T2 ] is Z E fk | ºg =
T2 T1
dtº(t).
(3.2)
If the rate function º is replaced by a stochastic process, we have to integrate over all possible realizations of º (with respect to a measure m ) in order to obtain the (unconditional) expectation of the number of events, Z E fkg =
« ¬ dm (º)E fk | ºg = E fk | ºg º =
Z
T2 T1
dt ºN (t),
(3.3)
R with h¢iº = ¢dm (º) and ºN ( t) = hº(t)iº. Hence, the composite process has the same expectation of the number of events as an inhomogeneous Poisson process with rate function ºN (t), which is just the expectation of º( t). Nevertheless, the composite process is not equivalent to a Poisson process. This can be seen by calculating the probabilities of observing k events within [T1 , T2 ]. For a Poisson process with xed rate function º, this probability equals 1 prob fk events in [T1 , T2 ] | ºg = k!
"Z
" Z dtº(t) exp ¡
T2 T1
#k
T2 T1
#
dtº(t) , (3.4)
whereas for the composite process, we obtain prob fk events in [T1 , T2 ]g « ¬ = prob fk events in [T1 , T2 ] | ºg º *"Z #nCk + 1 T2 X (¡1)n = dtº(t) k!n! T1 n= 0
(3.5)
º
Z Z T2 1 X « ¬ (¡1)n T2 = dt1 . . . dtnCk º( t1 ) . . .º(tnCk ) º . k!n! T1 T1 n= 0
(3.6)
Modeling Synaptic Plasticity
389
The last expression contains expectations of products of the rate function º evaluated at times t1 , . . . , tnCk . Since the membrane potential h is a continuous function of time, the rate function is (at least) piecewise continuous, and º( ti ) and º( tj ) with ti 6 = tj are in general not statistically independent. The calculation of correlations such as C(ti , tj ) := hº(ti )º( tj )i ¡ hº(ti )ihº( tj )i is thus far from being trivial (Bartsch & van Hemmen, 1998). In the following we approximate the composite process by an inhomogeneous Poisson process with rate function º. N This approximation can be justied for large numbers of overlapping excitatory postsynaptic potentials (EPSPs), when, due to the law of large numbers (Lamperti, 1966), the membrane potential as a sum of independent EPSPs and thus the rate function º( t) is almost always close to its expectation value ºN ( t). The approximation is equivalent to neglecting correlations in the rate function º. That is, to good approximation (as we will see shortly), we are allowed to assume º( t1 ), . . . , º(tkCn ) to be independent for ti 6 = tj , i 6 = j. In doing so, we can replace the expectation of the product by the product of expectations under the time integrals of equation 3.5 and obtain prob fk events in [T1 , T2 ]g Z Z T2 1 X (¡1) n T2 = dt1 . . . dtnCkºN ( t1 ) ¢ . . . ¢ ºN ( tnCk ) k!n! T1 T1 n= 0 1 = k!
"Z
T2 T1
#k
dt ºN ( t)
" Z exp ¡
T2 T1
#
dt ºN (t) ,
(3.7)
which is the probability of observing k events for an inhomogeneous Poisson process with rate function ºN that is simply the expectation of the process º. The efcacy of this approximation remains to be tested. This is done throughout the rest of this article by comparing theoretical results based on this approximation with numerical simulations. We now calculate ºN for neuron i. We assume that the spike train of the presynaptic neurons, labeled j, can be described by an inhomogeneous Poisson process with rate function ºj . A description of spike trains of cortical neurons in terms of inhomogeneous Poisson processes is appropriate for time windows below approximately 1 second (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997), which is sufcient for our purpose. With these preliminaries we can calculate the rst and the second moment of the distribution of the resulting membrane potential. The average postsynaptic potential generated by a single presynaptic neuron with rate function ºj is *
X f
+ 2 (t
f ¡ tj )
= (ºj ¤ 2 )( t).
(3.8)
390
Werner M. Kistler and J. Leo van Hemmen
R Here ¤ denotes convolution, that is, ( f ¤ g)(t) = dt0 f (t0 ) g( t ¡t0 ). The second moment is (Kempter, Gerstner, van Hemmen, & Wagner, 1998, appendix A) *2 4
X f
32 + 2 (t ¡
f tj ) 5
£ ¤2 ³ = (ºj ¤ 2 )( t) C ºj ¤ 2
2
´
(t).
(3.9)
Returning to equation 3.1, we see that the expectation value hN i of the membrane potential and the corresponding variance si2 are given by hN i ( t) = E fhi (t)g =
X j
Jij (ºj ¤ 2 )( t)
(3.10)
and si2 ( t) = Var fhi (t)g =
X j
Jij2 (ºj ¤ 2
2 )(
t).
(3.11)
Because of the central limit theorem, the membrane potential at a certain xed time t has a gaussian distribution, if there are sufciently many EPSPs that superimpose at time t; (for an estimate by the Berry-Esseen inequality we refer to Kempter et al., 1998). Equipped with the gaussian assumption, we can calculate ºN ( t), Z
dh G [h ¡ hN i (t), si ( t)]º(h) ( " #) ºmax hN i ( t) ¡ # = 1 C erf p , 2 2si2 (t)
ºN ( t) =
(3.12)
£ ¤ with G being a normalized gaussian, G (h, s) = (2p s 2 ) ¡1/ 2 exp ¡h2 / (2s 2 ) . We note that this is the rst time that we have used explicitly the functional dependence (see equation 2.2) of the ring rate º on the membrane potential h. The result so far is that if the membrane potential is a sum of many EPSPs so that we can assume h(t) to be gaussian distributed and the membrane potentials for different times to be independent, and if the effect of the afterpotential is negligible, then the postsynaptic spike train is given by an inhomogeneous Poisson process with rate function ºN as given by equation 3.12. 4 First Passage Time
In the previous section we discussed an approximation that allows us to replace the doubly stochastic process that describes the postsynaptic spike
Modeling Synaptic Plasticity
391
train by an inhomogeneous Poisson process with a rate function ºN that is independent of the specic realization of the presynaptic input. This approximation allows us to calculate the distribution of the rst postsynaptic spike analytically. The probability of observing no spikes in the interval [t0 , t] is µ Z t ¶ 0 0 prob fno spike in [t0 , t]g = exp ¡ dt ºN ( t ) , t0
(4.1)
and the density for the rst ring time after time t0 is d (1 ¡ prob fno spike in [t0 , t]g) dt µ Z t ¶ = ºN ( t) exp ¡ dt0 ºN ( t0 ) .
prst (t) =
(4.2)
t0
We are going to study the rst-spike distribution (rst passage time) in a specic context. We consider a single neuron having N independent synapses. Each synapse i delivers a spike train determined by an£ inhomogeneous ¤ Poisson process with rate function ºi ( t) = (2p si2 ) ¡1/ 2 exp ¡t2 / 2si2 , so that on average, one spike arrives at each synapse with a temporal jitter si around t = 0. We can calculate the averaged rate function ºN for this setup,
8 2 39 PN = ( ) ¡ ºmax < J 2 N t # i i 5 , 1 C erf 4 q i= 1 ºN (t) = P ; 2 : 2 N2 2 N i= 1 Ji 2 i ( t)
(4.3)
with expectation and variance being given as convolutions 2 Ni ( t) = (ºi ¤ 2 )( t),
and 2 Ni2 (t) = (ºi ¤ 2
2 )(
t).
(4.4)
The resulting density of the rst postsynaptic spike is shown in Figure 1. 4.1 High Threshold. We calculate the density of the rst postsynaptic spike explicitly for two limiting cases. First, we assume that the probability of observing at least one postsynaptic spike is very small, Z dt ºN ( t) ¿ 1.
(4.5)
In this case, the neuron will re, if it will re at all, in the neighborhood of the maximum of the postsynaptic potential. We expand the terms involving the
392
Werner M. Kistler and J. Leo van Hemmen
Figure 1: Plots to compare the density function for the rst postsynaptic spike as reconstructed from simulations (solid line) with the analytical approximation using the averaged rate function of equation 4.3 (dotted line). The dashed line represents the mean rate º. N (A, B) N = 100 synapses with strength Ji = 1/ N receive spike trains determined£ by an inhomogeneous Poisson process with rate ¤ function ºi = (2p s 2 ) ¡1/ 2 exp ¡t2 / 2s 2 and s = 1. The EPSPs are described by alpha functions t/ t exp(1 ¡ t/ t ) with time constant t = 1 so that the maximum of the membrane potential amounts to h = 1, if all spikes were to arrive simultaneously. The postsynaptic response is characterized by ºmax = 1 and # = 0.5 in A and # = 0.75 in B. Increasing the threshold obviously improves the temporal precision of the postsynaptic response, but the overall probability of a postsynaptic spike is decreased. (C, D) The same plots as A and B but for N = 10 instead of 100 synapses. The synaptic weights have been scaled so that Ji = 1/ 10. The central approximation that asserts statistical independence of the membrane potential at different times produces surprisingly good results for as little as N = 10 overlapping EPSPs.
membrane potential in the neighborhood of the maximum at t0 to leading order in t, PN Ji2 Ni (t) ¡ # q i= 1 = h0 ¡ h2 ( t ¡ t0 )2 C O (t ¡ t0 ) 3 . P N 2 N2 2 i= 1 Ji 2 i (t)
(4.6)
The averaged rate function ºN can thus be written ºN ( t) =
h io ºmax n 1 C erf h0 ¡ h2 (t ¡ t0 )2 C O ( t ¡ t0 ) 3 2
= ºN0 ¡ ºN 2 ( t ¡ t0 ) 2 C O (t ¡ t0 ) 3 ,
(4.7)
Modeling Synaptic Plasticity
393
with ºN0 =
ºmax [1 C erf ( h0 ) ] , 2
and
ºN2 =
ºmax h2 ¡h2 p e 0. p
(4.8)
It turns out that for ¡0.5 < h0 < 0.5, the averaged rate function ºN can be approximated very well by the clipped parabola ( p ºN 0 ¡ ºN2 ( t ¡ t0 )2 , for | t ¡ t0 | < ºN 0 / ºN2 , (4.9) ºN (t) ¼ 0, elsewhere. Using this approximation, we can calculate the integral over ºN and obtain the distribution of the rst postsynaptic spike, s # " h i 2ºN0 ºN0 ºN 2 2 3 ( t ¡ t0 ) ¡ ºN 0 ( t ¡ t0 ) ¡ prst (t) = ºN 0 ¡ ºN2 ( t ¡ t0 ) exp 3 3 ºN2 s ! ºN 0 £H ¡ |t ¡ t0 | . (4.10) ºN 2
(
For h0 < ¡0.5, however, a gaussian function is the better choice, ºN (t) ¼ ºN post G ( t ¡ t0 , spost ), with ºN post
(4.11)
q p = p ºN03 / ºN 2 and spost = ºN 0 / (2ºN 2 ). Since ºN post ¿ 1, we have
prst (t) ¼ ºN ( t).
(4.12)
4.2 Low Threshold. We now discuss the case where the postsynaptic neuron res with probability near to one, that is, Z 1 ¡ dt ºN ( t) ¿ 1. (4.13)
In this case, the neuron will re approximately at time t0 when the membrane potential crosses the threshold, h( t0 ) = #. We again expand the terms containing the membrane potential to leading order in ( t ¡ t0 ), PN Ji2 Ni ( t) ¡ # q i= 1 = h1 ( t ¡ t0 ) C O (t ¡ t0 ) 2 , PN 2 N2 ( ) 2 i= 1 Ji 2 i t
(4.14)
and assume that the averaged rate function ºN is already saturated outside the region where this linearization is valid. We may therefore put ºN (t) =
ª ºmax © 1 C erf [h1 (t ¡ t0 )] , 2
(4.15)
394
Werner M. Kistler and J. Leo van Hemmen
Figure 2: Approximation of the averaged rate function ºN (left) and the density of the rst postsynaptic spike prst (right) for various threshold values. The solid line gives the numerical result; the dashed line represents the analytic approximation. (A, B) Approximation for low-threshold value (# = 0.5) as in equations 4.15 and 4.16. (C, D), Approximation by a clipped parabola for # = 0.65, as in equations 4.9 and 4.10. (E, F) Gaussian approximation for high-threshold values (# = 0.8); see equations 4.11 and 4.12. The number of synapses is N = 100; the other parameters are as in Figure 1.
and obtain for the density of the rst postsynaptic spike, µ ¶ ºmax ¡h2 ( t¡t0 )2 1 ( ) ( ) ( )( ) e . prst t = ºN t exp ¡N º t t ¡ t0 ¡ p 2 p h1
(4.16)
Outside the neighborhood of the threshold crossing, that is, for |t ¡ t0 | À h¡1 1 , prst can be approximated by prst( t) ¼ ºmax e¡ºmax ( t¡t0 ) H ( t ¡ t0 ),
|t ¡ t0 | À h¡1 1 .
(4.17)
4.3 Optimal Threshold. As can be seen in Figure 2, the width of the rst-spike distribution decreases with increasing threshold #. The temporal precision of the postsynaptic spike can thus be improved by increasing
Modeling Synaptic Plasticity
395
Figure 3: Reliability (A) and precision (B) of the postsynaptic ring event as a function of the threshold #. The neuron receives input through N = 100 synapses; for details, see the caption to Figure 1. Reliability is dened as the R overall ring probability ºN post = dtprst (t). Precision is the inverse of the length of the interval containing 90% of the postsynaptic spikes, D t = t2 ¡ t1 with R t1 R1 dtprst (t) = t dtprst (t) = 0.05Nºpost . C demonstrates that there is an optimal ¡1 2 threshold in the sense that ºN post / D t exhibits a maximum at # ¼ 0.6. The short and the long dashed lines represent the results obtained through the high- and the low-threshold approximations, respectively; See equations 4.18–4.20.
the ring threshold #. The overall ring probability of the postsynaptic neuron, however, drops to zero if the threshold exceeds the maximum value of the membrane potential. The trade-off between the precision of the postsynaptic response and the reliability, that is, the overall ring probability of the postsynaptic neuron, is illustrated in Figure 3. There is an optimal threshold where the expectation value of the precision, that is, the product of reliability and precision, reaches its maximum (Kempter et al., 1998). A straightforward estimate demonstrates the functional dependence of precision and reliability on the various parameters. For high-threshold values, the density prst for the rst postsynaptic spike can be approximated by a gaussian function with variance ºN0 / 2ºN 2 . The precision of the ring time is thus proportional to
Dt
¡1
p / ºN 2 / ºN 0 =
(
2
2h2 e¡h0 p p [1 C erf (h0 )]
!1/ 2 ,
(4.18)
which depends solely on the time course of the membrane potential and is
396
Werner M. Kistler and J. Leo van Hemmen
independent ofºmax . The reliability of the postsynaptic response amounts to ºN post
q = p ºN 03 / ºN 2 = ºmax
(
!1/ 2 p p [1 C erf ( h0 )]3 2
8h2 e¡h0
,
(4.19)
which is proportional to ºmax . For low threshold, prst( t) is dominated by an exponential decay / e¡ºmax t , resulting in a constant precision, D t¡1 ¼
ºmax , 3.00
(4.20)
and reliability close to unity. 5 Synaptic Long-Term Plasticity
There is experimental evidence (Markram et al., 1997) that the strength of synapses between cortical neurons is changed according to the relative timing of pre- and postsynaptic spikes. We develop a simple model of synaptic long-term plasticity that mimics the experimental observations. Using the densities for the rst postsynaptic spike calculated in the previous section, we nd analytic expressions for the stationary synaptic weights that result from the statistics of the presynaptic spike train. 5.1 Modeling Synaptic Plasticity. Very much in the thrust of the spikeresponse model, we describe the change of a synaptic weight as the linear response to pre- and postsynaptic spikes. are formalized by P Spike trains f ), with a delta function ( sums of Dirac delta functions, S( t) = d ¡ t t f
for each ring time t f . The most general ansatz up to and including terms bilinear in the pre- and postsynaptic spike train for the change of the synaptic weight J is µ ¶ Z d 0 0 0 J( t) = Spre (t) d pre C dt k pre ( t ) Spost ( t ¡ t ) dt µ ¶ Z C Spost (t) d post C dt0 k post ( t0 )Spre ( t ¡ t0 ) .
(5.1)
This is a combination of changes that are induced by the pre- (Spre ) and the postsynaptic (Spost ) spike train. Furthermore, d pre (d post) is the amount by which the synaptic weight is changed if a single presynaptic (postsynaptic) spike occurs. This non-Hebbian contribution to synaptic plasticity, which is well documented in the literature (Alonso, Curtis, & Llin´as, 1990; Nelson, Fields, Yu, & Liu, 1993; Salin, Malenka, & Nicoll, 1996; Urban & Barrionuevo, 1996), is necessary to ensure that the xed point of the synaptic
Modeling Synaptic Plasticity
397
strength can be reached from very low initial values and even zero postsynaptic activity (take 0 < d pre ¿ 1) and to regulate postsynaptic activity (take ¡1 ¿ d post < 0). Finally, k pre (s) ( k post( s)) is the additional change induced by a presynaptic (postsynaptic) spike that occurs s milliseconds after a postsynaptic (presynaptic) spike. This modication of synaptic strength depends critically on the timing of pre- and postsynaptic spikes (Dan & Poo, 1992; Bell et al., 1997; Markram et al., 1997). Causality is respected if the kernels k vanish identically for negative arguments. In order to restrict the synaptic weights to the interval [0, 1], we treat synaptic potentiation and depression separately and assume that synaptic weight changes due to depression are proportional to the current synaptic weight J( t), whereas changes due to potentiation are proportional to [1 ¡ J (t)], so that µ ¶LTP µ ¶LTD d d d ¡ J ( t) J( t) = [1 ¡ J( t)] J ( t) J ( t) dt dt dt
(5.2)
with µ
¶LTP/ D µ ¶ Z d LTP/ D LTP/ D 0 ( t )Spost (t ¡ t0 ) = Spre ( t) d pre C dt0 k pre J ( t) dt µ ¶ Z LTP/ D LTP/ D CSpost ( t) d post C dt0 k post ( t0 ) Spre ( t ¡ t0 ) , (5.3)
LTP/ D LTP/ D and d pre/ post , k pre/ post (t0 ) ¸ 0. We are interested in the net change hD Ji of the synaptic weight integrated over some time interval and averaged over an ensemble of presynaptic spike trains, ½ Z µ ¶¾ d hD Ji = dt (5.4) J ( t) . dt
If synaptic weights change only adiabatically, that is, if the change of the synaptic weight is small as compared J and (1 ¡ J), we have D E D E hD Ji = [1 ¡ J( t)] D JLTP ¡ J(t) D JLTD , (5.5) with D E D J LTP/ D =
*Z
µ dt
d J ( t) dt
¶LTP/ D +
LTP/ D
LTP/ D = d pre ºNpre Cd post ºNpost Z Z C dt dt0 º(tI t0 )k LTP/ D (tI t0 ),
(5.6)
398
Werner M. Kistler and J. Leo van Hemmen
where ºN pre,post =
R
dtºpre,post ( t) is the expected number of pre- and postsyLTP/ D LTP/ D naptic spikes, andk LTP/ D ( tI t0 ) = k pre (t¡t0 ) Ck post ( t0 ¡t) is the change of the synaptic weight due to a presynaptic spike at time t and a postsynaptic spike at time t0 . In passing we note that º( tI t0 )d tdt0 is the joint probability of nding a presynaptic spike in the interval [t, t C dt) and a postsynaptic spike in [t0 , t0 C dt0 ). 5.2 Stationary Weights. In the following we concentrate on calculating the stationary synaptic weights dened by
hD Ji = 0
()
« LTP ¬ DJ J = « LTP ¬ « LTD ¬ . DJ C DJ
(5.7)
To this end, we return to the situation described at the beginning of Section 4 where a neuron receives input through N independent synapses, each spike train being described by an inhomogeneous Poisson process with rate function ºi , and 1 · i · N. We study a single synapse and assume that this synapse conveys (on average) one action potential with a gaussian distribution centered around t = 0 so that º( t) = G (t, spre ). The synaptic weight of this synapse is denoted by J. Furthermore, we assume that the presynaptic volley of action potentials triggers at most one postsynaptic spike, as is the case, for instance, if postsynaptic action potentials are followed by strong hyperpolarizing afterpotentials. This assumption relieves us from the need to discuss the inuence of the shape of the afterpotential g because the postsynaptic spike statistic is now completely described by the rst-passage-time prst. We discuss the case where the ring time of a single presynaptic neuron and the postsynaptic ring time are virtually independent. This might be considered as an approximation of the case where either the corresponding synaptic weight is small or the number of overlapping EPSPs needed to reach the ring threshold is large. The joint probability ºi ( t, t0 ) of the ring of the presynaptic neuron i and the rst postsynaptic spike is then simply the product of the probabilities, ºi ( t, t0 ) = ºi ( t) prst( t0 ).
(5.8)
We investigate a learning rule that is inspired by the observation that a synapse is weakened (depressed) if the presynaptic spike arrives a short time after the postsynaptic spike and is strengthened (potentiated) if the presynaptic spike arrives a short time before the postsynaptic spike and, thus, is contributing to the postsynaptic neuron’s ring (Markram et al., 1997). We can mimic this mechanism within our formalism by setting k LTP ( tI t0 ) = 2
LTP
h i exp ¡(t0 ¡ t)/ t LTP H ( t0 ¡ t),
(5.9)
Modeling Synaptic Plasticity
399
Figure 4: Synaptic plasticity based on the timing of pre- and postsynaptic spikes. The kernels Ck LTP (solid line) and ¡k LTD (dashed line) describe the modication of synaptic strength as it is caused by the combination of a postsynaptic spike at time 0 and a presynaptic spike arriving at time t. See equations 5.9–5.10 with 2 LTP = 2 LTD = 0.1 and t LTP = t LTD = 1.
and k LTD ( tI t0 ) = 2
LTD
h i exp ¡(t ¡ t0 )/ t LTD H ( t ¡ t0 ),
(5.10)
with 2 LTP/ D > 0 and t LTP/ D as time constants of the “synaptic memory” (see Figure 4). Our choice of adopting exponentials to describe the kernels k LTP and k LTD is motivated by both biological plausibility and the need to keep the mathematical analysis as simple as possible. With these prerequisites, we can calculate the mean potentiation and depression due to the correlation of pre- and postsynaptic spikes (cf. equation 5.6) for the limiting cases of high and low threshold. If we approximate the postsynaptic ring by a gaussian distribution (cf. equation 4.12), prst (t) = ºNpost G ( t ¡ t0 , spost ),
(5.11)
we nd Z Z dt dt0 º( tI t0 )k LTP ( tI t0 ) Z =2
1
LTP
¡1
1 = ºN post2 2 and
Z
Z dt
LTP
h i dt0 G ( t, spre )ºN postG ( t0 ¡ t0 , srst) exp ¡(t0 ¡ t)/ t LTP
1 t
" p # µ ³ ¶ 1 s ´2 1 2s t0 t0 exp ¡ LTP erfc ¡p , 2 t LTP t 2 t LTP 2s
(5.12)
Z dt dt0 º( tI t0 )k LTD ( tI t0 ) 1 = ºN post2 2
LTD
" p # µ ³ ¶ 1 s ´2 1 2s t0 t0 exp C LTD erfc Cp , (5.13) 2 t LTD t 2 t LTD 2s
400
Werner M. Kistler and J. Leo van Hemmen
Figure 5: Stationary synaptic weights. (A) Three-dimensional plot of the stationary synaptic weight for high threshold as a function of s and t0 , where 2 C s2 s 2 = spre post is the sum of the variances of pre- and postsynaptic ring time and t0 the time between the arrival of the presynaptic spike and the ring of the postsynaptic action potential. (B) Contour plot of the same function as in A. (C, D) Plot of the stationary synaptic weight in the case of low-threshold # as a function of the standard deviation spre of the presynaptic ring time and the time t0 when the expectation of the membrane potential reaches the threshold LTD = from below. The parameters used to describe synaptic plasticity ared post 0.01, LTP LTD LTP LTP LTD LTP LTD d pre = 0.001, d pre = d post = 0, 2 = 2 = 0.1, t = t = 1; See equations 5.2, 5.9, and 5.10. 2 C s 2 . Using equation 5.7, we are thus able to calculate the with s 2 = spre post stationary weight of a synapse that conveys a single action potential with 2 , given the postsynaptic spike at a gaussian distribution with variance spre 2 t = t0 with variance spost . This result is illustrated in Figure 5. If the variance 2 2 of both pre- and postsynaptic spike is small, that is, s 2 = spre C spost ¿ LTP/ D 1, the stationary weight is dominated by the kernels k : The weight is close to zero if the presynaptic spike occurs after the postsynaptic one (t0 < 0) and close to unity if the presynaptic spike arrives a short time before the postsynaptic action potential (t0 > 0). For large variances, this dependence on the timing of the spikes is smoothed, and the stationary
Modeling Synaptic Plasticity
401
weight is determined by the rates of pre- and postsynaptic spikes. We obtain a similar result for low threshold if we approximate the postsynaptic ring time by an exponential distribution (cf. equation 4.17), Z
Z dt dt0 º( tI t0 )k LTP ( tI t0 ) Z =2 =
LTP
1
Z dt
¡1
1
dt0 G ( t, spre )ºmax e¡ºmax ( t ¡t0 ) e¡(t ¡t)/ t 0
0
LTP
max(t,t0 )
2 LTPºmax
(
ºmax C 1/ t LTP
µ ³ ¶ 1 1 spre ´2 t0 exp ¡ 2 2 t LTP t LTP " # s t0 £ erfc p ¡p 2t LTP 2spre µ ¶ ¢2 1 1¡ C exp spreºmax C t0ºmax 2 2 " #) spreºmax t0 £ erfc p Cp , 2 2spre
(5.14)
and Z
Z dt dt0 º( tI t0 )k LTD ( tI t0 ) =
2 LTDºmax
ºmax ¡ 1/ t LTD
(
µ ³ ¶ 1 1 spre ´2 t0 exp C 2 2 t LTD t LTD " # s t0 £ erfc p Cp 2t LTD 2spre µ ¶ ¢2 1 1¡ ¡ exp spreºmax C t0ºmax 2 2 " #) spreºmax t0 £ erfc p Cp . 2 2spre
(5.15)
5.3 Synaptic Weights as a Function of the Input Statistics. We saw in the previous section that the stationary synaptic weights can be calculated as a function of the parameters characterizing the distribution of the pre- and postsynaptic spikes. The synaptic weights, on the other hand, determine the distribution of the postsynaptic spike. If we are interested in the synaptic weights that are produced by given input statistics, we thus have to solve a self-consistency problem.
402
Werner M. Kistler and J. Leo van Hemmen
The self-consistency problem can be solved numerically for the limiting cases of low and high threshold, where the distribution of the postsynaptic ring time is completely characterized by only a few parameters. That is, starting from a given vector J of synaptic couplings, we calculate the parameters that characterize the postsynaptic ring time using the results of section 4 in the limiting cases of high- and low-ring threshold. In a second step, we calculate the corresponding stationary synaptic weight vector J as it is determined by the statistics of pre- and postsynaptic ring times; (cf. section 5.2). Finally, we solve the self-consistency equation J = J using standard numerics. In the simple case where all synapses deliver spikes with a common statistics, the stable solution to the self-consistency equation is unique because the mean postsynaptic ring time is a monotone function of the synaptic weight. But if the synapses are different, say, with respect to the mean spike arrival time, the solution in general will no longer be unique. We illustrate the solution of the self-consistency problem by means of an example and consider a neuron that receives spikes via N independent synapses. Again, each presynaptic spike train is described by an inhomogeneous Poisson process with a gaussian rate G ( t, si ) centered at t = 0. In contrast to our previous examples, the synapses are different with respect to the temporal precision si with which they deliver their action potentials. Figure 6A shows the resulting synaptic weights if we start with two groups of presynaptic neurons that deliver action potentials with s1 = 0.1 and s2 = 1.0, respectively. The number of neurons contained in group 1 and group 2 are n1 = 20 and n2 = 80, respectively, but the results turned out to be insensitive to the actual numbers. (A more balanced ratio of synapses would shift the curve J1 (#) slightly to the left). We compare the results obtained from the unique solution of the self-consistency equation with simulations of a straightforward implementation of the synaptic dynamics given in equation 5.2. The number of iterations required for the synaptic weights to settle down in the stationary values depends on the amplitude of the synaptic weight change described by d LTP/ D and 2 LTP/ D and on the amount of postsynaptic activity and, thus, on the ring threshold. With the parameters used in Figure 6, stationarity is usually reached in fewer than 1000 iterations. Finally, a brief discussion of the results shown in Figure 6 is in order. For a high ring threshold, the synaptic weights of both groups are close to their maximum value because otherwise the neuron would not re at LTP > 0. For very all. This is due to the constant presynaptic potentiation d pre low threshold, the postsynaptic ring is triggered by the rst spikes that arrive at the neuron. These spikes stem from presynaptic neurons with a broad ring-time distribution (group 2). Spikes from group 1 neurons thus arrive mostly after the postsynaptic neuron has already red and the corresponding synapses are depressed accordingly. This explains why synapses
Modeling Synaptic Plasticity
403
Figure 6: (A) Synaptic weights for a neuron receiving input from two groups of synapses. One group (n1 = 20) delivers precisely timed spikes (spre = 0.1) and the other one (n2 = 80) spikes with a broad distribution of arrival times (spre = 1.0). The synapses are subject to a mechanism that modies the synaptic strength according to the relative timing of pre- and postsynaptic spikes, see equation 5.2. The upper trace shows the resulting synaptic weight for the group of precise synapses; the lower trace corresponds to the second group. The solid lines give the analytic result for either the low-threshold (# < 0.4) or the high-threshold (# > 0.4) approximation. The dashed lines show the results of a computer simulation. The parameters for the synaptic plasticity are the same as in Figure 5. (B, C, D) precision D t¡1 , reliability ºN post , and efciency ºN post / D t as a function of the threshold # for the same neuron as in A. See Figure 3 for the denition of ºN post and D t.
delivering precisely timed spikes are weaker than group 2 synapses, unless the ring threshold exceeds a certain critical value. The most interesting nding is that in the intermediate range of the ring threshold, synapses that deliver precisely timed spikes are substantially stronger than synapses with poor temporal precision. Furthermore, there is an optimal threshold at # ¼ 0.2 where the ratio of the synaptic strengths of precise and unprecise synapses is largest. The ring of the postsynaptic neuron is thus driven mainly by spikes from group 1 neurons, and the jitter of the postsynaptic spikes is minimal (see Figure 6B). 6 Discussion
We have presented a stochastic neuron model that allows for a description of pre- and postsynaptic spike trains by inhomogeneous Poisson processes
404
Werner M. Kistler and J. Leo van Hemmen
provided that the number of overlapping postsynaptic potentials is not too small. In view of the huge number of synapses that a single neuron carries on its dendritic tree and the number of synchronous excitatory postsynaptic potentials required to reach threshold—104 . . . 105 synapses and several tens of EPSPs—the approximations used seem to be justied. Since ring of a neuron is triggered by a large number of synchronous presynaptic action potentials, the postsynaptic ring time and the arrival time of a single presynaptic spike can be assumed to be independent and the correlation of pre- and postsynaptic ring times is readily calculated. In addition to the spike-triggering mechanism, we have developed a model for synaptic long-term plasticity based on the relative timing of preand postsynaptic ring events. The dynamics of the synaptic weights can be analyzed in connection with the former neuron model. In particular, stationary synaptic weights can be calculated as limit states for a given ensemble of presynaptic spike trains. It turns out that this mechanism is able to tune the synaptic weights according to the temporal precision of the spikes they convey. Synapses that deliver precisely timed action potentials are favored at the expense of synapses that deliver spikes with a large temporal jitter. This learning process does not rely on an external teacher signal; it is unsupervised. Furthermore, there is only one free parameter that has to be xed: the ring threshold #. It is conceivable that a small, local network of inhibitory interneurons could be entrusted with setting the ring threshold. The result concerning the tuning of synaptic weights has two consequences that are open to experimental verication. First, it shows that neurons are able to tune their synaptic weights so as to select inputs from those presynaptic neurons that provide information that is meaningful in the sense of a temporal code. Second, this tuning ensures the preservation of temporally encoded information because noisy synapses are depressed and the neuron is thus able to re its action potential with high temporal precision. We believe that this mechanism has clear relevance to any information processing based on temporal spike coding. Acknowledgments
W. M. K. gratefully acknowledges nancial support from the Boehringer Ingelheim Fonds. References Alonso, A., Curtis, M. de, & Llin a´ s, R. (1990). Postsynaptic Hebbian and nonHebbian long-term potentiation of synaptic efcacy in the entorhinal cortex in slices and in the isolated adult guinea pig brain. Proc. Natl. Acad. Sci. USA, 87, 9280–9284.
Modeling Synaptic Plasticity
405
Bartsch, A. P., & van Hemmen, J. L. (1998). Correlations in networks of spiking neurons. Unpublished manuscript. Munich: Physics Department, Technical University of Munich. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Dan, Y., & Poo, M. (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science, 256, 1570–1573. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3. 139–164. Gerstner, W., Kempter, R., van Hemmen, J. L. & Wagner,H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 384, 76–78. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Comput., 10, 1987–2017. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Kistler, W. M., & van Hemmen, J. L. (1999). Short-term synaptic plasticity and network behavior. Neural Comput., 11, 1579–1594. Lamperti, J. (1966). Probability: A survey of the mathematical theory. New York: Benjamin. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann B. (1997).Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213– 215. Nelson, P. G., Fields, R. D., Yu, C., & Liu, Y. (1993). Synapse elimination from the mouse neuromuscular junction in vitro: A non-Hebbian activity-dependent process. J. Neurobiol., 24, 1517–1530. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek W. (1997).Spikes— Exploring the neural code. Cambridge, MA: MIT Press. Salin, P. A., Malenka, R. C., & Nicoll, R. A. (1996). Cyclic AMP mediates a presynaptic form of LTP at cerebellar parallel ber synapses. Neuron, 16, 797–803. Tuckwell, H. C. (1988).Introductionto theoreticalneurobiology, (Vol. 2.) Cambridge: Cambridge University Press. Urban, N. N., & Barrionuevo, G. (1996). Induction of Hebbian and non-Hebbian mossy ber long-term potentiation by distinct patterns of high-frequency stimulation. J. Neurosci., 16, 4293–4299. Received August 3, 1998; accepted April 29, 1999.
LETTER
Communicated by John Platt
On-line EM Algorithm for the Normalized Gaussian Network Masa-aki Sato ATR Human Information Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan Shin Ishii Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara 6300101, Japan A normalized gaussian network (NGnet) (Moody & Darken, 1989) is a network of local linear regression units. The model softly partitions the input space by normalized gaussian functions, and each local unit linearly approximates the output within the partition. In this article, we propose a new on-line EM algorithm for the NGnet, which is derived from the batch EM algorithm (Xu, Jordan, & Hinton 1995), by introducing a discount factor. We show that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we show that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator. A new regularization method is proposed in order to deal with a singular input distribution. In order to manage dynamic environments, where the input-output distribution of data changes over time, unit manipulation mechanisms such as unit production, unit deletion, and unit division are also introduced based on probabilistic interpretation. Experimental results show that our approach is suitable for function approximation problems in dynamic environments. We also apply our on-line EM algorithm to robot dynamics problems and compare our algorithm with the mixtures-of-experts family. 1 Introduction
The main aim of this work is to develop a fast on-line learning method that can deal with dynamic environments. We use a normalized gaussian network (NGnet) (Moody & Darken, 1989) for this purpose. The NGnet is a network of local linear regression units. The model softly partitions the input space by normalized gaussian functions, and each local unit linearly approximates the output within the partition. Since the NGnet is a local model, it is possible to change parameters of several units in order to learn a single datum. Therefore, the learning process becomes easier than that of global models such as the multilayered perceptron. In a local model, on the other hand, the number of necessary units grows exponentially as the Neural Computation 12, 407–432 (2000)
c 2000 Massachusetts Institute of Technology °
408
Masa-aki Sato and Shin Ishii
input dimension increases if one wants to approximate the input-output relationship over the whole input space. This results in a computational explosion, which is often called the curse of dimensionality. However, actual data will often distribute in a lower dimensional space than the input space, such as in attractors of dynamical systems. The NGnet, a kind of mixtures-of-experts model (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994), can be interpreted as an output of a stochastic model with hidden variables. The model parameters can be determined by the maximum likelihood estimation method. In particular, the expectation-maximization (EM) algorithm for the NGnet was derived by Xu, Jordan, & Hinton (1995). In this article, we propose a new on-line EM algorithm for the NGnet. The on-line EM algorithm is derived from the batch EM algorithm (Xu et al., 1995) by introducing a discount factor. We show that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we show that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator. Similar on-line EM algorithms have been proposed by Nowlan (1991) and Jordan and Jacobs (1994) for mixture models. However, there have been no convergence proofs for these on-line EM algorithms. Neal and Hinton (1998) proposed a wide variety of incremental EM algorithms and proved the convergence of their algorithms. One drawback of their algorithms is that they either need additional storage variables for all the training data or need to see all the training data at each iteration. This prevents these algorithms from being applied to realtime on-line settings where new data are supplied indenitely each time. For practical applications, important modications to the on-line EM algorithm are necessary. A new regularization method is proposed in order to deal with a singular input distribution. In order to manage dynamic environments, where the input-output distribution of data changes over time, unit manipulation mechanisms such as unit production, unit deletion, and unit division are also introduced based on the probabilistic interpretation. Applicability of our new on-line EM algorithm is investigated using function approximation problems in three circumstances: a usual function approximation for a set of observed data; a function approximation in which the distribution of the input data changes over time (this experiment is performed to check the applicability of our approach to dynamic environments); and a function approximation in which the distribution of the input data is singular (i.e., the dimension of the input data distribution is smaller than the input space dimension). In this case, a straightforward application of the basic on-line EM algorithm does not work, because the covariance matrices used in the NGnet become singular. Our modied on-line EM algorithm works well even in this situation. Finally, we applied our on-line EM algorithm to realistic and difcult problems in robot dynamics. The performance of our on-line EM algorithm is compared with those of the mixtures-of-experts family.
On-line EM Algorithm for the Normalized Gaussian Network
409
2 Normalized Gaussian Network
The NGnet model, which transforms an N-dimensional input vector x to a D-dimensional output vector y, is dened by the following equations: y=
M X i= 1
i ( x)
N
N
Q
i ( x) Wi xQ
´ Gi ( x)/
M X
(2.1a)
Gj ( x)
(2.1b)
j= 1
µ ¶ 1 Gi ( x) ´ (2p ) ¡N/ 2 |S i | ¡1/ 2 exp ¡ ( x ¡ m i ) 0 S i¡1 ( x ¡ m i ) . 2
(2.1c)
M denotes the number of units, and the prime (0 ) denotes a transpose. Gi ( x) is an N-dimensional gaussian function; its center is an N-dimensional vector m i , and its covariance matrix is an ( N £ N)-dimensional matrix S i . | S i | is the determinant of the matrix S i . N i (x) is the ith normalized gaussian Q i ´ ( Wi , bi ) is a ( D £ ( N C 1))-dimensional linear regression function. W 0 Q i xQ ´ Wi x Cbi . The gaussian function Gi (x) is a kind matrix, xQ ´ ( x0 , 1), and W of radial basis function (Poggio & Girosi, 1990). However, the normalized gaussian functions, N i ( x) ( i = 1, . . . , M ), do not have a radial symmetry. They softly partition the input space into M regions. The ith unit linearly Q i xQ within the corresponding region. An output approximates its output by W of the NGnet is given by a summation of these outputs weighted by the normalized gaussian functions. The NGnet (see equation 2.1) can be interpreted as a stochastic model in which a pair of an input and an output, ( x, y), is a stochastic (incomplete) event. For each event, a single unit is assumed to be selected from a set of classes fi | i = 1, . . . , Mg. The unit index i is regarded as a hidden variable. A triplet (x, y, i) is called a complete event. The stochastic model is dened by the probability distribution for a complete event (Xu et al., 1995), P(x, y, i|h ) = (2p ) ¡(DCN)/ 2 si¡D |S i | ¡1/ 2 M¡1 µ ¶ 1 1 Q i xQ ) 2 , £ exp ¡ (x¡m i ) 0 S i¡1 ( x¡m i ) ¡ 2 ( y¡ W 2 2si
(2.2)
Q i | i = 1, . . . , Mg is a set of model parameters. From where h ´ fm i , S i , si2 , W the probability distribution, 2.2, the following probabilities are obtained: P(i|h ) = 1/ M P(x|i, h ) = Gi ( x)
(2.3a) µ
P( y|x, i, h ) = (2p ) ¡D/ 2 si¡D exp ¡
¶
1 Q i xQ ) 2 . (y ¡ W 2si2
(2.3b) (2.3c)
410
Masa-aki Sato and Shin Ishii
These probabilities dene a stochastic data generation model. First, a unit is randomly selected with an equal probability (see equation 2.3a). If the ith unit is selected, the input x is generated according to the gaussian distribution (see equation 2.3b). Given the unit index i and the input x, the output y is generated according to the gaussian distribution (see equation 2.3c), Q i xQ and the variance s 2 . which has the mean W i When the input x is observed, the probability that the output value becomes y in this model turns out to be P ( y| x, h ) =
M X
N
i ( x) P( y|x,
i, h ).
(2.4)
i= 1
From this conditional probability, the expectation value of the output y for a given input x is obtained as Z M X Q i x, Q N i ( x) W (2.5) E[y|x] ´ yP( y|x, h )dy = i= 1
which is equivalent to the output of the NGnet (see equation 2.1a). The probability distribution (see equation 2.2) provides a stochastic model for the NGnet. 3 EM Algorithm
From a set of T events (observed data), (fxg, fyg) ´ f(x(t), y(t)) | t = 1, . . . , Tg, the model parameter h of the stochastic model (see equation 2.2) can be determined by the maximum likelihood estimation method. In particular, the EM algorithm (Dempster, Laird, & Rubin 1977) can be applied to models having hidden variables. The EM algorithm repeats the following E (expectation)-step and M (maximization)-step. Since the likelihood for the set of observations increases (or does not change) after an E- and M-step, the maximum likelihood estimator is asymptotically obtained by repeating the E- and M-steps: E-step. Let hN be the current estimator value. By using hN , the posterior probability that the ith unit is selected for each observation ( x( t), y(t)) is calculated according to Bayes rule: P( i| x( t), y(t), hN ) = P(x( t), y( t), i|hN )/
M X j= 1
P( x(t), y( t), j|hN).
(3.1)
M-step. By using the posterior probability (see equation 3.1), the expected log-likelihood Q(h |hN , fxg, fyg) for the complete events is dened by Q(h |hN , fxg, fyg) =
T X M X t= 1 i= 1
P(i| x( t), y(t), hN ) log P( x( t), y(t), i|h ).
(3.2)
On-line EM Algorithm for the Normalized Gaussian Network
411
On the other hand, the log-likelihood of the observed data (fxg, fyg) is given by L(h |fxg, fyg) = =
T X
log P( x(t), y( t)|h )
t= 1 T X
log
t= 1
(
M X
! P( x( t), y(t), i|h ) .
(3.3)
i= 1
Since an increase of Q(h |hN , fxg, fyg) implies an increase of the loglikelihood L(h | fxg, fyg) (Dempster et al., 1977), Q(h |hN , fxg, fyg) is maximized with respect to the estimator h . A solution of the necessity condition @Q/ @h = 0 is given (Xu et al., 1995) by m i = hxii (T )/ h1ii ( T )
(3.4a) 0
S i = h(x ¡ m i )( x ¡ m i ) ii ( T )/ h1ii ( T )
= hxx0 ii ( T )/ h1ii ( T ) ¡ m i ( T )m 0i ( T )
Q i hxQ xQ ii ( T ) = hyxQ ii (T ) W 0
0
1 Q i x| Q 2 ii (T )/ h1ii (T ) h|y ¡ W D ³ ´i 1 h 2 Q i hxy Q 0 ii ( T ) / h1ii ( T ), = h|y | ii (T ) ¡ Tr W D
(3.4b) (3.4c)
si2 =
(3.4d)
where Tr(¢) denotes a matrix trace. A symbol h¢ii denotes a weighted mean with respect to the posterior probability (see equation 3.1), and it is dened by h f ( x, y)ii (T ) ´
T 1X f ( x(t), y( t)) P( i|x( t), y( t), hN ). T t= 1
(3.5)
If an innite number of data, which are drawn independently according to an unknown data distribution density r ( x, y), are given, the weighted mean (see equation 3.5) converges to the following expectation value, T!1
h f ( x, y)ii ( T ) ¡! E[ f (x, y)P( i|x, y, hN )]r ,
(3.6)
where E[¢]r denotes the expectation value with respect to the data distribution density r ( x, y). In this case, the M-step equation (see equation 3.4) can be written as gi ´ E[P( i|x, y, hN )]r
(3.7a)
412
Masa-aki Sato and Shin Ishii
m i = E[xP( i|x, y, hN )]r / gi
(3.7b)
S i = E[xx0 P( i|x, y, hN )]r / gi ¡ m im 0i
(3.7c)
Q i E[xQ xQ 0 P( i|x, y, hN )]r = E[yxQ 0 P( i|x, y, hN )]r W ´ 1 ³ Q i E[xy Q 0 P( i|x, y, hN )]r ) / gi , si2 = E[|y| 2 P( i|x, y, hN )]r ¡ Tr( W D
(3.7d) (3.7e)
Q i , s 2 |i = 1, . . . , Mg, is calculated where the new parameter set, h = fm i , S i , W i NQ , s 2 |i = 1, . . . , by using the old parameter set, hN = fmN i , SN i , W Mg. i Ni At an equilibrium point of this EM algorithm, the old and new parameters become the same: h = hN . The equilibrium condition of the EM algorithm (equation 3.7 along with h = hN ) is equivalent to the maximum likelihood condition, @L(h )/ @h = 0,
(3.8)
where the log-likelihood function is given by " L(h ) = E log
M X
# P( x, y, i|h )
i= 1
.
(3.9)
r
4 On-line EM Algorithm 4.1 Basic Algorithm. The EM algorithm introduced in the previous section is based on batch learning (Xu et al., 1995); the parameters are updated after seeing all the observed data (fxg, fyg). In this section, we derive an online version of the EM algorithm. Since the estimator is changed after each observation, let h (t) be the estimator after the tth observation ( x(t), y( t)). In the on-line EM algorithm, the weighted mean (see equation 3.5) is replaced by
hh f ( x, y)iii (T ) ´ g(T)
T X t= 1
(
T Y
! l(s)
s= tC1
£ f ( x( t), y(t)) P( i|x(t), y( t), h ( t ¡ 1)) g(T) ´
(
T Y T X
!¡1 l(s)
.
(4.1a) (4.1b)
t= 1 s= tC1
Here, the parameter l ( t) (0 · l ( t) · 1) is a time-dependent discount factor, which is introduced for forgetting the effect of the old posterior values employing the earlier inaccurate estimator. g(T) is a normalization coefcient and plays a role like a learning rate.
On-line EM Algorithm for the Normalized Gaussian Network
413
The modied weighted mean hh¢iii can be obtained by the step-wise equation, hh f (x, y)iii ( t) = hh f ( x, y)iii (t ¡ 1) £ ¤ Cg(t) f ( x(t), y( t)) Pi ( t) ¡ hh f ( x, y)iii (t ¡ 1)
g(t) = (1 Cl ( t)/g( t ¡ 1))¡1 ,
(4.2a) (4.2b)
where Pi ( t) ´ P(i| x( t), y(t), h ( t ¡ 1)). The constraint, 0 · l(t) · 1, gives the constraint on g(t): 1 ¸ g(t) ¸ 1/ t.
(4.3)
If this constraint is satised, equation 4.2b can be solved for l (t) as l(t) = g(t ¡ 1)(1/ g(t) ¡ 1).
(4.4)
After calculating the modied weighted means, that is, hh1iii ( t), hhxiii ( t), hhxx0 iii ( t), hh|y| 2 iii (t), and hhyxQ 0 iii (t), according to equation 4.2, the new estimator h (t) is obtained by the following equations: m i ( t) = hhxiii (t)/ hh1iii ( t) S i¡1 ( t)
0
(4.5a) 0
= [hhxx iii (t)/ hh1iii ( t) ¡ m i ( t)m i ( t
)]¡1
(4.5b)
Q i ( t) = hhyxQ 0 iii ( t)[hhxQ xQ 0 iii ( t)]¡1 W ³ ´i 1 h Q i ( t)hhxy Q 0 iii ( t) / hh1iii ( t). si2 ( t) = hh|y| 2 iii (t) ¡ Tr W D
(4.5c) (4.5d)
This denes the on-line EM algorithm. In this algorithm, calculations of the inverse matrices are necessary at each time-step. A recursive formula Q i can be derived using a standard method (Jurgen, for S i¡1 and W ¨ 1996), so there is no need to calculate the matrix inverse. Let us dene the weighted Q LQ i ( t) ´ (hhxQ xQ 0 iii (t)) ¡1 . This quantity can be inverse covariance matrix of x: obtained by the step-wise equation: LQ i ( t) =
" # 1 Pi (t) LQ i (t¡1) xQ ( t)xQ 0 ( t) LQ i ( t¡1) Q L i ( t¡1) ¡ . (4.6) 1¡g(t) (1/ g(t) ¡1) CPi ( t) xQ 0 ( t) LQ i ( t¡1) xQ (t)
S i¡1 ( t) can be obtained from the following relation with LQ i (t): LQ i ( t)hh1iii ( t) =
¡
S i¡1 (t) ¡m 0i ( t)S i¡1 ( t)
¡S i¡1 ( t)m i ( t) 1 Cm 0i ( t)S i¡1 ( t)m i (t)
¢
.
(4.7)
414
Masa-aki Sato and Shin Ishii
Q i , is given by The estimator for the linear regression matrix, W Q i ( t) = hhyxQ 0 iii (t) LQ i (t) W Q i ( t ¡ 1) Cg(t)Pi ( t)( y( t) ¡ W Q i ( t ¡ 1) xQ (t)) xQ 0 (t) LQ i (t). = W
(4.8a) (4.8b)
Thus, the basic on-line EM algorithm is summarized as follows. For the tth observation (x( t), y( t)), the weighted means—hh1iii ( t), hhxiii (t), hh|y| 2 iii (t), Q 0 iii (t)—are calculated by using the step-wise equation 4.2. LQ i (t) is and hhxy also calculated with the step-wise equation 4.6. Subsequently, the estimators for the model parameters are obtained by equations 4.5a, 4.5d, 4.7, and 4.8b. 4.2 Equivalence of On-line and Batch EM Algorithms. In this part, we show that the basic on-line EM algorithm dened by equations 3.1, 4.2, and 4.5 is equivalent to the batch EM algorithm dened by equations 3.1, 3.4, and 3.5, with an appropriate choice of l(t). It is assumed that the same set of data, f(x(t), y( t))| t = 1, . . . , Tg, is repeatedly supplied to the learning system, that is, x( t C T ) = x( t) and y( tCT ) = y( t). The time t is represented by an epoch index k (= 1, 2, . . .) and a data number index m (= 1, . . . , T ) within an epoch, that is, t = (k ¡ 1)T Cm. Let us assume that the model parameter h is updated at the end of each epoch, when m = T, and it is denoted by h ( k). Let us suppose that l ( t) is given by » 0 if t = ( k ¡ 1)T C 1 l ( t) = (4.9) 1 otherwise.
The correspondingg(t) is given byg((k¡1)T Cm) = 1/ m. Then the weighted mean hh f (x, y)iii ( t) is initialized to f (x(1), y(1)) P(i| x(1), y(1), h ( k ¡ 1)) at the beginning of an epoch. At the end of an epoch, hh f (x, y)iii ( kT ) = h f ( x, y)ii ( T ) is satised for hN = h ( k ¡ 1). This shows that the on-line EM algorithm, together with l(t) given by equation 4.9, is equivalent to the batch EM algorithm. Consequently, the estimator of the on-line EM algorithm converges to the maximum likelihood estimator. It should be noted that the step-wise equation, 4.6, for LQ i ( t) cannot be used in this case, since the equation becomes singular for g(t) = 1. 4.3 Stochastic Approximation. If an innite number of data, which are drawn independently according to the data distribution density r ( x, y), are available, the on-line EM algorithm can be considered as a stochastic approximation (Kushner & Yin, 1997) for obtaining the maximum likelihood estimator, as demonstrated below. Let w be a compact notation for the weighted mean: w ( t) ´ fhh1iii ( t), hhxiii ( t), hhxx0 iii ( t), hhyxQ 0 iii ( t), hh|y| 2 iii ( t)| i = 1, . . . , Mg. The on-line EM algorithm, equations 4.2 and 4.5, can be written in an abstract form:
d w ( t) ´ w (t) ¡w (t¡1) = g(t)[F(x(t), y( t), h ( t¡1)) ¡w ( t¡1)]
(4.10a)
On-line EM Algorithm for the Normalized Gaussian Network
h (t) = H(w ( t)).
415
(4.10b)
It can be easily proved that the set of equations, w = E[F( x, y, h )]r h = H (w ),
(4.11a) (4.11b)
is equivalent to the maximum likelihood condition, equation 3.8. Then the on-line EM algorithm can be written as dw (t) = g(t)(E[F(x, y, H(w ( t ¡ 1)))]r ¡ w (t ¡ 1)) Cg(t)f(x(t), y( t), w (t ¡ 1)),
(4.12)
where the stochastic noise term f is dened by f(x(t), y( t), w (t ¡ 1)) ´ F(x( t), y( t), H(w ( t ¡ 1)) ¡ E[F( x, y, H (w (t ¡ 1)))]r ,
(4.13)
and it satises E[f ( x, y, w)]r = 0.
(4.14)
Equation 4.12 has the same form as the Robbins–Monro stochastic approximation (Kushner & Yin, 1997), which nds the maximum likelihood estimator given by equation 4.11. The effective learning coefcient g(t) should satisfy the condition t!1
g(t) ¡! 0,
1 X t= 1
g(t) = 1,
1 X t= 1
g 2 ( t) < 1.
(4.15)
Typically, g(t), which satises conditions 4.3 and 4.15, is given by t!1
g(t) ¡!
1 at C b
(1 > a > 0).
(4.16)
The corresponding discount factor is given by t!1
l(t) ¡! 1 ¡
1¡a I at C ( b ¡ a)
(4.17)
l (t) is increased such that l (t) approaches 1 as t ! 1. For the convergence proof of the stochastic approximation (see equation 4.12), the boundedness of the noise variance is necessary (Kushner & Yin, 1997). The noise variance is given by E[f ( x, y, w) 2 ]r = E[F( x, y, H (w ))2 ]r ¡ E[F( x, y, H (w ))]r2 .
(4.18)
416
Masa-aki Sato and Shin Ishii
Both terms on the right-hand side in equation 4.18 are nite if we assume that the data distribution density r ( x, y) has a compact support. Consequently, the noise variance becomes nite under this assumption. This assumption is not so restrictive, because actual data always distribute in a nite domain. We can weaken this assumption such that r ( x, y) decreases exponentially as |x| or | y| goes to innity. It should be noted that the stochastic approximation (see equation 4.12) for nding the maximum likelihood estimator is not a stochastic gradient ascent algorithm for the log-likelihood function (see equation 3.9). The online EM algorithm (see equation 4.12) is faster than the stochastic gradient ascent algorithm, because the M-step is solved exactly. So far we have used the on-line EM algorithm dened by equations 4.2 and 4.5, which is equivalent to the basic on-line EM algorithm dened by equations 4.2, 4.5a, 4.5d, 4.6, 4.7, and 4.8b, if g(t) 6 = 1. Therefore, the basic on-line EM algorithm is also a stochastic approximation. 5 Regularization 5.1 Regularization of Covariance Matrix. We have assumed so far that the covariance matrix of the input data is not singular in each region; the inverse matrix for every S i (i = 1, . . . , M) exists. In actual applications of the algorithm, however, this assumption often fails. A typical case occurs when the dimension of the input data distribution is smaller than the dimension of the input space. In such a case, S i¡1 cannot be calculated, and LQ i diverges exponentially with the time t. There are several methods for dealing with this problem, such as the introduction of Bayes priors, the singular value decomposition, and the ridge regression. However, these methods are not satisfactory for our purpose. Here, we propose a simple new method that performs well. Let us rst consider a regularization of an ( N £ N )-dimensional covariance matrix, S , for the observed data. Its eigenvalues and normalized eigenvectors are denoted by j n (¸ 0) and y n ( n = 1, . . . , N ), respectively. The set of the eigenvectors, fy n |n = 1, . . . , Ng, forms a set of orthonormal bases. The condition number of the covariance matrix S is dened by
u ´ jmin /j max ,
(5.1)
where j min and j max are the minimum and maximal eigenvalues of S , respectively, so that 0 · u · 1. If u = 0, the covariance matrix S is singular. If u ¼ 1, the covariance matrix S is regular and the calculation of its inverse is numerically stable. If 0 < u ¿ 1, the covariance matrix is regular and S ¡1 PN ¡1 ¡1 0 is given by S ¡1 = n= 1 jn y n y n . It is dominated by jmin , which may contain a numerical noise. As a consequence, the inverse matrix S ¡1 becomes sensitive to numerical noises.
On-line EM Algorithm for the Normalized Gaussian Network
417
In order to deal with this problem, we propose the following regularized covariance matrix S R for the data covariance matrix S : S R = S C aD 2 IN
(0 < a < 1)
2
D = Tr(S )/ N,
(5.2a) (5.2b)
where IN is an ( N £ N )-dimensional identity matrix. The data variance D 2 satises the inequality jmax ¸ D 2 =
N 1 X j n ¸ jmax / N. N n= 1
(5.3)
The eigenvalues and eigenvectors of the regularized matrix S R are given by (j n CaD 2 ) and y n , respectively. From the inequality (se equation 5.3), it can be proved that the condition number of the regularized matrix S R satises the condition: uR ¸ a/ ( N (1 C a)).
(5.4)
Then the condition number of S R is bounded below and controlled by the small constant a. The rate of change for the regularized eigenvalue is dened by R(jn ) ´ ((j n C aD 2 ) ¡ j n )/j n = a(D 2 / j n ). From the inequality (see equation 5.3), this ratio satises the condition: R(j n ) ¸ R(jmax ) ¸ a/ N and R(j n ) · R(j min ) · a/ u. If all the eigenvalues of S are the same order (i.e., u ¼ O(1)), the rate of change becomes small (i.e., R(j n ) ¼ O(a)), so that the effect of the regularization is negligible. If the data covariance matrix S is singular (i.e., u = 0), the zero eigenvalue of S is replaced by aD 2 . Thus, the regularized matrix S R becomes regular, and its condition number is bounded by equation 5.4. In general, the eigenvalues, which are larger than their average, are slightly affected, and the very small eigenvalues are changed so as to be nearly equal to aD 2 . If S = 0, D 2 becomes zero and the regularization (see equation 5.2) does not work. Therefore, if D 2 is smaller than a threshold value D 2min , D 2 is set to D 2min . This prevents S R from being singular even when S = 0. By introducing a Bayes prior for a regularized matrix S B as P(S B ) = (k/ 2)N exp(¡(k / 2)Tr(S B¡1 )), one can get a similar regularization equation, S B = S Ck IN ,
(5.5)
where the regularization parameter k is a given constant. However, it is rather difcult to determine the k value without the knowledge of the data covariance matrix. It is especially difcult for the NGnet, since there are M independent local covariance matrices S i ( i = 1, . . . , M). The proposed method (see equation 5.2) automatically adjusts this constant by using the data variance D 2 , which is easily calculated in an on-line manner.
418
Masa-aki Sato and Shin Ishii
5.2 Regularization of On-line EM Algorithm. Based on the above consideration, the ith weighted covariance matrix is redened in our on-line EM algorithm as
S i¡1 (t) =
h³ ´ i¡1 hhxx0 iii ( t) ¡m i ( t)m 0i ( t)hh1iii ( t) CahhD 2i iii ( t) IN /hh1iii ( t) , (5.6)
where ³ ´ hhD 2i iii (t) ´ hh|x ¡ m i (t)| 2 iii (t)/ N = hh|x| 2 iii ( t) ¡|m i ( t)| 2 hh1iii ( t) /N.
(5.7)
The regularized S i¡1 (see equation 5.6) can be obtained from relation 4.7 by using the following regularized LQ i : LQ i ( t) =
³ ´¡1 hhxQ xQ 0 iii ( t) C ahhD 2i iii ( t) IQN .
(5.8)
The (( N C 1) £ ( N C 1))-dimensional matrix, IQN , is dened by
¡
¢
N X 0 = eQn eQ0n , 0
I IQN ´ N 0
n= 1
where eQn is an (N C 1)-dimensional unit vector; its nth element is equal to 1, and the other elements are equal to 0. From the denition 5.7, hhD 2i iii ( t) can be written as hhD 2i iii (t) = hhD 2i iii ( t ¡ 1) Cg(t)(D 2i ( t) Pi ( t) ¡ hhD 2i iii ( t ¡ 1)),
(5.9)
where D 2i ( t) is given by g(t)Pi ( t)D 2i ( t) = g(t)Pi (t)| x( t) ¡ m i ( t)| 2 / N C (1 ¡ g(t))|m i (t) ¡ m i (t ¡ 1)| 2 hh1iii ( t ¡ 1)/ N.
(5.10)
The second term on the right-hand side of equation 5.10 comes from the difference between hh|x ¡ m i (t)| 2 iii (t ¡ 1) and hh|x ¡ m i (t ¡ 1)| 2 iii ( t ¡ 1). Using equations 5.9, equation 5.8 can be rewritten as " LQ i ( t) =
(1 ¡ g(t))LQ ¡1 i ( t ¡ 1)
(
0
Cg(t) xQ (t) xQ ( t) C ºQn ( t) ´
p
aD i ( t) eQn .
N X n= 1
! 0
ºQn ( t)ºQn ( t) Pi ( t)
#¡1
(5.11a) (5.11b)
On-line EM Algorithm for the Normalized Gaussian Network
419
As a result, the regularized LQ i ( t) (see equation 5.8) can be calculated as follows. For a given input data x( t), LQ i ( t) is calculated using the step-wise equation 4.6. After that, LQ i ( t) is updated by LQ i ( t) := LQ i ( t) ¡
g(t)Pi ( t) LQ i ( t)ºQn ( t)ºQn 0 ( t) LQ i ( t) , 1 Cg(t)Pi ( t)ºQn 0 (t)LQ i (t)ºQn ( t)
(5.12)
using the virtual data fQ ºn ( t)| n = 1, . . . , Ng. After calculating the regularized LQ i ( t) (see equation 5.8), the linear regresQ i is obtained by using equation 4.8b, in which the regularized sion matrix W LQ i (t) is used. Only the observed data f(x(t), y( t))| t = 1, 2, . . .g are used in the calculation of equation 4.8b; and the virtual data fºQ n (t) | n = 1, . . . , NI t = 1, 2, . . .g, which have been used in the calculation of the regularized LQ i (t), must not be used. Therefore, equation 4.8a does not hold in our regularization method. Since the regularized LQ i ( t) is positive denite, the linear Q i at an equilibrium point of equation 4.8b satises the regression matrix W condition Q i E[xQ xQ 0 P(i| x, y, h )]r = E[yxQ 0 P( i|x, y, h )]r , W
(5.13)
which is identical to the maximum likelihood equation (3.7d) along with h = hN . Although the matrix E[xQ xQ 0 P( i| x, y, h )]r may not have the inverse, the relation (see equation 5.13) still has a meaning. Therefore, our method does Q i. not introduce a bias on the estimation of W . 6 Unit Manipulation
Since the EM algorithm guarantees only local optimality, the obtained estimator depends on its initial value. The initial allocation of the units in the input space is especially important for attaining a good approximation. If the initial allocation is quite different from the input distribution, much time is needed to achieve proper allocation. In order to overcome this difculty, we introduce dynamic unit manipulation mechanisms, which are also effective for dealing with dynamic environments. These mechanisms are unit production, unit deletion, and unit division, and they are conducted in an on-line manner after observing each datum ( x(t), y( t)). Unit production. The probability P (x( t), y( t), i | h ( t ¡ 1)) indicates how probable the ith unit produces the datum ( x( t), y(t)) with the parameter ( ( ), ( ), ( h (t ¡ 1). Let 0 < Pproduce ¿ 1/ M. When maxM i= 1 P x t y t i | h t ¡ 1)) < Pproduce , the datum is too distant from the present units to be explained by the current stochastic model. In this case, a new unit is produced to account for the new datum. The initial parameters of the new unit are
420
Masa-aki Sato and Shin Ishii
given by: m MC1 = x( t)
(6.1a)
¡1 ¡2 S MC1 = ÂMC1 IN
M
2 = b1 min |x(t) ¡ m i | 2 / N ÂMC1 i= 1
M
(6.1b)
2 sMC1 = b2 max si2
(6.1c)
Q MC1 ´ ( WMC1 , bMC1 ) = (0, y(t)), W
(6.1d)
i= 1
where b 1 and b2 are appropriate positive constants. Unit deletion. The weighted mean hh1iii (t), which is calculated by equation 4.2, indicates how much the ith unit has been used to account for the data until t. Let 0 < Pdelete ¿ 1/ M. If hh1iii ( t) < Pdelete , the unit has rarely been used. In this case, the ith unit is deleted. Unit division. The unit error variance si2 ( t) (see equation 4.5d) indicates the squared error between the ith unit’s prediction and the actual output. Let Ddivide be a specic positive value. If si2 ( t) > Ddivide, the unit’s prediction is insufcient, probably because the partition in charge is too large to make a linear approximation. In this case, the ith unit is divided into two units and the partition in charge is divided into two. The initial parameters of the two units are given by: p m i ( new) = m i ( old) C b3 j 1 y 1 p m MC1 ( new) = m i ( old) ¡ b3 j 1 y 1 (6.2a) ¡1 ( new) = S i¡1 ( new) = 4j1¡1 y 1 y 10 C S MC1
N X n= 2
j n¡1 y n y n0
(6.2b)
2 (new) = si2 ( old)/ 2 si2 ( new) = sMC1
(6.2c)
Q i ( new) = W Q MC1 ( new) = W Q i (old), W
(6.2d)
where j n and y n denote the eigenvalue and the eigenvector of the covariance matrix S i ( old), respectively, and j 1 = j max . b3 is an appropriate positive constant. Although similar unit manipulation mechanisms have been proposed (Platt, 1991; Schaal & Atkeson, 1998), these mechanisms can be conducted with a probabilistic interpretation in our on-line EM algorithm. 7 Experiments 7.1 Function Approximation in Static Environment. Applicability of our algorithm is investigated using the following function (N = 2 and
On-line EM Algorithm for the Normalized Gaussian Network
421
0.1 0.08
nMSE
0.06 0.04 0.02 0 0
10
20
30
Epochs
40
50
Figure 1: Learning processes of batch EM algorithm (dashed line) and on-line EM algorithm (solid). Five hundred training data are supplied once in each epoch. Ordinate denotes nMSE (mean squared error normalized by variance of desired outputs). Error is evaluated on 41 £ 41 mesh grids on input domain.
D = 1), which was previously used by Schaal and Atkeson (1998): n o 2 2 2 2 (¡1 · x1 , x2 · 1) . y = max e¡10x1 , e¡50x2 , 1.25e¡5(x1 Cx2 )
(7.1)
We prepared 500 input-output pairs, f(x(t), y(t))| t = 1, . . . , 500g as a training data set by uniformly sampling the input variable vector, x ´ ( x1 , x2 ), from its domain. A fairly large gaussian noise N (0, 0.1) is added to the outputs, where N (0, 0.1) denotes a gaussian distribution whose mean and standard deviation are 0 and 0.1, respectively. The problem task is for the NGnet to approximate the function 7.1 from the 500 noisy data. We compared the batch EM algorithm and the on-line EM algorithm. In this experiment, the discount factorl ( t) was scheduled to approach 1 as in equation 4.17. Figure 1 shows the time series of the normalized mean squared error, nMSE, which is normalized by the variance of the desired outputs (see equation 7.1). In the gure, we can see that both the batch EM algorithm and the online EM algorithm are able to approximate the target function in a small number of epochs. Although the so-called overlearning can be seen in both learning algorithms, it is more noticeable in the batch EM algorithm than in the on-line EM algorithm. The number of the units used in both algorithms was 50. The data distribution does not change over time in this task. In this case, it may be thought that the batch learning is more suitable than the on-line learning because the batch learning can process the whole data at once. However, our on-line EM algorithm has an ability similar to the batch algorithm. In addition, the on-line EM algorithm achieves
422
Masa-aki Sato and Shin Ishii
a faster and more accurate approximation for this task than the RFWR (receptive eld weighted regression) model proposed by Schaal and Atkeson (1998). 7.2 Function Approximation in Dynamic Environments. The on-line EM algorithm is effective for a function approximation in a dynamic environment where the input-output distribution changes over time. For an experiment, the distribution of the input variable x1 in equation 7.1 was continuously changed in 500 epochs from the uniform distribution in the interval [¡1, ¡0.2] to that in the interval [0.2, 1]. At each epoch, the input variables were generated in the current domain. In the applications of the on-line EM algorithm to such dynamic environments, the learning behavior changes according to the scheduling of the discount factor l(t). When the discount factor is relatively small, the model tends to forget the past learning result and to adapt quickly to the input-output distribution. We show two typical behavioral patterns in the learning phase. In the rst case, l ( t) is initially set at a relatively small value, and it is slowly increased; that is, the effective learning coefcient g(t) is relatively large throughout the experiment. The NGnet adapts to the input distribution change by a relocation of the units’ center. During the course of this relocation, the units’ center moves to the new region, and consequently the past approximation in the old region is forgotten. Figures 2b and 2c show the center and the covariance of all the units for the 50th epoch and the 500th epoch, respectively. The NGnet adapts to the input distribution change by the drastic relocation of the units’ center. Figure 2a shows that nMSE on the current input region does not grow large throughout the task. In this experiment, the number of the units does not change. In the second case, l ( t) is rapidly increased; that is, g(t) rapidly approaches to zero. The NGnet adapts to the input distribution change without forgetting the past approximation result. Figure 3a shows this learning process. Figure 3b shows the center and the covariance of all the units at the end of the learning task. Since the unit relocation is slow, the NGnet adapts to the input distribution change mainly by the unit production mechanism. Consequently, the units in the region where no more input data appear remain, and the function approximation on the whole input domain is accurately maintained. 7.3 Singular Input Distribution. In order to evaluate our regularization method, we prepared a function approximation problem where the input variables are linearly dependent and there is an irrelevant variable. In such a case, the basic on-line EM algorithm without the regularization method obtains a poor result because the input distribution becomes singular. In this experiment, the output y is given by the same function as equation 7.1, while the input variables are given by x ´ (x1 , x2 , x3 , x4 , x5 ) = ( x1 , x2 , (x1 C x2 )/ 2, (x1 ¡ x2 )/ 2, 0.1). When the input data are generated, a gaussian noise
On-line EM Algorithm for the Normalized Gaussian Network
423
0 .0 5 0 .0 4
nM SE
0 .0 3 0 .0 2 0 .0 1
100
200
300
1
1
0.5
0.5
0
0.5
1 1
400
E p o ch s
X2
X2
0 0
500
0
0.5
0.5
0
X1
0.5
1
1 1
0.5
0
X1
0.5
1
Figure 2: (a) Learning process of on-line EM algorithm that quickly adapts to a dynamic environment with forgetting the past. Ordinate denotes nMSE evaluated on current input domain at each epoch. (b, c) Dynamic unit relocation for dynamic environment. Each ellipse denotes a receptive eld of a unit; the ellipse corresponds to covariance (two-dimensional display of standard deviation). The ellipse center is set at unit center,m i . Dots denote 500 inputs in current epoch. (b) At 50th epoch. (c) At 500th epoch.
N (0, 0.05) is added to x3 , x4 , and x5 . For a training set, we prepared 500 data, where x1 and x2 were uniformly taken from their domain, and the output y was added by a gaussian noise N (0, 0.05). For evaluation, a test
424
Masa-aki Sato and Shin Ishii
Figure 3: (a) Learning process of on-line EM algorithm that slowly adapts to dynamic environment without forgetting the past. Solid line: nMSE on whole input domain. Dashed line: (nMSE ¤ 10) on current input domain at each epoch. Dash-dot line: number of units, which is normalized by its initial value 30 for convenience. (b) Receptive elds after learning.
set was prepared by setting x1 and x2 on the 41 £ 41 mesh grids, and the other variables were generated according to the above prescription. The same training set was repeatedly supplied during the learning. In Figure 4a,
On-line EM Algorithm for the Normalized Gaussian Network
425
0 .5 0 .4
nM SE
0 .3 0 .2 0 .1 0 0
20
40
60
Epochs
80
10 0
0 .1 1
0 .0 8
0.5
X3
nM S E
0 .0 6 0
0 .0 4
0.5
0 .0 2
1 1
1 0
0 0
10
20
30
Epochs
40
50
0
X1
1 1
X2
Figure 4: (a) Learning processes when output is given by equation 7.1 for input x ´ (x1 , x2 , x3 , x4 , x5 ) = (x1 , x2 , (x1 C x2 )/ 2 C N(0, 0.05), (x1 ¡ x2 )/ 2 C N(0, 0.05), N(0.1, 0.05)). Learning curves for a = 0.23 (solid line), a = 0.1 (dashed), and a = 0 (dash-dot). (b) Learning curves for a = 0.023 (solid) and a = 0 (dashed). Output is given by equation 7.2. (c) Receptive elds at 50th epoch, each of which is dened by an ellipse with axes for two principal components of the local covariance matrix S i . The ellipse center is set at unit center, m i . This gure shows a 3D view of a hemisphere.
the solid, dashed, and dash-dot lines are the learning curves for a = 0.23, a = 0.1, and a = 0, respectively. If no regularization method is employed (a = 0), the error becomes large after the early learning stage. Comparing
426
Masa-aki Sato and Shin Ishii
the cases of a = 0.23 and a = 0.1, the regularization effect seems fairly robust with respect to the parameter a value. Let us consider another situation, where the input data distribute on a curved manifold. Each input datum in the three-dimensional space, (x1 , x2 , x3 ), is restricted on the unit sphere, that is, (x1 , x2 , x3 ) = (cosh1 cosh2 , cosh1 sinh2 , sinh 1 ). A function dened on this unit sphere is given by n o 2 2 2 2 y = cos(h1 )cos(h2 / 2)max e¡10(2h1 / p ) , e¡50(h2 / p ) , 1.25e¡5((2h1 / p ) C(h2 / p ) ) , (7.2) where the range of the spherical coordinate is given by ¡p / 2 · h 1 · p / 2 and ¡p · h2 < p . The output y does not include a noise. The function 7.2 is similar to the function 7.1, but it is changed so as to satisfy the consistency of the spherical coordinate. We prepared 2000 data points uniformly on the sphere. The covariance matrix of input data is not singular. However, the covariance matrix of each local unit is nearly degenerate, so that the calculation of the inverse covariance matrix and the linear regression matrix may include a noise without the help of the regularization method. In Figure 4b, the solid and dashed lines are the learning curves for a = 0.023 and a = 0, respectively. If no regularization method is employed (a = 0), the error does not become small. Our regularization method (a = 0.023) provides a fairly good result compared to the basic method without the regularization (a = 0). The condition numbers dened by equation 5.1 with the regularization and without the regularization are 0.0191 § 0.0042 and 0.0071 § 0.0038, respectively. The local covariance matrix, S i , of each unit represents the local input data distribution fairly well in our regularized method. It has two principal components that span the tangential plane to the unit sphere at each local unit position. The third eigenvalue is very small and is bounded below by the regularization term. In Figure 4c, receptive elds of the local units are shown. The receptive elds appropriately cover the unit sphere. 7.4 Robot Dynamics. Finally, we examine realistic and difcult problems. The problems are taken from DELVE (Rasmussen et al., 1996), which is a new standard collection of neural network benchmark problems. The Pumadyn data sets are a family of data sets synthetically generated from a realistic simulation of the dynamics of a Puma 560 robot arm (Ghahramani, 1996; Corke, 1996). The Puma robot has six revolving joints. The state vector is dened by fh i , hPi |i = 1, . . . , 6g, where hi and hPi represent the angle and the angular velocity of the ith joint, respectively. The applied torque to the ith joint is denoted by ti . The task of the Pumadyn-8 family is to predict the third joint acceleration hR3 given the eight-dimensional input vector x ´ fh1 , hP1 , h2 , hP2 , h3 , hP3 , t1 , t2 g. Other variables are xed to zero. The input data are randomly generated by sampling the input variables uniformly from the range (|h |/ p , |hP |/ p , | t| · 0.6). The data sets are characterized by
On-line EM Algorithm for the Normalized Gaussian Network
427
nonlinearity, noise level, and size of training data. We examined six training data sets: nm64, nm256, and nm1024, which are nonlinear and medium noisy data sets consisting of 64, 256, and 1024 training data, as well as nh64, nh256, and nh1024, which are nonlinear and highly noisy data sets consisting of 64, 256, and 1024 training data. Gaussian noise with standard deviations of 0.12 and 0.4 was added to both the input and the output variables for the medium and the highly noisy data sets, respectively. The test data set size was 1024 when the training data set size was 1024 and 512 for the other cases. We compared our on-line EM (OEM) algorithm with the mixtures-ofexperts (ME) methods registered in DELVE (Waterhouse, 1997) and the Moody-Darken (MD) method (Moody & Darken, 1989). There are nine competitor methods: on-line EM algorithm (OEM), on-line EM algorithm with Neal early stopping (OEM-ese) (Neal, 1998), mixtures-of-experts trained by ensemble learning (ME-el), hierarchical mixtures-of-experts trained by ensemble learning (HME-el), mixtures-of-experts trained by committee early stopping (ME-ese), hierarchical mixtures-of-experts trained by committee early stopping (HME-ese), hierarchical mixtures-of-experts grown via committee early stopping (HME-grow), Moody-Darken (MD) method, and modied Moody-Darken (MD-l) method. The committee early stopping method (Waterhouse, 1997) uses a committee of (hierarchical) ME models. The training data are randomly divided into a training set and a validation set, with two-thirds and one-third of the total data, respectively. The training is done for each member of the committee using different training and validation sets and terminated at the minimum of the validation error. The prediction is given by the average of the outputs for all committee members. The number of committee members depends on the training time. A minimum of 3 members and a maximum of 50 can be created. If the training time is greater than a given threshold, which is proportional to the amount of data, no more members can be created. The HME-grow method constructs a hierarchical ME from data, learning both the structure as well as the parameters of the model. In the growing process, one of the experts is split into two experts and a gate. The committee early stopping method is also used to control the size of the hierarchical trees. The ensemble learning estimates the posterior probability distribution for the model parameters rather than estimates the optimal parameter value. The gaussian prior distribution on the gate and the experts parameters are assumed to introduce hyperparameters. The optimal hyperparameters are estimated by an alternating minimization procedure that combines the standard EM algorithm for parameter estimation with reestimation of hyperparameters. The algorithm is computationally highly demanding. The training of the OEM method is terminated at a prespecied number of epochs, which is 5% of the data size. The OEM-ese uses a simpler version of the committee early stopping proposed by Neal (1998). The training data
428
Masa-aki Sato and Shin Ishii
Table 1: Computation Time Comparison for Nine Methods. Method ME-el HME-el OEM-ese OEM ME-ese HME-ese HME-grow MD-l MD Time (sec)
16,515
23,580
395
40
2129
2171
2059
16
5
Notes: The data set used was nm1024 with 1024 training data. The computer used was Silicon Graphics Octane (CPU: R10000 175MHz £ 2).
are partitioned into four equal portions. The committee consists of four networks. Each network uses three of the four portions as a training set and the remaining portion as a validation set. The MD method uses two-stage learning. It rst determines the unit centers according to the input distribution by using the k-means clustering algorithm. Subsequently, the bias coefcients bi are determined by using the least mean squared error method while xing the unit centers. The NGnet used in the original MD model has isometric covariance matrices and no linear coefcients, that is, Wi ´ 0. The MD-l method, in contrast, uses the NGnet with linear coefcients. The main differences among the OEM, the ME, and the MD methods are their criterion functions. The OEM method maximizes the log-likelihood P of the joint probability distribution, Tt= 1 log P( x( t), y(t)|h ). Therefore, the units are allocated according to the input-output data distribution. The ME method maximizes the log-likelihood of the conditional probability, PT t= 1 log P( y( t)| x( t), h ). Therefore, the specialization process of each expert is done according to the output performance and does not take account of the input data distribution. The rst-stage learning of the MD method can be considered as the maximization of the log-likelihood of the input probP ability, Tt= 1 log P(x( t)|h ). The second-stage learning minimizes the mean squared error of the network output. Therefore, the optimality of the total algorithm is not guaranteed. The simulation results are summarized in Figure 5. Table 1 shows the computation time needed for training by these algorithms. The results depend very much on the training size. The most striking differences can be seen for the nm256 data set. The best performance is achieved by (H)ME-el, and followed by OEM (-ese), (H)ME-ese, HME-grow, and MD(-l), in that order. The performances of the ME family and the OEM family are almost the same for the nm64 and nh64 data sets, while those of the MD family are slightly worse. For the nm1024 and nh1024 data sets, (H)ME-el, OEM(-ese) show comparable performances. The performances of (H)ME-ese and MD-l are slightly worse. HME-grow and MD each shows poor performance. Some comments can be made regarding these results. The number of data is very small compared to the dimensionality of the input space; the noisy data are sparsely distributed in a high-dimensional space. Therefore, the
On-line EM Algorithm for the Normalized Gaussian Network
nM SE
nm 1024
nm 256 0 .8
0 .8
0 .6
0 .6
0 .6
0 .4
0 .4
0 .4
0 .2
0 .2
0 .2
ABCDEFGH I
0
nh1024
nMSE
nm 64
0 .8
0
ABCDEFGH I
0
nh256 0 .8
0 .8
0 .6
0 .6
0 .6
0 .4
0 .4
0 .4
0 .2
0 .2
0 .2
ABCDEFGH I
0
ABCDEFGH I
ABCDEFGH I nh64
0 .8
0
429
0
ABCDEFGH I
Figure 5: Learning results for Pumadyn data sets: nm64, nm256, nm1024, nh64, nh256, and nh1024. Bars represent the expected nMSE,which is calculated by an average of nMSE for different test data sets, for nine methods shown at the bottom of the gure.
reliable estimation of the optimal parameter is very difcult. This situation favors ensemble learning. The disadvantage of ensemble learning is that it requires a huge amount of time, as shown in Table 1. The OEM method achieves comparable learning ability with the (H)ME-el method within a signicantly smaller amount of computation time for the 1024 data sets. The original MD method showed a poor performance despite being very fast. The modied MD method with linear coefcients showed a better performance than the original MD method. However, its performance is worse than the OEM methods. The performance of the MD methods depended on the number of units. In Figure 5, the results of the best MD models are
430
Masa-aki Sato and Shin Ishii
Table 2: Learning Results for Robot Trajectory Data Sets. Method nMSE (%) Time (sec)
(H)ME-el OEM — —
1.8 1085
ME-ese HME-ese HME-grow MD-l
MD
2.3 84,626
4.4 130
2.8 63,610
22.2 36,780
3.9 220
Notes: Computation time and nMSE for six methods are shown. Ensemble learning, (H)ME-el, does not end within a week.
shown. For example, the best number of units for MD and MD-l are 140 and 33 for the nm1024 data set, respectively. The OEM method automatically selects the number of units. The nal numbers of units for the nm1024, nm256, and nm64 data sets are 6, 5, and 1, respectively, although the initial number of units is set to 2% of the data size. We considered a more realistic situation. For real-time robot control, the training data are generated along the robot trajectory. Therefore, we generated the training data by freely operating the Puma robot from the upright position. Due to inertia, the robot exhibited chaotic oscillations. The task was to predict the third joint acceleration hR3 given the 10-dimensional input vector x ´ fhi , hPi |i = 1, . . . , 5g. Since h 6 and hP6 were almost zero throughout the trajectory, we omitted these variables, and the applied torque was zero. The size of training data was 7000 (the ME family could not operate more than 7000 data in our computer). The 7000 test data were generated in the neighborhood of the trajectory. The results are summarized in Table 2. The OEM method achieves a better performance than the ME family with a signicantly smaller amount of time. The performance of the MD method is worse than that of the (H)ME-ese method. The computation times of the OEM, the MD, and the MD-l methods scale as (N C 1) 2 M , M3 , and (( N C 1)M)3 , respectively, where M is the number of units and N is the input dimension. The best performance of MD(MD-l) was obtained with 420 (50) units, while the nal number of units in OEM became 230 after four epochs. When the number of units in MD(MD-l) was larger than 1000 (100), the MD(-l) method became slower than the OEM method and the performance was degraded. In summary, we conrmed that the on-line EM algorithm achieves good performance in a short amount of time even if the number of data become large. 8 Conclusion
In this article, we proposed a new on-line EM algorithm for the NGnet. We showed that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we showed that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator.
On-line EM Algorithm for the Normalized Gaussian Network
431
A new regularization method was proposed in order to deal with a singular input distribution. In order to manage dynamic environments, unit manipulation mechanisms such as unit production, unit deletion, and unit division were also introduced based on the probabilistic interpretation. Experimental results showed that our approach is suitable for dynamic environments where the input-output distribution of data changes over time. We also applied our on-line EM algorithm to robot dynamics problems. Our algorithm achieved a comparable learning ability to the ME methods for a small number of data and achieved a better performance than the ME methods for a large number of data within a signicantly smaller amount of computation time. Our method can also be applied to reinforcement learning (Sato & Ishii, 1999) and the reconstruction of nonlinear dynamics (Ishii & Sato, 1998).
Acknowledgments
We thank the referees and the editor for their insightful comments in improving the quality of this article.
References Corke, P. I. (1996).A robotics toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3, 24–32. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39, 1–22. Ghahramani, Z. (1996). The Pumadyn dataset. Available online at: http://www.cs. toronto.edu/»delve/data/pumadyn/pumadyn.ps.gz. Ishii, S., & Sato, M. (1998). On-line EM algorithm and reconstruction of chaotic dynamics. In T. Constantinides, S.-Y. Kung, M. Niranjan, & E. Wilson (Eds.), Neural networks for siginal processing VIII (pp. 360–369). New York: IEEE. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79-87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jurgen, ¨ S. (1996). Pattern classication: A unied view of statistical and neural approaches. New York: Wiley. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer-Verlag. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Neal, R. M. (1998). Assessing relevance determination methods using DELVE. In C. M. Bishop (Ed.), Generalization in neural networks and machine learning (pp. 97–129). Berlin: Springer-Verlag.
432
Masa-aki Sato and Shin Ishii
Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer Academic Press. Nowlan, S. J. (1991). Soft competitive adaptation: neural network learning algorithms based on tting statistical mixtures (Tech. Rep. No. CS-91-126). Pittsburgh: Carnegie Mellon University. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1496. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Available online at: ftp://ftp.cs.utoronto.ca/pub/neuron/delve/doc/manual.ps.gz. Sato, M., & Ishii, S. (1999). Reinforcement learning based on on-line EM algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 1052–1058).Cambridge, MA: MIT Press. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Computation, 10, 2047–2084. Waterhouse, S. R. (1997). Classication and regression using mixtures of experts. Unpublished doctoral dissertation, Cambridge University. Xu, L., Jordan, M. I., & Hinton, G. E. (1995).An alternative model for mixtures of experts. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 633–640). Cambridge, MA: MIT Press. Received March 23, 1998; accepted February 17, 1999.
LETTER
Communicated by Jean-Pierre Nadal
A General Probability Estimation Approach for Neural Computation Maxim Khaikine Klaus Holthausen Friedrich-Schiller-Universit¨at Jena, Ernst-Haeckel-Haus, D-07745 Jena, Germany We describe an analytical framework for the adaptations of neural systems that adapt its internal structure on the basis of subjective probabilities constructed by computation of randomly received input signals. A principled approach is provided with the key property that it denes a probability density model that allows studying the convergence of the adaptation process. In particular, the derived algorithm can be applied for approximation problems such as the estimation of probability densities or the recognition of regression functions. These approximation algorithms can be easily extended to higher-dimensional cases. Certain neural network models can be derived from our approach (e.g., topological feature maps and associative networks). 1 Introduction
The concepts of probability density estimation are relevant to many aspects of neural computation (Bishop, 1995). There are two general methods to deal with this problem in the framework of statistical pattern recognition: nonparametric and parametric probability density estimation. Both recognition methods are presented and discussed in detail in Bishop (1995). The nonparametric estimation (e.g., simple histogram methods) does not assume a particular functional form for the density function. A disadvantage of this method is that it requires large amounts of processing because all data points used in the training must be retained. Moreover, it is not known a priori how many data have to be collected in order to provide a good representation. The parametric approach, on the contrary, assumes a specic form for the density function, which might be very different from the actual density. The chosen functional form contains a number of adjustable parameters that are optimized by tting the model to the data set. For the purpose of neural network theory, an important advantage of this parametric approach is that the parameters can be adjusted locally. As local learning rules of this type, we will consider algorithms in which a hypothesis about the real distribution is changed each time after a stimulus was received without storing and processing of previous stimuli. These local recognition algorithms give quite Neural Computation 12, 433–450 (2000)
c 2000 Massachusetts Institute of Technology °
434
Maxim Khaikine and Klaus Holthausen
good results if the real probability density is not too different from some elements of the family of functions chosen for the approximation. Generally the quality of the approximation depends on the smoothness of the probability density function and the inherent degree of freedom in the applied functional form. In this article, we develop a general approach to nd a comprehensive family of recognition algorithms in which the class of basis functions is not limited and includes all possible smooth functions. The focus of our work is not a particular application in neural network theory, but a principled approach based on sound theoretical concepts. It allows an ideal estimation of any arbitrary probability distribution. We will show that on the basis of these learning rules, one can solve more complicated problems (e.g., an approach for the estimation of regression functions for stimulus-response pairs with arbitrary distributions). Then we discuss the general framework of our approximation theory and certain applications in neural network theory. Our ndings support recent work on the approximation by feedforward neural networks (Scarselli & Tsoi, 1998) since they allow the focus of investigation to be turned toward the approximation capabilities of single neurons. Furthermore, earlier general approaches that combined neural computation and approximation theory (Girosi & Poggio, 1990; Kurkova, 1992) could be revised in order to gain insight into novel architecture and optimization principles in neural computation. 2 Local Algorithms for the Approximation of Probability Densities
We consider a discrete process in which a stochastic variable x adopts one of N possible states fx1 , x2 , . . . , xN g. This process can be described by the corresponding probabilities PE = ( p1 , p2 , . . . , pN ), which we will call the objective probability distribution. A priori we have no information about the objective E = ( q1 , q2 , . . . , qN ) about it. distribution, but we can make a hypothesis Q E The correEvery stimulus xk gives rise to a change of the hypothesis Q. sponding learning rule can be represented by a matrix L , whose elements E Thus, the adaptation of the hypothesis Q Et depend on the actual vector Q. after a stimulus xk was received at time t can be written with the learning E ) and learning rate 2 : rule L ( Q E t) = qtn C 2 Ln,k ( Q qtC1 n
8n.
We can calculate the expected (i.e., average) value of qtC1 n with respect to the objective probabilities pk of the stimuli xk : t hqtC1 n i = qn C 2 ¤
N X k= 1
E t ) pk . Ln,k (Q
(2.1)
General Probability Estimation Approach for Neural Computation
435
E ) are continuous functions and the coefIf all elements of the matrix Ln,k ( Q cient 2 is sufciently small, we can come to a description that is continuous E in time. Thus, the statistical dynamics of the subjective probability vector Q is given by a set of coupled differential equations: N X dq n ( t) E ) pk . = Ln,k ( Q dt k= 1
(2.2)
E ) can be derived? First, let us Which properties of the learning matrix L (Q note that the sum of all time gradients of probabilities qk in our hypothesis must be always null. This means: N N X N X X dqn ( t) E ) pk = 0 = Ln,k ( Q dt n= 1 n= 1 k= 1
E 8P.
E ) follows: Hence, the rst property of Ln,k (Q N X n= 1
E) = 0 Ln,k ( Q
E k. 8Q,
(2.3)
E = PE presents Second, it is a plausible assumption that a correct hypothesis Q at least a xed point in the phase-space and is not statistically changed with time (about convergence toward this point, see below). Together with E ): equation 2.2 we obtain the second property of Ln,k (Q N X k= 1
E ) qk = 0 Ln,k ( Q
E 8n, Q.
(2.4)
Equations 2.3 and 2.4 are obviously not sufcient to dene learning rules. Nevertheless we give an example of a possible learning matrix after simpliE ). fying the structure of L n , k ( Q 2.1 Pfaffelhuber ’s Learning Rule. Let us assume that all functions E ) in the learning matrix are divided into two principally different Ln,k ( Q classes of very simple form: » E ) = f ( qn ) if k 6 = n Ln,k (Q g( qn ) otherwise.
Equations 2.3 and 2.4 will be rewritten as follows: f ( qn )
N X k6 = n
qk C g( qn ) qn = 0
8n
436
Maxim Khaikine and Klaus Holthausen
and N X n6 = k
f ( qn ) C g( qk ) = 0
8k.
This system has only one solution if N > 2 and innitely many otherwise. The general solution is f ( qn ) = ¡qn C and g(qn ) = C(1 ¡ qn ), where C is an arbitrary constant. The sign of C determines whether the dynamics will be stable. The learning matrix hence is: 0 1 1 ¡ q1 ¡q1 ... ¡q1 B ¡q2 1 ¡ q2 . . . ¡q2 C C E) = C B L( Q B .. C. .. .. @ . A . . ¡qN ¡qN . . . 1 ¡ qN This learning rule was rst suggested by Pfaffelhuber (1972), who also investigated its convergence. With this matrix, the statistical dynamics for this learning process follows from equation 2.2: dqn (t) = C( pn ¡ qn ). dt Thus, it has been shown that it is possible to formulate certain restrictions for our approximation theory from which a prominent learning rule can be derived. Here we investigate a broader class of approximation algorithms that could be of interest for both statistical pattern recognition and neural network theory. 3 General Mathematical Framework
The basic properties of the learning matrix (see equations 2.3 and 2.4) guarantee that the probabilities qk that constitute our hypothesis are normalized E = PE represents a xed and that any correctly recognized distribution Q point of the dynamics 2.2. But the convergence toward this point was not investigated. In order to solve this problem, we will apply a framework that will be more suitable than the notation of matrix functions. E (t) and the real distriFirst, we note that both vectors—the hypothesis Q E bution P—are located on a (N ¡ 1)-dimensional hyperplane—a manifold Q in the N-dimensional vector space. The manifold Q is dened by the normalization condition: N X n= 1
qn = 1
E 2 Q. 8Q
In the context of probability estimation, we will consider only learning processes that can be described by their dynamics on the manifold Q . These
General Probability Estimation Approach for Neural Computation
437
algorithms will be regarded as models of the real learning processes. Our aim is to dene a learning rule (or, better, a family of rules) so that every start hypothesis converges statistically to the objective probability distribution PE during the recognition process. E ) for every vector Q E denes a linear operator The learning matrix L (Q on the space RN (cf. equation 2.2). For this linear operator, one can nd a set of eigenvectors fE vn n = 1, . . . , Ng with corresponding eigenvalues fl n n = 1, . . . , Ng so that E ) vEn = l n vEn . L (Q It is clear that if the eigenvectors and values are given in every point of the manifold Q , they automatically dene the learning rule (i.e., the learning E )) in Q . Furthermore, this method is very exible and suitable for matrix L ( Q studying the global properties of the learning process. As a simplication, we consider here only the case where alll n are real. This means that in every E on the manifold Q , the operator L (Q E ) has only one-dimensional point Q subspaces. E is an eigenvector of the operator Proposition 2.4 shows that the vector Q E L ( Q) with corresponding eigenvalue l Q = 0: E )Q E = 0¤Q E = 0. L (Q Furthermore, all other eigenvalues l n n 2 f1, . . . , N ¡ 1g of the learning E ) must be different from zero, because otherwise (for example, matrix L (Q E would be stable for every real probability l m = 0) the current hypothesis Q Q distribution P given as a normalized linear combination, PQ =
E C b vEm aQ P . a C b k vm,k
Here, the parameters a and b have to be chosen in order to satisfy the condiE vEn n = 1, . . . , N ¡ 1g form a complete tion 0 · PQk · 1. The eigenvectors fQ, vector system—the basis for the vector space—in every point of Q so that every vector xE can be expanded as a linear combination: EC xE = cQ Q
N¡1 X n= 1
cn vEn .
E ) yields: The application of the operator L ( Q E ) xE = cQ L ( Q E)Q EC L (Q
N¡1 X n= 1
E ) vEn = cn L ( Q
N¡1 X n= 1
l n cn vEn .
438
Maxim Khaikine and Klaus Holthausen
P E ) projects every vector onto the space The learning matrix L (Q n xn = 0. Therefore all eigenvectors vEn associated with nonzero eigenvalues must belong to this space: N X
vn,k = 0
k= 1
8n 2 f1, . . . , N ¡ 1g.
(3.1)
Theorem 1 introduces a broad class of learning rules that converge toward the process (2.2) and allow the study of the global properties of the E ): corresponding matrix L ( Q E ( t) always converges toward the objective probaThe hypothesis Q E ) are E bility distribution P during the learning process (see equation 2.2) if all l n (Q E real and positive and all eigenvectors of the learning matrix L ( Q) are orthogonal E 2 Q: in every point Q Theorem 1.
E ) > 0, l n(Q
vEn vEm = 0
8n, m = 1, . . . , N ¡ 1.
The eigenvalues corresponding to Pfaffelhuber ’s (1972) learning rule (section 2.1.) for the case N = 3 are given by the roots of the polynomial  (l ) = ¡l(1 ¡l) 2 . Two orthogonal eigenvectors corresponding to the eigenvalue l = C1 are given by v1 = (0, C1, ¡1) and v2 = (¡2, C1, C1). Remark.
E ( t). We choose Let us introduce the distance vector DE ( t) := PE ¡ Q the following linear combination for DE ( t): Proof.
E (t) C DE ( t) = cQ Q
N¡1 X n= 1
cn vEn .
(3.2)
E and PE belong to the manifold Q , the coefcient cQ in Because both vectors Q the rst part P of equation 3.2 must always be null so that the normalization condition k D k = 0 is always fullled. In order to describe the convergence, we consider the dynamics of the distance vector:
®
®2
® ® d ®DE ( t) ® dt
= ¡2DE ( t)
E ( t) dQ E )P. E = ¡2DE (t)L ( Q dt
E)Q E = 0 we obtain: Using the property L ( Q
® ®
®2 ®
d ®DE ( t) ® dt
E ) DE ( t) = ¡2DE ( t) L ( Q
General Probability Estimation Approach for Neural Computation
= ¡2
(
N¡1 X n= 1
! cn vEn
(
N¡1 X m= 1
! l m cm vEm
= ¡2
N¡1 X n= 1
439
l n c2n | vEn | 2 .
E ) are positive for all points Q, E we can nd an inmum c so that If all l n (Q E ) ¸ c > 0 8Q, E n. Thus we obtain a simple approximation for the l n(Q dynamics of DE ( t):
® ®
®2 ®
d ®DE (t)® dt
· ¡2c
N¡1 X n= 1
®
®2
® ® c2n | vEn | 2 = ¡2 ®DE (t) ®
and
® ® ® ® ®DE ( ) ®2 · ®DE (0) ®2 exp(¡2c ). t ® t® ® ® Because the parameter c is positive, the quantity kDE ( t)k2 converges toward null. To emphasize the relationship of the results of this article with the learning matrix from the previous section, we indicate shortly how a learning matrix E ) n = 1, . . . , N ¡ 1g can be derived from the elds of the eigenvectors fE vn ( Q E ) n = 1, . . . , N¡1g. This follows easily after adjoint and the eigenvalues fl n ( Q E ¤ , vE¤ n = 1, . . . , N ¡ 1g are introduced so that: vectors fQ n » 1 if n = k ¤ EQ E ¤ = 1, E ¤ = vE¤ Q E = 0, E E ¤ = Q vEn Q v v k n n 0 otherwise. In the case of orthogonal eigenvectors vEn , all adjoint vectors can be easily dened: E ¤ ( vEn Q E) vEn ¡ Q E ¤ = (1, 1, . . . , 1)I . Q vE¤n = 2 | vEn | E ) can thus be written as a product of three matrices: L( Q E ) = V( Q E ) ¤ L (Q E ) ¤ V ¤T ( Q E ), L (Q
E ) is a matrix from the components of eigenvectors including Q, E where V ( Q E ) is a matrix composed of adjoint vectors, and L (Q E ) is the transpose V ¤T ( Q a diagonal matrix whose elements are eigenvalues with the corresponding eigenvalue l Q = 0. 4 Approximation of Continuous Probability Distributions
In this section we generalize our results for the case of one continuous variable. First, we introduce and clarify the corresponding notions for our
440
Maxim Khaikine and Klaus Holthausen
description of learning processes. This will allow us to describe the convergence toward a continuous probability density. Finally, after a simplication, we arrive at possible learning rules for the approximation of probability densities. Without loss of generality, we assume a discrete stochastic process in which a continuous variable x is distributed on the interval I = [0, 2p ] with a density P( x). At every time, there is a hypothesis Qt ( x) about the objective distribution P(x). Let us suppose that both functions are smooth on I. We note that in the space of all smooth function, the hypothesis Q( x) and the real distribution P (x) always lie on a manifold Q that is dened by the normalization condition: Z 2p Qt ( x)dx = 1. 0
The adaptation process will be triggered by random stimuli y 2 I. We start with an arbitrary initial hypothesis Qt ( x) satisfying the normalization condition. In the tth step, a stimulus yt 2 I will be chosen according to the probability distribution P( x). Thus, the learning process can be described as a discrete adaptation of the hypothesis: QtC1 ( x) = Qt ( x) C 2 K( x, yt ).
Here, the function K( x, y) corresponds to the learning matrix from the previous section. Because we want the hypothesis Q always to belong to Q , the kernel function K(x, y) has to be smooth. In general, the learning rule K(x, y) depends on the actual hypothesis Qt ( x). To emphasize this, we write KQ (x, y). After we pass over to the continuous description, with 2 ! 0, we obtain the dynamics of Q( x): dQ( x, t) = dt
Z
2p 0
P( y)KQ ( x, y) dy.
(4.1)
In order to satisfy the normalization condition at each instance of the adaptation process, the following equation must be always fullled: Z
2p 0
KQ ( x, y) dx = 0 8y and 8Q(x).
(4.2)
Moreover, the correct hypothesis Q( x, t) = P( x) must be a xed point of the dynamics of our adaptation process. Thus, the second condition for the kernel KQ ( x, y) is given by: Z
2p 0
KQ ( x, y) Q( y) dy = 0 8x and 8Q(x).
(4.3)
Equations 4.2 and 4.3 can be regarded as a continuous generalization of conditions 2.3 and 2.4 applied in the case of discrete approximations. Thus,
General Probability Estimation Approach for Neural Computation
441
the goal is to nd an appropriate integral operator L ( Q ) with smooth kernel KQ ( x, y) that is dened on the manifold Q , fullls conditions 4.2 and 4.3, and always describes the dynamics of the hypothesis Q(x, t), dQ( x, t) = L (Q) P, dt that leads to a convergence toward the objective distribution P( x). We apply the standard denition of convergence: A hypothesis Q( x, t) converges toward the distribution P(x) with the time t, if for every g > 0 one can nd T so that for every time tO ¸ T: |Q(x, tO) ¡ P( x)| · g. By analogy with the previous section, we introduce the set of eigenvectors fvQ,n n = 0, . . . , 1g (in this case, they are real functions on I) and corresponding eigenvalues fl Q,n n = 0, . . . , 1g for the linear operator L ( Q ) in every point of the manifold Q , and note some general properties of them. Equation 4.3 shows that every point of Q is an eigenvector for the operator L ( Q ) with corresponding eigenvalue l Q,0 = 0. With the same arguments as in section 3, one can see that all other eigenvalues fl Q,n n = 1, . . . , 1g must be different from zero to avoid the existence of an uncorrect stable hypothesis. Furthermore, all eigenvectors vQ,n (x) associated with nonzero eigenvalues have to belong to the class Z
2p 0
vQ,n ( x)dx = 0
(4.4)
and build a basis of this space. Following our constructions for the discrete case, we introduce a distance D ( x, t) between the real distribution and the hypothesis D ( x, t) = P( x) ¡ Q(x, t). The dynamics of the distance function is given by: dD ( x, t) dQ(x, t) = ¡ dt dt = ¡L ( Q ) P = ¡L ( Q )D = ¡
1 X
l Q,n cn vQ,n ( x).
(4.5)
n= 1
As a difference between two elements of Q , D ( x, t) satises the condition Z 2p D ( x, t) dx = 0 8t 0
and can be expanded using the basis fvQ,n g: D ( x, t) =
1 X 1
cn vQ,n ( x).
442
Maxim Khaikine and Klaus Holthausen
And nally for the dynamics of kD ( x, t)k2 : dkD ( x, t)k2 = ¡2D ( x, t)L ( Q )D dt = ¡2
(
1 X n= 1
cn vQ,n ( x)
!
(
1 X
! l Q,k ck vQ,k ( x) .
k= 1
Up to this relation, our constructions for the continuous distribution approximation were the same as for the discrete case. But here the implementation of an orthogonal system of eigenvectors with corresponding positive eigenvalues in every point of Q and the proof of convergence fails. Indeed, if the kernel function KQ (x, y) of the integral operator L ( Q ) is smooth, the operator is compact. The innite set of nonzero eigenvalues of such an operator has to converge toward null (limn!1 l n = 0), and thus no positive inmum can be found. Thus, according to theorem 1, we cannot nd an estimation for the convergence speed. In order to construct a class of possible learning rules, we introduce some simplications. First, we make an assumption about the form of the eigenvectors of the learning operator: in every point Q of the manifold Q , we want to construct an operator with uniform eigenvectors given as trigonometric polynomials in complex form: vn ( x) = exp(inx)
n = ¡1, . . . , 1, n 6 = 0.
All of these functions are orthogonal, independent, satisfy condition 4.4, and represent, together with the trivial eigenvector Q( x), a basis of the functional space. The second supposition is that all corresponding eigenvalues are positive, do not depend on the actual hypothesis, and satisfy the condition 1 X ¡1
|l n | < 1.
Because all vectors fvn ( x)g are xed, the dynamics of D (x) in equation 4.5 can be described only by the change of the coefcients cn : X dcn (t) X dD ( x, t) = l n c n ( t) v n ( x ) vn ( x) = ¡ dt dt n6 = 0 n6 = 0 and dcn ( t) = ¡l n cn ( t). dt
(4.6)
After this simplications we can prove the following convergence theorem.
General Probability Estimation Approach for Neural Computation
443
If the dynamics of all Fourier coefcients of a smooth function g( x, t) x 2 [0, 2p ] is given by Theorem 2.
dcn (t) = ¡l n cn (t), dt where alll n are positive, then the function g(x, t) converges toward null. Precisely: 8g > 0 9 T so, that 8t ¸ T: Proof.
| g( x, t)| · g
8x 2 [0, 2p ].
The absolute value of the function f ( x, t) can be estimated as fol-
lows:
X 1 X | g( x, t)| = cn ( t) exp ( inx) · |c (t)|. n ¡1 n Moreover, a well-known fact from Fourier analysis is that for every smooth function on I, the relation 1 X ¡1
|cn | < 1
holds. Therefore we can nd a positive number L so that ¡L X n= ¡1
| cn (0)| C
1 X n= L
|cn (0)| ·
g . 2
Using equation 4.6 and l n > 0 we obtain |cn ( t)| = |cn (0)| exp(¡l n t) · |cn (0)| and ¡L X n= ¡1
| cn ( t)| C
1 X n= L
| cn ( t)| ·
g . 2
We introduce a partial sum of the coefcients: S L ( t) =
L X n= ¡L
| cn ( t)| .
444
Maxim Khaikine and Klaus Holthausen
From the set of positive numbers fl n n = ¡L, . . . , Lg we can always nd a minimum c > 0, so that all l n ¸ c . Thus, the dynamics of SL ( t) can be easily estimated: L L X X dSL ( t) = ¡ l n |cn ( t)| · ¡c |cn ( t)| = ¡c S( t). dt n= ¡L n= ¡L
This differential equation yields S( t) · S(0) exp(¡c t), and 8t ¸
1 c
) the relation S(t) · ln( 2S(0) g
| g( x, t)| ·
1 X
¡L X
| cn ( t)| =
n= ¡1
n= ¡1
g 2
holds. Finally:
|cn (t)| C
1 X n= L
| cn ( t)| C S( t) ·
g g C · g. 2 2
In order to formulate the concrete form of the kernel for the operator
L ( Q) with the given eigenvectors and eigenvalues, we introduce an adjoint
basis fQ¤ (x), v¤n ( x)g that fullls the following conditions: Q( x)Q¤ ( x) = 1, » 1 v¤n ( x) vk ( x) = 0
vn (x) Q¤ (x) = v¤n ( x) Q( x) = 0, if n = k otherwise.
(4.7)
The multiplication here is dened via the integral: Z f ¤g=
2p 0
f ( x) g( x) dx.
Then for any arbitrary smooth function g(x) = of the operator L ( Q) can be written as: L ( Q ) g( x) =
1 X
Z
2p
n(
l n cn v x ) =
n= ¡1
0
(
P1
¡1 cn vn ( x), the application
1 X n= ¡1
! l n vn ( x) v¤n ( y)
g( y)dy.
The functions satisfying the conditions (see equation 4.7) are dened as follows: ¤
Q ( x) = 1,
v¤n
1 = 2p
(
Z exp (¡inx) ¡
n = ¡1, . . . , 1, n 6 = 0.
2p 0
! Q( z) exp (¡inz)dz
General Probability Estimation Approach for Neural Computation
The integral kernel is a smooth function if by: 1 X
KQ (x, y) =
ln
n= ¡1,n6 = 0
Z
¡
2p 0
(
P1
n= ¡1
|l n | · 1 and is given
exp ( in(x ¡ y)) 2p 1 X
exp ( in(x ¡ z)) ln 2p n= ¡1,n6 = 0 Z
= f ( x ¡ y) ¡
445
0
2p
Q(z) f ( x ¡ z)dz
! Q( z) dz
(4.8)
P exp ( inx) with the function f ( x) := . n6 = 0 l n 2p In this section, the detailed analysis of the dynamics of the approximation has been developed and the convergence has been proved. In order to apply this framework to discrete operating systems (e.g., computer simulations of approximation processes) we emphasize that the basic equations for the presented learning algorithm are 4.1 and 4.8. From these, a local algorithm that is based solely on adaptations of an internal hypothesis that is triggered by external stimuli can be derived. The result of a computer simulation in which the objective stimulus distribution P(x) is compared with the evolved hypothesis Q( x) after 10,000 iterations is presented in Figure 1. The start hypothesis was a uniform distribution, and the generating function for the integral kernel (see equation 4.8) was a gaussian function f ( x) = exp(¡x2 / 10). 5 Recognition of Regression Functions
One of the most important problems in neural processing and approximation theory is the recognition of statistical dependences between variables. More precisely, if we have two random variables x and w, which are correlated and distributed on the interval I = [0, 2p ], the conditional average of w for each x denes a function W(x), the so-called regression function (RF), Z 2p W(x) = hw|xi = Pv ( w| x) wdw, 0
where Pv ( w|x) is a conditional probability density of w by the given value of x. We see that the denition of RF is not symmetrical for w and x. To emphasize the basic difference between the variables, we will call x a stimulus and the variable w the response. The approximation of RF can be intuitively understood as a recognition of objective functional dependence W(x) on the signal with symmetrical noise. In this section we suppose that the W(x) and both probability densities— P( x) and Pv ( w| x)—are smooth on I and the distribution P( x) is not equal to null in every point of the interval I.
446
Maxim Khaikine and Klaus Holthausen
Figure 1: Simulation of a continuous density approximation for an arbitrarily chosen objective distribution P(x). Ten thousand iterations of the learning rule represented by equation (14) for the local update of the kernel function together with the local update of the hypothesis: QtC1 (x) := Qt (x) C 2 K(x, yt ) (learning reate 2 = 10¡5 ) were performed. The generating function for the kernel was the gaussian f (x) = exp(¡x2 / 10) that fullls the conditions of theorem 2. As subjective initial hypothesis Q(x, 0), a uniform distribution was assumed. Due to the convergence of the algorithm, quite a good representation could be achieved.
Generally there are many possibilities for applying the approach of ideal approximation described in the previous sections for the recognition of W(x). Here we describe a simple one that consists of two simultaneous approximation processes and obviously converges toward RF. As in sections 3 and 4, we introduce a hypothesis V( x, t) about the regression function W(x). Moreover, we consider an additional function F( x, t) and the hypothesis Q( x, t) about the distribution P( x). The hypothesis Q( x, t) will be changed sequentially after the stimulus y was received, as dened by equation 4.1 and 4.8, so that lim Q( x, t) = P(x).
t!1
(5.1)
The local dynamics of F(x, t) is in general given by: F( x, t C D t) ¡ F(x, t) = 2 KF ( x, yt , wt ),
where KF ( x, yt , wt ) is a kernel function depending on the actual value of F(x, t). We note that compared with the previous section, the kernel has
General Probability Estimation Approach for Neural Computation
447
three parameters: the signal coordinate x, the stimulus y, and the response w. After transition to the continuous description, the average dynamics of F( x, t) can be written as: dF( x, t) = dt
Z Z
=
2p 0 2p 0
Z Z
2p 0 2p 0
P( y, w)KF ( x, y, w) dydw P( y) Pv ( w| y) KF (x, y, w) dydw.
(5.2)
If we assume the kernel function KF ( x, y, w) to be given in the form Z 2p Q ( ) ( ) KF x, y, w = w f x ¡ y ¡ F( z, t) fQ( x ¡ z)dz 0
where fQ( x) is an arbitrary function with real and positive Fourier coefcients. We obtain after some calculations: Z 2p Z 2p dF( x, t) = W(y)P(y) fQ( x ¡ y) dy ¡ F(z, t) fQ( x ¡ z) dz. dt 0 0 This equation describes the convolution of D ( x, t) = (W(x) P(x) ¡ F(x, t)) with the function fQ( x). Hence, the dynamics of Fourier coefcients of D ( x, t) satises equation 4.6 and, as given in theorem 2, D ( x, t) ! 0. Finally: lim F( x, t) = W(x)P(x).
t!1
(5.3)
The hypothesis V ( x, t) about the regression function W(x) can be dened every time as follows: V( x, t) =
F( x, t) . Q(x, t)
And together with equations 5.1 and 5.3 and condition 2, we obtain: lim V( x, t) =
t!1
limt!1 F( x, t) = W(x). limt!1 Q( x, t)
(5.4)
In order to study the performance of this algorithm, a simultaneous approximation of the RF W(x) and the estimation of the objective distribution of the stimuli x was computed via numerical integration. The stimuli were distributed with the same probability density P( x) as in Figure 1. Figure 2 shows a comparison between the regression function W(x, t) and the corresponding approximation V( x, t) after 10,000 learning steps. The additional white noise was distributed in the interval [¡100, 100]. The start hypothesis about the RF was a constant null function.
448
Maxim Khaikine and Klaus Holthausen
Figure 2: Simulation of the approximation of a regression function W(x), which was arbitrarily chosen. The underlying probability density P(x) was the same as in the foregoing simulation (see Figure 1). The initial hypothesis V(x, 0) about the regression function was a constant null function. Again, an appropriate representation of the true regression function could be achieved.
The corresponding dynamics of the distance kW(x) ¡ V( x, t)k between the regression function and the iterated hypothesis is shown in Figure 3. We performed a series of integrations with various regression functions and usually achieved satisfactory approximations; the distance converged toward null. 6 Discussion
Neural networks can be regarded as an extension of conventional techniques in statistical pattern recognition (Bishop, 1995). And, of course, certain network models can be derived from approaches to probability density estimation. For example, a principled alternative to the widely used selforganizing map of Kohonen (1982) could be developed that has the key property that it denes a probability density model (Bishop, Svensen, & Williams, 1998). Accordingly, possible generalizations and applications of the presented approach can be pointed out. Essentially we described a framework for the learning processes in neural systems that can be regarded as autonomous entities that adapt its internal structure on the basis of subjective probabilities constructed on the basis of
General Probability Estimation Approach for Neural Computation
449
Figure 3: The monotonous decrease of the distance kW(x) ¡V(x, t)k between the real regression function and the iterated hypothesis was measured during the foregoing simulation (see Figure 2). After 30,000 iterations the initial distance was reduced by a factor ¼300.
randomly received input signals (Jost, Holthausen, & Breidbach, 1997). The approximation algorithms can be easily extended to higher-dimensional cases. To do this, we have to replace all one-dimensional hypotheses with the corresponding many-dimensional ones and do the same for the kernels. All functions thus will be represented by many-dimensional Fourier series, but the main convergence properties will remain. For example, the density estimation for two-dimensional distributions can be performed in the following manner. We assume that the objective distribution P( x1 , x2 ) and the according hypothesis Q( x1 , x2 ) are given. An appropriate generating function for the kernel is dened by the two-dimensional gaussian f ( x1 , x2 ) = exp(¡(x1 Cx2 ) 2 / s 2 ). The adaptation of the kernel after a stimulus (y1 , y2 ) was received is given by the two-dimensional extension of equation 4.8. Thus, our approach can be applied to the generation of topological feature maps. Because of the optimal approximation capabilities enclosed in each element for the recognition of regression functions in higher-dimensional cases, our algorithms could be a good basis for the mathematical investigation of complex structures that are composed of these elements: articial neural networks. Recent work on information processing and storage at the single-cell level has revealed previously unimagined complexity (Koch, 1997). Thus, one may assume that even single neurons could perform calculations that correspond to the presented approximation model.
450
Maxim Khaikine and Klaus Holthausen
The theorems provided here concerning the convergence of the approximation algorithms are only preliminary. Especially the restriction to trigonometric polynomials presumably could be dropped in the course of more comprehensive mathematical analysis. Acknowledgments
This research was supported by the Thuringer ¨ Ministerium fur ¨ Wissenschaft, Forschung und Kultur (B 301-96028). We thank Paul Ziche for critically reading an earlier draft of the manuscript for this article. We are grateful to one of the anonymous referees who provided several detailed suggestions to clarify our derivations. References Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Comp., 10, 215–234. Girosi, F., & T. Poggio (1990). Networks with the best approximation property. Biol. Cybern., 63, 169–176. Jost, J., Holthausen, K., & Breidbach, O. (1997).On the mathematical foundations of a theory of neural representation. Theory Bioscienc., 116, 125–139. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–210. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5, 501–506. Pfaffelhuber, E. (1972). Learning and information theory. Intern. J. Neuroscience, 3, 83–88. Scarselli, F., & Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11, 15–37. Received April 13, 1998; accepted January 27, 1999.
LETTER
Communicated by Richard Golden
On the Synthesis of Brain-State-in-a-Box Neural Models with Application to Associative Memory Fation Sevrani Kennichi Abe Department of Electrical Engineering, Faculty of Engineering, Tohoku University, Sendai 980-8579, Japan In this article we present techniques for designing associative memories to be implemented by a class of synchronous discrete-time neural networks based on a generalization of the brain-state-in-a-box neural model. First, we address the local qualitative properties and global qualitative aspects of the class of neural networks considered. Our approach to the stability analysis of the equilibrium points of the network gives insight into the extent of the domain of attraction for the patterns to be stored as asymptotically stable equilibrium points and is useful in the analysis of the retrieval performance of the network and also for design purposes. By making use of the analysis results as constraints, the design for associative memory is performed by solving a constraint optimization problem whereby each of the stored patterns is guaranteed a substantial domain of attraction. The performance of the designed network is illustrated by means of three specic examples. 1 Introduction
In this article we consider a class of synchronous discrete-time neural networks described by the difference equation of the form x( k C 1) = g ( x (k) C a( Wx ( k) C b )),
x (0) = x 0 ,
(1.1)
where x ( k) 2 Rn is the state vector at time k, a is the step size, W = [wij ] 2 Rn£n is the connection (or weight) matrix, b = [b1 , . . . , bn ]T 2 Rn is a bias vector, and g (x ) = [g(x1 ), . . . , g( xn )]T
is a vector valued function where 8 if xi ¸ 1 0,
for i = 1, . . . , n,
(2.2)
Synthesis of Brain-State-in-a-Box Neural Models
455
then À is an asymptotically stable equilibrium point of system 1.1. Note that the above stability result is local, in the sense that there will be a sufciently small, open neighborhood U (À ) of À in Rn such that any trajectory of 1.1, corresponding toTan initial condition x 0 2 H n , converges to À whenever x 0 belongs to U (À ) H n . In the subsequent analysis we are concerned with a more global stability analysis for the equilibrium points of system 1.1 located at the vertices of the hypercube. More precisely, given that a vertex À of H n is asymptotically stable, depending on the positive margin with which condition 2.2 is satised, we would like to know the extent of this stability—that is, for which points it is true that any trajectory of system 1.1 initiating from those points will converge to À . Let Cr (À ) ½ H n be a bounded set dened as Cr (À ) = fx 2 Hn | kx ¡ À k1 · 2r g,
(2.3)
where k ¢ k1 is the l1 vector norm on Rn , and r a positive scalar, 0 < r < n. We shall prove the following result: Let À be an asymptotically stable equilibrium point and let
Proposition 2.
0 < r < min i
ui ( W À C b )i , 2 maxj |wij |
i, j = 1, . . . , n.
(2.4)
Then, every trajectory of system 1.1 starting from a point in Cr ( À ) cannot escape Cr ( À) and will converge to À . Before proceeding further, it will be convenient to introduce some additional notations and geometrical considerations that will be useful in the proof. Let V = fÀ(0) , À (1) , . . . , À (n) g be a set of n C1 afnely independent vectors 1 in Rn such that À(0) = À and À( i) = À ¡ 2r ui e i , where e i , i = 1, . . . , n, is the ith column vector of the identity matrix. Let co(V) denote the ndimensional simplex expressed as a convex combination of the vertices À (0) , À(1) , . . . , À ( n) , that is, ( ) X n n X ( i) co(V) = x x= l iÀ , l i = 1, 0 · l i 2 R for all i . (2.5)
i= 0
i= 0
Equivalently, co(V) can be expressed by a set of linear inequalities fx 2 Rn | ujT x · hj , j = 1, . . . , n C 1g, where uj = uj ej , j = 1, . . . , n, u nC1 = ¡À;
hj = 1, j = 1, . . . , n, hnC1 = 2r ¡ n. Here each inequality ujT x · hj , j =
Recall that a set of n C 1 vectors fx (0) , . . . , x (n) g is afnely independent if and only if the translated set fx (1) ¡ x(0) , . . . , x(n) ¡ x (0) g is linearly independent. 1
456
Fation Sevrani and Kennichi Abe
1, . . . , n, denes a closed half-space associated with tangent (or supporting) hyperplane to Hn at À . Accordingly, u TnC1 x · hnC1 denes the closed halfspace associated with the hyperplane facing the vertex À . In this regard, each vector À ( i) lies on n linearly independent supporting hyperplanes of co(V); that is, each À ( i) satises exactly n linearly independent equations ujT À ( i) = hj . Now suppose that x , | xi | · 1, i = 1, . . . , n, satises u TnC1 x · 2r ¡ n. Then, u TnC1 ( x ¡ À ) · 2r or, equivalently, kx ¡ À k1 · 2r . Thus, the set Cr ( À ) can be represented as the intersection of the n-simplex, spanned by point À as well as points À(1) , . . . , À ( n) in Rn , and the hypercube Hn . In the proof of proposition 2 we shall make use of the following lemma (a proof of the lemma is given in the appendix following the approach of Golden, 1993.) The neural network model 1.1 can be represented as an uncertain linear system described by the state equation of the form Lemma 1.
x ( k C 1) = x (k) C aÀ ( k)( Wx ( k) C b ),
(2.6)
where À ( k) 2 Rn£n is a diagonal matrix with diagonal elements 0 · c i ( k) · 1. We can continue now with the proof of proposition 2. We will show rst that any trajectory of equation 1.1 originating in Cr ( À ) can never leave Cr (À ), that is, if x 0 2 Cr ( À), x (k) will remain in Cr ( À ) for all k > 0. Since, due to saturating nonlinearity, the state of the system is conned to the hypercube H n , in order to leave Cr ( À ), any trajectory originating in Cr ( À) must cross the face of Cr ( À ) opposing the vertex À , fx | hÀ , x i = n ¡ 2r , x 2 Cr (À )g, represented by the closed half-space hÀ , x i ¸ n ¡ 2r . Hence, in view of lemma 1, it sufces to show that if x 2 Cr ( À ), then hÀ, À (Wx C b )i ¸ 0. For this, observe that since Cr ( À) µ co(V), each point x 2 Cr ( À) is uniquely expressible as a convex combination of the form 2.5. Using the fact that l 0 C ¢ ¢ ¢ Cl n = 1, we have Proof of Proposition 2.
hÀ , À ( Wx C b )i =
n X j= 0
lj
n X i= 1
c i (¢)ui ( W À ( j) C b ) i .
(2.7)
By noting that À(0) = À and substituting for À ( j) = À ¡ 2r uj ej , we have by assumption that ui (W À ( j) C b ) i = ui ( W À C b )i ¡ 2r wij ui uj > 0,
for i, j = 1, . . . , n.
(2.8)
Since l j and c i (¢) are nonnegative, with l j adding up to 1, we conclude that hÀ , À( Wx C b )i ¸ 0.
Synthesis of Brain-State-in-a-Box Neural Models
457
To this end we have shown that Cr ( À) is an invariant set. Next we show that x (k) ! À as k ! 1 for any x0 2 Cr ( À). For this, without loss of generality, consider the case where Ài = 1, and let x ( k) 2 Cr ( À) such that xi ( k) 6 = 1. Then the ith component of x ( k) C a(Wx ( k) C b ) is 0 1 n X ( ) l j ( W À j C b ) i A > xi ( k), xi (k) C a @ j= 0
where the above inequality follows from equations 2.4 and 2.8. Thus xi ( k C 1) > xi (k). The case where ui = ¡1 is similar. Continuing these arguments for x (k C 2), x( k C 3), . . . , it can be shown that the corresponding trajectory will monotonically converge to À as k ! 1. The next result addresses the existence of other binary equilibrium points of system 1.1 in the immediate vicinity of Cr (À ). Let wii = 0 for i = 1, . . . , n. Suppose that À is an asymptotically stable equilibrium point, and let k = dmini (ui (W À C b ) i / 2 maxj |wij | )e, where dxe denotes the smallest integer ¸ x and i, j = 1, . . . , n. Then none of vertices À( k) at Hamming distance dH (À , À( k) ) · k is an equilibrium point. Proposition 3.
By assumption there exists r > 0 satisfying equation 2.4 such that from proposition 2, and in view of the denition for k, none of the vertices of the hypercube within the Hamming distance k ¡ 1 from À, that is, ( ) dH ( À , ÀQ k¡1 · k¡1, ÀQ 6 = À , can be an equilibrium point. Now let À (k¡1) be an arbitrary vertex with dH ( À, À (k¡1) ) = k ¡1, and let J ½ f1, . . . , ng of cardinalP ( ) ity | J| = k ¡ 1 such that À (k¡1) = À ¡ 2 j2J uj ej . Let À ( k) = À ( k¡1) ¡ 2ui k¡1 e i Proof.
( )
such that ui k 6 = ui , where i 2 / J, i = 1, . . . , n. By assumption, we have that ³ ´ X ui W À( k¡1) C b = ui ( W À C b ) i ¡ 2ui wij uj > 0, i
for i = 1, . . . , n.
j2J
( ) Since wii = 0 8i, we obtain ui k (W À (k) C b ) i = ¡ui ( W À ( k¡1) C b ) i < 0, i 2 / J. Hence, from equation 2.1, it follows that À (k) cannot be an equilibrium point.
An immediate consequence of proposition 3 is the following: Let wii = 0 8i. Suppose that À is an asymptotically stable equilibrium point. Then none of the vertices of the hypercube at Hamming distance one from À can be an equilibrium point. Corollary 1.
458
Fation Sevrani and Kennichi Abe
In the foregoing results, no considerations regarding the global stability of system 1.1 are made. Recall that a neural network (such as network 1.1) is called globally stable if every trajectory of the network converges to some equilibrium point. This guarantees that periodic and chaotic motions cannot exist in the network. Theorem 1, which follows, provides conditions on the connection matrix W that guarantee the global stability of system 1.1. In the proof of theorem 1 we will make use of an additional assumption: Given assumption A1, system 1.1 has a nite number of equilibrium points. Assumption A2.
Assumption A2 seems reasonable. In fact, if matrix W satises assumption A1, and, in addition, wii 6 = 0 8i, every principal matrix of W is of full rank. Then, following similar arguments as in the proof of proposition 1. (see also Hui et al., 1993), it can be shown that system 1.1 cannot have an innity of equilibrium points. When wii = 0 8i, as shown in the proof of proposition 1, assumption A2 is true for almost all b 2 Rn , with the possible exception of those belonging to a set of measure zero. Suppose that system 1.1 satises assumption A2 and W = W T is either positive denite or a < 2/ |l min |, where l min is the smallest eigenvalue of symmetric matrix W . Then any trajectory of 1.1 starting from an arbitrary point x 0 6 = 0, x 0 2 H n , will converge to an equilibrium point. Theorem 1.
Proof.
See the appendix.
In practice, due to noise, every trajectory of 1.1 will converge to an asymptotically stable equilibrium point, which, from proposition 1, with the additional constraint of matrix W having all of its diagonal elements zero, will be located at a vertex of the hypercube H n . 3 Application to Synthesis
In this section we address the problem of synthesis of associative memory using the neural network model 1.1: Synthesis Problem. Given a set of binary vectors (desired memory patm terns), fp ( k) gk= 1 , p (k) 2 f¡1, 1gn , and given the step size a > 0, nd W , b ,
such that:
1. p (1) , . . . , p ( m) are stored as asymptotically stable equilibrium points (memories) of system 1.1. 2. The domain of attraction of each stored pattern is as large as possible.
Synthesis of Brain-State-in-a-Box Neural Models
459
3. The total number of spurious asymptotically stable equilibria (extraneous stored patterns) is as small as possible. 4. The system 1.1 is globally stable; that is, every trajectory of the system converges to some equilibrium point. In regard to design criterion 4, the global stability of the network is not essential for storing the information. In fact, several existing design methods do not take in consideration the global stability of the network (see, e.g., Michel, Farrell, and Porod, 1989; Michel & Farrell, 1990). However, in terms of the reliability of retrieving the information—for example, when one expects the retrieval of a stored pattern from an input that is further away from that pattern—the global stability of the network is of great interest. Based on the analysis results provided in the preceding section, we approach the synthesis problem in the following manner. According to proposition 2, by selecting an n £ n matrix W = [wij ] and an n-dimensional vector b = [b1 , . . . , bn ]T such that for each k = 1, . . . , m the inequality n X j= 1
( ) ( k)
wij pi k pj
C p(i k) bi ¸ s,
i = 1, . . . , n,
(3.1)
is satised for some given s > 0, we are guaranteed that p (1) , . . . , p ( m) will be stored as memories of system 1.1. Furthermore, associated with each p (k) there will be a stability region fx 2 Hn | kx ¡ p ( k) k1 < 2r k g, ( ) where r k = mini ( pi k ( Wp ( k) C b ) i / 2 maxj | wij | ), i, j = 1, . . . , n, such that any trajectory of the system originating within this region will converge to p (k) . Assuming that |wij | · wN for all i, j, the quantity r k gives a lower bound for the extent of the domain of attraction of a stored pattern,2 and thus, it provides a reasonable measure of the error-correction capabilities of the network. From an alternative viewpoint, given a stored pattern corresponding to an asymptotically stable equilibrium point of the network, an idea of how much the domain of attraction of that pattern can be expanded may also be given by the location of other asymptotically stable equilibrium points in the vicinity—that is, spurious asymptotically stable equilibrium points as well as points used for storing the (desired) memory patterns. Hence, the requirement of excluding spurious asymptotically stable equilibria, at least within some reasonable distance from each of the stored patterns, would allow for a larger extent of the domain of attraction for those patterns. It is clear that such an extent is bounded from above by the locations of the stored patterns, and in general it decreases as their number increases. 2
Note that the domain of attraction of the stored patterns is not known in its entirety.
460
Fation Sevrani and Kennichi Abe
In regard to the location of spurious asymptotically stable equilibrium points, from proposition 1 we have that, for system 1.1, with a zero diagonal connection matrix W satisfying assumption A1, those points must be located at the vertices of the hypercube. By proposition 3, provided that condition 3.1 is satised with wii = 0 8i, we are guaranteed that within a Hamming distance rk = dr k e from a stored pattern p ( k) there can be no equilibrium point apart from p (k) itself. In particular (see corollary 1), none of the vertices at Hamming distance one from any of the stored patterns can be an equilibrium point, which has the implication that none of such vertices can be used as memory locations. However, we consider this a desirable property, since patterns that differ in only one component may be too close to allow for a signicant amount of error correction. At this point some observations are in order. Suppose that for some k, 1 · k · m, we have n X j= 1
( k)
wij p(i k) pj
C p(i k) bi ¡ 2
X j2J
( k)
wij p(i k) pj
> 0
(3.2)
for at least one integer i 2 J ½ f1, . . . , ng, where J is any subset of integers of cardinality | J| ¸ rk (rk ¸ 1). Stated otherwise, for at least one integer i 2 J, the ith component of W À C b , at the vertex À that differs from p ( k) in ( ) the components pi k , i 2 J, points to p ( k) . From equation 2.1, it is clear that there cannot be any equilibrium point located at such a vertex. Note that when a signicant number of components of W À C b , at À , point to p ( k) , we have reason to suspect that a trajectory starting at À will approach p ( k) and eventually reach that pattern. Hence, since s in condition 3.1 provides a lower bound for the rst two terms in equation 3.2, if we seek to maximize s (provided that | wij | · wN for all i, j), this will certainly reduce further the number of spurious asymptotically stable equilibria, and, in addition, it will ensure good retrieval of the memory patterns even from locations (vertices) more distant than those for which we can guarantee a perfect retrieval. To this end, a nal observation concerns the effect of the bound wN on errorcorrection capabilities of the network in the presence of random binary errors. Let us suppose that we require the retrieval of a memory pattern in the presence of errors due to the ipping of some of its components. We anticipate that the dispersion (or discontinuity) of the wij values will affect the strength of perturbations along all other components of that pattern. The larger the dispersion of the wij values, the more the information that is lost in the presence of those errors and, hence, the less likely that the memory pattern will be correctly retrieved. The combined effect of the increase of the positive margin s in equation 3.1 and the bound wN on the error-correction capabilities of the network will be demonstrated by practical simulations results presented in section 4. Now, putting it all together, a possible strategy to approach the synthesis problem can be formulated as follows:
Synthesis of Brain-State-in-a-Box Neural Models
461 m
Given a set of binary vectors fp ( k) gk= 1 , p (k) 2 f¡1, 1gn , to be stored as memories for system 1.1, and given a > 0, wN > 0, nd W = [wij ] 2 Rn£n , b = [b1 , . . . , bn ]T 2 Rn such that s is maximum subject to constraints n X s > 0, k = 1, . . . , m, (3.3) wij pki pjk C pki bi ¸ s, Synthesis Strategy.
j= 1
¡wN · wij · w, N i 6 = j, wii = 0,
i, j = 1, . . . , n.
(3.4)
By making use of the condition for the smallest eigenvalue, 3 l min ( W ), provided by theorem 1, the global stability requirement imposes the additional constraint W = W T , tI n C W > 0,
t = 2/ a,
(3.5)
where In is the n £ n identity matrix and the inequality sign above means that ( tIn C W ) is positive denite. In the following we point out some computational aspects of solving the synthesis problem based on the formulated strategy. Let us consider rst the problem of determining W and b such that s is maximum subject to constraints 3.3 and 3.4. In such a case, the synthesis problem reduces to a general linear programming (LP) problem. Provided that a solution for W and b has been found, the resulting network is not guaranteed to be globally stable; however, we have guaranteed that all given patterns are stored as asymptotically stable equilibrium points of the network, having a substantial domain of attraction and hence a signicant amount of error correction. From a computational perspective, based on an LP formulation, the synthesis problem can be solved efciently by using any of the polynomial time algorithms for LP (see, e.g., Lustig, Martsen, & Shanno, 1994), or by employing another neural network that offers computational advantages of massive parallel processing and fast convergence (see, e.g., Xia & Wang, 1995; Xia, 1996). Note also that restrictions on the interconnection structure that require that predetermined elements of W be zero can be incorporated simply by excluding those elements from constraints 3.3 and 3.4. With the addition of the global stability constraint 3.5, the synthesis problem, that is, the problem of maximizing s over equations 3.3 through 3.5, cannot be cast as a linear program. However, the problem we wish to solve can be cast as a semidenite program (see Vandenberghe & Boyd, 1996) in the variables W = W T , b and s. In semidenite programming (SDP), one minimizes a linear function subject to the constraint that an afne combination of symmetric matrices is positive semidenite, 4 which can be regarded 3
Note that since wii = 0 8i, W = WT cannot be positive denite. Many common convex constraints—for example, linear and quadratic inequalities, eigenvalue, and matrix norm inequalities—can be expressed in this way. 4
462
Fation Sevrani and Kennichi Abe
as an extension of LP where the componentwise inequalities between vectors are replaced by matrix inequalities. Semidenite programs are convex optimization problems, and, as in LP, they can be solved in polynomial time by effective algorithms for the solution of these problems that rapidly compute the global minimum with nonheuristic stopping criteria (Nesterov & Nemirovsky, 1994). We note here that in terms of computational requirements, it is also possible to consider a simple heuristic algorithm for solving the synthesis problem. More specically, given the specications for a and w, N one can approach the synthesis problem by nding a feasible solution for W and b subject to constraints 3.3 and 3.4 (with the additional constraint that wij = wji for i 6 = j, if the global stability is required), for some increasing values of s, until the solution space is empty. In such a case, simple numerical relaxation techniques can be used, such as the technique used in Perfetti (1995). 5 In order to check for global stability, one can then compute the smallest eigenvalue of the symmetric matrix W . Referring to equation 3.5, if l min < ¡2/ a, the above procedure is repeated by specifying a lower value for the bound wN until one ends up with a solution for W and b that solves the synthesis problem.6 We conclude this section with a nal comment concerning the stability properties of the desired memory patterns, stored as asymptotically stable equilibrium points of the network, when considering inaccuracies incurred during the implementation process or limited precision implementation of system 1.1. Denoting those parameter inaccuracies, respectively, by D W and b , and performing a similar analysis as in Liu and Michel (1994), it can be shown that the stored patterns will retain their stability properties provided that the following condition holds:7 | W k1 C kb k1 < s, where k ¢ k1 denotes the matrix norm induced by l1 vector norm. Thus, increasing s will certainly increase the robustness of the stability of the stored patterns in the presence of implementation inaccuracies or parameter perturbations.
5 Experimentation with the algorithm proposed in Perfetti (1995) shows that, in terms of numerical stability and number of iterations, this algorithm is quite inferior (as it can be expected) compared, for example, with the interior point algorithms for LPs or SDPs. 6 GerLskorin theorem applied to zero diagonal symmetric matrix W states that all eigenvaluesPof W are located inside the circular disc centered at the origin and with radius n maxi j | wij |, which is the spectral radius of W . 7 Note that, when considering the inaccuracies during the implementation process, one is more concerned with the parameter inaccuracies in the connection matrix W rather than bias vector b .
Synthesis of Brain-State-in-a-Box Neural Models
463
At this point it should be noted that since large s will usually result in a large spread or dispersion of the wij values, as we conjectured above and also as simulation results show, it may cause the error-correction capabilities of the network to deteriorate in the presence of random binary errors. Furthermore, if we require the global stability of the network, simulation results suggest that the range of possible values of s is practically small. It is clear that there will be a trade-off between robustness, global stability, error-correction capability, and speed of operation, which remain to be chosen by the designer. 4 Simulations
In this section, we demonstrate the applicability of the theoretical results as well as the constituency of the arguments presented in the previous sections by means of three specic examples. 4.1 Example. In this example we compare the retrieval performance of the neural network model, which is synthesized based on the strategy given in the previous section, with the performance of the networks designed by methods of Michel, Si, & Yen (1991) and Perfetti (1995) (in the special case of system 1.1 where b = 0). For the sake of comparison with the results of Michel et al. and Perfetti, in the design presented in this example, we choose a = 1 and require the network to be globally stable. We consider the following set of binary patterns of length n = 10 (cf. Michel et al., 1991, example 2.1) to be stored as asymptotically stable equilibrium points of equation 1.1: p (1) p (2) p (3) p (4) p (5) p (6)
= = = = = =
[¡1, 1, ¡1, 1, 1, 1, ¡1, 1, 1, 1]T , [1, 1, ¡1, ¡1, 1, ¡1, 1, ¡1, 1, 1]T , [¡1, 1, 1, 1, ¡1, ¡1, 1, ¡1, 1, ¡1] T , [1, 1, ¡1, 1, ¡1, 1, ¡1, 1, 1, 1]T , [1, ¡1, ¡1, ¡1, 1, 1, 1, ¡1, ¡1, ¡1]T , [¡1, ¡1, ¡1, 1, 1, ¡1, 1, 1, ¡1, 1] T .
Specifying a = 1 and wN = 0.5, we determine a 10 £ 10 dimensional matrix W = [wij ] and a 10-dimensional vector b by maximizing s over equations 3.3 and 3.4 and the additional symmetry constraint wij = wji for i 6 = j, which is solved by using linear programming. We obtain 2
W=
0 .0000 ¡0 .0071 6¡0 .4009 6¡0 .4419 6¡0 .5000 6 6 0 .3199 6 0 .1017 4¡0 .3457 ¡0 .0071 0 .2898
¡0.0071 0.0000 0.1931 ¡0.0335 ¡0.3104 0.0000 ¡0.3300 ¡0.1920 0.4552 0.2350
¡0 .4009 0 .1931 0 .0000 0 .2238 ¡0 .3885 ¡0 .3046 0 .1568 ¡0 .2677 0 .1931 ¡0 .3830
¡0.4419 ¡0.0335 0.2238 0.0000 ¡0.4009 0.0253 ¡0.2708 0.3909 ¡0.0335 0.1719
¡0 .5000 0.3199 0 .1017 ¡0 .3104 0.0000 ¡0 .3300 ¡0 .3885 ¡0.3046 0 .1568 ¡0 .4009 0.0253 ¡0 .2708 0 .0000 0.0780 0 .1773 0 .0780 0.0000 ¡0 .5000 0 .1773 ¡0.5000 0 .0000 ¡0 .0968 0.2946 ¡0 .3584 ¡0 .3104 0.0000 ¡0 .3300 0 .3921 ¡0.5000 ¡0 .3119
¡0.3457 ¡0.1920 ¡0.2677 0.3909 ¡0.0968 0.2946 ¡0.3584 0.0000 ¡0.1920 0.4185
3
¡0 .0071 0.2898 0 .4552 0.2350 0 .1931 ¡0.3830 7 ¡0 .0335 0.1719 7 7 ¡0 .3104 0.3921 7 0 .0000 ¡0.5000 7 ¡0 .3300 ¡0.3119 7 ¡0 .1920 0.4185 5 0 .0000 0.2350 0 .2350 0.0000
464
Fation Sevrani and Kennichi Abe
Table 1: Simulation Results for Example 1. Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)
HD = 1
HD = 2
HD = 3
HD = 4
10 10 10 8 10 10
35 38 31 33 35 33
65 69 39 41 45 51
62 63 10 25 20 24
96.7
75.9
43.1
16.2
HD ¸ 5
Total
40 78 0 4 7 6
212 258 90 111 117 124
10.4
89.0
Table 2: Simulation Results for the Design of Michel et al. (1991, Example 2.1). Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)
HD = 1
HD = 2
HD = 3
HD = 4
10 10 10 8 10 9
25 27 32 17 30 26
16 28 34 14 26 22
4 7 12 7 5 5
95.0
58.1
19.4
3.2
HD ¸ 5
Total
0 0 2 2 0 0
55 72 90 48 71 62
0.3
38.9
and b = [ ¡0 .1071
0 .5972 ¡1.4713 0.8561 0 .4372 0 .1175 1.1485 ¡0 .1649 0 .5972 ¡0.0323 ]T ,
corresponding to smax = 0.5. Also l min ( W ) = ¡1.5 which, together with the fact that W is symmetric, according to theorem 1, guarantees the global stability of the synthesized network. Using condition 2.3 it can be veried that system 1.1 with W and b given above, in addition to the desired patterns p (1) , . . . , p (6) , also has three binary patterns stored as asymptotically stable equilibrium points (spurious memories). The simulation results summarized in Tables 1–3 are intended to compare the retrieval performance of the synthesized neural network 1.1, with the performance of the networks designed by the methods of Michel et al. (1991) and Perfetti (1995). In columns 2 to 5 in each table, the data in its row, corresponding to the pattern specied in the rst column, show the number of binary inputs converging to that pattern out of all possible binary inputs at Hamming distance HD = 1, . . . , 4 and HD ¸ 5, which are used as initial conditions for the state of the network. Referring, for example, to the rst
Synthesis of Brain-State-in-a-Box Neural Models
465
Table 3: Simulation Results for the Design of Perfetti (1995, Example 2). Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)
HD = 1
HD = 2
HD = 3
HD = 4
10 7 10 8 10 6
31 8 15 31 33 11
46 5 19 47 42 7
30 0 8 34 28 4
85.0
47.8
23.1
8.3
HD ¸ 5
Total
8 0 0 4 7 0
125 20 52 124 120 28
1.3
45.8
Note: Special case of neural network (1) with b = 0.
row of Table 1, for the given pattern p (1) , 35 out of 45 possible initial conditions at Hamming distance HD = 2 from p (1) converged to p (1) . The data in the last row of each table show the average rate of convergence to the correct stored pattern. 8 Also, the total in the last column of each table gives an indication of the attraction capability of each given pattern. As shown in Table 1, the synthesized network (see equation 2.1) attracted 89% of the binary inputs (there are 210 = 1024 possible binary inputs) to the desired patterns. The design method (eigenstructure method) of Michel et al. (1991) (example 2.1, with the design parameters t1 = 1.5, t2 = 0.5) attracted 39% of the binary inputs to the desired patterns, and the method in Perfetti (1995) (example 2, with the design parameter d = 0.75) attracted 46% to the desired patterns. At this point we make a few additional observations. As seen from Table 2, for inputs at Hamming distance HD > 1 from the desired memory patterns, the convergence rate of the network designed by the eigenstructure method drops faster than that for two other designs. In fact, the network designed by the eigenstructure method, with the design parameters specied above, has in addition 12 spurious (nonbinary) memories that affect the retrieval efciency of the network. This is in contrast with the design where there are only three spurious memories, as reected by the convergence rates shown in Table 1. For the design of Perfetti (1995) there are eight spurious memories; six are the negatives of the desired patterns, which are automatically stored as asymptotically stable equilibrium points of the network. Next, numerous simulations suggest that the maximization of s in equation 3.3 is critical for the performance of the network, which is in line with the theoretical results and arguments presented in the previous 8 The convergence rate is specied as the ratio of the number of the initial conditions that converge to the desired pattern to the number of all possible initial conditions from a given Hamming distance HD from the desired pattern.
466
Fation Sevrani and Kennichi Abe
Table 4: Simulation Results for Table 2. Number of Binary Inputs Converging to the Desired Pattern wN
smax / 2w N
HD = 1
HD = 2
HD = 3
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
0.60 0.60 0.60 0.59 0.57 0.52 0.46 0.41 0.37 0.34
96.83 96.83 96.83 96.83 96.42 91.50 84.58 82.25 81.00 78.83
78.40 77.61 76.71 74.78 70.17 60.47 53.82 50.82 49.71 47.45
44.21 43.32 42.11 39.87 37.76 33.63 30.39 28.55 28.11 27.77
Number of spurious states 2.35 2.45 2.50 2.40 2.40 2.80 3.40 3.60 3.90 3.90
sections. Furthermore, by considering wN (the upper bound on the absolute values of the elements of connection matrix W ) as a design parameter, we observe that low values of w, N as the one chosen in this example, improve the retrieval efciency of the network in the presence of errors (see the following example). This approach also allows us to design a globally stable network by solving a linear programming problem, and thus the possibility of designing the neural network (cf. equation 1.1) by another neural network. 4.2 Example. Our objective in this example is to investigate the effect of the bound wN on the retrieval performance of the neural network model 1.1 and also to ascertain how typical are the results of example 4.1. To perform this, we used 20 randomly generated sets of binary patterns (with mutual Hamming distances greater than one but otherwise random) to be stored as asymptotically stable equilibrium points of the network. Each set contained m = 6 binary patterns of length n = 10. For each set, we synthesized the network (with n = 10 and a = 1) by solving a semidenite programming based on the formulation given in the preceding section for several specic values for the bound w. N Thus, all the designed networks are guaranteed to be globally stable. 9 Table 4 shows the results with the obtained optimal value for s (smax ). The rst column species the values chosen for the bound N Columns 2–5 summarize the average results of the simulation tests. w. The simulation results show that the number of binary inputs (applied as initial conditions for the state of the network) converging to the correct stored pattern decreases as the value wN increases. In particular, this was more pronounced for inputs at Hamming distance HD = 1, 2, 3 from a stored pattern. The number of binary inputs attracted from larger Hamming distances, 9 Note that the maximization of s in equation 3.3 subject to large values for bound w N may result in connection matrices that have a large spectral radius. In such a case, without the matrix inequality constraint in equation 3.5, one cannot guarantee the global stability of the synthesized network.
Synthesis of Brain-State-in-a-Box Neural Models
467
HD ¸ 4, was in general unaffected (as it is expected) from the increase of w. N As shown in Table 4, for wN less than one, the average convergence rate for inputs at HD = 2, 3 decreases steadily despite the fact that smax / 2wN stays unchanged. Also, the convergence rate drops faster for wN ¸ 1, which may be considered due to the combined effect of the increase of wN and the decrease of smax / 2w, N 10 as predicted by our theoretical results. In experimentations with other values for n and m (not shown here) we consistently found that the retrieval performance of the network had the same trend when increasing the bound w. N This validates our conjecture that a small dispersion of wij values contributes to better error-correction capabilities of the network in the presence of random binary errors. In particular, we observe that the results of example 1 are quite typical. 4.3 Example. The purpose of this example is to examine how the retrieval performance of neural network model 1.1 is affected by the number of patterns to be stored. For this, 150 sets of binary patterns (n = 13) were chosen randomly, where each set consisted of k patterns (desired memory patterns), with k varying between 1 and 15. For each set of k patterns (for each value of k, there were 10 random sets of k patterns) we synthesized the network by solving an LP problem with specications a = 1 and wN = 0.5. All the designed networks satised the global constraint 3.5. Then, for each designed network we examined the percentage of input patterns (given as an initial state of the network), at Hamming distance HD = 1, . . . , 5 and HD ¸ 6 from a stored pattern, which converge to the stored pattern. Figure 1 summarizes the average results of the simulations tests. The shape of Figure 1 is similar to the waterfall graphs in coding theory and signal processing that are used to display the degradation of system performance as the input noise increases. From this point of view, Figure 1 shows the capability of network 1.1 to handle small signal-to-noise ratios as a function of the number of stored patterns. Compared to other existing results, Figure 1 shows that, regarding the storage ability and retrieval efciency, the performance of the network is quite satisfactory. 5 Conclusions
We have considered the realization of associative memory by means of a class of synchronous discrete-time neural networks that can be viewed as a generalization of Anderson’s BSB neural model. We presented rst an analysis for the location and qualitative properties of the equilibria of the considered network. In particular, we provided results concerning the extent of the domain of attraction of asymptotically stable equilibria, which, under 10
The decrease of smax / 2w N is caused by the fact that the value of smax is bounded by the global stability constraint 3.5.
468
Fation Sevrani and Kennichi Abe
Figure 1: Average percentage of all inputs that converge from a given Hamming distance HD to a stored pattern.
appropriate assumptions, were shown to exist only at the vertices of the hypercube. Using the analysis results as constraints, the synthesis problem was formulated as a constraint optimization problem that was shown to be very exible and offering numerous degrees of freedom. Taking advantage of this formulation, the proposed approach of the synthesis problem allows a design in which the interconnection structure is predetermined with or without symmetry constraints, including a class of discrete cellular neural networks as a special case. In the case of the symmetric interconnection structure, it was shown that by properly choosing an upper bound on the absolute values of the elements of connection matrix, one can guarantee the global stability of the network by using LP techniques for solving the synthesis problem, also highlighting the possibility of the synthesis to be carried out by another neural network. In the experimentations, computer simulations veried that all the desired patterns were stored as asymptotically stable equilibrium points of the designed network. Furthermore, trends observed in numerous simulations showed that the number of spurious asymptotically stable equilibrium points for the considered network is in general small. They also proved the existence of a substantial domain of attraction around the stored patterns, thus validating our theoretical results as well as heuristic arguments provided in the previous sections. Our results, compared to others reported in the literature, show that the retrieval efciency of the designed network is quite satisfactory.
Synthesis of Brain-State-in-a-Box Neural Models
469
Appendix Proof of Proposition 1.
integers and let
Let J ½ f1, . . . , ng denote a nonempty subset of
F = fx 2 Rn | |xi | < 1 if i 2 J, xi = § 1 if i 2/ Jg, dim(F) = | J| the cardinality of J. Let xQ 2 R| J| , xQ i = xi , i 2 J. Then xQ evolves according to the linear system 0 xQ i (k C 1) = xQ i ( k) C a
X j2J
wij xQj ( k) C a @
1 X j2J /
wij uj C bi A
8i 2 J, (A.1)
where uj , j 2/ J, is xed at § 1. From equation A.1, any equilibrium point of system 1.1, say, x* , must satisfy the condition Q xQ * = ¡bQ , W
(A.2)
where xQ * 2 R| J| is a vector whose components are the unsaturated compoQ 2 R| J|£| J| is a submatrix of W obtained by selecting the nents of x * , W rows and columns indicated by the indices which are in the set J, and P bQi = j2J / wij uj C bi , 8i 2 J. Let us consider rst the case | J| = 1, F is an edge of Hn (a face of dimension one). We have the following result. Let wii = 0 8i. Then for almost all b 2 Rn (in the sense of Lebesgue measure), there is no equilibrium point of system 1.1 on an edge of the hypercube Hn . Lemma 2.
Since wii = 0 8i, by equation A.1 the condition for P the existence of an equilibrium point on an edge of the hypercube reduces to j2J / wij uj Cbi = 0. P For xed W , the last equality can be satised if bi = ¡ j2J w ij uj . Then, since / fbi g, i = 1, . . . , n, has one-dimensional measure zero, for all b 2 Rn except those belonging to a set of measure zero, there is no equilibrium point on an edge of Hn . Proof.
Now consider 1 < | J| · n. From the hyperbolicity assumption (assumpQ 2 R| J|£| J| is of full rank. Hence, any equation of tion A1) any submatrix W the form A.2 will have exactly one solution. Now, since the trace of W is zero (wii = 0 8i), each principal submatrix of W , including W itself, has at least one eigenvalue with positive real part (the others with negative real Q ), W Q 2 R| J|£| J| , denote the state matrix of equation A.1. By part). Let ( I| J| CaW
470
Fation Sevrani and Kennichi Abe
Q ) has at least one eigenvalue with modulus greater assumption, (fI| J| C aW than one. Hence the possible equilibrium xQ * is unstable. Consequently the only possible stable equilibria must be located at the vertices of the hypercube. Using an elementwise notation, if xi (k C 1) 6 = xi ( k), ( ) c i k can be obtained as Proof of Lemma 1.
c i ( k) =
( g (x( k) C aL(x ( k))) i ¡ xi ( k) , a(L( x ( k)))i
where, for convenience, we use the notation L( x ) = Wx C b . Now consider the following cases: Case 1. | (x ( k) C aL( x ( k)))i | · 1. Then c i ( k) =
(x ( k) C aL( x ( k)))i ¡ xi (k) = 1. a(L(x ( k))) i
Case 2. ( x (k) C aL(x ( k))) i > 1. Note that since |xi ( k)| · 1, ( L(x ( k))) i > 0. Then (x ( k) C aL( x ( k)))i ¡ xi (k) c i ( k) < = 1, a(L(x ( k))) i also c i ( k) =
1 ¡ xi ( k ) ¸ 0. a(L( x ( k)))) i
The case when ( x( k) C aL(x ( k))) i < ¡1 is similar. Proof of Theorem 1. Under the given hypotheses, that is, if the connection matrix W is symmetric and either positive denite or a < 2/ |l min |, it follows
from the energy minimization theorem (see Golden, 1986, 1993; Hui et al., 1993) that the energy function E: H n ! R, dened by E (x ) = ¡ 12 x T Wx ¡ x T b ,
(A.3)
evaluated on the trajectories of equation is monotonically decreasing unless it encounters an equilibrium point, that is, denoting D E( x (k)) = E( x (k C1)) ¡ E( x( k)), we have that D E( x ( k)) < 0 if x ( k C1) 6 = x (k) and D E (x ( k)) = 0 if and only if x ( kC1) = x( k). Based on these considerations and by making use of assumption A2, we now show that every trajectory of system 1.1 converges to some equilibrium point as k ! 1. For this, given a trajectory fx ( k)g starting from x 0 2 Hn , k = 0, 1, 2, . . . , let V (x ) denote the positive limit set dened as V( x ) = fy : there is a sequence km ! C1 such that x ( km ) ! y g. By denition all the trajectories of system 1.1 are bounded. Hence, by the Bolzano-
Synthesis of Brain-State-in-a-Box Neural Models
471
Weierstrass theorem V ( x) 6 = ;. Let S = fx 2 Hn : D E( x ) = 0g and let M denote the largest invariant subset of S. By denition (cf. equation A.3), E is C1 and bounded on H n . Since D E( x (k)) · 0 8x ( k) 2 H n , by the invariant set theorem (see, e.g., Luenberger, 1979), every trajectory of system 1.1 approaches M as k ! 1, that is, V( x ) ½ M. Now, by assumption, the set M is the entire set of the equilibrium points of 1.1, because D E (x ( k)) = 0 iff x (k C1) = x( k). Furthermore, by assumption, the set of system equilibrium points is nite, that is, each equilibrium point is isolated. Since M is nite and V( x ) ½ M, we have that V( x) is nite. Thus, in order to prove that every trajectory of the system converges to an equilibrium point, it remains to be shown that V( x ) contains only a single point. By contraposition, suppose that x * , y * 2 V( x ). Then, given 2 / 2 > 0, there is a k1 such that | x ( k) ¡ x* | < 2 / 2 for k > k1 , and a k2 such that | x (k) ¡ y * | < 2 / 2 for k > k2 . Choosing kQ = max(k1 , k2 ) we have Q which contradicts that | x * ¡ y * | · | x (k) ¡ x * | C| x (k) ¡ y * | < 2 for all k > k, the fact that the equilibrium points are isolated. This contradiction shows that every trajectory of system 1.1 converges to a limit set containing a single equilibrium point of the equation. References Anderson, J. (1993). The BSB model: A simple nonlinear autoassociative neural network. In M. H. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp. 77–103). Oxford: Oxford University Press. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psych. Rev., 84, 413–451. Golden, R. (1986). The brain-state-in-a-box neural model is a gradient descent algorithm. Journal Math. Psychol., 30(1), 73–80. Golden, R. (1993). Stability and optimization analysis of the generalized brainstate-in-a-box neural network model. J. Math. Psychol., 37, 282–298. Guez, A., Protopopsecu, V., & Barhen, J. (1988). On the stability storage capacity and design of nonlinear continuous neural networks. IEEE Trans. Systems, Man, and Cybernetic, 18, 80–87. P S. (1992). Dynamical analysis of the bain-state-in-a-box (BSB) Hui, S., & Zak, neural models. IEEE Trans. Neural Networks, 3, 86–94. P S. H. (1993). Dynamics and stability of the brainHui, S., Lillo, W. E., & Zak, state-in-a-box (BSB) neural models. In M. H. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp. 212–224). Oxford: Oxford University Press. Kohonen, T. (1987). Self-organization and associative memory. Berlin: SpringerVerlag. Liu, D., & Michel, A. (1994).Dynamical systems with saturation nonlinearities: Analysis and design. New York: Springer-Verlag. Luenberger, D. (1979). Introduction to dynamic system theory; Theory, models and applications. New York: Wiley.
472
Fation Sevrani and Kennichi Abe
Lustig, I., Martsen, R., & Shanno, D. (1994). Interior point methods for linear programming: Computational state of the art. ORSA J. Computing, 6, 1–14. Michel, A., & Farrell, J. (1990). Associative memories via articial neural networks. IEEE Control Systems Magazine, 10(3), 6–17. Michel, A., Farrell, J., & Porod, W. (1989).Qualitative analysis of neural networks. IEEE Trans. Circuits and Systems, 36, 229–243. Michel, A., Si, J., & Yen, G. (1991). Analysis and synthesis of a class of discrete time neural networks described on hypercubes. IEEE Trans. Neural Networks, 2, 32–46. Nesterov, Y., & Nemirovsky, A. (1994). Interior-point polynomial methods in convex programming. Philadelphia: SIAM. Perfetti, R. (1995). A synthesis procedure for brain-state-in-a-box neural networks. IEEE Trans. Neural Networks, 6, 1071–1080. Vandenberghe, L., & Boyd, S. (1996). Semidenite programming. SIAM Review, 38(1), 49–95. Xia, Y. (1996). A new neural network for solving linear programming problems and its application. IEEE Trans. Neural Networks, 7(2), 525–529. Xia, Y., & Wang, J. (1995).Neural network for solving linear programming problems with bounded variables. IEEE Trans. Neural Networks, 6(2), 515–519. Received June 10, 1998; accepted January 18, 1999.
Neural Computation, Volume 12, Number 3
CONTENTS Article
Dynamics of Encoding in Neuron Populations: Some General Mathematical Features Bruce W. Knight 473 Notes
Local and Global Gating of Synaptic Plasticity Manuel A. S´anchez-Monta˜n´es, Paul F. M. J. Verschure, and Peter K¨onig
519
Nonlinear Autoassociation Is Not Equivalent to PCA Nathalie Japkowicz, Stephen Jos´e Hanson, and Mark A. Gluck
531
Letters
No Free Lunch for Noise Prediction Malik Magdon-Ismail
547
Self-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking Mechanism Chris J. S. Webber
565
Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks J. Rubin and D. Terman
597
An Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event Analysis Quentin Pauluis and Stuart N. Baker
647
Impact of Correlated Inputs on the Output of the Integrate-and-Fire Model Jianfeng Feng and David Brown
671
Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth David M. Halliday
693
Second-Order Learning Algorithm with Squared Penalty Term Kazumi Saito and Ryohei Nakano
709
ARTICLE
Communicated by Carl van Vreeswijk
Dynamics of Encoding in Neuron Populations: Some General Mathematical Features Bruce W. Knight Laboratory of Biophysics, Rockefeller University, and Laboratory of Applied Mathematics, Mount Sinai Medical School, New York University,New York, NY 10021, U.S.A. The use of a population dynamics approach promises efcient simulation of large assemblages of neurons. Depending on the issues addressed and the degree of realism incorporated in the simulated neurons, a wide range of different population dynamics formulations can be appropriate. Here we present a common mathematical structure that these various formulations share and that implies dynamical behaviors that they have in common. This underlying structure serves as a guide toward efcient means of simulation. As an example, we derive the general population ring-rate frequency-response and show how it may be used effectively to address a broad range of interacting-population response and stability problems. A few specic cases will be worked out. A summary of this work appears at the end, before the appendix. 1 Population Dynamics: Preliminaries
A small patch of cortex may contain thousands of similar neurons, each communicating with hundreds or thousands of other neurons in that same patch or in other patches. By the very nature of this situation, an attempt to simulate it directly must be massive. A recently advanced and promising alternative is the simulation of the dynamical rules stating how the statistical status of the whole population evolves with time. Three recent expositions of this approach are Knight, Manin, and Sirovich (1996), Nykamp and Tranchina (1999), and Omurtag, Knight, and Sirovich (1999). Precursors with goals of a more analytic nature are Stein (1965), Johannesma (1969), Wilbur and Rinzel (1982), Kuramoto (1991), Abbott and van Vreeswijk (1993), Gerstner (1995). Here we briey give one example of the approach and then discuss the generality of its central features. Models of single neuron dynamics are typied by the equations of Hodgkin and Huxley (1952). A serviceable caricature of these equations, and also of more elaborate dynamical models (see the appendix), is the generalized integrate-and-re equation: dx / dt D f ( x, s) Cg(x, s), s D s( t), x D 1 resets to x D 0. Neural Computation 12, 473–518 (2000)
(1.1)
c 2000 Massachusetts Institute of Technology °
474
Bruce W. Knight
Here x may be regarded as a transmembrane voltage, which depends on the accumulation of a transmembrane current f ( x, s) common to all neurons in a population, and which current, in turn, depends on the membrane voltage x and also on a synaptic input signal s(t); each neuron also feels a “noise current” (Tuckwell, 1988) g(x, s) which is statistically uncorrelated with the noise currents felt by other neurons. We may view the resetting of x D 1 to x D 0 as the occurrence of a nerve impulse. We will use equation 1.1 to illustrate particular cases of the general features. The central structural features of the population approach emerge from the consideration of a single population of noninteracting neurons. The interesting further features that emerge, for interacting subpopulations, may be addressed with little further overhead in formulation. Our early discussion consequently concerns a population of noninteracting neurons. (Interactions among subpopulations and feedback interaction among the members of one population are discussed in sections 4 and 5). The neuron population is distributed over the voltage variable x, between 0 and 1, with a distribution function r (x, t), which integrates to unity and whose evolution in time must be determined by equation 1.1. The fraction of neurons whose voltage is more than a given value x changes with time in a way that depends on the probability current, J( x, t) D f ( x, s)r ( x, t) ¡ D( x, s) D
¡
f (x, s) ¡ D( x, s)
¢
@ r ( x, t) @x
@ r (x, t), @x
(1.2)
which follows from equation 1.1. The rst term—the advection term—states that the probability density is swept along by the common velocity. The second term—the diffusion term—states that neurons are more likely to be shaken (by uctuations) from a voltage region, where more of them reside to an adjacent region, where they are less dense, than their converse likelihood of being shaken up the density gradient. Quantitatively, the diffusion coefcient D( x, s) must be derived from a more detailed specication of the noise g(x, s). Equation 1.2 is taken from Knight et al. (1996), where a brief derivation of D is given. Comparable expressions for current (specialized in different ways) are given by Abbott and van Vreeswijk (1993; equation 9.4), by Nykamp and Tranchina (1999; equations A4 and A16), and by Omurtag et al. (1999; equations 36 and 48). We note in equation 1.2 that the current is obtained from the probability density by a linear transformation on r ( x, t). From the dening property of the current, we easily show that its change with x gives the conservation equation (or continuity equation) for the local rate at which density accumulates: @r / @t D ¡@J / @x.
(1.3)
Dynamics of Encoding
475
(In the present context, equation 1.3 was used by Knight et al. 1996, equation 4; Abbott & van Vreeswijk, 1993, equation 3.3; Nykamp & Tranchina, 1999, equation 10.) The linear transformation from r to J, equation 1.2, may be substituted into the conservation equation, 1.3, to obtain the dynamical equation,
¡
¢
@r @ @ D ¡ f (x, s) C D( x, s) r @t @x @x
(1.4)
(Knight et al. 1996, equation 8; Abbott & van Vreeswijk, 1993, equation 9.4, Omurtag et al. 1999, equation 49), which describes how r ( x, t) evolves with the passage of time. (Such a partial differential equation for probability density is technically a Fokker-Planck equation; an extensive reference is Risken, 1996.) Equation 1.4 may be integrated over time to obtain the statistical description of the population’s evolution. This also yields J( x, t) by equation 1.2, and the current across threshold yields the per-neuron ring rate: r( t) D J(1, t)
(1.5)
(Abbott & van Vreeswijk, 1993, equation 9.9; Nykamp & Tranchina, 1999, equation 11; Omurtag et al., equation 39). We note from equation 1.5 that r( t) is linearly related to J (x, t), and so also is related linearly to r ( x, t). Particular neurons, or particular questions concerning their dynamical behavior, or other special circumstances may require a model more detailed, or differently expressed, than what followed from equation 1.1. For example, the Hodgkin-Huxley equations describe a neuron’s momentary state in terms of four variables (v, m, h, n), including two ( m, h) that determine the conductance of sodium channels and one ( n) that governs potassium conductance. The population density in this case lives in a four-dimensional state-space, and the corresponding population current has four components that describe the ow of population density in the several coordinate directions in that space. Relay cells in the lateral geniculate nucleus (McCormick & Huguenard, 1992) reveal the presence of 10 conductance channel types, which specify a state-space with 18 dimensions. Some neurons have cell bodies that are not tightly coupled electrically with their extended dendrites, and such cells must be described by a state-space that has separate sets of dimensions for each of their multiple electrical compartments. Synapses that deliver nite increments of synaptic electric charge require a more detailed formulation (Nykamp & Tranchina, (1999); Omurtag et al., 1999). Synapses with a slow response dynamics add synapse dynamics dimensions to the state-space. On the other hand (as briey discussed in section 6.1), within a state-space the neuronal dynamics places a huge constraint on which probability-density congurations can actually be reached; in a specic case (Sirovich, Knight, & Ormutag, 1999c), this allows one to
476
Bruce W. Knight
replace the population density of equation 1.4 (where the density is a member of the innite-dimensional space of functions that depend on x between 0 and 1) by a 17-entry vector, and allows the reduction of the differential operator of equation 1.4 to a 17 £17 (s-dependent) matrix. The general structural features reviewed below carry over directly to such an efcient matrix representation of the population dynamics. 2 Population Dynamics: Common Structure
In all of the examples cited, we nd the following common elements of structure: s(t): r:
an input signal. a probability density.
(2.1) (2.2)
In many of the above cases, r is a function dened over a state-space; the collection of such functions has the structure of a linear vector space (a weighted sum of two members yields another member). In this sense, in all of the above cases r may be regarded as a member of a linear vector space. J:
a probability current.
(2.3)
Since in many cases, such as the Hodgkin-Huxley case above, J has several components, convenience dictates that we regard J as in a space different from that which contains r . However, there are natural connections: C(s):
a current linear operator such that
(2.4) (2.5)
Cr D J.
In the case discussed above, equation 1.2 is a particular example of equation 2.5, and the linear operator that appears in equation 1.2 is the corresponding particular example of C(s): K:
a conservation operator such that
@r D ¡KJ. @t
(2.6) (2.7)
In the examples above where r is represented in a state-space, K is the familiar divergence operator of ordinary vector analysis, which in our onedimensional example reduces equation 2.7 to equation 1.3. Q( s) D ¡KC(s):
a dynamical operator.
(2.8)
Dynamics of Encoding
477
Substitution of equation 2.5 into 2.7 gives @r D Q( s( t))r : @t
a dynamical equation for
r.
(2.9)
Integration of the dynamical equation yields r at all times, and thus gives a statistical description of the evolving population. In our rst example, equation 2.9 becomes 1.4. r( t): R[J, s]:
the ring rate (per neuron) of the population. a ring-rate linear functional of J,
(2.10) (2.11)
a linear transformation from J to the scalar ring rate r(t) such that r D R[J],
(2.12)
which implies the linear relation, r D R [(Cr ) ] ,
(2.13)
that yields the momentary ring rate from the momentary population density. In our illustrative case above, equation 2.12 specializes to 1.5. Other cases can be more elaborate. For example, the ring rate of a population of Hodgkin-Huxley neurons is the rate at which individual members achieve the peak values of their action potentials. At the moment of maximum voltage, the right-hand side of the Hodgkin-Huxley voltage equation equals zero, and that constraint denes a 3-surface in the four-dimensional statespace. Integration of the ux of probability current across that surface yields the (per-neuron) ring rate, and that integral is linear in J, as equation 2.11 states. Because the value of an action potential’s voltage maximum is inuenced by the input current s, the location of the 3-surface is s-dependent, which leads to the general s-dependence of R[J, s] in equation 2.11. For an additional view of the mathematical structure just presented, we may imagine ourselves endowed with a computer of mythical capability and ask how we might approach the dynamical equation, 2.9. For examples where r is naturally represented as a probability density supported by a state-space, we may subdivide that state-space into hypercubes, each sufciently small to ensure whatever numerical accuracy we demand. If we do this, all the numbered items above may be cast in terms of nite matrices, and several structural features become more clearly visible. For example, we may observe that the dynamical equation, 2.9, describes what technically is a Markov process (Bailey, 1964), and this ensures the existence of some items we use below.
478
Bruce W. Knight
Our informal discretization is introduced at this point not as an approximation but as an aid to uniform presentation. It leads us, by means of nite linear algebra, to some valuable general structural features that may also be obtained directly for each of the several different cases quoted above, but by means that must be customized anew for each case. In particular we note that for any specied input s, the dynamical operator Q of equation 2.9 has a set of eigenvectors w n , which it maps to eigenvalue multiples l n of themselves: Qw n D l n w n .
(2.14)
All of these depend on the value of the input s: Q D Q( s), w n D w n ( s), l n D l n ( s).
(2.15)
For time-independent s, equation 2.9 has a time-independent steady-state solution r 0 ( s) for which Qr 0 D 0I
(2.16)
and time independence need not be invoked to observe that this is a distinguished eigenvector of equation 2.14 with eigenvalue l 0 D 0. This eigenvalue, of course, is independent of s in spite of equation 2.15. However, the zeroth eigenvector r 0 ( s) does depend on the input value s. The existence of a zero eigenvalue implies that all columns of the matrix Q have a common linear dependence expression satised by their elements. To determine this dependence we may exploit the conservation of probability. (Only in equations 2.17, 2.18, and 2.19 will subscripts refer to the compartments of our informal discretization.) Discretized, equation 2.9 is a column of equations, the jth of which reads: X dr j D Qjkr k . dt k
(2.17)
The r j ’s are probabilities that sum to unity, and so their sum has no time dependence. Thus, if equation 2.17 is summed on j, the left side is zero: 0 1 X X @ 0D Qjk A r k . k
(2.18)
j
Equation 2.18 is satised at every moment, including the moment that arbitrary initial data are imposed. Initially, we may choose any one of the
Dynamics of Encoding
479
entries r k to be unity, and choose the rest to be zero, so that equation 2.18 implies X (2.19) Qjk D 0: j
every column sums to zero. The discretized case has a natural inner product; if u and v are two column vectors, then (v, u)
(2.20)
is the sum of products of the entries of v and u if those entries are real numbers; if the entries are complex numbers, we instead take the complex conjugates of the v entries before performing the bilinear sum. (This prescription endows each of our particular cases with an inner product, a bilinear integral over state-space, if we take the small hypercubes, of our discretizing construction, to their innitesimal limit.) For any Q, the inner product induces Q¤ :
the adjoint operator to Q,
which, for any u, v satises ¡ ¤ ¢ Q v, u D ( v, Qu).
(2.21)
(2.22)
The entries of the matrix Q are real numbers, and its matrix transpose Q¤ satises equation 2.22. The determinant of a matrix has the same value as the determinant of its transpose, so the eigenvalues of Q, which satisfy det(l I ¡ Q) D 0,
(2.23)
are the same as those of its adjoint Q¤ . We may regard equation 2.23 as a polynomial in l and must anticipate that it can have complex roots. However, l is the only possibly complex item in equation 2.23, and so complex conjugation of that equation simply replaces l by l ¤ . Thus we see l ¤ is also a root, and complex eigenvalues occur in conjugate pairs. The adjoint operator Q¤ (s) has eigenvectors b w m ( s), which satisfy w n D l ¤n b w n. Q¤ b
(2.24)
Here (for convenience, which soon will be evident), we have chosen a labeling that associates the label n with the eigenvector whose eigenvalue is the complex conjugate of l n . We now observe that ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢ ln b w m , w n D wbm , l n w n D b w m , Qw n D Q¤ b w m, w n ¡ ¢ ¡ ¢ ¡ ¢ D l ¤m b wm , w n D b w m , l m wn D l m b w m, w n . (2.25)
480
Bruce W. Knight
Subtracting the last form from the rst gives ¡ ¢ bm , w n D 0. (l n ¡ l m ) w
(2.26)
Since one factor or the other must vanish, we see that sets of eigenvectors, and sets of adjoint eigenvectors, which are associated with distinct eigenvalues, form biorthogonal sets. Linear algebra tells us that generically (that is, except for exceptional cases that are “unlikely” in a strict sense) there are as many distinct eigenvectors (and adjoint eigenvectors) as there are dimensions in the vector space. We may impose a normalization on these sets and get ¡
¢ b w m , w n D d mn . d mn D 1 if m D n, D 0 otherwise.
(2.27)
Moreover, the dimension count says that the eigenvectors are complete in the sense that any vector v may be represented as a sum of eigenvectors: vD
X
an w n ,
(2.28)
n
where each am may be evaluated by observing that ¡
(
¢
b wm, v D b w m,
X
! an w n
n
D
X n
¡ ¢ bm , w n D am , an w
(2.29)
where the last step followed from equation 2.27. An immediate rst application is the following: suppose in the dynamical equation 2.9 that s has a xed value and that for r (t), we have initial data r (0); then since equation 2.28 can represent any vector, we know we can represent r ( t) as r ( t) D
X
an ( t)w n ,
(2.30)
n
where ¡ ¢ an (0) D wbn , r (0) .
(2.31)
The full solution to the dynamical equation, 2.9, is given by an ( t) D an (0) el n t ,
(2.32)
whence r ( t) D
X n
an (0) el n t w n ,
(2.33)
Dynamics of Encoding
481
as substitution of equation 2.33 into 2.9 shows that the time differentiation, on the left of equation 2.9, produces on each term the same eigenvalue as does the action of Q on the corresponding eigenvector on the right. In general, if noise entered the formulation of the operator Q in equation equation 2.9 (and if s is constant), then an initial departure from equilibrium will die out, and with the exception of l 0 D 0, all the other eigenvalues will have real parts that are negative. (This is a general property of Markov matrices; Bailey, 1964.) At long times only the n D 0 term in equation 2.33 will survive, giving (with equation 2.31) ¡ ¢ b0 , r (0) r 0 . r (t ! 1) D w
(2.34)
Clearly the zeroth adjoint eigenfunction is a valuable item to understand. We have seen in the discretized format that each matrix column of Q( s) must sum to zero. As Q¤ (s) is its transpose, every row of Q¤ (s) will sum to zero. By the rules of matrix action on a column vector, for a vector whose entries are all the same, the action of Q¤ will yield the zero vector because every row-sum vanishes. Thus, the column vector whose entries are all the same constant is the eigenvector b w 0 ( s) of Q¤ ( s), which has eigenvalue zero. In fact it is not a function of s, but a vector rb0 that is universal for all s (just as is its eigenvalue zero). We have asserted both that the equilibrium probability function r 0 ( s) accumulates to a total probability of unity and that it is the eigenfunction of Q( s) with eigenvalue zero. We have also chosen the biorthonormal normalization, equation 2.27. In the particular case of m D n D 0, equation 2.27 becomes (b r 0 , r 0 ( s)) D 1,
(2.35)
which implies that we choose the value 1 for the constant entry of rb0 . The constant-entry property of rb0 shows a special feature of the eigenfunctions of Q( s). If, in the biorthogonality equation, 2.27, we choose m D 0 and n 6 D 0, we obtain (b r 0 , w n ( s)) D 0.
(2.36)
Because the entries of rb0 are constant, this says that the entries of w n sum to zero. Thus, in the eigenfunction expansion, equation 2.30, the vectors w n for n 6 D 0 do not contribute to the probability inventory, and we verify that the decay of those modes in equation 2.33 does not make probability vanish from the system. If the input s is constant, by equation 2.13 the steady-state ring rate is given by r( s) D R [C(s)r 0 (s), s] ,
(2.37)
482
Bruce W. Knight
which species the steady-state input-output relation for the neuron population. We note from equation 2.37 that already in the steady state, r responds to s in a manner that in general is nonlinear. Evidently equation 2.37 also predicts the population response to a time-dependent s( t), which changes slowly enough. Although the estimate is informal, it is fairly clear that “slowly enough” means that the characteristic change rate of s( t) should be small compared to the slowest decay rate determined by the eigenvalues in the exponents of equation 2.33, in mathematical terms,
¡ ¢ (1 / s(t)) ds( t) / dt < < | Re (l 1 ( s( t)))| ,
(2.38)
where the index 1 has been assigned to the nonzero eigenvalue whose absolute real part is smallest. The response to a step input also may be stated in exact closed form, because the population response is determined by the linear functional R of equation 2.13. Suppose that the population density has achieved an equilibrium distribution r 0 ( sA ) at input sA and that the input signal is jumped to a new value, sB . We set our time origin at the time of the jump. Equations 2.13, 2.31, and 2.33 give the time course of the response: r(t) D
X
¡ ¢ w n (sB ) , r 0 ( sA ) R [C ( sB ) w n (sB ) , sB ] el n ( sB ) t b
n
D R [C ( sB ) r 0 (sB ) , sB ] X ¡ ¢ C el n (sB )t wbn ( sB ) , r 0 ( sA ) R [C ( sB ) w n ( sB ) , sB ] ,
(2.39)
n6 D0
where our special information on the zeroth eigenfunctions (equation 2.35) has been used in the second line, to isolate the new steady state (compare the rst term with equation 2.37). The remaining terms contribute a smooth transient to what would have been an abrupt jump if equation 2.37 had been recklessly used and 2.38 ignored. The jump response (equation 2.39) has been conrmed in the case of equation 1.4 by comparison to both a direct integration of that equation and a direct simulation involving 90,000 integrate-and-re neurons (Knight, Ormutag, & Sirovich, 1999). The effect of our informal discretization above is to make the sum in equation 2.39 large and nite rather than innite. This distinction is not of practical importance; Knight et al. (1999) show that the sum converges with a few terms. An understanding of the eigenvalues clearly is valuable in the interpretation of results such as equation 2.33. While standard algorithms are conveniently available that essentially solve equation 2.23, there are several approximate procedures, and some informal arguments as well, that give valuable systematic information concerning the eigenvalues. A case that we may expect to occur frequently is that of a neuron model that incorporates
Dynamics of Encoding
483
stochastic effects but quantitatively resembles a counterpart nonstochastic model, with the further feature that the steady input s0 is xed at a value where the nonstochastic counterpart res repetitively on a stable limit cycle. A simple example is equation equation 1.1, with s0 so chosen that f ( x, s0 ) is always positive and does not go to zero for any choice of x between 0 and 1; in the nonstochastic counterpart, the noise g is set to zero. In the more detailed example of the Hodgkin-Huxley equations, a steady input current in the most interesting range gives a limit cycle that is a one-dimensional closed curve in 4-space, around which a system point circulates at xed frequency. Even in most neuron models with many electrical compartments and hence a state-space with many dimensions, periodic ring under steady input implies a system point that pursues a closed one-dimensional limit cycle curve embedded in that many-dimensional space. For the nonstochastic counterpart, we may construct a time-dependent probability density r ( x , t) that is degenerate in the sense that it is nonzero only on the limit cycle curve. But along that curve, at some initial moment a variable density of individual systems may be freely chosen. After one period, any cluster of neurons in the population will return to the same point from which they started, so that the probability density r (x , t) is periodic in time and may be represented as a Fourier series in time: r ( x , t) D
X
einv0 tr n ( x ) ,
(2.40)
n
where v0 ( s0 ) is 2p times the fundamental ring frequency. We observe that equation 2.40 is entirely consistent with equation 2.33, with a set of eigenvalues, l n ( s) D inv0 ( s),
(2.41)
that are pure imaginary and come in complex-conjugate pairs except for the value zero. Here the integer index ranges over ¡1 < n < 1. In equation 2.40, the r n ( x ) may be regarded (aside from normalization) as the corresponding eigenfunctions. The stable limit cycle is an “attractor ” in the sense that a system point near it is drawn to the limit cycle with the passage of time. The introduction of stochastic effects will superimpose a random walk on the trajectory of a system point, but the excursions of that random walk will be limited by the greater attraction exerted on trajectories further from the limit cycle. In consequence, stochastic effects will produce an equilibrium probability density that will have appreciable size in a region of state-space that we might describe as a “fattened limit cycle.” We may expect that other eigenfunctions whose eigenvalues lie nearest zero (counterparts of the r n ( x ) in 2.40, with integer n near zero) will be conned to this region as well. A probability density on that fattened limit cycle feels both advection and dif-
484
Bruce W. Knight
fusion around the cycle. More detailed exploration suggests that with some generality, eigenvalues whose indices lie near zero approximate the form l n ( s) D ¡n2 C(s) C inv0 ( s),
(2.42)
where C(s) is positive. This form was found by Abbott and van Vreeswijk (1993, equation 9.15) for the case of equation 1.4 with f and D constant; by Knight et al. (1996, equation 36) for D small and constant and with f ( x, s) D s ¡ c x
(2.43)
(case of “forgetful” or “leaky” integrate-and-re encoder); by Sirovich, Knight, and Omurtag (1999a) for a case with nite synaptic jumps (both analytic and numerical conrmation); and further analytic evidence concerning the generality of this approximate eigenvalue rule, equation 2.42, is given in the appendix. (See also equation 3.24 and the comments following it.) The presence of diffusion tends to make the fundamental frequency v0 more rapid than it is in the nonstochastic counterpart case (Knight et al., 1996, equation 36). Finally, there are neurons that, under steady input, cyclically re bursts of nerve impulses. For such a neuron, the stable attractor is a two-torus embedded in the higher-dimensional state-space. The system motion exhibits two fundamental frequencies (“around and along the doughnut,” which, respectively, characterize the mean impulse rate and the burst rate), and instead of equation 2.40, we will have a sum that features the various combination frequencies of the two fundamentals. A similar but more elaborate eigenvalue theory will follow from these considerations. 3 The Frequency Response
Much valuable structural information may be learned by the study of how the neuron population responds to an input that consists of a large, steady part plus a small, superimposed part that is sinusoidal in time. A valuable slight generalization is a superimposed part that was small throughout its past history and consists of a sinusoid multiplied by an exponentially growing envelope function. The general strategy comes in two steps. The rst step is to observe that a small time-variable addition to steady input will cause a small timedependent addition to the steady value of any system variable that ultimately depends on the input. We will regard the relatively easy (though nonlinear) problem, of determining the steady values of the system variables, as already solved. Large, steady values we will give the index zero; small, time-varying ones will be given the index unity. In algebraic manipulations any product of terms with index unity will be dropped as small to
Dynamics of Encoding
485
second order. Following our numbered expressions of section 2, the input signal, s D s0 C s1 ( t),
(3.1)
gives rise to the probability density, r D r 0 Cr 1 (t),
(3.2)
and the probability current, J D J0 C J1 ( t).
(3.3)
The current linear operator, through rst order, is C0 C C1 D C (s0 C s1 ( t)) D C (s0 ) C s1 (t) Cs ,
(3.4)
where the subscript s implies differentiation with respect to s of the sdependent linear operator C. In the illustrative case of the operator C in equation 1.2 (also see the particular example, equation 2.43), both f and D are to be differentiated with respect to s. We note in equation 3.4 that the zeroth-order terms on the left and right sides are equal and may be subtracted, leaving only the evaluation of the rst-order term (C1 D s1 Cs ). Below, the same sort of steps will be taken repeatedly. The mapping from density to current, equation 2.5 now gives a rst-order part, C0r 1 C s1 Csr 0 D J1 ,
(3.5)
so that the perturbed time-dependent current arises from two pieces, the second and less obvious of which depends directly on small changes in the input. The conservation operator K is not s-dependent; as it is linear, we have K ( J0 C J1 ) D KJ0 C KJ1 ,
(3.6)
and by equation 2.7, probability conservation at rst order is @r 1 D ¡KJ1 . @t
(3.7)
By equation 2.8, the rst-order departure (from equilibrium) of the dynamical operator Q is Q1 D ¡KC1 D ¡s1 KCs D s1 Qs ,
(3.8)
486
Bruce W. Knight
where the last form establishes a more convenient notation. From equation 2.9 the rst-order dynamical equation is @r 1 D Q0r 1 C s1 Qsr 0 . @t
(3.9)
The nal term, which depends directly on the input, involves only known quantities and makes the equation linear inhomogeneous in the rst-order probability density and also linear in the rst-order input. In principle these observations provide the means to solve equation 3.9 analytically for r 1 (for example by the construction of a Green’s function solution). By equation 2.11 the ring rate satises r0 C r1 (t) D R [J0 , s0 ] C R [J1 , s0 ] C s1 Rs [J0 , s0 ] .
(3.10)
With the two terms from the J1 equation, 3.5, this gives the rst-order response, r1 D R [C0r 1 , s0 ] C s1 (R [Csr 0 , s0 ] C Rs [C0r 0 , s0 ]) b[r 0 ] , D R [C0r 1 ] C s1 R
(3.11)
where the second line simplies notation a bit and reminds us that the departure in ring rate can show a direct dependence on input, as well as a response through the departure in population density r 1 . The second step in constructing a frequency response is to assume that the small time-dependent input is sinusoidal: s1 ( t) D s1 (0) eivt .
(3.12)
Instead we at once make the slightly more general specialized assumption that s1 is a sinusoid multiplied by an exponentially growing envelope: s1 ( t) D s1 (0) est where s D a C ivI
(3.13)
here a determines the growth rate of the growing envelope exp(at). Because all of the rst-order relations above are both linear and time invariant, the other rst-order dynamical quantities generically will respond sympathetically with a similar time dependence: J1 (t) D J1 (0) est , r 1 (t) D r 1 (0) est , r1 ( t) D r1 (0) est .
(3.14)
Here the coefcients at time zero will depend on s and are to be determined by the rst-order equations. Since s is generally complex, they will be complex (and, in fact, analytic functions of s). Their complex nature will
Dynamics of Encoding
487
indicate how their phases in the sinusoidal cycle differ from the phase of the input. Since equation 3.13 is a specialization of the general time-dependent input perturbation above, all the linear rst-order equations remain intact. However the dynamical equation, 3.9, reduces to a simpler form: (sI ¡ Q0 ) r 1 D s1 Qsr 0 ,
(3.15)
and this equation has the formal solution r 1 D s1 (s I ¡ Q0 ) ¡1 Qsr 0 .
(3.16)
Now the right-hand expression Qsr 0 is to be regarded as a vector, which can be represented as a sum of eigenvectors of the operator Q0 , as in equation 2.28. We further observe that (s I ¡ Q0 ) ¡1 w n D
1 wn . s ¡ ln
(3.17)
(Multiply both sides by sI ¡ Q0 and use the eigenvector equation 2.14; equation 3.17 becomes an identity.) This enables us explicitly to solve equation 3.16. With equations 2.28 and 2.29, we have X¡ ¢ b w n , Qsr 0 w n , (3.18) Qsr 0 D n
which brings equation 3.16 to the form r 1 D s1
X n
¢ 1 ¡b w n , Qsr 0 w n . s ¡ ln
(3.19)
If we let s D iv here, this expression tells us the frequency response of the population density to a small sinusoidal oscillation in input. If our eigenvalues are of the form expected commonly (see equation 2.42), we see that as we move the driving frequency v, every time it passes a value nv0 , there will be a resonant response, and the eigenfunction w n will become prominent in the population density function. In equation 3.19 there is no resonance for s D 0 even though l 0 D 0 : the n D 0 term in the numerator vanishes, as we now show. By equation 2.16, the equilibrium distribution satises 0D
@ @r 0 ( Qr 0 ) D Qsr 0 C Q . @s @s
(3.20)
Thus the n D 0 numerator in equation 3.19 is
¡
(b r 0 , Qsr 0 ) D rb0 , ¡Q
@r 0 @s
¢ ¡
D Q¤rb0 , ¡
@r 0 @s
¢
D 0,
(3.21)
488
Bruce W. Knight
the last because rb0 is the zeroth eigenfunction of Q¤ . Consequently we can reduce the sum in equation 3.19 to one over n 6 D 0. The population ring-rate frequency response, or transfer function, is the (typically complex) ratio of output departure to input departure: T 0 (s) D r1 / s1 .
(3.22)
Because r1 and s1 are proportional to the same time course, equations 3.13 and 3.14, their ratio is time independent; because they are linearly related, their ratio is amplitude independent: the transfer function depends on input only through the value of s. If we substitute equation 3.19 for r 1 into 3.11 for r1 , we see that r1 is indeed proportional to s1 , and the transfer function, equation 3.22, becomes b[r 0 ] C T 0 (s) D R
¡ ¢ bn , Qsr 0 X w R [C0 w n ] . s ¡ln n6 D0
(3.23)
We will loosely call this the population rate frequency response, which more properly is T0 ( iv), for the more special case of purely oscillatory input. We see again, in the output response, equation 3.23, the appearance of resonance when imaginary s is brought close to a complex eigenvalue l n . The b of equation 3.23 is independent of frequency, and represents leading term R accurate tracking even at high frequencies. (However, in a detailed model b term goes to zero, as whose state-space includes synaptic dynamics, the R explicated by Gerstner, 1999, and its effect is taken over by large-eigenvalue terms in the sum.) An example of the frequency response, equation 3.23, was evaluated by a quite different route some years ago (Knight, 1972a). In the case of equation 2.43 (“forgetful integrate-and-re” model) let the “capacitor discharge rate” c be small and also let stochastic effects be small. Then encoders, which have simultaneously reset to zero, will disperse only slightly before reaching threshold, and stochastic input current may be replaced by an effective stochastic threshold, which slightly disperses ring times with a small standard deviation t . The frequency response in this model (Knight, 1972a, equation 8.22, adapted to present notation) is T 0 (s) D
¡
v0 2p s0
¢
C
X n6 D0
c (v0 / 2p s0 ) » ¼ v03 t 2 2 s ¡ ¡n C inv0 4p
¡ ¢
.
(3.24)
Comparison with equation 3.23 identies the various terms and relates those of equation 3.23 to “back of the envelope” model parameters. Amplitude and phase dependence on s D iv were in agreement with a population frequency response measured (and v0 and t also measured) in an experiment
Dynamics of Encoding
489
(Knight, 1972b). The form of the bracketed expression in the denominator of equation 3.24 is in agreement with the expected form of the eigenvalues in equation 2.42. (For this model, as c and t go to zero, the radian frequency is v0 D 2p s0 .) For subsequent convenience, we notationally simplify equation 3.23 to bC T0 (s) D R
X n6 D0
bn . s ¡ ln
(3.25)
For a multidimensional neuron model, the expedient evaluation of the various constants in equation 3.25 (dened in more detail in equation 3.23) remains a challenge for the future. However, experience to date (Omurtag et al., 1999; Knight et al., 1999; Sirovich et al. 1999) indicates that for onedimensional neuron models, such evaluations are quite straightforward. In the rst section of the appendix, a procedure is outlined to reduce multidimensional neuron models approximately to fairly realistic one-dimensional generalized integrate-and-re counterparts. Also in section 6.2 a method (so far not implemented) is outlined for the efcient direct numerical evaluation of the coefcients in equations 6.14, which express the population density dynamics in a “moving basis” representation. If these equations are specialized to the sinusoidal perturbation input of equations 3.1 and 3.12, equation 3.25 emerges together with a prescription for numerical evaluation of its various constants. 4 Applications of the Frequency Response: Stability of Interacting Subpopulations and of Feedback
The deliberations above are major tools for the investigation of how interacting subpopulations of neurons behave in consort. In time-independent equilibrium the kth subpopulation will satisfy a generalized form of the population-response relation, equation 2.37: £ ¤ (4.1) rk (s 0 ) D Rk Ck ( s 0 ) r k0 ( s0 ) , s0 . Here s0 stands for a list of inputs that may be derived from external sources or from the outputs of other subpopulations. Equation 4.1 has been stated in a way that leaves room for differences not only in the detailed dynamics of each subpopulation but also in the way different input channels deliver synaptic signals to any particular subpopulation. Clearly the dynamical equation, 2.9, may be generalized in a manner similar to equation 4.1: @r k D Qk ( s (t)) r k . @t
(4.2)
Again the subscript k refers to the kth subpopulation. The procedures of section 3 may be applied to this more general case, to nd the frequency
490
Bruce W. Knight
response of the kth population to the mth, input, as in equation 3.22: rk1 D Tkm (s) sm1 .
(4.3)
(Looking back to equations 3.16 and 3.17, we note that the resonant eigenvalues in this expression sensibly depend on the kth dynamical equation but not on the mth input source.) We still have to specify how the response of one population is to affect the input to another. We will work through some informative simple cases, but initially let us leave this relation very general and write £ ¡ ¢ ¤ sm ( t) D s( ext) m ( t) C um r t0 , t .
(4.4)
The rst right-hand term is to be regarded as specied external input from other neuron subpopulations, which can be as numerous as the populations under our direct investigation. The second term is a general nonlinear functional, where the prime on t0 indicates that this functional can look back across past times t0 at earlier outputs of all of the different subpopulations. Thus um is general enough to include, for example, distributed propagation time lags from other subpopulations. In the special state of time-independent equilibrium, equation 4.4 reads sm0 D s( ext) m0 C um [r0 ] ,
(4.5)
where the s( ext) m0 are given. These equations simply state how, through back coupling, the time-independent total inputs must depend in a specied way on the time-independent outputs. They are to be solved together with equation 4.1 for the two sets of constants s0 , r 0 . Simple particular cases suggest particular simple strategies, but we note that one standard method for solving equations 4.5 and 4.1 is to replace um by bum and repeatedly solve while advancing b from b D 0, by small steps to b D 1. There remains the question of the stability of the feedback-coupled solution. The technical means for evaluating a transfer function, discussed in the previous section, are broadly applicable. In particular if we are given a specied um [r( t0 ), t], and if we assume that the nth input component has the time dependence rn ( t) D rn0 C rn1 (t) D rn0 C rn1 (0) est ,
(4.6)
with rn1 small, and we further assume that the remaining components of r( t) are at their equilibrium values, the general steps of the previous section yield £ ¡ ¢ ¤ um r t0 , t D um [r0 ] C Umn (s) rn1 ,
(4.7)
Dynamics of Encoding
491
and the transfer function Umn (s) is straightforward to evaluate in particular cases. If instead we assume that all the rn ( t) are of the form of equation 4.6, then equation 4.7 generalizes to X £ ¡ ¢ ¤ u m r t0 , t D u m [r 0 ] C Umn (s) rn1 ,
(4.8)
n
which is a component of the matrix equation £ ¡ ¢
¤
u r t0 , t D u [r0 ] C U (s) r1 .
(4.9)
Again in matrix notation, if we assume that the external driving in equation 4.4 is of the form s( ext) ( t) D s( ext)0 C s ( ext)1 (0) est ,
(4.10)
then the rst-order part of the feedback equation, 4.4, becomes s1 D s ( ext)1 C U (s) r 1 .
(4.11)
The subpopulation input-output relation, equation 4.3, in matrix form becomes r1 D T 0 (s) s1 .
(4.12)
If we now act on the previous equation with the population-response transfer function matrix of equation 4.12, we may express the rst-order ring rates of all the subpopulations in terms of those of the external inputs: r1 D T 0 (s) s( ext)1 C T 0 (s) U (s) r1 .
(4.13)
This may be solved for the departures from equilibrium of the population rates by a matrix inversion: r1 D ( I ¡ T 0 (s) U (s)) ¡1 T 0 (s) s ( ext)1 .
(4.14)
When this expression is written out in components, it tells us the transfer function that expresses the modulation of ring rate, near equilibrium, of any subpopulation, in response to modulation of any one of the external inputs, and includes the effects of cross-talk communicated between different subpopulations as well as the effect of any self-feedback. It is time to observe the connection between the analytic properties of the transfer function and the question of stability. In general, a modulated output r1 is related to a modulated input b s1 by a (generally complex) ratio
492
Bruce W. Knight
T (s) (as in equation equation 3.22), which allows us to nd what input is required to achieve a specied output: b s1 D ( T (s)) ¡1 r1 .
(4.15)
If we move s on the complex plane, the required value of b s1 changes. In particular if we move s so that (T (s)) ¡1 D 0,
(4.16)
we have an output in the absence of input, which is to say that a root of equation 4.16 gives us a free-running behavior of our system, which has a time course proportional to exp(st). It may be technically simpler to observe that the free-running exponents of the system occur where the transfer function itself becomes innite. A case in point is the population ring-rate response transfer function of equation equation 3.25, which goes to innity at the eigenvalues s D l n . We note from equation 2.39 that the eigenvalues l n are the free-running exponents of the system. We see from equation 4.14 that the free-running exponents for the coupled subpopulations are given by the values of s, where the inverse matrix goes to innity, and these are the roots of the equation, det ( I ¡ T 0 (s) U (s)) D 0.
(4.17)
This expression is a polynomial in the known transfer functions of the various component pieces of the overall coupled system. If any root of equation 4.17 has a real part that is positive, then the equilibrium determined by equations 4.1 and 4.5 is unstable. In typical cases, the features of the specic problem may be exploited to nd the informative roots of equation 4.17 by means more efcient than a direct determinental expansion. The principal virtue of equation 4.17 is to show that in general, there is a well-dened route to a well-dened goal that answers the question about stability. 5 Stability: Specic Cases 5.1 One Population, Self-Feedback with Fixed Lag. In this case we choose the feedback functional in equation 4.4 to be the linear relation
£ ¡ ¢ ¤ ¡ ¢ u r t0 , t D gr t ¡ b t ,
(5.1)
where g is the gain from output to feedback input, and b t is the xed lag time, so that the feedback equation, 4.4, becomes s(t) D s( ext) (t) C gr( t ¡ b t).
(5.2)
Dynamics of Encoding
493
For time-independent equilibrium, equation 4.5 becomes s0 D s(ext)0 C gr0 .
(5.3)
This is to be solved together with the steady-state population-response equation, 2.37: r0 D R [C(s0 )r 0 ( s0 ), s0 ] ´ R0 (s0 ),
(5.4)
which is a nonlinear relation between r0 and s0 . Thus the determination of s0 in terms of s( ext)0 cannot be done in a simple and exact closed form. But to understand the basic facts about the static behavior of our population response, we should already have on hand the evaluation of the function R0 ( s0 ) over a range of s0 values, and so we can solve equations 5.3 and 5.4 together to express the s0 versus s( ext)0 relation as s( ext)0 D s0 ¡ gR0 ( s0 ).
(5.5)
Values of s( ext)0 that are outside the range of positive values given by equation 5.5 will not allow solution of the steady-state equations 5.3 and 5.4. We make the cautionary observation that in equation 5.5, any steady-state total input s0 may be excluded from the solvable region by making the positive feedback g strong enough. Clearly our discussion applies to the remaining broad range of cases where the time-independent solution exists. If we let r( t) D r0 C r1 ( t) D r0 C r1 (0) est ,
(5.6)
our specialization equation, 5.1, becomes ¡ ¢ £ ¡ ¢ ¤ s t¡b t t ( ), D gr0 C ge ¡sb u r0 C r1 t0 , t D gr0 C gr 1 (0) e r1 t
(5.7)
so in this case the transfer function of equation 4.7 is t . U (s) D ge ¡sb
(5.8)
The rst-order part of equation 5.2 is s1 D s(ext)1 C U (s) r1 ,
(5.9)
while from equation 3.22, T0 (s) s1 D r1 .
(5.10)
494
Bruce W. Knight
We multiply equation 5.9 by T0 (s) and nd r1 D
1 T0 (s) s( ext)1 1 ¡ T0 (s) U (s)
(5.11)
for the population rate frequency response to external input. (Note that this is the 1 £ 1 matrix case of the matrix equation, 4.14.) By equation 4.16, the free-running response exponents of the system will be those values of s for which the denominator vanishes, so that T 0 (s) U (s) D 1,
(5.12)
or in our particular case (equation 5.8), t e¡sb T 0 (s) D
1 , g
(5.13)
a transcendental equation to be solved for s, where T0 (s) is the explicit expression (equation 3.25). A specialized case will show some qualitative features. In the simple case of self-feedback without delay (b t D 0), equation 5.13 becomes T 0 (s) D
1 . g
(5.14)
If the feedback gain g is small, then a root s must make T0 (s) large; this will happen if s lies near a pole l n of T0 (s). As a rst approximation we may truncate the sum in equation 3.25 to the term that is large because s lies near its pole: T 0 (s) » D
bn , s ¡ln
(5.15)
and equation 5.14 may now be solved to give s D l n C gbn .
(5.16)
If bn is real and positive (as in the explicit example of T0 (s) given by equation 3.24) then, for positive feedback g, equation 5.16 shows that the real part of s is less negative than that of l n , and also that increasing g moves the real part of s in the positive direction and eventually past zero and beyond the threshold of instability. More generally, to nd the border of instability, we may set the real part of s to zero and take the real part of equation 5.16 to nd the critical feedback value, gcrit D |Rel n | / Re bn ,
(5.17)
Dynamics of Encoding
495
to within the approximation of equation 5.15. The system will be unstable if any root of equation 5.14 has a positive real part, so in equation 5.17, we must choose the index n that makes gcrit smallest. Equation 2.42 suggests that commonly this will occur for n D 1. Returning to equation 5.13, for nite lag b t the “small g-single pole” approximation, equation 5.15, still makes good sense; instead of equation 5.16, it now gives us the implicit relation t s D l n C e¡sb bn g.
(5.18)
Again let us observe that a critical value of g yields a purely imaginary s at the edge of instability. At that value take the real part of equation 5.18, and instead of equation 5.17 we nd ¡¢ gcrit b t D
1 | Rel n | ¡ ¢ , cos |s| b t Re bn
(5.19)
and (because the cosine cannot exceed unity) the presence of the lag tends to stabilize positive feedback. Equation 5.18 may be used in this way if gcrit (b t) is still small, which should hold if | Rel n | is small, which should be the case if the inuence of stochastic effects is modest. 5.2 One Population, Distributed Feedback Lags. Suppose that instead of a single xed lag b t as in equation 5.1, we have a distribution of lags weighted according to a distribution function P( t0 ), which integrates to unity. In place of equation 5.1, our feedback functional now is
£ ¡ ¢¤ u r t0 D g
Z1 0
¡ ¢ ¡ ¢ dt0 P t0 r t ¡ t0 ,
(5.20)
which is still linear in r, and where g again is the feedback gain. Substitution of a constant r0 in equation 5.20 retrieves exactly the same results as in the previous exercise, since P( t0 ) integrates to unity. If we assume a superimposed small modulation, r1 (t) D r1 (0) est ,
(5.21)
then equation 5.20 gives £ ¡ ¢ ¤ u r1 t0 , t D g
Z1
¡ ¢ 0 dt0 P t0 r1 (0) es (t¡t )
0
D gr1 (t)
Z1 0
¡ ¢ 0 dt0 e¡st P t0 D U (s) r1 ( t),
(5.22)
496
Bruce W. Knight
and the explicit feedback transfer function, U (s) D g
Z1 0
¡ ¢ 0 b(s) , dt0 e¡st P t0 D gP
(5.23)
may be substituted into equation 5.12, whose roots give us the free-running frequencies. An example of such a situation is given by Abbott and van Vreeswijk (1993). If the feedback coupling g is small enough, we may again use the “small g-single pole” approximation, equation 5.15. Since in this case s lies near l n , to lowest order in g we may replace U (s) by U(l n ), and equation 5.12 may be solved for the free-running exponents, b(l n ) . sn D l n C bn U (l n ) D l n C gbn P
(5.24)
Here the rst right-hand form makes no assumption concerning the nature of the feedback transfer function U(s), except that its effect should be small enough to permit the single-pole approximation. In the second right-hand form, we may exploit the fact that P (t0 ) is positive in equation 5.23. If, further, P( t0 ) is concentrated around small values of t0 , then for small integer |n|, equation 5.23 tells us that U(l n ) will have a real part that is positive. Equation 3.24 suggests (as do some detailed analytic arguments, as well as numerical evidence; Sirovich et al., 1999a) that the real part of bn (for small |n|) should be positive as well. Consequently the second right-hand form of equation 5.24 indicates that, again in this case, more positive feedback leads to real parts of the sn that are less negative, and so leads to more persistent free-running modes of motion. Again, the borderline of instability may be estimated. If we assume that s1 is pure imaginary and take the real part of equation 5.24, we obtain gcrit D
| Rel 1 | ¡ ¢, b(l 1 ) Re b1 P
(5.25)
which should have quantitative accuracy if gcrit is small enough to fulll the single-pole approximation. 5.3 A Multipopulation Generalization for Small Feedback. If feedbacks are small among interacting subpopulations of neurons, the freerunning exponent equation, 4.17, has an instructive and valuable approximate analytic solution that can be obtained by the single-pole approximation. The determinant of a matrix vanishes if that matrix has a zero eigenvalue, and so equation 4.17 will be solved by any value s that solves the matrix equation,
( I ¡ T 0 (s) U (s)) w D 0 .
(5.26)
Dynamics of Encoding
497
A typical matrix element of T 0 (s) is Tkm (s) D
X n6 D0
(km) bn , (k) s ¡ln
(5.27)
where the row k indexes the kth subpopulation and the column m the mth input, while n indexes the nth free-running mode of one subpopulation in isolation. (The fact that a subpopulation’s free-running eigenvalues depend on which subpopulation, but not on which input, is reected in equation 5.27.) There is a pole for every eigenvalue of every subpopulation. If the matrix elements of U are small, then T 0 must have large elements to compensate, ( ) and so, in equation 5.27, s must lie near some l nk , let us say for the qth subpopulation, and all other poles may be ignored. The matrix elements (see equation 5.27) become Tkm (s) D
( qm) bn vm D ( q) ( q) s ¡ ln s ¡ln
if k D q, D 0 otherwise,
(5.28)
and the second right-hand form denes the elements of the specied vector: ( qm)
v : vm D bn
.
(5.29)
Let us specify another vector: b q: b qk D 1
if k D q, D 0 otherwise.
(5.30)
Then (with the superscript T for transpose) equation 5.28 may be expressed as the outer-product matrix, T 0 (s) D
1 ( q) s ¡ ln
³ ´ b q vT .
(5.31)
( q)
If s lies close to l n , then, in equation 5.26, to lowest order U (s) may be (q ) replaced by U (l n ), and equation 5.26 becomes w¡
1 ( q) s ¡ ln
³ ´³ ³ ´´ ( q) b q vT U l n w D0.
(5.32)
Performing the right-hand operations and multiplying by the denominator: ³ ´ ³ ³ ´ ´ ( q) ( q) s ¡ l n w Db q v, U l n w ,
(5.33)
498
Bruce W. Knight
where the right-hand parenthesis is the scalar product of v with Uw . Equation 5.33 shows us that the components of the unknown vector w are proportional to the components of the known vector b q , so the eigenvector of equation 5.26 is w Db q,
(5.34)
and we may equate the coefcients of equation 5.33 to obtain the exponent s: ³ ³ ´ ´ ³ ´ ( q) ( q) ( q) X ( qm) ( q) s D l n C v, U l n b q D ln C bn Umq l n ,
(5.35)
m
where equations 5.29 and 5.30 were used, in the last step, to evaluate the scalar product. Equation 5.35 tells us how far the exponent s is moved from ( q) the nearby eigenvalue l n , by the interpopulation coupling Umq . This subsection has shown how the single-population results of the previous subsection generalize naturally to a situation with interacting subpopulations. In a single small patch of cerebral cortex, we nd subpopulations of neurons of different dynamical design, and we also nd, in separated patches, subpopulations of neurons with the same dynamical design. This subsection was intended, together with the next subsection, as a guide to show how a model that is a caricature of cortex, with subpopulations that interact and have separate inputs, but are of only a single dynamical type, may be generalized to the more realistic situation, which involves several different dynamical types. 5.4 Caricature of an Orientation Hypercolumn. In primary visual cortex, nearby patches of neurons typically respond best to linear visual stimuli that have similar orientations. There is neural interaction between the patches, and this interaction is weaker between pairs of patches whose preferred orientations are more dissimilar. Approximately, patches for all preferred orientations are on an equal footing, with no preferred orientation exciting a larger subpopulation of neurons than others do or differing from others in design characteristics. A set of nearby patches that includes all orientations is referred to as a hypercolumn. If a hypercolumn contained neurons of only one dynamical type, and contained N subpopulations with equally spaced preferred orientations, then the following caricature would be accurate:
A common dynamical equation, @r k D Q (sk ( t)) r k , @t for the kth subpopulation.
(5.36)
Dynamics of Encoding
499
A common ring-rate rule rk ( t) D R [C (sk (t)) r k , sk ( t) ] .
(5.37)
An even-handed feedback rule, X £ ¡ ¢ ¤ bn u rk¡n t0 , t . sk ( t) D s( ext) k ( t) C
(5.38)
n
Here the summed index is understood to be modulo N, so no subpopulation has preference. Since u has been left general, we may assume that the weightings bn sum to unity: X n
bn D 1.
(5.39)
Let us also reasonably assume that bn is always positive and is even (D b¡n ). For simplicity we assume that u is a linear functional, as in the earlier subsections. If we assume a time-independent baseline external input that also is evenhandedly independent of k, then it is consistent to assume that the resulting total inputs and the subpopulation responses are independent of k as well, and we retrieve, for the N subpopulations the equilibrium equations, 5.3, 5.4, and 5.5, which held for a single population with feedback. Proceeding as before with the small-departure analysis, from our earlier results we easily derive from equations 5.36 and 5.37 that rk,1 D T0 (s) sk,1 ,
(5.40)
while equation 5.38 gives us sk,1 D s( ext)k,1 C U (s)
X
bn rk¡n,1 .
(5.41)
n
To relate population response to external input, we may multiply equation 5.41 by T0 (s) and use equation 5.40: rk,1 D T0 (s) s( ext) k,1 C T 0 (s) U (s)
X n
bn rk¡n,1 .
(5.42)
This is a special case of the matrix equation, 4.13. It is much simplied because the single population dynamics results in a population transfer function matrix, which is a single scalar transfer function that multiplies the identity matrix, while the feedback matrix is a single scalar function of s which multiplies a discrete convolution. The presence of a discrete convolution suggests that a discrete-Fouriertransform strategy should yield major simplications, and this is indeed
500
Bruce W. Knight
the case. In fact discrete Fourier transformation of our nonlinear starting equations, 5.36–5.38, gives a reduction in numerical work by isolating a collection of “nonlinear dynamical modes,” which can be much less than N in number and also carry a valuable physiological interpretation. The stability question, which we address here, is expedited by a small bit of the discrete-Fourier machinery. We have seen that the free-running exponents of the coupled system may be found as the values of s that produce a zero eigenvalue and so solve the matrix equation, 5.26. Corresponding to equation 5.42, that matrix eigenvalue equation specializes to wk ¡ T0 (s) U (s)
X n
b n wk¡n D 0.
(5.43)
Because of the discrete convolution, we can anticipate that the eigenvector w will be a discrete-Fourier basis vector: wk D eiqk
where q D 2p j / N,
(5.44)
and where j can be any integer in the range that includes zero to N ¡ 1. To verify this, substitute equation 5.44 into 5.43 and at once reduce that equation to ¡ ¢ 1 ¡ T 0 (s) U (s) be q D 0,
(5.45)
where ¡ ¢ X ¡iqn be q D bn e
(5.46)
n
is the discrete Fourier transform of b n . We note for j D 0 that be(0) D 1,
(5.47)
by equation 5.39. Because we chose the bn real, positive, and even in n, we see that be is a real number and ¡ ¢ be q
1.
(5.48)
Thus equation 5.45, which is to be solved for s, is identical in form to equation 5.12, which was for a single-neuron population with the same population dynamics and the same feedback dynamics. For q 6 D 0, the presence e in equation 5.45 is simply to reduce the strength of the effective of b(q) feedback coupling according to equation 5.48. Thus, we anticipate that the
Dynamics of Encoding
501
least stable exponent will go with q D 0, which gives us exactly the singlepopulation case, equation 5.12), discussed earlier. Three further points should be made here. The rst is that discrete subpopulations appeared in equations 5.36–5.38 in anticipation of numerical work to come, while a more realistic hypercolumn model would feature a continuous gradation in best-orientation sensitivity. But a free-running exponent equation, 5.45, has emerged as an effective single-population relation whose form is independent of how nely we discretize the system. We could have started equally well with a continuously graded system, and the same effective one-dimensional result would have followed. The simplication of equation 5.45 was a consequence of an underlying symmetry of the system. We could have started with the symmetry of the circle, which is what the problem gave to us, rather than reverting to the less natural symmetry of an N-sided regular polygon, which is what the discrete Fourier transform addresses. The second point is that we could have undertaken a more realistic hypercolumn model involving several different cell types, each with a population dynamics differing in its details. All the steps above would have their counterparts in this more realistic case. At the nal step we would have, instead of the single equation 5.45, a matrix eigenvalue equation like 5.26. But the number of entries in the eigenvector would be not the large number of subpopulations but rather the small number of dynamically different cell types. Thus, the stability problem still is very tractable for this case. The third point is in a more speculative vein. Recent evidence (Everson et al. 1998) suggests that primary visual cortex has a single processing strategy and even-handed allocation of resources for the processing of patches of retinal image that differ not only in orientation but also in magnication scale over a fair range of magnications, to reasonable approximation. The combined two-parameter (“twist” and “zoom”) symmetry has a set of two-parameter eigenvectors analogous to the “discrete twist” eigenvector (see equation 5.44). These mathematical constructs might not only assist more realistically detailed modeling of visual cortex, but perhaps also might provide some new insight into the nature of cortical processing. 6 Possibilities for Efcient Simulation 6.1 Principal Components. Section 1 concluded with the mention of a means to reduce the dynamical equation, 2.9, and the ring-rate equation, 2.13, to matrix equations of fairly modest size. This may be achieved, once we have on hand a sample solution of equation 2.9 in response to a span of “typical input” s( t), by principal component analysis (Sirovich & Everson, 1992) (also called “singular value decomposition” or “representation by empirical eigenfunctions”), which now will be briey described. For rea-
502
Bruce W. Knight
sonable input s(t), some dynamical equations of the form 2.9 prove to be almost unable to carry the population density function r ( t) outside a subspace of modest dimension that is embedded in its full function space. If at a time t¤ we have a snapshot r ( t¤ ) of the density function, we may create from it the singular operator ¡ ¢ ¡ ¢T P t¤ D r t¤ r t¤ ,
(6.1)
which is a projection operator in the sense that its action on any vector u gives ¡ ¢¡ ¡ ¢ ¢ P t¤ u D r t¤ r t¤ , u ,
(6.2)
which is a vector in the one-dimensional manifold dened by the snapshot vector r ( t¤ ). From equation 6.1 we immediately show that P t¤ is self-adjoint. Because ¡ ¢ ¡ ¢ ¡ ¡ ¢ ¡ ¢¢ P t¤r t¤ D r t¤ r t¤ , r t¤ ,
(6.3)
it has only nonnegative eigenvalues and is positive-semidenite. If we have a collection of snapshots, taken at times t¤ , we may form the operator AD
X t¤
Pt¤ ,
(6.4)
which, according to equation 6.2, maps any vector into the modestdimensional manifold accessible to the dynamical equation. The self-adjoint property, and the positive-semidenite property, are inherited by A, and this ensures a set of orthonormal eigenvectors in the eigenvalue problem Awn D l n wn ,
(6.5)
whose eigenvectors we can number in the order of their decreasing nonnegative eigenvalues. Because the eigenvectors are orthonormal, we have ( wm , wn ) D d mn .
(6.6)
Because the effectively nonnull subspace of A is of modest dimension, to good approximation we may span it with a modest subset of the eigenvectors to equation 6.5. Now we can expand a solution r ( t) as r ( t) D
X n
an ( t) wn ,
(6.7)
Dynamics of Encoding
503
where the sum extends over a modest number of terms (17 terms in the quoted example). This we now can substitute into the dynamical equation, 2.9. We take the inner product of that equation with wm and (employing equation 6.6 on the left-hand side) directly conclude that X dam (wm , Q (s(t)) wn ) an , D dt n
(6.8)
which may be integrated in time for the am (t). With these am (t) on hand, substitution of equation 6.7 into the ring-rate equation, 2.13, explicitly evaluates the ring rate as rD
X
R [Cwn ] an .
(6.9)
n
We note that the an may be regarded as the components of a vector, and this endows equations 6.8 and 6.9 with an inherited structure identical to that of equations 2.9 and 2.13. Equation 6.9 is an explicitly specied linear functional of its input vector. Equation 6.8 expresses the evolution of that vector in terms of the action on it, of an s-dependent operator—here an s-dependent matrix. If the operator in equation 2.9 can be expressed as a polynomial in s (as, for example, by substituting equation 2.43 into equation 1.4), then the matrix in equation 6.8 can be expressed as a corresponding polynomial in which the same set of powers of s multiplies corresponding component constant matrices. These considerations translate directly into a practical and efcient computational scheme. However, this procedure requires caution at one point: the operator Q has conservation of probability built into it, but its representation by a truncated matrix in equation 6.8 may slightly miss perfect conservation because of the truncation; the formerly zero eigenvalue may be slightly perturbed, and this will give to the total probability a spurious slow exponential drift as time progresses. Conservation of probability is easily restored to the matrix of equation 6.8. In discretized format, conservation for the original matrix Q was assured by the fact that every column of Q was orthogonal to the zeroth adjoint eigenvector rb0 , which had constant entries. In the present representation the nth component of this vector is given by b an D (wn , rb0 ) .
(6.10)
Without truncation, the admixture of this vector in each column of the matrix in equation 6.8 would be zero. Let us inner-product the truncated equation 6.10, with a column of the truncated matrix, to obtain the (very small) coefcient of actual admixture; we then multiply the vector (equation 6.10) by this admixture coefcient and subtract the result from the matrix column.
504
Bruce W. Knight
Conservation of probability is now fullled. In cases tested (Sirovich et al., 1999c) the resulting matrix applied in equations 6.8 and 6.9 gives excellent agreement with the much longer direct calculation of equations 2.9 and 2.13. The choice of 17 dimensions rendered graphical results indistinguishable from those of the longer direct calculation. (This was not quite achieved with 15 dimensions. The dimension count jumps by twos because the natural subspaces relate to consecutive pairs of complex conjugate eigenvalues, as the next section will elucidate.) 6.2 Moving Basis. We conclude with a brief comment on a set of possibilities that might lead to extremely efcient simulation and give an additional viewpoint on mathematical structure. Once the efcient principal components representation has been constructed, it is not a heavy computation to solve the eigenvalue problem and tabulate the results, parametrically in the input s, for the eigenvalues and eigenfunctions that are within the subspace that the dynamical equation is able to access. In cases examined, these have proved to be in good practical agreement with their counterparts derived from the more laborious direct calculation (Sirovich et al., 1999c). Thus, it is reasonable to envision a representation of the dynamics expressed in terms of the input-dependent dynamical eigenvectors of equations 2.14 and 2.15:
r ( t) D
X
e an ( t)w n (s(t)) .
(6.11)
n
This is a representation in terms of a “moving basis” that at each time t remains optimized for the moving present value of the input s(t). (The overhead mark is to distinguish these coefcients from those of equation 6.7.) Effective convergence thus should be achieved by a sum of fewer terms in equation 6.11 than were required in equation 6.7, where the xed basis was a compromise that had to span the rst several eigenvectors of Q( s) over the entire range of s( t). Once the e an (t) in equation 6.11 have been determined, the population ring rate is given by the short sum rD
X n
e an R [C( s)w n (s) , s] D
X
bn ( s)e R an (t).
(6.12)
n
To determine the e an ( t), we may work from the differential equations that they satisfy. To do this we may inner-product r ( t) in equation 6.11 with the adjoint eigenvector b w m (s), to obtain ¡ ¢ bm (s (t)) , r ( t) . e a m ( t) D w
(6.13)
Dynamics of Encoding
505
Differentiation now yields de am D dt
¡b
w m,
@r @t
¢ ¡b
C w m,s
ds ,r dt
¢
¡ ¢ ds ¡ ¢ b D Q¤ b w m, r C w m,s , r dt ¼ X »¡ ¢ ds ¡ ¢ b e D w m,s , w n an l ¤m wbm , w n C dt n D l m ( s( t))e am C
ds X Mmn (s( t))e an . dt n
(6.14)
In the second line we used the dynamical equation and the adjoint operator relation; in the third we substituted equation 6.11 for r ; in the fourth line we abbreviated by dening the matrix ¡ ¢ w m ( s) / @s, w n ( s) M mn (s) D @b
(6.15)
(whose m D 0 entries are all zero, because of the constant-entry property of rb0 , as discussed at equations 2.35 and 2.36). The fourth line of equation 6.14 gives an explicit system of ordinary differential equations to be integrated for the e am ( t). A check on the equations 6.14 is that if s is constant, they may be solved analytically, and their solution is equation 2.32. Because for m 6 D 0 the eigenvectors w m do not carry any probability (see the discussion at equation 2.36) we must have e a0 ( t) ´ 1.
(6.16)
(by equation 6.11) and this is consistent with equation 6.14 even when s(t) is time dependent, since, when m D 0, both l m and Mmn are zero. If s(t) changes very slowly (see equation 2.38), then r ( t) will pass through a progression of steady states, and we will have r ( t) D r 0 (s(t)) .
(6.17)
The remaining members of equation 6.14 furnish a sequence of corrections to this result. We can estimate how many terms we will need in the sums in equations 6.11 and 6.14. If our eigenvalues are of the “damped limit-cycle” form (see equation 2.42), then the equations 6.14 are those for a sequence of “tuned lters,” each tuned for a radian frequency given by the imaginary part of the eigenvalue, and each driven by the right-hand-most term in the equation. We see that the pair of equations for m and ¡m are needed to furnish the part of the response spectrum in a frequency range near m times the limit cycle frequency. If we wish to compute an output response that
506
Bruce W. Knight
is accurate through frequency components that are N times as high as the highest single-neuron ring rate, then the estimated number of terms in the sums in the ring rate and dynamical equations 6.12 and 6.14 will be 2N C1. Thus, for example, seven terms should yield a computed output with a short-term-estimated spectrum that remains accurate through frequencies about three times the momentary single-neuron ring rate. Direct application of the moving basis, as just discussed, to a simulation in which the value of the input s( t) passes from the limit-cycle regime of an individual neuron to the regime of the stable equilibrium point, will encounter a technical nuisance. (This situation will arise in the simulation of primary visual cortex.) In the region of passage between the two regimes, each pair of early eigenvalues l§ m ( s), at some critical value of s, will merge by having their imaginary parts move to zero, and further change in s will give rise to two negative real eigenvalues that correspond to a faster and a slower pure relaxation. The members of equation 6.14 for § m together may be written ¡ ¢ de a Cm / dt D (l Cm ¢ e a Cm C 0 ¢e a¡m ) C ds / dt L Cm (e a C, e a¡ , s) ¡ ¢ de a¡m / dt D (0 ¢e a Cm Cl ¡m ¢e a¡m ) C ds / dt L¡m (e a C, e a¡ , s)
(6.18)
to emphasize that here we are dealing with a 2 £ 2 diagonal matrix. The L§ terms are the remaining linear sums in equation 6.14. Clearly we can pursue equation 6.18 into the pure-relaxation regime by assigning Cm and ¡m, respectively, to the less and to the more negative real eigenvalues there. But the transition is not smooth at the merger value, as the motion of each eigenvalue (for example) on the complex plane takes a right-angle turn at that value of s. The two invariants of the matrix, its determinant 4 and its trace T, are related to the eigenvalues by 4m ( s) D l Cm ( s) ¢ l ¡m (s)
(6.19)
T m ( s) D l Cm (s) Cl ¡m ( s)
(6.20)
» ¼ q 1 2 l§ m ( s) D Tm ( s) § (Tm ( s)) ¡ 44m (s) . 2
(6.21)
The critical condition occurs at the value of s where the argument of the root passes through zero and then changes sign. We observe that the rightangle-turn behavior of the eigenvalues is not shared by either of the two invariants, and this indicates that the passage may be made smooth by a
Dynamics of Encoding
507
linear transformation upon equation 6.18 that carries it to the form ¡ ¢ ¡ ¢ ¡ ¢ dxm / dt D 0 ¢ xm C 1 ¢ ym C ds / dt Xm x, y, s
¡ ¢ ¡ ¢ ¡ ¢ dym / dt D ¡4m ¢ xm C Tm ¢ ym C ds / dt Ym x, y, s
(6.22) (6.23)
whose 2 £ 2 matrix has the same determinant and trace as in equation 6.18. Equation 6.22 may be solved for ym and substituted into equation 6.23 to remove explicit dependence on ym from that equation’s right-hand side. An additional differentiation of equation 6.22 then carries the form of the dynamics to that of the fairly familiar damped and forced harmonic oscillator. The critical value of s corresponds to critical damping in the oscillator, which does not introduce singular behavior into the oscillator ’s forced motion. The moving-basis equations in the real-component form, 6.22 and 6.23, may be forward integrated across the critical damping condition without encountering embarrassment. The introduction of the moving basis in terms of sharpening the results of principal component analysis at the start of this subsection was convenient but inessential. The evaluated pieces that go into equations 6.12 bn (s), l m ( s), and Mmn ( s), are all to be evaluated at conand 6.14, which are R stant input s. If state-space is divided up into small compartments, then the only compartments visited by a neuron at constant s will be those that lie within a “fattened limit cycle” or in the case of an s so chosen that the neuron without noise would be quiescent, will lie analogously within the “stochastically fattened neighborhood of a stable equilibrium point.” These small compartment subsets of the state-space offer a numerically tractable eigenvalue problem and direct evaluation of the pieces just mentioned. The standard sinusoidal small perturbation assumption, as stated in equations 3.1 and 3.12, can be substituted into the moving-basis dynamical equations 6.14. Since equation 6.14 contains no prior approximation, the transfer function (see equation 3.25) must follow, and does so with a little thoughtful manipulation. Hence the numerically tractable eigenvalue problem will in principle lead to a similarly tractable numerical evaluation of the various constants that appear in equation 3.25. Equations 6.12 and 6.14 summarize the dynamics of a neuron population in a sort of a compact canonical form, in the sense that in principle, two dynamically distinct neuron designs can lead to numerically identical examples of equations 6.12 and 6.14, but if they do then both alternative designs would play dynamically identical roles as subpopulations in a larger array of interacting neuron subpopulations. If the effect on a specic neuron population by an elsewhere-directed pharmacological agent is to be minimal, or if the effect of a base-pair change in a mouse gene is to be maximal, then this more properly could be measured by minimal or maximal numerical changes in equations 6.12 and 6.14 rather than in the underlying dynamical equations. Regarding efcient simulation, we can
508
Bruce W. Knight
envision an elaborate neuron model with multiple electrical compartments, and a highly simplied neuron of the general integrate-and-re variety, giving rise to numerically almost identical instances of equations 6.12 and 6.14. 7 Summary
We may anticipate that a fairly broad range of useful neuron-population simulations will exhibit a common mathematical structure. The main ingredient of that structure is a dynamical equation that describes the momentary evolution of a population density. That density exists in a state-space whose points represent the possible states of a single model neuron. The population density dynamical equation follows from the postulated dynamics of the single neuron, including the consequences of an input noise uncorrelated among different neurons of the population. The population dynamics equation is nonlinear in the common inputs from other neuron populations (which nonlinearity reects the nonlinear dynamics of the identical individual neurons in the population) but is linear in the population density. This linearity endows the population dynamics with a structure whose broad outlines are familiar from applied mathematics and which gives insight regarding the ways the neuron population will respond to different input situations. The second major ingredient of a neuron population dynamics model is an expression that species its momentary per-neuron output rate of impulses. This output rate expression depends linearly on the momentary disposition of the population density. (In the sense that the impulse train from each neuron is a point process in time, the merged impulse trains from a large population of noninteracting neurons will be a time-dependent Poisson point process.) Because the population dynamical equation is linear in the population density, it may be viewed in terms of an input-dependent population dynamical linear operator that acts on the population density. At a specied input, that operator has a set of eigenfunctions in terms of which the population density can be expanded as a linear weighted sum and a corresponding set of eigenvalues that include zero and complex conjugate pairs with negative real parts. Under common circumstances, the eigenvalue pairs nearest zero may be approximated by a simple orderly progression, and we may anticipate that the corresponding eigenfunctions will dominate in the expansion of the population density. We may examine how such a population dynamics model responds to input that is a small sinusoidal oscillation around an operating point. This leads to an input-output frequency response that is a simple analytic function of frequency with “resonance denominators” at the complex eigenvalues that, under common circumstances, lie near integer multiples of the population’s mean ring rate.
Dynamics of Encoding
509
The output of one population can be used as input to another. If several neuron subpopulations are thus coupled together, commonly a set of equilibrium ring rates may be found. A frequency response function can be assigned to each input-output pair; from these we may derive an analytic function that is a polynomial combination of those frequency responses, and whose complex roots specify the possible free-running behaviors near equilibrium of the coupled set of subpopulations. A root whose real part is positive indicates a free-running behavior that is a sinusoidal oscillation times an exponentially growing envelope; hence the equilibrium is unstable in this case. In particular cases this stability analysis is quite tractable. The question of the stability of a model of an orientation hypercolumn of neurons in primary visual cortex may be reduced by symmetry to the tractable stability analysis of a much smaller system, which contains only one subpopulation of each cortical cell type. Finally, because only fairly few early eigenfunctions proved important in exploring the analytic methods discussed above, we are led to a preliminary examination of how this feature might enable extremely efcient simulations of neuron population dynamics. One such method involves examination of the solution of the full population dynamics equation in response to a sample of typical variable input. The small subspace of function space, in which the population density always resides, may be determined by principal component analysis. The dynamical equation may be partially diagonalized into this subspace, where it becomes a modest system of linear ordinary differential equations coupled by a matrix that changes in time with the changing input. A second approach, which currently is speculative but promising, is to recast the dynamical equation in terms of coefcients of a “moving basis” of the input-dependent eigenfunctions of the dynamical operator. This approach has the potentially useful additional feature that it reduces neuron population models to a single canonical form. This should lead, for example, to a quantitative estimate of how well an elaborate neuron model may be approximated by a simpler model. We certainly can envision models of interacting cortical neuron subpopulations whose mathematical features will include some that lie beyond those presented here. The eventual accurately descriptive model may well fall in the broader set. Still, what is presented here may help furnish a bit of guidance for theoretical developments to come. Appendix: Estimate of the Eigenvalues A.1 The Model. The generalized integrate-and-re model, of equation 1.1, can be made to mimic a more detailed model fairly well, for driving inputs s(t) that do not change with unnatural abruptness. In the more detailed case of the Hodgkin-Huxley neuron, for example, the voltage equation is of
510
Bruce W. Knight
the form dv / dt D F( v, m, h, n, s),
(A.1)
where m, h, n satisfy differential equations of their own. For a typical xed value of s, these variables follow a limit cycle, and on that s-dependent limit cycle, the other variables can be expressed as functions of the concurrent value of v, whence F( v, m, h, n, s) D F( v, m( v, s), h( v, s), n( v, s), s) D b F( v, s).
(A.2)
The model equation, dv / dt D b F( v, s),
(A.3)
when s is held constant, will integrate to yield the exact voltage time course on the limit cycle, for each s-value. More generally, for time-dependent s, the momentary exact m, h, n values will not depart far from their limit-cycle values for present v and s, so long as s does not change by a major fraction of its value during a single cycle of v. The other variables relate to v in different ways on the fast descending side of the voltage cycle and on the much slower ascending side. To avoid double-valuedness, we may dene our variable x in equation 1.1 as xD
1 2 ( vmax ¡ vmin ) ( (for downward motion) vmax ¡ v £ (for upward motion). vmax ¡ vmin C ( v ¡ vmin )
(A.4)
If x is taken modulo unity, it is periodic on the limit cycle. Substitution of v in terms of x in equation A.3 now gives the f ( x, s) of equation 1.1. If we proceed in this way, the boundary condition induced on the consequent dynamical equation, 1.4, for the population density r ( x, t), is that r ( x, t) should be smoothly periodic in x, with period 1. The more familiar boundary condition, for integrate-and-re dynamics with diffusion, is an absorbing (or “no-return”) upper boundary. We note the same effect is built in here, as on the nal fast voltage rise of the nerve impulse, forward advection overwhelms backward diffusion. The form of equation 1.4, as a good approximation to more elaborate and realistic dynamics, may be reached by an entirely different path. Suppose that in a state-space of higher dimension, we have properly formulated the dynamical operator in diffusion approximation. Suppose further that stochastic effects are modest enough so that the “fattened limit cycle” (discussed in section 2, equations 2.40 and 2.41), which supports the population
Dynamics of Encoding
511
density function, is reasonably thin. Then our partial differential equation eigenvalue problem may be separated into a preliminary transverse eigenvalue problem that may be solved locally at each cross-section across the limit cycle, and an axial eigenvalue problem that follows from assembling the solutions to the transverse problem. (This is the same methodology that is natural in addressing wave propagation in a waveguide of variable crosssection.) The resulting axial dynamical operator inherits both advective and diffusive terms, and is of the form given in equation 1.4, although in this case f ( x, s) includes the effects of transverse diffusion. A.2 Small Stochastic Effects. If the diffusion term in equation 1.4 is small, we may approach the eigenvalue problem,
Qw n D l n w n ,
(A.5)
by straightforward perturbation theory. In equation 1.4, let Q D Q0 C Q1 ,
(A.6)
where Q0 D ¡
@ @ @ f (x) , Q1 D D (x) @x @x @x
(A.7)
(we drop the s dependence from the notation, as input is held xed in the eigenvalue problem), and we assume that inclusion of the small Q1 induces changes away from both the eigenvalues and eigenvectors of Q0 : (0)
(0) (0)
Q0 w n D l n w n
(A.8)
(0)
(1)
(A.9)
(0)
(1)
(A.10)
wn D w n C wn
l n D l n Cl n .
Substitution of equations A.6, A.9, and A.10 into A.5 (and use of equation A.8) gives the rst-order part of equation A.5: (1)
(0)
(0) (1)
(1) (0)
Q0 w n C Q1 w n D l n w n Cl n w n ,
(A.11)
where we regard terms with index zero as known, and those with index one as to be found. The perturbed eigenfunction can be expanded in terms of the unperturbed eigenfunctions: (1)
wn D
X m6 Dn
(0) . am w m
(A.12)
512
Bruce W. Knight (0)
It is notable that the m D n term, which would include an admixture of w n (1) in w n , is absent from the sum. This term is absent because its inclusion in equation A.9 would serve simply to rescale the unperturbed eigenvector, and the eigenvector property is not affected by a scale change. Essentially equation A.11 asks, How much must the old eigenvector be redirected to be an eigenvector of the modied operator? and a simple scale change is irrelevant to answering this question. So every term of equation A.12 is (0) biorthonormal to the adjoint eigenvector b w n . If all we want from equa(1) tion A.11 is the rst-order correction to the eigenvalue, l n , we may inner(0) product that equation with b w n , and by use of equations 2.22 and 2.27, we at once nd ³ ´ (1) (0) (0) l n D wbn , Q1 w n . (A.13) The inner product for the state-space of equation 1.4 is Z1 ( v, u) D
dxv¤ ( x) u ( x) .
(A.14)
0
We observe that Z1 ( v, Q0 u) D ¡
dxv ¤ (x) 0
Z1 D
dx 0
¡
f (x)
¢ @ ¡ f ( x) u ( x) @x
@ v (x) @x
¢
¡ ¢ D Q¤0 v, u ,
¤
u (x) (A.15)
where for the second line we integrated by parts, and used the periodic boundary condition discussed in section A.1. By equation A.15 the zerothorder adjoint operator is Q¤0 D f ( x)
@ , @x
and the adjoint eigenvalue equation, ³ ´¤ (0) (0) (0) b wn D l n wn , Q¤0 b
(A.16)
(A.17)
is f ( x)
d b(0) ³ (0) ´¤ b(0) wn D l n wn , dx
(A.18)
Dynamics of Encoding
513
which (by direct substitution back into equation A.18) has the solution
8 9 0. 2
In the language of the NFL theorems (Wolpert, 1996b), hE tr¤ i , n is E(C| f, m) where the on-training-set likelihood and the quadratic loss function are used.
No Free Lunch for Noise Prediction
551
data points and estimate the training error. We use bootstrapping (Shao & Tu, 1996) to estimate hEtr¤ ( N1 )i, and by varying N1 , we can t the resulting dependence of hEtr¤ ( N1 )i on 1 / N1 using equation 2.3. Let the slope of this O From equation 2.3, we see that AO is an estimate for s 2 ( d C 1) C B; t be A. therefore, AO / ( d C 1) used as an estimate of s 2 is off by B / ( d C 1). Perhaps we can estimate B by the bootstrap as well. Given a bootstrapped estimate for B, we can combine it with AO to obtain an estimate for s 2 . A more careful analysis presented in the appendix shows that this too leads to failure. Thus, to get information on s 2 , perhaps a more subtle approach such as looking at terms with higher order in 1 / N would work. Perhaps some method other than simply using the training error would yield a better result. Instead of using linear models, why not use a more complex model for which E 0 is zero? Then at least asymptotically, the training error will approach s 2 (provided that the model is not too complex). However, for any nite N, we run the risk of overtting the data, perhaps even obtaining an estimate s 2 D 0. Perhaps there is a systematic way of choosing the model complexity as a function of N to get an optimal noise variance estimate. The next few sections show that such an exhaustive search will necessarily be fruitless unless we make some statement about f . In this case it would sufce to tell us “how good our model is at estimating f ” (i.e., E0 ). Rather than explore every other potential method for estimating s 2 in this linear scenario, we will set up a more general framework for the estimation of noise distributions and show that unless some assumptions are made about the target function, all noise models are just as probable, given the data. The intuition is that the data points are consistent with any noise level if we do not restrict the target function. We pursue these issues next. 3 Problem Set-Up
A nite data set D D fx a, ya glaD1 is given, where x a 2 R d , ya D f ( xa ) C na, f : R d ! R and the data set consists of l data points. 3 We concentrate on the case where the noise distribution is xed. The results can be modied to accommodate noise distributions that vary with x. It will always be understood that a indexes points in the data set. We concentrate on compact spaces and assume that the input space is the unit hypercube in d dimensions. Let the output space be the bounded interval [¡C, C]. Discretize the input and output spaces as follows. Let the input variable, x , take on values in the set
X 3
D f0, D x , 2D x , . . . , 1 ¡ D x , 1gd ,
(3.1)
We use l rather than the customary N to avoid confusions with Nx , Ny , to be dened shortly.
552
Malik Magdon-Ismail
Figure 3: Discretized function space. The space has only a nite number of functions. An arbitrary precision can be obtained by making the grid ner and ner.
so there are Nx D 1 C D1 possible values for each input variable. Let the x output variable, y, take a value in the set © ª Y D ¡C, ¡C C D y , . . . , 0, D y , 2D y , . . . , C ¡ D y , C . There are Ny D
2C Dy
(3.2)
C 1 possible y values. A function ( f ) will be a mapping, d
f : X ! Y . There are C D Ny Nx different functions. By making D x , D y as small as we choose, we can achieve an arbitrary precision (for example, see Figure 3). In all applications, only a nite number of functions are available (computers are nite precision machines). Due to this niteness, we can dene a probability distribution on the function space—a prior over the function space. One natural prior is to take this distribution to be uniform. Therefore to every function we assign a probability of C1 . We assume that C can be made as large as we wish. Further, without loss of generality, we can scale the outputs by D1 ; therefore we can assume that y Q ¡CQ C 1, ¡CQ C 2, . . . , CQ ¡ 1, Cg, Q Ny D 2CQ C 1 and CQ D C / D y . With y 2 f¡C, this representation, the noise n 2 f0, § 1, § 2, . . .g has a distribution given by
No Free Lunch for Noise Prediction
553
P a vector P , where Pi D Pr[n D i] and i Pi D 1. Assume a prior probability measure on the possible noise distributions g( P ). We are interested in the Bayesian posterior on these noise distributions, Pr[D| P ]g( P ) , (3.3) Pr[D ] P where Pr[D ] D P Pr[D | P ]](P ). We can calculate Pr[D| P ] as follows: g( P |D) D
Pr[D | P , f ] D
l Y aD1
Pr[na D ya ¡ f ( x a )].
(3.4)
Note that the right-hand side implicitly depends on P through the probabilities for the noise realizations na . Multiplying by P[ f ] and summing over f , we have Pr[D| P ] D
l X Y 1 XX ¢¢¢ Pr[na | f , D ] C f ( x ) f (x ) f ( x ) aD1 1
l 1 Y
( a)
D
Ny l
2
Nx
C X
Pr[na | f , D ],
(3.5)
aD1 f ( xa )D¡C
where we have used the notation Pr[na| f, D ] to mean Pr[na D ya ¡ f ( x a )]; (a) follows when we sum over all points not in the data set, and we have used the fact that Pr[ f ] D 1 / C ¡ uniform prior. 4 No Free Lunch for Noise Prediction 4.1 Boolean Functions. We use the following notation: x 2 f0, 1gd
f : x ! f0, 1g (1 ¡ p) 2 [0, 1] D probability of ip. d
There are C D 22 possible functions. Let g( p) be the prior distribution for the noise parameter, p. If the prior distribution on the possible functions is uniform then g( p| D ) D g( p) . Theorem 1.
Let mi be the number of times fi agrees with D where we use i to index functions. Then Pr[D |p, fi ] D pmi (1 ¡ p) l¡mi , so Proof.
Pr[D , fi |p] D Pr[D | p, fi ]Pr[ fi | p] D
1 mi p (1 ¡ p) l¡mi . C
(4.1)
554
Malik Magdon-Ismail
Letr ( mi ) be the number of functions agreeing exactly mi times with the data ¡ ¢ d ¡ ¢ set D . r (mi ) D ml 22 ¡l D ml 2¡l C. Summing equation 4.1 over fi , we get i i X Pr[D | p] D Pr[D , fi | p] fi
D D
l 1 X pmi (1 ¡ p) l¡mir (mi ) C m D0 i 1 Pl mi (1 ¡ p) l¡mi mi D0 p 2l
¡
D2
¢
lmi
¡l
independent of p. Therefore, g( p| D ) D
Pr(D | p) g( p) D g( p). Pr( D )
4.2 Extension to More General Functions. We would like to extend the previous analysis to the case where the output is not binary and the input is the discretized version of [0, 1]d . Figure 4 illustrates the conclusions of this section on a particular discretized grid. This section is somewhat technical. Its essential content is contained in theorem 3, which formalizes the intuition that if every target function value is equally likely, then each noise realization for every data point is equally likely. Hence nothing can be deduced about the relative likelihoods of different noise realizations.
4.2.1 Cyclic Noise. When the noise is additive modulo Ny , then P D [P0 , P1 , . . . , P Ny ¡1 ] fully species the noise distribution. The binary case with probability of ip 1 ¡ p we have Ny D 2, P0 D p, P1 D 1 ¡ p Example.
Lemma 1.
P f ( xa )
Pr[na | f , D ] D 1 for all x a 2 D .
Proof. C X f ( xa )D¡C
Pr[na | f, D ] D
yX a ¡C na Dya CC
Pr[na | f , D ] D
N y ¡1 X na D0
Pna D 1.
Combining lemma 1 with equation 3.5, we have the following theorem: Let g( P ) be the prior distribution for the cyclic noise. If the prior distribution for the functions is uniform, then g( P | D) D g( P ). Theorem 2.
No Free Lunch for Noise Prediction
555
Figure 4: Illustration of no noisy free lunch (NNFL). Suppose that the two functions shown are both equally likely (probability 1/2) to be the function that created the data. Then the two hypotheses, noise D 0 and noise > 0, are still equally likely to be true if they were a priori equally likely to be true. Proof.
By lemma 1 and equation 3.5, Pr[D| P ] D
g( P |D) D
1 Ny l
. Therefore,
Pr[D| P ]g( P ) D g( P ). Pr[D ]
Note that the boolean case is a special case of the cyclic noise case. In the language of information theory (Cover & Thomas, 1991), the noise forms a symmetric channel between the function value and the output value. A uniform distribution for the function values induces a uniform distribution on the output values, independent of the details of the symmetric noisy channel. Hence, observing the output conveys no information about the noisy channel. This is essentially the content of theorem 2. 4.2.2 Thresholded Additive Noise. Noise is added to the target, and then the resultant value is thresholded, so, yi D
» minfC, f ( x i ) C ni g, f (x i ) C ni ¸ 0 maxf¡C, f (x i ) C ni g, f ( x i ) C ni < 0.
556
Malik Magdon-Ismail
P D [P0 , P§ 1 , P§ 2 , . . .]. Unlike the previous case, edge effects make Pr[D| P ] dependent on P . These effects can be made as small as we please by allowing
C to be as large as we please. The intuition is that the data set is nite (yi is bounded), and so, because the noise distribution must decay to zero, if C is large enough, the probability that f ( xa ) is close to the edge and ya so far away becomes negligible. This is formalized in the next lemma. Given 2 > 0, 9C¤ such that for all C ¸ C¤ ,
Lemma 2.
1 Ny
l
(1 ¡ 2 ) · Pr[D| P ] ·
1 Ny l
.
(4.2)
(Ny D 2C C 1 and intuitively speaking, 2
C!1
¡! 0).
P Because fPi g is summable, there exists g such that |i|> g Pi < 2 l . Let ymax D maxfyi g and ymin D minfyi g. Choose C¤ > maxf|ymax Cg|, | ymin ¡g|g and let C ¸ C¤ . We then have Proof.
C X f ( xa )D¡C
Pr[na | f, D ] D
yX a CC
Pr[na | f , D ]
na Dya ¡C
D 1¡
X
na < ya ¡C
2
Pr[na | f , D ] ¡
X
Pr[na | f, D ]
na > ya CC
¸ 1¡ , l where the last inequality follows because by construction ymax ¡ C < ¡g and ymin C C > g. Using equation 3.5, we have 1 Ny
l
¸ Pr[D| P ] ¸
1 ³ 2 ´l 1¡ . l l Ny
The lemma now follows by using the inequality (1 ¡ x) n ¸ 1 ¡ nx for 0 · x · 1. Thus, we see that by choosing C, the bound on our function f to be large enough, the noise distribution has almost no effect on the probability of obtaining a particular data set. All data sets become equally probable, in the limit of large C. This tells us that the data should not be telling us any more about the noise distribution than we already know. The prior distribution for the noise is unaffected given any nite data set. To formalize this intuition, we need to impose a technical condition on the prior over the noise distribution space.
No Free Lunch for Noise Prediction
557
Let 0 < 2 · 1. Dene g P (2 ) by
Denition 1.
( gP (2 ) D min g:
X |i|> g
) Pi < 2
.
For all 0 < 2 · 1, g P (2 ) < 1, by the summability of P . Let 0 < 2 · 1. Dene g( N 2 ) by
Denition 2.
g( N 2 ) D sup fgP (2 )g P
where the sup is taken over P 2 supportfg( P )g. We say that a prior on the space in which P is dened, g(P ) , is totally bounded if Denition 3.
g( N2 )< 1 for all 2 > 0. Essentially this means that the support set for the prior on the noise distributions is not “too large.” For example, a prior over the noise distribution space that has support over only a nite number of points (i.e., that is nonzero only for a nite number of noise distributions) is totally bounded. We are now in a position to make precise the intuition mentioned above. Theorem 3 (No Noisy Free Lunch (NNFL)). Let the prior over noise distributions, g( P ) , be totally bounded. For all 0 < 2 < 1, 9C¤ > 0 such that for all
C ¸ C¤ ,
(1 ¡ 2 ) g(P ) · g( P |D) ·
g( P ) 1 ¡2
(4.3)
Therefore, lim g( P | D) D g(P )
C!1
because 2 ! 0 can be attained by letting C ! 1. Following lemma 2, now let C¤ > maxf| ymax C g( N 2 )| , | ymin ¡ g( N 2 )| g and let C ¸ C¤ . Then lemma 1 applies uniformly to every P 2 supportfg( P )g. Proof.
558
Malik Magdon-Ismail
Therefore equation 4.2 holds for every P 2 supportfg( P )g. Pr[D| P ]g(P ) Pr[D ] Pr[D| P ]g( P ) D P . P Pr[D| P ]g( P )
g(P | D) D
(4.4)
Using equation 4.2, an upper bound can be obtained by substituting Pr[D| P ] with 1 / Ny l in the numerator and (1 ¡ 2 ) / Ny l in the denominator. To get a P lower bound, we do the opposite. Then, using the fact that P g( P ) D 1, we get equation 4.3. Therefore, having no bound on f and allowing all f ’s to be equally likely does not allow anything new to be deduced about the noise distribution given any nite data set. It should also be clear how to extend these results to the case where the noise distribution has x dependence. One looks at each point independently. The nonthresholded additive noise model is asymptotically obtained by letting C become arbitrarily large. Further, the results apply to arbitrary D x , D y , so by allowing D x , D y ! 0, these results can be applied to the nondiscretized model in this asymptotic sense. 5 Discussion
The posterior g(P | D) is given by g(P | D) / g( P )
X
Pr[D | P , f ]P[ f ],
(5.1)
f
where P[ f ] is the prior over target functions. We have demonstrated that when P[ f ] is “uniform,” it is not possible to update a prior on the noise distribution from a nite data set, provided that a certain technical condition is satised. A nontrivial P[ f ] is needed for g(P | D) to be different from g(P ). Exactly how P[ f ] will factor into g( P |D) is given by equation 5.1. The result is more useful than it appears at rst sight, for one might argue that no one ever has a uniform prior. A restatement of the result is: Given two noise distributions, P 1 and P 2 , there are just as many (nonuniform) priors that favor P 1 over P 2 (when one weights by the amount by which P 1 is favored) as vice versa. This result ts into a set of NFL theorems that might be considered a collection of negative results, offering no hope for learning theory. The situation is considerably more positive, however. One is never in a situation with a uniform P[ f ]. Further, it is rare that no information about P[ f ] is known (excepting that it is nonuniform). Therefore, incorporating P[ f ] into
No Free Lunch for Noise Prediction
559
the noise prediction or learning algorithm is necessary in order to claim superior performance. Distribution-independent bounds such as the VC bounds (Vapnik, 1995) say that with high probability, the training error is close to the test error given some xed target function f . But VC results do not guarantee that the training error will be low with high probability; therefore they are of little practical use (for a deeper discussion of the subtleties underlying the connection between NFL and VC theory, see Wolpert (1995, 1996b). 4 How could one guarantee that a training error close to zero is achievable? One has to use the prior information available on the target function. Along a similar vein, results that exemplify the superiority of algorithm A over algorithm B on a certain benchmark problem are also of little use, for the reverse will be true on some other “benchmark” problem, and we never know which “benchmark” problem we will run into. Similarly, looking at a data set alone, one cannot produce any reliable estimate for the noise variance. Perhaps we “feel” as though we ought to be able to (for how often can it be heard: “This data set is clearly noisy”), but only because we already have a smoothness prior built into our neural network. Why? Because the types of functions we tend to encounter in practice are smooth. NFL focuses our attention on the importance of the prior information available. One should not attempt to nd learning systems that are universally good. Instead, one should study the possibilities that various priors present. Unfortunately, identifying and justifying the prior for the problem at hand is already a daunting task, let alone incorporating it into the learning problem. One way would be through the use of hints (Abu-Mostafa, 1995; Sill, 1998; Sill & Abu-Mostafa, 1997). A possible starting point would be to identify priors, P[ f ], with learning systems that perform well with respect to them (for example, Barber & Williams, 1997; Zhu & Rohwer, 1996). One might then be able to relate real problems to these priors and thus choose an appropriate learning system, or at least it might become possible to make precise the assumptions behind a belief that a certain learning system is appropriate for a certain problem. A useful result would be of the form, “Algorithm A [or learning system A] performs better than average on problems drawn from the (useful) class B.” Appendix
O an A.1 Estimating Noise Variance Using Linear Models. We have A, estimate of s 2 ( d C 1) C B, and we would like to estimate B by bootstrap4 VC results make statements about Pr[training error| test error] by making the statement that the probability of a low training error given a high test error is extremely unlikely. In practice, what we would really like to make statements about is the Pr[test error|training error] for which we really need to make some statement about the a priori probablity of a certain test error; that is, we need to make some statement about a prior over target functions.
560
Malik Magdon-Ismail
ping. Suppose we draw NB data points and estimate B (using the bootstrap O It can be shown that (see theorem 5) technique) by B.
¡
O n D BQ C s 2 ( d C 1) 1 ¡ NB hBi N
¢
C
s2 trhS ¡1 V S ¡1 V i, {z } N |
(A.1)
O( dC1)
where BQ is an unbiased estimate of B and the expectation in the last term is with respect to the bootstrapping. The denitions of V and S are given in the next section and are unimportant for our purpose here. Thus, we now O an estimate of B C s 2 ( d C 1), and B, O an estimate of have two quantities: A, 2 ¡1 ¡1 (( ) ). B C s d C 1)(1 ¡ NB / N C trhS V S V i / N Solving for s 2 , we have s2 D
(d
AO ¡ BO
C 1) NNB
¡
trhS ¡1 V S ¡1 V i N
.
(A.2)
However, we are bootstrapping a quantity B, which is related to the deviation of a statistical quantity of the sample from its population value. This estimate is good only as long as N represents the population—that is, NB / N ! 0. In this case, BO D B C s 2 ( d C 1) and thus we cannot combine this with AO to solve for s 2 (the expression in equation A.2 becomes indeterminate). We are faced with a dilemma. For nite NB / N we can get an estimate of s 2 , but this estimate is not very good. However, we have no way to make it better because by making the bootstrap more effective for B, the nal result cannot be solved for s 2 . Technically speaking, we cannot in this way get a consistent estimate of s 2 , even as N ! 1. A.2 Proof of Theorems. We use the notation X D [ x 1 « x2 ¬ . . . S ´ xxT x ,
xN ] , y D f Cn q D hx f (x )ix .
X is a (dC1)£N matrix where the x i are the data points (we use the convention
that the rst component is set to 1 and the remaining d components are the coordinates of the data point). f is an N £ 1 vector of output values at the data points, and n is a vector of noise realizations. If hx i D 0, then S is the covariance matrix for the input distribution. The law of large numbers gives us that XX T / N ¡! S and Xf / N ¡! q , where we assume that the N!1
N!1
conditions for this to happen are satised. We thus dene V and a by XX T
N
V ( X) DS C p , N
Xf
N
a (X ) DqC p . N
Note that hV iX D ha iX D 0 and Var( V ) and Var(a ) are O(1).
(A.3)
No Free Lunch for Noise Prediction
561
Let the learning model be the set of linear functions w ¢ x and let the learning algorithm be minimization of the squared error. Then Theorem 4.
hEtr¤ i
D E0 C s 2 ¡
,n
¡ ¢
s 2 ( d C 1) C B 1 CO 3 N N2
,
(A.4)
where E 0 and B are constants given by E0 D h f 2 i ¡ q T S ¡1 q BD Proof.
(A.5)
hq T S ¡1 V S ¡1 V S ¡1 q C a T S ¡1 a ¡ 2a T S ¡1 V S ¡1 q i . N
(A.6)
The least squares estimate of w is given by
w O D ( XX T ) ¡1 Xy ,
(A.7)
from which we calculate the expected training error as ½ hEtr i D D
¾ ½ ¾ 1 T 1 (X w ((X T ( XX T ) ¡1 X ¡ 1) y ) 2 O ¡ y) 2 D N N « T ¬ « T ¬ T ¡1 f f ¡ f X ( XX ) Xf C hn T n i ¡ hn T X T ( XX T ) ¡1 Xn i
N T X T ( XX T ) ¡1 Xf i h f s 2 ( d C 1) D h f 2i ¡ Cs 2 ¡ . N | {z N } T1
Using equation A.3 and the identity [1 Cl A ]¡1 D 1 ¡ l A Cl 2 A 2 C O(l 3 ), we evaluate T1 as T1 D h f 2 i ¡ q T S ¡1 q ¡
¡ ¢
hq T S ¡1 V S ¡1 V S ¡1 q C a T S ¡1 a ¡ 2aT S ¡1 V S ¡1 q i 1 CO 3 N N2
.
The theorem now follows. This result is similar to results obtained by Amari, Murata, Muller, ¨ and Yang (1997) and Moody (1991). From theorem 4, we have that 0
1
1 B T ¡1 ¡1 ¡1 ¡1 ¡1 ¡1 C BD @hq S V S V S q i ChaT S a i ¡2 ha T S V S q iA . {z } | {z } | {z } N | T1
T2
T3
(A.8)
562
Malik Magdon-Ismail
We wish to estimate T1 , T2 , T3 . Suppose that we try to bootstrap them as follows. For T 1 , we take qO D Xy / N D X ( f C n ) / N and SO Dp( XX T / N ). We estimate V by sampling NB of the N data points and compute NB ((X B X TB / NB )¡ (XX T / N )). We then estimate T1 by TO 1 , a bootstrapped expectation of qO T S O¡1 V S O¡1 V S O¡1 qO . This requires NB / N to be small. Doing all this, we nd that E Xf E fT X T D O¡1 s 2 D O¡1 S V S O¡1 V S O¡1 C TO 1 D tr S V S O¡1 V , N N N
(A.9)
where we have an expectation with respect to the noise and the remaining expectation is with respect to the bootstrapping. Notice that the rst term is the unbiased estimator of T1 . Denote this term by TQ 1 . p For TO 2 D ha T S ¡1 a i we use a D NB (X B y B / NB ¡ qO ). Partitioning the input matrix and noise vector into the part in the sampled NB points and the remaining part, write X D [X B XQ B ] and n D [n TB nQ TB ]T . Noting that n B and nQ B are independent, we have for TO 2 : 1 (NB Xn ¡ N X B n B )S ¡1 ( NB Xn ¡ N X B n B ) N2 NB " # " # 2( ¡ Q B nQ B Q B nB Q B nQ B Q B nB )2 N N N X X X X B ¡1 B D TQ2 C ¡ S ¡ N 2 NB N ¡ NB NB N ¡ NB NB
TO 2 D TQ2 C
¡
NB D TQ2 C s 2 (d C 1) 1 ¡ N
¢ ¡ 1¢ Co
N
,
(A.10)
where we have taken the expectation with respect to the noise and TQ 2 is an unbiased estimate of T2 . Performing the same kind of analysis for TO 3 , we nd that TO 3 D TQ 3 , an unbiased estimate of T3 . Now we estimate BO by TO 1 C TO 2 ¡ 2TO 3 , which is the unbiased estimate of B plus some correction terms. Therefore we have proved the following theorem. For N ! 1, NB / N ! 0, the bootstrapped estimate of B, which O has an expectation given by we call B, Theorem 5.
O n D BQ C hBi
s 2 ( d C 1) N
¡
1¡
NB N
¢
C
s2 trhS ¡1 V S ¡1 V iX , N
(A.11)
where BQ is an unbiased estimate of B and the remaining expectation is with respect to the bootstrapping.
No Free Lunch for Noise Prediction
563
Acknowledgments
We thank Yaser Abu-Mostafa and the members of the Caltech Learning Systems Group for helpful comments. In addition, two anonymous referees provided useful comments.
References Abu-Mostafa, Y. (1995). Hints. Neural Computation, 4(7), 639–671. Amari, S., Murata, N., Muller, ¨ K., & Yang, H. (1997).Asymptotic statistical theory of overtraining and cross validation. IEEE Transactions on Neural Networks, 8(5), 985–996. Barber, D., & Williams, C. K. I. (1997). Gaussian processes for Bayesian classications via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 340–346). San Mateo, CA: Morgan Kaufmann. Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999).No free lunch for early stopping. Neural Computation, 11, 995–1001. Cortes, C., Jackel, L. D., & Chiang, W.-P. (1994). Limits on learning machine accuracy imposed by data quality. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Neural information processing systems, 7 (pp. 239–246). Cambridge, MA: MIT Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Moody, J. E. (1991). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information procesing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Shao, J., & Tu, D. (1996).The jackknife and the bootstrap. New York: Springer-Verlag. Sill, J. (1998). Monotonic networks. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. S. (1997). Monotonicity hints. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 634–640). Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wolpert, D. H. (1995). The mathematics of generalization. Reading, MA: Addison Wesley. Wolpert, D. H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1391–1420. Wolpert, D. H. (1996b). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82.
564
Malik Magdon-Ismail
Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426. Zhu, H., & Rohwer, R. (1996). Bayesian regression lters and the issue of priors. Neural Computation Applications, 4, 130–142. Received May 29, 1998; accepted March 31, 1999.
LETTER
Communicated by Peter Foldiak
Self-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking Mechanism Chris J. S. Webber Defence Evaluation and Research Agency, Malvern WR14 3PS, U.K. Symmetry networks use permutation symmetries among synaptic weights to achieve transformation-invariant response. This article proposes a generic mechanism by which such symmetries can develop during unsupervised adaptation: it is shown analytically that spontaneous symmetry breaking can result in the discovery of unknown invariances of the data’s probability distribution. It is proposed that a role of sparse coding is to facilitate the discovery of statistical invariances by this mechanism. It is demonstrated that the statistical dependences that exist between simplecell-like threshold feature detectors, when exposed to temporally uncorrelated natural image data, can drive the development of complex-cell-like invariances, via single-cell Hebbian adaptation. A single learning rule can generate both simple-cell-like and complex-cell-like receptive elds. 1 Introduction
Symmetry networks (Shawe-Taylor, 1989) are neural networks in which some synaptic weights are equal to others, to some degree of approximation. In other words, the weight vectors are invariant under certain permutations among the weights. For example, the simple permutation that swaps over the ith and i0 th components of weight vector w will leave w invariant if the weights wi and wi0 are equal. These symmetries have the consequence that the scalar product with an input vector x will be invariant under corresponding permutations of the input nodes: 0
Sii...... ( w ) D w
)
0
w ¢ Sii...... (x ) D w ¢ x .
(1.1)
This follows because w ¢ S( x) D S¡1 (w ) ¢ x and because S( w ) D w implies S¡1 ( w ) D w . This is true for any permutation, including those that cycle more than two weights and those that cycle several distinct equivalence classes of weights simultaneously. Because the length | w | of a vector is invariant under a permutation, invariance of w ¢ x implies invariance of the distance | x ¡ w | , so that Sii...... ( w ) D w 0
)
0 Sii...... ( x ) ¡ w D | x ¡ w |.
Neural Computation 12, 565–596 (2000)
(1.2)
c British Crown copyright 2000 °
566
Chris J. S. Webber
Any neuron whose response is a function of w ¢ x or | x ¡ w | only will respond invariantly under those input-node permutations that correspond to its weight vector ’s symmetries. The converse of equation 1.1 implies that in general, such a neuron will discriminate between a pair of permutationrelated patterns if its weight vector is not symmetric under that particular permutation. Depending on the weight vector ’s specic symmetries, the response will be invariant for some input-node permutations but not for others; the neuron’s invariances are selective for specic groups of transformations. A neuron with a high-dimensional weight vector can be invariant for many distinct permutations, while retaining the capacity to discriminate for very many other permutations. These arguments are also valid for the case of approximate symmetries, if the neural response is a continuous function of w ¢ x or | x ¡ w | , because S(w ) ¼ w implies w ¢ S(x ) ¼ w ¢ x . Actually the arguments based on equations 1.1 and 1.2 apply to a more general group of linear transformations than the permutations: the continuous orthogonal1 group of proper and improper rotations in the weight-vector space, of which the coordinate permutations form a discrete subgroup. (The orthogonal transformations are those linear transformations that leave vector lengths invariant: |S( w )| D | w |.) The condition for w ¢ x and | x ¡ w | to be invariant under such a transformation is that the weight vector lies in the invariant subspace of the transformation, which is dened to be the subspace of vectors w for which S(w ) D w . The invariant subspace is the multidimensional analog of a mirror plane or axis of rotation. Minsky and Papert (1969) originated the idea that invariant response can be related to weight symmetries in this way: their group-invariance theorem proves that if one can construct a perceptron for computing a given symmetric function, then one can also construct a perceptron with symmetric weights for computing the same function. Many results on the subject of transformation invariance have depended on the design of networks having equalities among weights. Such networks have been successfully applied to the problem of graph isomorphism (Shawe-Taylor, 1993). The rotationinvariant network of Fukumi, Omatu, Takeda, and Kosaka (1992), the handwritten zip code recogniser of Le Cun et al. (1989), and the neocognitron (Fukushima & Miyake, 1982) all implicitly use built-in weight equalities to achieve their transformation invariances. Instead of imposing prior knowledge of a desired invariance directly onto the network architecture in this way, one can incorporate that prior knowledge into the learning algorithm, in the form of prior constraints on the learning equations. The use of weight sharing to distinguish T and C shapes independently of translation and 90 degree rotation (Rumelhart, Hinton, and Williams, 1986) provides one ex-
1 Results obtained for orthogonal transformations can be generalized straightforwardly to apply to unitary transformations if we were to consider complex-valued weights and data.
Self-Organization of Symmetry Networks
567
ample. Rao and Ballard (1998) and Wood (1995) provide others. Supervised networks have even been explicitly trained to be invariant under innitesimal transformations of a known invariance group (Hinton, 1987; Simard, Victorri, Le Cun, & Denker, 1991). This article addresses the question of what kind of intrinsic invariances of the data might be able to drive the development of weight symmetries during unsupervised adaptation, and proposes a generic mechanism by which specic invariances of the data can drive the development of transformation-invariant response. In principle, this generic mechanism could be used to generate invariances with respect to a wide range of transformations, in a range of learning rules. As a specic example, the Hebbian rule of Webber (1994) is studied and is shown analytically to give rise to weight symmetries when trained on data having statistical invariances under permutations. Webber (1994) used an analytical model, and simulations involving synthetic images, to demonstrate that this single-cell learning rule can discover independent components in kurtotically distributed data. The component-nding behavior of this learning rule has also been demonstrated using natural data: the rule discovers phones and words when exposed to continuous speech (Webber, 1997). Blais, Intrator, Shouval, and Cooper (1998) have shown that various other single-cell rules also belonging to the general class of projection pursuit algorithms (Friedman, 1987) can discover independent components in natural images, which resemble the oriented receptive elds of cortical simple cells and appear localized to some extent. Such single-cell learning rules are very different from multicell independent component analysis algorithms such as those of Olshausen and Field (1996) and Bell and Sejnowski (1997), in that they are not derived from any kind of minimum mutual entropy or statistical independence principle, such as was originally proposed by Barlow (1989). In this article the Webber (1994) rule is also trained on natural images, and it generates simple-cell-like feature detectors having oriented and highly localized receptive elds as expected. The patterns of output activity induced in a set of these “simple cells,” when natural images are presented at their inputs, are then used to train other similar, higherlevel Webber (1994) neurons. These higher-level neurons develop weight symmetries that give them complex-cell-like translation invariances. Thus, key response properties of both simple and complex cells can be derived within the framework of the Hebbian single-cell learning rule, without the need to rely on temporal correlations to explain transformation invariance, as others have done (Foldiak, 1991). 2 The Spontaneous Symmetry-Breaking Mechanism
If weight equalities are not explicitly constructed by prior constraints, how can they arise through self-organization? What properties of the data, unknown prior to learning, can determine the permutations under which the
568
Chris J. S. Webber
weight vector is to become invariant? In other kinds of dynamical system, the symmetry of the convergent state is often determined by spontaneous breaking of dynamical symmetries. Examples are widespread in physical systems. This article proposes that spontaneous symmetry-breaking mechanisms can also determine the self-organization of invariant neural response. The term dynamical symmetry refers to an invariance of the equations of motion, under a transformation of the system’s dynamical coordinates. In the neural network context, these coordinates are a set of adaptive weight vectors W ´ ( w 1 , . . . , w m ), and the equations of motion are the adaptation rules, dw j D vj ( w 1 , . . . , w m ) dt
( j D 1, . . . , m),
(2.1)
or, dening V ´ ( v 1 , . . . , v m ), dW D V ( W ). dt
(2.2)
Equations of motion that are rst order in t include most learning rules, notable exceptions being rules with second-order “momentum” terms. The velocity functions vj will depend in some way on the distribution of the training data Pr(x ). The class of rules of the form 2.2 includes the familiar rules in which the weight vector is updated smoothly on the basis of some average h. . .ifx g over Pr(x ): Z vj ( w 1 , . . . , w m ) D Àj ( x, w 1 , . . . , w m ) Pr( x ) dx ( j D 1, . . . , m).
(2.3)
In these ensemble-average rules, the order in which data patterns are drawn from Pr( x) has no inuence on the learning dynamics: it is assumed that the statistics of Pr( x ) are stationary over time. The velocities vj are usually derived from a Lyapunov function U, called the objective function in the neural network context, Z (2.4) U ( w 1 , . . . , w m ) D u( x, w 1 , . . . , w m ) Pr( x) dx , often via gradient ascent or descent: vj ( w 1 , . . . , w m )
§
@U ( w 1 , . . . , w m ) @w j
( j D 1, . . . , m).
(2.5)
Consider a linear transformation of the coordinate system from old coordinates W to new coordinates W 0 , W 0 ´ S( W ).
(2.6)
Self-Organization of Symmetry Networks
569
In the new coordinates, the equations of motion (equation 2.2) become dS¡1 ( W 0 ) D V (S¡1 ( W 0 )) dt
i.e.
dW 0 D S( V (S¡1 ( W 0 ))), dt
(2.7)
because linearity of the transformation implies S¡1 ( W 0o C dW 0 ) ¡ S¡1 ( W 0o ) dS¡1 ( W 0 ) D lim D S¡1 dt dt!0 dt
¡ dW ¢ 0
dt
.
(2.8)
Note that although we have assumed here that the transformation is linear, we do not have to assume that the dynamics are linear—that V ( W ) is a linear function of W . The transformation S on the dynamical coordinates W is said to be a symmetry of the equations of motion if the equations have the same functional form, dW 0 D V ( W 0 ), dt
(2.9)
in the new coordinate system. For linear transformations, such dynamical symmetry is dened by V ( W 0 ) D S(V (S¡1 (W 0 )))
i.e.
V (S(W )) D S(V ( W )),
(2.10)
as can be seen by comparing equation 2.7 with 2.9. In cases where the equations of motion just perform gradient ascent or descent on an objective function, any orthogonal transformation that leaves the objective function invariant, U( W 0 ) D U( W ),
(2.11)
will also be a symmetry (see equation 2.10) of the equations of motion, because the gradient in the transformed coordinate system is (using the chain rule and the denition of orthogonality 2 ) @U @W T @U @U @U D D (S¡1 ) T DS . @W 0 @W 0 @W @W @W
(2.12)
However, the concept and denition (see equation 2.10) of symmetries of the equations of motion apply even to dynamics for which no objective function U exists. If the equations of motion are invariant under some group of transformations, this has implications for the invariances of the weight vectors when the 2
S¡1 D ST for orthogonal matrices.
570
Chris J. S. Webber
learning dynamics have brought them to convergence. From equation 2.2, it can be seen that the rst of two conditions for a conguration of weight 1) vectors to be a convergent state W 1 ´ (W 1 1 , . . . , W m is that it should be a stationary point, V ( W 1 ) D 0,
(2.13)
even for dynamics for which no Lyapunov function exists. This condition is satised for unstable stationary states as well as for stable or metastable convergent states, so, in addition, convergent states must satisfy a second condition,
²T ( @V T / @W ) ² · 0,
(2.14)
for any small perturbation W D W 1 C ² away from the stationary state. This says that the “restoring force” V ( W ) should oppose3 perturbations ² away from the stationary point, so ² ¢ V (W ) should not be positive for any ²: equation 2.14 can be derived by Taylor expansion of V (W ) about W 1 in the standard way. In the special case of gradient ascent, the matrix @V T / @W is the objective function’s Hessian matrix, and equation 2.14 is just the condition for the stationary point dened by equation 2.13 to be a local maximum of the objective function. If S( W ) is an orthogonal symmetry of the equations of motion (equation 2.2) and if the particular state W 1 is one of the stable convergent states satisfying equations 2.15 and 2.16, then the symmetry denition (equation 2.10) ensures that the state S(W 1 ) will also be one of the stable convergent states satisfying equations 2.13 and 2.14. Clearly S( W 1 ) will also be a stationary point because V(W1) D 0
implies
S( V ( W 1 )) D S( 0 ) D 0 ,
(2.15)
and, by symmetry of the objective function under S, S( W 1 ) will obviously be a local maximum of U if W 1 is. Stability (see equation 2.14) of W 1 implies stability of S( W 1 ) even when no objective function U exists. If S is an orthogonal symmetry (see equation 2.10) of the equations of motion, a proof analogous to equation 2.12 shows that T @V T ( W 0 ) ¡1 @V ( W ) D S S. @W 0 @W
(2.16)
The matrices @V T ( W 0 ) / @W 0 and @V T ( W ) / @W are thus related by a similarity transformation.4 For every possible perturbation ²0 in the W 0 coordinate system, there is a corresponding direction ² D S( ²0 ) in the W coordinate 3 4
Or not further enhance it, in the case of metastability. It can be shown that the stability condition, equation 2.14, is equivalent to the re-
Self-Organization of Symmetry Networks
571
system for which the left-hand side of equation 2.14 has identical value (and vice versa),
²0 T
@V T ( W 0 ) 0 @V T ( W ) ² D ²T ², 0 @W @W
(2.17)
so the condition (see equation 2.14) that these values are nonpositive for every possible perturbation is independent of the coordinate transformation: S(W 1 ) must be stable if W 1 is stable or metastable if W 1 is metastable. Orthogonal symmetries of the equations of motion map convergent states onto convergent states. Two possibilities arise when the equations of motion are invariant under an orthogonal transformation S( W ). Either convergent states W 1 and S(W 1 ) are distinct from one another, in which case S( W 1 ) 6 D W 1 and S( W ) is said to be a spontaneously broken, or hidden, symmetry, or else they represent identical states, in which case S( W 1 ) D W 1 ,
(2.18)
and S( W ) is said to be a preserved, or manifest, symmetry. The preserved symmetries always form a subgroup of the equation of motion’s symmetry group, because they are elements of the equation of motion’s symmetry group and because symmetries form groups by denition. It is not helpful to think of spontaneous symmetry breaking as a temporal process that begins at the initial state and ends with convergence, and to regard it as being caused by asymmetry in the initial state. It simply describes the situation in which the objective function as a whole is symmetric under some transformation group, but in which the optima of the objective function are mapped onto one another, rather than onto themselves, by the action of that group. It is important to note that spontaneous symmetry breaking is not caused by slight asymmetries in the equations of motion. There are cases in physics where the dominant component of the Lyapunov function may be symmetric but some perturbation effect introduces a small asymmetry into the Lyapunov function, with the result that its optima no longer map onto themselves exactly. This situation is also called symmetry breaking, but the symmetry is then not hidden and the breaking is not called spontaneous. This article is concerned exclusively with spontaneous symmetry breaking. In the neural network context, whether a particular symmetry of the equations of motion is preserved or broken will depend on the distribution quirement that @V T / @W has no positive eigenvalues. Matrices H and H0 ´ S¡1 H S related by a similarity transformation have the same set of eigenvalues: H e D c e implies H0 S¡1 e D c S¡1 e. The condition that @V T ( W ) / @W has no positive eigenvalues is therefore independent of the transformation S( W ) if S( W ) is an orthogonal symmetry of the equations of motion.
572
Chris J. S. Webber
Pr( x ) and on the particular functional form of the Àj in equation 2.3 and on any xed symmetry-breaking parameters implicit in the functions Àj . In other words, whether a symmetry is preserved or broken depends on the particular learning rule; there is no guarantee that an arbitrary learning rule will preserve any symmetries, except the trivial identity transformation. The values of any symmetry-breaking parameters will control how large a subgroup of the dynamical symmetries is preserved; section 5 provides a simple example. Under the action of a learning rule that preserves a symmetry, the weight-vector conguration will become symmetric (see equation 2.18) on convergence, even though the learning starts from a random and asymmetric initial conguration in general. 3 Symmetry Preservation and Invariant Neural Response
The principles proved in the previous section will be familiar to those who have come across the spontaneous symmetry-breaking mechanism in physics. Although these principles have not elsewhere been set down and worked through explicitly for the neural network context, other authors have made references to symmetry breaking in neural networks. For example, Yamagishi (1994) refers to the breaking of left-right ocular symmetry in the visual cortex, which he suggests takes place when dominance columns are formed. However, other authors have not considered the signicance of the preservation of symmetries for transformation-invariant neural response. Considering this leads us to new results, which connect the subjects of symmetry networks and unsupervised learning quite intimately. Because we are interested in the response invariances of individual neurons in symmetry networks, we concentrate on transformations that permute the coordinates of individual weight vectors, operating in the same way on all the weight vectors: ( w 1 , . . . , w m ) ¡! (S( w 1 ), . . . , S( w m )).
(3.1)
Where the transformation S permutes all the weight vectors in the same way, the general symmetry denition, equation 2.10, simplies to vj (S(w 1 ), . . . , S( w m )) D S( vj ( w 1 , . . . , w m ))
( j D 1, . . . , m).
(3.2)
S( W ) will be an orthogonal transformation in the W -space if S( w ) is an orthogonal transformation on all the individual weight vectors. 5 If S(W ) D (S( w 1 ), . . . , S( w m )) is a preserved permutation symmetry, the symmetry-network conditions, 1.1 and 1.2, for the neurons’ responses to be invariant under S( x ) are automatically satised. If S( W ) D (S( w 1 ), . . . , 5
Because the mn £ mn S-matrix that transforms the W -space is assembled from m repetitions of the n £ n S-matrix that transforms each w -space, in block-diagonal form.
Self-Organization of Symmetry Networks
573
S(w m )) is a spontaneously broken permutation symmetry, however, neurons will instead discriminate between data patterns related by the transformation S(x ). (These results are valid not just for permutations but hold for all orthogonal transformations in the weight-vector space.) If the dynamics are invariant under a whole group of symmetries, some of them may be preserved while others are spontaneously broken, so a neuron may have invariant response under some transformations but discriminate for others. We observe that symmetry preservation provides a mechanism for neurons to acquire invariant response through self-organization, and that symmetry breaking provides a mechanism for the response invariances to be restricted to a specic subgroup of transformations. 4 Invariances of the Data Determine the Dynamical Symmetries
Whether a given permutation is actually a symmetry of the equations of motion in the rst place depends on Pr( x ). Consider the familiar ensembleaverage learning rules in which u is a continuous function of m scalar products wj ¢ x , m distances | w j ¡ x| , and / or m lengths | w j | , and for which the m neurons see the same input space x and so belong to the same layer of the network. S( w j ) ¢ S( x) D wj ¢ x
and
|S( wj ) ¡ S( x )| D | wj ¡ x |,
(4.1)
0
for any permutation Sii...... , so the objective function, equation 2.4, will be symmetric, U(S(w 1 ), . . . , S(w m )) D U( w 1 , . . . , w m ),
(4.2)
under any permutation that satises Pr(S( x )) D Pr( x )
(4.3)
because Z
Z u( x , S( W )) Pr( x ) dx D D
u(S¡1 ( x ), W ) Pr( x ) dx Z u(x , W ) Pr(S( x )) dx .
(4.4)
For many ensemble-average learning rules (see equation 2.3), we can derive analogous results even if no objective function exists. Consider the learning rules of the form
Àj ( x , w 1 , . . . , w m ) D vj x ¡ w j wj
( j D 1, . . . , m),
(4.5)
574
Chris J. S. Webber
where each of the 2m scalars vj and wj is a continuous but not necessarily linear function of the m scalar products w j ¢ x , the m distances | wj ¡ x| , and / or the m lengths | w j |. Their dynamics will be symmetric (see equation 2.10) under any permutation that satises equation 4.3 because all the vj and w j are invariants (see equation 4.1) under combined S( W ) and S( x ), so that Z Z ( )) ) Àj x , S( W Pr( x dx D S( Àj (S¡1 ( x), W )) Pr( x ) dx Z D S( Àj ( x , W )) Pr(S( x )) dx. (4.6) These new results state that any permutation that is an invariance of the data’s distribution will therefore be a dynamical symmetry of these learning dynamics. Those permutations that are invariances of the data’s distribution will therefore give rise to corresponding weight equalities if they are preserved unbroken in the convergent state. We therefore propose that such invariances of the data, unknown prior to learning, can drive the self-organization of transformation invariances in the response of individual neurons, through symmetry preservation. (Again, all the arguments of this section hold for the wider group of orthogonal6 transformations.) The theory of spontaneous symmetry breaking applies to real physical systems, which are only ever deemed symmetric to a nite approximation. For u( x , w 1 , . . . , w m ) that are continuous functions, approximate symmetries of Pr( x) will give rise to approximate symmetries of the objective function U (W ). Any Pr(x ) can be decomposed into a sum of symmetric and antisymmetric components 7 under a given permutation S(x ), and any U( W ) can be decomposed under the corresponding S( W ). As one perturbs Pr( x) away from exact symmetry by introducing an innitesimal antisymmetric component, U (W ) will acquire an innitesimal antisymmetric component. As this component grows, any symmetry-preserving stable optima W 1 will be perturbed away from the invariant subspace dened by S( W ) D W , remaining approximately invariant under S but to an increasingly inaccurate degree. 8 Thus, approximate symmetries of Pr( x) can give rise to approximate weight symmetries in the convergent state. The example of section 6 demonstrates that approximate invariances of the data’s distribution can indeed drive the development of approximate symmetries among synaptic weights. 6 The arguments could also be generalized straightforwardly to unitary transformations if we were to consider complex-valued weights and data. 7 The decomposition Pr( x ) D Pr C ( x ) C Pr¡ ( x ) is merely an analytical device. The ) are not themselves probability densities and are not required to satisfy functions Pr§ ( xR Pr§ ( x) ¸ 0 and Pr§ ( x ) dx D 1. 8 A large enough antisymmetric component may eventually break the approximate symmetry of these stable optima, when they abruptly cease to be stable, or, alternatively, it may move them so far from the invariant subspace that their approximate symmetry becomes washed out.
Self-Organization of Symmetry Networks
575
5 Hebbian Adaptation of Threshold Neurons as an Example
So far, only a few properties of learning rules have been assumed, but remarkably general predictions have been made. To demonstrate that spontaneous symmetry breaking gives rise to self-organized symmetry in practice, a specic learning rule is cited. Webber (1994) studied the Hebbian adaptation of a scalar product neuron whose continuous threshold nonlinearity approximates the limiting case µ a¡# lim r( a) D 0 s!0
if a ¸ # if a · #,
µ dr 1 lim D 0 s!0 da
if a ¸ # if a · #,
(5.1)
where a ´ w ¢ x is the scalar product. s represents a “softness” parameter, which determines how closely the threshold nonlinearity approximates this sharp piece-wise linear limit in practice: a convenient soft parameterization (Webber, 1997, 1998) is
¡
¢
µ ¶ a¡# , r( a) D s log 1 C exp s
¡
¢
µ ¶ dr a ¡ # ¡1 D 1 C exp ¡ . (5.2) da s
The neuron’s weight vector rotates under the effect of Hebbian (Hebb, 1949) adaptation to a distribution Pr( x ) of training patterns, dw dt
hx r( w ¢ x )ifxg ¡ (length constraint term parallel to w ),
(5.3)
subject to a Lagrange length constraint | w | D 1. The Lagrange term in equation 5.3 must operate parallel (or antiparallel) to w , because it is derived from the gradient of | w | 2 (Webber, 1994). The only effect of using a soft threshold, equation 5.2 instead of 5.1, is that the adaptation can boot-strap itself if all the training patterns initially lie below threshold. The convergent states for the two forms (equations 5.1 and 5.2) are clearly no different if s is small (Webber, 1997). (Actually, the soft form, equation 5.2, is used in the simulations here only because it is generally preferable to the original Webber (1994) form, equation 5.1: the simulation results can also be obtained using the s D 0 form (equation 5.1). Because u( x , w ) D r( x ¢ w ) 2 , a function of x ¢ w only, the objective function U (w ) will be symmetric under any permutation symmetry of Pr( x ). Hebbian adaptation optimizes this objective function by gradient ascent (subject to the length constraint), seeking out one of its stable maxima. Maxima related to one another by permutation symmetries of Pr( x ) have basins of attraction of equal area. A large enough population of neurons, adapting independently of one another, will converge to populate such maxima in equal numbers, provided their initial states were distributed randomly and uniformly over the unit hypersphere in w -space.
576
Chris J. S. Webber
Appendix A shows analytically that the threshold # is a symmetrybreaking parameter, which determines how large a subgroup of Pr( x )’s symmetries will leave the response r( w ¢ x) invariant when w has converged to a stable state. For low enough #, the weight vector will develop symmetry under every permutation that is a symmetry of Pr( x ). As # is raised, various symmetry-breaking bifurcations will be passed, which depend quantitatively on Pr( x ). At each bifurcation, a previously stable state becomes unstable, to be replaced by a set of new stable states having a smaller symmetry group.9 The new stable states will be capable of invariant response for only a smaller group of transformations. The broken symmetries no longer map a stable state onto itself, but instead map the set of new stable states onto one another. The sequence in which the permutation symmetries of Pr( x ) are progressively broken into smaller and smaller symmetry groups as # is raised can be determined analytically (see appendix A). In this way, the symmetry-breaking parameter can be used to control the degree of invariance or discrimination in the neuron’s response. Among a collection of neurons adapting with various values for the symmetrybreaking parameter, those with lower values will develop invariant responses for a larger group of transformations, while those with higher values will discriminate to a greater extent. If all permutation symmetries are preserved so that the weight vector just ends up at the mean of the training ensemble, one obtains a useless neuron with no capacity to discriminate. If the spontaneous symmetry breaking is so extensive that only the identity symmetry is preserved, no invariances will develop. It is the ability of the Webber (1994) rule to control the symmetry breaking under the inuence of a symmetry-breaking parameter, in such a way that an intermediate-sized subgroup of the data’s permutation symmetries is preserved, which makes possible the development of response invariances in a discriminating neuron. It is clear from section 4 that a wide range of unsupervised algorithms will constitute spontaneous symmetrybreaking systems when trained on data distributions that possess symmetries, but this does not mean that such systems preserve their symmetries in general. Only the single-cell rule of Webber (1994) has so far been proved analytically to be able to preserve specic subgroups up to and including the group of all the permutation symmetries of the data and to be able to break them in a well-dened and controlled sequence under the inuence of a symmetry-breaking parameter. (It is also possible to replace the threshold function r( a) in equations 5.1 and 5.2 by the more conventional and more biologically plausible sigmoid, and to use the same single-cell Hebbian rule and 9 If the symmetry is exact and discrete, the transition to states of lower symmetry will be abrupt. However, if Pr( x) has an almost continuous sequence of progressively less accurate symmetries, the stable states can appear to change continuously as these approximate symmetries are progressively broken. Section 6’s synthetic data set has examples of both of these kinds of symmetry.
Self-Organization of Symmetry Networks
577
symmetry-breaking parameter of Webber (1994) to reproduce this article’s empirical demonstrations, though at the expense of no longer being able to characterize the symmetry-preserving behavior analytically so easily.) 6 A First Demonstration Using Completely Dened Statistics
Webber (1991) provided an empirical demonstration of self-organized permutation invariance, using a synthetic data set having completely dened statistics. The data set was designed to model the problem of nding statistical dependences among sparse-coded features. The same data set is used here to demonstrate that such invariances arise through the spontaneous symmetry-breaking mechanism. The statistics of this example data set are made simple to dene, so that one can perceive clearly the invariances of the data and the self-organized symmetries of the weight vectors. However, the statistical dependences are nontrivial and cannot be revealed by principal component analysis or algorithms designed to nd localized clusters, as is explained below. This example addresses the problem of revealing the statistical dependences between sparse, nonoverlapping, sharp spikes of neural activity, and is described in greater length, with the benet of diagrams, in Webber (1991). Every pattern of this data set consists of exactly four spikes of unit height on a zero background. Each spike occurs in one of four separate one-dimensional (1D) coordinate spaces labeled A to D, and there is always exactly one spike in each of the four spaces. The location of the spike within each space is determined by a single position-coordinate t, which can take integer values modulo n / 4; that is, the four 1D spaces tA , . . . , tD are each discretized into n / 4 locations and have circular boundary conditions. The spikes can be thought of as representing the responses of n localized feature detectors, arranged in four rows, each containing n / 4 detectors. Each row detects localized features of one of four types in hypothetical 1D images, within which the location of the feature is measured by the positioncoordinate t. Together the four rows of feature detectors encode those images as n-dimensional code vectors x , using a sparse 4-of-n code. The analogy with rows of feature detectors is taken further in the more realistic but less precisely denable example of section 7. p The n-dimensional x -vectors of this data set all subtend an angle arccos 4 / n with the uniform n-dimensional vector (1, 1, . . . , 1). Positive-valued sparse codes have the property that their code vectors subtend large angles with the uniform vector. Every x-vector lies on one of the many vertices of p the unit hypercube that lie 4 length-units from the origin. p pNonidentical p p patterns are separated from one another by distances of 1, 2, 3, or 4 units in x-space, depending on the number of spikes the patterns happen to share in common at corresponding locations. No two patterns are farther than 2 units from each other, and no two patterns lie closer than 1 unit.
578
Chris J. S. Webber
Individually, the spikes are uniformly distributed over the four 1D spaces: Pr(tA ) D Pr(tB ) D Pr(tC ) D Pr(tD ) D 4 / n.
(6.1)
Despite their uniform marginal distributions, the locations of the spikes are statistically dependent on one another, with two signicantly different correlation lengths s2 > s1 À 1. The statistics Pr( x) of these training data are determined completely from the joint distribution Pr(tA , tB , tC , tD ) D
4 N( tA ¡ tB , s1 ) N( tC ¡ tD , s1 ) n t A C tB tC C tD ¢N ¡ , s2 . 2 2
¡
¢
(6.2)
Although (discretised) univariate gaussians N ( t, s), of zero mean and standard deviation s, appear in Pr(tA , tB , tC , tD ), Pr( x) is clearly not a multivariate gaussian and is not localized into distinct clusters. Algorithms designed to nd localized clusters are inappropriate for revealing the statistical dependences between these nonoverlapping spikes, because there are no distinct, localized clusters in x -space to nd. However, both the pair of spikes ( A, B ) and the pair (C, D) appear as relatively localized structures in imagespace t, as is evident from the marginal distributions Pr(tA , tB ) D
4 N ( tA ¡ tB , s1 ) n
and
Pr(tC , tD ) D
4 N (tC ¡ tD , s1 ). (6.3) n
In image-space t, each pair is a composite structure that has an internal degree of freedom (tA ¡ tB or tC ¡ tD ). Webber (1991) points out that internal degrees of freedom of such composite image-space structures are analogous to shape degrees of freedom. The “centers of gravity” of the two pairs are uniformly distributed over image-space t, Pr
¡t
A
C tB 2
¢
D Pr
¡t
C
C tD 2
¢
D 4 / n,
(6.4)
although the two centers are locally correlated with each other: Pr
¡t
A
C tB tC C t D , 2 2
¢
D
4 N n
¡t
A
¢
C tB tC C t D ¡ , s2 . 2 2
(6.5)
The two pairs are intercorrelated more weakly than they are intracorrelated, because s2 > s1 . In the limit s2 ! 1, the pairs would cease to be intercorrelated at all. The statistical dependences dene composite structures on two size scales: the strongly correlated ( A, B) and ( C, D) pairs on the ner scale, and the weaker associations between those ( A, B) and (C, D) pairs
Self-Organization of Symmetry Networks
579
on the coarser scale. The coarser-scale composite structures have a shape degree of freedom ( tA C tB ) / 2 ¡ ( tC C tD ) / 2. The statistics Pr(tA , tB , tC , tD ) are invariant under any translation t ! tCt applied to all four t-coordinates (corresponding to translationally invariant statistics of the hypothetical images that contain the features). These global translation invariances give rise to permutation symmetries of Pr( x ). The distribution is also symmetric under the two global exchanges tA $ tB and tC $ tD , permutations that map the four t-coordinates onto one another, and also under the global exchange (tA , tB ) $ ( tC , tD ). More interesting are the approximate symmetries that permute individual pairs of neighboring xi , corresponding to similar t-values, that is, t $ t C t where |t | ¿ s1 . These approximate symmetries result from the fact that N( t, s) changes only appreciably over a scale length | t | » s, so that a pair of xi corresponding to neighboring t -locations within the same t-space will have very similar statistics. These symmetries would correspond to approximate invariances of the statistics of the hypothetical 1D images, under local shifts of individual features. Principal component analysis (PCA) would be inadequate to deduce the localized nature of the structures dened by these correlations, because the eigenvectors of Toeplitz matrices are periodic, not localized. (The distinction between the two correlation lengths would result in periodic eigenvectors of distinct wavelengths, corresponding to distinct eigenvalues, however.) Because of its threshold nonlinearity, without which the Hebbian neuron could do no more than the PCA neuron of Oja (1982), the neuron described by equation 5.1 is sensitive to higher-order statistics—that is, properties of Pr( x ) other than its mean and covariance. Its nonlinearity allows its convergent states to undergo symmetry-breaking bifurcations under the control of a neural parameter. This is demonstrated in Figure 1. Through the preservation of some permutation symmetries of Pr( x ) and the breaking of others, this Hebbian neuron is able to reveal the two scales of localized composite structure present in the data. If the bifurcation parameter # is high enough, the neuron develops into a detector for either ( A, B) pairs or ( C, D) pairs—the ner-scale structures. Because the global translation symmetry of Pr( x ) has been broken, a pair detector can resolve its pair’s rough location. Because the approximate symmetry of Pr( x) under exchange of neighboring t-locations has been preserved, the responses of these pair detectors are tolerant to variation in the shape degrees of freedom tA ¡ tB and tC ¡ tD . If # lies between the two bifurcation points, the neuron resolves the coarser-scale structure but can no longer resolve the ner-scale structure. Again, because the global translation symmetry of Pr( x ) has been broken, the neuron can resolve the rough location of the coarser-scale structure. Again, the approximate symmetry of Pr( x ) under exchange of neighboring t-locations has been preserved, so the neural response is tolerant to the shape degree of freedom ( tA CtB ) / 2¡(tC CtD ) / 2. If the bifurcation parameter # is low enough, none of
580
Chris J. S. Webber
Figure 1: The weight vector w of the threshold neuron (see equation 5.1), having converged under Hebbian adaptation (see equation 5.3) to patterns drawn from the permutation-symmetric distribution Pr( x ) dened by equation 6.2, is plotted for each of three values of the bifurcation parameter #. Permutation-symmetric receptive elds have developed from randomly initialized, asymmetric initial states. The approximate symmetry of Pr( x ) under individual swaps of closely neighboring t-locations is preserved unbroken into the convergent states. The receptive elds vary smoothly with t, resulting in local position tolerance of the neural response, and tolerance to the “shape” degree of freedom. Above the second bifurcation, two kinds of convergent state can be obtained: the one depicted and another, related to it by the symmetry transformation (tA , tB ) $ (tC , tD ), which is broken at the second bifurcation. The global translation symmetry of Pr( x ) is broken at the rst bifurcation, and the resulting localized convergent states are related to one another via those broken translation symmetries. The special choice of origin used for this gure shows the states depicted as centered, although there are equally many such states at every location, identical up to a translation. Broken symmetries always relate convergent states that have equal basins of attraction, so such states are equally likely to develop if the initial states were uniformly distributed. (The very small value 10¡3 used for s means that r( w ¢ x ) is a very close approximation to the sharp nonlinearity (see equation 5.1).)
Self-Organization of Symmetry Networks
581
the permutation symmetries of Pr( x) is broken, and w O converges to the mean of Pr(x ), as predicted by the theory. When no symmetries are broken, the neuron has no capacity to discriminate between patterns. Each bifurcation breaks some symmetries but not others; the symmetry-breaking mechanism gives rise to self-organized detectors that can discriminate between structures but generalise over their internal degrees of freedom. A different kind of structure is detected in each distinct regime of the symmetry-breaking parameter. A large enough population of such neurons, with randomly initialized weights and randomly set #-values, will converge to span the full variety of convergent states available. This demonstration gives examples of both discrete, exact symmetries (the global exchanges tA $ tB , tC $ tD and (tA , tB ) $ ( tC , tD )) and continuous, approximate symmetries (the symmetries under local pair-wise swaps t $ t C t, which become progressively less accurate with increasing |t |). The breaking of exact, discrete symmetries involves abrupt transitions to states of lower symmetry, as is the case for the two bifurcations illustrated in Figure 1. The breaking of an almost continuous sequence of progressively less accurate symmetries involves an apparently continuous process of transition, however. Although this is not illustrated, the smooth weight proles of Figure 1 become progressively narrower as the bifurcation parameter # is raised. This narrowing process corresponds to the progressive breaking of the continuous sequence of approximate symmetries under local pair-wise swaps t $ t C t , in which less accurate, larger |t | symmetries are broken before more accurate, smaller |t |, symmetries. Preservation of symmetry under local permutations of neighboring input coordinates has given rise to locally smooth weight-vector proles, even though the individual training patterns are sharply spiked and not smooth. This does not happen with every equation of motion that has dynamical symmetries, however, because arbitrary adaptation rules are not in general guaranteed to preserve permutation symmetries as is the Hebbian rule of Webber (1994). If the local permutation symmetries of this data set were not preserved, the converged weight-vector proles would instead appear asymmetric, spiky, and disordered, and would not provide translation-tolerant response. 7 Hebbian Adaptation to Simple-Cell-Encoded Natural Images
The statistical dependences present in temporally uncorrelated, natural image data can also give rise to local weight symmetries. This demonstration also shows that the mechanism is no less applicable if the input vectors are real valued, rather than binary valued as in the previous idealized example. However, for nonsynthetic data, the distribution Pr( x ) is not constructed with symmetries dened a priori, so it is not so easy to determine precisely which local permutation symmetries of Pr( x) are responsible for the approximate local weight symmetries that develop. These statistical symmetries
582
Chris J. S. Webber
are not necessarily simple pair-wise swaps as in the previous synthetic example, but may involve larger subsets of the set of feature detectors. When exposed to a temporally uncorrelated ensemble of spectrally attened natural images that has rotationally invariant statistics, the Hebbian threshold neurons of Webber (1994) develop localized, oriented, bandpass receptive elds that appear similar to those of cortical simple cells. Two examples are shown in Figure 2, obtained under simulation conditions described in appendix B. (The two examples chosen for Figure 2 are rather special in that they are horizontally and vertically oriented. Actually one can obtain receptive elds having a range of orientations, positions, and size scales, depending on the random initialization of the weight vector.) This result raises the questions of what is gained by transforming images into this sparse, nonlinear simple-cell representation, what happens if similar, higher-level, Hebbian neurons are trained on input that has already been preprocessed by a lower network layer into this simple-cell-encoded form, and whether the naturally occurring statistical dependences between the simple-cell outputs can drive the development of weight symmetries in higher-level neurons. These questions have to be addressed using much larger natural images than those that other authors have used to derive simple-cell receptive elds (160 £ 160 pixels as opposed to 16 £ 16). This is because one has to simulate a signicant number of simple cells simultaneously, which collectively cover an image area substantially larger than the individual receptive elds. Only then can one demonstrate the emergence of envelopes of translation tolerance that are substantially larger than the simple cells’ receptive elds. The simulations required are much more time-consuming, because more training patterns are required to characterize the greater number of degrees of freedom that exist in higher-dimensional pattern-vector spaces. Consequently, constraints of computing resources have restricted the number of simple-cell orientations simulated here to two (horizontal and vertical), and restricted all the simple cells to having the same receptive eld size. This is because one can efciently simulate many receptive elds that share the same orientation and size by exploiting the convolution theorem (see appendix B). A more realistic demonstration of complex cell self-organization would require the simulation of many more simple cells, spanning a full range of orientations, sizes, and modal spatial frequencies, and modal Fourier phases. However, these simulations are sufcient to demonstrate the self-organization of weight symmetries driven by the statistical dependences within temporally uncorrelated natural data. Figure 2 describes the rst, simple-cell layer of a two-layer network, in which all neurons adapt according to the Webber (1994) Hebbian rule. Such a two-layer network of these Hebbian neurons was originally studied in Webber (1994), analytically and using synthetic data. All neurons in the simple-cell layer have the same threshold function (see equations 5.1
Self-Organization of Symmetry Networks
583
Figure 2: Grids of 512 “simple cells,” whose outputs form the input vector for the “complex cell” whose weight vector is represented in subsequent gures. These simple cells have either horizontal orientation selectivity (left) or vertical (right). The numbers in the gure represent pixel coordinates within the input image. The 256 horizontal cells differ only in the location of their receptive eld’s centroid within the image plane (as do the 256 vertical cells). The centroids of the two kinds of simple cell are located on the lattice points of two corresponding grids (lower). These simple cells sample the image plane anisotropically; they are most densely packed along the dimension perpendicular to their orientation axis. The simple cells have no lateral connections with each other in this model, so the complex cell must deduce their topological relationships purely on the basis of the joint statistics of their responses. The receptive elds are bandpass; their 2D power spectra all have bandwidths of 0.082 fN and 0.14 fN measured parallel to and perpendicular to the orientation axis, respectively. ( fN denotes the Nyquist frequency.) Their mean spatial frequency perpendicular to the orientation axis is 0.26 fN . A single receptive eld size is used for all the simple cells so that the convolution theorem can be exploited in order to compute many scalar products efciently (see appendix B).
584
Chris J. S. Webber
and 5.2) and threshold value, 10 and collectively transform the spectrally attened input images into a nonlinear sparse representation. The array of real-valued, nonnegative, simple-cell outputs so generated for each image is used as the input vector x for a single higher-level neuron, which adapts by the same rule. It is this “complex cell” whose self-organization we are primarily concerned with. The array of simple cells described in Figure 2 can be conceptually divided into two grids of neurons. One grid consists of similar receptive elds all oriented vertically, and the other of similar receptive elds all oriented horizontally. The centroids of the vertically oriented receptive elds are more closely packed along the horizontal image dimension than along the vertical, and the converse is true for the horizontally oriented elds. Field (1987) has proposed, on the grounds of code completeness and economy, that simple cells should not overlap with other similarly oriented simple cells to a greater extent along their orientation axes than perpendicular to their orientation axes, and therefore that their centroids may in reality be distributed anisotropically in this way with respect to orientation axis. With the topographic arrangement of simple cells that is used here (in Figure 2), the distinct receptive elds are roughly orthogonal to one another. 11 The simple cells have no lateral connections with each other in this model, so the complex cell must deduce their topological relationships purely on the basis of the joint statistics of their responses. Each 160 £ 160-pixel training image is excised from a larger landscape photograph, which has been subjected to a different and uniformly distributed random rotation and then spectrally attened (see appendix B). The ensemble of training images therefore has rotationally invariant statistics. Consequently, the joint distribution Pr(x ) of simple-cell outputs will be invariant under the global permutation that swaps over the two grids of Figure 2—that is, the permutation that exchanges the set of all vertically oriented simple cells with the set of all horizontally oriented simple cells, while maintaining the topographic relationships within each set. Neglecting edge effects, the training ensemble is also invariant under global transla-
10 The simple cell’s thresholds # are set at 0.012, the same value as was used during the training process that generated their receptive elds. (The training used a very small but nonzero value of the softness parameter s D 5 £ 10¡4 , so the hard-threshold limit [see equation 5.1] is approximated very closely.) # and s are quoted in input-vector length units in which the whitened input images have pixel variance measured as 4 £ 10¡5 squared length units. The variance of the projection x ¢ w along each simple cell’s unit weight vector is also 4 £ 10¡5 (because the image ensemble is on average spectrally at, and so must have an isotropic covariance matrix). 11 The scalar product between the weight vectors of immediate neighbors in the same row is 0.0027, corresponding to a w -space angle of 89.8 degrees. The corresponding values for nearest neighbors in adjacent rows of the same orientation grid are 0.29 and 73 degrees, and 0.15 and 81 degrees for nearest neighbors separated by two rows. The values for colocated receptive elds of perpendicular orientations are 0.17 and 80 degrees.
Self-Organization of Symmetry Networks
585
Figure 3: A self-organized conguration of input synapses for the second-layer “complex cell.” Heights represent values of synaptic weights (scaled by a factor of 1000). The 512 synapses provide input from the 512 “simple cells” located at the 512 grid points depicted in the previous gure. (Although not constrained a priori to be nonnegative, all the complex cell’s synaptic weights have developed nonnegative values, because the outputs of the simple cells that feed these synapses are always nonnegative.) In the example, the global rotation symmetry of the data’s statistics happens to have been broken in such a way that this complex cell will only be triggered by activity in the horizontally oriented simple cells: the complex cell depicted has horizontal orientation selectivity. The weight prole is smooth—approximately symmetric under permutations of neighboring weights. The complex cell will have translation-tolerant response as a result. In this simulation, the threshold parameter # was set at 0.02 (measured in the same length-units as the simple cells’ thresholds).
tions, because the training images were selected randomly from uniformly distributed locations within larger landscape photographs and because the training images are presented repeatedly to the network, translated to every possible position offset with respect to the network’s input window (see appendix B). Pr(x ) will therefore be approximately invariant under those global cycles among simple cells that correspond to these translations. These symmetries are only approximate because the periodic boundary condition implied by global cycling is an imperfect model of image edge effects. If Pr(x ) also has approximate symmetries under more local permutations, which involve smaller subsets of simple cells as in the synthetic example of section 6, those symmetries will give rise to localized weight symmetries when the symmetry-preserving Hebbian rule is trained on these simple-cell outputs. Further details of how this Hebbian rule is simulated here are given in appendix B. Figure 3 depicts a self-organized conguration of the higher-level Hebbian “complex cell,” trained on the outputs of Figure 2’s simple-cell grid as images from the training ensemble are presented. After adapting from
586
Chris J. S. Webber
a random initial conguration, the input synapses of this higher-level cell vary smoothly with the grid location of the corresponding simple cell. The approximate weight symmetries that have developed in the complex cell’s receptive eld mean that the complex cell will be tolerant to the position of the simple cells that trigger it. The complex cell’s translation tolerance is substantially greater than that of the simple cells, which will be of the order of the separation between peak and trough in the simple cells’ receptive elds (3 pixels), or less. This observation is consistent with the experimental result that real complex cells are position tolerant over a distance greater than the width of the barlike features for which they are optimally sensitive (Hubel & Wiesel, 1962). One can see from Figure 3 that the “complex cell” has developed so as to become sensitive to activity in simple cells of one orientation only; the global rotation symmetry has been broken. The complex cell’s orientation selectivity has developed because the statistical dependences between neighboring simple cells of the same orientation are stronger than the statistical dependences between simple cells of very different orientations. Long image features typically trigger several similarly oriented neighboring simple cells simultaneously, but neighboring simple cells of very different orientations are simultaneously triggered less frequently, by rarer features such as corners, which have both horizontal and vertical components. (Because rotational symmetry has been broken, the translation invariance result cannot be attributed to rotational symmetry, and so the credibility of that result is not undermined by the fact that rotational invariance of the training statistics has been reinforced by applying random rotations.) Figure 4 illustrates that broken permutation-symmetries map distinct stable states onto one another, and that raising the value of the Webber (1994) threshold parameter # breaks symmetry, as the analytical treatment of appendix B predicts. At the lower of the two threshold values depicted, the stable weight vectors are approximately invariant (i.e., invariant up to edge effects) under global permutations that correspond to translations of the whole image. The 90 degree rotation symmetry of Pr(x ) is already broken at this #-value; this rotation symmetry maps the two stable congurations obtainable at this #-value onto one another. When # is raised to the higher value, the global translation symmetry is broken, and longer-range translations now map distinct localized stable congurations onto one another. Even at this higher #-value, the translation-tolerance of the “complex cell” is still 10 times wider than that of the simple cells that provide its input; its weight vector remains approximately symmetric under the most localized permutations of neighboring simple cells. 8 Discussion
Webber (1991) proposed that a role of sparse feature coding might be to map the raw data into a representation in which transformations of fea-
Self-Organization of Symmetry Networks
587
tures, such as local shift and deformation, are rendered as permutations of the output values among the various feature-coding units. That hypothesis can be extended by observing that any invariances of the raw data’s statistics under such transformations would give rise to homomorphic symmetries of the sparse code’s statistics under permutations among the coding units. Such mapping of invariances onto permutation symmetries would be a useful function for sparse coding to perform, because permutation symmetries can drive the development of invariant response in neurons of the next network layer, via the spontaneous symmetry-breaking mechanism. The examples have illustrated how translations (of individual features or sets of features) can be mapped onto permutations by a sparse feature code. When the features are moved, a different subset of feature detectors res, which corresponds to the new feature locations. By analogy, one can see how scale transformations might be mapped onto permutations by a sparse code based on self-similar wavelets of various size scales; when a feature or set of features is enlarged or reduced, a different subset of wavelet detectors will re, corresponding to the new feature sizes. Because the principles proposed here do not constrain the coding units of sparse codes to have any particular topological relationships (such as the 2D topology of the image plane, for example), it is conceivable that rotations in three dimensions could be represented as permutations of some more abstract, higher-level, sparse code in which 3D features having very different orientations in 3D space are encoded by different feature detectors. (However, it would be inefcient to encode certain very simple transformation groups by homomorphism onto the permutations. For example, image illumination variation can be effectively removed by simple preprocessing operations such as local normalization or local background subtraction on a logarithmic scale.) Sparse feature codes represent individual features of the data on individual coding units, or feature detectors. If a sparse feature code has permutation symmetries Pr(S( x)) ¼ Pr( x ), what does this imply about the subset of features exchanged under the permutation? If exchanging one feature with another (or one subset of features with another subset) is to transform every training pattern into a roughly equiprobable pattern, then the exchanged features (or subsets of features) must have similar statistics, so that the exchanged features always arise in similar contexts with similar likelihood. (“Similar context” means that apart from the features exchanged, the same combination of features is present in the two equiprobable training patterns; the pattern of responses of the rest of the code’s feature detectors is the same for the two patterns.) Thus, if a feature code has permutation symmetries, this means that the exchanged features are indistinguishable from each other on the basis of their statistical dependences on the rest of the features that make up the code. In the spontaneous symmetry-breaking mechanism, it is this property of statistical indistin-
588
Chris J. S. Webber
guishability among features that can drive the development of invariant neural response. The example of section 6 illustrates how discrete local permutations can represent effectively continuous transformations like local position shifts and shape deformations. However, permutations can also represent absolutely discrete transformations, because permutations are inherently discrete transformations themselves. One example of an absolutely discrete transformation is the global exchange of all visual cells having left-ocular dominance with all cells having right-ocular dominance. Yamagishi (1994) has already argued that the breaking of this obvious symmetry may be responsible for the formation of ocular dominance columns. This leads us to speculate that the symmetry preservation mechanism proposed here could explain how invariance with respect to ocularity might be reacquired by cells involved in higher stages of visual processing. Another example of a discrete transformation is the replacement of a word in a sentence by a synonymous word. If the exchanged words have always arisen in similar contexts in the listener ’s experience and have similar frequencies of occurrence, then they are statistically indistinguishable by denition. Another example is the difference in pronunciation of the vowels between one speaking accent and another. In principle, the statistical indistinguishability model could allow invariance with respect to pronunciation to be learned, because the different pronunciations of the vowels are always experienced in the same contexts.
Figure 4: Facing page. The effect of varying the complex cell’s threshold parameter # and the effect of starting from different random initial weight congurations. Neglecting edge effects, the two weight vectors obtainable at # D 0.02 (a and b) are approximately symmetric under long-range translations, but not under exchange of horizontal with vertical; the “complex cell” is translation tolerant but orientation specic. The global translation symmetry of the data’s statistics (made approximate by edge effects) has been preserved, but the 90 degree rotation symmetry has been broken. The rotation symmetry instead maps the two distinct convergent states a and b onto one another. As the threshold is raised to # D 0.04, the receptive elds become more localized (c and d); the global translation symmetry has now been broken, but the weights are still approximately symmetric under more localized permutations. As a result, the translation tolerance of the complex cell’s response will have become more short range but not eliminated. Even at the higher #-value, the width of the tolerance envelope is still about 20 pixels, 10 times wider than the simple cells’ tolerance width. With the long-range translation symmetry broken at the higher #-value, translations as well as 90 degree rotations map distinct convergent states onto one another (e.g., c onto d). Although not depicted, the rotation symmetry may be restored by lowering # (to 0.01); then the weights all become equal (edge effects make the equality approximate).
Self-Organization of Symmetry Networks
589
Invariant neurons in symmetry networks achieve invariance by discarding information about the transformation’s degrees of freedom. For example, a translation-invariant complex cell does not retain precise information about its feature’s position degree of freedom. The discarded degreeof-freedom information is not destroyed; it is still obtainable from other neurons—for example, from the simple cells from which the complex cell takes its input. However, nowhere is there stored a feature-independent generator of the transformation, which could be used to generate transformed copies of arbitrary features from prototypical templates. The transformation is not represented independently of the object on which it acts. There is an
590
Chris J. S. Webber
advantage to this apparent omission in that there is no binding problem: no ambiguity in deciding which, of the many objects that may be present, each particular transformation applies to. Single-cell learning rules such as the Webber (1994) rule do not attempt to minimize statistical dependences between neurons as Barlow (1989) proposed. Indeed, clear statistical dependences between simple cells are needed to drive the development of complex-cell invariances in the symmetrypreservation theory, so the elimination of statistical dependence between simple cells would not be a useful objective from the point of view of this theory. Other theories have been proposed to account for the unsupervised discovery of unknown invariances, based on the temporal association of sequences of transformation-related patterns over short memory timescales (Foldiak, 1991; Wallis et al., 1993; Bartlett & Sejnowski, 1995). Using simple synthetic training patterns rather than natural images, Foldiak (1991) argued that complex cells could discover short-timescale correlations between neighboring simple cells if they were to average the simple-cell activity pattern over those short timescales. Although the explanation of complex-cell development offered here does not exploit time dependence, the general principle of invariance discovery proposed here is not incapable of making use of time dependence. However, it would do so in other contexts and in a different way. For example, it could discover statistical indistinguishabilities among the statistics of sets of motion detectors. Temporal association theories of invariance discovery have problems in principle that are not shared by the symmetry-preservation theory. If a neuron’s memory timescale is to be short enough for it to be sensitive to the temporal associations in short transformation sequences, it is hard to see why it would not forget those associations just as quickly if its stimulus temporarily became stationary and become matched to the stationary stimulus instead, so losing its transformation invariance. By exposure to rotation sequences, complex cells would nd associations between simple cells of different orientations as easily as they nd associations between simple cells of different positions from translation sequences. The temporal association model does not explain why complex cells are orientation specic but translation tolerant, except by assuming that because translation sequences are (presumably) so much more prevalent than rotation sequences, cells would quickly forget any rotation sequences that they happened to learn. That assumption raises a stability-plasticity dilemma, however; if cells forget some transformation sequences in order to learn newer ones, it is not obvious that their response properties would ever settle down. Temporal association models are not relevant to discrete transformations where neurons never experience a temporal sequence—for example, the transformations that relate different speakers’ pronunciations of the same vowel, or the transformation that exchanges ocularity between left and right. Association is just as important in the symmetry-preservation theory as in temporal asso-
Self-Organization of Symmetry Networks
591
ciation theories, but the association is not on the basis of similarity over time, but similarity of context: statistical indistinguishability. In principle, symmetry preservation provides a generic mechanism for generating neurons that respond similarly to different features that arise in similar contexts. Appendix A: Symmetry-Breaking Bifurcation in the Hebbian Threshold Neuron
Webber (1994) showed that the condition for the weight vector of the neuron (see equation 5.1) to be stationary under the Hebbian dynamics is D
x xT
E1 H( w O 1)
w O 1 ¡ #hx iH( wO 1 ) D b w O1
(positive scalar b),
(A.1)
where the average is now taken over just that subset H of patterns for which the neuron’s response r rises above threshold: x¢w O > #
,
x 2 H( w O ).
(A.2)
In addition, for w O 1 to be a stable stationary state, any small perturbation 1 O Dw O C ² perpendicular to w O 1 must satisfy w D E ²O T x x T
H (wO 1 )
D E O 1T x x T ²O < w
O 1) H( w
O1 w
O 1T hx iH( wO 1 ) ¡ #w
(²O ´ | ²| ¡1 ²).
(A.3)
O 1 are permitted by the constraint Only perturbations ² perpendicular to w O | D 1. This stability condition |w « ¬ is an upper limit on the eigenvalues of that part N H( wO 1 ) of the matrix x xT H( wO 1 ) that lies in the subspace of permissible « ¬ perturbations ²—the part of x x T H( wO 1 ) that projects normal to w O 1 (Webber, 1994). Those conditions can be used to dene the symmetry-breaking behavior of this Hebbian neuron. When the threshold # is very large and negative, the entire training ensemble will belong to the set H, and the only stable state that will satisfy equations A.1 and A.3 is O1D w
hx ifxg |hx ifxg |
for # ! ¡1.
(A.4)
So when # is very negative, the weight vector inevitably converges to the mean of the ensemble. Because the mean is invariant under the entire group G of Pr( x )’s permutation symmetries, no permutation symmetry will be broken if # is low enough.
592
Chris J. S. Webber
As # is raised, the right-hand side of equation A.3 will get gradually 12 smaller, until it drops below the left-hand side, at which point the stationary state w O 1 will cease to be stable. It will bifurcate along the direction of the principal eigenvector of N H( wO 1 ) (or along any direction within its principal subspace if its largest eigenvalue is degenerate). Regardless of Pr( x ), the indicatrix ellipsoid of this matrix always has reection and rotation symmetries about a number of invariant subspaces, each containing a number of its eigenvectors. The symmetries that will be broken by the bifurcation are those whose invariant subspaces do not contain the principal eigenvector(s) of N H( wO ) . Permutation symmetries correspond to rotations and reections that map the axes of x -space onto one another and are special cases of proper and improper rotations: the orthogonal transformations. If Pr( x ) is symmetric O1 under a group G of orthogonal transformations, every convergent state w 0 will lie in the invariant subspaces of some group G of orthogonal « unbroken ¬ symmetries, where G0 is a subgroup of G. hx iH( wO 1 ) , x x T H( wO 1 ) and N H( wO 1 ) will all be symmetric under the unbroken orthogonal symmetries G0 of Pr( x), because w O 1 lies in the invariant subspaces of all these unbroken symmetries, and because the above-threshold subset H will therefore be symmetric if the whole of Pr( x ) is. Thus, every unbroken orthogonal symmetry of Pr( x ) will be a symmetry of the indicatrix of N H (wO 1 ) . Whichever unbroken symmetries have invariant subspaces that do not contain the principal eigenvector(s) of N H( wO 1 ) will be the ones that will be broken when the next bifurcation point is passed. Appendix B: Details of Algorithm Implementation for the Natural Image Example
The rst network layer, represented in Figure 2, consists of “simple-cell” receptive elds of just two orientations but many different locations in the image plane. All the receptive elds of a single orientation can be mapped onto one another by translations. The correlation function between the image and a receptive eld, computable using the convolution theorem, is equivalent to an array of scalar products, between the image and an array of similar receptive elds translated to all possible image locations. O required to compute the Consequently, the array of scalar products x ¢ w outputs of each grid of “simple cells” can be computed very efciently by fast Fourier transform, as is done in Webber (1998), for example, which also explains how to compute the weight updates d w O due to an array of cell 12 The left-hand and right-hand sides may both change discontinuously if the set H loses members as a result of the raising of the threshold #, although these discontinuous changes turn out not to break the inequality, because they affect both sides equally. This is because any pattern that is just about to pass below the x ¢ w D # plane will make only a negligible contribution u( x , w ) to the objective function.
Self-Organization of Symmetry Networks
593
outputs equally efciently. This trick can be used to train both the simplecell layer and the complex cell. It is only by exploiting the huge speed enhancement afforded by the convolution theorem that the simulations presented here are feasible for a contemporary serial computer in reasonable time. Computing a pair of correlation functions, between a 160 £ 160-pixel input image and a pair of 64 £ 64-pixel “simple-cell” receptive-eld kernels (one horizontally and the other vertically oriented), yields a pair of square correlation functions of side length 160 ¡ 64 C 1 D 97 pixels, if the artifact of image wraparound is to be avoided. That artifact would introduce undesirable horizontal and vertical correlations between the simple-cell outputs, due to the long discontinuities at the input-image edges. This pair of 97 £ 97-pixel correlation functions forms the basis for the training input to the single Hebbian “complex cell” that forms the second layer of the network. The “simple-cell” receptive elds are not located at 1-pixel intervals, but sample only at the grid points illustrated in Figure 2, so the “complex cell” is allowed to receive input from the two correlation functions only through a limited number of synapses, connecting it only to those coordinates of the correlation functions corresponding to these grid points. As illustrated, these grid points extend over a region of the image 64 pixels square, but there are only 512 of them, so the “complex cell” has 512 synaptic inputs, half of them from vertically oriented “simple cells” and half from horizontally oriented “simple cells.” The size of the training set is augmented by updating the complex cell’s synapses with each training image presented repeatedly, at every possible location translated with respect to the input synapses. This amounts to reinforcing the natural translational invariance of the image statistics. The convolution theorem allows this to be done very efciently (Webber, 1998). In this way, one can augment the training set by a factor of (97 ¡ 64 C 1)2 if one wishes to avoid wraparound of the correlation functions. Augmenting the size of the training set in this way is necessary for two closely related reasons. First, one needs a sufciently large training set to characterize the relevant degrees of freedom of natural images adequately; without such augmentation, too few photographs would have been available to populate the high-dimensional pattern space, so there would have been very strong statistical biases due to gross undersampling of the space. Second, one needs to present at each iteration a sufciently large random subsample of this training set to ensure that the adaptation converges smoothly and is not disrupted by statistical uctuations introduced by undersampling bias. Each 200-iteration simulation required about six weeks’ CPU time on a 266 MHz Alpha, benchmarked at 90 SPECfp rate95. At each iteration, 16,000 training images were presented to the network, each presented at each of the (97 ¡ 64 C 1) 2 position-offsets computable simultaneously using the convolution theorem. Each set of 16,000 training images was selected randomly from a total ensemble of 392,400 images, which were
594
Chris J. S. Webber
obtained by rotating 13 109 large landscape photographs by 1800 different angles covering the full 360 degrees, duplicating these by reections, then applying a spectral whitening lter (the same lter 14 is applied to all photographs), and then excising a randomly chosen 160£160 pixel region within the rotated photograph (avoiding the edges of the photograph where the whitening produces artifacts). Together with the translations, the reections and rotations are necessary to make the training set large enough to eliminate statistical bias due to undersampling of the high-dimensional pattern space. In all simulations depicted in Figures 3 and 4, the complex cell was trained with the very small but nonzero softness parameter s D 5 £ 10¡4 . The fractional rate constant l 0 , which is related to the time-varying Webber (1994) rate-parameter l (the constant of proportionality in equation 5.3) as dened in Webber (1997), was set at 0.3 for all cells in all these simulations. Acknowledgments
I am grateful to Steve Luttrell for discussions. References Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bartlett, M. S., & Sejnowski, T. (1995). Unsupervised learning of invariant representations of faces through temporal association. In J. M. Bower (Ed.), The neurobiology of computation. Norwell, MA: Kluwer. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338. Blais, B. S., Intrator, N., Shouval, H. V., & Cooper, L. N. (1998). Receptive eld formation in natural scene environments: Comparison of single-cell learning rules. Neural Computation, 10, 1797–1813. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, 4, 2379–2394. Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Fukumi, M., Omatu, S., Takeda, F., & Kosaka, T. (1992). Rotation-invariant neural pattern recognition system with application to coin recognition. IEEE Transactions on Neural Networks, 3, 272–279.
13
The rotation uses a conventional bilinear interpolation algorithm. The whitening lter has been optimized so as to make the average power spectrum over the entire training ensemble at. 14
Self-Organization of Symmetry Networks
595
Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469. Hebb, D. O. (1949). The organisation of behaviour. New York: Wiley. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel architectures and languages europe (pp. 1-13).New York: Springer-Verlag. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Oja, E. (1982). A simplied neuron model as a principal component analyser. Journal of Mathematical Biology, 15, 267–273. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607– 609. Rao, R. P. N., & Ballard, D. H. (1998).Development of localized oriented receptive elds by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9, 219–234. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructures of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Shawe-Taylor, J. (1989).Building symmetries into feedforward networks. In First IEEE Conference on Articial Neural Networks (pp. 158–162). Shawe-Taylor, J. (1993). Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4, 816–826. Simard, P., Victorri, B., Le Cun, Y., & Denker, J. (1991).Tangent prop—a formalism for specifying selected invariances in an adaptive network. Neural Information Processing Systems, 4, 895–903. Wallis, G., Rolls, E., & Foldiak, P. (1993). Learning invariant responses to the natural transformations of objects. Proceedings of the 1993 International Joint Conference on Neural Networks, 2, 1087–1090. Webber, C. J. S. (1991). Self-organisation of position- and deformation-tolerant neural representations. Network: Computation in Neural Systems, 2, 43–61. Webber, C. J. S. (1994). Self-organisation of transformation-invariant detectors for constituents of perceptual patterns. Network: Computation in Neural Systems, 5, 471–496. Webber, C. J. S. (1997). Generalisation and discrimination emerge from a selforganising componential network: A speech example. Network: Computation in Neural Systems, 8, 425–440. Webber, C. J. S. (1998). Emergent componential coding of a handwriting-image database by neural self-organisation. Network: Computation in Neural Systems, 9, 433–447.
596
Chris J. S. Webber
Wood, J. J. (1995). A model of adaptive invariance. Unpublished doctoral dissertation, University of London. Yamagishi, K. (1994). Spontaneous symmetry breaking and the formation of columnar structures in the primary visual cortex. Network: Computation in Neural Systems, 5, 61–73. Received March 23, 1998; accepted April 26, 1999.
LETTER
Communicated by Bard Ermentrout
Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks J. Rubin D. Terman Department of Mathematics, Ohio State University, Columbus, Ohio 43210, U.S.A. We develop geometric dynamical systems methods to determine how various components contribute to a neuronal network’s emergent population behaviors. The results clarify the multiple roles inhibition can play in producing different rhythms. Which rhythms arise depends on how inhibition interacts with intrinsic properties of the neurons; the nature of these interactions depends on the underlying architecture of the network. Our analysis demonstrates that fast inhibitory coupling may lead to synchronized rhythms if either the cells within the network or the architecture of the network is sufciently complicated. This cannot occur in mutually coupled networks with basic cells; the geometric approach helps explain how additional network complexity allows for synchronized rhythms in the presence of fast inhibitory coupling. The networks and issues considered are motivated by recent models for thalamic oscillations. The analysis helps clarify the roles of various biophysical features, such as fast and slow inhibition, cortical inputs, and ionic conductances, in producing network behavior associated with the spindle sleep rhythm and with paroxysmal discharge rhythms. Transitions between these rhythms are also discussed. 1 Introduction
Neuronal networks often exhibit a rich variety of oscillatory behavior. The dynamics of even a single cell may be quite complicated; it may, for example, re repetitive spikes or bursts of action potentials, each followed by a silent phase of near-quiescent behavior (Rinzel, 1987; Wang & Rinzel, 1995). The bursting behavior may wax and wane on a slower timescale (Destexhe, Babloyantz, & Sejnowski, 1993; Bal & McCormick, 1996). Examples of population rhythms include synchronous behavior, in which every cell in the network res at the same time, and clustering, in which the entire population of cells breaks up into subpopulations or blocks; the cells within a single block re synchronously while different blocks are desynchronized from each other (Golomb & Rinzel, 1994; Kopell & LeMasson, 1994). Of course, much more complicated population rhythms are also possible (Traub & Miles, 1991; Terman & Lee, 1997). Activity may also propagate through the Neural Computation 12, 597–645 (2000)
c 2000 Massachusetts Institute of Technology °
598
J. Rubin and D. Terman
network in a wavelike manner (Kim, Bal, & McCormick, 1995; Destexhe, Bal, McCormick, & Sejnowski, 1996; Golomb, Wang, & Rinzel, 1996; Rinzel, Terman, Wang, & Ermentrout, 1998). A network’s population rhythm results from interactions among three separate components: the intrinsic properties of individual neurons, the synaptic properties of coupling between neurons, and the architecture of coupling (i.e., which neurons communicate with each other). These components typically involve numerous parameters and multiple timescales. The synaptic coupling, for example, can be excitatory or inhibitory, and its possible turn-on and turn-off rates can vary widely. Neuronal systems may include several different types of cells, as well as different types of coupling. An important and typically challenging problem is to determine the role each component plays in shaping the emergent network behavior. In this article we consider recent models for thalamic oscillations (see, for example, Destexhe, McCormick, & Sejnowski, 1993; Steriade, McCormick, & Sejnowski, 1993; Golomb, Wang, & Rinzel, 1994; Terman, Bose, & Kopell, 1996; Destexhe & Sejnowski, 1997). The networks consist of several types of cells and include excitatory as well as both fast and slow inhibitory coupling. One interesting property of these networks is that they exhibit very different rhythms for different parameter ranges. For some parameter values, the network behavior resembles that of the spindle sleep rhythm: one population of cells is synchronized at the spindle frequency, while another population of cells exhibits clustering. If a certain parameter, corresponding to the strength of fast inhibition, is varied, then the entire network becomes synchronized. This resembles paroxysmal discharge rhythms associated with spike-and-wave epilepsy. In other parameter ranges, the network behavior is similar to that associated with the delta sleep rhythm; in this case, each cell exhibits an entirely different behavior from before. We develop geometric dynamical systems methods to analyze the mechanisms responsible for each of these rhythms and the transitions between them. This approach helps determine each component’s contribution to the network behavior and to clarify how the behavior changes with respect to parameters. We are particularly interested in analyzing the role of inhibitory coupling in generating different oscillatory behaviors. This is done by considering a series of networks with increasing levels of complexity. Our analysis demonstrates, for example, how networks with distinct architectures can make different uses of inhibition to produce different rhythms. The techniques we develop are quite general and do not depend on the details of the specic systems. For a given network, however, these techniques lead to rather precise conditions for when a particular rhythm is possible. Numerous work has considered the role of inhibition in synchronizing oscillations (for example, Wang & Rinzel, 1992, 1993; Golomb & Rinzel, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Whittington, Traub, & Jefferys, 1995; Bush & Sejnowski, 1996; Gerstner, van Hemmen, & Cowan, 1996; Rowat & Selverston, 1997; Terman, Kopell, & Bose, 1998). Many of
Population Rhythms
599
these articles used model neurons with very short spiking times. These include integrate-and-re models along with alpha function–type dynamics for the synapses. More biophysically based models for bursting neurons were considered by, among others, Wang and Rinzel (1993) and Terman et al. (1998). One conclusion of those works is that inhibition can lead to synchrony only if the inhibition decays at a sufciently slow rate; in particular, the rate of decay of the synapses must be slower than the rate at which the neurons recover in their refractory period. These theoretical and numerical studies, however, considered rather idealized networks: Each cell contained only one channel state variable, and the network architecture was simply two mutually coupled cells. By considering more realistic biophysical models, we demonstrate that fast inhibitory coupling can indeed lead to synchronous rhythms. We show that this is possible in networks with complicated cells but simple architectures and in networks with more complicated architectures but simple cells. Geometric singular perturbation methods have been used previously to study the population rhythms of neuronal networks (for example, Somers & Kopell, 1993; Skinner, Kopell, & Marder, 1994; Terman & Wang, 1995; Terman & Lee, 1997; Terman et al., 1998; LoFaro & Kopell, in press). Each of the relevant networks possess multiple timescales; this allows one to dissect the full system into fast and slow subsystems. Often, however, there is no clear-cut separation of timescales, so it is not obvious how one decides which variables should be considered as fast or slow. This is particularly the case when there are multiple intrinsic channel state variables and the synaptic variables turn on and turn off at different rates. A primary goal of this article is to demonstrate how one can make the singular reduction in order to understand mechanisms responsible for the thalamic rhythms. Two crucial issues are related to the geometric analysis. The rst is concerned with the existence of a singular solution corresponding to a particular pattern. We assume that individual cells, without any coupling, are unable to oscillate. The existence of network oscillatory behavior then depends on whether the singular trajectory is able to “escape” from the silent phase. An important point will be that greater cellular or network complexity enhances each cell’s opportunity to escape the silent phase when coupled. The second issue is concerned with the stability of the solution. To demonstrate stability of a perfectly synchronous state, for example, we must show that the trajectories corresponding to different cells are brought closer together as they evolve in phase space. As we shall see, this compression is usually not controlled by a single factor; it depends on the underlying architecture as well as nontrivial interactions between the intrinsic and synaptic properties of the cells (see also Terman et al., 1998). Our analysis demonstrates, for example, why thalamic networks are well suited to use inhibitory coupling to help synchronize oscillations and produce other, clustered, rhythms. In the next section we describe in detail the types of models to be considered for individual neurons. We distinguish between basic and compound
600
J. Rubin and D. Terman
cells. As a concrete example, we consider recent conductance-based models for thalamocortical relay (TC) cells (Golomb et al., 1994; Destexhe & Sejnowski, 1997 and references therein). Compound cells can be realized as a model for a TC cell that includes three ionic currents: a low-threshold calcium current (IT ), the nonselective sag current (Isag ), and a leak. A basic cell does not include Isag . In this context, our results help to explain the role of Isag in generating network activity; we shall see that this role depends on the architecture of the network. We also discuss different forms of synaptic coupling and different architectures. In section 3, we show that synchronization is possible in mutually coupled networks that include compound cells and fast synapses. This analysis is similar to that done by Terman et al. (1998), which showed that synchronization is possible in networks with basic cells and slow inhibitory coupling. Those networks contain two types of slow processes: one corresponds to an intrinsic ionic current and the other to a synaptic slow variable. The main conclusion of the analysis here is that what is crucial for synchronization is that the network possess at least two slow processes; one may be intrinsic and the other synaptic, or both may be intrinsic. In section 4, we consider networks with architectures motivated by recent models for the thalamic spindle sleep rhythm. The more complex architectures allow the network to use inhibition in different ways to produce different population rhythms. In particular, inhibition can play an important role in synchronizing the cells in a much more robust way than in the mutually coupled networks. We demonstrate how tuning various parameters allows the network to control the effect of inhibition and thereby control the emergent behavior. Consequences of these results for the full thalamic networks are presented in section 5. We consider the roles of various biophysical parameters associated with fast and slow inhibition, the sag current, cortical inputs, and other currents. These results help clarify the mechanisms responsible for spindle and paroxysmal discharge rhythms and the ways that changes in biophysical parameters lead to transitions between different rhythms. We conclude with a discussion in section 6.
2 The Models
We begin by describing the equations corresponding to individual cells and then describe the synaptic coupling between two cells. It is necessary to explain which parameters determine whether the synapse is excitatory or inhibitory and which other parameters determine whether the synapse is fast or slow. It will also be necessary to distinguish between direct synapses and indirect synapses. Finally, we describe the types of architectures to be considered.
Population Rhythms
601
Figure 1: Nullclines of basic and compound cells. (A) The v- and w-nullclines of a basic cell intersect at p0 , on the middle branch of the v-nullcline, in the oscillatory case. The closed curve indicates a singular periodic orbit, with double arrows denoting fast pieces and single arrows denoting slow pieces. (B) The vnullcline and a singular solution of an excitable compound cell with a stable critical point p0 . 2.1 Single Cells. We model a basic cell as the relaxation oscillator
v0 D f ( v, w) w0 D 2 g( v, w),
(2.1)
d where 0 D dt . Here 2 is assumed to be small; hence, w represents a slowly evolving quantity. The active phase of the oscillation can be viewed as the envelope of a burst of spikes. We assume that the v-nullcline, f (v, w) D 0, denes a cubic-shaped curve, as shown in Figure 1A, and the w-nullcline, g D 0, is a monotone decreasing curve that intersects f D 0 at a unique point p0 . We also assume that f > 0 ( f < 0) above (below) the v¡nullcline and
602
J. Rubin and D. Terman
g > 0 (< 0) below (above) the w-nullcline. If p0 lies on the middle branch of f D 0, then equation 2.1 gives rise to a periodic solution for all 2 sufciently small, and we say that the system is oscillatory. In the limit 2 ! 0, one can construct a singular solution as shown in Figure 1A. If p0 lies on the left branch of f D 0, then the system is said to be excitable; p0 is a stable xed point, and there are no periodic solutions for all 2 small. Note that the thalamic cells we seek to model are typically excitable during the sleep state. For some of our results, it will be necessary to make some more technical assumptions on the nonlinearities f and g. We will sometimes assume that fw > 0,
gv < 0
and
gw < 0
(2.2)
near the singular solutions. By a compound cell we mean one that contains at least two slow processes. We consider compound cells that satisfy equations of the form v0 D f ( v, w, y) w0 D 2 g( v, w) y0 D 2 h(v, y).
(2.3)
Precise assumptions required on the nonlinear functions in equation 2.3 are given later. For now, we assume that for each xed value of y, the functions f ( v, w, y) and g( v, w) satisfy the conditions described for a basic cell. Then f f ( v, w, y) D 0g denes a cubic-shaped surface. The system (see equation 2.3) is said to be excitable if there exists a unique xed point, which we denote by p0 , and this lies on the left branch of the cubic-shaped surface. One can construct singular solutions of this equation, and one of these is shown in Figure 1B. The singular solution shown begins in the silent phase, or left branch, of the surface. It evolves there until it reaches the curve of jump-up points that correspond to the left knees of the cubic surface. The singular solution then jumps up to the active phase, or right branch, of the surface. It evolves in the active phase until it reaches the jump-down points or right knees of the surface. It then evolves in the silent phase, approaching the stable xed point at p0 . A more formal description of certain singular solutions is given in section 3.2. 2.2 Synaptic Coupling. Consider the network of two mutually coupled cells: E1 $ E2 . The equations corresponding to this network are
v01 D f ( v1 , q1 ) ¡ gsyns2 ( v1 ¡ vsyn) q01 D 2 L ( v1 , q1 ) v02 D f ( v2 , q2 ) ¡ gsyns1 ( v2 ¡ vsyn) q02 D 2 L ( v2 , q2 ).
(2.4)
Population Rhythms
603
Here, qi D wi and L D g if the cells are basic, while qi D ( wi , yi ) and L D ( g, h) if the cells are compound. In equation 2.4, gsyn > 0. It is the parameter vsyn that determines whether the synapse is excitatory or inhibitory. If vsyn < v along each bounded singular solution, then the synapse is inhibitory. The coupling depends on the synaptic variables si , i D 1, 2. We consider two choices for the si . Each si may satisfy a rst-order equation of the form s0i D a(1 ¡ si ) H (vi ¡ h syn) ¡ bsi .
(2.5)
Here, a and b are positive constants, H is the Heaviside step function, and hsyn is a threshold above which one cell can inuence the other. Note that a and b are related to the rates at which the synapses turn on or turn off. For fast synapses, we assume that a is O(1) with respect to 2 . It may seem natural also to assume that b is O(1), and this is, in fact, what we will do in the next section. However, when we consider the thalamic networks in sections 4 and 5, it will be necessary to assume that b D K2 , where K is a large constant. The reason for choosing the fast synaptic variables in this way is discussed in more detail later. (The model for slow synapses is discussed in section 5.2.) If the synaptic variables satisfy equation 2.5, then we say that the synapse is direct. We will also consider indirect synapses. These are modeled by introducing a second synaptic variable xi (see Golomb et al., 1994; Terman et al., 1998). The equations for ( xi , si ) are: x0i D 2 ax (1 ¡ xi ) H (vi ¡ hsyn) ¡ 2 bx xi s0i D a(1 ¡ si )H ( xi ¡ hx ) ¡ bsi .
(2.6)
Here, ax and bx are positive constants. Note that indirect synapses have the effect of introducing a delay in the synaptic action, and this delay takes place on the slow timescale. If, say, the cell E1 res, then x1 will activate once v1 crosses the threshold h syn . The activation of s1 must wait until x1 crosses the second threshold hx . Note also that an indirect synapse can be fast if a and b are O(1), as discussed above. 2.3 Globally Inhibitory Networks. Besides mutually coupled networks, we also consider networks with the architecture shown in Figure 2. This network contains two different types of cells, labeled E-cells and J-cells. Each E-cell sends fast excitation to some of the J-cells, and each J-cell sends inhibition to some of the E-cells. The inhibition may be fast or slow (or both). There is no communication among different E-cells; however, the J-cells communicate with each other via fast inhibitory coupling. This network is motivated by recent models for the thalamic sleep rhythms discussed in section 1. The E- and J-cells correspond to thalamocortical relay (TC) and thalamic reticularis (RE) cells, respectively.
604
J. Rubin and D. Terman
Figure 2: Network of inhibitory J-cells and excitatory E-cells.
3 Mutually Coupled Compound Cells with Fast Inhibitory Synapses
Rubin and Terman (1998) proved that stable synchronous oscillations are not possible in mutually coupled networks with fast inhibitory coupling and only one slow variable corresponding to each cell. This holds regardless of whether the cells are excitable or oscillatory and whether the synapses are direct or indirect. These results are for networks with relaxation-type neurons, as discussed in the previous section. Here we demonstrate that when the mutually coupled cells are compound, they can exhibit synchronized oscillations when connected with fast inhibitory coupling. The synchronous solution may exist if the synapses are direct, but as in Terman et al. (1998), it can be stable only if the synapses are indirect (see section 3.3). We assume that each cell is excitable for xed levels of input; however, there is no problem in extending the analysis if this does not hold. We analyze the network by constructing singular solutions, done by piecing together solutions of reduced fast and slow subsystems. Since the cells are compound, there will be at least two slow variables corresponding to each cell. The multiple slow variables are needed for both the existence and the stability of the synchronous solution. For existence, the multiple slow variables allow the singular trajectory to escape from the silent phase, despite the fact that each cell is excitable. The multiple slow variables also allow for several mechanisms that lead to compression of cells as they evolve in phase-space. We do not give precise conditions on the parameters and nonlinearities in the equations to specify when the synchronous solution is stable, as was done in Terman et al. (1998). This will be done elsewhere. Here we describe the geometric mechanisms that allow the cells to escape the silent phase so a synchronous solution is possible and then characterize the compression mechanisms that act to stabilize the synchronous solution. A primary aim of this article is to compare how inhibition is used in different networks with different architectures to produce stable synchrony. By identifying the compression mechanisms here, we are able to evaluate the robustness of the
Population Rhythms
605
synchronous solution. As we shall see, the compression mechanisms for the mutually coupled architecture are considerably less robust than those that arise in the globally inhibitory thalamic networks. The analysis here is similar to but more general than that in Terman et al. (1998), where it is demonstrated that stable synchrony can arise in networks with basic cells but slow synapses. It is assumed in Terman et al. (1998) that the synapses activate on the fast timescale and that the evolution of the cells in the active phase does not depend on the level of synaptic input. These assumptions imply that the slow system corresponding to the active phase is one-dimensional. We do not make these assumptions here and demonstrate that the resulting richer dynamics may lead to additional compression mechanisms for stabilizing the synchronous solution. 3.1 Singular Solutions. We analyze the network by treating 2 as a small, singular perturbation parameter and constructing singular solutions. These consist of various pieces, each piece corresponding to a solution of either fast or slow equations. The fast equations are obtained by simply letting 2 D 0 in equation 2.4 and in either equation 2.5 or 2.6. The slow equations are obtained by replacing t with t D 2 t as the independent variable and then letting 2 D 0. We assume here that both a and b are independent of 2 . There is no problem in extending this analysis if b D O(2 ); this is actually an easier case since the additional slow variable then provides additional opportunities for compression. Here we derive the reduced slow equations valid when the synapses are direct. The equations for indirect synapses are similar, but there are more cases to consider. The slow equations are
0 D f (vi , wi , yi ) ¡ gsyn sj ( vi ¡ vsyn ) wP i D g( vi , wi ) yP i D h(vi , yi ) 0 D a(1 ¡ si ) H (vi ¡ hsyn) ¡ bsi ,
(3.1)
where P D dtd . One can reduce this system to equations for just the slow variables ( yi , wi ). There are several cases to consider depending on whether both cells are silent, both are active, or one is silent and the other is active. We assume that the solution of the rst equation in 3.1 denes a cubic-shaped surface C s , and the left and right branches of this surface can be expressed as vi D W L ( wi , yi , si ) and vi D W R ( wi , yi , si ), respectively. If both cells are silent, then each vi < h syn and si D 0. Let GL ( w, y, s) ´ g(W L ( w, y, s), w) and HL ( w, y, s) ´ h(W L ( w, y, s), y). Then each ( wi , yi ) satises the equations wP D GL (w, y, 0)
yP D HL ( w, y, 0).
(3.2)
606
J. Rubin and D. Terman
If both cells are active, then each vi > h syn and the last equation in 3.1 implies that si D sA ´ a / (a C b). Let GR ( w, y, s) ´ g(W R (w, y, s), w) and HR ( w, y, s) ´ h(W R ( w, y, s), y). Then each (wi , yi ) satises the equations wP D GR ( w, y, sA )
yP D HR ( w, y, sA ).
(3.3)
Finally suppose that one cell, say cell 1, is silent and cell 2 is active. Then the slow variables satisfy the reduced equations wP1 D GL ( w1 , y1 , sA ) yP1 D HL ( w1 , y1 , sA )
wP2 D GR ( w2 , y2 , 0) yP2 D HR ( w2 , y2 , 0).
(3.4)
We may view the singular solution as two points moving around in the ( y, w) slow phase-space. Each point corresponds to the projection of one of the cells onto the slow phase plane. The points evolve according to one of the reduced slow systems until one of the points reaches a jump-up or jump-down curve. The cells then jump in the full phase-space; however, the slow variables remain constant during the fast transitions. The points then “change directions” in slow phase-space and evolve according to some other reduced slow equations. Since the cells are excitable, the reduced system, equation 3.2, with s D 0 has a stable xed point, which we denote by P0 . The slow phase-space corresponding to equation 3.2 is illustrated in Figure 3A. Note that while some of the trajectories are attracted toward P0 , others are able to reach the jump-up curve. That is, although the uncoupled cells are excitable, it is possible for a cell to begin in the silent phase and still re. This will be important in the next section, when we discuss the existence of the synchronous solution. The following lemma characterizes the left and right folds (or jumpup and jump-down curves) of C s . We assume here that fy > 0 on the left branch of C s while fy < 0 on the right. This assumption is justied, based on biophysical considerations, in remark 4 (in appendix A). The left and right folds of C s can be expressed as JL D f(vL ( y, s), L L R ( ), )g < 0, @w > 0, @w > 0, wL y, s y and JR D f(vR ( y, s), wR ( y, s), y)g where @w @y @s @y Lemma.
and
@wR @s
> 0.
Since vL ( y, s) D W L (wL ( y, s), y, s), it follows from equation 3.1 and the denition of folds that Proof.
0 D f (W L (wL ( y, s), y, s), wL ( y, s), y) ¡ gsyn s(W L ( wL ( y, s), y, s) ¡ vsyn ) 0 D fv (W L ( wL ( y, s), y, s), wL ( y, s), y) ¡ gsyns (3.5)
Population Rhythms
607
Figure 3: Singular solutions for compound cells. (A) The slow phase-space of an uncoupled compound cell, bounded by the jump-up curve JL and the jumpdown curve JR of the slow manifold Cs . The solid (dashed) curves represent evolution in the silent (active) phase. P0 is a stable xed point. (B) Numerically generated synchronous solution for mutually coupled compound cells that are separately excitable, in (y, w)-space. The inset (v versus t) shows the voltage traces as two cells approach synchrony. These curves, as well as those in other gures, were generated using the program XPPAUT, developed by G. B. Ermentrout, with parameter values given in appendix A. The solid curve is the synchronous solution, the dashed curves are the jump-up (labeled) and jump-down (approximately horizontal, unlabeled) curves for s D 0.2, and the dash-dotted curves (partially obscured) are those for s D sA D 0.8. Note that the curves for s D 0.8 lie at larger w-values than those for s D 0.2. Since 2 6 D 0, the synchronous solution does not jump up [y0 D 0, near (y, w) D (0.08, 0.07)] immediately on reaching the jump-up curve.
608
J. Rubin and D. Terman
Differentiating the rst equation in 3.5 with respect to w and using the second equation, we obtain 0D
@f @wL ¡ gsyn(W L ( wL ( y, s), y, s) ¡ vsyn). @w @s
Hence, gsyn(W L (wL ( y, s), y, s) ¡ vsyn) @wL D . @s fw
(3.6)
The right-hand side of this expression has a positive numerator because the coupling is inhibitory and a positive denominator from equation 2.2, so @wL R > 0. Analogously, @w > 0. @s @s Similarly, differentiating with respect to y in equation 3.5 yields 0 D f @f @wL L C @f , or @w D ¡ f y , with fy @w @y @y @y w fy @wR D ¡ , with , evaluated f f y w @y fw
and fw evaluated on JL . Analogously,
on JR . The above assumptions on fy , together with equation 2.2, yield the desired result. In the thalamic networks of interest, the jump-down curve JR is nearly horizontal (see Figure 3B). This holds because the y current has a much smaller reversal potential and maximal conductance than the w current; when a cell is in the active phase, this implies that | fy | ¿ | fw |, so R | @w | is quite small. (See remark 5 in appendix A.) @y Remark 1.
3.2 Existence of the Synchronous Solution. Here we illustrate why it is possible for a synchronous solution to exist in a network of mutually coupled compound cells even when the individual cells are excitable. A numerically generated picture of such a solution, projected onto the slow variables ( y, w), is shown together with certain jump-up and jump-down curves in Figure 3B. We also show a solution with initial conditions slightly perturbed from the synchronous solution. The precise equations that this solution satises and the parameter values used numerically are given in appendix A. One constructs the singular synchronous solution as follows. We begin when the cells are in the silent phase, just after they have jumped down. The slow variables then evolve according to equation 3.2. If they are able to reach the jump-up curve, then they jump up according to the fast equations. While in the active phase, the slow variables satisfy equation 3.3 until they reach the jump-down curve. They then jump down according to the fast equations, and this completes one cycle of the synchronous solution. Note that w and y are decreasing in the active phase and then increase just after jump-down; we use this in the next subsection.
Population Rhythms
609
It is not clear how to choose the starting point ( yi (0), wi (0)) so that the singular orbit returns precisely to this point after one cycle. Note, however, that the variables yi relax very close to yi D 0 during the active phase. If we suppose that yi ¼ 0 at the jump-down point, then the value of wi is determined; that is, for the coupled cells, wi ¼ wR (0, sA ). A straightforward xed-point argument shows that the synchronous solution will therefore exist if the solutions of equation 3.2, which begin near ( yi , wi ) D (0, wR (0, sA )), are able to reach the curve of jump-up points. The reason that a synchronous solution can exist even when the uncoupled cells cannot oscillate is that the synchronous solution lies on a different cubic during the active phase than the uncoupled cells. For this reason, the synchronous solution jumps down along a different curve than the uncoupled cells do. From the lemma, the jump-down curve JR (sA ) has larger values of w than the jump-down curve JR (0) (see Figure 3B). It is therefore possible for the coupled cells to jump down to a point from which they are able eventually to escape the silent phase, although the uncoupled cells jump down to a point from which they cannot escape. There is a nice biophysical interpretation for why coupled excitable cells may be able to oscillate. Recall that a thalamocortical relay cell is an example of a compound cell. (See section 5 for a more detailed discussion.) Then w corresponds to the inactivation variable of the IT current. A larger value of w means that this current is more deinactivated. This implies that if the cells jump down at a larger value of w, then it is easier for the cells to become sufciently depolarized so they can reach threshold and re. The construction of the synchronous solution for indirect synapses is very similar. The only difference is that after the cells jump up or jump down, there is a delay until the inhibition either turns on or turns off. The cells switch their cubic surface while in the silent and active phases, assuming that the delay is shorter than the time that the cells spend in each of their silent and active phases. We will assume that this is the case throughout the remainder of this article. 3.3 Stability of the Synchronous Solutions. We assume throughout this section that the synapses are indirect. The synchronous solution cannot be stable if the synapses are direct for the following reason. Suppose we start with both oscillators in the silent phase and assume that cell 1 jumps up. If the synapse is direct, then s2 jumps instantly (with respect to the slow timescale) to s2 D 1. The effect of this is to move cell 2 instantly away from its ring threshold, thus destabilizing the synchronous solution. Indirect synapses are needed for stability since they provide a window of opportunity for both cells to jump up during the same cycle. However, this does not guarantee that the synchronous solution is stable. One must still show that cells that are initially close together are brought closer together, or compressed, as they evolve in phase-space.
610
J. Rubin and D. Terman
It is not at all obvious how to dene compression. We need to demonstrate that the cells are brought closer together; however, this requires that we have a notion of distance between the cells. There are several possible metrics, each with certain advantages on different pieces of the solution. One obvious metric is the Euclidean distance between the points in phase-space corresponding to the cells. It is sometimes convenient to work with a time metric; the “distance” between the cells is then the time it takes for one cell to reach the initial position of the other cell. This was used previously in Somers & Kopell (1993), Terman & Wang (1995), Terman et al. (1998), and LoFaro & Kopell (in press). Here, we describe several mechanisms for compression, each corresponding to a different piece of the singular solution. These mechanisms illustrate how the geometry of the two-dimensional slow subsystem, notably its curves of knees, allows for compression in the presence of inhibitory coupling. By identifying the compression mechanisms, we can then understand how changing parameters in the equations inuences the stability of solutions. We can also compare the robustness of the compression mechanisms for this network with that for other networks with other architectures. 3.3.1 The Jump Up. Suppose that cell 1 lies on the jump-up curve when t D 0. After cell 1 res, there is a delay in the onset of inhibition. We assume that cell 2 begins in the silent phase so close to the jump-up curve that it res before it feels this inhibition. Suppose that cell 2 res when t D T0 . We now need to make an assumption on the nonlinearities. Let ( y¤ , w¤ ) be the point where the synchronous solution jumps up. We assume that (A) |GL ( y¤ , w¤ , 0)| < |GR ( y¤ , w¤ , 0)|
and
| HL ( y¤ , w¤ , 0)| < |HR ( y¤ , w¤ , 0)|
Note that this assumption implies that the w and y coordinates of both cells change at a faster rate in the active phase after the jump up than in the silent phase before the jump up. This is certainly satised for the example described in appendix A and is similar to assumptions in previous work (see, for example, Somers & Kopell, 1993, where the notion of fast threshold modulation is introduced). There are now several cases to consider depending on the orientation of the cells both before and after they jump up. We work out two of these in detail. These are the cases that arise most often for the system described in appendix A. Similar analysis applies to other cases. Assume that w1 (0) < w2 ( T0 ) < w2 (0) and y2 (0) < y1 ( T0 ) < y1 (0) (see Figure 4A). These assumptions, together with the lemma, imply that |y1 (T 0 )¡ y2 (T0 )| < |y1 (0)¡ y2 (0)| so there is compression in the y-coordinates after the jumps. From assumption A, there is also compression in a time metric corresponding to the y-coordinate. For each t0 , let r y (t0 ) be the time it takes for cell 2 to evolve from its position at t D t0 until its y-coordinate is that of cell 1 when t D t0 . It then follows that r y ( T0 ) < r y (0).
Population Rhythms
611
Figure 4: Compression mechanisms for mutually coupled compound cells. (A) Compression in the time metric r w can occur between mutually coupled compound cells in the jump-up. Note that w1 (0) < w2 (T0 ) < w2 (0) and y2 (0) < y1 (T0 ) < y1 (0). The Euclidean distances dw (0), dw (T0 ) are used to compute the time metrics r w (0), r w (T0 ), respectively. (B) A reversal of orientation in the active phase (solid lines) can lead to compression in the jump-down; the dashed line indicates evolution of cell 1 in the silent phase. (C) Numerically computed trajectories of a pair of mutually coupled compound cells undergoing order reversal. Cells 1, 2 correspond to c1 , c2 respectively in (B). Parameter values are given in appendix A.
612
J. Rubin and D. Terman
We now show that there is also a compression in the time metric corresponding to the w-coordinate. This is denoted by r w (t ). Let a¡ D |GL ( y¤ , w¤ , 0)| and a C D |GR ( y¤ , w¤ , 0)|. Then w2 (0) ¡ w1 (0) w2 (0) ¡ w2 (T0 ) w2 ( T0 ) ¡ w1 (0) D C a¡ a¡ a¡ w2 ( T0 ) ¡ w1 (0) w2 ( T0 ) ¡ w1 (0) ¼ T0 C > T0 C a¡ aC w1 (0) ¡ w1 ( T0 ) w2 (T 0 ) ¡ w1 (0) ¼ C aC aC ¼ r w ( T0 ).
r w (0) ¼
(3.7)
Now suppose that w1 (0) < w2 ( T0 ) < w2 (0) and y1 ( T0 ) < y2 (0). The exact same calculation given in equation 3.7 shows that there is compression in the time metric r w across the jump up. A simple calculation also shows that there is compression in r y . 3.3.2 The Active Phase. Next assume that the cells are active with xi > hx . Then each ( yi , wi ) satises equation 3.3. It is easy to see why the cells are compressed in the Euclidean metric if we make some simplifying assumptions concerning the nonlinear functions g and h. These assumptions arise naturally if one considers the network in appendix A; it is also a simple matter to extend this analysis to more general systems. Suppose that g and h are of the form g( v, w) D ( w1 (v) ¡ w) / tw ( v) and h(v, y) D ( y1 ( v) ¡ y) / ty (v). Note that while in the active phase, y1 ( v) and w1 ( v) are very small. Moreover, ty ( v) and tw ( v) are nearly constant. We assume here that while in the active phase, g(v, w) D ¡w / tw and h( v, y) D ¡y / ty , where tw and ty are positive constants. It follows that each ( wi , yi ) satises simple linear equations. If we ignore the jump-down curve, then each slow variable decays to 0 at an exponential rate. In particular, the distance between the cells decays exponentially. Actually, more is true. Each ( yi , wi ) approaches the origin tangent to the weakest eigendirection. Now suppose that the jump-down curve passes close to the origin. The Euclidean distance between the cells still decreases exponentially, and both cells jump down at nearly the same point. This is the point where the jumpdown curve crosses the weakest eigendirection. We note that there is also a more subtle source of compression while the cells are active. There will be some period of time when cell 2 receives inhibition but cell 1 does not; that is, s1 D sA but s2 D 0. During this time, the ( yi , wi ) satisfy different equations. It is then possible that the trajectories ( yi (t ), wi (t )) cross in the slow phase-space. This leads to a reversal of orientation between the cells, as shown in Figure 4B. Next, we discuss why a reversal of orientation can lead to compression in the cells’ trajectories.
Population Rhythms
613
3.3.3 The Jump Down. We now show that if the cells reverse their orientation while in the active phase, this can lead to a form of compression after the cells jump down. Let Ti be the time when cell i jumps down. As above, assume that cell 1 jumped up rst with w1 (0) < w2 (0). We will assume that w1 (t ) < w2 (t) as long both cells are active. Moreover, from remark 1, the jump-down curve is nearly horizontal. Hence, cell 1 jumps down rst; that is, T1 < T2 . If the cells’ trajectories cross while in the active phase, then y1 ( T1 ) < y2 (T 2 ). This is shown in Figure 4B. Let r yA ( T1 ) be the time it would take for the solution of equation 3.3 starting at ( y2 ( T1 ), w2 ( T1 )) to reach the ycoordinate y1 ( T1 ). It follows that r yA ( T1 ) > T2 ¡ T1 . For T1 < t < T2 , cell 1 evolves in the silent phase with y increasing, while cell 2 evolves in the active phase with y decreasing. If y1 (T2 ) < y2 ( T2 ), then | y2 ( T2 ) ¡ y1 (T 2 )| < | y2 ( T1 ) ¡ y1 ( T1 )| so there is compression in the ycoordinates of the cells across the jump. Now suppose that y1 ( T2 ) > y2 ( T2 ), as shown in Figure 4B. Let r yS (T2 ) be the time it would take for the solution of equation 3.2 starting at ( y2 ( T2 ), w2 (T 2 )) to reach the y-coordinate of cell 1. Since y2 ( T1 ) > y1 ( T1 ), it follows thatr yS ( T2 ) < T2 ¡T1 . We have now demonstrated that r yS (T2 ) < T2 ¡ T1 < r yA ( T1 ). That is, there is compression in the time metric corresponding to the y-coordinate. We note that since the jumpdown curve is nearly horizontal, any compression in the w-coordinates is, to rst order, neutral. A numerical example of orientation reversal in two cells’ trajectories in ( y, w)-space is shown in Figure 4C. At the top of the gure, the cells are in the silent phase. Each cell jumps up where the corresponding y0i D 0; chronologically, cell 1 jumps up rst. In the active phase, the trajectories cross, because the cells experience different levels of inhibition; cell 2 receives inhibition rst. The paths cross again at the bottom left of the gure after the leading cell, cell 1, falls down to the silent phase. 3.3.4 The Silent Phase. Suppose that both cells lie in the silent phase with xi < h syn . Then each ( yi , wi ) satises equation 3.2 until one of them reaches the jump-up curve. We now dene a metric between the cells and, in appendix B, we analyze how to choose parameters to guarantee that the metric decreases as the cells evolve in the silent phase. This metric is similar to that introduced by Terman et al. (1998). Suppose that cell 1 reaches the jump-up curve rst, and this is at the point ( y¤1 , w¤1 ). (See Figure 13 in appendix B.) Fix some time t0 , and let t wL0 be the physical translate of the jump-up curve such that ( y¤1 , w¤1 ) is translated to the point ( y1 (t0 ), w1 (t0 )). Then the “distance” between ( y1 (t0 ), w1 (t0 )) and ( y2 (t0 ), w2 (t0 )) is the time it takes for the solution of equation 3.2, which begins at ( y2 (t0 ), w2 (t0 )), to cross wtL0 . This is certainly well dened as long as the two cells are sufciently close to each other.
614
J. Rubin and D. Terman
One can compute explicitly how this metric changes as the cells evolve in the silent phase. (The computation, rather technical, is in appendix B.) A more complete discussion of how this metric is used to prove the stability of the synchronous solution of two mutually coupled basic cells with slow synapses is given in Terman et al. (1998). 3.4 Further Remarks. The compression of cells can take place as the cells evolve along the multidimensional slow manifold or as they jump up or down. The compression during the jumping process depends on the geometry (or slope) of the curve of knees and the orientation of the cells both before and after the jumps. Parameters that determine this slope may therefore have subtle effects on the stability of the synchronous solution; gsyn is one such parameter (see equation 3.6). The results in Terman et al. (1998) provide precise conditions on combinations of parameters that ensure that the synchronous solution is stable when the inhibition is slow. Increasing gsyn, for example, may sometimes stabilize the synchronous solution; however, when other parameters satisfy a different relationship, increasing gsyn may destabilize the synchronous solution. The size of the domain of attraction of the synchronous solution is, to a large extent, determined by the delay in the onset of inhibition. The two cells are able to re together if the trailing cell lies within the window of opportunity determined by this delay. If the trailing cell lies outside this window, then the network typically exhibits antiphase behavior in which the cells take turns ring, although other network behavior is possible. The system may crash, for example, since the completely quiescent state is asymptotically stable. In our analysis, we assumed that the cells and coupling are homogeneous. The effect of heterogeneities on mutually coupled basic cells with slow synapses was studied in several papers (Golomb & Rinzel, 1993; White, Chow, Ritt, Soto, & Kopell, 1998; Chow, 1998). These found that the synchronous solution is not very robust to mild levels of heterogeneities; a 5% variation in parameters was sufcient to destroy synchronous behavior. We have done a number of numerical simulations in order to study the effects of heterogeneities on the network considered in this section. Our numerical results are consistent with those in previous studies. 4 Globally Inhibitory Networks
We now consider the network described in section 2.3. Recall that in this network, E-cells excite J-cells, which in turn inhibit E-cells. The thalamic networks involved in sleep rhythms, discussed in section 5, are examples of such a network, with compound cells, to which the results of this section apply. We assume for now that there are just two E-cells, denoted by E1 and E2 , and there is one J-cell, which we denote as J. Larger networks are considered later. Initially, each cell is assumed to be a basic cell; generalization
Population Rhythms
615
for compound cells is discussed in section 4.4. The E-cells are identical to each other, but they may be different from the J. Each cell is also assumed to be excitable for xed levels of input. The system of equations corresponding to each Ei is v0i D f (vi , wi ) ¡ ginh sJ ( vi ¡ vinh ) w0i D 2 g( vi , wi ) s0i D a(1 ¡ si ) H (vi ¡ h ) ¡ bsi ,
(4.1)
while the equation for J is 1 v0J D fJ ( v J , wJ ) ¡ ( s1 C s2 ) gexc ( vJ ¡ vexc ) 2 w0J D 2 g J ( v J , wJ ) s0J D aJ (1 ¡ sJ ) H( v J ¡ h J ) ¡ 2 KJ sJ .
(4.2)
Here, each synapse is direct. Indirect synapses will be needed when we discuss the stability of solutions. Note that the inhibitory variable s J turns off on the slow timescale. The reason that we write the equations this way will become clear in the analysis. We assume that b D O(1); however, there is no a problem in extending the analysis if b D O(2 ). If vi > h , then si ! sA ´ aCb on the fast timescale. Two types of network behavior are shown in Figures 8 and 9. A synchronous solution, in which each cell res during every cycle, is shown in Figure 9. In Figure 8, each excitatory cell res every second cycle, while J res during every cycle. This type of solution is referred to as a clustered solution. In the following sections, we construct singular orbits corresponding to each of these solutions and then analyze their stability. The constructions then lead to conditions for when the different solutions exist and are stable. 4.1 Existence of the Synchronous Solution. We now construct a singular trajectory corresponding to a synchronous solution in phase space. As before, the trajectory for each cell lies on the left or right branch of a cubic nullcline during the silent and active phases. Which cubic a cell inhabits depends on the total synaptic input that the cell receives. Nullclines for the Ei are shown in Figure 5A and those for J in Figure 5B. Note in Figure 5A that the s J D 1 nullcline lies above the sJ D 0 nullcline, while in Figure 5B, the stot ´ 12 ( s1 C s2 ) D sA nullcline lies below the stot D 0 nullcline. This is because the Ei receive inhibition from J while J receives excitation from the Ei . We will make several assumptions concerning the ow in the following construction. These are justied later. We begin with each cell in the active phase just after it has jumped up. These are the points labeled P0 and Q0 in Figure 5. Each Ei evolves down
616
J. Rubin and D. Terman
Figure 5: Nullclines for (A) E-cells and (B) J-cells in a globally inhibitory network with basic cells. The closed curves and points Pi , Q i correspond to the singular synchronous solution discussed in the text. Note that s J decays on the slow timescale.
the right branch of the sJ D 1 cubic, while J evolves down the right branch of the stot D sA cubic. We assume that the Ei have a shorter active phase than J, so each Ei reaches the right knee P 1 and jumps down to the point P2 before J jumps down. We also assume that at this time, J lies above the right knee of the stot D 0 cubic. J must then jump from the point Q1 to the point Q2 along the stot D 0 cubic. On the next piece of the solution, J moves down the right branch of the stot D 0 cubic while the Ei move up the left branch of the s J D 1 cubic. When J reaches the right knee Q3 , it jumps down to the point Q4 along the left branch of the stot D 0 cubic. Now the inhibition sJ to the E i starts to turn off on the slow timescale. Thus, the Ei do not jump to another cubic. Instead, the trajectory for the Ei moves upward, with increasing wi , until it crosses the w nullcline. Then each wi starts to decrease. If this orbit is able to reach a left knee, it jumps up to
Population Rhythms
617
Figure 6: Slow phase plane for an E-cell. The curve wL (sJ ) is the jump-up curve, which trajectories reach if K J is large enough (¡ ¢ ¡ ¢ ¡ path; cell jumps up from the point marked *). The dotted curve WF (s J ) consists of zeros of GL (w, sJ ) in system 4.3; trajectories tend to the stable critical point WF (0) as s J ! 0 for small K J (¡ ¢ ¢ ¡ ¢ ¢ ¡ path). Note that w0 < 0 for w > WF .
the active phase, and this completes one cycle of the synchronous solution. When the Ei jump up, J also jumps up if it lies above the left knee of the stot D sA cubic. We now derive more quantitative conditions for when the singular synchronous solution exists. It is not at all obvious, for example, why we needed to assume that the active phase of J is longer than that for the E i . It is also not clear what conditions are needed to ensure that the Ei are able to reach a jump-up curve and escape once they are released from inhibition. These two issues are actually closely related. We rst discuss how the Ei can reach the jump-up curve. For this, it is convenient to derive equations for the evolution of the slow variables (wi , sJ ) as was done in section 3.1. Let t D 2 t, denote the left branch of the cubic f (v, w) ¡ ginh s(v ¡ vinh ) D 0 by v D W L (w, s) and let GL ( w, s) ´ g(W L ( w, s), s). Then each ( wi , sJ ) satises the slow equations, wP D GL ( w, sJ ) sPJ D ¡KJ sJ .
(4.3)
The phase plane corresponding to this system is illustrated in Figure 6. There are two important curves shown in the gure. The rst is the jumpup curve w D wL ( sJ ); this is the curve of “left knees.” The second curve, denoted by WF ( s J ), corresponds to the xed points of the rst two equations
618
J. Rubin and D. Terman
in 4.1 with the input s J held constant. This corresponds to the w-nullcline of equation 4.3. We need to determine when a solution (w(t), sJ (t )) of equation 4.3 beginning with s J (0) D 1 and w(0) < W F (1) can reach the jump-up curve wL ( sJ ). This is clearly impossible if WF (1) < wL (0), so we shall assume that WF (1) > wL (0). If w(0) > wL (0) and K J is sufciently large, the solution will certainly reach the jump-up curve; this is because the solution will be nearly vertical, as shown in Figure 6. If, on the other hand, KJ is too small, the solution will never be able to reach the jump-up curve. This is because the solution will slowly approach the curve WF (s J ) and lie very close to this curve as s J approaches zero. This is also shown in Figure 6. We conclude that the cells are able to escape the silent phase if the inhibitory synapses turn off sufciently quickly and the w-values of the cells are sufciently large when this deactivation begins. Escape is not possible for very slowly deactivating synapses (although it would be possible with slow deactivation if the cells were oscillatory for some levels of synaptic input). A biophysical interpretation of this is that escape is possible for GABAA synapses and will occur if the cell’s IT current is sufciently deinactivated when inhibition begins to wear off. We assume that KJ is large enough so that escape is possible. Choose Wesc so that the solution of equation 4.3 that begins with sJ (0) D 1 will be able to reach the jump-up curve only if w(0) > Wesc . The existence of the singular synchronous solution now depends on whether the Ei lie in the region where wi > Wesc when J jumps down to the silent phase. We claim that this requires that the active phase of J be sufciently long. One can give a simple estimate on how long this active phase must be as follows. Suppose that all the cells jump up when t D 0, the Ei jump down when t D tE , and J jumps down when t D tJ . We require that wi (tJ ) > W esc . Since the time the Ei spend in the silent phase before they are released from inhibition is tJ ¡tE , this implies that tJ ¡tE must be sufciently large. Hence, J’s active phase must be sufciently longer than the Ei ’s. More precisely, let wRK be the value of w at the right knee of the sJ D 1 cubic, and let tL be the time it takes for the solution of the rst equation in 4.1 with sJ D 1 to go from w D wRK to w D Wesc . We require that t J ¡ tE > tL .
(4.4)
4.2 Stability of the Singular Synchronous Solution. We now demonstrate that the synchronous solution is stable if the synapse s J is indirect and the active phase of J is sufciently long. We start with the Ei a small distance apart, just after both have jumped up to the active phase. Assume that this causes J to re. We will show that after one cycle, both of the Ei are so close that they must re together again. Moreover, there is a contraction in the distance between the E-cells during each cycle. The analysis proceeds as in the previous section. We assume that the active phases of the Ei are shorter than that of J, so that the Ei return to
Population Rhythms
619
the silent phase and proceed up the left branch of the sJ D 1 cubic before J jumps down. As the Ei move up this left branch, they approach the point PL where the s J D 1 cubic intersects the w-nullcline. (See Figure 5A.) If the active phase of J is sufciently long, then the Ei lie as close as we please to PL , and therefore to each other, when J jumps down. This is precisely what is required to guarantee that both will re together during the next cycle. While in the silent phase, the Ei approach PL at an exponential rate (in the slow timescale). This leads to a very strong compression of Euclidean distance between the cells while in the silent phase. This compression is certainly stronger than any possible expansion over the remainder of the cycle. After the J-cell falls down, s J decays on the slow timescale. This allows the J-cell to recover so that it can re when excited by the ring of the E-cells, and the whole cycle repeats. We need to assume that sJ corresponds to an indirect synapse for the same reason as we did previously. When one of the E-cells res, this causes J to re, which sends inhibition back to the other E-cell. If s J is direct, this causes the second E-cell to be “stepped on” on the fast timescale, and the synchronous solution cannot be stable. Note that the time between the rings of the Ecells is determined by K J , the rate at which sJ decays. If KJ is large, the time between rings is short; it is then easier for the second cell to pass through the window of opportunity provided by the indirect synapse. Our analysis has shown that the dynamics of the J-cell can inuence the domain of attraction of the synchronous solution in several ways. If J’s active phase is long, then the E-cells lie close to each other, near PL , when J jumps down and releases them from inhibition. Moreover, if J recovers quickly in the silent phase, then KJ can be chosen to be large. Both factors make it easier for the E-cells, once they are released from inhibition, to pass through a window of opportunity and re during the same cycle. Hence, both enlarge the domain of attraction of the synchronous solution. There are important differences between the ways in which mutually coupled and globally inhibitory networks use inhibition to synchronize oscillations. In mutually coupled networks, a second slow variable is required for the existence of the synchronized solution; it allows the cells to escape from the silent phase. The second slow variable is also required for the compression of the cells as they evolve in phase space. The existence and stability of the synchronous solution in globally inhibitory networks, on the other hand, are controlled by the dynamics of the J-cell. If the Jcell’s active phase is long enough, then this pushes the E-cells, in their silent phase, to a position from which they can escape; moreover, this provides a strong compression of the E-cells. For this network, we require that the inhibition decays on the slow timescale; however, the reason is so that the J-cell can recover sufciently. The slow recovery is not needed to allow the E-cells to escape or for compression. In fact, the domain of stability of the Remark 2.
620
J. Rubin and D. Terman
synchronous solution is increased if the recovery of the J-cell and the decay of the synapses occur quickly, on the slow timescale. 4.3 Clustered Solution. We now describe the geometric construction of the singular antiphase, or clustered, solution. It sufces to consider half of a complete cycle. During this half-cycle, E1 res and returns to the initial position of E2 , J res and returns to its initial position, and E 2 evolves in the silent phase to the initial position of E1 . By symmetry, we can then continue the solution for another half-cycle with the roles of E1 and E2 reversed. When E1 jumps up, it forces J to jump up to the right branch of the stot D 12 sA cubic. Then E1 moves down the right branch of the sJ D 1 cubic, while J moves down the right branch of the stot D 12 sA cubic and E2 moves up the left branch of the s J D 1 cubic. We assume, as before, that E1 ’s active phase is shorter than J’s active phase, so E1 jumps down before J does so. It is possible that J lies below the right knee of the stot D 0 cubic at this time, in which case J also jumps down. If J lies above this right knee, then it moves down the right branch of the stot D 0 cubic until it reaches the right knee and then jumps down. During this time, both E1 and E2 move up the left branch of the s J D 1 cubic. After J jumps down, s J (t ) slowly decreases. If E 2 is able to reach the jump-up curve, then it res, and this completes the rst half-cycle of the singular solution. Suppose that t D tF when this occurs. For this to be onehalf of an antiphase solution, we need w2 (tF ) D w1 (0), w1 (tF ) D w2 (0), and wJ (tF ) D wJ (0). We now derive conditions for when the antiphase solution exists. These will imply that the active phase of J cannot be too long or too short, compared with the active phase of the Ei . If J’s active phase is too long, then the network exhibits synchronous behavior as described before. If J’s active phase is too short, then the system approaches the stable quiescent state. Suppose that E1 and J jump up when t D 0, E1 jumps down when t D tE , and J jumps down when t D t J . Let tL and Wesc be as dened in the previous section. To have a clustered solution, we require that
w1 (tJ ) < Wesc < w2 (tJ ).
(4.5)
The second inequality is necessary to allow E2 to re during the second halfcycle. The rst inequality guarantees that E1 does not re during this halfcycle. It follows from the denitions that the rst inequality is equivalent to t J ¡ tE < tL .
(4.6)
Next we derive a similar expression for the second inequality in equation 4.5. For w0 < w1 , let r ( w0 , w1 ) to be the time it takes for a solution of the
Population Rhythms
621
rst equation in 4.3 with sJ D 1 to go from w0 to w1 . The second inequality is equivalent to r ( wRK , w2 (tJ )) > tL . Now, r ( wRK , w2 (tJ )) D r (wRK , w2 (0)) Cr (w2 (0), w2 (tJ )). Moreover, wRK D w1 (tE ) and w2 (0) D w1 (tF ). Hence, r ( wRK , w2 (tJ )) D r (w1 (tE ), w1 (tF )) Cr ( w2 (0), w2 (tJ )). Clearly, r ( w2 (0), w2 (tJ )) D tJ , because E 2 lies on the sJ D 1 cubic for 0 < t < tJ . It is not true that E1 lies on the sJ D 1 cubic for tE < t < tF ; however, if the rst equation in 4.3 is weakly dependent on sJ , then we have that r ( w1 (tE ), w1 (tF )) ¼ tF ¡ tE . In this case, the second inequality in equation 4.3 is equivalent to tF ¡ tE C t J > tL .
(4.7)
One can simplify this formula if the parameter KJ is rather large. In this case, tF ¡ tJ is small; that is, E2 escapes the silent phase as soon as it is released from inhibition. Then equation 4.7 is approximately equivalent to tJ >
1 (tE C tL ). 2
(4.8)
Combining equation 4.6 and 4.8 leads to the following condition for the existence of a clustered solution if the synaptic variable sJ turns off quickly: 1 (tE C tL ) < tJ < tE C tL . 2
(4.9)
It is possible for both stable synchrony and stable clustering to exist for the same parameter values. Note that the domain of stability of the synchronous solution is controlled to a large extent by the size of the delay caused by the indirect inhibitory synapses. If this delay is small, the domain of stability will also be small. In this case, the synchronous solution will still be stable; however, most solutions will converge to a stable clustered solution. Remark 3.
4.4 Globally Inhibitory Networks with Compound Cells. The discussion in the previous subsections generalizes to globally inhibitory networks with compound cells, such as the model thalamic network in the next section. The primary difference is that each cell contains an additional slow variable, yi , so it is necessary to consider a higher-dimensional slow phasespace. As a consequence, the jump-up curves of the previous subsections are
622
J. Rubin and D. Terman
Figure 7: Three-dimensional slow phase-space for a singular synchronous periodic orbit of the globally inhibitory network. Double (single) arrows denote evolution on the fast (slow) timescale, solid (dashed) lines indicate the silent (active) phase, and points Pi are as discussed in the text. The shaded region represents the jump-up surface w D wL (y, sJ ).
replaced by jump-up surfaces fw D wa ( y, sJ )g, a D L or R, where wa ( y, sJ ) is as in the lemma in section 3. Figure 7 illustrates the evolution of the slow variables ( wi , yi , sJ ) for the singular synchronous solution. We begin at the point labeled P1 on the jumpup surface w D wL ( y, sJ ). The E-cells then jump up, and this forces J to jump up. Hence, sJ ! 1. This corresponds to the segment in Figure 7 that connects P1 to the point P2 on the s J D 1 surface. Each cell then evolves in the active phase with s J D 1. As before, we assume that the active phases of the E-cells are shorter than that of J. Hence, the E-cells jump down when the ( wi , yi ) reach the jump-down curve wi D wR ( yi , 1). This is at the point labeled P3 in Figure 7. While J lies in the active phase, the E-cells evolve in the silent phase, but sJ D 1 still holds. At P4 , J jumps down and the ( wi , yi , sJ ) evolve
Population Rhythms
623
until they reach the jump-up curve. This then completes one cycle of the singular solution. The stability analysis proceeds just as in section 4.2. What is crucial for stability is that J remains active long enough. The E-cells then approach the stable xed point on the left branch of the sJ D 1 surface while J is active. This provides the compression needed for stability. The construction of a clustered solution is also very similar to that described in section 4.3. We do not describe the construction here; some comments are given in the next section. 4.5 Further Remarks. The geometric constructions of the synchronous and clustered solutions extend in a straightforward manner to globally inhibitory networks with an arbitrary number of excitatory cells Ei . Of course, in a larger network there are more possibilities for clustered solutions; however, if each cluster contains (approximately) the same number of cells, then inequalities similar to equation 4.9 must be satised. This is similar to analysis of Terman and Wang (1995), which yields precise conditions for the existence of clustered states in a locally excitatory and globally inhibitory network model for scene segmentation. Further analysis of clustered solutions in inhibitory networks is given in Rubin and Terman (1999). The analysis leads to simple formulas for the periods of the synchronous and clustered solutions. Let tJ be, as above, the time cell J spends in the active phase, and let tS be the time for the E-cells to reach the jump-up curve after the J-cell jumps down. Then the period of the synchronous solution is simply t J C tS . Now tJ is determined by the dynamics of the J-cell, while tS is primarily controlled by the rate at which the synapses turn off; this is the parameter KJ in equation 4.2. Also see Figure 11. Other parameters play a secondary role. Note, for example, that the parameter gsyn inuences the period by controlling the slope of the jump-up curve, as shown in equation 3.6. The synchronous and clustered solutions can exist only if trajectories are able to escape from the silent phase. Previous work on mutually inhibitory neurons has emphasized the distinction between “release” and “escape” in producing both synchronous and antiphase solutions (Wang & Rinzel, 1993). “Release” refers to the case in which the active phase of one cell ends and this releases another silent cell from inhibition. “Escape” refers to the case in which the dynamics of the inactive cell allows it to re even if that cell receives inhibition. As pointed out in Terman et al. (1998), there is no clear distinction between escape and release when there are multiple slow processes. A silent E-cell is in some sense released when the J-cell jumps down to the silent phase. The rhythms can continue, however, only if this E-cell is able to escape the silent phase. Here we are assuming that each E-cell is excitable for constant levels of inhibition. Both escape and release are therefore needed to maintain oscillations.
624
J. Rubin and D. Terman
5 Thalamic Network
The model for the thalamic spindle sleep rhythm falls into the framework of the network analyzed in the preceding section. The two populations of cells in the model are the thalamocortical relay (TC) cells and the thalamic reticularis (RE) cells, corresponding to the E-cells and J-cells, respectively. One difference between the spindle model and those considered earlier is that the spindle model contains numerous RE as well as TC cells. The RE cells communicate with each other through fast inhibitory synapses, as illustrated in the network shown in Figure 2. During spindling, the network exhibits behavior similar to the clustered solution discussed in the previous section. The RE population is synchronized, while the TC cells break up into groups; cells within each group are synchronized, while cells within different groups are desynchronized. The network also exhibits completely synchronized rhythms. The synchronized rhythms arise, for example, when fast inhibition is removed from the entire network or from between the RE cells only. Recent results have also shown that the synchrony can arise if the RE population receives additional phasic excitation, corresponding to cortical input, at the delta frequency. Hence, the network can transform from clustering to synchronized behavior without any change in the inhibitory synapses. In this section, we demonstrate how the geometric analysis helps to explain the dynamical mechanisms responsible for these rhythms and transitions between them. We begin by presenting a concrete model for the sleep rhythms and then present results of numerical simulations of this model. The numerical simulations clearly show that solutions of the model behave as predicted by the singular perturbation analysis; in particular, solutions jump up and jump down when they reach a curve (or surface) of knees. This conrms that the decomposition into fast and slow variables, as described in the previous section, provides the correct singular perturbation framework for analyzing these rhythms. We also demonstrate how the geometric analysis leads to quantitative statements concerning the behavior of solutions. In particular, the analysis predicts correctly how the frequency of oscillations depends on parameters. The analysis also leads to precise conditions for when the network exhibits either synchronous or clustered solutions. For example, factors that enable the TC cells to synchronize are a long RE active phase, a relatively fast RE recovery, and a fast decay of inhibition. Some of these factors are inconsistent with mechanisms for synchronization in mutually coupled inhibitory networks. Finally, the analysis claries the roles of the various intrinsic and synaptic currents in generating a particular rhythm. We illustrate this in section 5.3, where we consider the role of the sag current, which also differs from the role of the second slow intrinsic current in mutually coupled networks.
Population Rhythms
625
5.1 Model. The following model contains many parameters and nonlinear functions (these are given in appendix A). The cells are modeled using the Hodgkin-Huxley formalism (Hodgkin & Huxley, 1952); the equations are very similar to those in Golomb et al. (1994). The equations of each TC cell are:
v0i D ¡IT (vi , hi ) ¡ Isag ( vi , ri ) ¡ IL ( vi ) ¡ IA ¡ IB h0i D ( h1 ( vi ) ¡ hi ) / th (vi ) r0i D ( r1 ( vi ) ¡ ri ) / tr ( vi ).
(5.1)
This describes a compound cell, with hi and ri corresponding to w and y, respectively, and 2 absorbed in th , tr rather than mentioned explicitly. The terms IT , Isag , and IL are intrinsic currents; they are given by IT ( v, h) D gCa m21 ( v) h( v ¡ vCa ), Isag ( v, r) D gsag r( v ¡ vsag ), and IL ( v) D gL (v ¡ vL ). The terms IA and IB represent the fast (GABAA ) and slow (GABAB ) inhibitory input from the RE cells. We model the fast inhibition IA as in previous P j sections; that is, IA D gA ( vi ¡ vA ) N1TR sA , where g A and vA are the maximal conductance and the reversal potential of the synaptic current. The sum is over all RE cells that send input to the ith TC cell and NTR represents the maximum number of RE cells that send inhibition to a single TC cell. Each j
synaptic variable sA satises the rst-order equation j 0
j
j
j
sA D aR (1 ¡ sA ) H( vR ¡ h R ) ¡ bR sA ,
(5.2)
j
where vR is the membrane potential variable of the jth RE cell. Motivated by recent experiments (Destexhe, Bal, McCormick, & Sejnowski, 1996; Destexhe & Sejnowski, 1997), we model the slow inhibition IB somewhat differently from IA . We rst discuss, however, the model for the RE cells. The equations of each RE cell are: 0 viR D ¡IRT ( viR , hiR ) ¡ IAHP ( viR , mi ) ¡ IRL ( viR ) ¡ IRA ¡ IE 0 hiR D (hR1 ( viR ) ¡ hiR ) / tRh (viR )
mi 0 D m 1 [Ca]i (1 ¡ mi ) ¡ m 2 mi
[Ca]0i D ¡ºIRT ¡ c [Ca]i .
(5.3)
The IRT , IAHP , and IRL represent intrinsic currents. These are given by IRT ( v, h) D gRCa m2R1 (v) h( v ¡ vRCa ), IAHP (v, m) D gAHP m( v ¡ vK ), and IRL ( v) D gRL ( v ¡ vRL ). More details concerning the biophysical signicance of each term are given in Golomb et al. (1994) and Terman et al. (1996). In equation 5.3, IRA represents the inhibitory input from other RE cells. It P j is modeled as IRA D gRA (viR ¡ vRA ) N1 sRA where the sum is over all RE RR
j
cells that send input to the ith RE cell. Each synaptic variable sRA satises a
626
J. Rubin and D. Terman
rst-order equation similar to 5.2. The term IE represents excitatory (AMPA) P j input from the TC cells and is expressed as IE D gE ( viR ¡ vE ) N1 sE , where RT the sum is over all TC cells that send excitatory input to the ith RE cell. The j
synaptic variables sE are fast and also satisfy rst-order equations similar to 5.2. It remains to discuss how we model the slow inhibitory current IB . Sims4
ilarly to Destexhe et al. (1996), we assume that IB D gB s4 biCl (vi ¡ vB ) where sbi , along with variable xbi , satises
bi
s0bi D k1 H( xbi ¡ hxb )(1 ¡ sbi ) ¡ k2 sbi i k3 hX x0bi D H( viR ¡ h Rb ) (1 ¡ xbi ) ¡ k4 xbi . NTR The parameters are such that xbi can become activated (i.e., exceed hxb) only if a sufciently large number of RE cells have their membrane potentials viR above the thresholdh Rb . The threshold is chosen rather large so the RE bursts must be sufciently powerful to activate xbi . Once xbi becomes activated, it turns on the synaptic variable sbi ; the expression s4bi in IB further delays the effect of the inhibition on the postsynaptic cell. 5.2 Numerical Simulations. A clustered solution is shown in Figure 8. There are three RE cells in this example, and they oscillate in synchrony at about 12.5 Hz; one of the RE cells is shown in Figure 8A. The RE cells synchronize due to excitation from the TC population. (This is discussed in more detail in section 5.3.) There are six TC cells, and they form two clusters, each oscillating at half of the RE oscillation frequency, as shown in Figure 8B. In Figure 8C, we show the time courses of the fast (sA ) and slow (sb ) inhibitory synaptic variables, respectively. Note that the fast inhibition activates during every cycle, providing the hyperpolarizing current needed to deinactivate each TC cell’s IT current. The fast inhibition is also needed to desynchronize the TC cells so they can form clusters. The slow current IB never activates during this solution because the RE cells do not re powerful enough bursts; that is, the membrane potentials viR do not rise above the threshold hRb D ¡25 mV long enough to activate the variables xbi . A synchronous solution is shown in Figure 9. The parameters are exactly as in Figure 8 except we set gRA D 0; that is, we have turned off the fast inhibition between RE cells. Note in Figure 9 that each TC cell res during every cycle along with the RE cells. The slow inhibitory current IB now activates during every cycle. Comparing the slow inhibitory variable with the fast inhibitory variable in Figure 9C, we see that the slow inhibition stays on longer and both turns on and turns off more gradually. Geometric analysis is useful in understanding why removing fast inhibition allows slow inhibition to activate and why this leads to synchronization of the network. This is discussed in the next subsection.
Population Rhythms
627
Figure 8: Numerical solution, with two TC clusters, of the thalamic network (parameter values are given in appendix A). Voltages are in mV and time is in msec. (Top) RE cell (the time course of which matches that of the RE population of three cells). (Middle) TC population of six cells, forming two clusters of three cells each. (Bottom) Inhibitory synaptic variables sA (dashed) and sb (solid). Note that sb ´ 0, since the RE cell bursts are not powerful enough to activate slow inhibition in this case.
In Figure 10 we plot the trajectory ( hi (t ), ri (t ), sb (t)) corresponding to the synchronous solution shown in Figure 9. We also plot the numerically computed surface of knees corresponding to the jump-up points; there is a similar jump-down surface, but it is not shown. The behavior of the trajectory is consistent with that discussed in the previous section. For example, the cells jump up when the trajectory crosses the jump-up surface of knees (approximately, since 2 6 D 0 numerically) (see Rubin & Terman, 1998, for a similar plot for the clustered solution). Figure 11 shows a plot of the period of synchronous TC oscillations as the rate of decay of inhibition, the rate of change of h, and the parameter gB are
628
J. Rubin and D. Terman
Figure 9: Numerical synchronous solution of the thalamic network (parameter values are given in appendix A). Voltages are in mV and time is in msec. (Top) RE cell (the time course of which matches that of the RE population of three cells). (Middle) TC population of six cells (synchronized). (Bottom) Inhibitory synaptic variables sA (dashed) and sb (solid). Note that sA turns on and off faster than sb , which in turn stays on longer.
separately varied. These relations can be derived from our analysis, as can predictions about the dependence of the period on other model parameters. As noted earlier, the period is given by the duration of the RE active phase (tJ ) plus the time it takes for inhibition to decay sufciently that the TC cells can jump up (tS ); hence, the period drops relatively sharply as the inhibitory decay rate increases. The same strong dependence, which our analysis explains, was found in the modeling work of Destexhe (1998). As the rate of change of h increases, there is essentially no change in the length of the TC silent phase (not shown) or in the period. This reects the fact that over the parameter range considered, the TC cells are compressed close to their silent phase rest state while the RE cells are active, and then the TC cells
Population Rhythms
629
Figure 10: Numerical trajectory of a TC cell, together with jump-up surface of knees (shaded), in the (h, r, sb ) slow phase-space. The TC cell shown belongs to the synchronized population of six TC cells shown in Figure 9. Note that sb does not immediately increase when the cell jumps up, since we have taken slow inhibitory synapses to be indirect.
evolve with little change in h after the RE cells jump down. Interestingly, gB has little effect on period because it has little inuence on tJ , tS . The mild effect that does occur is due to the subtle inuence of gB on the slope of the curve of knees for the TC silent phase. This actually causes the period to increase as gB , and hence the strength of slow inhibitory coupling, increases (see also Golomb et al., 1994). Our numerical studies show that the synchronous and clustered solutions are robust to moderate levels of heterogeneities and variations in the parameters. This is due to the strong TC compression provided by the RE cells. For example, these solutions were not affected by heterogeneities of about 20% in sag conductances and about 5% in the TC IT conductance.
630
J. Rubin and D. Terman
Figure 11: Period of synchronous TC oscillations versus rate of decay of inhibition (k2 D bR , solid curve), rate of change of h (dashed line), and gB (dotted line) for the population shown in Figure 9. The parameters k2 , bR were varied from 0.04 to 0.14 and were held at 0.1 for the other two curves. A parameter w was factored out from th0 , th1 in the h-equation and was varied from 1.5 to 2.5; w was held at 2.0 for the other curves. The parameter gB was varied from 0.03 to 0.07 and was held at 0.05 for the other two curves. Other parameter values as given in appendix A.
Stronger TC heterogeneities tend to promote TC clustering in this model. In the resultant patterns, TC cells with similar properties re together, independent of the way they are initially perturbed from synchrony. 5.3 Further Implications of the Analysis.
5.3.1 Removing Fast Inhibition. Fast inhibition occurs in two places: the RE cells inhibit themselves as well as the TC cells. Removing fast inhibition has different consequences for each of these synaptic connections, and
Population Rhythms
631
both of these help to synchronize the TC cells (see also Golomb et al., 1994; Destexhe & Sejnowski, 1997). Removing the RE-TC fast inhibition is clearly helpful for synchronization among the TC cells. This holds because the fast inhibition has a very short rise time, which follows since the fast inhibitory synapses are direct in equations 5.1 and 5.2. This short rise time corresponds to a small domain of attraction of the synchronous solution. In fact, it is precisely this inhibition that is responsible for desynchronizing the TC cells during the clustered solution. Removing the RE-RE fast inhibition appears to be even more crucial for synchronizing the TC cells (see also Steriade, McCormick, & Sejnowski, 1993; Huguenard & Prince, 1994; Destexhe, Bal, McCormick, & Steriade, 1996; Destexhe, 1998). This allows the RE cells to re longer, more powerful bursts, which activate the slow inhibitory current IB . The analysis in section 4 demonstrates that long, powerful bursting of the RE cells is needed for the TC cells to synchronize, unless the desynchronizing effect of inhibition is somehow removed as discussed in Terman et al. (1996). Our numerical simulations (e.g., Figures 9–12) show, in fact, that the TC cells will synchronize even if fast inhibition is removed from within the RE population but not from the RE-TC connections. Why removal of inhibition leads to stronger RE bursts can be easily understood by our analysis of trajectories in phase-space. This removal forces the RE cells to lie on the right branch of a different cubic while in the active phase. The cubic of the disinhibited cells lies below the cubic of the cells with inhibition. The disinhibited cells therefore jump up to larger values of membrane potential; moreover, their jump-down point (right knee) lies below the jump-down point of the inhibited cells. The disinhibited RE cells therefore have a longer active phase. Note that the slow inhibitory current IB , activated when RE-RE fast inhibition is removed, has slower rise and decay times than IA . The slow rise time enhances the domain of attraction of the synchronous solution by expanding the window of opportunity. The slow decay time can help to bring the cells closer together while in the silent phase, as discussed in section 3.3 and appendix B. Hence, both effects improve the ability of the TC cells to synchronize in the absence of fast inhibition. 5.3.2 Role of the Sag Current. We view the TC cells as examples of compound cells. If one removes the sag current Isag or keeps it constant, then they become basic cells. Hence, questions concerning the differences between networks with basic or compound cells are closely related to issues pertaining to the role of the sag current. The analysis in section 3 shows that in mutually coupled networks, whether the cells are basic or compound is signicant. Added complexity allows the cells to escape from the silent phase and also leads to possible compression mechanisms. In globally inhibitory networks, the dynamical mechanisms responsible for the synchronous and clustered solutions are basically the same for basic or compound cells. We
632
J. Rubin and D. Terman
therefore do not view the sag current as playing a direct role in the generation of these rhythms. Recall that oscillations arise when the cells are capable of escaping from the silent phase. The sag current helps modulate the intrinsic properties of the TC cells, and this can determine whether escape is even possible. From a more biophysical viewpoint, the sag current depolarizes the TC cells while they are silent. Hence, a stronger sag current raises the TC’s resting potential. If this resting potential is too large, then inhibitory input from the RE cells may not be capable of hyperpolarizing the TC cells sufciently to deinactivate the IT current. If this is the case, then the TC cells will not be able to re (i.e., escape the silent phase). We note that this relates closely to some theories about mechanisms responsible for waxing and waning behavior during the spindle rhythm (Destexhe, Babloyantz, & Sejnowski, 1993; Bal & McCormick, 1996). Within the range where escape is possible, increasing the sag current tends to enhance the TC synchronization. This occurs because the increased sag current tends to depolarize the TC cells more rapidly, leading to stronger compression. Note that while the sag current is depolarizing during the silent phase, it is hyperpolarizing in the active phase. Hence, increasing the sag conductance lowers the left branch of the cubic corresponding to the TC cells but raises the right branch of the cubic. This decreases the voltage level to which TC cells jump up and hence the overall amplitude of TC oscillations (see Figure 12); at jump-down, the sag current is essentially deactivated, so the size of gsag has little effect there. Increasing gsag also tends to decrease the level of deinactivation of IT at ring, as we can show using arguments similar to the proof of the lemma in section 3, and to decrease the period of the TC cells (see Figure 12). To analyze this dependence of the period more carefully, note that the evolution of the inactivation variable h for IT is almost linear for most of the silent phase; we can model this by h0 D w (1 ¡ h), independent of the inhibition level. From this equation, we can approximate the time tk it takes for the cell to reach the curve of knees. This depends on gsag , since the location of the curve of knees does. We nd that tk ¼ ¡ w1 ln(1 Ch f ¡ hk ), where h f is the h-value when the cell falls down and hk is the h-value when it hits the curve of knees to jump up. Incorporating the dependence of h f and hk on gsag implies that @tk / @gsag < 0; in fact, since h f ¡ hk is small relative to 1 and h f , hk depend close to linearly on gsag , the dependence of tk on gsag is almost linear as well. An analogous argument shows that the time spent in the active phase also has a roughly linear relation to gsag . These nearly linear dependencies sum to yield a nearly linear relationship between overall oscillation frequency and gsag , as appears in Figure 12. 5.3.3 Synchronization Among the RE Cells. The RE cells are synchronized during the spindle rhythm, primarily due to the excitation they receive from
Population Rhythms
633
Figure 12: The effect of gsag on synchronous TC oscillations. Six TC cells and three RE cells were simulated with parameter values given in appendix A.
the TC cells. Although only about half of the TC cells re during each cycle, if each TC cell sends excitation to several RE cells, then every RE cell will receive a sufcient amount of excitation to re during every cycle. Experiments have also shown that the RE cells can sustain synchronized rhythms when these cells are completely isolated (Steriade, Domich, Oakson, & Deschˆenes, 1987; Steriade, Jones, & Llin´as, 1990). In fact, these are the experiments that motivated many of the theoretical studies concerning how synchronization can arise in a population of inhibitory cells. If one considers the RE cells to be modeled as basic cells, then the conclusion of these studies is that synchronization is possible only if the decay of inhibition is sufciently slow. However, modeling the RE cells as in equation 5.3, we consider them to be compound cells; the slow variables are hiR and mi . We then conclude from our analysis of mutually coupled inhibitory cells in section 3 that synchronization is possible even if the inhibition decays quickly.
634
J. Rubin and D. Terman
5.3.4 Cortical Inputs. Our results clearly demonstrate that globally inhibitory networks provide a natural framework for analyzing RE-TC interactions in the generation of thalamic sleep rhythms. These results are also consistent with recent work about the impact of cortical inputs in the development and manifestation of the sleep rhythms (Steriade, Curro´ Dossi, & Nunez, ˜ 1991; Steriade, Nunez, ˜ & Amzica, 1993; Steriade, 1994; Contreras, Destexhe, Sejnowski, & Steriade, 1996; Timofeev & Steriade, 1996; Contreras, Destexhe, & Steriade, 1997; Destexhe, Contreras, & Steriade, 1998). Here we review some of this work and briey describe how it relates to our analysis. Recent work has emphasized the importance of the cortex in the transformation of spindle oscillations into spike-and-wave-like (SW) epileptiform oscillations in the thalamus (Steriade, 1994; Steriade, Contreras, & Amzica, 1994; Contreras, Destexhe, & Steriade, 1996; Destexhe, Contreras, Sejnowksi, & Steriade, 1996; Destexhe & Sejnowski, 1997; Steriade, Contreras, & Amzica, 1997; Destexhe, 1998). The experiments and modeling described in these works have suggested that this can arise without the removal of fast inhibition from the thalamus. Recall that in the mechanism we have discussed, disinhibition of the RE cells leads to more powerful RE bursts, and this permits the TC cells to synchronize. Our analysis supports the nding (Contreras, Destexhe & Steriade, 1996; Destexhe et al., 1996; Steriade et al., 1997; Destexhe, 1998) that if one does not remove the fast inhibition among the RE cells but instead induces sufciently strong excitation from the cortex, then this will have the same effect: the RE cells will re more powerful bursts because of the additional excitation (their cubics are lowered). Hence, GABAB inhibition of the TC cells ensues and the TC cells can synchronize, even though they receive fast inhibition from the RE cells. In particular, this explains the mechanism behind the results of Destexhe (1998), in which sufciently strong corticothalamic excitation (achieved by blocking only cortical GABAA ) is found to be crucial for triggering powerful RE bursts. These bursts in turn lead to the activation of GABAB in the thalamus and the generation of synchronized » 3 Hz oscillations in TC and RE cells. It is interesting to note that SW discharges of pyramidal cells and interneurons in the cortex, as modeled by Destexhe, apparently result from a related interaction: excitation from TC cells induces the “spike” of prolonged ring of pyramidal cells and interneurons in the cortex, and resulting GABAB inhibition from the interneurons to the pyramidal cells enhances synchronization and maintains the subsequent “wave” of inactivity until the TC rebound and restart the cycle. Our analysis has shown that it is possible for the TC cells to synchronize even in the presence of fast inhibition if the RE cell bursts are powerful enough. This does not, however, contradict the argument in Terman et al. (1996) that the effect of fast inhibition to the TC cells must be removed during delta. Since the RE cells re at the slow rhythm in delta, their bursts
Population Rhythms
635
are completely absent (after the rst cycle) while the TC cells re. Hence, the RE cells do not produce the powerful bursts that would be needed to synchronize the TC cells if the effect of the fast inhibition remained. 5.3.5 Variations in Synchronous Oscillations. The thalamic network may exhibit other types of solutions than those discussed. This is because assumptions required for the geometric constructions of the solutions may not be satised for some ranges of parameter values. Subtle variations in network behavior can arise as parameters that change the underlying geometry of phase space are varied. One possibility that arises, for certain biophysically relevant parameter values, is that the right knees of each RE cell’s family of cubics lie very close to the left knees of the RE cell’s cubics. In this case, each RE cell recovers almost instantly on jump-down (after a long, gradual decrease in vR during the active phase). If the TC cells approach sufciently close to the stable equilibrium in the silent phase while the RE cells are still active, then they can all re as soon as they are disinhibited by RE jump-down. Hence, stable synchrony can occur here without activation of slow inhibition, even if fast inhibition decays on the fast timescale, since slow decay of inhibition was needed only to allow the RE cells to recover; this holds even though RE-TC fast inhibitory synapses are direct. (For more details, see Rubin & Terman, 1998.) 6 Discussion
Numerous articles have considered models for thalamic oscillations and mechanisms for synchronization. The RE-TC model studied here is closely related to that in Golomb et al. (1994); however, those results are based primarily on numerical studies. Here we develop, for the rst time, a systematic approach for analytically studying models as complex as the thalamic networks. Related articles that deal with the role of inhibition in synchronizing bursting-like oscillations are Wang and Rinzel (1993) and Terman et al. (1998). Their conclusion is that a slow decay of inhibition is needed to obtain synchrony. This does not account, however, for synchronization of the RETC network in the presence of GABAA inhibition. By extending the analysis in Terman et al. (1998) to more complex networks, we are able to understand the mechanisms responsible for this. Several other articles, including van Vreeswijk et al. (1994) and Gerstner et al. (1996), have also shown that inhibition can lead to synchrony. These articles considered integrate-and-re type models and therefore do not directly apply to the thalamic networks. One of their conclusions is that the synchronous solution must be unstable if the synapses have instantaneous onset. This is consistent with our result (also in Terman et al., 1998) that indirect synapses are needed for synchrony. Our analysis shows, however, that parameters not in the integrate-and-re models, such as the time on the excited branch or the geometry of the curves of knees, can strongly inuence when synchrony arises.
636
J. Rubin and D. Terman
Our results clarify the multiple roles that inhibition can play in producing different rhythms. Inhibition may help to synchronize or desynchronize oscillations, depending on several factors. Fast onset of inhibition tends to desynchronize the cells, because when one cell res, it then quickly “steps on” other cells. For this reason, we generally need to assume that the synapses are indirect for stable synchronization to occur. Slow offset of inhibition may help to synchronize the oscillations; it can lead to compression of the cells while they evolve in the silent phase. Compression (or expansion) can also take place as the cells jump up or down between the silent and active phases. This depends on the geometry of curves (or surfaces) of knees, which in turn depends on both the intrinsic and the synaptic parameters, such as gsyn and gsag . A more powerful source of compression arises in the globally inhibitory networks. While the inhibitory cell J is active, it produces sustained inhibition to the E-cells. This forces the E-cells close to a stable xed point and therefore close to each other. In these networks, fast offset of inhibition also helps to synchronize the cells; it helps to get cells through the narrow window of opportunity for ring. We have demonstrated that mutually coupled networks of excitable cells with indirect fast inhibitory coupling can produce synchronized rhythms. This is not possible if the network contains only basic cells. The geometric approach helps explain why additional cellular complexity allows for synchronized rhythms. The additional complexity translates, within the framework of the geometric approach, to higher-dimensional slow manifolds. This allows the cells to escape from the silent phase. It also leads to additional sources of compression. We conclude that synchronization is possible in mutually coupled inhibitory networks if there are at least two slow variables in the intrinsic or synaptic dynamics. This has relevance for isolated RE cell populations, which can synchronize even though they include fast inhibition (Steriade et al., 1990). Globally inhibitory networks can produce different rhythms depending on the intrinsic dynamics of the J-cells, which controls the amount of inhibition sent back to the E-cells. A rapid rate of synchronization, and a large domain of attraction for the synchronous solution, are achieved when the following factors are present: indirect synapses to provide a window of opportunity, a long J-cell active phase to enhance compression among E-cells, and a fast J-cell recovery coupled with relatively fast synaptic decay. Less powerful inhibition, or a smaller window of opportunity relative to synaptic decay rate, results in clustering among the E-cells. The network crashes if the amount of inhibition is too small. In the models for sleep rhythms, there are several possible ways to control the RE cells’ bursts and therefore the emergent network behavior. More powerful RE bursts result from removal of fast inhibition from among the RE cells (Steriade, McCormick, & Sejnowski, 1993; Huguenard & Prince, 1994; Destexhe, Bal, McCormick, & Sejnowski, 1996; Destexhe, 1998) or from the addition of excitatory input from the cortex (Contreras, Destexhe, & Steriade, 1996; Destexhe, Contreras, Sejnowski,
Population Rhythms
637
& Steriade, 1996; Steriade et al., 1997; Destexhe, 1998). Other intrinsic RE parameters, such as a leak conductance, may also greatly inuence the RE cells’ dynamics. We have seen how such considerations help explain the transition between spindling, delta, and paroxysmal discharges in RE-TC networks. The geometric analysis can lead to precise statements for when a particular rhythm is possible. For example, a TC cell can re only if it is sufciently hyperpolarized; this deinactivates the IT current. A geometric interpretation is that a TC cell can re only if, during the silent phase, it lies in the region where trajectories are able to reach the jump-up curve of knees. By considering the slow equations corresponding to the TC cells (the compound analogue of equation 4.1), one can then derive conditions for when a TC will re. In this sense, the geometric analysis helps clarify how different parameters inuence the rhythms. For example, we have seen how the analysis demonstrates that the sag current plays different roles in generating synchronized rhythms in mutually coupled and globally inhibitory networks. We note that the geometric approach used here (see also Terman & Wang, 1995; Terman & Lee, 1997; Terman et al., 1998) is somewhat different from that used in many dynamical systems studies. All of the networks considered here consist of many differential equations, especially for larger networks. Traditionally, one would interpret the solution of this system as a single trajectory evolving in a very large-dimensional phase-space. Instead, we consider several trajectories, one corresponding to each cell, moving around in a much lower-dimensional phase-space. After reducing the full system to a system for just the slow variables, the dimension of the lowerdimensional phase-space equals the number of slow intrinsic variables and slow synaptic variables corresponding to each cell. In the worst case considered here, there are two slow variables for each compound cell and one slow synaptic variable; hence, we never have to consider phase-spaces with dimension more than three. Of course, the particular phase-space we need to consider may change, depending on whether the cells are active or silent and also depending on the synaptic input that a cell receives. We assumed throughout that the networks were completely homogeneous. The analysis certainly extends to network models with mild heterogeneities. In this case, cells within each cluster will no longer be perfectly synchronized. They may lie on different branches of different cubic surfaces. If the cubics are sufciently close to each other, however, the jumps will be very close to each other and almost synchrony will result. Appendix A
The equations for the TC and RE cells in the thalamic network are given in section 5.1. As our biophysical mutually coupled network of compound cells, we considered a pair of TC cells, each governed by equation 5.1; note that in the case of basic cells without slow inhibition, these could be thought
638
J. Rubin and D. Terman
of as simplied RE cells or TC cells. The synaptic variables in this model satisfy equation 2.6. For numerical simulations, we approximated Heaviside functions by functions of the form H1 ( v) D
1 . 1 C exp(¡k (v ¡ h ))
The functions h1 ( v), m1 ( v), r1 ( v), hR1 ( v), and mR1 ( v) are assumed to be of 1 the same form: if j D h, m, r, hR or mR , then j 1 ( v) D 1Cexp((vCh . Further, j ) / sj ) we take th ( v) D th0 C
th1 1 C exp((v C vt h ) / st h )
with tRh ( v) having an analogous form, and tr ( v) D tr0 C
tr1 . exp((v C vt r0 ) / st r0 ) C exp(¡(v C vt r1 ) / st r1 )
To generate our numerical gures, we started cells under a slight perturbation from a synchronous state. For Figures 3B and 4C, we used the following parameter values, based on Golomb et al. (1994) and Terman et al. (1996). IT : gCa D 2.5, h m D 57.0, sm D ¡6.0, vCa D 140.0, hh D 81.0, sh D 4.0, th0 D N vt h D 78.0, st h D 3.0; Isag : gsag D 0.2, vsag D ¡50.0, hr D 10.0, th1 D 73.3, 75.0, sr D 5.5, tr0 D 20.0, tr1 D 1000.0, vt r0 D 71.5, st r0 D 14.2, vt r1 D 89.0, st r1 D 11.6; IL : gL D 0.025, vL D ¡75.0; IA : gA D 0.4, vA D ¡79.0, a D 16.0, b D 4.0, hx D 0.1, k D 100.0, 2 ax D 0.3, 2 bx D 0.1, h syn D ¡50.0. We did not include slow inhibition here. To generate the clustered solution displayed in Figure 8, we used the following parameter values, also based on Golomb et al. (1994) and Terman et al. (1996). Six TC cells: IT : gCa D 1.5, hm D 59.0, sm D ¡9.0, vCa D N th1 D 333.3, N vt h D 78.0, st h D 1.5; Isag : 90.0, h h D 82.0, sh D 5.0, th0 D 66. 6, gsag D 0.15, vsag D ¡40.0, h r D 75.0, sr D 5.5, tr0 D 20.0, tr1 D 1000.0, vt r0 D 71.5, st r0 D 14.2, vt r1 D 89.0, st r1 D 11.6; IL : gL D 0.2, vL D ¡76.0; IA : g A D 0.1, vA D ¡84.0, aR D 8.0, bR D 0.05, h R D ¡50.0, k A D 2.0; IB : gB D 0.05, vB D 95.0, l D 10¡4 , k1 D 0.1, k2 D 0.05, hxb D 0.8, k xb D 0.02, k3 D 0.5, k4 D 0.005, hRb D ¡25.0, k b D 2.0. Three RE cells: IRT : R D gRCa D 2.0, hmR D 52.0, smR D ¡9.0, vRCa D 90.0, h hR D 72.0, shR D 2.0, th0 N t R D 333. 3, N vR D 78.0, s R D 1.0; IAHP : gAHP D 0.1, vK D ¡90.0, m 1 D 66. 6, th th h1 0.02, m 2 D 0.025, º D 0.01, c D 0.08; IRL : gRL D 0.3, vRL D ¡76.0; IRA : gRA D 0.25, vRA D ¡84.0, other parameters as in IA for the TC cells; IE : gE D 0.6, vE D 0, aE D 2.0, bE D 0.05, h E D ¡35.0, k E D 2.0. [Note that in j
the current IE , the variables sE satisfy an equation of the form (5.2), with the parameters aE , . . . replacing aR , . . . .]
Population Rhythms
639
To generate the synchronous solution displayed in Figures 9 and 10, we used the same parameter values except a D 16, b D 0.2, st h D 3.0, stRh D 2.0, gRA D 0. By setting gRA D 0, we removed the RE-RE fast inhibition; for aR D 16, bR D 0.2 with the same initial conditions and gRA D 0.25, the solution forms two clusters, but slow inhibition is slightly activated and it eventually destabilizes, while for gRA D 0.5, the solution forms two clusters without activation of slow inhibition. Finally, for Figures 11 and 12, we used the above parameter values except certain parameters were varied as discussed in the text and gure captions and th0 D 100.0, th1 D 500.0, st h D R D 100.0, t R D 500.0, s R D 2.0. 3.0, th0 th h1 We assumed in the proof of the lemma in section 3 that for compound cells, fy > 0 in the silent phase, while fy < 0 in the active phase. This is justied for the TC cell model for the following reason. Note that y corresponds to the variable r; hence, fy D ¡gsag (v ¡ vsag ). Since vsag ¼ ¡40 mV typically, while v ranges from around ¡80 mV in the silent phase to at least ¡30 mV in the active phase, the result follows. Remark 4.
R We claimed in remark 1 that | @w | is quite small. This is also @y shown numerically in Figure 3B. We can understand analytically why this
Remark 5.
f
a is so by recalling, from the proof of the lemma, that @w D ¡ f y for a D L or R @y w corresponding to the silent or active phase respectively. From equation 5.1,
gsag ( v ¡ vsag ) @wa D¡ . @y gCa m21 (v)(v ¡ vCa )
(A.1)
Typical values are gsag ¼ .04 up to 2.0, vsag ¼ ¡40 mV, gCa ¼ 2.5, vCa ¼ 140 mV. In the silent phase, m1 ( v) is small, so the numerator and denominator in equation A.1 have similar magnitudes, even though v ¼ ¡80 mV. R In the active phase, however, m1 ( v) ¼ 1 while v ¼ 0, typically. Hence, | @w | @y is quite small. Appendix B
Consider a pair of compound cells in the active phase, with mutual inhibitory coupling. Suppose that the cells are perturbed from synchrony such that the lead cell falls down to the silent phase at time t1 and the following cell falls down at time t2 > t1 . Assume slow decay of inhibition (b D 2 K), although the analysis simplies otherwise (see below). During the silent phase, for t ¸ t2 , the slow dynamics of the two cells are given by wP D g( v, w) yP D h(v, y) sP D ¡Ks,
(B.1)
640
J. Rubin and D. Terman
Figure 13: The set-up for analysis of compression in the silent phase in two dimensions (i.e., assuming s is xed). The picture generalizes naturally to 3D when s is included as another slowly evolving variable. Cell 1 jumps up rst, with t D T1 , at (y¤1 , w¤1 ). The larger dotted square shows a blow-up of the smaller dotted square, in which the vectors V and (g, r ) are dened. d where v satises v D W L ( w, y, s) andP D dt . Let V (t ) D ( g( w, v), h( y, v), ¡Ks). The equation of variations that describes the evolution of tangent vectors to the ow of equation B.1 is
d Pw D ¡awd w ¡ bwdy ¡ cwd s P D ¡ayd w ¡ bydy ¡ cyds dy dPs D ¡Kds,
(B.2)
where aw D ¡@g / @w, bw D ¡@g / @y D ¡@g / @v ¢ @W L / @y, and so on. Fix the cell that jumps up out of the silent phase rst as cell 1 and call the other cell 2. Let Ti denote the jump-up time of cell i and restrict to t 2 [t2 , T1 ]. Let (g(t ), r (t), s(t)) denote the vector from the position of cell 1 to that of cell 2, such that (g, r , s) satises equation B.2; see Figure 13 for a two-dimensional representation of this set-up. Next, let wL ( y, s) denote the projection of the left surface of knees to (w, y, s)-space and let (a, b, c) D r[wL ( y1 ( T1 ), s( T1 )) ¡ w( T1 )] D (¡1, @wL ( ( ), ( )), @wL ( ( ), ( ))). Let wtL denote the physical translate @y y1 T1 s T1 @s y1 T1 s T1
Population Rhythms
641
of wL ( y, s) to the position of cell 1 at time t; for example, at time t2 , cell 1 lies in wtL2 . Again, see Figure 13. We dene the time T (t ) as the time for cell 2 to ow to its rst intersection with wtL . This satises "
Z
(a, b, c) ¢ (g, r , s) C
T (t )
0
# V (t Cj ) dj
D 0,
(B.3)
where V is evaluated along the path of cell 2 (see Figure 13), but Z
T(t ) 0
Z V (t Cj ) dj D
T (t ) 0
( V (t) Cj VP (t ) C O(j 2 ))dj
D V (t )T (t ) C O(T 2 ).
(B.4)
For small perturbations from synchrony, if the vector eld of equation B.1 is O(1), then the O(T 2 ) term in equation B.4 can be neglected. Substituting the approximation equation B.4 into B.3 yields, at leading order, T (t ) D ¡
( a, b, c) ¢ (g, r , s) g ¡ br ¡ cs D , ( a, b, c) ¢ V ¡g(w2 , v2 ) C bh( y2 , v2 ) ¡ cKs2
(B.5)
since a D ¡1. Henceforth, we omit the arguments indicating evaluation along the path of cell 2 when this is clear. A sufcient condition for compression in the silent phase is that TP (t ) < 0 for t 2 [t2 , T1 ]. Thus, we proceed to compute TP (t ). Since (g, r , s) and V (t ) satisfy equation B.2, we differentiate, substitute from B.2, and simplify to obtain [V(t ) £ (g, r , s)] ¢ [((a, b, c) ¢ DV ) £ ( a, b, c)]† , TP (t ) D (( a, b, c) ¢ V (t ) † ) 2
(B.6)
where † denotes transpose. Geometrically, the determinant of three vectors gives the volume of the parallelepiped they bound. The numerator of TP (t) consists of such a determinant, with the three edges of the parallelepiped given by the vector eld, the vector from the position of cell 1 to that of cell 2, and a third vector. This last vector relates the linearization of the vector eld to the gradient of the translate of the surface of knees. Next, consider Z ´ (Z1 , Z2 , Z3 ) :D V £ (g, r , s) D (s h Cr Ks, ¡sg ¡ gKs, r g ¡ gh)
D (( h, ¡Ks) ^ (r , s), (g, s) ^ ( g, ¡Ks), ( g, h) ^ (g, r )).
(B.7)
642
J. Rubin and D. Terman
Differentiating equation B.7 and using B.2 gives, in system form, ZP 1 D ¡(by C K)Z1 C ay Z2 ZP 2 D bw Z1 ¡ (aw C K) Z2 ZP 3 D cw Z1 C cy Z2 ¡ ( aw C by ) Z3 .
(B.8)
For basic cells, without the current y, equation B.6 simplies to sg CgKs ¡Z2 ( c( aw ¡ K) C cw ) D (c( aw ¡ K) C cw ). TP (t ) D ( g C cKs) 2 ( g C cKs) 2 Moreover, Z1 ´ 0, so ZP 2 D ¡(aw CK)Z2 . Hence, the sign of Z2 (t ) is invariant for t 2 [t2 , T1 ]. In Terman et al. (1998), Z2 (t2 ) D ( w1 (t2 ) ¡w2 (t2 )) Ks2 (t2 ) < 0, which implies that ¡Z2 (t2 ) > 0. Thus, the sign of TP (t) matches that of (c(aw ¡ K) C cw ) (Terman et al. use l 1 to denote c). By showing that this quantity is negative for all relevant t for K < a¡ (case I in Terman et al., 1998), Terman et al. achieve a sufcient condition for compression in the silent phase. In general, one can compute the signs of a, b, c as well as aw , bw , . . . (see, e.g., the lemma in section 3) and then can use equations B.6 and B.8 to derive compression conditions for the silent phase. Simplications facilitate this process in certain cases even for compound cells. For example, s D sP D 0 during the silent phase for E cells in a globally inhibitory network as well as for much of the silent phase for mutually coupled cells with fast synaptic decay. In the latter case, however, an adjustment must be made to compensate for the fact that one cell loses inhibition before the other; we omit details and explicit computations here. Acknowledgments
We thank N. Kopell for her very useful comments. Research for this article was supported in part by the NSF grants DMS-9423796 and DMS-9804447. Some of the research was done while the authors were visiting the IMA at the University of Minnesota. References Bal, T., & McCormick, D. A. (1996). What stops synchronized thalamocortical oscillations? Neuron, 17, 297–308. Bush, P., & Sejnowski T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns of realistic network models. J. Comp. Neurosci., 3, 91–110. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370.
Population Rhythms
643
Contreras, D., Destexhe, A., Sejnowski, T. J., & Steriade, M. (1996).Control of spatiotemporal coherence of a thalamic oscillation by corticothalamic feedback. Science, 274, 771–774. Contreras, D., Destexhe, A., & Steriade, M. (1996). Cortical and thalamic participation in the generation of seizures after blockage of inhibition. Soc. Neurosci. Abst., 22, 2099. Contreras, D., Destexhe, A., & Steriade, M. (1997). Spindle oscillations during cortical spreading depression in naturally sleeping cats. Neuroscience, 77, 933– 936. Destexhe, A. (1998) Spike-and-wave epileptic oscillations based on the properties of GABA B receptors. J. Neurosci., 18, 9099–9111. Destexhe, A., Babloyantz, A., & Sejnowski, T. J. (1993). Ionic mechanisms for intrinsic slow oscillations in thalamic relay and reticularis neurons. Biophys. J., 65, 1538–1552. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049– 2070. Destexhe, A., Contreras, D., Sejnowski, T. J., & Steriade, M. (1996). Cortical projections to the thalamic reticular nucleus may control the spatiotemporal coherence of spindle and epileptic oscillations. Soc. Neurosci. Abst., 22, 2099. Destexhe, A., Contreras, D., & Steriade, M. (1998). Mechanisms underlying the synchronizing action of corticothalamic feedback through inhibition of thalamic relay cells. J. Neurophysiol., 79, 999–1016. Destexhe, A., McCormick, D. A., & Sejnowski, T. J. (1993). A model for 8–10 Hz spindling in interconnected thalamic relay and reticularis neurons. Biophys. J., 65, 2474–2478. Destexhe, A., & Sejnowski, T. J. (1997). Synchronized oscillations in thalamic networks: Insights from modeling studies. In M. Steriade, E. G. Jones, & D. A. McCormick (Eds.), Thalamus. Amsterdam: Elsevier. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking? Neural Comp., 8, 1653–1676. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Golomb, D., Wang, X.-J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109–1126. Golomb, D., Wang, X.-J., & Rinzel, J. (1996) Propagation of spindle waves in a thalamic slice model. J. Neurophysiol., 75, 750–769. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Huguenard, J., & Prince, D. (1994). Clonazepam suppresses GABAB -mediated inhibition in thalamic relay neurons. J. Neurophysiol., 71, 2576–2581.
644
J. Rubin and D. Terman
Kim, U., Bal, T., & McCormick, D. A. (1995). Spindle waves are propagating synchronized oscillations in the ferret LGNd in vitro. J. Neurophysiol., 74, 1301–1323. Kopell, N., & LeMasson, G. (1994).Rhythmogenesis, amplitude modulation and multiplexing in a cortical architecture. Proc. Nat. Acad. Sci. (USA), 91, 10586– 10590. LoFaro, T., & Kopell, N. (In press). Time regulation in a network reduced from voltage-gated equations to a one-dimensional map. J. Math. Biology. Rinzel, J. (1987). A formal classication of bursting mechanisms in excitable systems. In A. M. Gleason (Ed.), Proceedings of the International Congress of Mathematics (pp. 1578–1593). Providence, RI: AMS. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351– 1355. Rowat, P., & Selverston, A. (1997). Oscillatory mechanisms in pairs of neurons connected with fast inhibitory synapses. J. Comp. Neurosci., 4, 103–127. Rubin, J., & Terman, D. (1998). Geometric analysis of population rhythms in synaptically coupled neuronal networks. Unpublished manuscript, IMA Preprint Series 1596, University of Minnesota. Rubin, J., & Terman, D. (1999). Analysis of clustered ring patterns in synaptically coupled networks of oscillators. Manuscript submitted for publication. Skinner, F., Kopell, N., & Marder, E. (1994). Mechanisms for oscillation and frequency control in networks of mutually inhibitory relaxation oscillators. J. Comp. Neurosci., 1, 69–87. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cyber., 68, 393–407. Steriade, M. (1994). Coherent activities in corticothalamic networks during resting sleep and their development into paroxysmal events. In G. Buzs´aki, R. Llin a´ s, W. Singer, A. Berthoz, & Y. Chrtisten (Eds.), Temporal coding in the brain. New York: Springer-Verlag. Steriade, M., Contreras, D., & Amzica, F. (1994). Synchronized sleep oscillations and their paroxysmal developments. TINS, 17, 199–208. Steriade, M., Contreras, D., & Amzica, F. (1997). The thalamocortical dialogue during wake, sleep, and paroxysmal oscillations. In M. Steriade, E. G. Jones, & D. A. McCormick (Eds.), Thalamus. Amsterdam: Elsevier Steriade, M., Curr´o Dossi, R., & Nunez, ˜ A. (1991). Network modulation of a slow intrinsic oscillation of cat thalamocortical neurons implicated in sleep delta waves: Cortically induced synchronization and brainstem cholinergic suppression. J. Neurosci., 11, 3200–3217. Steriade, M., Domich, L., Oakson, G., & Deschˆenes, M. (1987). Abolition of spindle oscillations in thalamic neurons disconnected from nucleus reticularis thalami. J. Neurophysiol., 57, 260–273. Steriade, M., Jones, E. G., & Llin a´ s, R. R. (1990). Thalamic oscillations and signaling. New York: Wiley Steriade, M., McCormick, D. A., & Sejnowski, T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685.
Population Rhythms
645
Steriade, M., Nunez, ˜ A., & Amzica, F. (1993). Intracellular analysis of relations between the slow (< 1 Hz) neocortical oscillation and other sleep rhythms of the electroencephalogram. J. Neurosci., 13, 3266–3283. Terman, D., Bose, A., & Kopell, N. (1996). Functional reorganization in thalamocortical networks: Transition between spindling and delta sleep rhythms. Proc. Natl. Acad. Sci. (USA), 93, 15417–15422. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled inhibitory neurons. Physica D, 117, 241–275. Terman, D., & Lee, E. (1997). Partial synchronization in a network of neural oscillators. SIAM J. Appl. Math., 57, 252–293. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Timofeev, I., & Steriade, M. (1996). Low-frequency rhythms in the thalamus of intact-cortex and decorticated cats. J. Neurophysiol., 76, 4152–4168. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. New York: Cambridge University Press van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition, not excitation synchronizes neural ring. J. Comp. Neurosci., 313–321. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst ring in a thalamic model: Specic neuronal mechanisms. Proc. Natl. Acad. Sci. (USA), 92, 5577–5581. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. Wang, X.-J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalamic nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. Wang, X.-J., & Rinzel, J. (1995). Oscillatory and bursting properties of neurons. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 686– 691). Cambridge, MA: MIT Press. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. White, J., Chow, C. C., Ritt, J., Soto, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Received July 20, 1998; accepted February 23, 1999.
LETTER
Communicated by George Gerstein
An Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event Analysis Quentin Pauluis Laboratory of Neurophysiology, School of Medicine, Universit´e Catholique de Louvain Brussels, Belgium Stuart N. Baker Department of Anatomy, University of Cambridge, Cambridge, CB2 3DY, U.K. We present an estimate for the instantaneous discharge probability of a neurone, based on single-trial spike-train analysis. By detecting points where the neurone abruptly changes its ring rate and treating them specially, the method is able to achieve smooth estimates yet avoid the blurring of signicant changes. This estimate of instantaneous discharge probability is then applied to the method of unitary event analysis. We show that the unitary event analysis as originally conceived is highly sensitive to ring-rate nonstationarities and covariations, but that it can be considerably improved if calculations of statistical signicance use an instantaneous discharge probability instead of a ring-rate estimate based on averaging across multiple trials.
1 Introduction
A large number of new techniques have been developed for the analysis of single-unit neuronal discharge. The increasing mathematical complexity of these techniques permits maximal information to be extracted about how these signals are being used by the brain. However, most new techniques are developed and tested on the assumption that neurone spikes resemble stationary Poisson processes. This often yields mathematical tractability, but at the expense of grossly oversimplifying the statistics of the target spike trains. Nonstationarity in cell discharge can occur in a number of ways. First, the ring rate often changes following delivery of a stimulus or performance of a behavioral task by the experimental animal. These changes can be profound (from no ring to > 100 Hz) and rapid. In order to account for such rate modulation, many analysis techniques estimate the ring rate on a single trial from the peristimulus time histogram (PSTH; Gerstein & Kiang, 1960; Lemon, 1984), derived by summing together the discharge on a number of repeats of the same trial. The underlying model for the spike train then Neural Computation 12, 647–669 (2000)
c 2000 Massachusetts Institute of Technology °
648
Quentin Pauluis and Stuart N. Baker
becomes a Poisson process with instantaneous ring rate on any one trial modulating as the all-trial PSTH (Aertsen, Gerstein, Habib, & Palm, 1989; Grun, ¨ 1996). More complex nonstationarities occur when the ring-rate modulation is not the same from one trial to the next. First, there may be a jitter in the time of occurrence of a rise in rate relative to the alignment event; this is common in work on awake, behaving animals, where a trial lasting several seconds must be aligned to one point in the performance (Seidemann, Meilijson, Abeles, Bergmann, & Vaadia, 1996; Munoz & Wurtz, 1993). Second, the overall shape could differ from one trial to the next, depending on uncontrolled external or random events such as attentional shifts. These two forms of nonstationarity may show no covariation between simultaneously recorded cells. However, since such variation is often not simply noise in the system, but instead reects genuine variation in the task performance of the animal from trial to trial, it is more likely that the times, the magnitudes, or the overall shape of rate changes will be correlated. The difference between these two situations is of great importance for methods that seek to estimate the expected number of synchronous discharges between two cells; unfortunately, the PSTH can only reect the mean ring-rate prole and gives no clue to its variability or correlation structure from trial to trial. Brody (1997, 1999a, 1999b) proposed a method of overcoming these problems by tting the cell discharge on a single trial to a function that had the same shape as the mean PSTH but was shifted by an adjustable latency and scaled by a multiplicative gain. This estimate of the single-trial ring rates permitted a better prediction of the cross-correlation histogram between two units than one derived from the PSTH alone. However, the all-trial PSTH reects only the mean change in rate, taking no account of the variability of this prole among trials, so that this technique is probably limited in its applicability to experimental data. In this article, we present a method for estimating the instantaneous discharge probability density of a neurone in a single trial. We then apply it to the method of unitary event analysis pioneered by Grun ¨ (1996) and recently applied to data recorded in the motor cortex of awake monkeys (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). We show that this method, as originally conceived, is highly sensitive to the nonstationarities described above, but that it can be considerably improved using single-trial ring-rate estimates. 2 Measurement of Instantaneous Discharge Probability
We rst draw a distinction between instantaneous discharge probability density and the instantaneous ring rate. We view the latter as an output, which may be read by a neurone farther downstream. By contrast, we view the discharge probability density as an input function that could be fed into a neuronal model to generate a spike train similar to the one observed. Gen-
Instantaneous Discharge Probability
649
eration of such a measure requires allowance to be made for the delayed response of the neuronal integrator to changes in its inputs. The discharge probability density should be smooth, and not replicate the variability typically seen in interspike intervals (Softky & Koch, 1992; Shadlen & Newsome, 1994). However, in conict with this, the function should also be able to identify and vary abruptly with sudden ring changes. Finally, since the only data available are the spike train, the method has to start with a measure of the instantaneous ring rate. Two distinct methods exist in the literature for estimation of the instantaneous ring rate of a spike train. In one, pulses produced at the spike times are low pass ltered (French & Holden, 1971), either using analog lters (Houk, Dessem, Miller, & Sybirska, 1987), or by convolution with a smoothing function (often a gaussian; MacPherson & Aldridge, 1979; Richmond, Optican, Podell, & Spitzer, 1987). The closer together the spikes, the more summation of the smoothed waveforms that occurs, and hence the larger the output signal. However, this method requires that the cutoff of the low-pass lter be adjusted to be appropriate for the ring rate under study. If it is too low, a highly smoothed signal is obtained, which cannot follow rapid rate changes; if it is too high, periods of zero estimated ring rate are interposed between spikes with long intervals. Adaptive kernel methods (Silverman, 1986; Richmond, Optican, & Spitzer, 1990) represent an attempt to adjust the amount of smoothing depending on the local density, so that these two extremes are avoided. More sophisticated techniques such as orthogonal series estimates and maximum penalized likelihood estimates (Silverman, 1986) can suffer from the need to tune smoothing parameters to nd the optimal representation of the data. An alternative is to use the reciprocal of the interspike interval (ISI) as an estimate of the ring rate during the time spanned by that interval. Such a method is often implicitly employed using “frequencygram” plots, where a point is plotted at the time of each spike with a height of the reciprocal of the preceding interval (Matthews, 1963; Bessou, Laporte, & Pag e` s, 1968; Awiszus, 1988). It is similar to the use of “fractional intervals” (Schwartz, 1992). This technique has the advantage that it never has periods of zero estimated rate; however, in its raw form, the estimate shows great variability. Any subsequent smoothing will necessarily be a trade-off between the need to reduce this variability without removing rapid uctuations. A means of avoiding this trade-off would be to detect rapid changes in ring rate and treat them as a special case. This is essentially the same problem as detection of bursts in a spike train, and a number of techniques have been suggested. The rst was the CUMSUM technique (Hurst, 1950), which has been applied to both averaged PSTH (Ellaway, 1978) and also to single-trial analysis (Butler, Horne, & Hawkins, 1992; Butler, Horne, & Rawson, 1992). The CUMSUM involves subtraction of a control reference level from a series of datum points and adding cumulatively. However, when the goal is the formation of a continuous estimate of discharge probability,
650
Quentin Pauluis and Stuart N. Baker
it is not clear where the control level should be placed. CUMSUM methods are also very sensitive to small variations during the control period and require excessive parameter tuning (Churchward et al., 1997). An alternative has been proposed by Cocatre-Filgien and Delcomyn (1992), who rst identied ISI subpopulations using an ISI histogram, and then tested identied bursts against the mean ring frequency with a chi-square test. This method is well suited to a spike train where ring is at a constant level for most of the time, with occasional high-frequency bursts; it is much less suitable for neurones where the ring rate changes continually over a wide range. Other methods in the literature suffer from similar problems: ISIs are always compared with the mean ISI for the whole recording, making the assumption that intervals are Poisson distributed (Leg´endy & Salcman, 1985; Eggermont, Smith, & Bowman, 1993). Our proposed technique is a variation of these methods. The rst difference is that we work with ISIs locally. In this way, a change in ring rate is identied relative to the ring of the cell close to the time of interest, rather than by comparison to the global ISI statistics. Three consecutive intervals are used, since a single interval shorter than the preceding one could simply represent the tail of the ISI distribution, even if ring rate were constant. Two consecutive ISIs signicantly shorter than the preceding ISI are much less likely without a change in ring rate. The same limit to the smallest detectable burst has been used by Cocatre-Zilgien and Delcomyn (1992). Second, we assume that the ISIs follow a gamma density. This allows for an absolute refractory period and for the relative refractory period produced by afterhyperpolarization, and is a more realistic description of interval statistics than the more commonly used Poisson process (for discussion, see Berry & Meister, 1998; de Ryuter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). Moreover, by changing the order parameter of the gamma density, it can be adjusted to t interval statistics from a wide range of experimental data. The rst step is to generate an instantaneous ring-rate function F(t), sampled at an appropriately ne temporal resolution, whose value at any point is simply the reciprocal of the ISI that spans it (see Figure 1B). The series of ISIs is then scanned to determine times at which the ring rate abruptly changes. To do this, we make the assumption that, at constant ring rate, ISIs are distributed according to a gamma density. The probability that if the mean interval is m , an interval i will be larger than some value I is then: Z P(i > I | m ) D
1 I
¡ a¢ m
a
1 ta¡1 e¡ta /m dt, ( a ¡ 1)!
(2.1)
where a is the order of the gamma process used to model the intervals. In the remainder of this article, we use a D 4, although this can be adjusted to t the interval statistics for a given cell (a D 1 would correspond to a Poisson process). As illustrated in Figure 1A, the nth interval was accepted as the
Instantaneous Discharge Probability
651
Figure 1: Reconstruction of the instantaneous frequency. (A) Original spike train. (B) Inverse of the interspike intervals (ISIs). Statistical tests detect points of rate increase (empty arrow) or decrease (lled arrow). (C) The high ring rates are extended out half an interval, to the best estimate of the time of the rate change. In order to maintain the overall mean rate, the ring rate before or after these rate change points is reduced. (D) The signal is smoothed by convolution with a gaussian. No smoothing is performed across the rate change points. Full details are given in the text.
onset of an increase in rate if the following two criteria were both fullled: P(i > In | m D InC1 ) < 0.05 P(i > In | m D InC2 ) < 0.05.
(2.2)
Likewise, the nth interval was accepted as the onset of a decrease in rate if both of the following two criteria were met: P(i > In | m D In¡1 ) < 0.05 P(i > In | m D In¡2 ) < 0.05.
(2.3)
The longest of the three intervals under consideration is always the one tested. As noted, the requirement that the interval should be significantly larger than both of the intervals preceding it (for a rate fall) or
652
Quentin Pauluis and Stuart N. Baker
succeeding it (for a rate rise) ensures that only genuine rate changes are detected, and not anomalous intervals arising from the tail of the ISI distribution. Changes in rate can thus be detected. The next stage is to determine at what precise time the change occurs. One possibility would obviously be at the time of the rst spike contributing to the changed interval. However, this is likely to be biased. A neurone that suddenly receives a higher rate of input will not re immediately; the moment of ring probability rise is therefore likely to be some time prior to the rst spike of the shortened interval. The best solution can be appreciated by the following formalism. Let the long interval In be divided into two parts: one lasting TLo in which the discharge probability density is FLo , and another lasting T Hi in which the discharge probability density is FHi . We have therefore that: T Lo C THi D In .
(2.4)
Since one spike in total resulted from the interval In , the sum of the ring probability density must be 1 over this time: T Lo FLo C THi FFi D 1.
(2.5)
Finally, in the absence of any more information about the timing of the change in discharge probability, we assume that the change occurs independently from the spike process. Had the ring probability density not changed, the interval would have been 1 / FLo. We therefore assume that on average, TLo is half of this value: 1 2FLo T Lo FLo D 0.5 D T Hi FHi . T Lo D
(2.6)
FHi is assumed to be equal to the reciprocal of the short interval IShort either preceding (for a rate fall) or succeeding (for a rate rise) In . Hence equations 2.4–2.6 provide a unique solution, which is: T Hi D
1 1 D IShort 2FHi 2
TLo D In ¡ FLo D
1 1 D In ¡ Ishort 2FHi 2
1 . 2TLo
(2.7)
The high discharge probability density calculated from the short interval is therefore extended out for half of this interval into the adjoining long
Instantaneous Discharge Probability
653
interval, as shown in Figure 1C, for both rate increase and rate decrease. This is compensated for by reducing the discharge probability density in the remainder of the long interval to a value close to half of the original (see Figure 1C). A probability density function of this form is exactly normalized, in the sense that its integral from time zero to a given time, rounded down, gives the number of spikes which have occurred to that point. This function can therefore be seen as the input to an IF neurone, which would thereby generate the observed spike train. The discharge probability density of Figure 1C still exhibits high variability, due to high variance in intervals even if the underlying rate remains constant (Shadlen & Newsome, 1994; Softky & Koch, 1992). In order to remove this unwanted variability partially, intervals between abrupt ring changes are then smoothed by convolution with a gaussian kernel (standard deviations: 10 ms, chosen to be in the range of the somatic time constant; Koch, Rapp, & Segev, 1996). This smoothing is not carried out across boundaries of rate changes as dened above, producing a measure that varies smoothly between rate change points but abruptly at them. This is illustrated in Figure 1D. It is the explicit identication and localization of abrupt ring probability changes that confer on this measure the most tangible advantage over other techniques. Without such a step, there is necessarily a trade-off between oversmoothing rapid changes and high variability, which cannot be satisfactorily resolved. The method described produces an estimate of the discharge probability density function that can be computed without averaging across trials. The next section investigates an application of this method to the assessment of the signicance of unitary joint events. 3 Application to Unitary Joint Event Analysis
The aim of the unitary joint event method is to detect discharges of two neurones that are synchronous within an allowed jitter and occur more often than expected by chance. The technique, similar to that proposed by Grun ¨ (1996; UEMWA method) but with minor changes to avoid unnecessary binning, is reviewed in Figures 2A–G. From the repeated trials (see Figures 2A and 2C), PSTHs for the two cells are built up, and these are smoothed with a moving window (here chosen to be 101 ms wide; see Figures 2B and 2D). Spikes from the two cells occurring at times t1 and t2 are designated a coincident event (CE) if |t1 ¡ t2 | < D ,
(3.1)
where D in this study was taken as 5 ms. We conventionally place the occurrence time of the CE halfway between the two coincident spikes, at ( t1 Ct2 ) / 2. The number of CEs is counted in successive bins, and the count is plotted as dots in Figure 2E. The number of these raw CEs rises as the cell discharge
654
Quentin Pauluis and Stuart N. Baker
rates increase. The expected number of coincidences is approximately E (CEs in bin T ) D 2wD NF1 ( T ) F2 ( T ),
(3.2)
where N is the number of trials, w is the bin width used to count the CEs (here 1 ms), and F1 , F2 represent the ring rates of the cells estimated from the smoothed PSTHs. Computation of the expected number of CEs is made more complex than equation 3.2 because the bin width w is in general larger than the precision to which spike times are recorded. If a CE falls into the bin centered on time T, then
w ( t1 ¡ t2 ) ¡ T 2 < 2 .
(3.3)
Suppose as an example that t1 is in the bin centered on T ¡w, and t2 is the bin centered on T Cw. For some possible combinations of t1 and t2 , equation 3.3
Figure 2: Facing page. Comparison of signicance calculation for coincident events using the method of Grun ¨ (1996) and the novel method proposed here. One hundred trials, each lasting 500 ms, were generated with a central 250 ms burst of activity (ring rate rose from 25 Hz to 75–125 Hz) which began with a temporal jitter of 100 ms. (A–G) Methods based on the peristimulus time histograms (PSTH). For each cell (A and C), the mean ring rates are computed by compiling PSTHs across trials; these are smoothed by a 101 ms moving window (B and D). (E) The expected coincidence rate (solid line) is calculated by multiplying the two mean ring rates, with a correction for the size of the window within which coincidences are detected. Dots show the actual coincidences detected in this simulation. (F) Actual and expected coincidence rates are smoothed with a 101 ms moving window. (G) The probability P to nd as many or more CEs as observed is calculated from the expected rate assuming a Poisson distribution. P is displayed as log10 ((1 ¡ P) / P), facilitating visualization of high and low values of P. (H–P) Method based on the single-trial ring-rate estimate. For each cell (rasters for the rst trial shown in H and J), the instantaneous frequency is reconstructed on a trial-by-trial basis (I and K ; lter SD between rate change points D 10 ms). Then the rate of expected CEs is calculated (L: rst trial; M: four other trials). A bin-per-bin summation of the single-trial CE expectancy curves produces the global CE expectancy (N, solid line), plotted with the number of CEs detected (dots). (O) Both rates are then ltered using a 101 ms moving window. (P) Signicance of departures from the expected rate is assessed as in (G) and plotted in the same way. Vertical bar: (B and D) 100 sp.s¡1 ; (E, F, N, O) 100 CE.s¡1 .trial¡1 ; (I, K) 500 sp.s¡1 ; (L, M) 1667 CEs.s¡1 ; (G, P) 5.
Instantaneous Discharge Probability
655
could still be left unfullled (t1 D T ¡ 0.7w and t2 D T C1.3w). This is shown schematically by Figure 3A: CEs generated by spikes each conned to one bin may fall into two adjacent bins. Extending these considerations to all possible combinations of bin occupancy for t1 and t2 , which could fulll equations 3.1 and 3.2, leads us to an exact solution for the case of D D 5 ms shown in Figure 3B and described numerically as: 0 !1 C3 C3 X X 2 (3.5) E(CEs in T ) D w N @ F1 ( T C jw) Mj,k F2 ( T C kw) A jD¡3
(
kD¡3
656
Quentin Pauluis and Stuart N. Baker
2
MjD¡3...3,kD¡3...3
0 6 0 6 6 0 6 D6 6 0 6 0 6 41 / 4 0
0 0 0 0 2/4 4/4 1/4
0 0 0 2 /4 4 /4 2 /4 0
0 0 2/4 4/4 2/4 0 0
0 2 /4 4 /4 2 /4 0 0 0
1/4 4/4 2/4 0 0 0 0
3 0 1 / 47 7 0 7 7 0 7 7. 0 7 7 0 5 0
(3.6)
Figure 2E plots the expected number of coincidences E as a solid line. The number of observed coincidences is then smoothed with a 101 ms moving window, yielding the plot in Figure 2F. Finally, the probability that the number of coincidences c will be at least as many as the number observed, C, given that E are expected, is calculated assuming that they follow a Poisson distribution: P (c ¸ C) D 1 ¡
iDC¡1 X e¡E Ei
i!
iD0
.
(3.7)
This probability is plotted in Figure 2G as log10 ((1 ¡ P) / P ), thereby permitting bins with a signicant excess, or a signicant decit, of CEs to stand out. The horizontal lines mark the 99% condence limits (two tailed). The modication to this technique that we propose is illustrated in Figures 2H–2P. For each trial (see Figures 2H and 2J), the instantaneous discharge probability density is calculated (see Figures 2I and 2K). This is used to calculate the expected number of coincidences in a single bin, in one trial; summing over all trials yields the approximate number of coincidences in a given bin: E (CEs in T ) D 2wD
N X nD1
Fn1 ( T ). Fn2 ( T ),
(3.8)
where Fn1 , Fn2 represent the instantaneous discharge probability density during trial n. Once again, however, the situation is made more complex by the binning used. By similar considerations to those above, we nd the exact formula for when D D 5w to be: 0 !1 C3 C3 N X X X @ E(CEs in T ) D w2 . Fn1 ( T C jw). Mj,k Fn2 ( T C kw) A , (3.9) nD1
jD¡3
(
kD¡3
where Mj,k is as dened in equation 3.5. This measure is plotted as a solid line in Figure 2N, overlaid with the observed number of coincidences (dots). The moving-window-smoothed traces are shown in Figure 2O, and the signicance plot, calculated according to equation 3.6, is illustrated in Figure 2P. The crucial difference between the two methods lies in the order in which averaging across trials is performed. In the original method proposed by
Instantaneous Discharge Probability
657
Figure 3: Schematic illustration of CE binning. Calculation of the CEs before the binning step ensures their exact determination but renders complex the calculation of their expectancy. Black squares represent a bin containing spikes from one unit; white squares represent a bin containing spikes from the other. The gray squares code on a gray scale the density of CEs resulting from spikes in the black and white squares. White bins with dotted outlines indicate a situation where not all of the possible combinations of spikes will lie within the coincidence detection interval D . (A) The CE distribution resulting from the combination of spikes in two bins has a triangular shape. If bins are separated by (D ¡ 1)w, w being odd, half the spike combinations do not verify the coincidence criterion of equation 3.1 and the number of CEs is reduced by two. (B) The problem is solved extensively for D D 5w. Figures on the right represent the fraction of spikes from these bins, which would form CEs within the central bin of interest (shown with thick vertical lines).The values given are used to construct the matrix Mj,k in equation 3.5.
658
Quentin Pauluis and Stuart N. Baker
Table 1: Evaluation of the Numbers of False-Positive CEs Occurring in Randomly Simulated Independent Spike Trains. Frequency
IDPM
PSTHM
Difference
(Hz)
a D 0.005
a D 0.025
a D 0.005
a D 0.025
a D 0.005
a D 0.025
10 30 50 70
6 2 0.5 0
26.3 15.8 8.3 2.3
6.5 3.3 0.5 0.3
26 20 12.5 4
¡0.5 ¡1.3¤ 0 ¡0.3
0.3 ¡4.3¤ ¡4.3¤ ¡1.8¤
Notes: One hundred trials, 1500 ms long, were simulated at a constant ring frequency for the two cells. CEs were counted within the central 500 ms of the simulation. The table shows the percentage of simulations with signicant CEs in a sample of 400 simulations with the same ring frequency. Two values for the criterion level a were used. Results for the methods using instantaneous discharge probability (IDPM) and normalization by the PSTH (PSTHM) are presented. The PSTH normalization method gives more false positives, although false-positive rates are high for both methods with a D 0.025. ¤ D a signicant difference between the two methods (binomial test, p < 0.05).
Grun ¨ (1996), ring rate is rst averaged across trials, and then the expected number of coincidences is calculated as a product of ring rates. In the current method, the expectation product is calculated on a single trial, and this is then summed. These will produce different results if the ring rates show any covariation from trial to trial about their mean proles. In agreement with this, Table 1 presents data showing that both methods are indeed similar if data are stationary. Two spike trains were generated as independent gamma processes, order 4, with constant ring rate. Simulations were run for 1500 ms, with 100 trials per simulation; 400 simulations were run for each set of condi-
Figure 4: Facing page. Performance of the two methods on spike trains with ring-rate changes that covary in their timing. One hundred trials were generated following the sequence of ring rate changes shown in (A). The black bars indicate the time range over which the rate changes began (uniform probability distribution). The rst column shows the results obtained with the single-trial expectancy calculation, and the second column with the method based on the mean ring rate. (G) Rasters of activity over the 100 trials for both cells. (B) Reconstructed instantaneous frequency for both cells during the rst trial (lter SD between rate change points: 10 ms) and (H) mean ring rates assessed over all trials. (C and I) CE rasters. (D and J) CE expectancy (solid) and CE observed rate (dots) smoothed with a 101 ms moving window. (E and K) Probability plot (same display as Figures 3G and 3P). (F and L) CE rasters. The CEs in windows that exceed the 0.5% condence limits are marked by a square. A large number of false positives occur with the mean ring-rate method.
Instantaneous Discharge Probability
659
tions. The number of simulations with signicant CEs in the central 500 ms was counted using both of the methods described above, for two different signicance criterion levels a. At the more relaxed criterion of a D 0.025, a large number of false positives occur with each method, although the
660
Quentin Pauluis and Stuart N. Baker
situation is slightly better for the instantaneous discharge probability technique. With the more rigorous criterion of a D 0.005, the two methods give comparable numbers of false positives. Figure 4 illustrates the performance of the two algorithms on simulated data, which have a covariation in the timing of the ring-rate modulation. The spike trains were independent, apart from the rate covariation, so that no signicant CEs should be detected. For a particular trial, the underlying ring-rate prole of the two cells was the same. Figure 4A shows how the underlying ring rate changed with time; the onset of the two periods of increased ring rate was uniformly distributed over the times shown by the black bars; the timing of the two bursts was independent. One hundred such trials were simulated. Figure 4G shows rasters of the unit activities simulated according to the ring-rate prole illustrated in Figure 4A. The estimated instantaneous discharge probability density of the rst trial (see Figure 4B) can be compared with the ring rate derived by averaging across all trials in Figure 4H. Spikes that were coincident within 5 ms are shown in Figures 4C and 4I. The number of CEs, counted in a 101 ms wide moving window are plotted as dots in Figures 4D and 4J. Overlaid as a solid line is the expected number of coincidences, calculated according to the instantaneous discharge probability method (see Figure 4D) or the PSTH method (see Figure 4J). While the two plots closely follow one another in Figure 4D, there are discrepancies in Figure 4J associated with the changes in ring rate. These differences reach signicance on the calculation of equation 2.5, such that the signicance plot of Figure 4K goes outside the 0.5% condence intervals. In Figures 4F and 4L, following the display convention established by Grun ¨ (1996), we display rasters of the CEs in which all events falling within a window that tested as signicant are highlighted by a square. It is clear that the use of a trial-averaged ring-rate estimate can lead to errors in signicance testing of unitary events when the neurone ring rates show a temporal jitter relative to the analysis alignment point, and that jitter is not independent between the two units. Our method based on single-trial discharge probability estimates is capable of alleviating this problem. The
Figure 5: Facing page. Performance of the two methods on spike trains that show covariant ring rates. (A) Schematic indicating the simulated ring-rate proles. After an initial period of constant rate, the rate changed to a value chosen at random from the range shown. Three further rate changes occurred. The organization of the panels is the same as in Figure 4, with the methods based on the single-trial expectancy in the left column, and the mean ring-rate approximation on the right. Since the mean ring rate does not change in these simulations, the PSTH-based method cannot predict the increased number of CEs expected, and a large number of false positives result.
Instantaneous Discharge Probability
661
improvement primarily results from the changed order of the operators (multiply, then sum, versus sum, then multiply). It is likely that almost as much of an improvement would be obtained with simpler estimates of the
662
Quentin Pauluis and Stuart N. Baker
instantaneous discharge probability, such as the ISI reciprocal (smoothed or not) or a low-pass ltered spike train. Our method, however, is capable of following both fast discharge probability changes (via insertion of breakpoints) and slow ones (via the smoothing of the ISI reciprocal); it is therefore preferred. Figure 5 illustrates the results of analysis on spike trains simulated to display variation in ring rate from trial to trial. The underlying rates of the two neurones on a given trial were the same, allowing the effect of rate covariation to be examined. The simulated rate proles are shown schematically in Figure 5A. The remainder of Figure 5 follows the same display conventions as Figure 4. It can be seen that the trial-averaged ring rates (smoothed PSTH) in Figure 5H show no modulation: this is expected from Figure 5A, since the mean ring rates do not change. However, the increase in ringrate variability brings about a rise in the number of CEs, shown by the dots in Figures 5D and 5J. This cannot be predicted by the trial-averaged measure, which therefore assigns a high signicance level to these extra events. By contrast, the expected and observed curves show good agreement using the instantaneous discharge probability method of Figure 5D, such that no CEs reach signicance. Figure 6 examines the detection of genuine excess CEs by the new algorithm. Figures 6A–6F follow the same format as Figures 4 and 5. One spike train was simulated as previously by a simple gamma process (order 4). Some of these spikes were selected at random to become “imposed CEs.”
Figure 6: Facing page. Detection of injected CEs by the method based on the single-trial CE expectation, and comparison of the sensitivity of the two methods. One hundred trials were simulated with a constant ring rate of 30 Hz; between ¡0.4 and 0.25 s, 371 coincident events were imposed on the spike train of the second cell (20% of the spikes of cell 1; this yielded 893 CEs in total, 250 more than the 643 CEs expected from a control simulation). (A–F) Detection of imposed CEs (same display conventions as Figures 4 and 5, left column). (G) Data from simulations of 100 trials at a constant ring rate, which investigated the sensitivity of the two methods. Trials lasted 500 ms; CEs were imposed during the central 150 ms. The number of windows that exceeded signicance is plotted versus the number of excess CEs. Different curves show results for different simulated background ring rates. Circles represent the method based on single-trial expectancy (instantaneous discharge probability method, IDPM) and triangles the method based on the mean ring rates (PSTHM). Bars indicate one standard deviation in a sample of 20 simulations. The methods show similar sensitivity; the decrease seen for high ring rates is due to the high level of chance coincidences expected. (H) Number of CEs within windows that tested as signicant versus the number of excess CEs, for the same simulations as shown in (G). The correlation is good, indicating that many of the injected CEs are retrieved. Filter SD between rate change points: 10 ms.
Instantaneous Discharge Probability
663
The second spike train was also initially simulated as a gamma process; the spike closest to each imposed CE was then moved to be at the same time as this event. This produced an increased number of CEs above the chance
664
Quentin Pauluis and Stuart N. Baker
level but maintained the ring rate of the two units. In Figures 6E and 6F, the CEs injected between ¡0.4 and 0.25 s were accurately detected. During this period, the ring rate was constant at 30 Hz, and there were 38% more CEs than expected by chance (893 CEs versus 643 in a control simulation; 2.5 extra CEs per trial spread over 650 ms). The right-hand side of Figure 6 compares both methods for CE detection in stationary spike trains at different ring rates. This was investigated by simulations similar to the one illustrated in Figure 6A, except that the ring rate and number of imposed CEs were varied systematically. Figure 6G plots the number of bins that tested as having a signicant excess of CEs versus the number of excess CEs (calculated as the difference between the mean number of CEs observed and the mean number of CEs during control simulation without imposed CEs). The curves for the two methods lie close together at all tested ring rates, indicating a similar sensitivity for each method. The decline of apparent sensitivity at high ring rates is due to the high number of CEs that would be expected by chance at these rates; larger numbers of excess CEs are therefore necessary before signicance is reached Figure 6H plots the number of detected CEs (the number of CEs in the signicant bins) versus the number of excess CEs; for both methods there is a good correlation. Points lie above the identity line because signicance is attributed to all the CEs in the 101 ms window; there is no means of dissociating the injected CEs from those that occurred by chance. The similar performance when conditions of ring-rate stationarity are fullled indicates that the different results in more complex situations cannot be simply explained by a difference in overall sensitivity between the two methods. 4 Discussion 4.1 Instantaneous Discharge Probability Density Calculation. The instantaneous discharge probability density function has a number of attractive features, which make it a useful measure for single-trial analysis. It is able to cope well with a large range of ring rates, unlike methods that smooth pulses derived from the spike train (French & Holden, 1971; Houk et al., 1987). It is locally smooth, except at a few points of rapid change and does not show great uctuations simply due to the variable nature of ISIs. However, by explicitly detecting points of rate change, blurring of rapid changes is avoided. The probabilistic framework for the detection of rate changes means that it is an entirely objective method in application. Finally, the use of a general gamma distribution as the underlying spike train model permits accurate representation of neurones with many different ring characteristics simply by changing the order parameter a. 4.2 Unitary Event Signicance Testing. The results presented demonstrate that the current methods used for assessing the signicance of uni-
Instantaneous Discharge Probability
665
tary events can lead to errors when there is variability in the ring rate of the neurones from trial to trial. Such variation in responses is less likely in sensory systems where the response to a stimulus is measured under anesthesia, since relatively few synapses are interposed between peripheral receptor and recorded neurones. By contrast, in awake animals subject to attentional uctuations, trial-to-trial variation is the norm. Animals can never be trained to produce precisely reproducible behavior, and even if the overt output appears similar, it is likely that there are many different redundant patterns of neural activity that will lead to the same goal. Lee, Port, Kruse, and Georgopoulos (1998) showed that single-trial ring rates often show correlation (noise correlation in their terminology), which is strongest for cells close together. The situations presented in Figures 4 and 5 are therefore not articially contrived to test current methods to their limits, but rather represent realistic possibilities common in experimental recordings. Other methods that rely on trial-averaged measures, such as the joint peristimulus time histogram (JPSTH) of Aertsen et al. (1989), or that assume intertrial constancy, such as the widely used shufe predictor (Perkel, Gerstein, & Moore, 1967; Engel, Konig, Gray, & Singer, 1990; Lee et al., 1998) are also vulnerable to error in such situations. When there is little variation in ring rate from trial to trial, the proposed method based on instantaneous discharge probability has similar sensitivity to use of the PSTH (see Table 1 and Figure 6). The novel method is therefore prefereable when analyzing a wide range of empirical data. Any technique for signicance testing of CEs will make assumptions (the null hypothesis); a signicant excess of CEs indicates a region of the data where these assumptions do not hold. Normalization using the PSTH makes the assumption that each trial is a realization of a random process whose discharge probability changes as the PSTH. This will detect as signicant all correlation between the two spike trains, over both the short term, and longer time course covariation of the ring rates about the trialaveraged level. Our method assumes that the neurones re at random in response to an input that either changes smoothly over time or makes abrupt jumps in magnitude. CEs that are assigned signicance using this approach result from transient input uctuations correlated between the two neurones and occur over a timescale briefer than that implied by the assumed smooth changes in input. We consider that this is appropriate, since we are interested in short-term synchronization, which may reect processes of temporal coding rather than longer-term comodulation of ring rate. Artifactual CEs could be produced, however, if other assumptions of the method are violated. For example, consider the situation where the neurone is more regular in its response than assumed in the algorithm used to position breakpoints (e.g., the cell res as a gamma process of order 4, but we test it against a Poisson process). In this case, the algorithm will fail to detect most of the step changes in input undergone by the cell; if these input changes covary between the two neurones investigated, an artifactual ex-
666
Quentin Pauluis and Stuart N. Baker
cess of CEs will be detected. It is therefore important to adjust the assumed order of the gamma distribution to agree with the observed ISI distribution in experimental data. There is one interesting circumstance in which our method will fail to assign signicant CEs; it is debatable whether this is desirable. If a neurone res brief bursts of only a few spikes, breakpoints will be set to delimit the bursts. The instantaneous discharge probability will therefore show a sharp increase closely surrounding the burst. If two neurones burst simultaneously, the expected CE count will rise sharply during the burst, such that any CEs that do occur are unlikely to reach signicance. Whether such CEs should be treated as signicant is an open question. Some other instantaneous discharge probability estimates could be proposed that would smooth across the burst limits; they would result in a measure of burst coincidence between the two trains rather than a measure of spike coincidence inside synchronous bursts. Alternative measures more suited to the data, such as the correlation between neurones of burst onset latency measured on single trials, would probably be simpler to interpret. The one report to date that has applied unitary joint-event analysis to experimental data demonstrated that signicant CEs were seen in association with rate changes and during periods of constant ring rate (Riehle et al., 1997). No data were presented on the variability of ring rate around rate changes, but it is likely that some of these CEs would not be signicant using our improved technique. In the instances where signicant CEs occurred without a modulation in ring rate, it is possible that there was nevertheless a change in the variability of discharge, as simulated in Figure 5. This could also lead to erroneous acceptance of chance CEs as signicant. The unitary joint-event method has received considerable interest, since it appears to allow the detection of synchronous ring on single trials (Fetz, 1997). However, as made clear by section 3, determination of the signicance of CEs requires averaging across trials; this article has been concerned primarily with the best way in which to do this. Formally, therefore, the unitary joint-event method is similar to time-resolved cross-correlation methods, such as the JPSTH (Aertsen et al., 1989), except that it concentrates only on the central portion of the cross-correlation histogram, and then identies the moment and the trials where synchronous rings occurred that contributed to signicant cross-correlation peaks. Whether this trial identication is useful depends on our underlying model for how synchrony occurs and is used in information processing in the brain. If we postulate that synchrony between the two recorded neurones is itself an important event, then knowing the trials on which it occurs is of obvious value in investigating the relationship between the CEs and, for example, task precision or reaction time. However, we may alternatively assume that the brain is composed of many small assemblies of neurones and that information is carried by the synchronous ring of a random subset of cells in one assembly (e.g., the synre chain hypothesis of Abeles, 1991; Pauluis, Baker, & Olivier, 1999).
Instantaneous Discharge Probability
667
This seems likely, since at any moment some cells of a given assembly may be refractory or otherwise nonexcitable due to their recent participation in other assemblies. If this is the case, trials on which the two cells whose discharge is monitored re synchronously are those trials where not only this assembly was active, but also where both cells were (by chance) able to participate in it. Knowing which trials contained the synchronous events is, on this model, less useful than knowing where, within the task, there was on average an excess of CEs. We have shown that modication of unitary event analysis to use an instantaneous discharge probability density measure can compensate accurately for changes in the occurrence rate of CEs produced by ring-rate covariation. It is anticipated that this technique, or others based on the instantaneous discharge probability density, will nd wide applicability in the analysis of synchronization among neural populations recorded from awake behaving animals. Acknowledgments
We thank Sonia Grun ¨ for helpful discussions. This work was supported by a British Council–CGRI travel grant. S.N.B. is supported by the UK MRC and Christ’s College, Cambridge. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J Neurophysiol., 61, 900–917. Awiszus, F. (1988). Continuous functions determined by spike trains of a neuron subject to stimulation. Biol. Cybern., 58, 321–327 Berry, M. J., & Meister, M. (1998).Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Bessou, P., Laporte, Y., & Pag`es, B. (1968). A method of analysing the response of spindle primary endings to fusimotor stimulation. J. Physiol., 196, 37–45. Brody, C. D. (1997).Latency, excitability, and spike timing correlations. Soc. Neurosci. Abstr., 23, 14. Brody C. D. (1999a).Disambiguating different covariation types. Neural Comput., 11, 1527–1535. Brody, C. D. (1999b). Latency, excitability, and spike timing correlations. Neural Comput., 11, 1537–1551. Butler, E. G., Horne, M. K., & Hawkins, N. J. (1992). The activity of monkey thalamic and motor cortical neurones in a skilled, ballistic movement. J. Physiol., 42, 25–48. Butler, E. G., Horne, M. K., & Rawson, J. A. (1992). Sensory characteristics of monkey thalamic and motor cortex neurones. J. Physiol., 442, 1–24.
668
Quentin Pauluis and Stuart N. Baker
Churchward, P. R., Butler, E. G., Finkelstein, D. I., Aumann, T. D., Sudbury, A., & Horne, M. K. (1997). A comparison of methods used to detect changes in neuronal discharge patterns. J. Neurosci. Meth., 76, 203– 210. Cocatre-Filgien, J. H., & Delcomyn, F. (1992). Identication of bursts in spike trains. J. Neurosci. Meth., 41, 19–30. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Eggermont, J. J., Smith, G. M., & Bowman, D. (1993). Spontaneous ring in cat primary auditory cortex: Age and depth dependence and its effect on neural interaction measures. J. Neurophysiol., 69, 1292–1313. Ellaway, P. H. (1978). Cumulative sum technique and its application to the analysis of peristimulus time histograms. Electroenceph. Clin. Neurophys., 45, 302– 304. Engel, A. K., Konig, P., Gray, C. M., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual-cortex—intercolumnar interaction as determined by cross-correlation analysis. Europ. J. Neurosci., 2, 588–606. Fetz, E. E. (1997). Temporal coding in neural populations? Science, 278, 1901– 1902. French, A. S., & Holden, A. V. (1971). Alias-free sampling of neuronal spike trains. Kybernetik, 8, 165–171. Gerstein, G. L., & Kiang, N. Y. S. (1960). An approach to the quantitative analysis of electrophysiological data from single neurons. Biophys. J., 1, 15–28. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity. Thun, Frankfurt am Main: Harri Deutsch. Houk, J. C., Dessem, D. A., Miller, L. E., & Sybirska, E. H. (1987). Correlation and spectral analysis of relations between single unit discharge and muscle activities. J. Neurosci. Methods, 21, 201–224. Hurst, H. E. (1950). Long-term storage capacity of reservoirs. Proc. Amer. Soc. Civil Engineering, 76, 1–30. Koch, C., Rapp, M., & Segev, I. (1996). A brief history of time (constants). Cereb. Cortex, 6, 93–101. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neurosci., 18, 1161–1170. Leg´endy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926–939. Lemon, R. N. (1984). Methods for neuronal recording in conscious animals. London: Wiley. MacPherson, J. M., & Aldridge, J. W. A. (1979). A quantitative method of computer analysis of spike train data collected from behaving animals. Brain Res., 175, 183–187. Matthews, P. B. C. (1963). Apparatus for studying the responses of muscle spindles to stretching. J. Physiol., 169, 58–60P. Munoz, D. P., & Wurtz, R. H. (1993).Fixation cells in monkey superior colliculus.
Instantaneous Discharge Probability
669
I. Characteristics of cell discharge. J. Neurophysiol., 70, 559–575. Pauluis, Q., Baker, S. N., & Olivier, E. (1999). Emergent oscillations in a realistic network: The role of inhibition and the effect of spatiotemporal distribution of the input. J. Comput. Neurosci., 6, 27–48. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Richmond, B. J., Optican, L. M., Podell, M., & Spitzer, H. (1987).Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. I. Response characteristics. J. Neurophysiol., 57, 132–146. Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. I. Stimulus-response relations. J. Neurophysiol., 64, 351–369. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Seidemann, E., Meilijson, I., Abeles, M., Bergman, H., & Vaadia, E. (1996).Simultaneously recorded single units in the frontal cortex go through sequences of discrete and stable states in monkeys performing a delayed localization task. J. Neurosci., 16, 752–768. Schwartz, A. B. (1992). Motor cortical activity during drawing movements: Single-unit activity during sinusoid tracing. J. Neurophysiol., 68, 528–541. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman & Hall. Softky, W. R., & Koch, C. (1992). Cortical cells should re regularly, but do not. Neural Comput., 4, 643–646. Received August 31, 1998; accepted March 1, 1999.
LETTER
Communicated by Anthony Zador
Impact of Correlated Inputs on the Output of the Integrate-and-Fire Model Jianfeng Feng David Brown Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K. For the integrate-and-re model with or without reversal potentials, we consider how correlated inputs affect the variability of cellular output. For both models, the variability of efferent spike trains measured by coefcient of variation (CV) of the interspike interval is a nondecreasing function of input correlation. When the correlation coefcient is greater than 0.09, the CV of the integrate-and-re model without reversal potentials is always above 0.5, no matter how strong the inhibitory inputs. When the correlation coefcient is greater than 0.05, CV for the integrateand-re model with reversal potentials is always above 0.5, independent of the strength of the inhibitory inputs. Under a given condition on correlation coefcients, we nd that correlated Poisson processes can be decomposed into independent Poisson processes. We also develop a novel method to estimate the distribution density of the rst passage time of the integrate-and-re model. 1 Introduction
Although the single-neurone model has been widely studied using computer simulation, most of these studies have made the assumption that inputs are independent (Brown & Feng, 1999; Feng, 1997; Feng & Brown, 1998a, 1998b; Tuckwell, 1988) both spatially and temporally. This assumption obviously contradicts the physiological data, which clearly show that nearby neurones usually re in a correlated way (Zohary, Shadlen, & Newsome, 1994)—what we term spatial correlation—and neurones with similar functions frequently form groups and re together. In fact “ring together, coming together ” is a basic principle in neuronal development (Sheth, Sharma, Rao, & Sur, 1996). Furthermore data in Zohary et al. (1994) show that even a weak correlation within a population of neurones can have a substantial impact on network behavior, which suggests that when comparing simulation results with experimental data, it is of vital importance to investigate the effects of correlated input signals on the output of single cells. We address the following two important issues: how correlation between the inputs affects the output of single-neurone models and when and how Neural Computation 12, 671–692 (2000)
c 2000 Massachusetts Institute of Technology °
672
Jianfeng Feng and David Brown
correlated inputs can be transformed into equivalent independent inputs. The second issue is interesting since it is usually not easy to model or analyze a correlated system, except when it is gaussian distributed. Theoretically we do not have general analytical tools to handle correlated systems, even to nd analytical expressions for the distribution. It is even difcult numerically to generate a correlated random vector. Therefore, a transformation from correlated to independent inputs greatly simplies the problem theoretically and numerically and, furthermore, provides insights on the functional role of the correlation. The technique can be applied to both simulation of biophysical models and to experiments in order to generate correlated Poisson inputs. The neuronal model we use is the integrate-and-re model with or without reversal potentials. Positive correlation in input signals usually enlarges the coefcient of variation (CV) of the integrate-and-re model. When the correlation coefcient is greater than 0.09, we nd that CV for the leaky integrate-and-re model without reversal potentials is always greater than 0.5. When the correlation coefcient is greater than 0.05, CV for the equivalent model with reversal potentials is always greater than 0.5. There has been much research devoted to nding a neuronal mechanism that can generate spike trains with a CV greater than 0.5 (see Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a; Konig, Engel, & Singer, 1996 and references therein). To the best of our knowledge, this is the rst article to provide a possible mechanism for doing this for the integrate-and-re model independently of the ratio between the inhibitory and excitatory inputs. Independently and experimentally, Stevens and Zador (1998) found that correlation between inputs was necessary in order to generate a sufciently large CV in output spike trains of neurons in neocortical slices. Our analysis and simulations therefore provide a theoretical explanation of their ndings. We also show that it is only under a restrictive condition that dependent inputs can be decomposed into independent inputs. Finally we also propose a novel method of nding the distribution density of the rst passage time of the integrate-and-re model. Correlation usually imposes a strong restriction on the system and has a profound impact on its output (Gawne & Richmond, 1993; Zohary et al., 1994). Imagining a neurone exclusively receiving excitatory postsynaptic potentials (EPSPs) from a population of p excitatory neurones with mutually positive correlation coefcient c, then the output CV of the perfect integrateand-re neurone (see section 3) is proportional to [1 C ( p2 ¡ p) c / p] / Nth D [1 C ( p ¡ 1)c] / Nth , where Nth is the number of EPSPs required for the cell to re. The rst term 1 / Nth is well known (Softky & Koch, 1993; Abbott, Varela, Sen, & Nelson, 1997) and is responsible for what is usually termed the central limit theorem effect. The important difference between the rst and second terms is that the latter depends on p, which is usually a large number. The appearance of this factor here and not in the rst term is due to the fact that
Impact of Correlated Inputs
673
the correlation is a second-order statistical quantity. Hence, even when c is small, say 0.1, the quantity ( p ¡ 1)c can be substantially greater than the rst term, approaching or exceeding Nth , thus resulting in a high CV. This is the second of our series of articles aiming to elucidate how more realistic inputs, in contrast to conventional independently and identically distributed Poisson inputs, which have been intensively studied in the literature, affect the outputs of simple neuronal models. We aim to describe as completely as possible the full spectrum of the behavior inherent in these models, documenting more thoroughly their restrictions and potential. Feng and Brown (1998b) have considered the behavior of the integrate-and-re model subject to independent inputs with different distribution tails. In the near future, we will report on more complicated biophysical models with realistic inputs as we develop here and in Feng and Brown (1998b). 2 Models
In this section we dene the models used and then discuss the effects of including reversal potentials on the dynamic behavior of the leaky integrateand-re model, attempting to elucidate the essential difference this makes to its behavior. Suppose that a cell receives EPSPs at p excitatory synapses and inhibitory postsynaptic potentials (IPSPs) at q inhibitory synapses. The activities among excitatory synapses and inhibitory synapses are correlated, but for simplicity of notation here, we assume that the activities of the two classes are independent of each other. When the membrane potential Vt is between the resting potential Vrest and the threshold Vthre , p q X X 1 dV t D ¡ Vt dt C a dEi ( t) ¡ b dIj (t), c iD1 jD1
(2.1)
where 1 /c is the decay rate, Ei ( t), Ii (t) are Poisson processes with ratel E and l I , respectively, and a, b are magnitude of each EPSP and IPSP (Thomson, 1997). Once Vt crosses Vthre from below, a spike is generated, and Vt is reset to Vrest . This model is termed the integrate-and-re model. The interspike interval of the efferent spike process is T D infft: Vt ¸ Vthre g. Without loss of generality, we assume that the correlation coefcient between ith excitatory (inhibitory) synapse and jth excitatory (inhibitory) synapse is c(i, j) D q
h(Ei ( t) ¡ hEi (t)i)( Ej ( t) ¡ hEj ( t)i)i
h(Ei ( t) ¡ hEi (t)i)2 i ¢ h(Ej ( t) ¡ hEj (t)i)2 i
D r (ki ¡ jk),
674
Jianfeng Feng and David Brown
where r is a nonincreasing function. More specically, we consider two structures for the correlation: block-type correlation and gaussian-type correlation. Here block-type correlation means that N cells are divided into N / k blocks, and cell activities inside each block are correlated, whereas between blocks they are independent; gaussian-type correlation implies that the correlation between cells is a decreasing function of their geometrical distance (we impose a periodic boundary condition). A slightly more general model than the integrate-and-re model dened above is the integrate-and-re model with reversal potentials dened by p
dZt D ¡
q
X X Zt ¡ Vre dt CNa(VE ¡ Zt ) dEi (t) CbN(VI ¡Zt ) dIj (t), c iD1 jD1
(2.2)
where Vre is the resting potential, aN , bN are the magnitude of single EPSP and IPSP, respectively, and VE and VI are the reversal potentials. Zt (membrane potential) is now a birth-and-death process with boundaries VE and VI . Once Zt is below Vre , the decay term Zt ¡ Vre will increase membrane potential Zt ; whereas when Zt is above Vre , the decay term will decrease it. A detailed analysis might clarify the essential difference between the models with and without reversal potentials. Rewrite the model with reversal potentials, equation 2.2, in the following way, p
dZt D ¡
q
X X Zt ¡ Vre dt C aN (VE ¡ Vre ) dEi ( t) C bN( VI ¡ Vre ) dIj ( t) c iD1 jD1
¡ aN (Zt ¡ Vre )
p X iD1
dEi ( t) ¡ bN(Zt ¡ Vre )
q X
dIj ( t).
(2.3)
jD1
We see that the term ¡(Zt ¡ Vre ) /c dt C aN ( VE ¡ Vre )
p X iD1
dEi ( t) C bN(VI ¡ Vre )
q X
dIj ( t)
jD1
is exactly the same as the integrate-and-re model without reversal potentials; the coefcients of the diffusion term are xed and independent of time. Compared to the integrate-and-re model without reversal potentials, there is an additional term, ¡Na(Zt ¡ Vre )
p X iD1
dEi (t) ¡ bN(Zt ¡ Vre )
q X jD1
dIj ( t),
Impact of Correlated Inputs
675
which is a pure decay term. Hence equation 2.3 can be expressed in the following way: 2
3 p q X X 1 dZ t D ¡(Zt ¡ Vre ) 4 dt C aN dEi (t) C bN dIj (t) 5 c iD1 jD1 C aN ( VE ¡ Vre )
p X iD1
dEi ( t) C bN( VI ¡ Vre )
q X
dIj (t).
(2.4)
jD1
Therefore, the model with reversal potentials essentially increases the decay rate of membrane potential, resulting in the system’s forgetting its dynamic history more quickly. This property is important when we consider how a group of cells synchronize their behavior (Feng & Brown, 1999b) and is also important for understanding numerical differences in the behavior of models with and without reversal potentials as presented below. Furthermore equation 2.4 might help us reach a theoretical conclusion on a debate (Tuckwell, 1988, p. 165; Wilbur & Rinzel, 1983) that has existed in the literature for decades. 3 Analytical Results
We rst use the technique of martingale decomposition to approximate the input to the integrate-and-re model. Together with the result in the appendix, this leads in theorem 1 to a formula for the distribution of interspike interval for the perfect integrate-and-re model subjected to correlated inputs. We then obtain (in lemma 1) the interspike interval distribution of the leaky integrate-and-re model for the special case when the leakage rate is such that the attractor of the deterministic part of the dynamics (Feng & Brown, 1999a) coincides with the spiking threshold. From this result, we nd an expression in theorem 2 for the distribution density of interspike interval for the general leaky integrate-and-re model, which is then discussed. Finally the model with reversal potentials is considered. First, martingale decomposition is used to approximate the integrateand-re model with or without reversal potentials. We do not discuss the approximation accuracy since it has been done by many authors previously (Tuckwell, 1988; Ricciardi & Sato, 1990) (see section 5). The Poisson point processes representing EPSP and IPSP input involve perturbations of membrane potential with the following two properties. First, they are either xed in size (for the model without reversal potentials) or of a size dependent on membrane potential (for the model with reversal potentials). Second, they arrive irregularly in time according to Poisson processes. Using martingale decomposition, we effectively rewrite this process as a Wiener or Brownian motion process, which is simulated as a sum of an equivalent continuous current input representing the drift that occurs with unbalanced EPSP/IPSP
676
Jianfeng Feng and David Brown
input and unbiased random input involving independently varying, small, random perturbations of membrane potential, arriving at regularly spaced, very short intervals of time. The martingale decomposition reads p E i (t) » l E t C l E BEi (t), and, similarly, p Ii (t) » l I t C l I BIi ( t), where B Ei ( t) and BIi ( t) are standard Brownian motions. For xed time t, the above decomposition is equivalent to the central limit theorem. Therefore the integrate-and-re model without reversal potentials can be approximated by p q p X X p X 1 l E dt ¡ b l I dt C a l E dvt D ¡ vt dt C a dBEi ( t) c iD1 iD1 iD1 q p X ¡ b lI dBIi (t). iD1
Since the summation of Brownian motions is again a Brownian motion, we can rewrite the equation above as follows: 1 dvt D ¡ vt dt C (apl E ¡ bql I )dt c v u p X p q X q X X u C ta2l E cE ( i, j) C b2l I cI (i, j)dB( t) iD1 jD1
(3.1)
iD1 jD1
1 D ¡ vt dt C (apl E ¡ bql I )dt c . v u p q X X u C ta2 pl E C b2 ql I C a2l E cE ( i, j) C b2l I cI ( i, j) dB( t) i6 D j
i6 D j
When cE ( i, j) D cI (i, j) D 0, equation 3.1 gives rise to the same results as in the literature (Tuckwell, 1988) for independent inputs. When q D 0 and cE ( i, j) D c, the signal-to-noise ratio of inputs given by equation 3.1 is SNR D p
pl E , pl E C p( p ¡ 1)l E c
which coincides with that in Zohary et al. (1994, g. 3a).
Impact of Correlated Inputs
677
Combining the results above with the conclusions in the appendix, we arrive at the following theorem for the perfect integrate-and-re model. Let T be the rst passage time of vt . If c D 1 the distribution density of T is
Theorem 1.
p( t ) D q
2p ( a2 pl 2
E
Cb2 ql
I
Ca2l
Vthre Pp E P ( ) Cb2l I qi6D j cI ( i, j))t3 E i6 D j c i, j 3
6 6 ( Vthre ¡ (apl E ¡ bql I ) t) 2 exp 6 ¡ Pp 6 2(a2 pl E Cb2 ql I Ca2l E i6D j cE (i, j) 4 Pq C b2l I i6D j cI ( i, j))t
7 7 7 7 5
(3.2)
and in particular
8 > > a2 pl E C b2 ql I > :CV D C . Vthre ( apl E ¡ bql I ) Vthre ( apl E ¡ bql I ) Theorem 1 tells us that for the perfect integrate-and-re model, correlation between inputs does have an impact on the variation (CV) of output interspike interval, but not on the mean ring rate. Compared to the perfect integrate-and-re model without correlation, the CV of efferent interspike intervals falls between 0.5 and 1 for greater degrees of imbalance between EPSPs and IPSPs (Feng & Brown, 1998b). Furthermore, the output distribution is a long-tailed distribution if and only if an exact balance between EPSPs and IPSPs is reached, although in general correlated inputs will generate efferent spikes with a larger CV. Although theorem 1 is a special case of theorem 2, we prefer to write it as a separate theorem since we apply it in the next section. In appendix A, we present a new and simple proof of theorem 1 without resorting to the classic Laplace transformation (Tuckwell, 1988), which we hope opens up new possibilities for obtaining an analytical formula for general cases—the integrate-and-re model without reversal potentials. The idea of our approach is to use Girsanov’s theorem (Protter, 1980), which connects two diffusion processes with a displacement transformation. Once we know the properties of the following diffusion process (s > 0, a constant), dX t D b(Xt ) dt C sdBt ,
678
Jianfeng Feng and David Brown
we can, by Girsanov’s theorem, infer properties of the following diffusion process: dYt D b( Yt ) dt Cm dt C sdBt . For the problem we consider here, we can nd the distribution density of the following case, dvN t D ¡vN t /c dt Cm 0 dt C sdBt , with m 0 D Vthre /c ,
(3.4)
the special case that the attractor of the deterministic dynamics coincides with the threshold. (For further interesting properties of the dynamics under the condition equation 3.4, see Feng, 1999; Feng & Brown, 1999a.) Let TN be the rst passage time of vN t . Lemma 1.
When m D Vthre /c the distribution density of TN is given by
2s 2 Vthre exp(¡t /c ) p0 ( t) D p p [s 2c (1 ¡ exp(¡2t /c ))]3 " # 2 exp(¡2t /c ) Vthre ¢ exp ¡ 2 . s c (1 ¡ exp(¡2t /c ))
(3.5)
Hence, by using Girsanov’s theorem, as we do in the appendix, we nd an analytical formula for the general case, dvt D ¡vt /c dt Cm 0 dt C (m ¡ m 0 )dt C sdBt , as described below. Theorem 2.
The distribution density of the rst passage time T of vt is
¡
¡
¢
µ ¶ (m ¡m 0 ) 2 t ¡(m ¡ m 0 )m 0 t Vthre (m ¡m 0 ) C exp p( t) D p0 ( t) exp ¡ 2 2 2s s s2 * ! + Z TN N m ¡m 0 Vt exp ¢ . (3.6) dt| FTN 2 s 0 c N
(
TDt
Impact of Correlated Inputs
679
As in the appendix, we only need to calculate the Radon-Nikodyn derivative in Girsanov’s theorem, which is given by F(t) D exp D exp
¡Z
t 0
1 1 (m ¡ m 0 ) dBs ¡ s 2
¡ m ¡m s
0
Bt ¡
Z
1
t 0
)2
¢
(m ¡ m 0 ) 2 ds s2
¢
1 (m ¡ m 0 t . 2 s2
Therefore we have hg(T)i D
Z
¡
(m ¡ m 0 ) 2 t 2s 2 0 ½ ¾ m ¡m 0 exp BTN | FTN dt, s N TDt g( t) p0 ( t) exp ¡
¡
¢
¢
(3.7)
where g is a measurable, nonnegative function, T is the rst passage time, and Ft is the s-algebra up to time t. Now the only remaining task to prove theorem 2 is to nd an analytical formula for BTN . From dvN t D ¡vN t /c dt Cm 0 dt C sdBt . we have Z vN TN D ¡
TN 0
vN t dt /c Cm 0 TN C sBTN .
Equation 3.7 thus becomes
¡
¢
µ ¶ (m ¡ m 0 ) 2 t Vthre (m ¡ m 0 ) C g( t) p0 ( t) exp ¡ 2s 2 s2 0 * ! + Z TN ¡(m ¡ m 0 )m 0 t m ¡m 0 vN t exp exp ¢ dt | FTN s2 s2 0 c N Z
hg(T)i D
1
¡
which gives the desired results.
¢
(
dt,
TDt
RN Now let us consider the meaning of the term 0T vN t dt. It is the area under the curve visited by the process vN t , before it hits the threshold, as shown in Figure 1. It is easily seen that we are able to use different ways to apR TN N lemma 1), which gives proximate 0 vN t dt (we know the distribution of T, rise to various approximations of the distribution density of T. To nd an analytical formula for the distribution density of T is a problem studied for decades, and a number of different methods, mainly in terms of PDEs, have been employed (see, for example, Ricciardi & Sato, 1990). Here we present a novel and probably the most natural way—by virtue of the probability
680
Jianfeng Feng and David Brown
Figure 1: Dark area above the resting potential ¡ dark area below the resting potential D
R TN 0
vN t dt (Vrest D 0 mV).
method—to approximate the distribution density. (A study on how to improve the accuracy of the approximation is an interesting problem, but it is outside the scope of this article and will be published elsewhere.) In fact, from theorem 2 we can obtain some informative conclusions. For example, we claim that as soon as m ¸ m 0 , then p( t) goes to zero (t ! 1) faster than or equal to µ ¶ (m ¡ m 0 ) 2 1 exp ¡ C t , 2s 2 c
¡
¢
an exponential decreasing function. This is because vN t /c · Vthre /c D m 0 , and therefore exp
¡ ¡(m ¡ m
0 )m 0 t
s2
¢*
(
m ¡m 0 exp ¢ s2
Z
TN 0
! + vN t dt | FTN c N
TDt
· 1.
Now we turn to the model with reversal potentials. Similar to the above treatment of the model without reversal potentials, we can rewrite the model in the following form, p
q
X X 1 l E ( VE ¡ zt ) dt C b l I ( VI ¡ zt ) dt dz t D ¡ zt dt C a c iD1 iD1 p q p X p X C a lE |VE ¡ zt |dBEi ( t) C b l I |VI ¡ zt | dBIi ( t), iD1
iD1
Impact of Correlated Inputs
681
which gives 1 dzt D ¡ zt dt C (ap( VE ¡ zt )l E ¡ bq( zt ¡ VI )l I )dt C sr ( t)dBt , c
(3.8)
where, sr2 ( t) D a2 ( zt ¡ VE ) 2l E
p X p X
cE ( i, j)
iD1 jD1
C b2 ( zt ¡ VI ) 2l I
q X q X
cI ( i, j)
D a2 ( zt ¡ Ve ) 2 pl E C b2 ( zt ¡ VI ) 2 ql I C a2 ( zt ¡ VE ) 2l E C b2 ( zt ¡ VI ) 2l I
(3.9)
iD1 jD1
q X
p X
cE ( i, j)
i6 D j
cI (i, j).
i6 D j
There are other forms for the diffusion terms in equation 3.8 to approximate the original process (Musila & L´ansky, ´ 1994). For simplicity of notation we conne ourselves to equation 3.9. 4 Numerical Results
From now on we assume that cE (i, j) D cI (i, j) D c( i, j). In this section we present and discuss the results of numerical simulations of equations 3.1 and 3.8 showing the effect of block-type correlation on mean (ISI) and CV of the leaky integrate-and-re model without (see Figure 2) and with reversal potentials (see Figure 4) for moderate EPSP/IPSP sizes, and also for very small EPSP/IPSP sizes (see Figure 3). We also demonstrate the effect of a correlation that falls off as an inverse square law (see Figure 5). 4.1 Block-type Correlation. For simplicity of notation, we consider only the case that c( i, j) D c for i 6 D j, i, j D 1, . . . , p and i 6 D j, i, j D 1, . . . , q. It is reported in the literature (Zohary et al., 1994) that the correlation coefcient between cells is around 0.1 in V5 of rhesus monkeys in vivo. In human motor units of a variety of muscles, the correlation coefcients are usually in the range 0.1 to 0.3 (Matthews, 1996). In Figure 2 we show numerical simulations for q D 0, 10, 20, . . . , 100, p D 100, Vrest D 0, Vthre D 20 mV, a D b D 0.5 mV, c D 20.2 msec, l E D l I D 100 Hz, a set of parameters as in the literature (Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a), and c D 0, 0.01, 0.02, . . . , 0.1. When c D 0, inputs without correlation, there
682
Jianfeng Feng and David Brown
Figure 2: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model without reversal potentials. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as c(i, j) ¸ 0.09, we see that the CV of efferent spike trains is greater than 0.5.
are many numerical and theoretical investigations (see, for example, Softky & Koch, 1993; Ricciardi & Sato, 1990; Feng & Brown, 1998b). As one might expect, the larger the input correlation, the larger the output CV. In particular we note that when c ¸ 0.09, then CV > 0.5 for any value of q. When the ratio q / p is small, the mean ring time is independent of the correlation, a phenomenon we have pointed out for the perfect integrateand-re model (see theorem 1). However when q / p is large (i.e., q ¸ 50), the mean ring time with weak correlation is greater than that with large correlation. It can be easily understood from the properties of the process since the larger the correlation, the larger the diffusion term, which will drive the process to cross the threshold more often. However, as in the perfect integrate-and-re model, CV of efferent spike trains does depend on the ratio and the correlation. The larger the correlation and the ratio, the larger the CV. Many studies have been devoted to the problem of how to generate spike trains with CV between 0.5 and 1 (see, for example, Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a; Softky & Koch, 1993; Shadlen & Newsome, 1994). In particular, Softky and Koch (1993) point out that it is impossible for the integrate-and-re model to generate spike trains with CV
Impact of Correlated Inputs
683
Figure 3: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model without reversal potentials and a D b D 0.05, p D 1000. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. For the case of c(i, j) D 0 and q D 0, the CV is 0.05. Note that we calculate q D 0, 100, . . . , 800 only for c(i, j) D 0 since the mean ring time is too large when q D 900, 1000.
between 0.5 and 1 if the inputs are exclusively excitatory. A phenomenon referred to as central limit effect is widely cited in the literature (Abbott et al., 1997) as justication of this statement. Many approaches have been proposed to get around this problem (Shadlen & Newsome, 1994). Here we clearly demonstrate that even with exclusively excitatory inputs, the integrate-and-re model is capable of emitting spike trains with CV greater than 0.5. No matter what the ratio between the excitatory and inhibitory synapses, the CV of efferent spike trains is always greater than 0.5, provided that the correlation coefcient between the inputs is greater than or equal to 0.09. Furthermore, we want to emphasize that in Zohary et al. (1994) the authors claim, based on their physiological data, that c > 0.1 is the most plausible case. For further conrming that a small correlation plays an important role in neuronal output, in Figure 3 we simulate an extreme case with EPSP and IPSP size of 0.05 mV. Shadlen and Newsome (1994) have reported that the size of EPSP is between 0.05 mV and 2 mV. This choice implies that 400 EPSPs are needed to drive a cell to re. To have a reasonable comparison with the numerical results of a D 0.5, m is kept as a constant when the ratio
684
Jianfeng Feng and David Brown
Figure 4: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model with reversal potentials. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as c(i, j) ¸ 0.05, we see that the CV of efferent spike trains is greater than 0.5.
q / p is the same and hence p D 1000 in Figure 3. When c( i, j) D 0 and q D 0, CV is about 0.05 (see Figure 3). Figure 3 clearly shows that a large, efferent CV is again obtained with a small correlation. Now we turn to the integrate-and-re model with reversal potentials. As we already pointed out, the essential difference between this model and that without reversal potentials is that the latter has a large-membrane potential decay rate. Therefore in this case, it is natural to expect that a larger efferent spike train CV is obtained since the decay term plays a predominant role. We assume that Vre D ¡50 mV, VI D ¡60 mV,Vth D ¡30, and VE D 50 mV, values that match experimental data. As in the literature (Musila & Lansky, 1994; Feng & Brown, 1998b) we impose a local balance condition on the magnitude of EPSPs and IPSPs: aN ( VE ¡ Vre ) D bN( Vre ¡ VI ) D 1 mV; starting from the resting potential, a single EPSP or a single IPSP will depolarize or hyperpolarize the membrane potential by 1mV (see the discussion in section 3). Figure 4 clearly shows that efferent spike trains of the integrateand-re model with reversal potentials are more variable than for the model without reversal potentials (except for high q and c( i, j)). When the correlation coefcient is greater than 0.05, CV > 0.5, no matter what the ratio
Impact of Correlated Inputs
685
Figure 5: Mean ring time (left) and CV (right) versus q for s D 1, 2, 5, 9, 10 of the integrate-and-re model without reversal potentials for a gaussian-type correlation pattern. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as s ¸ 5, we see that the CV of efferent spike trains is greater than 0.5.
between inhibitory and excitatory inputs. Due to this effect (large CV) we see that mean ring rates are also wider spread than for the model without reversal potentials. When q ¸ 20, the difference between mean ring times become discernible (see Figure 4). From Figures 2 and 4 we can see another interesting phenomenon. When q / p approaches one, the CV of the integrate-and-re model without reversal potential is greater than that of the model with reversal potentials (which itself is less than or equal to one) (see Tuckwell, 1988, p. 165; Wilbur & Rinzel, 1983). The reason can be understood from our analyses in section 3. Note that the decay term will return the system to the resting potential. The model with reversal potentials is equivalent to one with a large decay rate, and so Zt generally remains closer to the resting potential than Vt . When q gets close to 100, the deterministic forces tending to increase membrane potential become weak, and so the process starting from the resting potential will tend to fall below the resting potential more and more often. However, the decay terms together with the reversal potentials in the integrate-andre model with reversal potentials will prevent the process from going far below the resting potential. The process Zt is thus more densely packed in a compact region, and a smaller range of CV is observed.
686
Jianfeng Feng and David Brown
4.2 Gaussian-Type Correlation. We suppose that c (i, j) D exp(¡(i ¡ j)2 / s 2 ), i, j D 1, . . . , p with periodic boundary conditions. Hence the larger that s is, the more widely spread the correlation for the ith cell. We simulate the integrate-and-re model using the Euler scheme (Feng, Lei, & Qian, 1992) with s D 1, 2, . . . , 10. Similar phenomena as in the previous subsection are observed (see Figure 5): there is a threshold value sc , such that as soon as s > sc , CV of efferent trains becomes greater than 0.5, independent of the number of inhibitory inputs. Similar results are obtained for the integrateand-re model with reversal potentials (see the previous subsection) as well (not shown). 5 From Dependent to Independent Synapses
We now consider how we can realize a constant correlated Poisson process. Dene Ei ( t) D Ni ( t) C N ( t) where Ni ( t), i D 1, . . . , p are independently and identically distributed Poisson processes with rate (1 ¡ c)l E , and N (t) is an independent Poisson process with rate cl E . It is then easily seen that the correlation coefcient between Ei and Ej is c. Therefore, in Pterms of our conclusions above, we can rewrite the total synaptic inputs i E i (t) in the following way: X X (5.1) E i ( t) D Ni ( t) C pN ( t). i
This conclusion tells us that constant correlated Poisson inputs are equivalent to the case where there is a common source—the term N (t) in equation 5.1—for all synapses. We usually prefer independent inputs, which are much easier to deal with than correlated inputs, and we do so again here. Let us rewrite the integrate-and-re model in the following way: p
dV t D ¡
q
X X Vt dt C a dNiE ( t) C apdN E ( t) ¡ b dNiI (t) ¡ bqdNI ( t) c iD1 iD1
Vt dt Cm dt c . q C a2 p(1 ¡ c)l E C b2 q(1 ¡ c)l I C a2 p2 cl E C b2 q2 cl E dBt
»¡
The independent term contributes to the variance term with an amount of a2 pl E (1 ¡ c), but the common source contributes with an amount of a2 p2l E c. Since p usually is large, the common source thus plays a predominant role in the model behavior when we employ physiological parameters—a D 0.5, c D 0.1—in the integrate-and-re model. We thus ask whether a given series of correlated Poisson processes (suppose that c(i, j), j D 1, . . . , p is not a constant) can be decomposed into a
Impact of Correlated Inputs
687
series of independent Poisson processes. Furthermore, without loss of generality, we assume that l E D 1. Decomposition Theorem. Suppose that p is large and
Theorem 3. p X jD2
c(1, j) ·
1 , 2
(5.2)
the Poisson process E i (t) dened below, E i ( t) D N i ( t) C
i¡1 X jD1
Nj,i¡j (t) C
p¡i X
Ni, j ( t)
jD1
is correlated with a correlation between Ei ( t) and Ej ( t) equal to c(1, |i¡ j| ), i 6 D j, j§ 1, where Ni ( t), Ni, j ( t), i D 1, . . . , pI j D 1, . . . , p ¡ i are independent Poisson Pi¡1 Pp¡i processes with rate l Ni D t(1 ¡ jD2 c(1, j) ¡ jD2 c(1, j)) and l Ni, j D tc(1, j). Proof . For i 6 D j, j C 1, j ¡ 1, we obtain
hEi ( t) ¡ hEi ( t)i, Ej (t) ¡ hEj ( t)ii
D hNi ( t) ¡ hNi (t)i, Nj ( t) ¡ hNj ( t)ii C C C C
j¡1 i¡1 X X kD1 lD1 p¡j i¡1 X X kD1 lD1 p¡i X j¡1 X kD1 lD1 p¡i X p¡j X kD1 lD1
hNk,i¡k ( t) ¡ hNk,i¡k ( t)i, Nl, j¡l ( t) ¡ hNl, j¡l ( t)ii hNk,i¡k ( t) ¡ hNk,i¡k ( t)i, Nj,l ( t) ¡ hNj,l ( t)ii hNi,k ( t) ¡ hNi,k ( t)i, Nl, j¡l ( t) ¡ hNl, j¡l ( t)ii hNi,k ( t) ¡ hNi,k ( t)i, Nj,l (t) ¡ hNj,l (t)ii
1 1 (1 C sgn( i ¡ j)) c(1, |i ¡ j| )t C (1 C sgn( j ¡ i))c(1, |i ¡ j| ) t 2 2 D c(1, | i ¡ j| ) t. D
Using a similar algebraic calculation, we have hEi ( t)¡hEi (t)i, Ei (t)¡hEi ( t)ii D t. Therefore the correlation coefcient between Ei ( t) and Ej ( t) is given by c(1, |i ¡ j| ).
688
Jianfeng Feng and David Brown
It is obvious that condition 5.2 is rather restrictive since c(i, j) D c with c > 1 / p violates it. However, the following example shows that the condition in theorem 3 is a necessary and sufcient condition for a linear decomposition: Example. Consider four cells i D 1, 2, 3, 4 and assume that c(2, 3) > 0 and c(2, 4) D 0. Since we have c(1, 3) D 0, the most efcient way to construct E2 is E 2 ( t) D N1 (t) C N3 (t), with hN1 (t)i D c(2, 3) D hN3 ( t)i. Therefore if and only if 2c(1, 2) · 1 do we have a linear decomposition of Ei ( t). 6 Discussion
We have shown that a weak correlation between neurones can have a dramatic inuence on model output. In the integrate-and-re model without reversal potentials, when the correlation coefcient is above 0.09, the CV of efferent spike trains is above 0.5. In the integrate-and-re model with reversal potentials, when the correlation coefcient is above 0.05, the CV of efferent spike trains is greater than 0.5 already. These properties are independent of the numbers of inhibitory inputs and therefore resolve an existing problem in the literature: how the integrate-and-re model generates spike trains with CV greater than 0.5 when there is a low ratio between inhibitory and excitatory inputs. Note that Zohary et al. (1994) have claimed that a correlation greater than 0.1 is the most plausible value. The integrate-and-re model is one of the most widely used singleneuronal models. The advantage when studying it is that there are some useful analytical formulas, although there is a fundamental question open: to nd the distribution density of the ring time. We have proposed a novel approach based on Girsanov’s theorem, which opens up new possibilities for answering this question. Analytical approaches of this type are a useful—and, wherever possible, essential—supplement computer simulation results, since analytically derived expressions are usually very general rather than applicable just to the specic parameter values used in the case of simulation results. We illustrate this by deriving an algebraic form for the interspike interval distribution tail. This could be important for tting to experimental data, and hence obtaining parameter estimates, but also has implications explored in our previous work (Feng & Brown, 1998a) when the output of the current layer of neurones is the input to a subsequent layer. When the correlation coefcient between neurones is a constant, we can easily decompose them into a linear summation of independent Poisson
Impact of Correlated Inputs
689
processes, and therefore all results in the literature are applicable. Under a sufcient condition, we also study the possibility of decomposing a correlated Poisson process into independent Poisson processes. The results can be widely used in biophysical models and experiments. Finally, we discuss the implication of random inputs—Poisson process inputs—in our model. This is a puzzling issue, and a solid answer can be provided only in terms of experiments (Abeles, 1982, 1990). Mainen and Sejnowski (1995) pointed out that “Reliability of spike timing depended on stimulus transients. Flat stimuli led to imprecise spike trains, whereas stimuli with transients resembling synaptic activity produced spike trains with timing reproducible to less than one millisecond.” However, it must be emphasized that their experiments were carried out in neocortical slices. It is interesting to see that the variability of spike trains depends on the nature of inputs, but CV of efferent spike trains in vivo might be very different from those in slices due to the absence of other inputs in slice experiments. For example, it has been reported that CV is between 0.5 and 1 for visual cortex (V1) and extrastriate cortex (MT) (Softky & Koch, 1993; Marsalek, Koch, & Maunsell, 1997) even in human motor cells CV is between 0.1 and 0.25 (Matthews, 1996, p. 597). There are also examples that demonstrate a group of cells in vivo behaving totally differently from in slices. Oxytocin cells re synchronously in vivo, but this property is totally lost in slices (see Brown & Moos, 1997, and references). Very recently, it has been reported that random noise of similar amplitude to the deterministic component of the signal plays a key role in neural control of eye and arm movements (Harris & Wolpert, 1998; Sejnowski, 1998). Thus randomness is present—and appears to have a functionally important role—in a number of physiological functions. Appendix
We know that in terms of the reective principle, the distribution density of the rst passage time from (¡1, Vthre ) of a Brownian motion sBt is given by p( t ) D p
1 2p t3 s 2
2 Vthre exp(¡Vthre / (2s 2 t)).
We intend to nd the distribution density of the process Xt D m t C sBt . For any measurable positive function g, we have hg(T))i D hhg(t)F(t)| F ( TO )ii,
(A.1)
690
Jianfeng Feng and David Brown
where F(t) is the Radon-Nikodyn derivative given by the Girsanov’s theorem, as follows, F( t) D exp
¡Z
t 0
m / s dBs ¡ 2
D exp(m / sBt ¡ m t /
1 2s 2
Z
t 0
s 2 ds
¢
(2s 2 )),
and TO is the rst passage time of Bt . Inserting the expression of F( t) into equation A.1 and using the fact that BTO D Vthre / s, we obtain hg(T))i D hg( TO ) exp(m / sB TO ¡ m 2 TO / (2s 2 ))i Z 1 D g(t) exp(m Vthre / s 2 ¡ m 2 t / (2s 2 )) 0
¢p Z D Z D
1 2p t3 s 2
1 0 1 0
2 Vthre exp(¡Vthre / (2s 2 t)) dt
g( t ) p g( t ) p
Vthre 2p s 2 t3 Vthre 2p s 2 t3
2 exp(m Vthre / s 2 ¡ m 2 t / (2s 2 ) ¡ Vthre / (2s 2 t)) dt
exp(¡(m t ¡ Vthre ) 2 / (2s 2 t)) dt.
Therefore the distribution density of T, the rst exiting time of Xt , is given by p
Vthre 2p s 2 t3
exp[¡(m t ¡ Vthre )2 / (2s 2 t)].
Acknowledgments
We are grateful to the anonymous referees for bringing references (Stevens & Zador, 1998; Gawne & Richmond, 1993) to our attention and their valuable comments on the manuscript of this article The work was partially supported by BBSRC and an ESEP grant of the Royal Society. References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. New York: Springer-Verlag. Abeles, M. (1990). Corticonics. Cambridge: Cambridge University Press. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Brown, D., & Moos, F. (1997). Onset of bursting in oxytocin cells in suckled rats. J. of Physiology, 503, 652–634.
Impact of Correlated Inputs
691
Brown, D., & Feng, J. (1999).Is there a problem matching model and real CV(ISI)? Neurocomputing, 26–27, 117–122. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-re model. Phys. Rev. Lett., 79, 4505–4508. Feng, J. (1999). Origin of Firing varibility of the integrate-and-re model. Neurocomputing, 26–27, 87–91. Feng, J., & Brown, D. (1998a).Spike output jitter, mean ring time and coefcient of variation. J. Phys. A: Math. Gen., 31, 1239–1252. Feng, J., & Brown, D. (1998b). Impact of temporal variation and the balance between excitation and inhibition on the output of the perfect integrate-andre model. Biol. Cybern., 78, 369–376. Feng, J., and Brown, D. (1999a). Coefcient of variation greater than .5 how and when? Biol. Cybern., 80, 291–297. Feng, J., & Brown, D. (1999b). Synchronisation due to common pulsed input in stein’s model. Accepted by Phys. Rev. E. Feng, J., Lei, G., & Qian, M. (1992). Second-order algorithms for SDE. Journal of Computational Mathematics, 10, 376–387. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Harris, C. M., & Wolpert, D. M. (1998).Signal-dependent noise determines motor planning. Nature, 394, 780–784. Konig P., Engel A. K., & Singer W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19, 130–137. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurones. Science, 268, 1503–1506. Marsalek, P., Koch, C., & Maunsell, J. (1997).On the relationship between synaptic input and spike output jitter in individual neurones. Proceedings of the National Academy of Sciences U.S.A., 94, 735–740. Matthews, P. B. C. (1996). Relationship of ring intervals of human motor units to the trajectory of post-spike after-hyperpolarization and synaptic noise. J. Physiology, 492, 597–628. Musila, M., & L´ansky, ´ P. (1994). On the interspike intervals calculated from diffusion approximations for Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Protter, P. (1980). Stochastic integration and differential equations: A new approach. Berlin: Springer-Verlag. Ricciardi, L. M., & Sato, S. (1990).Diffusion process and rst-passage-times problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Sejnowski, T. J. (1998). Neurobiology—making smooth moves. Nature, 394, 725– 726. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Sheth, B. R., Sharma, J., Rao, S. C., & Sur, M. (1996).Orientation maps of subjective contours in visual cortex. Science, 274, 2110–2115. Softky, W., & Koch, C. (1993). The highly irregular ring of cortical-cells is incon-
692
Jianfeng Feng and David Brown
sistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neurosci., 1, 210–217. Thomson, A. M. (1997). Activity-dependent properties of synaptic transmission at two classes of connections made by rat neocortical pyramidal. J. Physiology, 502, 131–147. Tuckwell, H. C. (1988). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefcient of variation and bimodality in neuronal interspike interval distributions. J. Theor. Biol., 105, 345–368. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received September 10, 1998; accepted March 4, 1999.
LETTER
Communicated by Raphael Ritz
Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth David M. Halliday Division of Neuroscience and Biomedical Systems, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow, G12 8QQ, U.K. We determine the bandwidth of a model neurone to large-scale synaptic input by assessing the frequency response between the outputs of a two-cell simulation that share a percentage of the total synaptic input. For temporally uncorrelated inputs, a large percentage of common inputs are required before the output discharges of the two cells exhibit signicant correlation. In contrast, a small percentage (5%) of the total synaptic input that involves stochastic spike trains that are weakly correlated over a broad range of frequencies exert a clear inuence on the output discharge of both cells over this range of frequencies. Inputs that are weakly correlated at a single frequency induce correlation between the output discharges only at the frequency of correlation. The strength of temporal correlation required is sufciently weak that analysis of a sample pair of input spike trains could fail to reveal the presence of correlated input. Weak temporal correlation between inputs is therefore a major determinant of the transmission to the output discharge of frequencies present in the spike discharges of presynaptic inputs, and therefore of neural bandwidth.
1 Introduction
Mammalian central nervous system neurones have extensive dendritic trees (Mel, 1994). The combined effects of large-scale synaptic input to such structures act in several ways, which contribute directly and indirectly to the output discharge of the cell. Bernander, Douglas, Martin, and Koch (1991) and Rapp, Yarom, and Segev (1992) demonstrated that large-scale synaptic activity can alter the integrative properties of a cell by acting to reduce the input resistance and time constant. Bernander et al. (1991) further demonstrated that large-scale background synaptic activity results in an increased sensitivity to highly synchronized inputs. Murthy and Fetz (1994) and Bernander, Koch, and Usher (1994) showed that highly synchronized inputs can modulate the mean ring rate of a cell. Dendrites are known to act as lowpass lters (Rinzel & Rall, 1974), which in terms of populations of synaptic inputs results in a ltering out of higher-frequency components present in Neural Computation 12, 693–707 (2000)
c 2000 Massachusetts Institute of Technology °
694
David M. Halliday
input spike trains (Farmer, Bremner, Halliday, Rosenberg, & Stephens, 1993; Halliday, 1995). This study examines the response of cells to different strengths of temporal correlations within subpopulations of the total synaptic input. Particular attention is given to an important aspect of neuronal systems: the widespread presence of noise at all levels (Holden, 1976), which results in the random uctuations in membrane potential observed in intracellular recordings (Calvin & Stevens, 1968) and the stochastic nature of both interspike intervals and the temporal correlation between spike trains. This study includes weak stochastic temporal correlation between presynaptic inputs, in contrast to the highly synchronized form of temporal correlation used in the above studies, where all synchronized inputs always re together within a narrow time window. The aim is to match natural patterns of correlation more closely (Singer, 1993; Gray, 1994). We consider rst the effects of uncorrelated inputs, where a large percentage of common inputs are required before the two cells exhibit signicant correlation. In contrast, for inputs with weak temporal correlation, a small percentage (5–10%) of the total synaptic input can exert a clear inuence on the output discharge at frequencies where the inputs are correlated. Preliminary results are presented in Halliday (1998b). 2 Model Description and Parameters
The model is based on a class of cells with extensive dendrites whose morphology and electrophysiology have been widely studied (Cullheim, Fleshman, Glenn, & Burke, 1987; Fleshman, Segev, & Burke, 1988; Rall et al., 1992): spinal motoneurones. Based on these studies, the membrane resistivity, Rm , is 11, 000V ¢ cm2 , the ctyoplasmic resistivity, Ri , is 70V ¢ cm, and the membrane capacitance, Cm , is 1.0m F / cm2 . Each cell has a spherical soma of50m m diameter with 12 tapered dendrites, 4 each of three types: short, medium, and long. A uniform taper of 0.5m m per 100m m length is used for dendrite diameters in the distal direction. For each dendrite type, the initial diameters at the soma are 5, 7.5, and 10m m; the physical lengths are 766, 1258, and 1904m m; the electrotonic lengths are L D 0.7, 1.0, and 1.5; and the total membrane areas are 8100, 18,500 and 33, 500m m2 , respectively. Each dendrite is modeled as a sequence of connected cylinders, of xed electrotonic length of 0.1 unit; the complete cell model has 129 compartments, representing a total membrane area of 248200m m2 , with 97% of this area in the 12 dendrites. The input resistance of the complete model neurone (including soma and tapered dendrites) to hyperpolarizing current injected at the soma is 4.92MV, and the time constant is 9.7 msec. The compartmental model provides a model of current ow in dendrites accurate to second order in both space and time (Segev, Fleshman, & Burke, 1989; Mascagni, 1989). Action potential generation and afterhyperpolarization (AHP) currents are based on the model of Baldissera and Gustafsson (1974). This incorpo-
Dynamic Modulation of Neuronal Bandwidth
695
rates a threshold-crossing mechanism in conjunction with a time-dependent potassium conductance to represent AHP currents. The three-term model for potassium conductances proposed by Baldissera and Gustafsson is reduced to its single dominant term of an exponential function with a time constant of 14 msec. At the low ring rates used in this study, it is not necessary to include the faster-acting conductances. Output spikes are generated when the membrane potential at the soma exceeds the xed threshold of ¡55 mV, and the AHP conductance is activated after each output spike. This conductance produces a rapid hyperpolarization of the membrane potential toward the reversal potential of ¡75 mV. With constant injected current sufcient to induce an output ring rate of 12 spikes/sec, the membrane potential at the soma following each output spike decreases by approximately 12 mV. Individual excitatory synaptic inputs use a time-dependent conductance change, specied by an alpha function: gsyn( t) D Gs t / t e¡t / t (Jack, Noble, & Tsien, 1975), with the time constant for this conductance change set at t D 2e ¡ 04 sec (Segev, Fleshman, & Burke, 1990), and scaling factor Gs D 1.19e ¡ 08, giving a peak conductance change of 4.38 nS at t D 0.2 msec. The membrane resting potential is ¡70 mV, and the reversal potential for excitatory postsynaptic potentials (EPSPs) is ¡10 mV. When activated at the soma from rest, this conductance generates an EPSP with peak magnitude 100m V at t D 0.525 msec, and a rise time (10% to 90%) and half-width of 0.275 msec, 2.475 msec. The EPSP parameters at the soma when activated in the most distal compartment of the short dendrite (L D 0.7) are 52.8m V at t D 2.00 msec, and 0.900, 7.05 msec. When activated in the most distal compartment of the medium dendrite (L D 1.0), the EPSP parameters are 40.4m V at t D 3.08 msec, and 1.35, 9.35 msec. Activation in the most distal compartment of the long dendrite (L D 1.5) gives EPSP parameters at the soma of 27.1m V at t D 4.75 msec, and a rise time and half-width of 2.08, 11.38 msec. Presynaptic inputs are spatially distributed uniformly by membrane area over the soma and dendrites. The level of excitation necessary to induce a constant ring rate of around 12 spikes/sec in the model cell is 31,872 EPSPs/sec. This can be achieved by 996 uniformly distributed inputs ring randomly at 32 spikes/sec. In this conguration the soma receives 32 inputs, and each dendrite type (short, medium, and long) receives 33, 74, and 134 synaptic inputs each, distributed by area over the surface of the dendrite. In the examples below, where periodic inputs form part of the total synaptic input, the mean rate and number of inputs are adjusted in tandem to provide the same number of EPSPs/sec acting on each compartment of the model. 3 Data Analysis and Inference of Neural Bandwidth
The simulations consist of two cells in which a percentage of the total synaptic input is common to both cells. The cells therefore exhibit a tendency for
696
David M. Halliday
correlated discharge, which we use to estimate the range of frequencies in the common input spike trains that are transmitted to the output discharge of the cells. The data analysis methods are those set out in Rosenberg, Amjad, Breeze, Brillinger, and Halliday (1989) and Halliday et al. (1995) for frequency domain analysis of stochastic neuronal data. We use estimates of coherence functions and cumulant density functions to characterize the correlation between the output discharges. Cumulant density functions provide a measure of correlation between spike trains as a function of time; coherence functions provide a measure of correlation as a function of frequency. The cumulant density function, at lag u, between two spike trains (1, 2), denoted by q12 ( u), can be dened and estimated as the inverse Fourier transform of the cross spectrum: Z p ( ) q12 u D f12 (l ) eil u dl , ¡p
where f12 (l) is the cross spectrum between (1, 2) at frequency l. The corresponding coherence function, |R12(l )| 2 , can be dened and estimated as the magnitude of the cross spectrum squared normalized by the product of the two auto spectrum: | F12 (l )| 2 D
| f12(l )| 2 . f11(l) f22 (l )
For further details including estimation procedures and the setting of condence limits, see Halliday et al. (1995). For two identical neurones acted on by a single common input, the estimated coherence between the output discharges is proportional to the product of the input spectrum with the squared magnitude of the transfer function (dened as the ratio of the cross spectrum and input spectrum) for this single input (Rosenberg, Halliday, Breeze, & Conway, 1998). Using random, or Poisson, common spike train inputs allows the bandwidth of the cell input-output transfer function to be inferred from the estimated coherence between the output discharges of the two cells. We use a similar approach in this study, except that we are dealing with a population of common inputs to the two cells. This method has the advantage of being independent of the number of inputs applied to the cell; the application and interpretation is the same for 1 or 1000 common inputs. The estimation of a multivariate transfer function for a cell with several hundred inputs is not practical. In addition, the two-cell method will also be useful when nonlinear mechanisms are involved in synaptic transmission. Our concern is with the presence of common frequency components in the two output discharges; since the common inputs have the same effect on both cells, the two-cell model will also be useful when linear and nonlinear transformations of input signals occur. Some of the simulations use presynaptic input spike trains that are periodic. We interpret a distinct peak in the estimated coherence between the
Dynamic Modulation of Neuronal Bandwidth
697
Figure 1: Synchronized discharge in the two-cell model induced by 50% common inputs with Poisson discharges. (a) Cumulant density estimate. Dashed and solid horizontal lines are expected value and upper and lower 95% condence limits, respectively, based on assumption of independence. (b) Coherence estimate. Dashed horizontal line is upper 95% condence limit based on assumption of independence. Results estimated from 400 seconds of data.
two output discharges at the frequency of the periodicity to indicate that this frequency is within the bandwidth of the cell for that particular input conguration. This inference uses an important property of a Fourier-based analytical framework: that distinct frequencies persist within a (linear) system (Brillinger, 1974), and can be detected as a periodic correlation between an input and an output, or between two outputs. Inferring the properties of common inputs to neurones from the estimated correlation between their output discharges is a well-established approach in both modeling of neural systems (Perkel, Gerstein, & Moore, 1967; Moore, Segundo, Perkel, & Levitan, 1970; Rosenberg et al., 1998) and neurophysiology (Farmer et al., 1993). 4 Results 4.1 Uncorrelated Inputs. With both cells acted on by 996 inputs, each activated by a Poisson spike train of rate 32 spikes/sec, and no common inputs, both cells re with a mean rate of around 12 spikes/sec, and, as expected, their output discharges are uncorrelated. When 50% of the inputs are applied commonly, the cells share 15,936 EPSPs/sec, and their output discharges exhibit a tendency for synchronous discharge. This results in a peak at time zero in the time domain correlation, (see Figure 1a) estimated as a cumulant density function (Halliday et al., 1995). In the frequency domain, signicant correlation is present at frequencies up to around 50 Hz in the estimated coherence (see Figure 1b). Therefore the bandwidth of this conguration, which reects the ability of 50% of the synaptic inputs acti-
698
David M. Halliday
Figure 2: Response of two-cell model to different percentages of common Poisson inputs. Estimated coherence between output discharges for (a) 10%, (b) 20%, (c) 40%, and (d) 80% common Poisson inputs applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data for each conguration.
vated by random spike trains to inuence the output discharge, is around 50 Hz. In Figure 2 are shown coherence estimates between the output discharges when each cell receives the same number of EPSPs/sec as above, except that the percentage of common EPSPs/sec is varied from 10% to 80%. These common inputs are activated by Poisson spike trains, their location is distributed uniformly by area over the cell body and dendrites, and they have the same location in both cells. The coherence estimate with 10% common inputs (see Figure 2a) has no signicant features; therefore, the combined effect of 10% of the total synaptic input is insufcient to exert any inuence on the timing of output spikes. The conguration with 80% common input (see Figure 2d) has a coherence estimate with signicant values up to frequencies in excess of 200 Hz. The bandwidth of this conguration is therefore above 200 Hz. The bandwidth of the neurone therefore depends on the percentage of the total synaptic input that is considered to provide the input. Coherence es-
Dynamic Modulation of Neuronal Bandwidth
699
Figure 3: Response of two-cell model to different percentages of common uncorrelated periodic 25 spikes/sec inputs. Estimated coherence between output discharges for (a) 10%, (b) 20%, (c) 40%, and (d) 8% common inputs applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data for each conguration.
timates provide a normative measure of association on a scale from 0 to 1 (Rosenberg et al., 1989; Halliday et al., 1995). The peak coherence between the two output discharges with 80% common synaptic input is 0.15. Thus, even with 80% common synaptic input to the two cells, the discharge of one cell can predict a maximum of 15% of the variability in the discharge of the companion cell. The above congurations using Poisson spike trains are equivalent to probing a continuous system with a white noise signal; both types of signal have a at spectral estimate. Many neural systems have pathways that carry periodic spike trains. In addition, there has been recent interest in the role of rhythmic neuronal activity in neural integration (Singer, 1993; Gray, 1994). This raises the question of whether the bandwidths inferred from the coherence estimates in Figure 2 are valid for spike train inputs that contain distinct periodicities. This is explored in Figure 3, which shows
700
David M. Halliday
coherence estimates, with the common inputs activated by periodic spike trains with a mean rate of 25 spikes/sec and a coefcient of variation (COV) of 0.1. As above, in determining the number of inputs, the total number of EPSPs/sec acting on each compartment remained xed. For example, the soma of each cell receives 1024 EPSPs/sec, which with 10% common synaptic input consists of 4 periodic inputs ring at 25 spikes/sec common to both cells and 37 random inputs ring at 25 spikes/sec applied independently to each cell. Figure 3 demonstrates that the behavior of the simulation in response to a varying percentage of common periodic inputs is similar to that for random inputs. With only 10% of the inputs ring at 25 spikes/sec there is no signicant coherence at 25 Hz (see Figure 3a). An increasing percentage of inputs ring at 25 spikes/sec results in an increase in the magnitude of the coherence estimate at 25 Hz, and also an increase in the range of frequencies present, seen as peaks at harmonics of the 25 Hz input. 4.2 Correlated Inputs. The simulations suggest that populations of synaptic inputs activated by uncorrelated spike trains that constitute less than 20% of the total synaptic input to a cell are unlikely to exert any inuence on the timing of output spikes. Next we explore the response to input spike trains that are themselves correlated. Each spike train is generated using an integrate-and-re encoder, and the resulting correlation within each population of common inputs is both weak and stochastic in nature (Halliday, 1998a). These integrate-and-re encoders are much simpler than the detailed conductance-based compartmental neurone model. They are used to generate correlated spike trains and are not intended to represent presynaptic circuitry acting on motoneurones. First we examine the response of the paired motoneurone model to a population of common inputs with the same ring rate (32 spikes/sec) and COV (1.0) as the Poisson inputs used in Figures 1 and 2, except they are weakly correlated over a broad range of frequencies. These inputs are generated using the spike generators described in Halliday (1998a), with the temporal correlation generated by a single Poisson spike train of rate 32 spikes/sec as common input to all spike generators. The resulting strength of correlation between sample pairs of spike trains is broadband and weak. Due to this weak temporal correlation, analysis of a sample pair of spike trains of duration 100 seconds can fail to reveal any signicant correlation between inputs (Halliday, 1998a). However, combined analysis of 20 sample pairs of 100 seconds duration using the technique of pooled coherence (Amjad, Halliday, Rosenberg, & Conway, 1997; Halliday, 1998a) reveals correlation within the population of inputs up to 100 Hz, with a maximum estimated coherence of 0.013. Figure 4a shows the coherence estimate between the two output discharges, with 10% common input supplied by 100 of these correlated inputs distributed uniformly by area as before. This estimate is similar to that obtained with 60% common Poisson inputs (not
Dynamic Modulation of Neuronal Bandwidth
701
Figure 4: Response of two-cell model to populations of inputs with weak temporal correlation. Estimated coherence between output discharges for (a) 10% common correlated broadband inputs and (b) 5% common correlated periodic inputs at 10 spikes/sec and 5% common correlated periodic inputs at 25 spikes/sec applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data in (a) and 400 seconds of data in (b).
shown; cf. Figures 2c and 2d). Weakly correlated inputs exert a signicantly stronger inuence on the output discharge of the cell than uncorrelated inputs. The conguration with 10% weakly correlated inputs has a bandwidth in excess of 100 Hz, in contrast to the conguration with 10% uncorrelated inputs, which has no signicant effect on the timing of output spikes. Next we examine the effects of applying two sets of common periodic spike trains, with each set assigned 5% of the total number of EPSPs/sec acting on the cells, and consisting of 160 inputs ring at 10/sec with COV 0.1, and 64 inputs ring at 25 spikes/sec, with COV 0.1. These spike trains have a similar strength of correlation as the broadband inputs above, except the correlation is at the ring frequency in each set of 10 Hz and 25 Hz, respectively (Halliday, 1998a). The coherence estimate between the two output discharges has clear peaks at 10 Hz and 25 Hz. Thus, the output discharge supports both periodicities present in the two populations of weakly correlated common inputs. The 5% of inputs at 25/sec induce a peak coherence similar to that in Figure 3c for 40% common uncorrelated 25 spikes/sec inputs. A small percentage of weakly correlated periodic inputs are able to exert an inuence on the timing of output spikes whose relative magnitude is far in excess of the percentage of input EPSPs involved. Weak stochastic correlation between inputs (which may not be detected) is converted into stronger correlation between outputs. A small change in the temporal correlation structure of a subset of the total synaptic input, without any alteration in mean ring rate, can alter the bandwidth of the cells.
702
David M. Halliday
5 Discussion
The main nding of this study is the dynamic nature of the cell bandwidth, which for a xed level of synaptic excitation (EPSPs/sec), exhibits systematic changes in response to changes in the numbers of presynaptic inputs considered to provide the input signal and their temporal correlation characteristics. For uncorrelated inputs, neural bandwidth is determined by the percentage of the total synaptic input that is considered to provide the input signal. Populations that constitute less than 10% of the total synaptic input have no signicant inuence on the output discharge (see Figures 2a and 3a). In contrast, weak temporal correlation within 5–10% of the total synaptic input has considerable inuence on the timing of output spikes (see Figure 4). Correlation between the output discharges occurs at the frequencies at which inputs are correlated. Weak temporal correlation between inputs is converted into stronger correlation between outputs. The question of neural bandwidth has been previously addressed. Farmer et al. (1993) used a similar two-cell simulation of a point neurone model and found that EPSPs with longer rise times and half-widths acted to lter out higher-frequency components. Based on a two-cell model with 96.5% of the total inputs common to both cells, they concluded that the bandwidth of their point neurone model was in excess of 250 Hz. Using a two-cell model of a motoneurone plus single attached dendrite with 75% of the total synaptic input common to both cells, Halliday (1995) concluded that a distal input location could result in a bandwidth of around 50 Hz. The ndings of this study are consistent with these previous results and show that these different estimates of the cell bandwidth result from the different percentages of common inputs applied. The classical view of the input-output relationship for motoneurones is that over a wide range of discharge rates, the steady-state discharge rate is determined by a simple linear relation that is proportional to the effective synaptic current input to the cell (Binder, Heckman, & Powers, 1993). This study is concerned with spatial temporal interactions on a millisecond timescale, which for large-scale synaptic input results in membrane potential uctuations (Calvin & Stevens, 1968). It is the random nature of these uctuations, and the interactions between the populations of common and independent inputs, that produces the modulation of neural bandwidth for the same level of average excitation. Two previous studies of the effects of input synchrony have concentrated on variations in mean ring rate (Murthy & Fetz, 1994; Bernander et al., 1994). Both sets of authors found that increased synchronization led to a decrease in output ring rate. The proposed mechanism was an ”overcrowding” effect where excess depolarizing input above that required to re the cell was effectively wasted during the refractory period. For the present simulations the average output ring rate was 12.32 spikes/sec for the Poisson input congurations (see Figures 1 and 2), 12.37 spikes/sec for the weakly correlated broadband inputs (see Figure 4a), and 12.16 spikes/sec for the
Dynamic Modulation of Neuronal Bandwidth
703
weakly correlated periodic inputs (see Figure 4a). This study differs from these in the use of weak, stochastic correlated inputs. Generation of spike trains using integrate-and-re encoders results in patterns of correlation that mimic more realistically natural patterns of correlation, rather than the highly synchronized inputs used by Murthy and Fetz (1994) and Bernander at al. (1994), where all correlated inputs re synchronously with probability 1 in a narrow time window. Bernander et al. (1991) and Rapp et al. (1992) found that large-scale synaptic activity can alter the basic characteristics of a cell, resulting in a reduction in cell input resistance and time constant. The present model exhibits similar behavior; when excited by 31,872 EPSPs/sec, the input resistance is 3.8 MV, the time constant is 6.45 msec, and the average membrane potential is ¡55.1 mV. This compares with 4.9 MV, 9.7 msec and ¡70mV with no synaptic inputs. Bernander et al. (1991) concluded that large-scale background synaptic activity also resulted in increased sensitivity to synchronized inputs, which they demonstrated by contrasting the cell response to 150 inputs that were either exactly synchronized every 25 msec or distributed throughout time. There are two differences in the methods of this study. Bernander et al. (1991) and Rapp et al. (1992) replaced individual background synaptic conductances with a time-averaged constant calculated to provide the same average level of excitation to each compartment. This study is concerned with the stochastic nature of neuronal spike trains and therefore models the time course of each individual synaptic input in full for all synaptic inputs to both cells. In the absence of any other synaptic inputs, a constant background conductance will result in a constant value of membrane potential, which will not exhibit the random uctuations observed in intracellular recordings (Calvin & Stevens, 1968; see Bernander et al., 1991, Figure 1c). This will bias the response of the model in favor of any synaptic inputs that are applied, since these will be the only source of membrane potential uctuations and the cell will always respond with an increased depolarization to excitatory inputs. This is in contrast to this model, where the synaptic inputs that provide the input signal interact and compete with the background synaptic inputs on a millisecond timescale. Theoretical studies have indicated that the addition of noise to a system can result in an improved signal-to-noise ratio, particularly for periodic signals in the presence of noise (Wiesenfeld & Moss, 1995; Bezrukov & Vodyanoy, 1997). In such a case it is important to model the stochastic nature of all inputs to the cell; indeed in the light of this work on stochastic resonance, the presence of stochastic background synaptic activity may contribute to the extreme sensitivity of the cell to other temporally correlated synaptic inputs. The second difference between this study and that of Bernander et al. (1991) is in the form of correlated inputs. Bernander et al. (1991) used precisely synchronized inputs, in contrast to the weak, stochastic temporal correlation used in our populations of inputs. This temporal correlation is sufciently weak that examination of a sample
704
David M. Halliday
pair of spike trains selected from the population may fail to reveal the presence of any correlation (Halliday, 1998a). Despite the presence of large-scale background activity, the model cell is able to detect the presence of weak, stochastic temporal correlation in a small fraction of the total input, without any change in the overall level of excitation (EPSPs/sec). This high selectivity to small changes in the temporal characteristics of subpopulations of the total synaptic input suggests that weak, stochastic temporal correlation between presynaptic inputs is a major determinant of neural bandwidth. The results of this study are also relevant to studies considering the relative merits of rate coding, coincidence detection, and temporal integration as mechanisms involved in neural coding (Shadlen & Newsome, 1994, 1995, 1998; Softky, 1995; K¨onig, Engel, & Singer, 1996), which have led to the view that temporal integration, where many PSPs contribute to the generation of each output spike, will not preserve temporal coding in input spike trains (Shadlen & Newsome, 1994; Konig ¨ et al., 1996). In the simulations here, around 2666 EPSPs contribute to each spike; therefore temporal integration would appear to be the dominant mode of operation. The presence of weak stochastic temporal correlation in 5–10% of the total synaptic input allows these inputs to exert a strong inuence on the timing of output spikes, allowing the temporal coding in the inputs to be transmitted to the output discharges. This result is not consistent with the suggestion that temporal information will not propagate through neurones whose mode of operation is temporal integration. An alternative mode of operation proposed for neurones is coincidence detection (Abeles, 1982; Softky, 1994; Konig ¨ et al., 1996). It is not clear that this is an appropriate mechanism to describe the present simulations, where weakly correlated inputs are distributed over the extensive dendritic tree, with propagation delays of up to 5 ms for distal inputs to arrive at the somatic spike generation site. In addition, the weak and stochastic nature of the correlation structure in the common inputs does not produce reliable coincidences in the input spike trains (Halliday, 1998a), but a diffuse temporal correlation distributed across all the spike trains in each population. We therefore suggest that the mode of operation of the present neurones is temporal integration with correlation detection. The results also demonstrate that correlation between neurones does not require any alteration in output ring rates (deCharms & Merzenich, 1996; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Weak correlation between inputs results in stronger correlation between outputs; reciprocal connections in networks of neurones may provide a means of sustaining these oscillations. In such networks, conduction delays are an important factor (K¨onig & Schillen, 1991); however, in the present data, dendritic conduction delays of up to 5 ms do not affect the ability of the neurone to detect temporal correlation between inputs. One unresolved issue in cortical signaling is the variability exhibited in neuronal discharges, which have a COV »1.0 (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998). In this study the neurones
Dynamic Modulation of Neuronal Bandwidth
705
have a regular output discharge. However, all inputs in the example in Figure 4a. have a COV »1.0, yet the cell is sensitive to weak stochastic temporal correlation between inputs distributed over the dendritic tree. Further simulation studies are necessary to ascertain if weak stochastic temporal correlation in populations of inputs contributes to the variability in cortical neurone discharges. Acknowledgments
This research was undertaken initially at the Information Science Division, NTT Basic Research Labs, Musashino, Tokyo. Completion was supported in part by grants from the Joint Research Council/HCI Cognitive Science Initiative and the Wellcome Trust (036928; 048128). References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Amjad, A. M., Halliday, D. M., Rosenberg, J. R., & Conway, B. A. (1997). An extended difference of coherence test for comparing and combining several independent coherence estimates—theory and application to the study of motor units and physiological tremor. Journal of Neuroscience Methods, 73, 69–79. Baldissera, F., & Gustafsson, B. (1974).Afterhyperpolarization conductance time course in lumbar motoneurones of the cat. Acta Physiologica Scandinavica, 91, 512–527. ¨ Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic backBernander, O., ground activity inuences spatiotemporal integration in single pyramidal cells. Proceedings of the National Academy of Sciences, 88, 11569–11573. ¨ Koch, C., & Usher, M. (1994). The effect of synchronized inputs Bernander, O., at the single neuron level. Neural Computation, 6, 622–641. Bezrukov, S. M., & Vodyanoy I. (1997). Stochastic resonance in non-dynamical systems without response thresholds. Nature, 385, 319–321. Binder, M. D., Heckman, C. J., & Powers, R. K. (1993). How different afferent inputs control motoneuron discharge and the output of the motoneuron pool. Current Opinion in Neurobiology, 3, 1028–1034. Brillinger, D. R. (1974). Fourier analysis of stationary processes. Proceedings of the IEEE, 62, 1628–1643. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. Journal of Neurophysiology, 31, 574–587. Cullheim, S., Fleshman, J. W., Glenn, L. L., & Burke, R. E. (1987). Membrane area and dendritic structure in type-identied triceps surae alpha motoneurones. Journal of Comparative Neurology, 255, 68–81. deCharms, R. C., & Merzenich, M. M (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613.
706
David M. Halliday
Farmer, S. F., Bremner, F. D., Halliday, D. M., Rosenberg, J. R., & Stephens, J. A. (1993). The frequency content of common synaptic inputs to motoneurones studied during voluntary isometric contraction in man. Journal of Physiology, 470, 127–155. Fleshman, J. R., Segev, I., & Burke, R. E. (1988). Electrotonic architecture of typeidentied (-motoneurones in the cat spinal cord. Journal of Neurophysiology, 60, 60–85. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience, 1, 11–38. Halliday, D. M. (1995). Effects of electronic spread of EPSPs on synaptic transmission in motoneurones—A simulation study. In A. Taylor, M. H. Gladden, & R. Durbaba (Eds.), Alpha and gamma motor systems (pp. 337–339). New York: Plenum. Halliday, D. M. (1998a). Generation and characterization of correlated spike trains. Computers in Biology and Medicine, 28, 143–152. Halliday, D. M. (1998b). Weak, stochastic temporal correlation of large scale synaptic input can modulate neuronal bandwidth. Society for Neuroscience Abstracts, 24, 1151. Halliday, D. M., Rosenberg, J. R., Amjad, A. M., Breeze, P., Conway, B. A., & Farmer, S. F. (1995). A framework for the analysis of mixed time series/point process data—Theory and application to the study of physiological tremor, single motor unit discharges and electromyograms. Progress in Biophysics and Molecular Biology, 64, 237–278. Holden, A. V. (1976). Models of the Stochastic Activity of Neurones. Berlin: Springer Verlag. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electrical current ow in excitable cells (2nd ed.). Oxford: Clarendon Press. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19, 130–137. Konig, ¨ P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses. I. Synchronization. Neural Computation, 3, 155–166. Mascagni, M. V. (1989). Numerical methods for neuronal modeling. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks. (pp. 439–484). Cambridge, MA: MIT Press. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Moore, G. P., Segundo, J. P., Perkel, D. H., & Levitan, H. (1970). Statistical signs of synaptic interaction in neurones. Biophysical Journal, 10, 876–900. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the ring rate of a three conductance cortical neuron model. Neural Computation, 6, 1111– 1126. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophysical Journal, 7, 419–440. Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J. B., Redman, S. J., & Segev, I. (1992). Matching dendritic neuron models to experimental data. Physiological Reviews, 72 (Suppl.), S159–S186.
Dynamic Modulation of Neuronal Bandwidth
707
Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel ber background activity on the cable properties of cerebellar Purkinje cells. Neural Computation, 4, 518–533. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rinzel, J., & Rall, W. (1974). Transient response to a dendritic neuron model for current injected at one branch. Biophysics, 14, 759–789. Rosenberg, J. R., Amjad, A. M., Breeze, P., Brillinger, D. R., & Halliday, D. M. (1989). The Fourier approach to the identication of functional coupling between neuronal spike trains. Progress in Biophysics and Molecular Biology, 53, 1–31. Rosenberg, J. R., Halliday, D. M., Breeze, P., & Conway, B. A. (1998). Identication of patterns of neuronal activity—partial spectra, partial coherence, and neuronal interactions. Journal of Neuroscience Methods, 83, 57–72. Segev, I., Fleshman, J. R., & Burke, R. E. (1989). Compartmental models of complex neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (pp. 63–96). Cambridge, MA: MIT Press. Segev, I., Fleshman, J. R., and Burke, R. E. (1990).Computer simulations of group Ia EPSPs using morphologically realistic models of cat a-motoneurones. Journal of Neurophysiology, 64, 648–660. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications of connectivity, computation and information coding. Journal of Neuroscience, 18, 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology, 55, 349– 374. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R. (1995). Simple versus efcient codes. Current Opinion in Neurobiology, 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Wiesenfeld, K., & Moss, F. (1995).Stochastic resonance and the benets of noise— from ice ages to craysh and squids. Nature, 373, 33–36. Received July 24, 1998; accepted February 26, 1999.
LETTER
Communicated by Steven Nowlan
Second-Order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano ¤ NTT Communication Science Laboratories, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan This article compares three penalty terms with respect to the efciency of supervised learning, by using rst- and second-order off-line learning algorithms and a rst-order on-line algorithm. Our experiments showed that for a reasonably adequate penalty factor, the combination of the squared penalty term and the second-order learning algorithm drastically improves the convergence performance in comparison to the other combinations, at the same time bringing about excellent generalization performance. Moreover, in order to understand how differently each penalty term works, a function surface evaluation is described. Finally, we show how cross validation can be applied to nd an optimal penalty factor. 1 Introduction
It has been found empirically that adding some penalty term to an objective function in the learning of neural networks can lead to signicant improvements in network generalization. Such terms have been proposed on the basis of several viewpoints such as weight decay (Hinton, 1987), regularization (Poggio & Girosi, 1990), function smoothing (Bishop, 1995), weight pruning (Hanson & Pratt, 1989; Ishikawa, 1996), and Bayesian priors MacKay, 1992; Williams, 1995). Some are calculated by using simple arithmetic operations, while others use higher-order derivatives. The most important evaluation criterion for adding these terms is how the generalization performance improves, but the learning efciency is also an important criterion in large-scale practical problems; computationally demanding terms are hardly applicable to such problems. Here, it is naturally conceivable that the effects of penalty terms depend on learning algorithms; thus, we need comparative evaluations. We have already given an outline of an evaluation on the learning performance of rst- and second-order off-line learning algorithms with three penalty terms, using an articial regression problem (Saito & Nakano, 1997b). This article gives the details of the above evaluation and also includes eval¤
Present address: Nagoya Institute of Technology, Gokiso-cho, Showa-Ku, Nagoya 466–8555, Japan Neural Computation 12, 709–729 (2000)
c 2000 Massachusetts Institute of Technology °
710
Kazumi Saito and Ryohei Nakano
uation using an actual classication problem. It also compares the learning performance of a standard on-line learning algorithm with the three penalty terms, using both the articial regression problem and the real classication problem. Section 2 explains the framework of the learning and shows a second-order learning algorithm with the penalty terms. Sections 3 and 4 show experimental results for the regression problem and the real classication problem, respectively. Section 5 describes a function surface evaluation, in order to explain how differently each penalty term works. Section 6 shows how cross validation can be applied to nd an optimal penalty factor. 2 Learning with Penalty Term 2.1 Framework. Let f( x 1 , y 1 ), ¢ ¢ ¢ , ( xm , y m )g be a set of examples, where x t denotes an n-dimensional input vector and y t a k-dimensional target output vector corresponding to x t . In a three-layer neural network, let h be the number of hidden units, wj D ( wj0 , ¢ ¢ ¢ , wjn ) T be the weight vector between all the input units and the hidden unit j, and ci D ( ci0 , ¢ ¢ ¢ , cih ) T be the weight
vector between all the hidden units and the output unit i; wj0 and ci0 mean bias terms, and xt0 is set to 1. Note that aT denotes the transposed vector of a . Hereafter, a vector consisting of all parameters, ( cT1 , ¢ ¢ ¢ , cTk , w T1 , ¢ ¢ ¢ , w Th ) T , is simply expressed as W D (w 1 , ¢ ¢ ¢ , w N )T , where N (D h( n C 1) C k( h C 1)) denotes the dimension of W. Then the output value of the output unit i is dened as follows: zi ( x t I W) D ci0 C
h X jD1
cij s(w jT xt ),
where s(u) represents a sigmoidal function, s(u) D 1 / (1 C e¡u ). First, we consider k-dimensional regression problems. Provided that each target output value independently includes noise with a mean of 0 and an unknown common standard deviation, then the learning error term can be dened by using a squared-error criterion: f (W) D
m X k 1X ( yti ¡ zi ( x t I W)) 2 . 2 tD1 iD1
Next, we consider k-class classication problems. Note that yti 2 f0, 1g because each target value indicates a class label. Provided that each example belongs to only one class, then the learning error term can be dened by using a soft-max function and a cross-entropy criterion: ! m X k X exp(zi (x t I W)) . f (W) D ¡ yti log Pk tD1 iD1 pD1 exp(zp ( x t I W))
(
Second-Order Learning Algorithm
711
Incidentally, for classication problems, we can consider another learning P Pk 2 error term using the squared-error criterion: 12 m tD1 iD1 ( yti ¡s(zi ( x t I W))) . However, when zi is dened as the linear output value with respect to the input vector, we can easily see that the Hessian matrix of the cross-entropy criterion always becomes nonnegative denite, but that of the squared-error criterion is not always so. The learning error term using the cross-entropy criterion has a unique (weak) local minimum. Thus, even if zi is the nonlinear output value of the neural network, learning with the error term using the cross-entropy criterion will be easier than that of the squared-error criterion. In this article, we consider the following three penalty terms: V1 (W) D
N N N X w k2 1X 1X w k2 , V2 (W) D |w k |, V3 (W) D . 2 kD1 2 kD1 1 C w k2 kD1
(2.1)
Hereafter, V1 , V2 , and V3 are referred to as the squared (Hinton, 1987; MacKay, 1992), absolute (Williams, 1995; Ishikawa, 1996), and normalized (Hanson & Pratt, 1989) penalty terms, respectively. Then learning with one of these three terms can be dened as the problem of minimizing the following objective function, Fi (W) D f (W) Cm Vi (W),
(2.2)
where m is a penalty factor. Although each of the single penalty terms is used for our evaluation, using adequate multiple penalty terms has lately attracted considerable attention. Ripley (1996) has pointed out that a single penalty term will make sense only if the inputs and outputs have been rescaled to the range [0,1]. In data analysis, such normalization is usually employed; thus, the single penalty term approach remains practical and signicant due to its simplicity. One of the limitations of the single penalty term approach is that it is inconsistent with certain scaling properties of network mappings (Ripley, 1996). From this standpoint, we should consider multiple penalty factors; however, if we consider a practically meaningful scaling of inputs and outputs such as xQ m D am xm Cbm and yQ i D ci yi Cdi , we have to optimize as many penalty factors as the number of inputs and outputs. The Bayesian approach (MacKay, 1992; Williams, 1995) may make it possible to estimate numerous penalty factors, but it has been the subject of partial implementation and considerable controversy (Ripley, 1996). Moreover, a constant penalty factor is used for our evaluation. Although the Bayesian approach can adaptively optimize penalty factors together with network weights, the following approach is still practical and useful; that is, train several networks with different amounts of penalty factors, estimate the generalization error for each, and then choose the penalty factor that minimizes the estimated generalization error.
712
Kazumi Saito and Ryohei Nakano
Table 1: Classication of Supervised Learning Algorithms. Supervised Learning Method Search Direction
First order Second order
Step Length Fixed
Variable
Backpropagation, etc. Newton method
Silva-Almeida method, etc. SCG, OSS, BPQ, etc.
Note: SCG D scaled conjugate gradient; OSS D one-step secant; BPQ D backpropagation based on quasi-Newton.
2.2 Learning Algorithms. Since a large number of algorithms have been proposed for supervised neural network learning, it is almost impossible to evaluate the learning performance of all the algorithms in combination with each penalty term. In general, supervised learning algorithms can be classied into on-line and off-line algorithms. In this article, as a representative on-line algorithm, the on-line backpropagation (BP) algorithm with a xed learning rate is used for evaluation. On the other hand, the basic characteristics of off-line algorithms can be determined by the methods for calculating a search direction and its step-length. Thus, as shown in Table 1, we classify the search direction calculation methods into rst and second order, and the step-length calculation methods into xed and variable; then the representative algorithm in each category is reasonably used for evaluation. Since it is well known that the off-line BP algorithm with the xed step length (learning rate) is very inefcient, and Newton techniques are hardly applicable to even midscale problems, hereafter we consider only the rstand second-order methods whose step length is adjusted. 2.3 Second-Order Algorithm with Penalty Term. In the second-order algorithms, shown in Table 2, each of existing representative methods has advantages and disadvantages of applicability to large-scale problems and efciency in inaccurate step length; second-order algorithms based on Levenberg-Marquardt or quasi-Newton methods cannot suitably scale up for large,1 while a line search to calculate an adequate step length is indispensable for second-order algorithms that are based on quasi-Newton or conjugate gradient methods.2 In order to make the most of the secondorder algorithm, we employ a newly invented second-order learning algorithm based on a quasi-Newton method (BPQ) (Saito & Nakano, 1997a). 1 Levenberg-Marquardt algorithms require O(N2 m) operation to calculate the descent direction during any one iteration; standard quasi-Newton algorithms require N2 storage space to maintain an approximation to an inverse Hessian matrix. 2 Successful convergence of conjugate gradient algorithms relies heavily on the accuracy of a line search; thus, they generally require a nonnegligible computational load, while successful convergence of quasi-Newton algorithms is theoretically guaranteed even with an inaccurate line search if certain conditions are satised.
Second-Order Learning Algorithm
713
Table 2: Second-Order Learning Methods. Learning Algorithm
Applicable Scale of Network
Required Accuracy for Step-length Calculation
Newton method Quasi-Newton method Conjugate gradient method
Small Middle Moderately large
Unnecessary to calculate Moderate High
Although we can employ other second-order learning algorithms such as scaled-conjugate gradient (Møller, 1993) or one-step secant (Battiti, 1992), the BPQ worked the most efciently among them in our own experience (Saito & Nakano, 1997a). In the BPQ, the descent direction, D W, is calculated on the basis of a partial Broyden-Fletcher-Goldfarb-Shannon update and a reasonably accurate step length, l, is efciently calculated as the minimal point of a second-order approximation. The partial BFGS update can be directly applied, while the step length l is evaluated as follows: l D
¡rFi (W)D W T ¡rFi (W)D W T D . (2.3) 2 2 D W T r Fi (W)D W D W T r f (W)D W Cm D W T r 2 Vi (W)D W
The quadratic form for the training error term, D W T r 2 f (W)D W, can be calculated efciently with the computational complexity of Nm CO(hm) by using the procedure of the BPQ, while those for penalty terms are calculated as follows: D W T r 2 V1 (W)D W D D W T r 2 V3 (W)D W D
N X kD1
D w k2 ,
D W T r 2 V2 (W)D W D 0,
N (1 X ¡ 3w k2 )D w k2 kD1
(1 C w k2 ) 3
.
(2.4)
Note that in the step-length calculation, D W T r 2 Fi (W)D W is expected to be positive. The three terms have a different effect on D W T r 2 Fi (W)D W: the squared penalty term always adds a nonnegative value, the absolute penalty term has no effect, and the normalized penalty p term may add a negative value if many weight values are larger than 1 / 3. This indicates that the squared penalty term has a very desirable feature. 3 Experiments Using Regression Problem 3.1 Nonlinear Regression Problem. By using a regression problem for 2 a function y D (1 ¡ x C 2x2 ) e¡0.5x , the learning performance of adding a
714
Kazumi Saito and Ryohei Nakano
Figure 1: Nonlinear regression problem.
penalty term was evaluated. In the experiment, a value of x was randomly generated in the range of [¡4, 4], and the corresponding value of y was calculated from x; each value of y was corrupted by adding gaussian noise with a mean of 0 and a standard deviation of 0.2. The total number of training examples was set to 30. The number of hidden units was set to 5, where the initial values for the weights between the input and hidden units were independently generated according to a normal distribution with a mean of 0 and a standard deviation of 1; the initial values for the weights between the hidden and output units were set to 0, but the bias value at the output unit was initially set to the average output value of all training examples. The iteration was terminated when the gradient vector was sufciently small (krFi (W)k2 / N < 10¡12) or the total processing time exceeded 100 seconds. Our experiments using the regression problem were done on SUN S-4/20 computers. The penalty factor m was gradually changed from 20 to 2¡19 by multiplying by 2¡1 ; trials were performed 20 times for each penalty factor. Figure 1 shows the training examples, the true function, and a function obtained after learning without a penalty term. We can see that such a learning overtted the training examples to some degree. 3.2 Evaluation Using Second-Order Algorithm. By using the BPQ, an evaluation was made after adding each penalty term. Figure 2 compares the generalization performance, which was evaluated by using the average RMSE (root mean squared error) and its standard deviation for a set of 5000 examples newly generated for test. 3 The best possible RMSE level is 0.2 because each test example includes the same amount of gaussian noise as given to each training example. For each penalty term, the generalization performance was improved when m was set adequately, but the normalized penalty term was the most unstable among the three, because it frequently got stuck in undesirable local minima.
3
All graphs in this article include results of unconverged trials within the maximum processing time.
Second-Order Learning Algorithm
715
Figure 2: Generalization performance using the BPQ for the regression problem.
Figure 3 compares the statistics of the processing time, until convergence, where the plots are omitted if the standard deviation value is 0. In comparison to the learning without a penalty term, the squared penalty term drastically decreased the processing time, especially when m was large, while the absolute penalty term did not converge when m was between [2¡7 , 20 ]; the normalized penalty term generally required much processing time. Thus, only the squared penalty term improved the convergence performance more than 2 » 100 times, keeping a better generalization performance for an adequate penalty factor. Figure 4 shows the learning results using the squared penalty term with different penalty factors. The gure shows that the result with the large penalty factor m D 2¡3 undertted the training examples, while the result with the small onem D 2¡15 overtted the training examples. We can see that the result with an adequate penalty factor m D 2¡6 closely approximated the true function.
716
Kazumi Saito and Ryohei Nakano
Figure 3: Convergence performance using the BPQ for the regression problem.
3.3 Evaluation Using Off-line BP. By using the off-line BP, a similar evaluation was made after adding each penalty term. Here, we adopted Silva and Almeida’s (1990) learning-rate adaptation rule: the learning rate gk for each weight w k is adjusted by the signs of two successive gradient values. 4 Figure 5 compares the generalization performance, and Figure 6 compares the processing time until convergence, where the trials without a penalty term are not displayed because all trials did not converge within 100 seconds. For each penalty term, the generalization performance was improved when m was set adequately. Note that the off-line BP with the squared penalty term V1 required more processing time than the BPQ with V1 . As for the normalized penalty term V3 , the off-line BP with V3 worked more stably than the BPQ with V3 .
4 The increasing and decreasing parameters were set to 1.1 and 1 / 1.1, respectively, as recommended by Silva and Almeida (1990);if the value of the objective function increases, all learning rates are halved until the value decreases.
Second-Order Learning Algorithm
717
Figure 4: Learning results of a squared penalty term for the regression problem.
3.4 Evaluation Using On-line BP. By using the on-line BP, a similar evaluation was made after adding each penalty term, where the maximum processing time was set to 1000 seconds as enough processing time. Here, the learning rate g was set to 0.01, 0.001, or 0.0001, and the penalty factor m was gradually changed from 0.01 to 0.01 £ 2¡19 by multiplying by 2¡1 . Since on-line learning algorithms perform a weight update for each single example, each penalty operation for the weight values should be performed for each single example as well. Then, in general, the adequate penalty factors of on-line algorithms become much smaller than those of off-line algorithms. Figure 7 compares the generalization performance, where the large RMSE values are omitted to make the graphs the same scale. As for the standard deviation, since each graph of the three learning rates was almost the same, only the graph of g D 0.01 is shown. This gure shows that for the same learning rate, the curves of the generalization performance with each penalty term were very similar. Moreover, the best generalization performance with each penalty term was almost equivalent to that of the
718
Kazumi Saito and Ryohei Nakano
Figure 5: Generalization performance using the off-line BP for the regression problem.
on-line BP without a penalty term. As for the convergence performance, all trials did not converge within 1000 seconds. This indicates that the on-line BP with a penalty term could not efciently reduce the error, worse than the off-line BP. 3.5 Consideration. As the consequence of evaluation using the regression problem, among all the combinations, the combination of V1 and the BPQ performed the best learning, when considering both the generalization performance and the convergence performance. Incidentally, the generalization performance of the on-line or off-line BP without a penalty term was substantially better than that of the BPQ without it. We predict that this is because the effect of early stopping (Bishop, 1995) worked for both, and more for the on-line BP. In this problem, the convergence of the BP is so slow that the effect of early stopping is inevitable.
Second-Order Learning Algorithm
719
Figure 6: Convergence performance using the off-line BP for the regression problem.
On the other hand, in our experiments, the choice of our maximum processing time is considered to be enough and reasonable. Figure 8 shows the learning curves of the BPQ without a penalty term, the off-line BP without it and the on-line BP without it, where the learning curves were plotted by using the average RMSE for the training examples with respect to the processing time. This gure indicates that even if the maximum processing time increases to a larger value, most of our experimental results will be the same because each of the learning curves for these algorithms is almost saturated within the maximum processing time. 4 Experiments Using Classication Problem
We performed similar evaluations using a real handwritten numeral recognition problem, where each numeral, 0 » 9 (k D 10), is described by reduced Glucksman’s features (n D 16) (Glucksman, 1967; Ishii, 1983). In ex-
720
Kazumi Saito and Ryohei Nakano
Figure 7: Generalization performance using the on-line BP for the regression problem.
Second-Order Learning Algorithm
721
Figure 8: Learning curves for the regression problem.
periments, among 4000 examples (each class contains 400 examples), 2000 examples (each class contains 200 examples) were used for training, and the remaining examples were used for testing, where the generalization performance was evaluated by using the misclassication rate. The experimental conditions were exactly the same as in section 3 except that the number of hidden units was set to 10 and the maximum processing time was set to 200 seconds. Our experiments using the classication problem were done on HP C180 computers. Figures 9a and 9b show learning results with each of the three penalty terms by using second and rst-order off-line learning algorithms, respectively. Here, the penalty factor m was gradually changed from 40 to 40 £ 2¡9 by multiplying by 2¡1 . We can see that for each penalty term, the generaliza-
722
Kazumi Saito and Ryohei Nakano
Figure 9: Generalization performance for the classication problem.
tion performance was greatly improved when m was set adequately. Here, the misclassication rate with the squared penalty term V1 is very similar to that with the normalized penalty term V2 , but the former could show the best generalization performance in this experiment. Figure 10 shows learning results with each of the three penalty terms by using the on-line BP whose learning rateg was set to 0.1, 0.01, or 0.001. Here, the penalty factor m was gradually changed from 0.002 to 0.002 £ 2¡19 by multiplying by 2¡1 . We can see that for each penalty term, the generalization performance was also improved when m was set adequately. However, the best generalization performance with each of the penalty terms was slightly better than the generalization performance without a penalty term, compared with the experiments using the off-line algorithms.
Second-Order Learning Algorithm
723
Figure 10: Generalization performance using the on-line BP for the classication problem.
724
Kazumi Saito and Ryohei Nakano
Figure 11: Convergence performance using the BPQ with V1 for the classication problem.
Except for the case using the on-line BP whose learning rate was set to 0.1, the generalization performance of the on-line or off-line BP without a penalty term was substantially better than that of the BPQ without it; the effect of early stopping may work clearly for large-scale problems. Incidentally, for all trials except the case using the off-line algorithms whose penalty factor was set to 40 or 20, the standard deviations of the generalization misclassication rates were less than 1%. As for the convergence performance, only the combination of the squared penalty term V1 and the second-order algorithm BPQ could make the gradient vector sufciently small; the learnings of the other combinations were terminated by the maximum processing time. Figure 11 shows the convergence performance of the BPQ with V1 . Thus, when considering both the generalization performance and the convergence performance, these experiments using the classication problem also show that the combination of V1 and the BPQ performed the best learning. 5 Comparison of Function Surfaces
In order to examine graphically the reasons that the effect of adding each penalty term differed, we designed a simple experiment: learning a function y D s(w1 x) Cs(w2 x), where only two weights, w1 and w2 , are adjustable. In the three-layer network, as shown in Figure 12, the input and output layers consist of only one unit, and the hidden layer consists of two units. Note that the weights between the hidden units and the output unit are xed at 1, there is no bias, and the activation function of hidden units is assumed to be s(x) D 1 / (1 Cexp(¡x)). Each target value yt was calculated from the corresponding input value xt 2 f¡0.2, ¡0.1, 0, 0.1, 0.2g by setting (w1 , w2 ) D (1, 3).
Second-Order Learning Algorithm
725
Figure 12: A two-weight learning problem.
Figure 13 shows the learning trajectories on function contour maps with respect to w1 and w2 during 100 iterations starting at ( w1 , w2 ) D (¡1, ¡3), where the penalty factor m was set to 0.1 or 0.01. Here the BPQ was used as a learning algorithm. The contours for the squared penalty term form ovals, making the BPQ learn easily. When m D 0.1, the contours for the absolute penalty term form an almost square-like shape, and the learning trajectories oscillate near the origin (w1 D w2 D 0), due to the discontinuity of the gradient function. The contours for the normalized penalty term form a valley, making BPQ’s learning more difcult. Here, in order to evaluate the learning results graphically , the number of weights was set to two, but the property of each penalty term can be naturally extended to a higher dimension. Even if the number of weights is larger than two, these properties are considered to be the same. Moreover, these considerations coincide with the experimental results of sections 3 and 4: the combination of squared penalty term V1 and the second-order learning algorithm BPQ performed the best learning. 6 Determining Penalty Factor
In general, for a given problem, we cannot know an adequate penalty factor in advance. Given a limited number of examples, we must nd a reasonably adequate penalty factor. The procedure of cross validation (Stone, 1978) is adopted for this purpose. Since we knew the combination of the squared penalty term V1 and the second-order algorithm, BPQ works very efciently; hereafter, we consider only this combination. The procedure of cross validation was implemented as a leave-one-out method: the learning error term without the hth example is dened as f
( h)
(W) D
m 1 X ( yt ¡ z( xt I W)) 2 , 2 tD1,t6Dh
and the parameters minimizing the objective function with the penalty term
726
Kazumi Saito and Ryohei Nakano
Figure 13: Graphical evaluation of function surfaces for the two-weight learning problem.
are dened as O ( h) D arg min( f ( h) (W) Cm V1 (W)). W W
Then, the cross validation error is v u m u1 X O ( h) )) 2 ( yh ¡ z( x h I W RMSE( CV) (m ) D t m hD1
.
Thus, the optimal penalty factor can be calculated by gradually changing its factor: m opt D arg min RMSE( CV) (m ). m
Second-Order Learning Algorithm
727
Figure 14: Generalization performance using cross validation for the regression problem.
On the other hand, the initial weight values for evaluating the cross-validation error were set as the learning results of the entire examples: O D arg min( f (W) Cm V1 (W)). W W
O ( h) can be calculated The advantage of this initial value setting is that W O and W O ( h) with a light computation load because the difference between W is generally considered to be small. Moreover, in general, there are sevO (h) in eral local minima for neural network learning, but by searching W O the cross-validation error is expected to be the neighborhood of W, stable. Experiments were performed using the above regression problem with exactly the same experimental conditions as in section 3. Figure 14 compares the generalization error and the cross-validation error by using average
728
Kazumi Saito and Ryohei Nakano
Figure 15: Convergence performance using cross validation for the regression problem.
and standard deviation. Although the cross-validation error was a pessimistic estimator of the generalization error, it showed the same tendency and was minimized at almost the same penalty factor. Moreover, since the standard deviations of the results were very small, we can see that this learning was stable. Figure 15 shows the average processing time and its standard deviation. Although the processing time includes the crossvalidation evaluation, we can see that the learning was performed quite efciently. 7 Conclusion
This article investigated the efciency of supervised learning with each of three penalty terms by using rst- and second-order off-line learning algorithms and a standard on-line algorithm. Our experiments showed that for a reasonably adequate penalty factor, the combination of the squared penalty term and the second-order algorithm drastically improves the convergence performance in comparison to the other combinations, and with excellent generalization performance. In the case of other second-order learning algorithms such as SCG or OSS, similar results are possible because the main difference between the BPQ and those other algorithms involves only the learning efciency. In the future, we plan to do evaluations using larger-scale problems. Acknowledgments
We thank Kenichiro Ishii for providing handwritten numeral recognition data.
Second-Order Learning Algorithm
729
References Battiti, R. (1992). First- and second-order methods for learning between steepest descent and newton’s method. Neural Computation, 4(2), 141–166. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Glucksman, H. (1967). Classication of mixed-font alphabetics by characteristic loci. In IEEE Computer Conference (pp. 138–141). Hanson, S., & Pratt, L. (1989). Comparing biases for minimal network construction with backpropagation. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 177–185). San Mateo, CA: Morgan Kaufmann. Hinton, G. (1987). Learning translation invariant recognition in massively parallel networks. In J. deBakker, A. Nijman, & P. Treleaven (Eds.), Proceedings PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin. Ishii, K. (1983). Generation of distorted characters and its applications. Systems and Computers in Japan, 14(6), 19–27. Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9(3), 509–521. MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Saito, K., & Nakano, R. (1997a). Partial BFGS update and efcient step-length calculation for three-layer neural networks. Neural Computation, 9(1), 123– 141. Saito, K., & Nakano, R. (1997b). Second-order learning algorithm with squared penalty term. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 627–633). Cambridge, MA: MIT Press. Silva, F. & Almeida, L. (1990). Speeding up backpropagation. In R. Eckmiller (Ed.), Advanced neural computers (pp. 151–160). Amsterdam: North-Holland. Stone, M. (1978). Cross-validation: A review. Operationsforsch. Statist. Ser. Statistics B, 9(1), 111–147. Williams, P. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1), 117–143. Received March 30, 1998; accepted February 4, 1999.
Neural Computation, Volume 12, Number 4
CONTENTS Articles
Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel and J´ozsef Fiser
731
The Multifractal Structure of Contrast Changes in Natural Images: From Sharp Edges to Textures Antonio Turiel and N´estor Parga
763
Note
Exponential or Polynomial Learning Curves? Case-Based Studies Hanzhong Gu and Haruhisa Takahashi
795
Letters
Training Feedforward Neural Networks with Gain Constraints Eric Hartman
811
Variational Learning for Switching State-Space Models Zoubin Ghahramani and Geoffrey E. Hinton
831
Retrieval Properties of a Hopeld Model with Random Asymmetric Interactions Zhang Chengxiang, Chandan Dasgupta, and Manoranjan P. Singh
865
On “Natural” Learning and Pruning in Multilayered Perceptrons Tom Heskes
881
Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations Michele Giugliano
903
Hierarchical Bayesian Models for Regularization in Sequential Learning J. F. G. de Freitas, M. Niranjan, and A. H. Gee
933
Sequential Monte Carlo Methods to Train Neural Network Models J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet
955
ARTICLE
Communicated by Shimon Edelman
Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Jozsef ´ Fiser Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627-0268, U.S.A. We have studied some of the design trade-offs governing visual representations based on spatially invariant conjunctive feature detectors, with an emphasis on the susceptibility of such systems to false-positive recognition errors—Malsburg’s classical binding problem. We begin by deriving an analytical model that makes explicit how recognition performance is affected by the number of objects that must be distinguished, the number of features included in the representation, the complexity of individual objects, and the clutter load, that is, the amount of visual material in the eld of view in which multiple objects must be simultaneously recognized, independent of pose, and without explicit segmentation. Using the domain of text to model object recognition in cluttered scenes, we show that with corrections for the nonuniform probability and nonindependence of text features, the analytical model achieves good ts to measured recognition rates in simulations involving a wide range of clutter loads, word sizes, and feature counts. We then introduce a greedy algorithm for feature learning, derived from the analytical model, which grows a representation by choosing those conjunctive features that are most likely to distinguish objects from the cluttered backgrounds in which they are embedded. We show that the representations produced by this algorithm are compact, decorrelated, and heavily weighted toward features of low conjunctive order. Our results provide a more quantitative basis for understanding when spatially invariant conjunctive features can support unambiguous perception in multiobject scenes, and lead to several insights regarding the properties of visual representations optimized for specic recognition tasks. 1 Introduction
The problem of classifying objects in visual scenes has remained a scientic and a technical holy grail for several decades. In spite of intensive research in Neural Computation 12, 731–762 (2000)
c 2000 Massachusetts Institute of Technology °
732
Bartlett W. Mel and J´ozsef Fiser
the eld of computer vision and a large body of empirical data from the elds of visual psychophysics and neurophysiology, the recognition competence of a two-year-old human child remains unexplained as a theoretical matter and well beyond the technical state of the art. This fact is surprising given that (1) the remarkable speed of recognition in the primate brain allows for only very brief processing times at each stage in a pipeline containing only a few stages (Potter, 1976; Oram & Perrett, 1992; Heller, Hertz, Kjær, & Richmond, 1995; Thorpe, Fize, & Marlot, 1996; Fize, Boulanouar, Ranjeva, Fabre-Thorpe, & Thorpe, 1998), (2) the computations that can be performed within each neural processing stage are strongly constrained by the structure of the underlying neural tissue, about which a great deal is known (Hubel & Wiesel, 1968; Szentagothai, 1977; Jones, 1981; Gilbert, 1983; Van Essen, 1985; Douglas & Martin, 1998), (3) the response properties of neurons in each of the relevant brain areas have been well studied, and evolve from stage to stage in systematic ways (Oram & Perrett, 1994; Logothetis & Sheinberg, 1996; Tanaka, 1996), and (4) computer systems may already be powerful enough to emulate the functions of these neural processing stages, were it only known what exactly to do. A number of neuromorphic approaches to visual recognition have been proposed over the years (Pitts & McCullough, 1947; Fukushima, Miyake, & Ito, 1983; Sandon & Urh, 1988; Zemel, Mozer, & Hinton, 1990; Le Cun et al., 1990; Mozer, 1991; Swain & Ballard, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Schiele & Crowley, 1996; Mel, 1997; Weng, Ahuja, & Huang, 1997; Lang & Seitz, 1997; Wallis & Rolls, 1997; Edelman & Duvdevani-Bar, 1997). Many of these neurally inspired systems involve constructing banks of feature detectors, often called receptive elds (RFs), each of which is sensitive to some localized spatial conguration of image cues but invariant to one or more spatial transformations of its preferred stimulus—typically including invariance to translation, rotation, scale, or spatial distortion, as specied by the task. 1 Since the goal to extract useful invariant features is common to most conventional approaches to computer vision as well, the “neuralness” of a recognition system lies primarily in its emphasis on feedforward, hierarchical construction of the invariant features, where the computations are usually restricted to simple spatial conjunction and disjunction operations (e.g., Fukushima et al., 1983), and/or specify the inclusion of a relatively large number of invariant features in the visual representation (see Mozer, 1991; Califano & Mohan, 1994; Mel, 1997)—anywhere from hundreds to millions. Neural approaches typically also involve some form of learning, whether supervised by category labels at the output layer (Le Cun et al., 1990), supervised directly within 1 A number of the above-mentioned systems have emphasized other mechanisms instead of or in addition to straightforward image ltering operations (Zemel et al., 1990; Mozer, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Edelman & Duvdevani-Bar, 1997).
Minimizing Binding Errors
733
intermediate network layers (Fukushima et al., 1983), or involving purely unsupervised learning principles that home in on features that occur frequently in target objects (Fukushima et al., 1983; Zemel et al., 1990; Wallis & Rolls, 1997; Weng et al., 1997). While architectures of this general type have performed well in a variety of difcult, though limited, recognition problems, it has yet to be proved that a few stages of simple feedforward ltering operations can explain the remarkable recognition and classication capacities of the human visual system, which is commonly confronted with scenes acquired from varying viewpoints, containing multiple objects drawn from thousands of categories, and appearing in an innite panoply of spatial congurations. In fact, there are reasons for pessimism. 1.1 The Binding Problem. One of the most persistent worries concerning this class of architecture is due to Malsburg (1981), who noted the potential for ambiguity in visual representations constructed from spatially invariant lters. A spatial binding problem arises, for example, when each detector in the visual representation reports only the presence of some elemental object attribute but not its spatial location (or, more generally, its pose). Under these unfavorable conditions, an object is hallucinated (i.e., detected when not actually present) whenever all of its elemental features are present in the visual eld, even when embedded piecemeal in improper congurations within a scattered coalition of distractor objects (see Figure 1A). This type of failure mode for a visual representation emanates from the dual pressures to cope with uncontrolled variation in object pose, which forces the use of detectors with excessive spatial invariances, and the necessity to process multiple objects simultaneously, which overstimulate the visual representation and increase the probability of ambiguous perception. One approach to the binding problem is to build a separate, full-order conjunctive detector for every possible view (e.g., spatial pose, state of occlusion, distortion, degradation) of every possible object, and then for each object provide a huge disjunction that pools over all of the views of that object. Malsburg (1981) pointed out that while the binding problem could thus be solved, the remedy is a false one since it leads to a combinatorial explosion in the number of needed visual processing operations (see Figure 1B). Other courses of action include (1) strategies for image preprocessing designed to segment scenes into regions containing individual objects, thus reducing the clutter load confronting the visual representation, and (2) explicit normalization procedures to reduce pose uncertainty (e.g., centering, scaling, warping), thereby reducing the invariance load that must be sustained by the individual receptive elds. Both strategies reduce the probability that any given object’s feature set will be activated by a spurious collection of features drawn from other objects. 1.2 Wickelsystems. Preprocessing strategies aside, various workers have explored the middle ground between “binding breakdown” and “com-
734
Bartlett W. Mel and J´ozsef Fiser object detectors
A
"dog"
"cat"
too few detectors, with excessive spatial pooling
"b"
"a"
Binding Problem
a b c . ..
array of detectors defining input space
a b c .. .
"c"
a b
a . c .. . . .
. . .
c a t t c
a
B "cat"
too many high-order conjunctions needed
Combinatorial Explosion
a b c .. .
a b c . . .
y z
t
a c t
c
cat1
"z"
"t"
t t
y z
multiple inputs activate same internal code
"bee"
dog1
bee1 cat2
"dog"
dog2
"bee"
bee 2
a b c . . .
cat N a .. c . . ..
t t
dog N
bee N
t
c a t
C
tractable number of features of intermediate complexity
"ca"
"aa"
tractable number of intermediate-order conjunctions
Wickelgren’s Compromise
"c_t"
"bee"
"at"
"zz"
at ca
c_t
ca
at
c_t
a a a b b b c c c . . . y z
binding problem arises, but less frequently
"dog"
"cat"
a . c .. . . .
t t
t
c a t t h a t
c a r
c u t
Figure 1: (A) A binding problem occurs when an object’s representation can be falsely activated by multiple (or inappropriate) other objects. (B) A combinatorial explosion arises when all possible views of all possible objects are represented by explicit conjunctions. (C) Compromise representation containing an intermediate number of receptive elds that bind image features to intermediate order.
Minimizing Binding Errors
735
binatorial catastrophe,” where a visual representation contains an intermediate number of detectors, each of which binds together an intermediatesized subset of feature elements in their proper relations before integrating out the irrelevant spatial transformations (see Mozer, 1991, for a thoughtful advocacy of this position). Under this compromise, each detector becomes a spatially invariant template for a specic minipattern, which is more complex than an elemental feature, such as an oriented edge, but simpler than a full-blown view of an object (see Figure 1C). A set of spatially invariant features of intermediate complexity can be likened to a box of jumbled jigsaw puzzle pieces: although the nal shape of the composite object—the puzzle—is not explicitly contained in this peculiar “representation,” it is nonetheless unambiguously specied: the collection of parts (features) can interlock in only one way. An important early version of this idea is due to Wickelgren (1969), who proposed a scheme for representing the phonological structure of spoken language involving units sensitive to contiguous triples of phonological features but not to the absolute position of the tuples in the speech stream. Though not phrased in this language, “Wickelfeatures” were essentially translation-invariant detectors of phonological minipatterns of intermediate complexity, analyzing the one-dimensional speech signal. Representational schemes inspired by Wickelgren’s proposal have since cropped up in other inuential models of language processing (McClelland & Rumelhart, 1981; Mozer, 1991). The use of invariant conjunctive features schemes to address representational problems in elds other than vision highlights the fact that the binding problem is not an inherently visuospatial one. Rather, it can arise whenever the set of features representing an object (visual, auditory, olfactory, etc.) can be fully, and thus inappropriately, activated by input “scenes” that do not contain the object. In general, a binding problem has the potential to exist whenever a representation lacks features that conjoin an object’s elemental attributes (e.g., parts, spatial relations, surface properties) to sufciently high conjunctive order. On the solution side, binding ambiguity can be mitigated as in the scenario of Figure 1 by building conjunctive features that are increasingly object specic, until the probability of false-positive activation for any object is driven below an acceptable threshold. Given that this conjunctive order-boosting prescription to reduce ambiguity is algorithmically trivial, it shifts emphasis away from the question as to how to eliminate binding ambiguity, toward the question as to whether, for a given recognition problem, ambiguity can be eliminated at an acceptable cost. 1.3 A Missing Theory. In spite of the importance of spatially invariant conjunctive visual representations as models of neural function, and the demonstrated practical successes of computer vision systems constructed along these lines, many of the design trade-offs that govern the performance of visual “Wickelsystems” in response to cluttered input images have re-
736
Bartlett W. Mel and J´ozsef Fiser
mained poorly understood. For example, it is unknown in general how recognition performance depends on (1) the number of object categories that must be distinguished, (2) the similarity structure of the object category distribution (i.e., whether object categories are intrinsically very similar or very different from each other), (3) the featural complexity of individual objects, (4) the number and conjunctive order of features included in the representation, (5) the clutter load (i.e., the amount of visual material in the eld of view from which multiple objects must be recognized without explicit segmentation), and (6) the invariance load (i.e., the set of spatial transformations that do not affect the identities of objects, and that must be ignored by each individual detector). In a previous analysis that touched on some of these issues, Califano and Mohan (1994) showed that large, parameterized families of complex invariant features (e.g., containing 106 elements or more) could indeed support recognition of multiple objects in complex scenes without prior segmentation or discrimination among a large number of highly similar objects in a spatially invariant fashion. However, these authors were primarily concerned with the mathematical advantages of large versus small feature sets in a generic sense, and thus did not test their analytical predictions using a large object database and varying object complexity, category structure, clutter load, and other factors. Beyond our lack of understanding regarding the trade-offs inuencing system performance, the issue as to which invariant conjunctive features and of what complexity should be incorporated into a visual representation, through learning, has also not been well worked out as a conceptual matter. Previous approaches to feature learning in neuromorphic visual systems have invoked (1) generic gradient-based supervised learning principles (e.g., Le Cun et al., 1990) that offer no direct insight into the qualities of a good representation, (2) or straightforward unsupervised learning principles (e.g., Wallis & Rolls, 1997), which tend to discover frequently occurring features or principal components rather than those needed to distinguish objects from cluttered backgrounds on a task-specic basis. Network approaches to learning have also typically operated within a xed network architecture, which largely predenes and homogenizes the complexity of top-level features to be used for recognition, though optimal representations may actually require features drawn from a range of complexities, and could vary in their composition on a per-object basis. All this suggests that further work on the principles of feature learning is needed. In this article, we describe our work to test a simple analytical model that captures several trade-offs governing the performance of visual recognition systems based on spatially invariant conjunctive features. In addition, we introduce a supervised greedy algorithm for feature learning that grows a visual representation in such a way as to minimize false-positive recognition errors. Finally, we consider some of the surprising properties of “good” representations and the implications of our results for more realistic visual recognition problems.
Minimizing Binding Errors
737
2 Methods 2.1 Text World as a Model for Object Recognition. The studies described here were carried out in text world, a domain with many of the complexities of vision in general (see Figure 2): target objects are numerous (more than 40,000 words in the database), are highly nonuniform in their prior probabilities (relative probabilities range from 1 to 400,000), are constructed from a set of relatively few underlying parts (26 letters), and individually can contain from 1 to 20 parts. Furthermore, the parts from which words are built are highly nonuniform in their relative frequencies and contain strong statistical dependencies. Finally, word objects occur in highly cluttered visual environments—embedded in input arrays in which many other words are simultaneously present—but must nonetheless be recognized regardless of position in the visual eld. Although text world lacks several additional complexities characteristic of real object vision (see section 6), it nonetheless provides a rich but tractable domain in which to quantify binding trade-offs and to facilitate the testing of analytical predictions and the development of feature learning algorithms. An important caveat, however, is that the use of text as a substrate for developing and testing our analytical model does not imply that our conclusions bear directly on any aspect of human reading behavior. Two databases were used in the studies. The word database contained 44,414 entries, representing all lowercase punctuation-free words and their relative frequencies found in 5 million words in the Wall Street Journal (WSJ) (available online from the Association for Computational Linguistics at http://morph.ldc.upenn.edu/). The English text database consisted of approximately 1 million words drawn from a variety of sources (KuLcera & Francis, 1967). 2.2 Recognition Using n-Grams. A natural class of visual features in text world is the position-invariant n-gram, dened here as a binary detector that responds when a specic spatial conguration of n letters is found anywhere (one or more times) in the input array.2 The value of n is termed the conjunctive or binding order of the n-gram, and the diameter of the n-gram is the maximum span between characters specied within it. For example, th** is a 3-gram of diameter 5 since it species the relative locations of three characters (including the space character but not wild card characters *), and spans a eld ve characters in width. The concept of an n-gram is used also in computational linguistics, though usually referring to n-tuples of consecutive words (Charniak, 1993).
2 An n-gram could also be multivalued, that is, respond in a graded fashion depending on the number of occurrences of the target feature in the input; the binary version was used here to facilitate analysis.
738
Bartlett W. Mel and J´ozsef Fiser
10
6
10
5
10
4
10
3
10
2
B
Non-Uniform Word Frequencies
Non-Uniform Word Sizes
8 Word Counts (x 1000)
Log Relative Frequency
A
10
7 6 5 4 3 2 1 0
1
1 2 3 4 5 6 7 8 9 1011121314 151617181920
Words Rank-Ordered by Frequency
C
Word Size (Letters)
Non-Uniform Letter Frequencies
D
450
Relative Frequency (x 1000)
1200
Relative Frequency
400 350 300 250 200 150 100 50 _ e t a o n h i s r d l u c wm f g y p b v k x j q z
Single Letters Ordered by Frequency Info Gain About Adjacent Letter (Bits)
1000 800 600 400
unifor m approximation
200 0
0
E
Non-Uniform 2-gram Frequencies
0
100
200
300
400
500
600
Adjacent 2-grams Ordered by Frequency
Statistical Dependencies Among Adjacent Letters 5 letter to left
4
letter to right 3 2 1 0 a
b
c d
e
f
g h
i
j
k
l m n o p
q
r
s
t
u v w x y
z
Figure 2: Text world was used as a surrogate for visual recognition in multiobject scenes. (A) Word frequencies in the 5-million-word Wall Street Journal (WSJ) corpus are highly nonuniform, following an approximately exponential decay according to Zipf’s law. (B) Words range from 1 to 20 characters in length; 7letter words are most numerous. (C,D) Relative frequencies of individual letters or adjacent letter pairs are highly nonuniform. Ordinate values represent the number of times the feature was found in the WSJ corpus. The dashed rectangle in D represents simplifying assumption regarding the 2-gram frequency distribution used for quantitative predictions below. (E) The parts of words show strong statistical dependencies. Columns show information gain about identity of the adjacent letter (or space character) to left or right, given the presence of letter indicated on the x-axis. The dashed horizontal line indicates perfect knowledge—log 2 (27) D 4.75 bits.
Minimizing Binding Errors
739
In relation to conventional visual domains, n-grams may be viewed as complex visual feature detectors, each of which responds to a specic conjunction of subparts in a specic spatial conguration, but with built-in invariance to a set of spatial transformations predened by the recognition task. In relation to a shape-based visual domain, a 2-gram is analogous to a corner detector that signals the co-occurrence of a vertical and a horizontal edge, with appropriate spatial offsets, anywhere within an extended region of the image. Let R be a representation containing a set of position-invariant n-grams analyzing the input array and W be a set of word detectors analyzing R, with one detector for every entry in the word database. Each word detector receives input from some or all of the n-gram units in R that are contained in the respective word. However, not all of a word’s n-grams are necessarily contained in R, and not all of the word’s n-grams that are contained in R necessarily provide input to the word’s detector. For example, the representation might contain four spatially invariant features R D fa, c, o, sg, while the detector for the word cows receives input from features c and s, but not from the o feature, which is present in R, or from the w feature, which is not contained in R. Word detectors are binary-valued conjunctions, responding 1 when all of their inputs are active and 0 otherwise. The collection of ngrams feeding a word detector is hereafter referred to as the word’s minimal feature set. A correct recognition is scored when an input array of characters is presented, and each word detector is activated if and only if its associated word is present in the input. Any word detector activated when its corresponding word is not actually present in the input array is counted as a hallucination, and the recognition trial is scored as an error. Recognition errors for a particular representation R are reported as the percentage of input arrays drawn randomly from the English text database that generated one or more word hallucinations.
3 Results 3.1 Word-Word Collisions in a Simple 1-Gram Representation. We begin by considering the case without visual clutter and ask how often pairs of isolated words collide with each other within a binary 1-gram representation, where each word is represented as an unordered set of individual letters (with repeats ignored). This is the situation in which the binding problem is most evident, as schematized in Figure 1A. Results for the approximately 2 billion distinct pairs of words in the database are summarized in Table 1. About half of the 44,414 words contained in the database collide with at least one other word (e.g., cerebrum ´ cucumber , calculation ´ unconstitutional ). The largest cohort of 1-gram-identical words contained 28 members.
740
Bartlett W. Mel and J´ozsef Fiser
Table 1: Performance of a Binary 1-Gram Representation in Word-Word Comparisons. Quantity
Value
Comment
Number of 1-grams in R
27
a, b, . . . , z,
Number of distinct words
44,414
From 5 million words in the WSJ
Word-word comparisons
»2 billion
Number of ambiguous words
24,488
analogies suction scientists
´ ´ ´
gasoline continuous nicest
stare, arrest, tears, restates, reassert, rarest, easter . . .
Largest self-similar cohort
28 words
Baseline entropy in wordfrequency distribution
9.76 bits
< log 2 (44, 414) D 15.4 bits
1.4 bits
Narrows eld to » 3 possible words
Residual uncertainty about identity of a randomly drawn word, knowing its 1-grams
As an aside, we noted that if the two words compared were required also to contain the same number of letters, the number of unambiguously represented words fell to 11,377, and if the words were compared within a multivalued 1-gram representation (where words matched only if they contained the identical multiset of letters, i.e., in which repeats were taken into account), the ambiguous word count fell further to 5,483, or about 12% of English words according to this sample. The largest self-similar cohort in this case contained only 5 words (traces , reacts , crates , caters , caster ). Although a 1-gram representation contains no explicit part-part relations and fails to pin down uniquely the identity of more than half of the English words in the database, it nonetheless provides most of the needed information to distinguish individually presented words. The baseline entropy in the WSJ word-frequency distribution is 9.76 bits, signicantly less than the 15.4 D log2 (44,414) bits for an equiprobable word set. The average residual uncertainty about the identity of a word drawn from the WSJ word frequency distribution, given its 1-gram representation, is only 1.4 bits, meaning that for individually presented words, the 1-gram representation narrows the set of possibilities to about three words on average. 3.2 Word-Word Collisions in an Adjancent 2-Gram Representation.
Since many word objects have identical 1-gram representations even when full multisets of letters are considered, we examined the collision rate when an adjacent binary-valued 2-gram representation was used (i.e., coding the
Minimizing Binding Errors
741
Table 2: Performance of a Binary Adjacent 2-Gram Representation in WordWord Comparisons. Quantity Number of adjacent 2-grams Number of words Word-word comparisons Number of ambiguous words Largest self-similar cohort
Ambiguous word pairs that are linguistically distinct Words with identical adjacent 2-gram multisets
Value
Comment
729
[aa], [ab], . . . , [a ], . . . , [ z], [ ]
44,414 »2 billion 57 5 words
4 pairs
2 words
ohhh, ahhh, shhh, hmmm, whoosh, whirr , . . . ohhh ´ ohhhh ´ ohhhhh ´ ohhhhhh ´ ohhhhhhh asses possess seamstress intended
´ ´ ´ ´
assess possesses seamstresses indented
intended ´ indented
set of all adjacent letter pairs, including the space character). This representation is signicant in that adjacent 2-grams are the simplest features that explicitly encode spatial relations between parts. The results of this test are summarized in Table 2. There were only 46 word-pair collisions in this case, comprising 57 distinct words. Forty-two of the 46 problematic word pairs differed only in the length of a string of repeated letters (e.g., ahhh ´ ahhhh ´ ahhhhh . . . , similarly for ohhh, ummm, zzz, wheee, whirrr ) or into sets representing varying spellings or misspellings of the same word also involving repeated letters (e.g., tepees ´ teepees, llama ´ lllama ). Only four pairs of distinct words collided that had different morphologies in the linguistic sense (see Table 2). When words were constrained to be the same length or to contain the identical multiset of adjacent 2-grams, only a single collision persisted: intended ´ indented . At the level of word-word comparisons, therefore, the binding of adjacent letter pairs is sufcient practically to eliminate the ambiguity in a large database of English words. We noted empirically that the inclusion of fewer than 20 additional nonadjacent 2-grams to the representation entirely eliminated the word-word ambiguity in this database, without recourse to multiset comparisons or word-length constraints. 4 An Analytical Model
The results of these word-word comparisons suggest that under some circumstances, a modest number of low-order n-grams can disambiguate a
742
Bartlett W. Mel and J´ozsef Fiser
large, complex object set. However, one of the most dire challenges to a spatially invariant feature representation, object clutter, remains to be dealt with. To this end, we developed a simple probabilistic model to predict recognition performance when a cluttered input array containing multiple objects is presented to a bank of feature detectors. To permit analysis of this problem, we make two simplifying assumptions. First, we assume that the features contained in R are statistically independent. This assumption is false for English n-grams: for example, [th] predicts that [he] will follow about one in three times, whereas the baseline rate for [he] is only 0.5%. Second, we assume that all the features in R are activated with equal probability. This is also false in English: for example ed occurs approximately 90,000 times more often than [mj] (as in ramjet ). Proceeding nonetheless under these two assumptions, it is straightforward to calculate the probability that no object in the database is hallucinated in response to a randomly drawn input image. We rst calculate the probability o that all of the features feeding a given object detector are activated by an arbitrary input image:
¡ d ¡ w¢ c¡w ( d ¡ w)! c! ³ c ´ D ¼ oD ¡ ¢ ( c ¡ w)! d! d d
w
,
(4.1)
c
where d is the total number of feature detectors in R , w is the size of the object’s minimal feature set, i.e., the number of features required by the target object (word) detector, and c is the number of features activated by a “cluttered” multiobject input image. The left-most combinatorial ratio term in equation 4.1 counts the number of ways of choosing c from the set of d detectors, including all w belonging to the target object, as a fraction of the total number of ways of choosing c of the d detectors in any fashion. The approximation in equation 4.1 holds when c and d are large compared to w, as is generally the case. Equation 4.1 is, incidentally, the value of a “hypergeometric” random variable in the special case where all of the target elements are chosen; this distribution is used, for example, to calculate the probability of picking a winning lottery ticket. The probability that an object is hallucinated (i.e., perceived but not actually present) is given by hi D
oi ¡ qi , 1 ¡ qi
(4.2)
where qi is the probability that the object does in fact appear in the input image, which by denition means it cannot be hallucinated. According to Zipf’s law, however, which tells us that most objects in the world are quite uncommon (see Figure 2A), we may in some cases make the further simpli-
Minimizing Binding Errors
743
cation that qi ¿ oi , giving hi ¼ oi . We adopt this assumption in the following. (This simplication would result also from the assumption that inputs consist of object-like noise structured to trigger false-positive perceptions.) We may now write the probability of veridical perception—the probability that no object in the database is hallucinated, µ ³ c ´w ¶N , pv D (1 ¡ h) N ¼ 1 ¡ d
(4.3)
where N is the number of objects in the database. This expression assumes a homogeneous object database, where all N objects require exactly w features in R. Given the independence assumption, the expression for a heterogeneousQobject database consists of a product of object-specic probabilities pv D i (1 ¡ ( dc )wi ), where wi is the number of features required by object i. Note that objects (words) containing the same number of parts (letters) do not necessarily activate the same number of features (n-grams) in R, depending on both the object and the composition of R. From an expansion to rst order of equation 4.3 around ( c / d) w D 0, which gives pv D 1¡N(c / d) w , it may be seen that perception is veridical (i.e., pv ¼ 1) when ( c / d) w ¿ 1 / N. Thus, recognition errors are expected to be small when (1) a cluttered input scene activates only a small fraction of the detectors in R, (2) individual objects depend on as many detectors in R as possible, and (3) the number of objects to be discriminated is not too large. The rst two effects generate pressure to grow large representations, since if d is scaled up and c and w scale with it (as would be expected for statistically independent features), recognition performance improves rapidly. Conveniently, the size of the representation can always be increased by including new features of higher binding order or larger diameters, or both. It is also interesting to note that item 1 in the previous paragraph appears to militate for a sparse visual representation, while equation 4.2 militates for a dense representation. This apparent contradiction reects the fundamental trade-off between the need for individual objects to have large minimal feature sets in order to ward off hallucinations and the need for input scenes containing many objects to activate as few features as possible, again in order to minimize hallucinations. This trade-off is not captured within the standard conception of a feature space, since the notion of “similarity as distance” in a feature space does not naturally extend to the situation in which inputs represent multiobject scenes and require multiple output labels. 4.1 Fitting the Model to Recognition Performance in Text World. We ran a series of experiments to test the analytical model. Input arrays varying in width from 10 to 50 characters were drawn at random from the text database and presented to R. The representation contained either all adjacent 2-grams, called Rf2g to signify a diameter of 2, or the set of all 2-grams of diameter 2 or 3, called Rf2,3g , where Rf2,3g was twice the size of Rf2g .
744
Bartlett W. Mel and J´ozsef Fiser
In individual runs, the word database was restricted to words of a single length—two-letter words (N D 253), ve-letter words (N D 4024), or sevenletter words (N D 7159)—in order for the value of w to be kept relatively constant within a trial.3 For each word size and input size and for each of the two representations, average values for w and c were measured from samples drawn from the word or text database. Not surprisingly, given the assumption violations noted above (feature independence and equiprobability), predicting recognition performance using these measured average values led to gross underestimates of the hallucination errors actually generated by the representation. Two kinds of corrections to the model were made. The rst correction involved estimating a better value for d, the number of n-grams contained in the representation, since some of the detectors nominally present in Rf2g and Rf2,3g were infrequently, if ever, actually activated. The value for d was thus down-corrected to reect the number of detectors that would account for more than 99.9% of the probability density over the detector set: for Rf2g , the representation size dropped from d D 729 to d0 D 409, while for Rf2,3g , d was cut from 1,458 to 948. The second correction was based on the fact that the n-grams activated by English words and phrases are highly statistically redundant, meaning that a set of d0 n-grams has many fewer than d0 underlying degrees of freedom. We therefore introduced a redundancy factor r, which allowed us to “pretend” that the representation contained only d0 / r virtual n-grams, and a word or an input image activated only w / r or c / r virtual n-grams in the representation, respectively. (Note that since c and d appear only as a ratio in the approximation of equation 4.3, their r-corrections cancel out.) By systematically adjusting the redundancy factor depending on test condition (ranging from 1.60 to 3.45), qualitatively good ts to the empirical data could be generated, as shown in Figure 3. The redundancy factor was generally larger for long words relative to short words, reecting the fact that long English words have more internal predictability (i.e., ll a smaller proportion of the space of long strings than short words ll in the space of short strings). The r-factor was also larger for Rf2,3g than for Rf2g , reecting the fact that the increased feature overlap due to the inclusion of larger feature diameters in Rf2,3g leads to stronger interfeature correlations, and hence fewer underlying degrees of freedom. Thus, with the inclusion of correction factors for nonuniform activation levels and nonindependence of the features in R , the remarkably simple relation in equation 4.3 captures the main trends that describe recognition performance in scenarios with database sizes ranging from 253 to 7159 words, words containing between
3 Since R f2g and R f2, 3g contained all n-grams of their respective sizes, the value of w could vary only from word to word to the extent that words contained one or more repeated n-grams.
Minimizing Binding Errors
A
745
Fits of Analytical Model to Measured Error Rates: Adjacent 2-gram Representation
Hallucination Errors (% of inputs generating)
R[2] : 729 total features : [ {aa}, {ab}, ... {a_}, ...{_z}, {__}]
B
100 2 - letter words, r = 1.60 5 - letter words, r = 1.60 7 - letter words, r = 2.13
90 80 70 60 50 40 30 20 10 0 10
20 30 40 Text Input Width (Characters)
50
Fits of Analytical Model to Measured Error Rates: Extended 2-gram Representation
Hallucination Errors (% of inputs generating)
R[2,3] : 1,458 total features : [ {aa}, {a*a}, {ab}, {a*b}, ...{a_}, {a*_}, ... ] 100 2 - letter words, r = 2.19 5 - letter words, r = 2.95 7 - letter words, r = 3.45
90 80 70 60 50 40 30 20 10 0 10
J 20 30 40 Text Input Width (Characters)
50
Figure 3: Fits of equation 4.3 to recognition performance in text world. (A) The representation, designated R f2g , contained the complete set of 729 adjacent 2grams (including the space character). The x-axis shows the width of a text window presented to the n-gram representation, and the y-axis shows the probability that the input generates one or more hallucinated words. Analytical predictions are shown in dashed lines using redundancy factors specied in the legend. (B) Same as A, with results for R f2,3g .
746
Bartlett W. Mel and J´ozsef Fiser
2 and 7 letters, input arrays containing from 10 to 50 characters, and using representations varying in size by a factor of 2. We did not attempt to t larger ranges of these experimental variables. Regarding the sensitivity of the model ts to the choice of d0 and r, we noted rst that ts were not strongly affected by the cumulative probability cutoff used to choose d0 —qualitatively reasonable ts could be also generated using a cutoff value of 95%, for example, for which d0 values for R f2,3g and Rf2g shrank to 220 and 504, respectively. In contrast, we noted that a change of one part per thousand in the value of r could result in a signicant change in the shape a curve generated by equation 4.3, and with it significant degradation of the quality of a t. However, the systematic increase in the optimal value of r with increasing word size and representation size indicates that this redundancy correction factor is not an entirely free parameter. It is also worth emphasizing that a redundancy “factor,” which uniformly scales w, c, and d, is a very crude model of the complex statistical dependencies that exist among English n-grams; the fact that any set of small r values can be found that tracks empirical recognition curves over wide variations in the several other task variables is therefore signicant. 5 Greedy Feature Learning
Several lessons from the analysis and experiments are that (1) larger representations can lead to better recognition performance, (2) only those features that are actually used should be included in a representation, (3) considerable statistical redundancy exists in a fully enumerated (or randomly drawn) set of conjunctive features, which should be suppressed if possible, and (4) features of low binding order, while potentially adequate to represent isolated objects, are insufcient for veridical perception of scenes containing multiple objects. In addition, we infer that different objects are likely to have different representational requirements depending on their complexity and that the composition of an ideal representation is likely to depend heavily on the particular recognition task. Taken together, these points suggest that a representation should be learned that includes higher-order features only as needed. In contrast, blind enumeration of an ever-increasing number of higher-order features is an expensive and relatively inefcient way to proceed. Standard network approaches to supervised learning could in principle address all of the points raised above, including nding appropriate weights for features of various orders depending on their utility and helping to decorrelate the input representation. However, the potential need to acquire features of high order—so high that enumeration and testing of the full cohort of high-order features is impractical—compels the use of some form of explicit search as part of the learning process. The use of greedy, incremental, or heuristic techniques to grow a neural representation through increasing nonlinear orders is not new (e.g., Barron, Mucciardi, Cook, Craig,
Minimizing Binding Errors
747
& Barron, 1984; Fahlman & Lebiere, 1990), though such techniques have been relatively little discussed in the context of vision. To cope with the many constraints that must be satised by a good visual representation, we developed a greedy supervised algorithm for feature learning that (1) selects the order in which features should be added to the representation, and (2) selects which features added to the representation should contribute to the minimal feature sets of each object Q in the database. By differentiating the log probability uv D log(pv ) D log i (1 ¡ hi ) with respect to the hallucination probability of the ith object, we nd that ¡1 duv D , 1 ¡ hi dhi
(5.1)
indicating that the largest multiplicative increments to pv are realized by squashing the largest values of hi D ( c / d) wi . Practically, this is achieved by incrementing the w-values (i.e., growing the minimal feature sets) of the most frequently hallucinated objects. The following learning algorithm aims to do this in an efcient way: 1. Start with a small bootstrap representation. (We typically used 50 2grams found to be useful in pilot experiments.) 2. Present a large number of text “images” to the representation. 3. Collect hallucinated words—any word that was ever detected but not actually present—in an input image. 4. For each hallucinated word and the input array that generated it, increment a global frequency table for every n-gram (up to a maximum order of 5 and diameter of 6) that is (i) contained in the hallucinated word, (ii) not contained in the offending input, and (iii) not currently included in the representation. 5. Choose the most frequent n-gram in this table of any order and diameter, and add it to the current representation. 6. Build a connection from the newly added n-gram to any word detector involved in a successful vote for this feature in step 4. Inclusion of this n-gram in these words’ minimal feature sets will eliminate the largest number of word hallucination events encountered in the set of training images. 7. Go to 2. We ran the algorithm using input arrays 50 characters in width (often containing 10 or more words) and plotted the number of hallucinated words (see Figure 4A) and proportion of input arrays causing hallucinations (see Figure 4B) as a function of increasing representation size during learning. The periodic ratcheting in the performance curves (most visible in the lower curve of Figure 4B) was due to staging of the training process for efciency
748
Bartlett W. Mel and J´ozsef Fiser
reasons: batches of 1000 input arrays were processed until a xed hallucination error criterion of 1% was reached, when a new batch of 1000 random input arrays was drawn, and so on. The envelope of the oscillating performance curves provides an estimate of the true performance curve that would result from an innitely large training corpus. Learning was terminated when the average initial error rate for three consecutive new training batches fell below 1%. Given that this termination condition was typically reached in fewer than 50 training epochs (consisting of 50,000 50-character input images), no more than half the text database was visited by random drawings of training-testing sets. We found that for 50-character input arrays, the hallucination error rate fell below 1% after the inclusion of 2287 n-grams in R. These results were typical. The inferior performance of a simple unsupervised algorithm, which adds n-grams to the representation in order of their raw frequencies in English, is shown in Figure 4AB for comparison (upper dashed curves). The counts of total words hallucinated (see Figure 4A), with initial values indicating many 10s of hallucinated words per input, can be seen to fall far faster than the hallucination error rate (see Figure 4B), since a sentence was scored as an error until its very last hallucinated word was quashed.
Figure 4: Facing page. Results of greedy n-gram learning algorithm using input arrays 50 characters in width and a 50-element bootstrap representation. Training was staged in batches of 1000 input arrays until the steady-state hallucination error rate fell below 1%. (A) Growth of the representation is plotted along the x-axis, and the number of words hallucinated at least once in the current 1000-input training batch is plotted on the y-axis. The drop is far faster for the greedy algorithm (lower curve) than for a frequency-based unsupervised algorithm (upper curve). Unsupervised bootstrap representation also included the 27 1-grams (i.e., individual letters plus space character) that were excluded from the bootstrap representation used in the greedy learning run. Without these 1-grams, performance of the unsupervised algorithm was even more severely hampered; with these 1-grams incorporated in the greedy learning runs, the wvalues of every object were unduly inated. (B) Same as (A), but the y-axis shows the proportion of input arrays producing at least one hallucination error. Jaggies occur at transitions to new training batches each time the 1% error criterion was reached. Asymptote is reached in this run after the inclusion of 2287 n-grams. (C) A breakdown of learned representation showing relative numbers of n-grams of each order and diameter. (D) The column height shows the minimal feature set size after learning, that is, average number of n-grams feeding word conjunctions for words of various lengths. The dark lower portion of the columns shows initial counts based on full connectivity to bootstrap representation; the light upper portions show an increment in w due to learning. Nearly constant w values after learning reect the tendency of the algorithm to grow the smallest minimal feature sets. The dashed line corresponds to one n-gram per letter in word.
Minimizing Binding Errors
749
One of the most striking outcomes of the greedy learning process is that the learned representation is heavily weighted toward low-order features, as shown in Figure 4C. Thus, while the learned representation includes ngrams up to order 5, the number of higher-order features required to remedy perceptual errors falls off precipitously with increasing order—far faster than the explosive growth in the number of available features at each order. Quantitatively, the ratios of n-grams included in R to n-grams contained in the word database for orders 2, 3, 4, and 5 were 0.42, 0.0095, 0.00017, and 0.0000055, respectively. The rapidly diminishing ratios of useful features to available features at higher orders conrm the impracticality of learning on a fully enumerated high-order feature set and illustrate the natural bias in the learning procedure to choose the lowest-order features that “get the job done.” The dominance of low-order features is of considerable practical signicance, since higher-order features are more expensive to compute, less robust to image degradation, operate with lower duty cycles, and provide a more limited basis for generalization. The relatively broad distribution of
A
Fewer Words are Hallucinated as Representation Grows
Inputs Causing Hallucinations (%)
10000
Words Hallucinated
9000 8000 7000 6000
Unsupervised
5000
Greedy
4000 3000 2000 1000 0 0
1000
2000
3000
4000
Sizes of N-grams in Final Learned Representation
6 5 4
800
3
600
2
400 200 0 2
3 4 N-gram Order
90 80 70 60
5
Unsupervised
50
Greedy
40 30 20 10 0 0
D Mean Number of N-grams
Number of N-grams
1400 1000
100
diameter
1600 1200
Fewer Inputs Cause Hallucinations as Representation Grows
5000
N-grams Included in Representation
C
B
1000
2000
3000
4000
5000
N-grams Included in Representation
Number of N-grams Feeding Words of Varying Lengths 20 18 16 14 12 10 8 6 4
initial learned
2 0 1 2 3 4 5 6 7 8 9 1011121314151617 181920
Word Size (Letters)
750
Bartlett W. Mel and J´ozsef Fiser
diameters of the learned feature set are also shown in Figure 4C for n-grams of each order. In spite of its relatively small size, it is likely that the n-gram representation produced by the greedy algorithm could be signicantly compacted without loss of recognition performance, by alternately pruning the least useful n-grams from R (which lead to minimal increments in error rate) and retraining back to criterion. However, such a scheme was not tried. After learning to the 1% error criterion, a 50-character input array activated an average of 168 (7.3%) of the 2287 detectors in R, while the average minimal feature set size for an individual word was 11.4. By comparison, the initial c / d ratio for 50-character inputs projected onto the 50-element bootstrap representation was a much less favorable 44%. Thus, a primary outcome of the learning algorithm in this case involving heavy clutter is a larger, more sparsely activated representation, which minimizes collisions between input images and the feature sets associated with individual target words. Further, the estimated r value from this run was 1.95, indicating signicantly less redundancy in the learned representation than was estimated using seven-letter words (r D 3.45) in the fully enumerated 2-gram representations reported in Figure 3; this is in spite of the fact that the learned representation contained more than twice the number of active features as the fully enumerated 2-gram representation. The total height of each column in Figure 4D shows the w value—the size of the minimal feature sets after learning for words of various lengths. The dark lower segment of each column counts initial connections to the word conjunctions from the 50-element bootstrap representation, while the average number of new connections gained during learning for words of each length is given by the height of each gray upper segment. Given the algorithm’s tendency to expand the smallest minimal feature sets, the w values for short words grew proportionally much more than those for longer words, producing a nal distribution of w values that was remarkably independent of word length for all but the shortest words. The w values for one-, two-, and three-letter words were lower either because they reached their maximum possible values (4 for one-letter words) or because these words contained rarer features, on average, derived from a relatively high concentration in the database of unpronounceable abbreviations, acronyms, and so forth. The dashed line indicates a minimal feature set size equal to the number of letters contained in the word; the longest words can be seen to depend on substantially fewer n-grams than they contain letters. The pressures inuencing both the choice of n-grams for inclusion in R and the assignment of n-grams to the minimal feature sets of individual words were very different from those associated with a frequency-based learning scheme. Thus, the features included in R were not nececssarily the most common ones, and the most commonly occurring English words were not necessarily the most heavily represented.
Minimizing Binding Errors
751
Table 3: First 10 n-Grams to Be Added to R During a Learning Run with 50-Character Input Arrays, Initialized with a 200-Element Bootstrap Representation. Order of Inclusion 1 2 3 4 5 6 7 8 9 10
n-Gram
Relative Frequency
Cumulative %
[z ] [ j] [k* ] [x ] [ki] [ q] [ *v] [ **z] [o*i] [p*** ]
417 14,167 39,590 6090 20,558 8771 24,278 3184 37,562 55,567
8 55 71 42 62 47 64 31 70 76
Note: The “ ” character represents the space character, and “*” matches any character. A value of k in the cumulative distribution column indicates that the total probability summed over all less common features is k%.)
First, in lieu of choosing the most commonly occurring n-grams, the greedy algorithm prefers features that are relatively rare in English text as a whole but relatively common in those words most likely to be hallucinated: short words and words containing relatively common English structure. The paradoxical need to nd features that distinguish objects from common backgrounds, but do so as often as possible, leads to a representation containing features that occur with moderate frequency. Table 3 shows the rst 10 n-grams added to a 200-element bootstrap representation during learning in one experimental run. The features’ relative frequencies in English are shown, which jump about erratically, but are mostly from the middle ranges of the cumulative probability distribution for English n-grams. Most of these rst 10 features include a space character, indicating that failure to represent elemental features properly in relation to object boundaries is a major contributor to hallucination errors in this domain. Specically, we found that hallucination errors are frequently caused by the presence in the input of multiple “distractor ” objects that together contain most or all of the features of a target object, for example, in the sense that national and vacation together contain nearly all the features of the word nation. To distinguish nation from the superposition of these two distractors requires the inclusion in the representation of a 2-gram such as [n***** ] whose diameter of 7 exceeds the maximum value of 6 arbitrarily established in the present experiments. Having established that the features included in R are not keyed in a simple way to their relative frequencies in English, we also observed that no simple relation exists between word frequency and the per-word representational cost, that is, the minimal feature set size established during
752
Bartlett W. Mel and J´ozsef Fiser Table 4: Correlation Between Word Frequency in English and Minimal Feature Set Size After Learning Was Only 0.1. Word
Relative Frequency
Minimal Feature Set Sizea
resting leading
100 315
25 24
buffalo unhappy
188 144
8 7
thereat fording
2 1
25 23
bedknob jackdaw
1 1
7 6
a Number of
n-grams in R that feed the word’s conjunction, that is, which must be present in order for a word to be detected. Table shows examples of common and uncommon words with large and small minimal features sets.
learning. In fact, if attention is restricted to the set of all words of a given length, with nominally equivalent representational requirements, the correlation between relative frequency of the word in English and its postlearning minimal feature set size was only 0.1. Thus some common words require a conjunction of many n-grams in R, while others require few, and some rare words require many n-grams while others require few (see Table 4). 5.1 Sensitivity of the Learned Representation to Object Category Structure. Words are unlike common objects in that every word forms its own
category, which must be distinguished from all other words. In text world, therefore, the critical issue of generalization to novel class exemplars cannot be directly studied. To address this shortcoming partially and assess the impact of broader category structures on the composition of the n-gram representation, we modied the learning algorithm so that a word was considered to be hallucinated only if no similar—rather than identical—word was present in the input when the target word was declared recognized. In this case, similar was dened as different up to any single letter replacement—for example, dog D ffog, dig, . . .g, but dog 6 D ffig, og, dogs, . . .g. The resulting category structure imposed on words was thus a heavily overlapping one in which most words were similar to several other words.4 4 This learning procedure, combined with the assumption that every word is represented by only a simple conjunction of features in R (disjunctions of multiple prototypes per word were not supported), cannot guarantee that every word detector is activated
Minimizing Binding Errors
753
Decrease in Representational Costs for Broader, Overlapping Object Categories 1400 tolerance = 0
Number of N-grams
1200
tolerance = 1
1000 800 600 400 200 0 2
3 4 N-gram Order
5
Figure 5: Breakdown of representations produced by two learning runs differing in breadth of category structure. Fifty-character input arrays were used as before to train to an asymptotic 1% error criterion, this time beginning with a 200-element bootstrap representation. Run with tolerance = 0 was as in Figure 4, where a hallucination was scored whenever a word’s minimal feature set was activated, but the word was not identically present in the input. In a run with tolerance D 1, a hallucination was not scored if any similar word was present in the input, up to a single character replacement. The latter run, with a more lenient category structure, led to a representation that was signicantly smaller, since it was not required to discriminate as frequently among highly similar objects.
The main result of using this more tolerant similarity metric during learning is that fewer word hallucinations are encountered per input image, leading to a nal representation that is smaller by more than half (see Figure 5). In short, the demands on a conjunctive n-gram representation are lessened when the object category structure is broadened, since fewer distinctions among objects must be made. 5.2 Scaling of Learned Representation with Input Clutter. As is conrmed by the results of Figure 3, one of the most pernicious threats to a spatially invariant representation is that of clutter, dened here as the quantity of input material that must be simultaneously processed by the representain response to any of the legal variants of the word. Rather, the procedure allows a word detector to be activated by any of the legal variants without penalty.
754
Bartlett W. Mel and J´ozsef Fiser Table 5: Measured Average Values of c, d, and w and Calculated Values of r in a Series of Three Learning Runs with Input Clutter Load Increased from 25 to 75 Characters, Trained to a Fixed Error Criterion of 3%. Input Width
c
d
w
r
25 50 75
55 152 253
816 1768 2634
7.6 10.7 12.8
1.45 1.85 2.11
Note: The 50-element bootstrap representation was used to initialize all three runs. Equation 4.3 was used to nd values of r for each run, such that with w ! w / r, the measured values of c and d, and N D 44,414, a 3% error rate was predicted.
tion. The difculty may be easily seen in equation 4.3, where an unopposed increase in the value of c leads to a precipitous drop in recognition performance. On the other hand, equation 4.3 also indicates that increases in c can be counteracted by appropriate increases in d and w, which conserve the value of ( c / d) w . To examine this issue, we systematically increased the clutter load in a series of three learning runs from 25 to 75 characters, training always to a xed 3% error criterion, and recorded for each run the postlearning values of c, d, and w. Although the WSJ database contained a wide range of word sizes (from 1 to 20 characters in length), the measurement of a single value for w averaged over all words was justied since, as shown in Figure 4D, the learning algorithm tended to equalize the minimal feature set sizes for words independent of their lengths. Under the assumption of uniformly probable, statistically independent features in the learned representations, the value of h D (c / d)w for all three runs should be equal to » 7 £ 10¡7 , the value that predicts a 3% error rate for a 44,414-object database using equation 4.3. However, corrections (w ! w / r) were needed to factor out unequal amounts of statistical redundancy in representations of various sizes, where larger representations contained more redundancy. The measured values of c, d, and w and the empirically determined r values are shown for each run in Table 5. As in the runs of Figure 3, redundancy values were again systematically greater for the larger representations generated, and the redundancy values calculated here for the learned representation were again signicantly smaller than those seen for the fully enumerated 2-gram representation. 6 Discussion
Using a simple analytical model as a taking-off point and a variety of simulations in text world, we have shown that low-ambiguity perception in un-
Minimizing Binding Errors
755
segmented multiobject scenes can occur whenever the probability of fully activating any single object’s minimal feature set, by a randomly drawn scene, is kept sufciently low. One prescription for achieving this condition is straightforward: conjunctive features are mined from hallucinated objects until false-positive recognition errors are suppressed to criterion. Even in a complex domain containing a large number of highly similar object categories and severe background clutter, this prescription can produce a remarkably compact representation. This conrms that brute-force enumeration of large families or, worse, all high-order conjunctive features—the combinatorial explosion recognized by Malsburg—is not inevitable. The analysis and experiments reported here have provided several insights into the properties of feature sets that support hallucination-proof recognition and into learning procedures capable of building these feature sets: 1. Efcient feature sets are likely to (1) contain features that span the range from very simple to very complex features, (2) contain relatively many simple features and relatively few complex features, (3) emphasize features that are only moderately common (giving a representation that is neither sparse nor dense) in response to the conicting constraints that features should appear frequently in objects but not in backgrounds also composed of objects, and (4) in spatial domains, emphasize features that encode the relations of parts to object boundaries. 2. An efcient learning algorithm works to drive toward zero, and therefore to equalize, the false-positive recognition rates for all objects considered individually. Thus, frequently hallucinated objects—objects with few parts or common internal structure, or both—demand the most attention during learning. Two consequences of this focus of effort on frequently hallucinated objects are that (1) the average value of w, the size of the minimal feature set required to activate an object, becomes nearly independent of the number of parts contained in the object, so that simpler objects are relatively more intensively represented than complex objects, and (2) among objects of the same part complexity, the minimal conjunctive feature sets grow largest for objects containing the most common substructures, though these are not necessarily the most common objects. A curious implication of the pressure to represent objects heavily that are frequently hallucinated is that the backgrounds in which objects are typically embedded can strongly inuence the composition of the optimal feature set for recognition. 3. The demands on a visual representation are heavily dependent on the object category structure imposed by the task. Where object classes are large and diffuse, the required representation is smaller and weighted
756
Bartlett W. Mel and J´ozsef Fiser
to features of lower conjunctive order, whereas for a category structure like words, in which every object forms its own category that must often be distinguished from a large number of highly similar objects (e.g., cat from cats, cut, rat), the representation must be larger and depend more heavily on features of higher conjunctive order. 6.1 Further Implications of the Analytical Model. Equation 4.3 provides an explicit mathematical relation among several quantities relating to multiple-object recognition. Since the equation assumes statistical independence and uniform activation probability of the features contained in the representation, neither of which is a valid assumption in most practical situations, we were unable to predict recognition errors using measured values for d, c, and w. However, we found that error rates in each case examined could be tted using a small correction factor for statistical redundancy, which uniformly scaled d, c, and w and whose magnitude grew systematically for larger collections of features activated by larger—more redundant—units of text. From this we conclude that equation 4.3 at least crudely captures the quantitative trade-offs governing recognition performance in receptive-eld-based visual systems. One of the pithier features of equation 4.3 is that it provides a criterion for good recognition performance: ( c / d) w ¿ 1 / N. The exponential dependence on the minimal feature set size, w, means that modest increases in w can counter a large increase in the number of object categories, N, or unfavorable increases in the activation density, c / d, which could be due to either an increase in the clutter load or a collapse in the size of the representation. For example, using roughly the postlearning values of c / d » 0.07 and r » 2 from the run of Figure 4, we nd that increasing w from 10 to 12 can compensate for a more than 10-fold increase N, or a nearly 60% increase in the c / d ratio, while maintaining the same recognition error rate. The rst-order expansion of equation 4.3 also states that recognition errors grow as the wth power of c / d, where w is generally much larger than 1. A visual system constructed to minimize hallucinations therefore abhors uncompensated increases in clutter or reduction in the size of the representation. These effects are exactly those that have motivated the various compensatory strategies discussed in section 1. The abhorrence of clutter generates strong pressure to invoke any readily available segmentation strategies that limit processing to one or a small number of objects at a time. For example, cutting out just half the visual material in the input array when w / r D 5, as in the above example, knocks down the expected error rate by a factor of 32. The pressure to reduce the number of objects simultaneously presented to the visual system could account in part for the presence in biological visual systems of (1) a fovea, which leads to a huge overrepresentation of the center of xation and marginalization of the surround, (2) covert attentional mechanisms, which selectively upor down-modulate sensitivity to different portions of the visual eld, and
Minimizing Binding Errors
757
(3) dynamical processes that segment “good” gures (in the Gestalt sense) from background. The second pressure involved in maintaining a low c / d ratio involves maintaining a visual representation of adequate size, as measured by d. One situation in which this is particularly difcult arises when the task mandates that objects be distinguished under excessively wide ranges of spatial variation, an invariance load that must be borne in turn by each of the individual feature detectors in the representation. For example, consider a relatively simple task in which the orientation of image features is a valid cue to object identity, such as a task in which objects usually appear in a standard orientation. In this case a bank of several orientation-specic variants of each feature can be separately maintained within the representation. In contrast, in a more difcult task that requires that objects be recognized in any orientation, each bank of orientation-specic detectors must be pooled together to create a single, orientation-invariant detector for that conjunctive feature. The inclusion of this invariance in the task denition can thus lead to an order-of-magnitude reduction in the size of the representation, which, unopposed could produce a catastrophic breakdown of recognition according to equation 4.3. As a rule of thumb, therefore, the inclusion of an additional order-of-magnitude invariance must be countered by an order-of-magnitude increase in the number of distinct feature conjunctions included in the representation, drawn from the reservoir of higher-order features contained in the objects in question. The main cost in this approach lies in the additional hardware, requiring components that are more numerous, expensive, prone to failure, and used at a far lower rate. 6.2 Space, Shape, Invariance, and the Binding Problem. While one emphasis in this work has been the issue of spatial invariance in a visual representation, we reiterate that the mathematical relation expressed by equation 4.3 knows nothing of space or shape or invariance, but rather views objects and scenes as collections of statistically independent features. What, then, is the role of space? From the point of view of predicting recognition performance and with regard to the quantitative trade-offs discussed here, we can nd no special status for worlds in which spatial relations among parts help to distinguish objects, since spatial relations may be viewed simply as additional sources of information regarding object identity. (Not all worlds have this property; an example is olfactory identication.) Further, the issue of spatial invariance plays into the mathematical relation of equation 4.3 only indirectly, in that it species the degree of spatial pooling needed to construct the invariant features included in the representation; once again, the details relating to the internal construction of these features are not visible to the analysis. In this sense, the spatial binding problem, which appears at rst to derive from an underspecication of the spatial relations needed to glue an object together, in fact reduces to the more mundane problem of ambiguity arising
758
Bartlett W. Mel and J´ozsef Fiser
from overstimulation of the visual representation by input images—that is, a c / d ratio that is simply too large. The hallucination of objects based on parts contained in background clutter can thus occur in any world, even one in which there is no notion of space (e.g., matching bowls of alphabet soup). However, the fact that object recognition in real-world situations typically depends on spatial relations between parts and the fact that recognition invariances are also very often spatial in nature together strongly inuence the way in which a visual representation based on spatially invariant, receptive elds can be most efciently constructed. In particular, many economies of design can be realized through the use of hierarchy, as exemplied by the seminal architecture of Fukushima et al. (1983). The key observation is that a population of spatially invariant, higher-order conjunctive features generally shares many underlying processing operations. The present analysis is, however, mute on this issue. 6.3 Relations to “Real” Vision. The quantitative relations given by equation 4.3 cast the problem of ambiguous perception in cluttered scenes in the simple terms of set intersection probabilities: a cluttered scene activates c of the d feature detectors at random and is correctly perceived with high probability as long as the set of c activated features only rarely includes all w features associated with any one of the N known objects. The analysis is thus heavily abstracted from the details of the visual recognition problem, most obviously ignoring the particular selectivites and invariances that parameterize the detectors contained in the representation. On the other hand, the analysis makes it possible to understand in a straightforward way how the simultaneous challenges of clutter and invariance conspire to create binding ambiguity and the degree to which compensatory increases in the size of the representation, through conjunctive order boosting, can be used to suppress this ambiguity. These basic trade-offs have operated close to their predicted levels in the text world experiments carried out in this work, in spite of violations of the simplifying assumptions of feature independence and equiprobability. Since the application of our analysis to any particular recognition problem involves estimating the domain-specic quantities c, d, w, and N, we do not expect predictions relating to levels of binding ambiguity or recognition performance to carry over directly from the problem of word recognition in blocks of symbolic text to other visual recognition domains, such as viewpoint-independent recognition of real three-dimensional objects embedded in cluttered scenes. On the other hand, we expect the basic representational trade-offs to persist. The most important way that recognition in more realistic visual domains differs from recognition in text world lies in the invariance load, which in most interesting cases extends well beyond simple translation invariance. Basic-level object naming in humans, for example, is largely invariant to
Minimizing Binding Errors
759
translation, rotation, scale, and various forms of distortion, occlusion, and degradation—invariances that persist even for certain classes of nonsense objects (Biederman, 1995). Following the logic of our probabilistic analysis, the daunting array of invariances that must be maintained in this and many other natural visual tasks would seem to present a huge challenge to biological visual systems. In keeping with this, performance limitations in human perception suggest that our visual systems have responded in predictable ways to the pressures of the various tasks with which they are confronted; in particular, recognition in a variety of visual domains appears to include only those invariances that are absolutely necessary and those for which simple compensatory strategies are not available. For example (1) recognition is restricted to a highly localized foveal acceptance window for text or other detailed gures, where by “giving up” on translation invariance, this substream of the visual system frees hardware resources for the several other essential invariances that cannot be so easily compensated for (e.g. size, font, kerning, orientation, distortion), (2) face recognition operates under strong restrictions on the image-plane oriention and direction of lighting of faces (Yin, 1969; Johnston, Hill, & Carman, 1992), a reasonable compromise given that faces almost always present themselves to our visual systems upright and with positive contrast and lighting from above, and (3) unlike the case for common objects, reliable discrimination of complex three-dimensional nonsense objects (e.g., bent paper clips, crumpled pieces of paper) is restricted to a very limited range of three-dimensional orientations in the vicinity of familiar object views (Biederman & Gerhardstein, 1995), though this deciency can be overcome in some cases by extensive training (Logothetis & Pauls, 1995). When viewed within our framework, the exceptional difculty of this latter task arises from the need for, or more precisely a lack of, the very large number of very-high-order conjunctive features— essentially full object views locked to specic orientations (see Figure 1B)— that are necessary to support reliable viewpoint-invariant discrimination among a large set of objects with highly overlapping part structures. In summary, the primate visual system manifests a variety of domain-specic design compromises and performance limitations, which can be interpreted as attempts to achieve simultaneously the needed perceptual invariances, acceptably low levels of binding ambiguity, and reasonable hardware costs. In considering the astounding overall performance of the human visual system in comparison to the technical state of the art, it is also worth noting that human recognition performance is based on neural representations that could contain tens of thousands of task-appropriate visual features (extrapolated from the monkey; see Tanaka, 1996) spanning a wide range of conjunctive orders and nely tuned invariances. In continuing work, we are exploring further implications of the tradeoffs discussed here in relation to the neurobiological underpinnings of spatially invariant conjunctive receptive elds in visual cortex (Mel, Ruderman & Archie, 1998) and on the development of more efcient supervised
760
Bartlett W. Mel and J´ozsef Fiser
and unsupervised hierarchical learning procedures needed to build highperformance, task-specic visual recognition machines. Acknowledgments
Thanks to Gary Holt for many helpful comments on this work. This work was funded by the Ofce of Naval Research. References Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and Applications in the United States of algorithms related to GMDH. In S. Farrow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Biederman, I. (1995). Visual object recognition. In S. Kosslyn & D. Osherson (Eds.), An invitation to cognitive science (2nd ed.) (pp. 121–165). Cambridge, MA: MIT Press. Biederman, I., & Gerhardstein, P. (1995). Viewpoint-dependent mechanisms in visual object recognition: Reply to Tarr and Bulthoff ¨ (1995). J. Exp. Psychol. (Human Perception and Performance), 21, 1506–1514. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Douglas, R., & Martin, K. (1998). Neocortex. In G. Shepherd (Ed.), The synaptic organization of the brain (pp. 567–509). Oxford: Oxford University Press. Edelman, S., & Duvdevani-Bar, S. (1997). A model of visual recognition and categorization. Phil. Trans. R Soc. Lond. [Biol], 352, 1191–1202. Fahlman, S., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Fize, D., Boulanouar, K., Ranjeva, J., Fabre-Thorpe, M., & Thorpe, S. (1998). Brain activity during rapid scene categorization—a study using event-related FMRI. J. Cog. Neurosci., suppl S, 72–72. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognition: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man & Cybernetics, SMC-13, 826–834. Gilbert, C. (1983). Microcircuitry of the visual cortex. Ann. Rev. Neurosci, 89, 8366–8370. Heller, J., Hertz, J., Kjær, T., & Richmond, B. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Hubel, D., & Wiesel, T. (1968). Receptive elds and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hummel, J., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Johnston, A., Hill, H., & Carman, N. (1992).Recognizing faces: Effects of lighting direction, inversion, and brightness reversal. Perception, 21, 365–375.
Minimizing Binding Errors
761
Jones, E. (1981). Anatomy of cerebral cortex: Columnar input-output relations. In F. Schmitt, F. Worden, G. Adelman, & S. Dennis (Eds.), The organization of cerebral cortex. Cambridge, MA: MIT Press. KuLcera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Komen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lang, G. K., & Seitz, P. (1997). Robust classication of arbitrary object classes based on hierarchical spatial feature-matching. Machine Vision and Applications, 10, 123–135. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cerebral Cortex, 3, 270–288. Logothetis, N., & Sheinberg, D. (1996). Visual object recognition. Ann. Rev. Neurosci., 19, 577–621. Malsburg, C. (1994).The correlation theory of brain function (reprint from 1981). In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 95–119). Berlin: Springer. McClelland, J., & Rumelhart, D. (1981). An interactive activation model of context effects in letter perception. Psych. Rev., 88, 375–407. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Mozer, M. (1991). The perception of multiple objects. Cambridge, MA: MIT Press. Oram, M., & Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. Pitts, W., & McCullough, W. (1947). How we know universals: The perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Sandon, P., & Urh, L. (1988). An adaptive model for viewpoint-invariant object recognition. In Proc. of the 10th Ann. Conf. of the Cog. Sci. Soc. (pp. 209–215). Hillsdale, NJ: Erlbaum. Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive eld histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2 pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Swain, M., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32.
762
Bartlett W. Mel and J´ozsef Fiser
Szentagothai, J. (1977). The neuron network of the cerebral cortex: A functional interpretation. Proc. R. Soc. Lond. B, 201, 219–248. Tanaka, K. (1996). Inferotemporal cortex and object vision. Ann. Rev. Neurosci, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Van Essen, D. (1985). Functional organization of primate visual cortex. In A. Peters & E. Jones (Eds.), Cerebral cortex (pp. 259–329). New York: Plenum Publishing. Wallis, G., & Rolls, E. T. (1997).Invariant face and object recognition in the visual system. Prog. Neurobiol., 51, 167–194. Weng, J., Ahuja, N., & Huang, T. S. (1997). Learning recognition and segmentation using the cresceptron. Int. J. Comp. Vis., 25(2), 109–143. Wickelgren, W. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15. Yin, R. (1969). Looking at upside down faces. J. Exp. Psychol., 81, 141–145. Zemel, R., Mozer, M., & Hinton, G. (1990). TRAFFIC: Recognizing objects using hierarchical reference frame transformations. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 266–273). San Mateo, CA: Morgan Kaufmann. Received September 29, 1998; accepted March 5, 1999.
This paper was originally printed in Neural Computation, Volume 12, Number 2—pages 247–278. It is being reprinted here because of print errors in Figures 2, 4, and 5.
ARTICLE
Communicated by William Bialek
The Multifractal Structure of Contrast Changes in Natural Images: From Sharp Edges to Textures Antonio Turiel N e´ stor Parga Departamento de F´õsica Te´orica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain We present a formalism that leads naturally to a hierarchical description of the different contrast structures in images, providing precise denitions of sharp edges and other texture components. Within this formalism, we achieve a decomposition of pixels of the image in sets, the fractal components of the image, such that each set contains only points characterized by a xed strength of the singularity of the contrast gradient in its neighborhood. A crucial role in this description of images is played by the behavior of contrast differences under changes in scale. Contrary to naive scaling ideas where the image is thought to have uniform transformation properties (Field, 1987), each of these fractal components has its own transformation law and scaling exponents. A conjecture on their biological relevance is also given.
1 Introduction
It has been frequently remarked that natural images contain redundant information. Two nearby points in a given image are very likely to have similar values of light intensity, except when an edge lies between them. In this case, the sharp change in luminosity also represents a loss of predictability, because probably the two points belong to two different objects, with almost independent luminosities. Since they cannot be predicted easily, it is frequently said that edges are the most informative structures in the image and that they represent their independent features (Bell & Sejnowski, 1997). Anyone’s subjective experience is consistent with this: most people would start drawing a scene by tracing the lines of the obvious contours in the scene. Even more, the picture can be understood by anyone observing that sketch. Only afterward does one start adding texture to the light-at areas in the drawing. These textures represent the less informative data about the position and intensity of the light sources. It seems that different structures can be distinguished in every image, following a hierarchy dened according to their information content, from the sharpest edges to the softest textures. Neural Computation 12, 763–793 (2000)
c 2000 Massachusetts Institute of Technology °
764
Antonio Turiel and N´estor Parga
Image-processing techniques also emphasize that edges are the most prominent structures in a scene. A large variety of methods for edge detection have been proposed, among them high-pass ltering, zero crossing, and wavelet projections (Gonzalez & Woods, 1992). But once the edges have been extracted from the image, these methods do not address the problem of how to describe and obtain the other structures. In order to detect these other sets and extract them in the context of an organized hierarchy, it is necessary to provide a more general mathematical framework and, in particular, to dene the variables suitable to perform this classication. In this article we present a more rigorous approach to these issues, proposing precise denitions of edges and other texture components of images. As we shall see, our formalism leads very naturally to a hierarchical description of the different structures in the images, achieving a decomposition of its pixels in sets, or components, ranging from the most to the less informative structures. Another motivation to understand the role that edges play in natural images is related to the development of the early visual system. The brain has to solve the problem of how to represent and transmit efciently the information it receives from the outside world. As it has been frequently pointed out (Barlow, 1961; Linsker, 1988; Atick, 1992; van Hateren, 1992), cells located in the rst stages of the visual pathway do this by trying to take advantage of statistical regularities in the image ensemble. The idea is that since natural images are highly redundant over space, they have to be represented in a compact and nonredundant way before being transmitted to inner areas in the brain. In order to do this, it is rst necessary to know the regularities of the images, which then could be used to build more appropriate and efcient internal representations of the environment. The statistical properties of natural images have been studied for many years, but the effort was mainly restricted to the second-order statistics. The emphasis was put on the two-point correlation rst because of the simplicity of the analysis and then because its Fourier transform (the power spectrum) has a power law behavior that reects the existence of scale-invariant properties in natural images. This implies that the second-order statistics does capture some of the regularities. However, most of the underlying statistical structure remains unveiled. One can easily convince oneself that most of the correlations in the image are still present after whitening is performed: the contours of the objects in the whitened image are still easily recognizable (Field, 1987). The conclusion is that nongaussian statistical properties of edges are important to describe natural images, a fact that has also been recognized by Ruderman and Bialek (1994) and Ruderman (1994). However, no systematic studies of the statistical properties of edges had been done until very recently (Turiel, Mato, Parga, & Nadal, 1998). A complete characterization of the regularities of natural images is impossible, and probably useless. Some intuition is necessary to guide the search for regularities. The aim of this work is to exploit the basic idea that
Structure of Contrast Changes
765
changes in contrast are relevant because they occur at the most informative pixels of the scene. Hence we focus on the statistics of changes in luminosity. Since these changes are graded, the task could seem rather difcult and even hopeless, but the solution to this problem turns out to be quite simple and elegant. A single parameter is enough to describe a full hierarchy of differences in contrast, and the hierarchy can be explicitly constructed. Moreover, it is intimately related to the scale-invariance properties of the image, which are certainly more elaborated than those that have been used until now but are still simple enough and have plenty of structure to be detected by a network adapting to the statistics of the stimuli. To understand these scale invariances we rst need the basic concepts of multiscaling and multifractality (Falconer, 1990), together with a simple explanation of their meaning for images. While in naive scaling there is a single transformation law throughout the image (which is then said to be a fractal), in the more general case there is no invariance under a global scale transformation, but the space can be decomposed in sets with a different transformation law for each of them (the image is then said to be a multifractal). To make this concept more precise, let us consider a simple situation where a set of images presents a single type of changes in contrast: the luminosity suddenly changes from one (roughly) constant value to another. In this case, a natural description of the properties of changes in contrast would be to count the number of such changes that appear along a segment of size r, oriented in a given direction. The statistical distribution of the position and size of these jumps in contrast would give relevant information about an ensemble of such images. However, in real images, this counting is not enough to describe well the properties of contrast changes. This is because these changes do not follow such a simple pattern: some of them can be sharp (the notion of sharp changes has to be given), but there will also be all types of softer textures that would be lost by counting just the most noticeable changes. It is then necessary to consider all of them as a whole. A natural way to deal with contrast changes is by dening a quantity that accumulates all of them, whatever their strength, contained inside a scale r, that is: E) D 2 r(x
1 r2
Z
x1 C 2r
x1 ¡ 2r
Z 0
dx1
x2 C 2r x2 ¡ 2r
dx02 | rC( xE0 )| .
(1.1)
Hereafter, the contrast C( xE) is taken as C( xE) D I ( xE)¡hIi, where I ( xE) is the eld of luminosities and hIi its average value across the ensemble. The bidimensional integral 1 on the right-hand side, dened on the set of pixels contained 1
The variables x01 , x02 are the components of the vector xE0 and | | denote the modulus of a vector.
766
Antonio Turiel and N´estor Parga
in a square of linear size r, is a measure of that square. 2 It is divided by the factor r2 , which is the usual Lebesque measure (which we will denote l) of a square of linear size r. The quantity 2 r ( xE) can then be regarded as the ratio between these two measures. More generally, we can dene the measure m of a subset A of image pixels as Z m ( A) D
A
dE x0 |rC( xE0 )|
(1.2)
R ( A dE x0 means bidimensional integration over the set A)—what we will call the edge measure (EM) of the subset A. It is also possible to generalize the denition of 2 r for any subset A: 2 AD
m ( A) . l(A)
(1.3)
2 A is a density, a quantity that compares how much the distribution of |rC|
deviates from being homogeneously distributed on A. We shall call it the edge content (EC) of the set A. Denoting Br ( xE) the “ball”3 of radius r centered around xE, it is clear that 2 r ( xE) ´ 2 Br ( xE) , so this denition generalizes the previous one. We shall refer to 2 r (E x) as the EC of xE at the scale r. Notice that E) is the m -density at xE, |rC|( xE). Thus, equation 1.2 can be symbolically 2 dr ( x expressed as dm ( xE) D | rC| (E x) dE x.
(1.4)
The main point is that contrast changes are distributed over the image in such a way that the EC has rather large contributions even from pixels that are very close together. As a consequence, the measure presents a rather irregular behavior. But it is precisely in its irregularity where the information about the contrast lies. Let us assume, for instance, that as r becomes very small, the measure behaves at every point xE as ¡ ¢ ( ) m Br ( xE) ´ m r (E x) D a( xE) rh xE Cd ,
(1.5)
where d D 2 is the dimension of the images. For convenience a shift in the denition of the exponent ofm r (xE) has been introduced. The EC then veries E) D a( xE) rh(Ex) , 2 r(x
(1.6)
2 Any measure of a set has the property of being additive; that is, if one splits the set in pieces, the measure of the set is equivalent to the sum of the measures of all its pieces. 3 “Ball” in a mathematical sense: it does not need to be true circles, but could also be squares or diamonds. In fact, here the “balls” will be taken as squares.
Structure of Contrast Changes
767
and this choice of h(xE) removes trivial dependencies on r. The exponent h(xE) is just the singularity exponent of |rC|( xE). That is, h < 0 indicates a divergence of |rC|( xE) to innity, while h > 0 indicates niteness and continuous behavior. The greater the exponent, the smoother is the density around that point. In general, we will talk about “singularity” even for the positive h cases. Thus, the meaning of equation 1.5 is that all the points are singular (in this wide sense) and that the singularity exponent is not uniform. This irregular behavior would allow us to classify the pixels in a given image: the set of pixels with a singularity exponent contained in the interval [h ¡ D h , h CD h ] (where D h is a small positive number) dene a class Fh . These classes are the fractal components of the image. The smaller the exponent, the more singular is the class, and the most singular component is the one with the smallest value of h. A measure m verifying equation 1.5 in every point xE is called a multifractal measure. We will refer to these components of the images as fractal sets because they are of a very irregular nature. One way to characterize the odd arrangement of the points in fractal sets is by just counting the number of points contained inside a given ball of radius r; we will denote by Nr ( h, D h ) this number for the set Fh . As r ! 0 it is veried that ( ) Nr (h, D h ) » rD h .
(1.7)
This new exponent, D( h), quanties the size of the set of pixels with singularity h as the image is covered with small balls of radius r. It is the fractal dimension of the associated fractal component Fh , and the function D( h) is called the dimension spectrum (or the singularity spectrum) of the multifractal. It is also worth estimating the probability for a ball of radius r to contain a pixel belonging to a given fractal component. Its behavior for small scales r is Prob( h, D h) » rd¡D( h) ,
(1.8)
where d D 2 as before. Notice that since r is small, this quantity increases as D(h) approaches d.4 The multifractal behavior also implies a multiscaling effect. Because every fractal component possesses its own fractal dimension, each changes differently under changes in the scale. Thus, although every component has no denite scale and is invariant under the scale transformations, scale invariance is broken for a given image: the points in the image transform with different exponents according to the fractal component to which they belong. This leads to the presence of intermittency effects: if the image were 4
It can be proved that any set contained in a d-dimensional real space has fractal dimension D smaller than or equal to d.
768
Antonio Turiel and N´estor Parga
enlarged, its statistical properties would change. Self-similarity is lost. However, scale invariance is restored in the averages over an ensemble of such images. At this point, one may wonder if a multifractal behavior is actually realized in natural images. One way to check its validity is to verify that equation 1.5 is indeed correct. This in turn leads to the technical problem of how to obtain in practice the exponent h( xE) at a given pixel xE of the image. This is a local analysis where the singularity is found by looking at the neighborhood of the point xE. It has the advantage that it not only provides a tool to check the multifractal character of the measure, but it also gives a way to decompose each image in its fractal components. These issues are addressed in the next section. The multifractality of the measure can also be studied by means of statistical methods. In this case one looks for the effect of the singularities on the moments of marginal distributions of, for example, the EC dened in equation 1.1. This is done in section 3, where the important notions of self-similarity, extended self-similarity, and multiplicative processes are explained. Once the relevant mathematical tools are given, these are used in section 4 to analyze natural images. First, in section 4.2, we check that the measure introduced in equation 1.2 is indeed multifractal. Examples of fractal components of natural images are also presented in the same section. After this we study the consistency between the singularity analysis and the statistical analysis of the measure (subsection 4.3). The results of the numerical study of self-similarity properties of the contrast C( xE) and of the measure density |rC|( xE) are presented in section 4.4. The discussion of the results and perspectives for future work are given in the last section. Some technical aspects are included in the appendix.
2 Singularity Analysis: The Wavelet Transform
In this section we present the appropriate techniques to deal with multifractal measures, especially with those adapted to natural images. To check if m denes a multifractal measure and to obtain the local exponent h( xE), we need to analyze the dependence of m ( Br ( xE)) on the scale r. Unfortunately, direct logarithmic regression performed on equation 1.5 yields rather coarse results when it is applied on discretized images, allowing only the detection of very sharp, well-isolated features. It is then necessary to use more sophisticated methods. A convenient technique for this purpose (Mallat & Huang, 1992; Arneodo, Argoul, Bacry, Elezgaray, & Muzy, 1995; Arneodo, 1996) is the wavelet transform (see, e.g., Daubechies, 1992). Roughly speaking, the wavelet transform provides a way to interpolate the behavior of the measure m over balls of radius r when r is a noninteger number of pixels. Besides, if the image is a discretization of a continuous signal, the wavelet analysis is the appropriate technique to retrieve the correct exponents, as we explain later.
Structure of Contrast Changes
769
The wavelet transform of a measure m is a function that is not localized in either space or in frequency. Its arguments are then the site xE and the scale r. It also involves an appropriate, smooth, interpolating function Y( xE). More precisely, the wavelet transform of m is: Z r TY dm ( xE) ´
dm ( xE0 )Yr ( xE ¡ xE0 ) D
dm Yr ( xE), dE x
(2.1)
where Yr (E x) ´ r1d Y( xEr ). The important point here is that for the case of a positive measure, it can be proved that r TY dm ( xE0 ) » rh(Ex0 )
(2.2)
if and only if m veries equation 1.5 (h being exactly the same) and Y decreases fast enough (see Arneodo, 1996, and references therein). In that way, r this property allows us to extract the singularities directly from TY dm . It is very important to remark that Y can be a positive function—what makes an essential difference with respect to the analysis of multiafne functions (see section 3.4). 5 In general the ideal, continuous measure dm (xE) is unknown. In these cases, the data are given by a discretized sampling, which in our case is a collection of pixels. One can reasonably argue that the pixel is the result of the convolution of an ideal signal with a compact support function Âa ( xE) (describing, for example, the resolution » a of the optical device). In this case, under convenient hypothesis, the wavelet transform yields the correct projection of the ideal signal. Let us call xE( n1 ,n2 ) ´ xEnE the positions of the pixels in the discretized sample and CnE the values of the discretized version of the contrast; then CnE D TÂa C( xEn ) ´ C Âa ( xEnE ).
(2.3)
Taking the sample fE xnE g in such a way that the centers of adjacent pixels are at a distance larger than a, it is veried that ¡ ¢ rCnE ´ C( n1 C1,n2 ) ¡ C( n1 ,n2) , C( n1 ,n2 C1) ¡ C( n1 ,n2) » rC Âa ( xEnE ), (2.4) and so the discretized versionm ( dis) of the ideal measurem , can be considered as a wavelet coefcient of the continuous m with the wavelet  at the scale a, namely, ( dis)
dm nE 5
» dE x| rC| Âa (xEnE ) D dE xTÂa dm ( xEnE ).
(2.5)
This means that the exponents can be obtained with a nonadmissible wavelet Y, that is, a function with a nonzero mean.
770
Antonio Turiel and N´estor Parga
Taking into account that the convolution is associative, the discretized wavelet ( ) analysis of dm nEdis with a wavelet Y can be simply expressed as r T rY dm ( dis) ( xEnE ) » TYÂ dm (E xnE ), a r
(2.6)
that is, analyzing the discretized measure m ( dis) with a wavelet Y is equivalent to analyzing the continuous measure m with a wavelet Y Â ar . But if the scale r is large enough compared to the photoreceptor extent a, Â ar ( xE) » d ( xE), which does not depend on r. Thus, if the internal size a is small enough to allow ne detection, we can recover the correct exponent at the point xEnE by just analyzing the discretized measure. Conversely, the extent a imposes a lower cutoff in the details that can be observed in any image constructed by pixels. Because it is just the size of the pixel, the estimation of the singularity exponent at a given point requires performing the analysis over scales containing several pixels. This analysis is similar to that performed in Mallat and Zhong (1991) and Mallat and Huang (1992): any image is analyzed by means of its scaling properties under wavelet projection. There are, however, important differences between the study performed by these authors and the work examined here. The rst difference concerns the basic scalar eld to be analyzed: Mallat and Zhong considered C( xE) instead of |rC|( xE), which we prefer because it is the m -density of a multifractal measure m (see section 4). Another difference is methodological: we intend to classify the points, according their singularity exponent to form every fractal component, while the previous work was devoted to obtaining only the wavelet transform modulus maxima, a concept related but different from what we will call the most singular component—that is, only one (although the most important) of the fractal components. The third difference concerns motivation: we want to classify every point with respect to a hierarchical scheme, while these authors search a wavelet-based optimal codication algorithm, in the sense of providing an (almost) perfect reconstruction algorithm from the wavelet maxima. 3 Statistical Self-Similarity of Natural Images
In this section we introduce the basic concepts of the statistical approach, dening at the same time new variables that are closer to the wavelet analysis. 3.1 SS and ESS. From the statistical point of view, m r is a random variable, in the sense that given a random point xE in an arbitrary image I , the value of m r ( xE) cannot be deterministically predicted. By denition, the multifractal structure is a multiscaling effect. This multiscaling should be somehow apparent from the dependence of the probability density ofm r on the scale parameter r. To determine the whole probability
Structure of Contrast Changes
771
density function ofm r requires a large data set, especially for the rare events; besides, its dependence on r could mix multiscaling effects in a complicated way. On the contrary, the p-moments of m r (that is, the expectation values of m r raised to the pth power) sufce to characterize completely the probability density,6 and given that they are averages of powers of m r they are likely to depend on r in a simple way, somehow related to the behavior of the measure in equation 1.5. It is more convenient to deal with the p-moments of normalized versions r of m r such as the EC 2 r ´ r12 m r or the wavelet projections |TY dm |. They are r normalized in such a way that the rst-order moments, h2 r i and h|TY dm | i, pd do not depend on r: the trivial factor r is hence removed from the random variable. p As we will see, the moments h2 r i and h|T rY dm | p i exhibit the remarkable scaling properties of self-similarity (SS): p
h2 r i D ap rtp
(3.1)
r dm | with exponents t Y ) and of extended self-similarity (analogously for |TY p (ESS) (referred to the moment of order 2) (Benzi, Ciliberto, Baudet, Chavarria, & Tripiccione, 1993; Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993; Benzi, Ciliberto, & Chavarria, 1995):
h ir ( p,2) p h2 r i D A( p, 2) h2 r2 i ,
(3.2)
r (and similarly for |TY dm | with exponents r Y ( p, 2)). ESS has been referred to the moment of order 2, but equation 3.2 trivially implies that any moment can be expressed as a power of the moment of order q with an exponent r ( p,2) r ( p, q) D r (q,2) . In this way, we have the sets of SS exponents tp and ESS exponents r ( p, 2) r for 2 r and similarly the exponents tpY and r Y ( p, 2) for TY dm . Notice that if SS holds, so does ESS, in which case there is a simple relation between tp and r ( p, 2):
tp D t2 r ( p, 2).
(3.3)
The actual dependence of tp and r ( p, 2) on p determines the fractal structure of the system. A trivial dependence tp / p would reveal a monofractal structure. A different dependence on p indicates the existence of a more complicated (multifractal) geometrical structure. 3.2 The Multiplicative Process. A random variable subtending an area of linear size r and possessing SS (see equation 3.1) and ESS (see equation 3.2) 6
Provided they do not diverge too fast with p.
772
Antonio Turiel and N´estor Parga
can be described statistically by means of a multiplicative process (Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993; Novikov, 1994). The statistical formulation is rather simple: given two different scales r and L such that r < L, there is a simple stochastic relation between the ECs at these two scales, .
2 r D arL 2 L ,
(3.4)
where arL is a random variable, independent of the EC, which describes how . the change between these two scales takes place. The symbol D indicates that both sides of the equation are distributed in the same way, but this does not necessarily imply the validity of the relation at every pixel of a given image. The random variables arL between all possible pairs of scales r and L are said to dene a multiplicative process. Let us understand in a more intuitive way the meaning of this process. It is rather clear that given the distribution of 2 L at a xed large-scale L, to compute the distribution of the EC 2 r , it is enough to know the distribution of arL . But let us now introduce an intermediate scale r0 . One could also obtain rst 2 r0 using ar0 L and then 2 r by means of arr0 . Thus, the multiplicative process must verify the cascade relation, arL D arr0 ar0 L ,
(3.5)
which tells us that the variables arL are innitely divisible. Innitely divisible stochastic variables are then completely characterized by their behavior under innitely small changes in the scale, that is, by ar,rCdr . The simplest example of an innitely divisible random variable arL is the one having a binomial innitesimal distribution: ( ar,rCdr D
1¡D b(1 ¡ D
dr , r dr ), r
with probability 1 ¡ [d ¡ D1 ] dr r with probability [d ¡ D1 ] drr .
(3.6)
The form of this process is determined by its compatibility with equation 3.5 (see, e.g., She & Leveque, 1994; She & Waymire, 1995; and Castaing, 1996, for a detailed study). Moreover, there are only two free parameters: it is veried that d ¡ D1 D D / (1 ¡ b). They have also to satisfy the following bounds: 0 < b < 1 and 0 < D < 1. The innitesimal process (see equation 3.6) can be easily interpreted in terms of the EC and the measure density: Probability D 1 ¡ [d ¡ D1 ] dr . This is the most likely situation. In r this case, the EC (and consequently the measure) changes smoothly under small changes in scale: ar,rCdr is close to one and its variation is proportional to d ln r.
Structure of Contrast Changes
773
Probability D [d ¡ D1 ] dr . In this unlikely case, under innitesimal r changes in scale, the EC undergoes nite variations, which is reected in the fact that ar,rCdr deviates from one by a factor b. The m -density | rC| must then be divergent somewhere along the boundary of the ball of radius r, and the parameter b is a measure of how sharp this divergence is. The process for noninnitesimal changes in scale can be easily derived from the innitesimal process presented in equation 3.6. In this case the random variable arL follows a log-Poisson process (She and Waymire, 1995); its probability distribution r arL (arL ) has the form r arL (arL ) D
µ 1 h r id¡D1 X ( d ¡ D1 ) n L
nD0
n!
L ln r
¶n
¡
d arL ¡ b n
³ r ´¡D L
¢
.
(3.7)
If a multiplicative process is realized, the EC has the SS and the ESS properties, equations 3.1 and 3.2. Thus, knowledge of the ESS exponents r ( p, 2) and of t2 is enough to compute tp using equation 3.3. For the log-Poisson process, the ESS exponents are: r ( p, 2) D
1 ¡ bp p ¡ (1 ¡ b)2 1¡b
(3.8)
which depend only on b. This process was proposed in She and Leveque (1994) for turbulent ows, so we will refer to it as either the She-Leveque (S-L) or the log-Poisson model. The SS exponents are tp D ¡D p C ( d ¡ D1 )(1 ¡ b p ),
(3.9)
which compared with equation 3.3 gives the following set of relations: (
D d ¡ D1
D D
t2 ¡ 1¡b t2 ¡ (1¡b) 2
D
D 1¡b .
(3.10)
This means that the model has only two free parameters, which can be chosen to be t2 and b or D and D1 , for instance. The geometrical interpretation of the last two is very interesting (Turiel et al., 1998) and will be explained in the next section. For the time being, let us notice that the maximum value of the EC, k 2 r k1 , also follows a power law with exponent ¡D , k2 r k1 D a1 r¡D ,
(3.11)
and thus the parameter D characterizes the most divergent behavior present in natural images.
774
Antonio Turiel and N´estor Parga
3.3 SS and Multifractality. Let us now analyze in more detail the geometrical meaning of the model and its relation with the singularities described in section 2. For a multifractal measure m with singularity spectrum D( h), there is an important relation between the SS exponents tp and D( h). Let us denote byr h ( h) the probability density of the distribution of the singularity exponents h in the image. Then, partitioning the image pixels according to their values of h, equation 1.6 can be expressed, for r small enough, as Z p ( ) h2 r i » (3.12) dhr h ( h)hap iFh rhpCd¡D h .
The factor rd¡D( h) comes from the probability of a randomly chosen pixel being in the fractal Fh (see equation 1.8) and guarantees the correct normalization of hap iFh (see Frisch, 1995; Arneodo, 1996, for details). For very small r the application of the saddle point method to equation 3.12 yields p
h2 r i D ap rtp / r
minfphCd¡D(h)g h
.
(3.13)
This gives a very interesting relation between tp and D( h): tp is the Legendre transform of D( h). The important point is that D( h) is easily expressed in terms of tp by means of another Legendre transform, namely: D( h) D minfph C d ¡ tp g.
(3.14)
p
If one has a model for the tp ’s (the S-L model, equation 3.9) the dimension spectrum can be predicted: D ( h) D D 1 ¡
h CD ln b
¡
µ 1 ¡ ln ¡
h CD (d ¡ D1 ) ln b
¢¶
.
(3.15)
This is represented in Figure 8. There are natural cutoffs for the dimension spectrum; in particular, there cannot be any exponent h below the minimal value h1 D ¡D . Thus, D denes the most singular of the fractal components, which has fractal dimension D(h1 ) D D1 . We could use these properties of the most singular fractal (its dimension and its associated exponent) to characterize the whole dimension spectrum, which is in agreement with the arguments given in section 3.2. 7 3.4 Multiafnity. Related to multifractality there exists the simpler but more unstable property of multiafnity (Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993). This is a characterization of chaotic, irregular scalar functions that somewhat generalizes the concepts of continuity and differentiability. 7
Remember that b can be expressed in terms of D and D1 : b D 1 ¡ D / (d ¡ D1 ).
Structure of Contrast Changes
775
First, let us give the concept of H¨older exponent: A scalar function F(E x) is said to be Holder ¨ of exponent hF ( xE0 ) at a given point xE0 if for any point yE close enough to xE that the following inequality holds: |F( yE) ¡ F(E x0 )| < A0 | yE ¡ xE0 | hF ( xE0 ) ,
(3.16)
¨ exA0 being a constant depending on the point xE0 . We dene the Holder ponent of F( xE) at xE0 as the maximum of the exponents hF (E x0 ) verifying equation 3.16. Dening the linear increment (LI) of F( xE) by a displacement vector Er as d Er F( xE) ´ |F( xE CEr) ¡ F(E ¨ exponent x)|, if the function F has a Holder hF ( xE) at xE then: d Er F( xE) » aF (E x) rhF ( xE) ,
(3.17)
which is similar to equation 1.6. In the same spirit, we say that a function is multiafne if for every point xE equation 3.17 is veried. Multiafnity is a more intuitive concept than multifractality, because the Holder ¨ exponents are good characterizations of Taylor-like local expansions of the function F(E x). For instance, hF (E x) > 0 means that the function is continuous at xE, hF ( xE) > 1 implies that the function has a continuous rst derivative, and so on. It also works in the other sense: a function having exponent hF D ¡1 behaves as “bad” as a d -function (see the appendix for a brief tutorial about Holder ¨ exponents of simple functions). Even more, a function F has Holder ¨ exponent hF at a given point if and only if its rst-order derivatives have Holder ¨ exponent hF ¡ 1 at the same point. Besides, multiafnity implies multifractality in the following sense: given a measure m , if its density dm is a multiafne function, then m is a multifracdE x tal measure having exactly the same exponents as dm at every point.8 For dE x natural images and taking F D | rC| , this fact can be expressed as r 2 r » TY dm » d rE| rC| ,
(3.18)
in the sense that the three variables have the same statistical dependence on r. This relation can be also expressed in terms of the SS exponents. Let us r denote by tp the SS exponents of 2 r , by tpY those of TY dm and by tprC those of d Er |rC|. Then the statistical relation in equation 3.18 implies that tp D tpY D tprC .
(3.19)
We are also interested in the relation between tp and the SS exponents tpC of the moments of d Er C. If C exhibits the multiafne behavior shown in equa8 Notice a shift of size d in the denition of the exponents of m (see equation 1.5) that is not present in the denition of the Holder ¨ exponents of a multiafne function (see equation 3.16).
776
Antonio Turiel and N´estor Parga
tion 3.17, recalling the meaning in terms of differentiability of the Holder ¨ exponents, we have that d Er | rC| (E x) veries: d Er |rC|( xE) » a| rC| ( xE)rhC ( xE)¡1 .
(3.20)
This can be statistically interpreted as: 2 r » d Er |rC| »
1 d Er C, r
(3.21)
which reects the shift by ¡1 in the exponents of the derivative of C.9 In terms of the SS exponents, this implies that tpC D tprC C p D tp C p.
(3.22)
Let us remark that the singularity spectra associated with the variables r d Er C, d Er |rC|, and TY dm are the Legendre transforms of the SS exponents tpC , tprC , and tpY , respectively (a fact that can be derived in the same way as equation 3.14). There still remains the converse question: if multifractality implies multiafnity. Unfortunately, this is not true (see Daubechies, 1992). In fact, the |F( yE) ¡F(xE0 )| term in equation 3.16 suggests the existence of a pseudo-Taylor expansion, in the sense that ( ) F( yE) ¼ pn ( yE ¡ xE0 ) C A0 | yE ¡ xE0 | hF xE0 ,
(3.23)
where pn ( yE ¡ xE0 ) is a polynomial of degree n in the projection of the vector ¨ yE ¡ xE0 along its direction. We should thus generalize the concept of Holder exponent: given a point xE0 , we will say that this point possesses a Holder ¨ exponent n < hF (E x0 ) < n C 1 (n integer) if there exists a polynomial pn ( xE) of degree n and a constant A0 such that | F( yE) ¡ pn ( yE ¡ xE0 )| < A0 | yE ¡ xE0 | hF (Ex0 )
(3.24)
and hF ( xE0 ) is the maximum of the exponents verifying such relation. Multifractal measures with no multiafne densities (as dened in equation 3.16) have been studied in Bacry, Muzy, & Arneodo (1993) and Arneodo (1996). The problem can be explained by the masking presence of the regular, polynomial part, which stands for global, nonlocalized effects in the signal eld F( yE). To remove this part, those authors propose a generalization of the LI. Instead of LIs of the function, wavelet transforms of the appropriate one-dimensional restriction are considered, taking the scalar eld as a 9
Let us note that the relation r2 local similarity (Arneodo, 1996).
r
» d Er C is the analog of the Kolmogorov hypothesis of
Structure of Contrast Changes
777
real-valued measure density. This wavelet transform of the function F( xE) is dened as Z Er r ( ) Yr ( s), (3.25) TEY F xE0 ´ dsF xE0 C s r
¡
¢
where Y is a real, one-dimensional analyzing wavelet. Since the integral is one-dimensional, the wavelet at the scale r is Yr ( s) ´ 1r Y( sr ). If the scale r is small enough, the integral is dominated by the rst terms in the pseudoTaylor expansion of F( xE) around xE0 , equation 3.23: Z
r (E ) TEY F x0
¼
Z
dspn ( s)Yr (s) C A0
ds|s| hF ( xE0 ) Yr (s).
(3.26)
If the analyzing wavelet is now required to vanish at least the rst n integer moments, the singular part appears as the main contribution to the wavelet transform. (For later reference, let us say now that a wavelet with a vanishing zeroth-order moment—its integral is null—is called an R admisible wavelet.) Hence, changing variables t D sr and dening A D A0 dt|t| hF (Ex0 ) Y(t), one has: r (E ) TEY F x0 ¼ ArhF ( xE0 ) ,
(3.27)
where all the dependence on r stands in the prefactor rhF ( xE0 ) . In this way, the Holder ¨ exponent (in the sense of equation 3.24) can be obtained by a wavelet projection, provided the wavelet has zero moments up to a large enough order. Since hF ( xE0 ) is noninteger, this procedure cannot cancel the singular part even if the order n of the largest vanishing moment of Y is larger than hF ( xE0 ). On the other hand, the wavelet would not be able to detect singularity exponents hF > n C1. This is because a polynomial of order n C1 could still be present and dominate the contribution of the singular term. For this reason, it is important to determine the right order of the required zero moments of the wavelet. This is, then, the appropriate scheme to detect the Holder ¨ exponents of signals with a regular part. In cases in which the m -density |rC| or its primitive C are not multiafne, the concept of multiafnity (see equation 3.16) should be applied to their wavelet projections. It is then necessary to determine how many zero moments the wavelet should have because it provides information on the global aspects of the signal. This generalization is Er r |rC|. straightforward, replacing in all the casesd Er C by TY C andd Er |rC| by TEY For instance, if the wavelet transforms of C and |rC| verify equation 3.27, for certain A and hF ( xE), the same arguments that led us to equations 3.21 and 3.22 can be used to derive that Er | rC| » 2 r » TY
1 Er T C, r Y
(3.28)
778
Antonio Turiel and N´estor Parga
Table 1: Images from van Hateren’s Database. imk00034.imc imk00211.imc imk00478.imc imk00586.imc imk00605.imc
imk00801.imc imk00808.imc imk00881.imc imk01017.imc imk01032.imc
imk01164.imc imk01406.imc imk02035.imc imk02603.imc imk02626.imc
imk02649.imc imk03134.imc imk03514.imc imk03536.imc imk03789.imc
imk03807.imc imk03842.imc imk03940.imc imk04042.imc imk04069.imc
Note: The images can be observed and downloaded from the following URL: http://hlab.phys. rug.nl/archive.html.
YC
which is expressed in terms of the SS exponents tp Er | rC| ) as: TY YC
tp
YrC
D tp
C p D tp C p.
YrC
Er (of TY C) and tp
(of
(3.29)
Let us make a nal remark: it is possible that the wavelet projections of a given scalar function F verify the scaling in equation 3.27 without verifying equation 3.24. This can happen, for instance, for functions with such an irregular behavior that the pseudo-Taylor expansion is not possible. In those Er cases, TY F can be used as a stronger generalization of the usual LI. As it will be shown in section 4.4, this is the case for |rC|. 4 Numerical Analysis of Images 4.1 Databases. We have used two different image data sets: one from Daniel Ruderman (1994) consisting of 45 black and white images of 256£256 pixels with up to 13 bits in luminosity depth (from now on this set will be referred to as the rst data set) and another containing 25 images taken from Hans van Hateren’s database (see van Hateren & van der Schaaf, 1998), having 1536 £ 1024 pixels and 16 bits in luminosity depth (we will refer to these images as the second data set). These 25 images are enumerated in Table 1. The rst data set has been used for the statistical analysis of the onedimensional wavelet transform, equation 3.25, while the second data set was used for the more demanding task (from the statistics point of view) of computing the moments of the two-dimensional variable in equation 1.1. No matter the ensemble considered, however, the studied properties seem to behave in essentially the same way. 4.2 Multifractality of the Measure and Singularity Analysis. We rst address the issue of the multifractality of the measure m dened in equation 1.2. Our numerical analysis gives an afrmative answer to this question: r the wavelet transform TY dm ( xE0 ) is well tted by » rh(Ex0 ) in about 98% of the pixels xE0 in each image of the two data sets. The analysis was done using
Structure of Contrast Changes
779
several wavelets, although the most convenient family of one-parameter functions was Yc ( xE) D (1C|1xE| 2 )c (c ¸ 1). These are positive (nonadmissible) wavelets. The singularity exponent of the measure at a given pixel, h( xE), was obtained from a regression on equation 2.2. The value of the scale r was typically taken in the range (0.5, 2.0) in pixel units. Notice that for scales smaller than r » 0.5 the method would detect an articial edge that actually corresponds to the pixel boundaries—what corresponds to h D ¡1. Similarly, as the scale becomes large, the wavelet can discriminate only between very sharp edges, and for r of the order of the size of the image, it would yield the value h D 0 everywhere. To obtain the fractal components Fh , the pixels of a given image were classied according to the value of their exponents. The exponents lie in the interval (¡0.5, 1.4); only a negligible fraction of pixels have exponents less than ¡0.5 (about 0.8% of them) and none with a singularity below ¡1. Also, there are no exponents larger than 1.4. This range of possible exponent values coincides exactly with that predicted in Turiel et al. (1998) and is in agreement with the model explained in section 3.3. The most singular component is the one with h D ¡0.5, dened within a window of size D h. The visualization of this set is very instructive (see Figure 1). Clearly, the points appear to be associated with the contour of the objects present in the image. This is in agreement with our expectation that the behavior of the measure at these edges should be singular, and as much as possible. But there is something rather surprising: for a sharp, theta-like contrast, with independent values at both sides of the discontinuity, one would have expected an exponent ¡1, which appears as a Dirac d -function in the density measure rC (see example, number 4 in the appendix). Our interpretation of the minimal value of h, h1 D ¡0.5 is that it reects the existence of correlations among different fractal components of the images that smoothen the singularities (in particular, those of the sharpest edges). It is also interesting to observe the fractal components with exponents just above the most singular one. Figure 2 shows the two next components, dened with D h D 0.05. A comparison with Figure 1 shows that pixels in these components are close to pixels in the most singular one. It is also rather clear that the probability that a random ball of size r contains a pixel from the component Fh increases with h, at least for small values of this exponent. Since this probability scales as shown in equation 1.8, this behavior reveals an increase with h of the dimensionality of the fractal components. It is also observed (although not shown in the gures) that the dimensionality starts decreasing beyond h ¼ 0.2. This is in agreement with the arguments of section 3.3: the singularity spectrum D(h) of the S-L model, shown in Figure 8, reects this behavior of the fractal components. The analysis done in this subsection justies the arguments presented in section 1. One cannot consider the sharpest edges of an image independent
780
Antonio Turiel and N´estor Parga
Figure 1: Image from Ruderman’s ensemble (Ruderman, 1994) and its most singular component (taken as the set of points having a singularity exponent in the interval h D ¡0.5 § 0.05).
Structure of Contrast Changes
781
Figure 2: Fractal components with exponents h D ¡0.4 § 0.05 (above) and h D ¡0.3 § 0.05 (below) for the same image represented in Figure 1. Notice the spatial correlations between components with similar singularities.
782
Antonio Turiel and N´estor Parga
of the other texture structures (the other fractal components). Luminosity jumps from one roughly constant value to another (statistically independent) constant value are not typical in natural images. The sharpest edges appear correlated with smoother textures, an effect that tends to decrease the strength of their singularity exponent, reaching the value h » ¡0.5. Standard edge-detection techniques (Gonzalez & Woods, 1992) have not been designed to search for singularities of this strength. Good edge-detection algorithms should be specically constructed to capture singularities close to h1 . 4.3 Statistical Analysis of the Measure. We computed the p-moments
of three variables related to the measure: Bidimensional edge content (2 r ). The denition is given in equation 1.1. This analysis was done using the 25 images of the second data set. Because the variable itself is positive, we computed directly p the quantities h2 r i. Two one-dimensional restrictions of the wavelet projection of the r measure 10 (TY dm ). We considered horizontal (h) and vertical (v) restrictions of the measure dm , that is, dm h D dx| @m | and dm v D dy| @m |. @x @y
r dm l (l D h, v). The appropriate wavelet transforms are denoted by TY Because these coefcients need not be positive (in general Y is not posr itive), we computed the moments of their absolute values, h|TY dm l | p i.
It is remarkable that the bidimensional EC and the two one-dimensional wavelet projections exhibit the scaling properties of SS (see equation 3.1) and ESS (see equation 3.2) (the ESS tests for the three variables are presented in Figures 3 and 4). This was also observed in Turiel et al. (1998), although for somewhat different variables: the one-dimensional horizontal and vertical restrictions of the local edge variance. Surprisingly, no matter the particular variable, the exponents tp and r ( p, 2) obtained are very similar (see Figure 5 and those in Turiel et al., 1998). Therefore, they should refer to essential, robust aspects of images. The S-L model ts the experimental exponents well (see Figure 5). It seems, then, that the multifractal structure underlying the statistical description could be described with the log-Poisson process, which means that under innitesimal changes in the scale, there are only two possible different types of transformations.
10 r Since the calculations involving TY dm are highly computer time-consuming, we did not consider the bidimensional wavelet transform. Fast Fourier transform cannot be safely used due to aliasing, which heavily changes the tails (rare events) of the distribution.
Structure of Contrast Changes
783
0.25 0 -0.25 -0.5 -0.75 -1
4 3 2 1 9 8 7 6 5 4 -1.6
-1.4
-1.2
-1
Figure 3: Test of ESS for the bidimensional edge content 2 r . The graphs represent the logarithm of its moments of order 3, 5, and 7 versus the logarithm of the second-order moment, for distances r D 4 to r D 64 pixels—that is, about an order of magnitude smaller than the image size. This upper bound is needed both to consider values of r small enough compared to the size of the images and to be able to use moments of high order as p D 7. The data correspond to the second image ensemble. According to equation 3.2, these graphs should be straight lines; the best linear ts are also represented. Except for lower cutoff effects at very small distances, the scaling law is well veried.
It is remarkable that for our data sets, D ¼ 0.5 and D1 ¼ 1. This means that the most singular component is a collection of lines (the sharp edges present in the image), which can be characterized by their common exponent ¡D D ¡0.5. But this is precisely the smallest value we observed previously in our wavelet analysis. The statistical approach conrms the result obtained by a local singularity analysis. The nonlinear dependence of r ( p, 2) on p again indicates that the EM is multifractal. This has to be contrasted with the results by Ruderman and
784
Antonio Turiel and N´estor Parga
-2 -2.5 -3 -3.5 -4 -0.5 -1.5 -2.5 -3.5 -4.5 -5.5 2 0 -2 -4 -6 -3.2 -3 -2.8-2.6-2.4-3.2 -3 -2.8 -2.6 -2.4
Figure 4: Test of ESS for (a) TYr dm h and (b) TYr dm v . The graphs represent the logarithm of the moments of order 3, 5, and 7 for both variables versus the logarithm of the corresponding second-order moment. The scales range from r D 4 to r D 64 pixels. The rst data set was used in this computation. The corresponding best linear ts are also represented. Except for lower and upper cutoff effects at small and large distances, the scaling law veries well. Although the test is not shown here, SS also holds for these variables.
Bialek (1994) and Ruderman (1994), who nd that the log-contrast distribution follows a simple classical scaling.11 It also holds for any derived variable as the gradient (see Ruderman, 1994, for details). This particular scaling is related in that context with general properties of scale invariance of natural images: the averaging of the contrast does not keep track of the details (e.g., the edges completely contained into the averaging block) and produces a statistically equivalent image. The EC and the wavelet projections, being local averages of a gradient, are able to keep the information about the more complex underlying multifractal structure. 11
Other luminosity analysis performed by D. Ruderman provided some evidence of multiscaling behavior (private communication).
Structure of Contrast Changes
785
1 7 .5 15 1 2 .5 10 7 .5 5 2 .5 0 0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
10
Figure 5: ESS exponents r (p, 2) for (a) the bidimensional edge content 2 r , (b) the horizontal wavelet transform TYr dm h , and (c) the vertical one, TYr dm v . The convolution function Y was taken to be a gaussian function. Each value of r (p, 2) was obtained by linear regression of the logarithm of the pth moment versus the logarithm of the second moment, for r between 8 and 32. The solid line represents the t with the log-Poisson process. The best t is obtained with (a) b D 0.52§ 0.05, (b) b D 0.55 § 0.06, and (c) b D 0.5 § 0.07 The SS parameters t2 were also calculated; they turned out to be (a) t2 D ¡0.26 § 0.06, (b) t2 D ¡0.25 § 0.06, and (c) t2 D ¡0.25 § 0.08.
The same authors have also designed an iterative procedure that tends to decompose the image into a gaussian-like piece (local uctuations of the logcontrast) and an image-like piece (local variances). Such a decomposition would suggest the existence of a multiplicative process for the log-contrast itself, where the random variable relating the image at two different levels of the hierarchy is given by the local uctuations. A closer look at the iterative procedure indicates that this is not the case: independence between the two factors of the process is not guaranteed, and the process is not innitely divisible. 4.4 Multiafnity Properties of C and jr Cj . In this section we study the scaling properties for the LIs of the contrast C( xE) and the measure density |rC|( xE). Our aim is to check if C and | rC| satisfy the scaling dened in equation 3.17. Recalling the connection (the analog of equation 3.12 for these two quantities) between local scaling (described by the singularity exponents) and statistical scaling (characterized by the SS exponents), one concludes that the validity of SS implies that multiafnity holds. We have done a statistical analysis of the LIs of the contrast and the measure density. We rst computed the p-moments of the LIs of both variables, looking for SS and ESS scalings. Let us notice that there is no a priori argument to expect that d Er C( xE) shows SS. On the other hand, since the measure m is multifractal (as it was shown in section 4.2) it would be reasonable to expect
786
Antonio Turiel and N´estor Parga
-0.6
9
-0.7
7
-0.8
5
-0.9
3
-1 -1.1
1 17
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
15 13 11 9 7 5 3
2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 -1.3
20 17.5 15 12.5 10 7.5 5 -1.2
-1.1
-1
-0.9
-0.8
0
1
2
3
4
5
6
Figure 6: Test of ESS for (a) the horizontal linear increment of the contrast, d rEi C and (b) the horizontal wavelet coefcient |TYr Ch | (Y was taken as the second derivative of a gaussian). The rst data set was used, and the range of scales was from r D 4 to r D 64. ESS is not present for d rEi C, but it is clearly observed for the wavelet transform (notice the presence of the cutoffs due to numerical effects). Note that the scale of the axes in (a) and (b) is different.
that |rC|( xE) has a multiafne behavior. However, this has to be veried explicitly because the multifractality of the measure does not guarantee the multiafnity of its density, as it was discussed in section 3.4. In fact, our numerical analysis showed that neither C( xE) nor |rC|( xE) has SS. The same is true for ESS, as it is shown in Figures 6a and 7, where the moments exhibit a rather erratic behavior. It seems that the failure of SS for C( xE) and |rC|( xE) has to be explained in different terms. In the case of |rC|( xE), this quantity is the measure density and, as it was stated in equation 2.2, its convolution with a nonadmissible (i.e., positive) wavelet has the same singularities as the measure, equation 1.2. This implies that the lack of SS cannot be explained in terms of a polynomial part as described in equation 3.23. In fact, a positive wavelet cannot alter the contribution from this polynomial while one observes that
Structure of Contrast Changes
787
4.3 4.25 4.2 4.15
10.5 10.45 10.4 10.35 17.8 17.7 17.6 17.5 1.75
1.8
1.85
1.9
1.75
1.8
1.85
Figure 7: Test of ESS for (a) horizontal linear increment of |rC|, d rEi |rC| and (b) vertical linear increment of the same function, d r Ej |rC|. The rst data set was used, and the scales were taken ranging from r D 4 to r D 64 pixels. Again, these variables do not have ESS. The corresponding wavelet projections are shown in Figure 4, where one can see that ESS holds.
r the data for d Er | rC| (xE) (see Figure 7) are different from those of TY dm h and r TY dm v (see Figure 4). On the other hand, it is plausible that the failure of SS of d Er C( xE) can be attributed to the presence of a polynomial part in C( xE). In contrast to what happened with |rC|( xE), when the contrast is convolved with a nonadmissible wavelet, SS and ESS are still not present (that is, one observes similar erratic behavior, as in Figure 6a). The same behavior is observed when C( xE) is convolved with an admissible wavelet with a zero mean. The situation changes drastically when a wavelet with vanishing zero- and rst-
788
Antonio Turiel and N´estor Parga 2
1.75
1.5
1.25
1
-0.5
-0.25
0
0.25
0.5
0.75
1
1.25
Figure 8: Singularity spectrum of the log-Poisson model for b D 0.5 and t2 D ¡0.25. For these values D1 D 1.0 and D D 0.5 (see text).
order moments is used. The result obtained using the second derivative of a gaussian is presented in Figure 6b, which clearly shows ESS (it also exhibits SS). This can be understood recalling that the possible values of the Holder ¨ exponent of C(xE) are contained in the interval (0.5, 2.4) (see equation 3.20 and section 4.2). A nonadmissible wavelet truncates the detected singularities at the value h D 0; an admissible one truncates them at h D 1. The third wavelet considered, with two vanishing moments, truncates the singularities only from h D 2, and then it covers almost the whole interval (In theory a wavelet with another vanishing moment would be more appropriate; however, in practice it is numerically more unstable because the resolution of the wavelet is smaller.) Y Y Finally, we also checked that the experimental values of tp , tp C , and tp rC satisfy the relation given in equation 3.29. 5 Discussion
We have shown that in natural scenes, pixels are arranged in nonoverlapping classes, the fractal components of the image, characterized by the behavior of the contrast gradient under changes in the scale. In each of these classes this quantity exhibits a distinct power law behavior under scale transformations, which indicates that there is no a well-dened scale for the isolated components but at the same time the scene is not scale invariant globally. This result implies that a given image is composed of many different spatial structures of the contrast associated with the fractal components, which can be arranged in a hierarchical way. In fact, how sharp or soft a
Structure of Contrast Changes
789
change in contrast is at a given point can be quantied in terms of the value of the scaling exponent at that site. The smaller the exponent, the sharper the local change in contrast is. More precisely, we have dealt with two basic quantities: the contrast C( xE) and the edge measure m , whose density is the modulus of the contrast gradient |rC|( xE) (see equation 1.2). Closely related to the measure, we can dene the edge content of a region A of the image (see equation 1.3), which quanties how much the contrast gradient deviates from being uniformly distributed inside the region A. This is a suitable denition to study the local behavior of the contrast gradient. As the size of A is reduced, some contributions of the contrast gradient to the edge content are left outside this region. Under an innitesimal change in the scale, these contributions often are small, but sometimes a small change in the scale produces a large change in the edge content, due to singularities in the contrast gradient. As a consequence, the edge content (and the measure) exhibits large spatial uctuations. The edge content can have different singularities at neighboring points, characterized by a power law behavior with exponents h( xE) that depend on the site (see equation 1.6). This means that the measure is multifractal, as it has been explicitly veried in this work. The set of pixels with the same value of h( xE) denes a particular spatial structure of contrast: the fractal component Fh . The smallest h(xE) is the sharpest the contrast gradient at the pixel xE is. Its minimal value was, for our data set, h1 D ¡0.5. Another question refers to the fractal dimension D1 of this component. One nds D1 D 1 (Turiel et al., 1998), and when this component is extracted from the image, one checks that it corresponds to the boundaries of the objects in the scene. The associated contrast structure is made of the sharpest edges of the image. The less singular components represent softer contrast structures inside the objects. It is also observed that there are strong spatial correlations between fractal components with similar exponents. We have also shown how to extract the fractal components from the image. To do this, one needs a suitable technique to compute the scale behavior at a given pixel. A wavelet projection (Mallat & Huang, 1992; Arneodo, 1996) seems to be the correct way to perform this analysis, since it provides a way of interpolating scale values that are not an integer number of pixels. Our approach differs from the one followed in those references in several aspects, but in particular in that we have been interested here in the characterization of the whole set of fractal components. We remark the important fact that the multifractal can be explained by a model with only two parameters (e.g., b and D ). This had been noticed before in terms of a variable that integrates the square of the derivative of the contrast along a segment of size r, both taken in either the horizontal or the vertical direction (Turiel et al., 1998). This quantity admits the interpretation of a local linear-edge variance. We have seen that the same model is valid for many other quantities, some of them more general and natural.
790
Antonio Turiel and N´estor Parga
The basic idea of the model is that the contrast gradient has singularities that under an innitesimal change in the size of A produce a substantial change in the edge content inside that region. The simplest stochastic process to assign a value to this change is a binomial distribution, equation 3.6; its more likely value is very small, but with small probability, this change is given by the parameter b, which for our data set is about 0.5. The other parameter of the model, D , measures the strength of the singularity. The geometrical locus of these singularities is the most singular fractal component. For a nite scale transformation, this process becomes log-Poisson. The model appeared before in other problems (She & Leveque, 1994). In a sense, the most singular component conveys a lot of information about the whole image. The two model parameters can be obtained by analyzing properties of this particular class of pixels: its dimension (D1 ) and the scale exponent of the edge content (D D ¡h1 ). In turn, these two parameters completely dene the whole dimension spectrum in the log-Poisson model. It is then very plausible that other components have a great deal of redundancy with respect to the most singular one. This hierarchical representation of spatial structure can be used to predict specic feature detectors in the visual pathways. We conjecture that the most singular component contains highly informative pixels of the images that are responsible for the epigenetic development leading to the adaptation of receptive elds. Learning of the statistical regularities (Barlow, 1961) present in this portion of visual scenes (using, for instance, the algorithm in Bell & Sejnowski, 1997), would give relevant predictions about the structure of receptive elds of cells in the early visual system. Appendix A: Characterization of the Singularities of Simple Functions
It is instructive to compute the singularity exponents dened in equation 3.16 for several simple functions. Continuous functions. If a function f ( xE) is continuous at a given point xE, the value of f at yE should be rather close to f ( xE) provided yE is close to xE; then | f ( yE)| < | f (xE)| C2 so | f ( xE) ¡ f ( yE)| < 2| f ( xE)| C 2 D A. That is, ¨ of exponent 0 at xE. f is Holder Smooth functions with continuous, nonvanishing rst derivative. Using the Taylor expansion, f ( xE)¡ f ( yE) D r f ( yE0 )¢( xE ¡ yE) with yE0 a point between xE and yE. Let us call A D max | r f ( yE0 )| > 0, so | f ( xE) ¡ f ( yE)| < ¨ of exponent 1 at xE. A| xE ¡ yE|. That is, f is Holder Analogously, any function f having n ¡ 1 vanishing derivatives at a point xE and a nonvanishing continuous nth derivative is Holder ¨ of exponent n at xE.
Structure of Contrast Changes
791
µ -function. The h ( x) function is given by 8 0 x< 0 < h ( x) D 1 x> 0 : undened x D 0. Since for x 6 D 0 the increments are zero (for small enough displacement) it is Holder ¨ of any order. For x D 0 we make use of the property that f is Holder ¨ of exponent h at a given point if and only if its primitive F is Holder ¨ of exponent h C 1 at the same point. One possible primitive F(x) of h ( x) is the following: » 0 x< 0 F( x) D x x ¸ 0, which is a continuous function. The increments |F(0) ¡ F(r)| D | F( r)| D F( r) are thus immediate, and obviously F( r) < |r|. Moreover, it is also clear that the maximal h such that |F(0) ¡ F(r)| < A|r| h is h D 1. Thus, F has Holder ¨ exponent 1 at x D 0 and thereforeh ( x) has Holder ¨ exponent 0 at x D 0.
± -function. The Dirac’sd -function is in fact a distribution. Anyway, it is possible to dene the Holder ¨ exponent of distributions attending to the fact that any distribution can be expressed as a nite-order derivative of a bounded function. So if the nth order primitive of a distribution is a Holder ¨ function of exponent h at a given point, the distribution is Holder ¨ of exponent h ¡ n at the same point. Since thed ( x) distribution vanishes in x 6 D 0, the only nontrivial point is ¨ exponent x D 0. Since thed is the derivative ofh (x) and this has Holder 0 at x D 0, one concludes that the d (x) has Holder ¨ exponent ¡1 at x D 0 Acknowledgments
We thank Jean-Pierre Nadal, Germ´an Mato, and Dan Ruderman for useful comments and Angel Nevado for many fruitful discussions that we had during the preparation of this work. We are grateful to Dan Ruderman and Hans van Hateren, some of whose images we used in the statistical analysis. A.T. is nancially supported by a FPI grant from the Comunidad Autonoma ´ de Madrid, Spain. This work was funded by a Spanish grant, PB96-0047. References Arneodo, A. (1996).Wavelet analysis of fractals: From the mathematical concepts to experimental reality. In M. Y. H. G. Erlebacher & L. Jameson (Eds.), Wavelets. Theory and applications. Oxford: Oxford University Press. Arneodo, A., Argoul, F., Bacry, M., Elezgaray, J., & Muzy, J. F. (1995). Ondelettes, multifractales et turbulence. Paris: Diderot Editeur.
792
Antonio Turiel and N´estor Parga
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Comput. Neural Syst., 3, 213–251. Bacry, E., Muzy, J. F., & Arneodo, A. (1993)Singularity spectrum of fractal signals from wavelet analysis: Exact results. J. Stat. Phys., 70, 635–673. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge lters. Vision Research, 37, 3327–3338. Benzi, R., Biferale, L., Crisanti, A., Paladin, G., Vergassola, M., & Vulpiani, A. (1993). A random process for the construction of multiafne elds. Physica D, 65, 352–358. Benzi, R., Ciliberto, S., Baudet, C., Chavarria, G. R., & Tripiccione, C. (1993). Extended self similarity in the dissipation range of fully developed turbulence. Europhysics Letters, 24, 275–279. Benzi, R., Ciliberto, S., & Chavarria, C. B. G. R. (1995). On the scaling of three dimensional homogeneous and isotropic turbulence. Physica D, 80, 385– 398. Benzi, R., Ciliberto, S., & Tripiccione, C., Baudet, C., Massaioli, F., & Succi, S. (1993).Extended self-similarity in turbulent ows. Physical Review E, 48, R29– R32. Castaing, B. (1996). The temperature of turbulent ows. Journal of Physique II, 6, 105. Daubechies, I. (1992). Ten lectures on wavelets. Montpelier, VT: Capital City Press. Falconer, K. (1990). Fractal geometry: Mathematical foundations and applications. Chichester: Wiley. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am., 4, 2379–2394. Frisch, U. (1995). Turbulence. Cambridge: Cambridge University Press. Gonzalez, R. C., & Woods, R. E. (1992). Digital image processing. Reading, MA: Addison-Wesley. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105. Mallat, S., & Huang, W. L. (1992). Singularity detection and processing with wavelets. IEEE Trans. in Inf. Th., 38, 617–643. Mallat, S., & Zhong, S. (1991). Wavelet transform maxima and multiscale edges. In M. B. Ruskai et al. (Eds.), Wavelets and their applications. Boston: Jones and Bartlett. Novikov, E. A. (1994). Innitely divisible distributions in turbulence. Physical Review E, 50, R3303. Ruderman, D. L. (1994). The statistics of natural images. Network, 5, 517–548. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Physical Review Letters, 73, 814. She, Z. S., & Leveque, E. (1994). Universal scaling laws in fully developed turbulence. Physical Review Letters, 72, 336–339. She, Z. S., & Waymire, E. (1995). Quantized energy cascade and log-Poisson statistics in fully developed turbulence. Physical Review Letters, 74, 262– 265.
Structure of Contrast Changes
793
Turiel, A., Mato, G., Parga, N., & Nadal, J. P. (1998).The self-similarity properties of natural images resemble those of turbulent ows. Physical Review Letters, 80, 1098–1101. van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive elds of y LMCS, and experimental validation. J. Comp. Physiology A, 171, 157–170. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond., B265, 359–366. Received August 11, 1998; accepted May 5, 1999.
NOTE
Communicated by Naftali Tishby
Exponential or Polynomial Learning Curves? Case-Based Studies Hanzhong Gu Haruhisa Takahashi Department of Information and Communication Engineering, University of ElectroCommunications, Chofugaoka, Chofu, Tokyo 182, Japan Learning curves exhibit a diversity of behaviors such as phase transition. However, the understanding of learning curves is still extremely limited, and existing theories can give the impression that without empirical studies (e.g., cross validation), one can probably do nothing more than qualitative interpretations. In this note, we propose a theory of learning curves based on the idea of reducing learning problems to hypothesistesting ones. This theory provides a simple approach that is potentially useful for predicting and interpreting (a diversity of) learning curve behaviors qualitatively and quantitatively, and it applies to nite training sample size and nite learning machine and for learning situations not necessarily within the Bayesian framework. We illustrate the results by examining some exponential learning curve behaviors observed in Cohn and Tesauro (1992)’s experiment. 1 Introduction
A fundamental problem in neural network learning theory is how average generalization performance depends on the training sample size m, which is also called learning curves. According to the experimental results of Cohn and Tesauro (1992), learning curves in average-case learning processes (from backpropagation experiments) can exhibit different behaviors that correspond to different learning problems. Specically, in the “majority” and the “majority-XOR” problems of binary inputs, the learning curves roughly exhibit exponential behaviors for moderately large sample size. The exponential learning curve behaviors may be covered by the theoretical results from statistical mechanics (Tishby, Levin, & Solla, 1989; Schwartz, Samalam, Solla, & Denker, 1990; Haussler, Seung, Kearns, & Tishby, 1994; for a survey, see Watkin, Rau, & Biehl, 1993), which suggest that if the spectrum of possible generalizations has a “gap” near perfect performance, then the generalization error will approach zero exponentially in the number of examples. One might think that since the inputs are binary, the generalization is necessarily quantized, and this quantization provides the gap that Neural Computation 12, 795–809 (2000)
c 2000 Massachusetts Institute of Technology °
796
Hanzhong Gu and Haruhisa Takahashi
explains the results. However, Cohn and Tesauro did not nd evidence of the gap in their experiment. Thus, it remains interesting to see which mechanism accounts for the exponential behaviors. In this note, we try to mediate this gap in a learning setting discussed in Gu and Takahashi (1996, 1997) based on a conceptual lower-quality learning algorithm, the so-called ill-disposed learning algorithm, and the Boolean interpolation dimension (Takahashi & Gu, 1998). One of the main ideas is to relate a learning problem to a hypothesis-testing one, so that a learning problem with the ill-disposed learning algorithm can then equivalently be reduced to a simple one for which the evaluation of generalization performance is usually much simpler. This note extends the result of Gu and Takahashi (1997) so as to examine the diversity of learning curve behaviors. As an example, we apply the theory to examining again the learning curves of the majority problem. 2 Learning Model
We begin with describing briey the basic learning setting. (For details, see Gu & Takahashi, 1997, and Takahashi & Gu, 1998.) In particular, we assume that the learning machine implements a class of f0, 1g-valued functions f f ( x, w): w 2 Wg on an input space X, and the target function to be learned is chosen from this class. Here, W µ Rp is called the conguration space. We refer to any f ( x, w) or w as hypothesis and the boundary of the region fx 2 X: f ( x, w) D 1g as the decision boundary of w. Consider a problem to infer a function f ( x, w) by a learning algorithm on nite m random training examples (x1 , y1 ), . . . , ( xm , ym ). Each example consists of an instance (input vector) x 2 X and a f0, 1g-valued output y. If yi D 1 (yi D 0), ( xi , yi ) is called a positive (negative) example. We assume that instances x1 , . . . , xm are generated independently according to a xed distribution P (x) over X and that corresponding outputs y1 , . . . , ym are given according to a xed unknown target function f (x, wt ), wt 2 W. Let xm D (x1 , . . . , xm ) denote the instances. xm is called a sample of size m. A hypothesis w is said to be consistent with xm if f ( xi , w) D yi for i D 1, . . . , m. A learning algorithm is called a consistent learning algorithm if, for any sample, it outputs a hypothesis consistent with the given sample. For any f (x, w), the generalization error of w Ris the probability that w misclassies a randomly drawn example: R( w) D X ( y ¡ f ( x, w))2 dP( x). Let w be output by a learning algorithm, and EfR( w)g denote the expectation of R(w) over training samples of size m. Then EfR( w)g is called the learning curve of the algorithm. Now let w1 » w2 , f (¢, w1 ) D f (¢, w2 ) be an equivalence relation onto W. Let W ¤ be a set of representatives of the equivalence classes. Then, generally, W ¤ is a space whose dimensionality is not greater than W’s. In what follows, we transfer the term hypothesis on W ¤ , and, with some abuse of notation, use f (x, w) or w, w 2 W ¤ to denote a hypothesis.
Exponential or Polynomial Learning Curves?
797
We say that a decision boundary is uniquely specied by a set of m instances x1 , . . . , xm if it is the common decision boundary of all possible hypotheses w 2 W ¤ that contains all xi , i D 1, . . . , m. Then the Boolean interpolation dimension d of the function class is the smallest integer m such that a set of m distinct instances in X uniquely species a decision boundary with the property that there are at most two hypotheses w1 , w2 2 W ¤ having this decision boundary in common and satisfying f (x, w1 ) D 1 ¡ f ( x, w2 ) for all permissible x 2 X outside the decision boundary. Now we introduce the ill-disposed learning algorithm, which behaves worse than ordinal ones and thus can apply to bound the generalization performance of ordinal algorithms (Takahashi & Gu, 1998). Let xm be a training sample. We call a hypothesis will an ill-disposed hypothesis if its decision boundary contains d instances in xm and is consistent with other m ¡ d training instances. Here the output values of the target function of the instances on the decision boundary of will can be either 0 or 1; that is, will is not necessarily consistent with xm . Then there are generally a number of ill-disposed hypotheses for the given xm . For convenience, we call any articial procedure that chooses a will randomly from the collection of ill-disposed hypotheses, ill-disposed learning algorithm. An example of such articial procedures is described by Takahashi and Gu (1998). The following theorem is due to Gu and Takahashi (1997). Theorem 1.
Let P( x) be not degenerated. 1 Let m ¸ d, where d < 1. Then
EfR(will )g · (1 C o)
d , m Cd
where o D 0 for m D d and m / d ! 1, and o ¿ 1 otherwise. Thus, if w0 is produced by an average good, consistent learning algorithm, d . EfR( w0 )g · mCd 3 Characterizing Generalization Error Under Degenerated Input Distribution
Assume that P (x) is degenerated, so that the probability measure of some instance is not zero. Let us name such an instance as x0 and assume that x0 is contained in the decision boundary of some will . Let w0 and w1 be two possible limit hypotheses of will such that w0 can correctly classify x0 , whereas w1 cannot. Then since the measure of x0 is not zero, there is a discontinuity 1 This means that for any given xm drawn from P(x), all the examples are, with probability one, distinct. In practice, this is equivalent to saying that all the examples in the training set and the test set are almost surely distinct.
798
Hanzhong Gu and Haruhisa Takahashi
in the generalization error of a hypothesis when it moves from w0 to w1 . Let P0 ( x) be induced from P( x) by perturbing P( x) in a neighborhood of x D x0 , such that P0 ( x) is not degenerated in the neighborhood and the generalization error of w0 with respect to P0 ( x) is R( w0 ). Then the generalization error of will with respect to P0 ( x), denoted by R0 (will ), ranges from R( w0 ) through R( w1 ) depending on the perturbation. Here one can establish a (one-to-one) relationship between R0 ( will ) and R( w0 ) according to the perturbation. Thus generalizing, we conclude that in the case P( x) is degenerated, one can generally characterize the generalization ability of a consistent hypothesis w0 by an ill-disposed hypothesis will such that » 0 R ( will ) D e R( w0 ) D  (e), where  (¢) is some real-valued function and R0 ( will ) is measured with respect to some nondegenerated P0 ( x) perturbed from P( x). Now apply the method of relating learning to hypothesis testing (Gu & Takahashi, 1997). Let will be produced on a sample of size m chosen from P0 ( x). We call those d instances contained in the decision boundary of will 0 the support of will . Let xm , m0 D m / d be an average case sample, which is obtained by deleting randomly d ¡ 1 instances from the support and (m ¡ d)( d ¡ 1) / d instances from other m ¡ d instances. This is the uniform pruning method. Then, according to Gu and Takahashi (1997), one can approximately treat will as a hypothesis whose decision boundary is uniquely specied by 0 0 one instance in xm such that ignoring the constraints on will due to xm ¡ xm , © ª 0 Pr R0 ( will ) ¸ e · (1 C o)(1 ¡ e) m
(3.1)
holds, where o D 0 for m0 D 1 and m0 ! 1, and o ¿ 1 otherwise. The real 0 constraints due to xm ¡ xm make R0 (will ) tend to exhibit a behavior that has a smaller deviation than equation 3.1. Therefore, if w0 is produced by an average good learning algorithm, then n o PrfR(w0 ) ¸ eg D Pr R0 ( will ) ¸ Â ¡1 (e) ³ ´m / d · 1 ¡ Â ¡1 (e) . Note that the uniform pruning method equivalently reduces the learning problem to a simple one where d D 1, training sample size is m0 , and for m0 D 1, PrfR(w0 ) · eg ¸ Â ¡1 (e). 4 Analysis of a Simple Problem
In this section, we develop a result that, when applied with the method of relating learning to hypothesis testing, can be used to account for the learning curve behaviors in learning majority functions.
Exponential or Polynomial Learning Curves?
799
Consider a simple learning problem in which d D 1 and an instance xi takes one of a number of discrete values over X D [a, b] for some real a, b. We use w0 to denote the output hypothesis consistent with the training sample xm . For simplicity, we shall not specify how w0 will be generated but directly specify the distribution of R( w0 ). Assume that there are only a nite number of K C 1 discrete values 0 D e0 < e1 < e2 < ¢ ¢ ¢ < eK D 1 that R(w0 ) can take, where K > 1 is some integer. For m D 1, let R( w0 ) take one of the values with probabilities p0 , p1 , . . . , pK , P respectively. We assume that k¡1 iD0 pi ¸ e k for k D 1, . . . , K so that PrfR(w0 ) · eg ¸ e, a situation where the learning algorithm performs better than the ill-disposed one does. Then for m 6 D 1,
(
PrfR(w0 ) ¸ eg D 1 ¡
k¡1 X
!m
for e 2 [ek¡1 , ek ), k D 1, . . . , K, (4.1)
pi
iD0
and EfR(w0 )g D D
K Z X
ek¡1
kD1 K X kD1
ek
(
1¡
k¡1 X
!m pi
de
iD0
(
(ek ¡ ek¡1 ) 1 ¡
k¡1 X
!m .
pi
iD0
Since equation 4.1 implies PrfR(w0 ) ¸ eg · (1 ¡ e)m , we have EfR( w0 )g D
K1 Z X
Z ·
ek¡1
kD1
C
(
1¡
Z K X
ek
kDK2 C1 ek¡1
eK 1 0
Z C
ek
k¡1 X
!m
(
1¡
(1 ¡ e) m de C 1
eK 2
de C
pi
iD0
k¡1 X
Z K2 X
ek
kDK1 C1 ek¡1 !m
pi
(
1¡
k¡1 X
de
iD0
de
iD0 K2 X
pi
!m
(
(ek ¡ ek¡1 ) 1 ¡
kDK1 C1
k¡1 X
!m pi
iD0
(1 ¡ e) m de
(
K2 k¡1 ´ X X 1 ³ (ek ¡ ek¡1 ) 1 ¡ D 1 ¡ (1 ¡ eK1 ) mC1 C pi m C1 iD0 kDK C1 1
C
¢mC1 1 ¡ 1 ¡ eK2 , m C1
!m
(4.2)
where K1 < K2 < K. Note that if eK1 is sufciently small but eK2 is considerably large, the learning curve would quickly transit to several exponential
800
Hanzhong Gu and Haruhisa Takahashi
behaviors (the second part), and then to a polynomial behavior (the rst part). Also note that if eK1 is far smaller than eK1 C1 , the last polynomial behavior may not be visible unless the sample size is sufciently large. 5 Some General But More Descriptive Results
It is expected that the above results will allow us to interpret a number of learning curve behaviors in general cases. It is also clear, however, that even though a learning problem can be reduced to one equivalent to the case where d D 1, a complete characterization of the learning problem might still be impossible (see, e.g., section 6). In this section, we develop some results useful for a qualitative interpretation. 5.1 Learning Is an Experiment of m Trials. Recall that the training examples can be presented to the learning machine either one by one or in a batch mode. We shall now take the learning process that a hypothesis is randomly selected from the set of consistent hypotheses as an experiment consisting of a sequence of m trials, where a trial in the learning process is one that the learning machine receives a training example (and consequently the learning algorithm adjusts its output). Sets of realizations—hypotheses produced by the learning algorithm—will be called outcomes of the experiment.2 The most important outcome of interest here is one that the learning machine attains perfect performance (R(w0 ) D 0). 5.2 Events of a Trial. In each trial, the learning machine receives an example according to some distribution P (x). For convenience, we shall take the input space X as a feature space and instances as points of the feature space. Thus, in each trial, the learning machine is assigned to a feature according to some probability distribution. In this case, we say that the feature occurs in that trial. It is possible that a set of features exhibits the same overall effects on generalization as a single feature in the set does. For example, in the case where X D [0, 1], P( x) is uniform over [0, 0.5 ¡ a] [ [0.5 C a, 1], and the boundary of the target function lies at xt D 0.5, where 0 · a < 0.5, any x 2 (0.5 ¡ a ¡ c, 0.5 ¡ a] has a same effect as far as perfect performance is concerned. We call such a set of features an event of a trial.3 2 Accordingly, the terminologies experiment, trial, and outcome here are exactly those used in classic probability theory. 3 It might be instructive to see that if one considers the denition of events as simply based on a specic generalization error, then it should have a close relation to the discussion of slicing the e-ball (see, e.g., Haussler et al., 1994), and the occurrence of the events should be related to the information entropy of the e-ball. It is clear, however, that nding out such events or the way of slicing the e-ball are usually very difcult and even unrealistic, unless in some very simple problems such as d D 1.
Exponential or Polynomial Learning Curves?
801
Thus, mutually exclusive sets of features can be assigned to different events. If at least one feature contained in an event occurs in a trial, we say that the event occurs in that trial. Further, since the (learning) experiment consists of a number of trials, if an event occurs in a trial, it will often be convenient to say that the event occurs in the experiment. Let A be an outcome of an experiment consisting of a sequence of m trials. Let A1 , A2 , . . . , AN be N < 1 mutually exclusive events of a trial such that once all of them occurs in m trials, the outcome of the experiment is A. Let P( A) denote the probability that the outcome of the experiment is A. Then P( A) D 1 ¡ P ( A) decreases exponentially in m for sufciently large m. Lemma 1.
Let A be an outcome of an experiment consisting of a sequence of m trials. Let A1 , A2 , . . . , AN be N mutually exclusive equally probable events of a trial such that once all of them occur in m trials, the outcome of the experiment is A. Let p denote the probability that an event in a trial is Ai for 1 · i · N. Let P ( A) denote the probability that the outcome of the experiment is A. Then Lemma 2.
P( A) D 1 ¡ P( A) D
¡ ¢
N X N (¡1) iC1 (1 ¡ ip) m . i iD1
In case p ¿ 1, P( A) ¼ 1 ¡ (1 ¡ e¡pm ) N , which is approximately Ne¡pm for sufciently large m. 5.3 Exponential Transition of Learning Curves. Now assume that A1 , A2 , . . . , AN are N mutually exclusive events of a trial such that once they occur in the learning process, perfect performance will be attained. Let xm be such a sample that not all A1 , A2 , . . . , AN occur. Let w0 be output by an average good learning algorithm. Then R( w0 ) may not be zero. Let e1 be the maximum of such R( w0 ). Consider now xm as being drawn from a nondegenerated input distribution 4 P0 ( x), and let will be an ill-disposed hypothesis and R0 (will ) be the generalization error measured with respect to P0 ( x). Since we had assumed that an average good learning algorithm is better than the ill-disposed one, R0 (w0 ) · R0 ( will ) and R(w0 ) · R( will ), where R( will ) is measured with respect to P (x). Thus, R( will ) is at least e1 , that is, R( will ) 2 [e1 , 1]. Applying the uniform pruning method, we obtain 0 (1Co)d a subsample xm of xm , and EfR0 ( will)g · mCd . Since in the case d D 1 and 0 0( m D 1, R will ) is uniformly distributed over [0, 1], and R(will ) is uniformly distributed over [e1 , 1] (refer to equation 3.1), one can simply consider that in the case d D 1, there is a linear map from R0 ( will ) to R( will ), that is, R( will ) D e1 C(1 ¡ e1 )R0 (will ). Hence in the case where not all A1 , A2 , . . . , AN 4
If necessary, xm can be perturbed to some extent.
802
Hanzhong Gu and Haruhisa Takahashi
occur, EfR(w0 )g is upper bounded by e1 C (1 ¡ e1 )d / (m C d) according to theorem 1. Let A denote the outcome of the (learning) experiment that the learning machine attains perfect performance. Then
¡
EfR( w0 )g · 0 ¢ P( A) C P( A) e1 C (1 ¡ e1 ) D e 1 P( A ) C
d (1 ¡ e1 ) P( A). m Cd
d m Cd
¢ (5.1)
If the measure of solution to perfect performance is not zero, then the learning process will improve its average generalization performance exponentially in m for sufciently large m. Theorem 2.
Since the measure of solution to perfect performance is not zero, it is sufcient to assume that there are nite N mutually exclusive events of a trial such that once the training sample contains all of these N events, perfect performance will be attained. Let p denote the probability that the event in a trial is the least possible among N events of a trial. Then by lemma 1 and equation 5.1, Proof.
EfR( w0 )g · Ne1 (1 ¡ p) m C N(1 ¡ e1 )(1 ¡ p) m ¢
d , m Cd
which decays exponentially in m when the exponential term dominates. Thus, as far as the perfect performance is concerned, average-case learning curves will probably exhibit roughly polynomial behaviors for small m and roughly exponential behaviors for sufciently large m. However, it is also clear that unless N ¿ m and p is considerably large, the exponential transition to perfect performance may in fact be invisible at nite m. We have so far focused on the situation where only transition to perfect performance is concerned. In some learning problems, there could exist events that make an output hypothesis have only a small generalization error. In this case, it is not difcult to verify that the learning curve could also exhibit exponential behaviors, but the transition would not be toward perfect performance. On the other hand, it is also possible that there could exist events such that if the ill-disposed algorithm output will with R( will ) D e, then the events would make the learning algorithm produce w0 with R( w0 ) D ae for some a < 1, and without the events, R(w0 ) is close to R( will ). In this case, the learning curve will follow from EfR( w0 )g · P( A)a
d d C P( A ) . m Cd m Cd
Exponential or Polynomial Learning Curves?
803
Since P( A) exponentially approaches unity for sufciently large m, the learning curve could exhibit an exponential transition between two polynomial behaviors. Furthermore, it is still possible that for a given learning problem, a learning algorithm A 0 could perform either optimally or nonoptimally, depending on how much information is contained in the training sample. In this case, one may consider that there exists another consistent learning algorithm A00 that is the same as A0 when A0 performs nonoptimally. Then the situation could be modeled as that there are events A such that if A 00 outputs w00 with R( w00 ) D e, then the events would make A0 produce w0 with R( w0 ) D ae, and without the events, R( w0 ) D e. The learning curve will then be upper bounded by P( A)aEfR(w00 )g C P( A) EfR(w00 )g. Thus if EfR(w00 )g has an exponential behavior, the learning curve of A 0 exhibits an exponential transition between two exponential behaviors (for example, the learning curve in section 6.3 may be interpreted in this way). However, since these discussions are too descriptive, we shall not discuss them any more. 6 Learning Majority Functions
In this section, we present simulation results in learning majority functions. (For detailed discussion, see Gu, 1997.) Majority function is a boolean predicate and is linearly separable. The output of a D-dimensional majority function is 1 if and only if more than half of the D inputs are 1. In what follows, we use ( xi1 , xi2 , . . . , xiD ) to denote D inputs of the ith instance xi , where xij 2 f0, 1g. Accordingly, the majority PD function sets the target output yi to 1 if jD1 xij > D / 2 and 0 otherwise. We assume that D is even and that each of the D inputs of an example is set independently to 0 or 1 with probability 1 / 2. PD Since a positive example satises jD1 xij > D / 2, we shall categorize PD C the positive instances (examples) into Ak D fxi : jD1 xij D D / 2 C k C 1g for k D 0, . . . , D / 2 ¡ 1, and the negative instances (examples) into A¡ D k PD fxi : jD1 xij D D / 2 ¡ kg for k D 0, . . . , D / 2, respectively. If an instance ) occurs in a trial, we shall simply say that xi categorized into AkC (or A¡ k the instance in a trial is AkC (or A¡ ). Let pkC, p¡ denote the probabilities k k C ¡ that the instance in a trial is Ak , Ak for k D 0, . . . , D / 2 ¡ 1, respectively. ¡ D ¢ ¡ 1 ¢D ¡ ¢ ¡ ¢D Then pkC D D / 2CkC1 and p¡ D D /D2¡k 12 . In the case D is large, 2 k . For simplicity, we shall treat AkC and A¡ as two equally probable pkC ¼ p¡ k k instances that occur in a trial, and denote the probability of the occurrence by pk .
804
Hanzhong Gu and Haruhisa Takahashi
Figure 1: The learning curve of learning a 50-dimensional majority function with a perceptron whose 49 input-output weights are xed to 1. The dashed line indicates a theoretical prediction (see equation 6.1).
6.1 Training a Perceptron with D ¡ 1 Weights Fixed. We begin with the results of learning with a perceptron with D ¡ 1 fan-in connections set to 1; that is, only one weight and the threshold need to be set by some learning algorithm. It is not difcult to verify that in this case, d D 2. Since d D 2, one can directly assume that after applying the uniform pruning method, R(w0 ) is of discrete values 0, p0 , p0 C p1 , . . ., with probabilities p0 , p1 , p2 , . . . when m0 D 1. Then, according to equation 4.2 and theorem 1, the learning curve would follow from m
m
EfR(w0 )g · p0 (1 ¡ p0 ) 2 C p1 (1 ¡ p0 ¡ p1 ) 2 C
2 m (1 ¡ p0 ¡ p1 ) 2 C1 . m C2
(6.1)
Figure 1 illustrates the simulation result in the case D D 50 (p0 D 0.11, p1 D 0.10). The learning algorithm is the usual reward-punishment with a learning rate 0.00005. We have found that the result is less dependent on the learning rate. The generalization error is estimated by testing against 4096 new examples. The result is obtained by averaging over 100 runs for each training sample size. 6.2 Learning with Perceptrons. Let the learning machine be a perceptron. Then d D D. In this case, since the understanding of the learning
Exponential or Polynomial Learning Curves?
805
Figure 2: The learning curve of learning a 50-dimensional majority function with perceptrons. The dashed lines denote two possible theoretical predictions, one of which is valid only for small m.
problem is still very limited, we rst present the simulation result and then examine possible theoretical interpretations of the resultant learning curve behavior. Figure 2 illustrates the experimental result of learning a 50-dimensional majority function with perceptrons, where two possible theoretical interpretations (B1 , B2 ) are also plotted out. Training sample size varies from 60 to 8000. The learning algorithm is the reward-punishment with momentum. The learning rate is set to 0.65, and the momentum is set to 0.5. The generalization error is estimated by testing on 4096 new examples (8000 for training sample size greater than 6000). The learning curve is obtained by averaging over 50 runs for each training sample size. The possible theoretical interpretations are described below. Since d D 50 À 1 and average case samples obtained via the uniform pruning method are only statistically unbiased, it is not difcult to verify that for m0 D 1, if the decision boundary crosses over AkC (or A¡ ), the change in k 0 generalization error is nearly pk . More specically, if xm contains A0C or A¡ 0, then one can only assert that almost surely R(w0 ) is close to zero. Therefore, a crude bound on the learning curve could be m
m
B1 D p0 (1 ¡ p0 ) 50 C p1 (1 ¡ p0 ¡ p1 ) 50 C
50 m (1 ¡ p0 ¡ p1 ) 50 C1 , m C 50
(6.2)
806
Hanzhong Gu and Haruhisa Takahashi
which should work satisfactorily for small m but fail for large m since it 0 assumes that w0 is of perfect performance whenever xm contains A0C or ¡ A0 . Note from Figure 2 that the learning curve seems to differ from equation 6.2 only within a small multiplicative factor for sample size less than 1500 (m / d < 30). 0 A rened bound could be obtained by taking xm ¡ xm into account. In the case d is large, for m0 D 1, if the decision boundary of w0 crosses over AkC, then the change in generalization error is almost surely nearly pk . This is equivalent to that in the case m0 D 1 the distribution of R( w0 ) is, compared with that used in section 6.1, smeared to a certain extent so that in some range, PrfR(w0 ) · eg remains nearly constant. Therefore for m0 D 1, PrfR(w0 ) D 0g would be considerably small, but PrfR(w0 ) · eg quickly 0 climbs to p0 for 0 < e · p0 . Moreover, because w0 depends on xm ¡ xm , 0 even if xm contains A0C or A¡ , w0 could still misclassify a part of A1C or A¡ 1, PD 0 C for example, A11 D fxi : jD2 xij D D / 2 C 1, xi1 D 1g. According to this type of reasoning, PrfR(w0 ) · eg in the case m0 D 1, is possibly characterized by discrete values p0 C0.5p1 , p0 Cp1 Cp2 , p0 Cp1 Cp2 Cp3 C0.5p4 , . . . such that there are quick transitions when R(w0 ) is near 0, p0 C0.5p1 , p0 Cp1 Cp2 , p0 C p1 C p2 C p3 C 0.5p4 , . . .. Then, according to equation 4.2 and theorem 1, the learning curve bound would be characterized by ´ m m m ³ 1 ¡ (1 ¡ e1 ) 50 C1 C (e2 ¡ e1 )(1 ¡ p0 ) 50 B2 D m C 50 m
C ( p0 C 0.5p1 ¡ e2 )(1 ¡ p0 ¡ 0.5p1 ) 50 m
C (0.5p1 C p2 )(1 ¡ p0 ¡ p1 ¡ p2 ) 50 C
m m (1 ¡ p0 ¡ p1 ¡ p2 ) 50 C1 , m C 50
for some e1 , e2 , p0 . A possible t is obtained when e1 D 0.00028. e1 ¡ e2 D 0.005, and p0 D 0.03, and is plotted in the gure. Note that the rst two terms in B2 are in general extremely descriptive. The exact interpretation of the learning curve is by far still unclear. 6.3 Learning with Multilayer Perceptrons. We illustrate here the result of learning a 50-dimensional majority function by 50-4-1 feedforward fully connected networks. The output layer uses a threshold unit, and the hidden layer uses sigmoidal units. To calculate the learning curve bounds for the ill-disposed learning algorithm, we simply substitute d with the number of weights. In this case, d D 209. The training algorithm is a mixture of the perceptron learning algorithm (in the output layer) and the backpropagation (in the hidden layer) without momentum. The learning rate is set to 0.25. The generalization error is estimated by testing against 4096 new examples (8000 for training sample size exceeding 1200). The nal learning curve is obtained by averaging over 50 runs for each training sample size and is
Exponential or Polynomial Learning Curves?
807
Figure 3: The learning curve of learning a 50-dimensional majority function with 50-4-1 networks. The dashed lines indicate two possible theoretical interpretations.
plotted in Figure 3, where two possible theoretical interpretations are also plotted out. Similar to the discussions in learning with perceptrons, a crude bound on the learning curve could be m
m
B1 D p0 (1 ¡ p0 ) 209 C p1 (1 ¡ p0 ¡ p1 ) 209 C
209 m (1 ¡ p0 ¡ p1 ) 209 C1 . m C 209
Of course, due to the limitation of the theory, the bound cannot explain the part where m < d. However it works quite well for sample size in the range [500, 1700]. On the other hand, it is interesting that in this case, the bound is still valid for sample size up to 6000. This may arise from the fact that m / d < 30. However, more interesting is that the learning curve appears to exhibit an exponential transition for sample size in [2000, 3000], and then to be back to an exponential behavior that has the same exponential rate as that in [500, 1700]. One possible explanation of the learning curve behavior may be inherent in the nature of the learning algorithm. It has been shown that the backpropagation algorithm is Bayes optimal to the extent that the algorithm is able to nd a global error minimum and to the extent that the network is able to represent the optimal Bayes function (Ruck, Rogers, Kabriskey, Oxley, & Suter, 1990; Wang, 1990). Therefore, if the learning algorithm performs optimally, it is reasonable to assume that for m0 D 1, if the decision boundary
808
Hanzhong Gu and Haruhisa Takahashi
of w0 crosses over AkC, then w0 can classify not only AkC but also most part C of Ak¡1 . Then according to equation 4.2 and theorem 1, a possible bound on the learning curve would be m
m
B 2 D e1 (1 ¡ p0 ) 209 C (e2 ¡ e1 )(1 ¡ p0 ¡ p1 ) 209 C
209 m (1 ¡ e1 ) 209 C1 , m C 209
where e1 < p0 and e2 ¡ e1 < p1 . Since the learning algorithm is a mixture of the perceptron learning algorithm and the backpropagation without momentum, when sample size is small, it may not be optimal, so that e1 ¼ p0 , and when sample size is sufciently large, the optimality saturates so that e1 ¼ ap0 for some constant a < 1. Clearly, this is equivalent to saying that there are events of a trial sufcient to make the learning algorithm perform optimally. Therefore, the learning curve could exhibit exponential transitions between B1 and B2 . Since B2 ¼ aB1 for large m / d, a possible bound on the learning curve would be B D P( A)aB1 C [1 ¡ P( A)]B1 , where P( A) denotes the probability the events occur in a trial. Note that the target function (the majority function) is linearly separable, and its decision boundary can be uniquely specied by D points. Hence because an optimal learning machine is a perceptron with d D D, whether the learning algorithm can perform optimally could depend essentially on such a fact that a reduced sample of size m / D obtained by the uniform pruning method contains A0C (or A¡ 0 ), which is necessary to make the perceptron show perfect performance. This denes one event. Clearly, to ensure that the training algorithm performs optimally, we would need D such events (because D points are needed to specify the decision boundary of the target function). Therefore, by lemma 2, P( A) ¼ [1 ¡ exp(¡p0 m / D)]D . Indeed, a good t to the learning curve can be obtained by setting a D 0.17 and P( A) D [1 ¡ exp(¡0.0021m)]50 (see Figure 3). The exact interpretation is not yet clear and seems challenging. 7 Conclusion
Although our discussion seems descriptive to some extent, it accounts for learning curve behaviors in learning majority functions. The results show that these learning curves contain both polynomial and exponential behaviors and the exponential transitions are not necessarily toward perfect performance. Similar results can be obtained for learning majority-XOR functions (Gu, 1997). We expect that our results can also be applied to other learning problems.
Exponential or Polynomial Learning Curves?
809
Acknowledgments
H. Gu acknowledges a student scholarship from the Japanese Government (Monbusho). This work was supported in part by Grant-in-Aid for Scientic Research from the Ministry of Education, Science and Culture, Japan. References Cohn, D., & Tesauro, G. (1992).How tight are the Vapnik-Chervonenkis bounds? Neural Computation, 4, 249–269. Gu, H. (1997). Towards a practically applicable theory of learning curves in neural networks. Unpublished doctoral dissertation, University of ElectroCommunications, Tokyo. Gu, H., & Takahashi, H. (1996). Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks, 7, 953–968. Gu, H., & Takahashi, H. (1997). Estimating learning curves of concept learning. Neural Networks, 10, 1089–1102. Haussler, D., Seung, H. S., Kearns, M., & Tishby, N. (1994). Rigorous learning curve from statistical mechanics. In Proc. 7th Annu. ACM Workshop on Comput. Learning Theory (COLT) (pp. 76–87). Ruck, D. W., Rogers, S. K., Kabriskey, M., Oxley, M. E., & Suter, B. W. (1990). The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans. Neural Networks, 1, 296–298. Schwartz, D. B., Samalam, V. K., Solla, S. A., & Denker, J. S. (1990). Exhaustive learning. Neural Computation, 2, 374–385. Takahashi, H., & Gu, H. (1998). A tight bound on concept learning. IEEE Transactions on Neural Networks, 9, 1191–1202. Tishby, N., Levin, E., & Solla, S. (1989). Consistent inference of probabilities in layered networks: Predictions and generalizations. IJCNN Proc., 2, 403–409. Wang, E. (1990). Neural network classication: A Bayesian interpretation. IEEE Trans. Neural Networks, 1, 303–305. Watkin, T. L. H., Rau, A., & Biehl, M. (1993).The statistical mechanics of learning a rule. Rev. Mod. Phys., 65, 499–556. Received February 27, 1998; accepted July 2, 1999.
LETTER
Communicated by Nicol Schraudolph
Training Feedforward Neural Networks with Gain Constraints Eric Hartman Pavilion Technologies, 1110 Metric Blvd. #700, Austin, TX 78758-4018, U.S.A. Inaccurate input-output gains (partial derivatives of outputs with respect to inputs) are common in neural network models when input variables are correlated or when data are incomplete or inaccurate. Accurate gains are essential for optimization, control, and other purposes. We develop and explore a method for training feedforward neural networks subject to inequality or equality-bound constraints on the gains of the learned mapping. Gain constraints are implemented as penalty terms added to the objective function, and training is done using gradient descent. Adaptive and robust procedures are devised for balancing the relative strengths of the various terms in the objective function, which is essential when the constraints are inconsistent with the data. The approach has the virtue that the model domain of validity can be extended via extrapolation training, which can dramatically improve generalization. The algorithm is demonstrated here on articial and real-world problems with very good results and has been advantageously applied to dozens of models currently in commercial use. 1 Introduction
Neural networks and other empirical modeling techniques are generally intended for situations where no physical (rst-principles) models are available. However, in many instances, partial knowledge about physical models may be known. Because empirical and physical modeling methods have largely complementary sets of advantages and disadvantages (e.g., physical models generally are more difcult to construct but extrapolate better than neural network models), the ability to incorporate available a priori knowledge into neural network models can capture advantages from both modeling methods. The nature of partial prior knowledge varies by application. For example, in the continuous process industries, such as rening and chemical manufacturing, operators and process engineers usually know, from physical understanding or experience, a great deal about their processes, including (1) input-output causality relations (knowledge that is easily modeled by network connectivity), (2) input-output nonlinear-linear relations (knowledge easily modeled by network architecture), and (3) approximate bounds Neural Computation 12, 811–829 (2000)
c 2000 Massachusetts Institute of Technology °
812
Eric Hartman
on some or all of the gains. This knowledge can be modeled by adding constraints of the form min · ki
@yk · max @xi ki
(1.1)
to the training algorithm, where yk is the kth output, xi is the ith input, and minki and maxki are constants. (Note that a nonzero maxki -minki range allows for nonlinear input-output relationships.) This article describes a robust method for carrying out item 3, which we call gain-constrained training. Gain-constrained models give superior performance both when calculating predictions and when calculating gains. When calculating predictions, extrapolation (generalization) accuracy can be improved by augmenting gain-constrained training with extrapolation training, in which the gain constraints alone are imposed outside the training region. When calculating gains, accuracy is vital for optimization, control, and other uses. Correlated inputs (substantial cross-correlations among the time series of the input variables) and “weak” (inaccurate, incomplete, or closed loop) data are very common in real applications and frequently cause incorrect gains in neural network models. When inputs are correlated, many different sets of gains can yield equivalent data-tting quality (the gain solution is underspecied; see section 5.1 for an example). Constraining the gains breaks this symmetry by specifying which set of gains is known a priori to be correct (the gain solution is fully specied).1 Such gain constraints are consistent by denition with the data and therefore do not affect the data-tting quality. In the case of “weak” data, on the other hand, gain constraints are imposed purposefully to override or contradict the data. A simple process industries example is illustrated in Figure 1, and a simple generic example is given in Sill and Abu-Mostafa (1997). Such gain constraints are not consistent with the data and therefore typically increase the data-tting error. Merging prior knowledge with neural networks has been addressed before in the literature (especially relevant references are Lee & Oh, 1997; Sill & Abu-Mostafa, 1997; and Lampinen & Selonen, 1995). Thompson (1996) provides a broad framework focusing on input-output relations rather than gains. All previous work formulates the problem in terms of equality bounds (point gain targets), as opposed to the inequality bounds formulation of equation 1.1 (interval gain targets), although the formulation of Lampinen and Selonen (1995) in effect can achieve interval targets by means of an associated certainty coefcient function. 1
A gain constraint for every input-output pair need not be specied; in this case the gain solution may be more fully, but not necessarily completely, specied.
Training Feedforward Neural Networks
813
Figure 1: Dots represent scatter-plot data of pH versus acid for a stirred-tank reactor, where the ow rate of base into the tank is constant and the pH is closed loop controlled to a constant set-point value by adjusting the ow rate of acid into the tank. (The plot represents a small, linear region of operation.) Whenever the pH rises, the controller increases the acid inow in order to lower the pH, and vice versa, creating data with a positive pH-acid correlation, as shown; that is, the data pH-to-acid gain is positive. The correct physical relationship, given by the solid line, has a negative pH-to-acid gain, which completely contradicts the data.
This article makes the following contributions: An extrapolation training procedure is presented. Term balancing, which is especially important in the case of conicting data and constraints, is directly addressed. Gain constraints are formulated explicitly as inequalities (or equalities). Also, the second derivative expressions for direct input-output connections, which are important in many applications, are new. 2 The Objective Function
A neural network with outputs yk and inputs xi is trained subject to the set of constraints of equation 1.1. The objective function to be minimized in training is E D E D C EC ,
(2.1)
where ED is the data-tting error, and EC is a penalty for violating the constraints. Weighting factors for ED and EC are introduced later in the form
814
Eric Hartman
Figure 2: Penalty function f (@yk / @xi ).
of learning rates (section 3). We use the standard sum of squares for ED ED
D
N out X k
ED k
Ntrp N
D
´ out ³ XX p p 2 tk ¡ y k , p
(2.2)
k
p
p
where p is the pattern index, tk is the target for output yk , Nout is the number of output units, and Ntrp is the number of training patterns. Although we assume equation 2.2 throughout, the algorithm has no inherent restrictions to this form for ED . We write EC as EC
D
Ninp N out X X k
i
ECki
D
Ninp Ntrp N out X XX p
i
k
f
(
p
@yk
!
p
@xi
,
(2.3)
where Ninp is the number of input units and f measures the degree to which p p @yk / @xi violates its constraints (see section 2.1 and Figure 2). In the following, the pattern index p is suppressed for clarity. 2.1 The Penalty Function. We dene the penalty function f as
f
¡ @y ¢ k
@xi
´ fki 8 @yk > < @xi ¡ maxki D minki ¡ @yk @xi > : 0
and its derivative as 8 @yk > < C1 if @xi > maxki 0 fki D ¡1 if @yk < minki @xi > : 0 else
if
@yk @xi @yk @xi
> maxki
if < minki else
9 (max is violated) > =
(min is violated) > , (2.4) ; (no violation)
9 (max is violated) > = (min is violated) . > ; (no violation)
(2.5)
Training Feedforward Neural Networks
815
The function f is shown in Figure 2 (such shapes are widely used in penalty functions). No improvement in solution quality resulted from using an everywhere smooth (and more time-consuming) approximation of equation 2.4. 3 Gradient Descent Training
From equations 2.2, 2.3, and 2.5 and the chain rule, the gradient descent update equation for a single pattern is given by Dw D ¡
N out X k
N
gk
inp out X ¢2 NX @ ¡ @ l ki fki0 tk ¡ y k ¡ @w @w i k
¡ @y ¢ k
@xi
,
(3.1)
where w denotes an arbitrary weight or bias, g k is the learning rate for ED , k and l ki is the learning rate for ECki . Expressions for the second derivatives ¡ ¢ @/ @w @yk / @xi are given in the appendix. The weights and biases are updated after M patterns (M a parameter) according to w( T ) D D w( T ) Cm D w( T ¡ 1),
(3.2)
where T is the weight update index and m is the momentum coefcient. (Section 3.2 details the steps of the training algorithm.) An alternative to this gradient-descent formulation would be to train using a constrained nonlinear programming (NLP) method (e.g., SQP or GRG) (Nash & Sofer, 1996; Smith & Lasdon, 1992). We chose our formulation instead because (1) commercial NLP codes use Hessian information (either exact or approximate), and in our experience any such explicitly higherorder training method tends toward overtting in a way that is difcult to control in a problem-independent manner, and (2) the methods of sections 3.1 and 4 constitute “changes to the problem” for NLPs, which would interfere with their Hessian calculation procedures. 3.1 Extrapolation Training. Some degree of improved model extrapolation may be expected from gain-constrained training solely from enforcing correct gains at the boundaries of the training data set (Lee & Oh, 1997). However, extrapolation training explicitly improves extrapolation accuracy by shaping the response surface outside the training regions. Extrapolation training applies the gain constraint penalty terms alone to the network for patterns articially generated anywhere in the entire input space, whether or not training patterns exist there. An extrapolation pattern consists of an input vector only; no output target is needed because the penalty terms do not involve output target values. The number of extrapolation patterns to be included in each training epoch is an adjustable parameter. Extrapolation input vectors are generated at random from a uniform distribution covering all areas of the input space over which the constraints are assumed to
816
Eric Hartman
be valid. The effectiveness of this method is shown in the synthetic data example of section 5.1. 3.2 Training Procedure Summary. The training algorithm proceeds as
follows: 1. Backpropagation (gradient descent) (LeCun, 1985; Parker, 1985; Rumelhart, Hinton, & Williams, 1986) is performed on ED , and the D w’s in equation 3.1 are incremented with the rst-term contributions. 2. A backward pass for each yk is performed to compute @yk / @xi . This is done by clearing the propagation error d’s (see Rumelhart et al., 1986) for all units, setting d yk D y0k , and backpropagating to the inputs, which yields d xi D @yk / @xi . This follows easily from the denition ofd and this technique has long been known to practitioners. (See Lee & Oh, 1997, for a complete derivation.) ¡ ¢ 3. Quantities computed in step 2 are used to evaluate the @/ @w @yk / @xi expressions given in the appendix, and the D w’s in equation 3.1 are incremented with the second-term contributions. 4. The weights are updated after every M patterns (1 · M · Ntrp ) according to equation 3.2. For an extrapolation pattern (section 3.1), step 1 is omitted and step 2 is preceded by a forward pass. 4 Dynamic Balancing of Objective Function Terms
If the constraints do not conict with the data, balancing the relative strengths (learning rates) of the data and penalty terms in the objective function is relatively straightforward (Lee & Oh, 1997). If the constraints conict with the data, however, a robust method for term balancing is of key importance. While the constraints have priority over data in case of conict, insisting that all constraint violations be zero is ordinarily neither necessary nor conducive to obtaining good models (models of practical value) in real applications. Typically it is not desirable to drive “constraint outliers” of negligible practical consequence into absolute conformance at the expense of signicantly worsening the data t. Instead, we allow for each constraint a specied fraction of training patterns (typically 10–20%) to violate the constraint to any degree, before deeming the constraint to be in violation. The violating patterns may be different for each constraint, and the violating patterns for each constraint may be chosen from the entire training set. The resulting combinatorics allows the network enormous degrees of freedom to achieve maximum data tting while satisfying the constraints.
Training Feedforward Neural Networks
817
Dynamic term balancing proceeds as follows. The initial values of the g k are set to a standard value. For each output, the initial values for the l ki are computed before training begins, such that the average rst-level derivatives of the data term (for that output) are approximately equal to the rst-level derivative of the sum of the constraint terms (for that output). The learning rates g k and l ki are then adjusted after each epoch according to the current state of the training process. A wide variety of situations are possible in gain-constrained training— for example: constraints may be consistent or inconsistent with the data; constraints may be violated at the outset of training or only much later, after the gains have grown in magnitude; or the constraints may be misspecied and cannot be achieved at all. Hence, a variety of rules for adjusting the learning rates are necessary. To adjust g k and l ki after each epoch, we use an extension of the rapidslow g k dynamics described in (Vogl et al., 1988) for training backprop nets, where gk is slowly (additively) increased if the error is decreasing and is rapidly (multiplicatively) decreased if the error is increasing. In the following, a “large” change indicates a multiplicative change, and a “small” change indicates an additive change. The fundamental goal that the constraints should override the data in case of conict dictates the following basic scheme. (l ki is, of course, subject to change only if a constraint for input-output variable pair ki has been specied). l ki Large increase if constraint ki is not satised (“satisfaction” is dened above in this section). Small decrease if constraint ki is satised. gk Large decrease if ED has worsened (increased). Small increase if ED has improved (decreased). The exceptions to these rules are presented in correspondence with the above rules: l ki No increase is made if E Cki has improved, since, given that ECki is improving, increasing l ki would make it unnecessarily harder for ED to improve. No decrease is made if all constraints are exactly satised (EC D 0). In this case, either the constraints agree with the data, in which case re-
818
Eric Hartman
taining the l ki level will actually speed convergence, or the constraints are such that they require large gains to be violated and will therefore remain satised until later in the training process when the weights have grown sufciently to produce large gains. In this case, repeatedly decreasing the l ki ’s early in the training process would result in their being too small to be effective when such constraints become violated. gk No decrease is made in two cases: (1) All constraints are satised but EC has increased. Both ED and EC are increasing, but the decrease in l ki (due to the constraints being satised) might allow E D to improve without decreasing g k . (2) All the constraints are not satised, but EC has improved; this improvement may allow ED to improve without decreasing gk . No increase is made if EC has increased, in accordance with allowing the constraints to dominate. Parameter values used in these dynamics depend on the network architecture, but for typical networks are generally in the range of 1.3 ¤ l ki and 0.8 ¤ g k for multiplicative changes, and (¡)0.03 ¤ l ki and ( C)0.02 ¤ g k for additive changes. Also, expected statistical uctuations in training errors are accounted for when deeming whether ED , EC , or ECki has increased. 5 Examples
This section applies gain-constrained training to two problems. The rst is a synthetic example with maximally correlated (identical) inputs, which clearly illustrates the effectiveness of extrapolation training as well as the use of gain constraints to fully specify gain solutions that are otherwise underspecied due to correlated inputs. The second is a real-world processindustry application involving “weak” data. 5.1 Synthetic Example. The synthetic example consisted of 10 maximally correlated (identical) inputs, x1 through x10 , and one output, y. The target t was also equal to the inputs
t D x1 D x2 D x3 D ¢ ¢ ¢ D x10 ,
(5.1)
each variable being a ramp from 0 to 1. Figure 3 illustrates the data conguration in input space (in 3 instead of 10 dimensions). The data set consisted of 10,000 patterns. A three-layer architecture was used in all cases, with sigmoid output and hidden units. Although the data were known to be linear, a nonlinear model was used to demonstrate the effectiveness of the algorithm. According to equation 5.1, any set of gains
Training Feedforward Neural Networks
819
Figure 3: Data conguration for the synthetic example. All variables are identical; the heavy line represents the training data (the actual input space is a 10-dimensional hypercube). In gain-constrained training, both gain constraints and data tting are applied to the training data. The dots represent extrapolation training patterns (randomly generated throughout training), where the gain constraints alone are applied (output targets neither exist nor are required).
summing to unity would be consistent with the data. We arbitrarily chose the gains @y / @x1 D 0.25 and @y / @x2 D 0.75. Allowing 5% tolerance (typical in real problems), we specied the constraints as 0.244 · @y / @x1 · 0.256 0.731 · @y / @x2 · 0.769
(5.2) (5.3)
No gain constraints involving inputs x3 through x10 were specied. Figure 4 contains gain plots from models trained (a) without gain constraints, (b and c) with gain constraints but without extrapolation training, and (d and e) with both gain constraints and extrapolation training. The constraints were consistent with the data, and in all cases the data-tting error was 0.0. Figure 4a shows the gains from the model trained without constraints. Each curve shows the gain of the output with respect to a single input, as a function of that input (ranging from 0 to 1), with the other nine inputs held constant at 0.5. That is, each gain curve is along a “horizontal” or “vertical” line running through the center of the 10-dimensional input space hypercube (see Figure 3). As expected, the gains of the unconstrained model are distributed roughly uniformly across the 10 identical inputs, with a mean of about 1 / 10. Figure 4b shows the gains for a model trained with gain constraints but without extrapolation training. In this case, training (with constraints) occurs only within the training data region, which runs diagonally through
820
Eric Hartman
the center of the input space (see Figure 3). The “horizontal” or “vertical” gain curve for an input xi therefore intersects the training data region only at xi D 0.5 (the center of the input space; in each curve all other inputs are
Training Feedforward Neural Networks
821
held constant at 0.5). Not surprisingly then, Figure 4b shows that the gain constraint specied for @y / @x1 (0.25) is satised only at x1 D 0.5 (where training data exist), whereas away from the data, the gains vary drastically from 0.25. The same comments hold for the @y / @x2 plot: the constraint for @y / @x2 (0.75) is satised only at the midpoint where there are training data and varies drastically away from the data. Figure 4c is identical to Figure 4b, except that the remaining nine inputs in each curve are held constant at values corresponding to a vertex of the 10-dimensional hypercube rather than the midpoint. Thus, the gain plots run along edges of the input space rather than through its center. The vertex was chosen such that the plots nowhere intersect the training data. The gains @y / @x1 and @y / @x2 satisfy their constraints only coincidentally at certain points. Figures 4d and 4e correspond to 4b and 4c, respectively, but with extrapolation training added. Figures 4d and 4e indicate that the model corresponding to the specied gain constraints has been learned everywhere over the 10-dimensional input space, even in corners far distant from any training data. Recalling that constraints were applied only to @y / @x1 and @y / @x2 , Figures 4d and 4e also show that extrapolation training more tightly enforced the “implied” constraints @y / @x3¡10 D 0. Because the constraints in this example were consistent with the data, the number of epochs required to achieve a given data-tting error with the gain constraints was fewer than the number of epochs required to achieve the same error without the constraints. 5.2 Industrial Process Application. In addition to the usual sparsity and inaccuracies, the data for this industrial process example were “weak” in a way that is not uncommon in data gathered from manufacturing processes. Figure 4: Facing page. Gains for the synthetic example (equations 5.1–5.3). Each curve shows the gain of the output with respect to a single input while holding the other inputs constant at their average (midpoint) values in (a), (b), and (d) and at a vertex in (c) and (e) (same vertex in both cases, a vertex distant from the data set). GC = gain constraints, ET = extrapolation training: models were trained with (a) no GC, (b) GC but no ET (midpoint), (c) GC but no ET (vertex), (d) GC with ET (midpoint), (e) GC with ET (vertex). (a) The distribution of unconstrained gains average 0.10 (as expected). (b) The specied constraints are satised only where the gain curves cross the data set, at the horizontal midpoint (cf. Figure 3). (c) The gain curves nowhere cross the data set and are correct only at accidental locations. (d, e) With extrapolation training, the specied constraints are satised within tolerance everywhere in the 10-dimensional hypercube input space, and the x3 through x10 gains are driven more closely to zero. The constraints were consistent with the data (constraining the otherwise underspecied gain solutions), and the data-tting error was 0.0 in all cases. Constraints were applied only to @y / @x1 and @y / @x2 .
822
Eric Hartman
Figure 5: Network architecture for the industrial process application. The output units are linear and the hidden units are sigmoidal.
The data were collected while parts of the process were under closed-loop control. In this case, an unconstrained neural network will model a mixture of the controller and the process. Because a controller (whether automated or human) essentially represents an inverse model of the process, networks trained without gain constraints often contain gains with reversed signs (see Figure 1 for a simple illustration), as is the case in this example. The process has seven inputs, x1 through x7 and two outputs, y1 and y2 (generic labels are used to protect condentiality). The data set consisted of approximately 79,000 patterns. Figure 5 shows the three-layer architecture that was used throughout, containing a separate hidden layer for each output. Note that only the x3 input was connected to both outputs. All units were fully connected to their appropriate layers. Linear output and sigmoid hidden units were used. Gain constraints were imposed to override the incorrect gains. The physically correct gain signs, known a priori, were specied as: 0.05 · @y1 / @x3 · C1 0.05 · @y1 / @x5 · C1
(positive)
(5.4)
0.05 · @y1 / @x4 · C1
(positive)
(5.5)
(positive)
(5.6)
¡1 · @y1 / @x6 · ¡0.05
(negative)
(5.7)
¡1 · @y2 / @x1 · ¡0.05
(negative)
(5.8)
(positive)
(5.9)
¡1 · @y2 / @x3 · ¡0.05
(negative)
(5.10)
(positive)
(5.11)
and 0.05 · @y2 / @x2 · C1 0.05 · @y2 / @x7 · C1
Training Feedforward Neural Networks
823
The small margin of 0.05 was used to avoid zero crossings; for bounds other than zero, crossing bounds is of less signicance. Figure 6 shows output responses to selected inputs (the gains are the slopes of these curves, in contrast to Figure 4). To avoid cluttering the plots, only two inputs are shown for each output. Figures 6a and 6b show the model trained without gain constraints, and 6c and 6d show the model trained with gain constraints. The architecture was the same in both cases (see Figure 5). Figure 6a shows the y1 response of the unconstrained model to inputs x3 and x5 . In each curve, the remaining input variables were held constant at their average values. The negative slopes in these responses clearly contradict the prior knowledge that they should be positive. (The other two inputs to y1 also violated their constraints.) Figure 6b shows the y2 response of the unconstrained model to inputs x1 and x3 . The positive slopes in these responses clearly contradict the prior knowledge that they should be negative. Figures 6c and 6d show the responses (for the same inputs as in Figures 6a and 6b) of a model trained with the above eight gain (sign) constraints (and with extrapolation training). The gains now satisfy the constraints. The training and validation data-tting error values for the unconstrained model were both 0.50. These are normalized ED error values, dened as
ED norm
v uX ³ ´ X³ ´2 u p p 2 p Dt t k ¡ yk / tk ¡ t k . pk
(5.12)
pk
These error values are not unusual in industrial applications and typically represent useful models. The gain-constrained model training and validation data-tting errors were 1.01 and 0.98, respectively. This large increase in data-tting error is not unexpected due to the many, drastically data-contradicting constraints. Our experience with this algorithm in industrial applications has ranged from very large increases in data-tting error such as in this example (primarily overriding “weak” data) to zero increase in data-tting error (purely constraining underspecied gains solutions). Because the constraints in this example were inconsistent with the data, the number of epochs required to achieve minimal data-tting error subject to the constraints was somewhat greater than the number of epochs required to achieve minimal data-tting error without the constraints. The constrained model is one of dozens of such models currently operating successfully in industrial plants.
824
Eric Hartman
Figure 6: Output responses of unconstrained and constrained models of the industrial process application. The gains are the slopes of the plots. The process was under partial closed-loop control during data gathering, which resulted in some wrong-sign gains in the unconstrained model. For clarity, each part of the gure shows the response of an output to only two of its four inputs. Each curve shows the response to a single input while the remaining inputs are held constant at their average values. (a) Output y1 responses of the unconstrained model (slopes should be positive). (b) Output y2 responses of the unconstrained model (slopes should be negative). (c) Output y1 responses of the constrained model (slopes have correct sign). (d) Output y2 responses of the constrained model (slopes have correct sign). In (a) and (b), the responses shown (and others not shown) violate their constraints. In (c) and (d), the responses shown (and those not shown) conform to the specied gain constraints. Not unexpectedly, the validation data-tting error increased signicantly. The constrained model is successfully operating at a continuous-process manufacturing plant.
Training Feedforward Neural Networks
825
6 Summary and Outlook
An algorithm to train feedforward neural networks subject to inequality (or equality) bound constraints on the gains of the learned mapping was presented. Neural networks trained without gain constraints often develop wrong gains due to correlated inputs (which cause the gains solution to be underspecied), “weak” (inaccurate or incomplete) data, or both. A method for dynamically balancing mutually inconsistent constraints and data terms in the objective function was described. The robustness of this procedure depends strongly on a certain measure of constraint violation, which was also dened. The novel extrapolation training technique that was presented applies the gain constraints alone to the model in regions of input space where no training data are available. This procedure improves model extrapolation in both the prediction and the gain modes. The effectiveness of extrapolation training was demonstrated on a 10dimensional synthetic problem, in which the response surface of the model was correct everywhere over the input space, despite the fact that the training set was conned to a single line running diagonally through the 10dimensional hypercube. The problem also demonstrated the effectiveness of gain constraints in fully specifying the gain solution, which was otherwise underconstrained due to correlated inputs. The algorithm was also applied to a real process industry problem with “weak” data. Several gain constraints, known a priori, were specied to reverse the incorrect signs of the gains inherent in the data. The effectiveness of this algorithm has been proved by its use in creating dozens of models that are operating successfully on a variety of units in continuous process plants, such as polymer reactors, boilers for power turbines, paper manufacturing, and regulated emissions monitoring. These models are being used for continuous steady-state optimization as well as for supplying the gain component in a closed-loop nonlinear dynamic model predictive control system. In many of these cases, neural network modeling would have otherwise been essentially impossible due to critical gain problems caused by correlated inputs and “weak” data. Both sign-only and more restricted magnitude-range constraints have been used in these models. In particular, for projects involving multiple identical (same manufacturer) units, relatively exact gain bounds have been developed, which constitute model templates, allowing very rapid and accurate model development for new such units. This work was restricted to the assumption that specied gain constraints are valid everywhere in the input space (global constraints). We are currently exploring elaborations of the same algorithmic concepts, such as specifying gain constraints as a function of location (local constraints). Such constraints
826
Eric Hartman
may be formulated as min · kil
@yk ¡ ¢ xEl · max, @xi kil
(6.1)
where xEl indicates a location in input space. Appendix: The Second Derivatives
Gradient descent (see equation 3.1) requires the second derivatives @ @w
¡ @y ¢ k
@xi
,
(A.1)
where w denotes an arbitrary weight or bias in the network. This appendix gives the expressions for these second derivatives. (See section 3.2 for the complete training procedure.) Direct input-output connections are often important in real applications (e.g., item 2 of section 1). Generalizing the recursive equations of Lee and Oh (1997) to include layer-skipping connections would be relatively straightforward. Instead, we provide the second derivatives in an easy-to-implement nonrecursive form for the usual three-layer network and include direct input-output connections. Full between layers is not assumed. A second derivative ¡ connectivity ¢ @/ @w @yk / @xi is zero if no path exists from w to yk . The equations below are valid provided that weights corresponding to nonexisting connections are xed at zero. Figure 7 indicates the notations for the basic network elements used in the following. In addition, let ak and aj denote the total input to output yk and to hidden hj respectively: ak D
X j
wkj hj C aj D
X
wki xi C bk
(A.2)
i
wji xi C bj .
(A.3)
i
X
The activation values are then ysk D s (ak ) ¡ ¢ hjs D s aj
yLk D ak hjL
D aj
(A.4) (A.5)
where the superscripts s and L indicate sigmoid and linear units, respectively, and s(¢) is the usual sigmoid transfer function s(x) D (1 C e¡x ) ¡1 .
Training Feedforward Neural Networks
827
Figure 7: Network notations for the second derivative expressions.
The derivatives for output and hidden units are then ( ) yk (1 ¡ yk ) for ysk dyk 0 D yk ´ dak 1 for yLk ( ) hj (1 ¡ hj ) for hjs dhj 0 D . hj ´ daj 1 for hjL For conciseness in the equations below, we dene the quantities ( ) 1 ¡ 2yk for ysk uk D 0 for yLk ( ) 1 ¡ 2hj for hjs . uj D 0 for hjL
(A.6)
(A.7)
(A.8)
(A.9)
The gains @yk / @xi appearing in the equations below are computed according to section 3.2. We write the second derivative equations separately for each type of weight and bias: Hidden-output weight: @ @wkj
¡ @y ¢ m
@xn
¡
D d km um hj
where d km is a Kronecker delta.
¢
@ym C y0m hj0 wjn , @xn
(A.10)
828
Eric Hartman
Input-output weight: @ @wki
¡ @y ¢ m
@xn
¡
¢
@ym C y0md in . @xn
(A.11)
µ ¶ ¡ ¢ @ym D hj0 wmj um xi C y0m uj xi wjn Cd in . @xn
(A.12)
D d km um xi
Input-hidden weight: @ @wji
¡ @y ¢ m
@xn
Output bias: @ @bk
¡ @y ¢ m
@xn
Hidden bias: @ @bj
¡ @y ¢ m
@xn
¡
D d km um
¡
@ym @xn
D hj0 wmj um
¢
.
(A.13)
¢
@ym C y0m uj wjn . @xn
(A.14)
Acknowledgments
I thank L. Lasdon, D. Nehme, and S. Pich´e for useful discussions and C. Peterson, S. Piche´ , and the referees for valuable comments on the manuscript. References Lampinen, J., & Selonen, A. (1995). Multilayer perceptron training with inaccurate derivative information. In Proceedings of the 1995 IEEE International Conference on Neural Networks (ICNN’95), Perth, WA (Vol. 5, pp. 2811–2815). LeCun, Y. (1985). Une procedure d’apprentissage pour reseau a seauil assymetrique (A learning procedure for asymmetric threshold networks). Proc. Cognitiva, 85, 599–604. Lee, J., & Oh, J. (1997).Hybrid learning of mapping and its jacobian in multilayer neural networks. Neural Comp., 9, 937–958. Nash, S. G., & Sofer, A. (1996). Linear and nonlinear programming. New York: McGraw-Hill. Parker, D. B. (1985). Learning-logic (Tech. Rep. TR-47). Cambridge, MA: MIT, Center for Computational Research in Economics and Management Science. Rumelhart, D. E, Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E.Rumelhart, J. L.McClelland, & the PDP Research Group, Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. S. (1997). Monotonicity hints. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information Processing systems, 9 (pp. 634–640). Cambridge, MA: MIT Press.
Training Feedforward Neural Networks
829
Smith, S., & Lasdon, L. (1992). Solving large sparse nonlinear programs using GRG. ORSA Journal on Computing, 4, 3–15. Thompson, M. L (1996). Combining prior knowledge and nonparametric models of chemical processes. Unpublished doctoral dissertation, MIT. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., & Alkon, D. L. (1988). Accelerating the convergence of the back-propagation method. Biol. Cybern., 59, 257. Received July 14, 1998; accepted March 4, 1999.
LETTER
Communicated by Volker Tresp
Variational Learning for Switching State-Space Models Zoubin Ghahramani Geoffrey E. Hinton Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. We introduce a new statistical model for time series that iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time-series models— hidden Markov models and linear dynamical systems—and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs, Jordan, Nowlan, & Hinton, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact expectation maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log-likelihood and makes use of both the forward and backward recursions for hidden Markov models and the Kalman lter recursions for linear dynamical systems. We tested the algorithm on articial data sets and a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models. 1 Introduction
Most commonly used probabilistic models of time series are descendants of either hidden Markov models (HMM) or stochastic linear dynamical systems, also known as state-space models (SSM). HMMs represent information about the past of a sequence through a single discrete random variable—the hidden state. The prior probability distribution of this state is derived from the previous hidden state using a stochastic transition matrix. Knowing the state at any time makes the past, present, and future observations statistically independent. This is the Markov independence property that gives the model its name. SSMs represent information about the past through a real-valued hidden state vector. Again, conditioned on this state vector, the past, present, and future observations are statistically independent. The dependency between Neural Computation 12, 831–864 (2000)
c 2000 Massachusetts Institute of Technology °
832
Zoubin Ghahramani and Geoffrey E. Hinton
the present state vector and the previous state vector is specied through the dynamic equations of the system and the noise model. When these equations are linear and the noise model is gaussian, the SSM is also known as a linear dynamical system or Kalman lter model. Unfortunately, most real-world processes cannot be characterized by either purely discrete or purely linear-gaussian dynamics. For example, an industrial plant may have multiple discrete modes of behavior, each with approximately linear dynamics. Similarly, the pixel intensities in an image of a translating object vary according to approximately linear dynamics for subpixel translations, but as the image moves over a larger range, the dynamics change signicantly and nonlinearly. This article addresses models of dynamical phenomena that are characterized by a combination of discrete and continuous dynamics. We introduce a probabilistic model called the switching SSM inspired by the divideand-conquer principle underlying the mixture-of-experts neural network (Jacobs, Jordan, Nowlan, & Hinton, 1991). Switching SSMs are a natural generalization of HMMs and SSMs in which the dynamics can transition in a discrete manner from one linear operating regime to another. There is a large literature on models of this kind in econometrics, signal processing, and other elds (Harrison & Stevens, 1976; Chang & Athans, 1978; Hamilton, 1989; Shumway & Stoffer, 1991; Bar-Shalom & Li, 1993). Here we extend these models to allow for multiple real-valued state vectors, draw connections between these elds and the relevant literature on neural computation and probabilistic graphical models, and derive a learning algorithm for all the parameters of the model based on a structured variational approximation that rigorously maximizes a lower bound on the log-likelihood. In the following section we review the background material on SSMs, HMMs, and hybrids of the two. In section 3, we describe the generative model—the probability distribution dened over the observation sequences—for switching SSMs. In section 4, we describe a learning algorithm for switching state-space models that is based on a structured variational approximation to the expectation-maximization algorithm. In section 5 we present simulation results in both an articial domain, to assess the quality of the approximate inference method, and a natural domain. We conclude with section 6. 2 Background 2.1 State-Space Models. An SSM denes a probability density over time series of real-valued observation vectors fYt g by assuming that the observations were generated from a sequence of hidden state vectors fXt g. (Appendix A describes the variables and notation used throughout this article.) In particular, the SSM species that given the hidden state vector at one time step, the observation vector at that time step is statistically independent from all other observation vectors, and that the hidden state vectors
Switching State-Space Models
833
X1
X2
X3
Y1
Y2
Y3
...
XT YT
Figure 1: A directed acyclic graph (DAG) specifying conditional independence relations for a state-space model. Each node is conditionally independent from its nondescendants given its parents. The output Yt is conditionally independent from all other variables given the state Xt ; and Xt is conditionally independent from X1 , . . . , Xt¡2 given Xt¡1 . In this and the following gures, shaded nodes represent observed variables, and unshaded nodes represent hidden variables.
obey the Markov independence property. The joint probability for the sequences of states Xt and observations Yt can therefore be factored as: P(fXt , Yt g) D P( X1 ) P(Y1 | X1 )
T Y
P(Xt |Xt¡1 ) P( Yt |Xt ).
(2.1)
tD2
The conditional independencies specied by equation 2.1 can be expressed graphically in the form of Figure 1. The simplest and most commonly used models of this kind assume that the transition and output functions are linear and time invariant and the distributions of the state and observation variables are multivariate gaussian. We use the term state-space model to refer to this simple form of the model. For such models, the state transition function is Xt D AXt¡1 C wt ,
(2.2)
where A is the state transition matrix and wt is zero-mean gaussian noise in the dynamics, with covariance matrix Q. P( X1 ) is assumed to be gaussian. Equation 2.2 ensures that if P( Xt¡1 ) is gaussian, so is P( Xt ). The output function is Yt D CXt C vt ,
(2.3)
where C is the output matrix and vt is zero-mean gaussian output noise with covariance matrix R; P( Yt |Xt ) is therefore also gaussian: »
P( Yt |Xt ) D
(2p )¡D / 2
|R|
¡1 / 2
¼ 1 0 ¡1 exp ¡ (Yt ¡ CXt ) R (Yt ¡ CXt ) , 2
where D is the dimensionality of the Y vectors.
(2.4)
834
Zoubin Ghahramani and Geoffrey E. Hinton
Often the observation vector can be divided into input (or predictor) variables and output (or response) variables. To model the input-output behavior of such a system—the conditional probability of output sequences given input sequences—the linear gaussian SSM can be modied to have a state-transition function, Xt D AXt¡1 C BUt C wt ,
(2.5)
where Ut is the input observation vector and B is the (xed) input matrix.1 The problem of inference or state estimation for an SSM with known parameters consists of estimating the posterior probabilities of the hidden variables given a sequence of observed variables. Since the local likelihood functions for the observations are gaussian and the priors for the hidden states are gaussian, the resulting posterior is also gaussian. Three special cases of the inference problem are often considered: ltering, smoothing, and prediction (Anderson & Moore, 1979; Goodwin & Sin, 1984). The goal of ltering is to compute the probability of the current hidden state Xt given the sequence of inputs and outputs up to time t—P( Xt |fYgt1 , fUgt1 ).2 The recursive algorithm used to perform this computation is known as the Kalman lter (Kalman & Bucy, 1961). The goal of smoothing is to compute the probability of Xt given the sequence of inputs and outputs up to time T, where T > t. The Kalman lter is used in the forward direction to compute the probability of Xt given fYgt1 and fUgt1 . A similar set of backward recursions from T to t completes the computation by accounting for the observations after time t (Rauch, 1963). We will refer to the combined forward and backward recursions for smoothing as the Kalman smoothing recursions (also known as the RTS, or Rauch-Tung-Streibel smoother). Finally, the goal of prediction is to compute the probability of future states and observations given observations up to time t. Given P( Xt |fYgt1 , fUgt1 ) computed as before, the model is simulated in the forward direction using equations 2.2 (or 2.5 if there are inputs) and 2.3 to compute the probability density of the state or output at future time t C t . The problem of learning the parameters of an SSM is known in engineering as the system identication problem; in its most general form it assumes access only to sequences of input and output observations. We focus on maximum likelihood learning in which a single (locally optimal) value of the parameters is estimated, rather than Bayesian approaches that treat the parameters as random variables and compute or approximate the posterior distribution of the parameters given the data. One can also distinguish between on-line and off-line approaches to learning. On-line recursive algorithms, favored in real-time adaptive control applications, can be obtained by computing the gradient or the second derivatives of the log-likelihood (Ljung 1 2
One can also dene the state such that XtC1 D AXt C BUt C wt . The notation fYgt1 is shorthand for the sequence Y1 , . . . , Yt .
Switching State-Space Models
835
& Soderstr ¨ om, ¨ 1983). Similar gradient-based methods can be obtained for off-line methods. An alternative method for off-line learning makes use of the expectation maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). This procedure iterates between an E-step that xes the current parameters and computes posterior probabilities over the hidden states given the observations, and an M-step that maximizes the expected log-likelihood of the parameters using the posterior distribution computed in the E-step. For linear gaussian state-space models, the E-step is exactly the Kalman smoothing problem as dened above, and the M-step simplies to a linear regression problem (Shumway & Stoffer, 1982; Digalakis, Rohlicek, & Ostendorf, 1993). Details on the EM algorithm for SSMs can be found in Ghahramani and Hinton (1996a), as well as in the original Shumway and Stoffer (1982) article. 2.2 Hidden Markov Models. Hidden Markov models also dene probability distributions over sequences of observations fYt g. The distribution over sequences is obtained by specifying a distribution over observations at each time step t given a discrete hidden state St , and the probability of transitioning from one hidden state to another. Using the Markov property, the joint probability for the sequences of states St and observations Yt , can be factored in exactly the same manner as equation 2.1, with St taking the place of Xt :
P(fSt , Yt g) D P (S1 )P (Y1 |S1 )
T Y
P( St | St¡1 )P( Yt |St ).
(2.6)
tD2
Similarly, the conditional independencies in an HMM can be expressed graphically in the same form as Figure 1. The state is represented by a single multinomial variable that can take one of K discrete values, St 2 f1, . . . , Kg. The state transition probabilities, P( St | St¡1 ), are specied by a K £ K transition matrix. If the observables are discrete symbols taking on one of L values, the observation probabilities P( Yt |St ) can be fully specied as a K £ L observation matrix. For a continuous observation vector, P(Yt | St ) can be modeled in many different forms, such as a gaussian, mixture of gaussians, or neural network. HMMs have been applied extensively to problems in speech recognition (Juang & Rabiner, 1991), computational biology (Baldi, Chauvin, Hunkapiller, & McClure, 1994), and fault detection (Smyth, 1994). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two different forms of the inference problem (Rabiner & Juang, 1986). The rst computes the posterior probabilities of the hidden states using a recursive algorithm known as the forward-backward algorithm. The computations in the forward pass are exactly analogous to the Kalman lter for SSMs, and the computations in the backward pass are analogous to the backward pass of the Kalman smoothing
836
Zoubin Ghahramani and Geoffrey E. Hinton
equations. As noted by Bridle (pers. comm., 1985) and Smyth, Heckerman, and Jordan (1997), the forward-backward algorithm is a special case of exact inference algorithms for more general graphical probabilistic models (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). The same observation holds true for the Kalman smoothing recursions. The other inference problem commonly posed for HMMs is to compute the single most likely sequence of hidden states. The solution to this problem is given by the Viterbi algorithm, which also consists of a forward and backward pass through the model. To learn maximum likelihood parameters for an HMM given sequences of observations, one can use the well-known Baum-Welch algorithm (Baum, Petrie, Soules, & Weiss, 1970). This algorithm is a special case of EM that uses the forward-backward algorithm to infer the posterior probabilities of the hidden states in the E-step. The M-step uses expected counts of transitions and observations to reestimate the transition and output matrices (or linear regression equations in the case where the observations are gaussian distributed). Like SSMs, HMMs can be augmented to allow for input variables, such that they model the conditional distribution of sequences of output observations given sequences of inputs (Cacciatore & Nowlan, 1994; Bengio & Frasconi, 1995; Meila & Jordan, 1996). 2.3 Hybrids. A burgeoning literature on models that combine the discrete transition structure of HMMs with the linear dynamics of SSMs has developed in elds ranging from econometrics to control engineering (Harrison & Stevens, 1976; Chang & Athans, 1978; Hamilton, 1989; Shumway & Stoffer, 1991; Bar-Shalom & Li, 1993; Deng, 1993; Kadirkamanathan & Kadirkamanathan, 1996; Chaer, Bishop, & Ghosh, 1997). These models are known alternately as hybrid models, SSMs with switching, and jump-linear systems. We briey review some of this literature, including some related neural network models.3 Shortly after Kalman and Bucy solved the problem of state estimation for linear gaussian SSMs, attention turned to the analogous problem for switching models (Ackerson & Fu, 1970). Chang and Athans (1978) derive the equations for computing the conditional mean and variance of the state when the parameters of a linear SSM switch according to arbitrary and Markovian dynamics. The prior and transition probabilities of the switching process are assumed to be known. They note that for M models (sets of parameters) and an observation length T, the exact conditional distribution of the state is a gaussian mixture with M T components. The conditional mean and variance, which require far less computation, are therefore only summary statistics. 3 A review of how SSMs and HMMs are related to simpler statistical models such as principal components analysis, factor analysis, mixture of gaussians, vector quantization, and independent components analysis (ICA) can be found in Roweis and Ghahramani (1999).
Switching State-Space Models
837
a
b S1
S2
S3
...
S1
S2
S3
...
X1
X2
X3
...
X1
X2
X3
...
Y1
Y2
Y3
...
Y1
Y2
Y3
...
c
d U1
U2
U3
...
S1
S2
S3
...
S1
S2
S3
...
X1
X2
X3
...
X1
X2
X3
...
Y1
Y2
Y3
...
Y1
Y2
Y3
...
Figure 2: Directed acyclic graphs specifying conditional independence relations for various switching state-space models. (a) Shumway and Stoffer (1991): the output matrix (C in equation 2.3) switches independently between a xed number of choices at each time step. Its setting is represented by the discrete hidden variable St ; (b) Bar-Shalom and Li (1993): both the output equation and the dynamic equation can switch, and the switches are Markov; (c) Kim (1994); (d) Fraser and Dimitriadis (1993): outputs and states are observed. Here we have shown a simple case where the output depends directly on the current state, previous state, and previous output.
Shumway and Stoffer (1991) consider the problem of learning the parameters of SSMs with a single real-valued hidden state vector and switching output matrices. The probability of choosing a particular output matrix is a prespecied time-varying function, independent of previous choices (see Figure 2a). A pseudo-EM algorithm is derived in which the E-step, which
838
Zoubin Ghahramani and Geoffrey E. Hinton
in its exact form would require computing a gaussian mixture with MT components, is approximated by a single gaussian at each time step. Bar-Shalom and Li (1993; sec. 11.6) review models in which both the state dynamics and the output matrices switch, and where the switching follows Markovian dynamics (see Figure 2b). They present several methods for approximately solving the state-estimation problem in switching models (they do not discuss parameter estimation for such models). These methods, referred to as generalized pseudo-Bayesian (GPB) and interacting multiple models (IMM), are all based on the idea of collapsing into one gaussian the mixture of M gaussians that results from considering all the settings of the switch state at a given time step. This avoids the exponential growth of mixture components at the cost of providing an approximate solution. More sophisticated but computationally expensive methods that collapse M2 gaussians into M gaussians are also derived. Kim (1994) derives a similar approximation for a closely related model, which also includes observed input variables (see Figure 2c). Furthermore, Kim discusses parameter estimation for this model, although without making reference to the EM algorithm. Other authors have used Markov chain Monte Carlo methods for state and parameter estimation in switching models (Carter & Kohn, 1994; Athaide, 1995) and in other related dynamic probabilistic networks (Dean & Kanazawa, 1989; Kanazawa, Koller, & Russell, 1995). Hamilton (1989, 1994, sec. 22.4) describes a class of switching models in which the real-valued observation at time t, Yt , depends on both the observations at times t¡1 to t¡r and the discrete states at time t to t¡r. More precisely, Yt is gaussian with mean that is a linear function of Yt¡1 , . . . , Yt¡r and of binary indicator variables for the discrete states, St , . . . , St¡r . The system can therefore be seen as an (r C 1)th order HMM driving an rth order autoregressive process, and is tractable for small r and number of discrete states in S. Hamilton’s models are closely related to hidden lter HMM (HFHMM; Fraser & Dimitriadis, 1993). HFHMMs have both discrete and real-valued states. However, the real-valued states are assumed to be either observed or a known, deterministic function of the past observations (i.e., an embedding). The outputs depend on the states and previous outputs, and the form of this dependence can switch randomly (see Figure 2d). Because at any time step the only hidden variable is the switch state, St , exact inference in this model can be carried out tractably. The resulting algorithm is a variant of the forward-backward procedure for HMMs. Kehagias and Petridis (1997) and Pawelzik, Kohlmorgen, and Muller ¨ (1996) present other variants of this model. Elliott, Aggoun, and Moore (1995; sec. 12.5) present an inference algorithm for hybrid (Markov switching) systems for which there is a separate observable from which the switch state can be estimated. The true switch states, St , are represented as unit vectors in < M , and the estimated switch state is a vector in the unit square with elements corresponding to the es-
Switching State-Space Models
839
timated probability of being in each switch state. The real-valued state, Xt , is approximated as a gaussian given the estimated switch state by forming a linear combination of the transition and observation matrices for the different SSMs weighted by the estimated switch state. Eliott et al. also derive control equations for such hybrid systems and discuss applications of the change-of-measures whitening procedure to a large family of models. With regard to the literature on neural computation, the model presented in this article is a generalization of both the mixture-of-experts neural network (Jacobs et al., 1991; Jordan & Jacobs, 1994) and the related mixture of factor analyzers (Hinton, Dayan, & Revow, 1997; Ghahramani & Hinton, 1996b). Previous dynamical generalizations of the mixture-of-experts architecture consider the case in which the gating network has Markovian dynamics (Cacciatore & Nowlan, 1994; Kadirkamanathan & Kadirkamanathan, 1996; Meila & Jordan, 1996). One limitation of this generalization is that the entire past sequence is summarized in the value of a single discrete variable (the gating activation), which for a system with M experts can convey on average at most log M bits of information about the past. In the models we consider here, both the experts and the gating network have Markovian dynamics. The past is therefore summarized by a state composed of the cross-product of the discrete variable and the combined real-valued state-space of all the experts. This provides a much wider information channel from the past. One advantage of this representation is that the real-valued state can contain componential structure. Thus, attributes such as the position, orientation, and scale of an object in an image, which are most naturally encoded as independent real-valued variables, can be accommodated in the state without the exponential growth required of a discretized HMM-like representation. It is important to place the work in this article in the context of the literature we have just reviewed. The hybrid models, state-space models with switching, and jump-linear systems we have described all assume a single real-valued state vector. The model considered in this article generalizes this to multiple real-valued state vectors. 4 Unlike the models described in Hamilton (1994), Fraser and Dimitradis (1993), and the current dynamical extensions of mixtures of experts, in the model we present, the real-valued state vectors are hidden. The inference algorithm we derive, which is based on making a structured variational approximation, is entirely novel in the context of switching SSMs. Specically, our method is unlike all the approximate methods we have reviewed in that it is not based on tting a single gaussian to a mixture of gaussians by computing the mean and covariance of the mixture.5 We derive a learning algorithm for all of the parameters 4 Note that the state vectors could be concatenated into one large state vector with factorized (block-diagonal) transition matrices (cf. factorial hidden Markov model; Ghahramani & Jordan, 1997). However, this obscures the decoupled structure of the model. 5 Both classes of methods can be seen as minimizing Kullback-Liebler (KL) diver-
840
Zoubin Ghahramani and Geoffrey E. Hinton
of the model, including the Markov switching parameters. This algorithm maximizes a lower bound on the log-likelihood of the data rather than a heuristically motivated approximation to the likelihood. The algorithm has a simple and intuitive avor: It decouples into forward-backward recursions on a HMM, and Kalman smoothing recursions on each SSM. The states of the HMM determine the soft assignment of each observation to an SSM; the prediction errors of the SSMs determine the observation probabilities for the HMM. 3 The Generative Model
In switching SSMs, the sequence of observations fYt g is modeled by specifying a probabilistic relation between the observations and a hidden state( ) space comprising M real-valued state vectors, Xt m , and one discrete state vector St . The discrete state, St , is modeled as a multinomial variable that can take on M values: St 2 f1, . . . , Mg; for reasons that will become obvious we refer to it as the switch variable. The joint probability of observations and hidden states can be factored as ³n o´ (1) ( M) P St , X t , . . . , X t , Y t D P ( S1 )
T Y tD2
P( St |St¡1 ) ¢
M Y mD1
T ³ ´Y ³ ´ ( ) ( ) ( m) P X1m P Xt m |Xt¡1
T ³ ´ Y (1) ( ) ¢ P Yt |Xt , . . . , Xt M , St ,
tD2
(3.1)
tD1
which corresponds graphically to the conditional independencies represented by Figure 3. Conditioned on a setting of the switch state, St D m, the observable is multivariate gaussian with output equation given by statespace model m. Notice that m is used as both an index for the real-valued state variables and a value for the switch state. The probability of the observation vector Yt is therefore ³ ´ (1) ( M) P Yt |Xt , . . . , Xt , St D m » ´0 ³ ´¼ 1³ ( m) (m) ( m) ( m) ¡ 12 ¡1 D | 2p R| exp ¡ , (3.2) Yt ¡ C Xt R Yt ¡ C Xt 2 where R is the observation noise covariance matrix and C( m) is the output matrix for SSM m (cf. equation 2.4 for a single linear-gaussian SSM). Each gences. However, the KL divergence is asymmetrical, and whereas the variational methods minimize it in one direction, the methods that merge gaussians minimize it in the other direction. We return to this point in section 4.2.
Switching State-Space Models
841
a
b (1)
...
(M) 1
(1)
X3
...
X2
...
(1)
X1
(M) 2
(M) 3
... (1)
X
X
X
...
S1
S2
S3
...
Y1
Y2
Y3
...
Xt
(2)
Xt
(M)
Xt
St Yt
Figure 3: (a) Graphical model representation for switching state-space models. (m) St is the discrete switch variable, and Xt are the real-valued state vectors. (b) Switching state-space model depicted as a generalization of the mixture of experts. The dashed arrows correspond to the connections in a mixture of experts. In a switching state-space model, the states of the experts and the gating network also depend on their previous states (solid arrows).
real-valued state vector evolves according to the linear gaussian dynamics of an SSM with differing initial state, transition matrix, and state noise (see equation 2.2). For simplicity we assume that all state vectors have identical dimensionality; the generalization of the algorithms we present to models with different-sized state-spaces is immediate. The switch state itself evolves according to the discrete Markov transition structure specied by the initial state probabilities P(S1 ) and the M £ M state transition matrix P (St |St¡1 ). An exact analogy can be made to the mixture-of-experts architecture for modular learning in neural networks (see Figure 3b; Jacobs et al., 1991). Each SSM is a linear expert with gaussian output noise model and lineargaussian dynamics. The switch state “gates” the outputs of the M SSMs, and therefore plays the role of a gating network with Markovian dynamics. There are many possible extensions of the model; we shall consider three obvious and straightforward ones: (Ex1) Differing output covariances, R(m) , for each SSM; ( )
(Ex2) Differing output means, m Ym , for each SSM, such that each model is allowed to capture observations in a different operating range (Ex3) Conditioning on a sequence of observed input vectors, fUt g
842
Zoubin Ghahramani and Geoffrey E. Hinton
4 Learning
An efcient learning algorithm for the parameters of a switching SSM can be derived by generalizing the EM algorithm (Baum et al., 1970; Dempster et al., 1977). EM alternates between optimizing a distribution over the hidden states (the E-step) and optimizing the parameters given the distribution over hidden states (the M-step). Any distribution over the hidden states, (1) ( ) Q(fSt , Xt g), where Xt D [Xt , . . . Xt M ] is the combined state of the SSMs, can be used to dene a lower bound, B, on the log probability of the observed data: XZ log P(fYt g|h ) D log (4.1) P(fSt , Xt , Yt g|h ) dfXt g fSt g
¶ P(fSt , Xt , Yt g|h ) dfXt g Q(fSt , Xt g) fSt g µ ¶ XZ P(fSt , Xt , Yt g|h ) (f ¸ Q St , Xt g) log dfXt g Q(fSt , Xt g) fS g D log
XZ
µ
Q(fSt , Xt g)
(4.2)
t
D B( Q, h ),
(4.3)
where h denotes the parameters of the model and we have made use of Jensen’s inequality (Cover & Thomas, 1991) to establish equation 4.3. Both steps of EM increase the lower bound on the log probability of the observed data. The E-step holds the parameters xed and sets Q to be the posterior distribution over the hidden states given the parameters, Q(fSt , Xt g) D P(fSt , Xt g| fYt g, h ).
(4.4)
This maximizes B with respect to the distribution, turning the lower bound into an equality, which can be easily seen by substitution. The M-step holds the distribution xed and computes the parameters that maximize B for that distribution. Since B D log P(fYt g|h ) at the start of the M-step and since the E-step does not affect log P, the two steps combined can never decrease log P. Given the change in the parameters produced by the Mstep, the distribution produced by the previous E-step is typically no longer optimal, so the whole procedure must be iterated. Unfortunately, the exact E-step for switching SSMs is intractable. Like the related hybrid models described in section 2.3, the posterior probability of the real-valued states is a gaussian mixture with M T terms. This can be seen by using the semantics of directed graphs, in particular the d-separation criterion (Pearl, 1988), which implies that the hidden state variables in Figure 3, while marginally independent, become conditionally dependent given the observation sequence. This induced dependency effectively couples all of the real-valued hidden state variables to the discrete switch variable, as a
Switching State-Space Models
843
consequence of which the exact posteriors become Gaussian mixtures with an exponential number of terms.6 In order to derive an efcient learning algorithm for this system, we relax the EM algorithm by approximating the posterior probability of the hidden states. The basic idea is that since expectations with respect to P are intractable, rather than setting Q(fSt , Xt g) D P (fSt , Xt g|fYt g) in the E-step, a tractable distribution Q is used to approximate P. This results in an EM learning algorithm that maximizes a lower bound on the log-likelihood. The difference between the bound B and the log-likelihood is given by the Kullback-Liebler (KL) divergence between Q and P: KL(QkP) D
XZ fSt g
µ
¶ Q(fSt , Xt g) Q(fSt , Xt g) log dfXt g. P(fSt , Xt g|fYt g)
(4.5)
Since the complexity of exact inference in the approximation given by Q is determined by its conditional independence relations, not by its parameters, we can choose Q to have a tractable structure—a graphical representation that eliminates some of the dependencies in P. Given this structure, the parameters of Q are varied to obtain the tightest possible bound by minimizing equation 4.5. Therefore, the algorithm alternates between optimizing the parameters of the distribution Q to minimize equation 4.5 (the E-step) and optimizing the parameters of P given the distribution over the hidden states (the M-step). As in exact EM, both steps increase the lower bound B on the log-likelihood; however, equality is not reached in the E-step. We will refer to the general strategy of using a parameterized approximating distribution as a variational approximation and refer to the free parameters of the distribution as variational parameters. A completely factorized approximation is often used in statistical physics, where it provides the basis for simple yet powerful mean-eld approximations to statistical mechanical systems (Parisi, 1988). Theoretical arguments motivating approximate E-steps are presented in Neal and Hinton (1998; originally in a technical report in 1993). Saul and Jordan (1996) showed that approximate E-steps could be used to maximize a lower bound on the log-likelihood, and proposed the powerful technique of structured variational approximations to intractable probabilistic networks. The key insight of their work, which this article makes use of, is that by judicious use of an approximation Q, exact inference algorithms can be used on the tractable substructures in an intractable network. A general tutorial on variational approximations can be found in Jordan, Ghahramani, Jaakkola, and Saul (1998).
6 The intractability of the E-step or smoothing problem in the simpler single-state switching model has been noted by Ackerson and Fu (1970), Chang and Athans (1978), Bar-Shalom and Li (1993), and others.
844
Zoubin Ghahramani and Geoffrey E. Hinton (1)
X1
(1)
X2
(1)
X3
...
...
...
... X1
(M)
X2
X3
(M)
...
S1
S2
S3
...
(M)
Figure 4: Graphical model representation for the structured variational approximation to the posterior distribution of the hidden states of a switching statespace model. ( )
The parameters of the switching SSM are h D fA( m) , C(m) , Q( m) , m Xm1 ,
( ) Q1m ,
R, ¼ , Wg, where A( m) is the state dynamics matrix for model m, C( m) is ( ) its output matrix, Q( m) is its state noise covariance, m Xm1 is the mean of the ( )
initial state, Q1m is the covariance of the initial state, R is the (tied) output noise covariance, ¼ D P( S1 ) is the prior for the discrete Markov process, and W D P( St |St¡1 ) is the discrete transition matrix. Extensions (Ex1) through (Ex3) can be readily implemented by substituting R( m) for R, adding means ( ) m Ym and input matrices B( m) . Although there are many possible approximations to the posterior distribution of the hidden variables that one could use for learning and inference in switching SSMs, we focus on the following: Q(fSt , Xt g) D
" # T M ³ ´ Y Y 1 ( ) ( ) ( ) y S1 y St¡1 , St y X1m ZQ tD2 mD1 ¢
T Y tD2
³ ´ ( m) ( ) y Xt¡1 , Xt m ,
(4.6)
where the y are unnormalized probabilities, which we will call potential functions and dene soon, and ZQ is a normalization constant ensuring that Q integrates to one. Although Q has been written in terms of potential functions rather than conditional probabilities, it corresponds to the simple graphical model shown in Figure 4. The terms involving the switch variables St dene a discrete Markov chain, and the terms involving the state vectors ( m) Xt dene M uncoupled SSMs. As in mean-eld approximations, we have approximated the stochastically coupled system by removing some of the
Switching State-Space Models
845
couplings of the original system. Specically, we have removed the stochastic coupling between the chains that results from the fact that the observation at time t depends on all the hidden variables at time t. However, we retain the coupling between the hidden variables at successive time steps since these couplings can be handled exactly using the forward-backward and Kalman smoothing recursions. This approximation is therefore structured, in the sense that not all variables are uncoupled. The discrete switching process is dened by ( )
(4.7)
(m) m| St¡1 ) qt ,
(4.8)
y (S1 D m) D P(S1 D m) q1m
y ( St¡1 , St D m) D P(St D ( )
where the qt m are variational parameters of the Q distribution. These parameters scale the probabilities of each of the states of the switch variable at ( ) each time step, so that qtm plays exactly the same role that the observation probability P (Yt | St D m) would play in a regular HMM. We will soon see ( ) that minimizing KL(QkP) results in an equation for qt m that supports this intuition. The uncoupled SSMs in the approximation Q are also dened by potential functions that are related to probabilities in the original system. These potentials are the prior and transition probabilities for X( m) multiplied by a factor that changes these potentials to try to account for the data: ³ ´ ³ ´h ³ ´ih(m) 1 ( ) ( ) ( ) y X1m D P X1m P Y1 |X1m , S1 D m
³ ´ ³ ´h ³ ´ih(m) t ( m) ( ) ( ) ( m) ( ) y Xt¡1 , Xt m D P Xt m |Xt¡1 , P Yt |Xt m , St D m
(4.9) (4.10)
( )
where the ht m are variational parameters of Q. The vector ht plays a role ( ) very similar to the switch variable St . Each component ht m can range be( ) ( ) tween 0 and 1. When ht m D 0, the posterior probability of Xt m under Q ( m) does not depend on the observation at time Yt . When ht D 1, the posterior ( ) probability of Xt m under Q includes a term that assumes that SSM m gener( ) ated Yt . We call ht m the responsibility assigned to SSM m for the observation ( ) ( ) ( ) vector Yt . The difference between ht m and St m is that ht m is a deterministic ( m) parameter, while St is a stochastic random variable. To maximize the lower bound on the log-likelihood, KL(QkP) is mini( ) ( ) mized with respect to the variational parameters ht m and qt m separately for each sequence of observations. Using the denition of P for the switching state-space model (equations 3.1 and 3.2) and the approximating distribution Q, the minimum of KL satises the following xed-point equations for the variational parameters (see appendix B): ( )
ht m D Q( St D m)
(4.11)
846
Zoubin Ghahramani and Geoffrey E. Hinton ( m)
qt
» ½ ´0 ³ ´¾¼ 1 ³ ( ) ( m) ( ) ( m) D exp ¡ , Yt ¡ C m Xt R¡1 Yt ¡ C m Xt 2
(4.12)
where h¢i denotes expectation over the Q distribution. Intuitively, the re( ) sponsibility, htm is equal to the probability under Q that SSM m generated ( ) observation vector Yt , and qt m is an unnormalized gaussian function of the expected squared error if SSM m generated Yt . ( ) To compute ht m it is necessary to sum Q over all the St variables not including St . This can be done efciently using the forward-backward algo( ) rithm on the switch state variables, with qt m playing exactly the same role as an observation probability associated with each setting of the switch vari( ) able. Since qtm is related to the prediction error of model m on data Yt , this has the intuitive interpretation that the switch state associated with models with smaller expected prediction error on a particular observation will be favored at that time step. However, the forward-backward algorithm ensures that the nal responsibilities for the models are obtained after considering the entire sequence of observations. ( ) ( ) To compute qt m , it is necessary to calculate the expectations of Xt m and ( m)
( m) 0
Xt Xt
( m) qt
under Q. We see this by expanding equation 4.12: »
1 ( ) D exp ¡ Yt0 R¡1 Yt C Yt0 R¡1 C( m) hXt m i 2 D Ei¼ 1 h 0 ( ) ( )0 ¡ tr C(m) R¡1 C(m) Xt m Xt m , 2
(4.13)
where tr is the matrix trace operator, and we have used tr( AB) D tr( BA). ( ) ( ) ( )0 The expectations of Xt m and Xt m Xt m can be computed efciently using the Kalman smoothing algorithm on each SSM, where for model m at time ( ) t, the data are weighted by the responsibilities ht m .7 Since the h parameters depend on the q parameters, and vice versa, the whole process has to be iterated, where each iteration involves calls to the forward-backward and Kalman smoothing algorithms. Once the iterations have converged, the Estep outputs the expected values of the hidden variables under the nal Q. The M-step computes the model parameters that optimize the expectation of the log-likelihood (see equation B.7), which is a function of the expectations of the hidden variables. For switching SSMs, all the parameter reestimates can be computed analytically. For example, taking derivatives of the expectation of equation B.7 with respect to C( m) and setting to zero,
7
(m)
Weighting the data by ht is equivalent to running the Kalman smoother on the (m) (m) unweighted data using a time-varying observation noise covariance matrix Rt D R / ht .
Switching State-Space Models
847
Figure 5: Learning algorithm for switching state-space models.
we get O ( m)
C
(
T D E D E X ( ) ( )0 D St m Yt Xt m tD1
!
(
! T D ED E ¡1 X ( m) ( m) ( m) 0 , St Xt Xt
(4.14)
tD1
which is a weighted version of the reestimation equations for SSMs. Similarly, the reestimation equations for the switch process are analogous to the Baum-Welch update rules for HMMs. The learning algorithm for switching state-space models using the above structured variational approximation is summarized in Figure 5. 4.1 Deterministic Annealing. The KL divergence minimized in the Estep of the variational EM algorithm can have multiple minima in general. One way to visualize these minima is to consider the space of all possible segmentations of an observation sequence of length T, where by segmentation we mean a discrete partition of the sequence between the SSMs. If there are M SSMs, then there are MT possible segmentations of the sequence. Given one such segmentation, inferring the optimal distribution for the real-valued states of the SSMs is a convex optimization problem, since these real-valued states are conditionally gaussian. So the difculty in the KL minimization lies in trying to nd the best (soft) partition of the data.
848
Zoubin Ghahramani and Geoffrey E. Hinton
As in other combinatorial optimization problems, the possibility of getting trapped in local minima can be reduced by gradually annealing the cost function. We can employ a deterministic variant of the annealing idea by making the following simple modications to the variational xed-point equations 4.11 and 4.12: ( )
1 Q ( St D m ) T » ½ ´0 ³ ´¾¼ 1 ³ ( m) (m) ( m) (m) ¡1 D exp ¡ . Yt ¡ C Xt R Yt ¡ C Xt 2T
ht m D ( )
qt m
(4.15) (4.16)
Here T is a temperature parameter, which is initialized to a large value and gradually reduced to 1. The above equations maximize a modied form of the bound B in equation 4.3, where the entropy of Q has been multiplied by T (Ueda & Nakano, 1995). 4.2 Merging Gaussians. Almost all the approximate inference methods that are described in the literature for switching SSMs are based on the idea of merging, at each time step, a mixture of M gaussians into one gaussian. The merged gaussian is obtained by setting its mean and covariance equal to the mean and covariance of the mixture. Here we briey describe, as an alternative to the variational approximation methods we have derived, how this more traditional gaussian merging procedure can be applied to the model we have dened. In the switching state-space models described in section 3, there are M different SSMs, with possibly different state-space dimensionalities, so it would be inappropriate to merge their states into one gaussian. However, it is still possibly to apply a gaussian merging technique by considering each SSM separately. In each SSM, m, the hidden state density produces at each time step a mixture of two gaussians: one for the case St D m and one for St 6 D m. We merge these two gaussians, weighted the current estimates of P( St D m| Y1 , . . . Yt ) and 1¡P(St D m|Y1 , . . . Yt ), respectively. This merged gaussian ( m) is used to obtain the gaussian prior for XtC1 for the next time step. We implemented a forward-pass version of this approximate inference scheme, which is analogous to the IMM procedure described in Bar-Shalom and Li (1993). This procedure nds at each time step the “best” gaussian t to the current mixture of gaussians for each SSM. If we denote the approximating gaussian by Q and the mixture being approximated by P, “best” is dened here as minimizing KL(PkQ). Furthermore, gaussian merging techniques are greedy in that the “best” gaussian is computed at every time step and used immediately for the next time step. For a gaussian Q, KL(PkQ) has no local minima, and it is very easy to nd the optimal Q by computing the rst two moments of P. Inaccuracies in this greedy procedure arise because the estimates of P( St | Y1 , . . . , Yt ) are based on this single merged gaussian, not on the real mixture.
Switching State-Space Models
849
In contrast, variational methods seek to minimize KL(QkP), which can have many local minima. Moreover, these methods are not greedy in the same sense: they iterate forward and backward in time until obtaining a locally optimal Q. 5 Simulations 5.1 Experiment 1: Variational Segmentation and Deterministic Annealing. The goal of this experiment was to assess the quality of solutions
found by the variational inference algorithm and the effect of using deterministic annealing on these solutions. We generated 200 sequences of length 200 from a simple model that switched between two SSMs. These SSMs and the switching process were dened by: (1)
Xt
(2) Xt
(1)
(1)
(1)
D 0.99 Xt¡1 C wt D
Yt D
(2) (2) 0.9 Xt¡1 C wt (m) Xt C vt
wt
(2) wt
vt » N
»N
»N
(0, 0.1),
(0, 1)
(5.1)
(0, 10)
(5.2) (5.3)
where the switch state m was chosen using priors ¼ (1) D ¼ (2) D 1 / 2 and transition probabilities W 11 D W 22 D 0.95; W 12 D W 21 D 0.05. Five sequences from this data set are shown in in Figure 6, along with the true state of the switch variable. We compared three different inference algorithms: variational inference, variational inference with deterministic annealing (section 4.1), and inference by gaussian merging (section 4.2). For each sequence, we initialized the variational inference algorithms with equal responsibilities for the two SSMs and ran them for 12 iterations. The nonannealed inference algorithm ran at a xed temperature of T D 1, while the annealed algorithm was initialized to a temperature of T D 100, which decayed down to 1 over the 12 iterations, using the decay function TiC1 D 12 Ti C 12 . To eliminate the effect of model inaccuracies we gave all three inference algorithms the true parameters of the generative model. The segmentations found by the nonannealed variational inference algorithm showed little similarity to the true segmentations of the data (see Figure 7). Furthermore, the nonannealed algorithm generally underestimated the number of switches, often converging on solutions with no switches at all. Both the annealed variational algorithm and the gaussian merging method found segmentations that were more similar to the true segmentations of the data. Comparing percentage correct segmentations, we see that annealing substantially improves the variational inference method and that the gaussian merging and annealed variational methods perform comparably (see Figure 8). The average performance of the annealed variational method is only about 1.3% better than gaussian merging.
850
Zoubin Ghahramani and Geoffrey E. Hinton
Figure 6: Five data sequences of length 200, with their true segmentations below them. In the segmentations, switch states 1 and 2 are represented with presence and absence of dots, respectively. Notice that it is difcult to segment the sequences correctly based only on knowing the dynamics of the two processes.
5.2 Experiment 2: Modeling Respiration in a Patient with Sleep Apnea.
Switching state-space models should prove useful in modeling time series that have dynamics characterized by several different regimes. To illustrate this point, we examined a physiological data set from a patient tentatively diagnosed with sleep apnea, a medical condition in which an individual intermittently stops breathing during sleep. The data were obtained from the repository of time-series data sets associated with the Santa Fe Time Series Analysis and Prediction Competition (Weigend & Gershenfeld, 1993) and is described in detail in Rigney et al. (1993). 8 The respiration pattern in sleep apnea is characterized by at least two regimes: no breathing and gasping breathing induced by a reex arousal. Furthermore, in this patient there also seem to be periods of normal rhythmic breathing (see Figure 9).
8 The data are available online at http://www.stern.nyu.edu/»aweigend/TimeSeries/SantaFe.html#setB. We used samples 6201–7200 for training and 5201–6200 for testing.
Switching State-Space Models
851
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
N A M T
Figure 7: For 10 different sequences of length 200, segmentations are shown with presence and absence of dots corresponding to the two SSMs generating these data. The rows are the segmentations found using the variational method with no annealing (N), the variational method with deterministic annealing (A), the gaussian merging method (M), and the true segmentation (T). All three (m) inference algorithms give real-valued ht ; hard segmentations were obtained (m) by thresholding the nal ht values at 0.5. The rst ve sequences are the ones shown in Figure 6.
852
a
Zoubin Ghahramani and Geoffrey E. Hinton
60 40 20
b
0 0
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
30 20 10
c
0 0 30 20 10
d
0 0 30 20 10 0 0
Percent Correct Segmentation
Figure 8: Histograms of percentage correct segmentations: (a) control, using random segmentation; (b) variational inference without annealing; (c) variational inference with annealing; (d) gaussian merging. Percentage correct segmentation was computed by counting the number of time steps for which the true and estimated segmentations agree.
We trained switching SSMs varying the random seed, the number of components in the mixture (M D 2 to 5), and the dimensionality of the state-space in each component (K D 1 to 10) on a data set consisting of 1000 consecutive measurements of the chest volume. As controls, we also trained simple SSMs (i.e., M D 1), varying the dimension of the state-space from K D 1 to 10, and simple HMMs (i.e., K D 0), varying the number of discrete hidden states from M D 2 to M D 50. Simulations were run until convergence or for 200 iterations, whichever came rst; convergence was assessed by measuring the change in likelihood (or bound on the likelihood) over consecutive steps of EM. The likelihood of the simple SSMs and the HMMs was calculated on a test set consisting of 1000 consecutive measurements of the chest volume. For the switching SSMs, the likelihood is intractable, so we calculated the lower bound on the likelihood, B. The simple SSMs modeled the data very poorly for K D 1, and the performance was at for values of K D 2 to 10 (see
Switching State-Space Models
853
a 20
respiration
15 10 5 0 5
0
50
100
150
200
50
100
150
200
b
250
300
350
400
450
500
250
300
350
400
450
500
time (s)
20
respiration
15 10 5 0 5
0
time (s)
Figure 9: Chest volume (respiration force) of a patient with sleep apnea during two noncontinuous time segments of the same night (measurements sampled at 2 Hz). (a) Training data. Apnea is characterized by extended periods of small variability in chest volume, followed by bursts (gasping). Here we see such behavior around t D 250, followed by normal rhythmic breathing. (b) Test data. In this segment we nd several instances of apnea and an approximately rhythmic region. (The thick lines at the bottom of each plot are explained in the main text.)
Figure 10a). The large majority of runs of the switching state-space model resulted in models with higher likelihood than those of the simple SMMs (see Figures 10b–10e). One consistent exception should be noted: for values of M D 2 and K D 6 to 10, the switching SSM performed almost identically to the simple SSM. Exploratory experiments suggest that in these cases, a single component takes responsibility for all the data, so the model has M D 1 effectively. This may be a local minimum problem or a result of poor initialization heuristics. Looking at the learning curves for simple and switching SSMs, it is easy to see that there are plateaus at the solutions found by the simple one-component SSMs that the switching SSM can get caught in (see Figure 11). The likelihoods for HMMs with around M D 15 were comparable to those of the best switching SSMs (see Figure 10f). Purely in terms of cod-
854
Zoubin Ghahramani and Geoffrey E. Hinton
a
b
SSM
SwSSM (M=2)
c
0.8
0.8
0.8
0.9
0.9
0.9
1
1
1
1.1
1.1
1.1
1234
d
K
6
8 10
SwSSM (M=4)
1234
e
K
6
8 10
SwSSM (M=5)
1234
f
0.8
0.8
0.8
0.9
0.9
0.9
1
1
1
1.1
1.1
1.1
1234
K
6
8 10
1234
K
6
8 10
SwSSM (M=3)
K
6
8 10
HMM
0 10 20 30 40 50 M
Figure 10: Log likelihood (nats per observation) on the test data from a total of almost 400 runs of simple state-space models (a), switching state-space models with differing numbers of components (b–e), and hidden Markov models (f).
ing efciency, switching SSMs have little advantage over HMMs on this data. However, it is useful to contrast the solutions learned by HMMs with the solutions learned by the switching SSMs. The thick dots at the bottom of the Figures 9a and 9b show the responsibility assigned to one of two components in a fairly typical switching SSM with M D 2 components of state size K D 2. This component has clearly specialized to modeling the data during periods of apnea, while the other component models the gasps and periods of rhythmic breathing. These two switching components provide a much more intuitive model of the data than the 10 to 20 discrete components needed in an HMM with comparable coding efciency.9
9 By using further assumptions to constrain the model, such as continuity of the realvalued hidden state at switch times, it should be possible to obtain even better performance on these data.
Switching State-Space Models
855
0.8 0.9 1 1.1
log P
1.2 1.3 1.4 1.5 1.6 1.7 1.8
0
50
100
150
iterations of EM
200
250
Figure 11: Learning curves for a state space model (K D 4) and a switching state-space model (M D 2, K D 2).
6 Discussion
The main conclusion we can draw from the rst series of experiments is that even when given the correct model parameters, the problem of segmenting a switching time series into its components is difcult. There are combinatorially many alternatives to be considered, and the energy surface suffers from many local minima, so local optimization approaches like the variational method we used are limited by the quality of the initial conditions. Deterministic annealing can be thought of as a sophisticated initialization procedure for the hidden states: the nal solution at each temperature provides the initial conditions at the next. We found that annealing substantially improved the quality of the segmentations found. The rst experiment also indicates that the much simpler gaussian merging method performs comparably to annealed variational inference. The gaussian merging methods have the advantage that at each time step, the cost function minimized has no local minima. This may account for how well they perform relative to the nonannealed variational method. On the other hand, the variational methods have the advantage that they iteratively improve their approximation to the posterior, and they dene a lower bound
856
Zoubin Ghahramani and Geoffrey E. Hinton
on the likelihood. Our results suggest that it may be very fruitful to use the gaussian merging method to initialize the variational inference procedure. Furthermore, it is possible to derive variational approximations for other switching models described in the literature, and a combination of gaussian merging and variational approximation may provide a fast and robust method for learning and inference in those models. The second series of experiments suggests that on a real data set believed to have switching dynamics, the switching SSM can indeed uncover multiple regimes. When it captures these regimes, it generalizes to the test set much better than the simple linear dynamical model. Similar coding efciency can be obtained by using HMMs, which due to the discrete nature of the state-space, can model nonlinear dynamics. However, in doing so, the HMMs had to use 10 to 20 discrete states, which makes their solutions less interpretable. Variational approximations provide a powerful tool for inference and learning in complex probabilistic models. We have seen that when applied to the switching SSM, they can incorporate within a single framework wellknown exact inference methods like Kalman smoothing and the forwardbackward algorithm. Variational methods can be applied to many of the other classes of intractable switching models described in section 2.3. However, training more complex models also makes apparent the importance of good methods for model selection and initialization. To summarize, switching SSMs are a dynamical generalization of mixture-of-experts neural networks, are closely related to well-known models in econometrics and control, and combine the representations underlying HMMs and linear dynamical systems. For domains in which we have some a priori belief that there are multiple, approximately linear dynamical regimes, switching SSMs provide a natural modeling tool. Variational approximations provide a method to overcome the most difcult problem in learning switching SSMs: that the inference step is intractable. Deterministic annealing further improves on the solutions found by the variational method.
Appendix A: Notation
Symbol
Size
Description
D£1 D£T K£1 KM £ 1
observation vector at time t sequence of observation vectors [Y1 , Y2 , . . . YT ] state vector of state-space model (SSM) m at time t entire real-valued hidden state at time t: Xt D (1) ( ) [Xt , . . . , Xt M ]
Variables
Yt fYt g ( ) Xt m Xt
Switching State-Space Models
M£1
St
857
switch state variable (represented either as discrete variable St 2 f1, . . . Mg, or as an M £1 vector (1) ( ) ( ) St D [St , . . . St M ]0 where St m 2 f0, 1g)
Model parameters
A( m) C( m)
K£K D£K K£K K£1 K£K D£D M£1 M£M
Q( m) ( ) m Xm1 ( ) Q1m R ¼ W
state dynamics matrix for SSM m output matrix for SSM m state noise covariance matrix for SSM m initial state mean for SSM m initial state noise covariance matrix for SSM m output noise covariance matrix initial state probabilities for switch state state transition matrix for switch state
Variational parameters ( )
ht m ( ) qt m
1£1 1£1
responsibility of SSM m for Yt related to expected squared error if SSM m generated Yt
Miscellaneous
X0 |X| hXi
matrix transpose of X matrix determinant of X expected value of X under the Q distribution
Dimensions
size of observation vector length of a sequence of observation vectors number of state-space models size of state vector in each state-space model
D T M K
Appendix B: Derivation of the Variational Fixed-Point Equations
In this appendix we derive the variational xed-point equations used in the learning algorithm for switching SSMs. First, we write out the probability density P dened by a switching SSM. For convenience, we express this probability density in the log domain, through its associated energy function or hamiltonian, H. The probability density is related to the hamiltonian through the usual Boltzmann distribution (at a temperature of 1), P(¢) D
1 expf¡H(¢)g, Z
858
Zoubin Ghahramani and Geoffrey E. Hinton
where Z is a normalization constant required such that P(¢) integrates to unity. Expressing the probabilities in the log domain does not affect the resulting algorithm. We then similarly express the approximating distribution Q through its hamiltonian HQ . Finally, we obtain the variational xed-point equations by setting to zero the derivatives of the KL divergence between Q and P with respect to the variational parameters of Q. The joint probability of observations and hidden states in a switching SSM is (see equation 3.1) " P(fSt , Xt , Yt g) D P(S1 ) ¢
T Y
T Y
# P( St | St¡1 )
tD2
M Y
"
mD1
# T ³ ´Y ³ ´ ( m) ( m) ( m) P X1 P Xt |Xt¡1 tD2
P( Yt |Xt , St ).
(B.1)
tD1
We proceed to dissect this expression into its constituent parts. The initial probability of the switch variable at time t D 1 is given by P ( S1 ) D
M ³ Y
¼ ( m)
´S(m) 1
,
(B.2)
mD1 (1)
(
)
( )
where S1 is represented by an M£1 vector [S1 . . . S1M ] where S1m D 1 if the switch state is in state m, and 0 otherwise. The probability of transitioning from a switch state at time t ¡ 1 to a switch state at time t is given by P (St |St¡1 ) D
M Y M ³ ´S(m) S(n) Y t t¡1 W (m,n) .
(B.3)
mD1 nD1
The initial distribution for the hidden state variable in SSM m is gaussian ( ) ( ) with mean m Xm1 and covariance matrix Q1m :
³ ´ ¡ 12 ( ) ( m) 2p Q1 P X1m D » ´³ ´ ³ ´¼ 1 ³ ( m) (m) 0 ( ) ¡1 (m) ( m) exp ¡ Q1m . (B.4) X1 ¡ m X1 X1 ¡ m X1 2 The probability distribution of the state in SSM m at time t given the state ( m) at time t ¡ 1 is gaussian with mean A(m) Xt¡1 and covariance matrix Q( m) : » ³ ´ ´ ¡1 1 ³ ( m) (m) ( m) ( m) 0 ( m) 2 (Q( m) )¡1 2p Q exp ¡ P Xt |Xt¡1 D Xt ¡ A( m) Xt¡1 2 ³ ´¼ ( ) ( m) ¢ Xt m ¡ A( m) Xt¡1 . (B.5)
Switching State-Space Models
859
Finally, using equation 3.2, we can write: P( Yt |Xt , St ) D
M µ Y
1
|2p R| ¡ 2
mD1
» ´0 ³ ´¼ ¶S(m) t 1³ ( m) ( m) ( m) ( m) ¡1 exp ¡ Yt ¡ C Xt (B.6) R Yt ¡ C Xt 2 since the terms with exponent equal to 0 vanish in the product. Combining equations B.1 through B.6 and taking the negative of the logarithm, we obtain the hamiltonian of a switching SSM (ignoring constants): HD
M 1X M ³ ´³ ´ ³ ´ 1X ( m) ( m) ( m) 0 ( ) ¡1 ( m) ( m) log Q1 C Q1m X1 ¡ m X1 X 1 ¡ m X1 2 mD1 2 mD1
C
M (T ¡ 1) X ( m) log Q 2 mD1
C
M X T ³ ´0 ³ ´ 1X ( m) ( m) ( ) ( m) ( ) ( ) ( m) Xt ¡ A m Xt¡1 (Q m )¡1 Xt ¡ A m Xt¡1 2 mD1 tD2
C
M X T ³ ´ ³ ´ 1X T ( m) ( m) 0 ¡1 ( m) log |R| C St Yt ¡ C( m) Xt R Yt ¡ C( m) Xt 2 2 mD1 tD1
¡
M X mD1
( )
S1m log ¼ ( m) ¡
T X M X M X tD2 mD1 nD1
( ) ( )
n log W (m,n) . St m St¡1
(B.7)
The hamiltonian for the approximating distribution can be analogously derived from the denition of Q (see equation 4.6): " # T M T Y Y Y 1 ( ) ( m) ( ) y ( S1 ) y ( St¡1 , St ) y ( X1m ) y (Xt¡1 , Xt m ). (B.8) Q(fSt , Xt g) D ZQ tD2 mD1 tD2 The potentials for the initial switch state and switch state transitions are y ( S1 ) D y ( St¡1 , St ) D
M Y
(m)
( ¼ (m) q(1m) )S1
(B.9)
mD1
M Y M ³ ´ (m) (n) Y ( ) St S t¡1 W ( m,n) qt m .
(B.10)
mD1 nD1
The potential for the initial state of SSM m is h ih(m) 1 ( ) ( ) ( ) y ( X1m ) D P( X1m ) P( Y1 |X1m , S1 D m) ,
(B.11)
860
Zoubin Ghahramani and Geoffrey E. Hinton
and the potential for the state at time t given the state at time t ¡ 1 is h ih(m) t ( m) ( ) ( ) ( m) ( ) ) P( Yt |Xt m , St D m) y (Xt¡1 , Xt m ) D P( Xt m | Xt¡1 .
(B.12)
The hamiltonian for Q is obtained by combining these terms and taking the negative logarithm: HQ D
M M ³ ´ ³ ´ 1X 1X ( ) ( m) ( m) 0 ( m) ( m) ( m) log | Q1m | C X1 ¡ m X1 ( Q1 ) ¡1 X1 ¡ m X1 2 mD1 2 mD1
C
M ( T ¡ 1) X log | Q( m) | 2 mD1
C
M X T ³ ´0 ³ ´ 1X ( m) ( m) ( ) ( m) ( ) ( ) (m) Xt ¡ A m Xt¡1 ( Q m ) ¡1 Xt ¡ A m Xt¡1 2 mD1 tD2
C
M M X T ³ ´ ³ ´ 1X TX ( m) ( m) 0 ¡1 ( m) log |R| C ht Yt ¡C( m) Xt R Yt ¡C( m) Xt 2 mD1 2 mD1 tD1
¡ ¡
M X mD1
( )
S1m log ¼ ( m) ¡
T X M X tD1 mD1
( m)
St
T X M X M X tD2 mD1 nD1
( ) ( )
n log W ( m,n) St m St¡1
( )
log qt m .
(B.13) ( )
Comparing HQ with H, we see that the interaction between the St m and ( ) the Xt m variables has been eliminated, while introducing two sets of varia( ) tional parameters: the responsibilities ht m and the bias terms on the discrete ( m) Markov chain, qt . In order to obtain the approximation Q that maximizes the lower bound on the log-likelihood, we minimize the KL divergence KL(QkP) as a function of these variational parameters: XZ Q(fSt , Xt g) KL(QkP) D (B.14) Q(fSt , Xt g) log dfXt g P(fSt , Xt g|fYt g) fS g t
D hH ¡ HQ i ¡ log ZQ C log Z,
(B.15)
where h¢i denotes expectation over the approximating distribution Q and ZQ is the normalization constant for Q. Both Q and P dene distributions in the exponential family. As a consequence, the zeros of the derivatives of KL with respect to the variational parameters can be obtained simply by equating derivatives of hHi and hHQ i with respect to corresponding sufcient statistics (Ghahramani, 1997): @hHQ ¡ Hi ( )
@hStm i
D0
(B.16)
Switching State-Space Models
@hHQ ¡ Hi
861
( )
D0
(B.17)
( )
D0
(B.18)
m @hXt i @hHQ ¡ Hi
@hPt m i ( )
( )
( )0
( )
( )
( )
where Pt m D hXt m Xt m i ¡ hXt m ihXt m i0 is the covariance of Xt m under Q. Many terms cancel when we subtract the two hamiltonians, HQ ¡ H D
M X T ´³ ´ X 1 ³ ( m) (m) ( m) 0 ¡1 h t ¡ St Yt ¡ C( m) Xt R 2 mD1 tD1 ³ ´ ( ) ( ) ( ) ¢ Yt ¡ C( m) Xt m ¡ St m log qt m .
(B.19)
Taking derivatives, we obtain ½ ´ ³ ´¾ 1 ³ ( m) 0 ¡1 ( m) Yt ¡ C( m) Xt R Yt ¡ C( m) Xt 2
@hHQ ¡ Hi
D ¡ log qt m ¡
@hHQ ¡ Hi ( )
³ ´³ ´ ( ) ( ) ( ) D ¡ ht m ¡ hStm i ( Yt ¡ C( m) hXt m i) 0 R¡1 C( m)
( )
D
( )
@hSt m i
m @hXt i @hHQ ¡ Hi
@Pt m
( )
(B.20) (B.21)
´³ ´ 1 ³ ( m) 0 ( ) ht ¡ hSt m i C( m) R¡1 C( m) 2
(B.22) ( )
From equation B.20, we get the xed-point equation, 4.12, for qtm . Both ( ) ( ) equations B.21 and B.22 are satised when htm D hSt m i. Using the fact that ( m) hSt i D Q(St D m) we get equation 4.11. References Ackerson, G. A., & Fu, K. S. (1970). On state estimation in switching environments. IEEE Transactions on Automatic Control, AC-15(1):10–17. Anderson, B. D. O., & Moore, J. B. (1979). Optimal ltering. Englewood Cliffs, NJ: Prentice-Hall. Athaide, C. R. (1995). Likelihood evaluation and state estimation for nonlinear state space models. Unpublished doctoral dissertation, University of Pennsylvania. Baldi, P., Chauvin, Y., Hunkapiller, T., & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proc. Nat. Acad. Sci. (USA), 91(3), 1059–1063. Bar-Shalom, Y., & Li, X.-R. (1993). Estimation and tracking. Boston: Artech House. Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970).A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.
862
Zoubin Ghahramani and Geoffrey E. Hinton
Bengio, Y., & Frasconi, P. (1995). An input-output HMM architecture. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 427–434). Cambridge, MA: MIT Press. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 719–726). San Mateo: Morgan Kaufmann Publishers. Carter, C. K., & Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika, 81, 541–553. Chaer, W. S., Bishop, R. H., & Ghosh, J. (1997). A mixture-of-experts framework for adaptive Kalman ltering. IEEE Trans. on Systems, Man and Cybernetics. Chang, C. B., & Athans, M. (1978). State estimation for discrete systems with switching parameters. IEEE Transactions on Aerospace and Electronic Systems, AES-14(3), 418–424. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intelligence, 5(3), 142–150. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Deng, L. (1993). A stochastic model of speech incorporating hierarchical nonstationarity. IEEE Trans. on Speech and Audio Processing, 1(4), 471–474. Digalakis, V., Rohlicek, J. R., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442. Elliott, R. J., Aggoun, L., & Moore, J. B. (1995). Hidden Markov models: Estimation and control. New York: Springer-Verlag. Fraser, A. M., & Dimitriadis, A. (1993).Forecasting probability densities by using hidden Markov models with mixed states. In A. S. Wiegand & N. A. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 265–282). Reading, MA: Addison-Wesley. Ghahramani, Z. (1997). On structured variational approximations (Tech. Rep. No. CRG-TR-97-1). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www.gatsby.ucl.ac.uk/»zoubin/ papers/struct.ps.gz. Ghahramani, Z., & Hinton, G. E. (1996a). Parameter estimation for linear dynamical systems (Tech. Rep. No. CRG-TR-96-2). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www.gatsby. ucl.ac.uk/»zoubin/papers/tr-96-2.ps.gz. Ghahramani, Z., & Hinton, G. E. (1996b). The EM algorithm for mixtures of factor analyzers (Tech. Rep. No. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www. gatsby.ucl.ac.uk/»zoubin/papers/tr-96-1.ps.gz. Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29, 245–273. Goodwin, G., & Sin, K. (1984).Adaptive ltering prediction and control. Englewood Cliffs, NJ: Prentice-Hall.
Switching State-Space Models
863
Hamilton, J. D. (1989).A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357–384. Hamilton, J. D. (1994). Time series analysis. Princeton, NJ: Princeton University Press. Harrison, P. J., & Stevens, C. F. (1976). Bayesian forecasting (with discussion). J. Royal Statistical Society B, 38, 205–247. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixture of local experts. Neural Computation, 3, 79–87. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods in graphical models. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33, 251–272. Kadirkamanathan, V., & Kadirkamanathan, M. (1996). Recursive estimation of dynamic modular RBF networks. In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 239–245). Cambridge, MA: MIT Press. Kalman, R. E., & Bucy, R. S. (1961).New results in linear ltering and prediction. Journal of Basic Engineering (ASME), 83D, 95–108. Kanazawa, K., Koller, D., & Russell, S. J. (1995).Stochastic simulation algorithms for dynamic probabilistic networks. In P. Besnard & S. Hanks (Eds.), Uncertainty in articial intelligence. Proceedings of the Eleventh Conference (pp. 346– 351). San Mateo, CA: Morgan Kaufmann. Kehagias, A., & Petrides, V. (1997). Time series segmentation using predictive modular neural networks. Neural Computation, 9(8), 1691–1710. Kim, C.-J. (1994). Dynamic linear models with Markov-switching. J. Econometrics, 60, 1–22. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 157–224. Ljung, L., & Soderstr ¨ om, ¨ T. (1983). Theory and practice of recursive identication. Cambridge, MA: MIT Press. Meila, M., & Jordan, M. I. (1996). Learning ne motion by Markov mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Neal, R. M., & Hinton, G. E. (1998).A new view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Parisi, G. (1988). Statistical eld theory. Redwood City, CA: Addison-Wesley. Pawelzik, K., Kohlmorgen, J., & Muller, ¨ K.-R. (1996). Annealed competition of experts for a segmentation and classication of switching dynamics. Neural Computation, 8(2), 340–356.
864
Zoubin Ghahramani and Geoffrey E. Hinton
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE Acoustics, Speech & Signal Processing Magazine, 3, 4–16. Rauch, H. E. (1963).Solutions to the linear smoothing problem. IEEE Transactions on Automatic Control, 8, 371–372. Rigney, D., Goldberger, A., Ocasio, W., Ichimaru, Y., Moody, G., & Mark, R. (1993). Multi-channel physiological data: Description and analysis. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 105–129). Reading, MA: Addison-Wesley. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345. Saul, L. and Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3(4), 253–264. Shumway, R. H., & Stoffer, D. S. (1991). Dynamic linear models with switching. J. Amer. Stat. Assoc., 86, 763–769. Smyth, P. (1994).Hidden Markov models for fault detection in dynamic systems. Pattern Recognition, 27(1), 149–164. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227– 269. Ueda, N., & Nakano, R. (1995). Deterministic annealing variant of the EM algorithm. In G. Tesauro, D. Touretzky, & J. Alspector (Eds.), Advances in neural information processing systems, 7 (pp. 545–552). San Mateo, CA: Morgan Kaufmann. Weigend, A., & Gershenfeld, N. (1993).Time series prediction:Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Received March 2, 1998; accepted March 16, 1999.
LETTER
Communicated by Misha Tsodyks
Retrieval Properties of a Hopeld Model with Random Asymmetric Interactions Zhang Chengxiang Chandan Dasgupta Department of Physics, Indian Institute of Science, Bangalore 560012, India Manoranjan P. Singh Laser Programme, Centre for Advanced Technology, Indore 452013, India The process of pattern retrieval in a Hopeld model in which a random antisymmetric component is added to the otherwise symmetric synaptic matrix is studied by computer simulations. The introduction of the antisymmetric component is found to increase the fraction of random inputs that converge to the memory states. However, the size of the basin of attraction of a memory state does not show any signicant change when asymmetry is introduced in the synaptic matrix. We show that this is due to the fact that the spurious xed points, which are destabilized by the introduction of asymmetry, have very small basins of attraction. The convergence time to spurious xed-point attractors increases faster than that for the memory states as the asymmetry parameter is increased. The possibility of convergence to spurious xed points is greatly reduced if a suitable upper limit is set for the convergence time. This prescription works better if the synaptic matrix has an antisymmetric component. 1 Introduction
Neural network models of associative memory have received a great deal of attention in recent years (Amit, 1989). The best-known model of this kind is the one proposed by Hopeld (1982, 1984). In the Hopeld model, a set of random binary patterns (“memories”) are stored in a network using the modied Hebb rule, which leads to a symmetric matrix of synaptic interactions (two model neurons in the network affect each other in exactly the same way). This assumption of symmetry of the synaptic interaction matrix is known (Eccles, 1964) to be biologically unrealistic. Therefore, it is of interest to try to understand the behavior of neural networks in which the constraint of symmetry is relaxed. Furthermore, it has been suggested (Parisi, 1986; Hertz, Grinstein, & Solla, 1987; Krauth, Nadal, & Mezard, 1988) that the presence of asymmetry in the synaptic interactions may improve the performance of neural network models of associative memory. It is well known (Amit, 1989) that the Neural Computation 12, 865–880 (2000)
c 2000 Massachusetts Institute of Technology °
866
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Hopeld model has a large number of spurious xed-point attractors that are not strongly correlated with any one of the memories stored in the network. The presence of these attractors adversely affects the performance of the network by “trapping” initial congurations that are not close to one of the memory states. This problem is aggravated by the fact that the network cannot distinguish between successful retrieval of a memory and convergence to a spurious xed point because in both cases it settles down to a time-independent state. This problem is not specic to the Hopeld model with binary neurons. It is also present in many other models with symmetric synaptic connections, such as the Hopeld model with analog neurons (Waugh, Marcus, & Westervelt, 1990), Hopeld-type models with more sophisticated learning rules (Sompolinsky, 1987; Forrest, 1988), and networks with sparse representations (Golomb, Rubin, & Sompolinsky, 1990; Erichsen & Theumann, 1991). The presence of asymmetry in the interactions is believed to alleviate this problem in two ways. First, analytic and numerical studies (Treves & Amit, 1988; Fukai, 1990; Singh, Zhang, & Dasgupta, 1995) have shown that the introduction of asymmetry greatly reduces the number of spurious xed-point attractors while leaving the attractors corresponding to memory states largely unaffected. These results suggest that the basins of attraction of the memory states may be enlarged by the introduction of asymmetry. Second, there are indications (Parisi, 1986) that the presence of asymmetry greatly increases the time of convergence to spurious attractors relative to that for convergence to memory states. If this were true in general, then it may be possible to use asymmetry to discriminate effectively between memory states and spurious attractors by monitoring the temporal behavior of the network. Although these suggestions are reasonable and physically plausible, they have not been tested by carrying out detailed and systematic studies of the retrieval performance of Hopeld-like models with asymmetric interactions. We describe the results of a numerical study of the retrieval properties of a Hopeld-like model in which a random antisymmetric component is added to the symmetric, Hebbian synaptic matrix. The presence of asymmetry in the interactions rules out the use of the methods of equilibrium statistical mechanics that have been employed extensively (Amit, 1989) in the analysis of the behavior of the Hopeld model. Also, it is not clear whether the methods used in rigorous mathematical analysis (Komlos ´ & Paturi, 1993) of Hopeld-type networks can be extended to networks with asymmetric interactions. Analytic studies of the dynamics of neural networks are difcult; exact results can be obtained only in very special limits (e.g., in the limit of extreme asymmetric dilution (Derrida, Gardner, & Zippelius, 1987). For this reason, existing analytic studies (Hertz et al., 1987; Crisanti & Sompolinsky, 1987) of the dynamics of neural networks with asymmetric interactions use simplications and approximations whose validity for models like the one we consider here is questionable. There are other analytic methods (Gardner, Derrida, & Mottishaw, 1987; Kepler & Abbott, 1988)
Hopeld Model with Random Asymmetric Interactions
867
that may be used in the study of the dynamical evolution of the state of the system for the rst few steps of update. It is difcult to draw any conclusion about the asymptotic behavior of the network from such studies because the short-time dynamics of the network may differ signicantly from its long-time behavior (Gardner et al., 1987). For these reasons, we have chosen to use numerical simulations in our study of the retrieval properties of this model. Section 2 contains denitions of the model we consider and several other quantities used in the analysis of our numerical data. The numerical methods used and the results obtained from our simulations are described in detail in section 3. Section 4 contains a summary of the main results of this study and a few concluding remarks. 2 Model
The network under investigation consists of N two-state model neurons (“spins”) fSi g, each of which may assume the values C1 or ¡1. A conguration or state of the network is dened by giving specic values to all of its N spins. The synaptic interconnection matrix considered here is dened as follows: Jij D Jijs C kJijas ,
(2.1)
where Jijs is symmetric and is given by the modied Hebb rule (Hopeld, 1982): Jijs D
p 1 X m m j i jj , i 6 D j, Jiis D 0, N m D1
(2.2)
m
where the fj i g, m D 1, . . . , p are the stored patterns or memories. Each m j i may take the values § 1 with equal probability. The number of patterns stored in the network is p, and a D p / N is the memory loading of the network. Jijas is an antisymmetric matrix with independent gaussian elements, that is, " # ¡(Jijas )2 1 as ) ( exp , i < j, P Jij D p 2/N 2p / N
(2.3)
and Jijas D ¡Jjias . The parameter k of equation 2.1 is a measure of the strength of the antisymmetric part of the coupling relative to that of the symmetric part. As discussed in Singh et al. (1995), a synaptic matrix of this form may arise in the tabula non rasa learning scenario proposed by Toulouse, Dehaene, & Changeux (1986).
868
Z. Chengxiang, C. Dasgupta, and M. P. Singh
The network evolves according to the iterated map, 2 Si ( t C 1) D sign4
N X jD1
3 Jij Sj (t)5, i D 1, . . . , N,
(2.4)
where the discrete “time” t is measured in units of updates per spin. A xed point of the dynamics is a conguration fSi g satisfying Si D sign(hi ),
(2.5)
where hi ´
N X
Jij Sj
(2.6)
jD1
is the local eld of the ith spin. Equation 2.5 is equivalent to the conditions l i ´ Si hi ¸ 0.
(2.7)
The quantity l i , called the local stability parameter of the ith spin of the network, may serve as a measure of the local stability of the xed point fSi g. m The fractional Hamming distance between two congurations fSi g and º fSi g of the network is dened as Hm º D
1 (1 ¡ mm º ), 2
(2.8)
N 1 X m Si Sºi N iD1
(2.9)
where mm º D
is the overlap of the two congurations. Although an energy function does not exist for an asymmetric network, one may still dene an energy-like function using the symmetric part of the interconnection matrix: ED¡
N p 1X Js Si Sj ´ aNe. 2 i, jD1 ij
(2.10)
Although asymmetric networks do not always evolve from higher values of E to lower values, the energy function dened above still provides some insight into the dynamics of the network.
Hopeld Model with Random Asymmetric Interactions
869
3 Results of Simulations
We have carried out numerical simulations for various sizes of the network ranging from N D 60 to 500. For asymmetric networks, k was taken to be 0.1, and simulations for each realization of the symmetric component of the synaptic matrix were carried out for two different sets of antisymmetric interconnections. The results reported here are averages over 20 realizations of the symmetric component of the synaptic matrix. The memory loading level a was xed at 0.1. Asynchronous dynamics with both serial and random sequential updating was employed in our simulations. We found no qualitative difference between the behavior of the network for the two updating procedures considered. We therefore present here only the results of random sequential updating. We rst looked at the effects of synaptic asymmetry on the number and properties of the xed-point attractors of the system. All the xed-point attractors were located by starting the system from a very large number of random initial states and following its evolution until a xed point is reached. Such exhaustive enumeration (Singh et al., 1995) of the xed points is possible only for small systems. We present here the results obtained for 60-neuron networks with six memories. The total number of input states used for each realization of the coupling matrix was of the order of 106 . We nd that the average number of xed points is 151 for k D 0 and 38 for k D 0.1 (a xed point and its complement, obtained by reversing the signs of all the spins, are not counted separately). In contrast to this dramatic decrease of the total number of stable states, the stored patterns are nearly unaffected by the introduction of asymmetry. In particular, we found only one realization of Jijs for which one memory becomes unstable for one of the two realizations of Jijas . The accessibility (Hopeld, Feinstein, & Palmer, 1983) of the stored patterns, dened as the fraction of random initial states converging to the set of stored memory states, increases from 0.64 to 0.70 when the asymmetry is turned on. Second, the inuence of asymmetry on the size of the basins of attraction of the stored patterns was investigated. Previous studies (Crisanti & Sompolinsky, 1987; Krauth et al., 1988; Spitzner & Kinzel, 1989) have led to contradictory conclusions about this issue. In our simulations of a network of 500 neurons and 50 memories, we chose each initial state in such a way that it is at a given Hamming distance H from a specic memory state (the target state), and at the same time, its Hamming distances to other memory states are larger than those to the target state. The total number of initial states used for each value of H was 10,000 for the asymmetric network and 5000 for the symmetric network. Figure 1 shows the probability of correct retrieval (operationally dened as one in which the overlap of the retrieved state with the target memory exceeds 0.95) at different values of H. The curves for k D 0 and k D 0.1 are found to be almost coincident. When H < 0.2, the probability of convergence to the target memory is al-
870
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Figure 1: Probability of convergence to a memory state as a function of the fractional Hamming distance H of the initial state from the memory. The data shown are for networks with N D 500 and a D 0.1. Results for the symmetric network (k D 0) are shown by the solid line; the triangles represent results for an asymmetric network with k D 0.1.
most equal to 1. It drops sharply when H exceeds 0.2 and becomes 0 as H approaches 0.4 for both symmetric and asymmetric networks. It is surprising that although a large number of spurious states are suppressed, the size of the basin of attraction of the memory states does not increase when the asymmetry is turned on. Spitzner & Kinzel (1989) found a similar result for a Hopeld model with directed bonds. A probable explanation of the results shown in Figure 1 is that the spurious stable states that become unstable due to the turning on of the asymmetry usually have high values of the energy E dened in equation 2.10. As noted in Hopeld et al. (1983), the typical size of the basin of attraction of such high-E states is expected to be small. So the disappearance of these high-energy states has no signicant inuence on the size of the basins of attraction of the memory states.
Hopeld Model with Random Asymmetric Interactions
871
The connection between the stability and the energy of a state of the network may be understood from the relation between the local stability parameters fl i g dened in equation 2.7 and the energy. The energy of a symmetric network (k D 0) may be written as ED¡
N N 1X 1X l i. Jijs Si Sj D ¡ 2 i, jD1 2 iD1
(3.1)
It is obvious that stable states with lower energy correspond to a higher local stability on the average, and vice versa. The energy of the network does not change when the antisymmetric component is added. However, all the local stability parameters are changed: some are increased, and the others are decreased. If the decrease of one of the local stability parameters is so large that its sign changes from positive to negative, then that state becomes unstable. Since the antisymmetric part of the connection matrix is chosen randomly, it is clear that stable states with higher energy (i.e., with lower average local stability) are more likely to become unstable when synaptic asymmetry is introduced. This proposed explanation has been conrmed by the results of our simulation. In Figures 2 and 3, we have shown the results obtained from simulations of networks with N D 60 and 6 memories. To obtain the data shown in Figure 2, the entire range of possible energy values was divided into 20 equal parts (“bins”). For each realization of the connection matrix, the number of stable states whose energies fall into a particular bin was counted. The numbers obtained in this way for all the bins were then averaged over all realizations of the connection matrix. We observe that the antisymmetric part of the connection matrix has a stronger effect in reducing the number of stable states with higher energy. Figure 3 shows the dependence of the accessibility of a xed point on the energy parameter e. The data shown in this gure were obtained in a similar way as those in Figure 2. For each coupling matrix, the fraction of input states that converge to a stable state whose energy falls in a particular bin was calculated. This fraction was then averaged over all the stable states belonging to that bin. Finally, an averaging was done over all realizations of the coupling matrix. We note that the accessibility drops rapidly with increasing e. Furthermore, the presence of asymmetry in the coupling matrix enhances the accessibility of low-energy stable states because a large number of high-energy stable states are eliminated. Since the total number of random input states used in these simulations is quite large, the fraction of inputs converging to a stable state may be considered to be a measure of the size of its basin of attraction. It is then clear from the data shown in Figure 3 that the size of the basin of attraction of a stable state decreases signicantly with increasing e. Although the introduction of asymmetry in the synaptic interactions does not directly increase the size of the basin of attraction of stored patterns, it
872
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Figure 2: Dependence of the average number of xed points, N fp , on the energy parameter e for networks with N D 60 and a D 0.1. Results for the symmetric network (k D 0) and the asymmetric network with k = 0.1 are shown.
does have a positive effect on the retrieval performance of the network. This positive effect stems from the fact that the convergence time for wrong retrievals increases in the presence of asymmetry. This makes it possible to discriminate between a correct retrieval and an incorrect one. Results for the average convergence time t , obtained for simulations of networks of 500 neurons, are presented in Figure 4. Data for t for both correct and wrong retrievals are shown for the symmetric network and an asymmetric network with k D 0.1. We nd that the average convergence time for wrong retrievals is signicantly larger than that for correct retrievals, and that the former increases signicantly in the presence of asymmetry, whereas the latter remains almost unchanged. Our results for the convergence times in the symmetric network are consistent with those of Tanaka and Yamada (1993). There may be several reasons for these observations. The typical Hamming distance of an input state to a spurious stable state may be longer than
Hopeld Model with Random Asymmetric Interactions
873
Figure 3: Dependence of the accessibility of a xed point (i.e., the fraction of random initial states which converge to it) on the energy parameter e for networks with N D 60 and a D 0.1. Results for the symmetric network (k D 0) and the asymmetric network with k D 0.1 are shown.
that to its target memory state. The former may increase further in the presence of asymmetry in the coupling matrix, which suppresses a large number of spurious states. A second possibility is that the dynamical evolution process for eventual trapping into a spurious state is more complicated than convergence to the target memory. Introduction of synaptic asymmetry further increases the complexity of processes leading to wrong retrieval. In order to shed some light on this question, we have calculated the average Hamming distance H0 between the input state and the output spurious state to which the system converges for different values of H. We nd that H0 is only slightly higher than H and that the introduction of asymmetry has almost no effect on H 0 . We have also found that for a xed value of H, the values of H0 obtained for networks of different size are nearly the same. This implies that the above observation is not a nite-size effect. It is then reasonable to con-
874
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Figure 4: Average convergence time t , measured in units of updates per neuron, as a function of the fractional Hamming distance H of the initial state from a memory state. The data shown are for networks with N D 500 and a D 0.1. The two curves at the bottom represent data for convergence to memory states for two different values of the asymmetry parameter k: k D 0 (full line) and k D 0.1 (dashed line). The two curves at the top represent data for t when the convergence is to a spurious state: k D 0 (solid line); k D 0.1 (dashed line).
clude that the second reason mentioned above plays a crucial role in slowing the process of wrong retrievals. The same mechanism may lead to a transformation of the spurious stable states to chaotic trajectories (Parisi, 1986; Crisanti & Sompolinsky, 1987) when the asymmetry parameter is increased. The dip near H ’ 0.2 in the curves (see Figure 4) for the time of convergence to spurious attractors arises from the fact that the spurious attractors to which the system converges have signicant correlations with the memory states. We have calculated the distribution of the minimum Hamming distance Hm of a spurious stable state from all the memory states. This distribution is found to have a sharp peak at a relatively small value (0.2–0.3) of Hm (Singh et al., 1995). The presence of a large number of spurious stable
Hopeld Model with Random Asymmetric Interactions
875
states at a Hamming distance of 0.2 to 0.3 from a memory state results in relatively fast convergence to one of them when the Hamming distance of the initial state from its target memory lies in this range. The sharp drop in the probability of convergence to a memory state as H is increased beyond 0.2 (see Figure 1) is also due to the same reason. The peak of the distribution of Hm is found to move to higher values as the sample size is increased. This observation suggests that the effect mentioned above may be an artifact of the smallness of sample size. This point merits further investigation. The observation that the average convergence times for correct and wrong retrievals are well separated suggests (Tanaka & Yamada, 1993) that if we set a suitable upper limit tm for the convergence time and allow the network to evolve only up to the time tm , then wrong retrievals will be mostly discarded and the correct ones will be selected out. However, this strategy may suffer because of the overlap of the two “bands” of convergence times. This difculty may be alleviated by adding an antisymmetric component to the synaptic matrix because the presence of asymmetry reduces the overlap between the two “bands” by increasing the convergence time for wrong retrievals. The effectiveness of this procedure is illustrated in Figure 5, where we show simulation results for networks with N D 500. The upper limit used in these simulations was tm = 16 updates per spin. The probability of convergence to memory states was calculated by considering only those runs in which the system reaches a stable state within time tm . A comparison with Figure 1 clearly shows that this strategy leads to a signicant improvement in the retrieval performance of the network. The probability of correct retrieval is close to 65% for H D 0.4 when an appropriate upper limit is set for the convergence time, whereas this probability is close to zero in the absence of any upper limit. The probability of correct retrieval for H D 0.4 increases to 73% for k D 0.1, indicating a further improvement caused by the introduction of asymmetry. Similar results were obtained for tm = 12 and 14 updates per spin. We have also studied the dependence of the convergence times on the network size N. The results are shown in Figure 6. It can be seen that both the gap between the average convergence times for correct and incorrect retrievals and the effectiveness of synaptic asymmetry in widening this gap increase as the size of the system is increased. Therefore, the prescription proposed for improving the retrieval performance of the network is expected to work better for larger networks. The same conclusions hold for serial updating. The only difference is that in serial updating, the average convergence time is shorter for memory states and spurious states. 4 Summary and Discussion
We have studied the retrieval properties of a Hopeld-like model in which a random antisymmetric part is added to the otherwise symmetric synap-
876
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Figure 5: The probability of convergence to a memory state is plotted as a function of H, the Hamming distance of the initial conguration from the target memory. In these simulations, an upper limit of 16 updates per neuron was set for the time of convergence; only those runs that converged to xed points within this time limit were used to calculate the probabilities. The results shown are for N D 500, a D 0.1, and two values of k: k D 0 (solid line) and k D 0.1 (triangles).
tic matrix. We nd that the introduction of the antisymmetric component enhances the overall accessibility of the memory states. However, the size of the basin of attraction of a memory state does not show any signicant change when asymmetry is introduced in the synaptic matrix. This is quite an unexpected result. We show that it is due to the fact that the spurious xed points that are destabilized by the introduction of asymmetry are highly inaccessible to begin with. Therefore, their suppression does not have any appreciable effect on the basins of attraction of the more accessible xed points. The convergence time to spurious xed-point attractors increases faster than that to the memory states as the asymmetry parameter is increased. The possibility of convergence to spurious xed points is
Hopeld Model with Random Asymmetric Interactions
877
Figure 6: Dependence of the average convergence time t on the number of neurons N of the network. The two curves at the bottom represent data for convergence to memory states; those at the top represent results for convergence to spurious states. The data shown are for a = 0.1. The Hamming distance of the initial state from the target memory was set at H = 0.3. The full lines represent the results for the symmetric network, and the dashed lines correspond to k =0.1.
greatly reduced if a suitable upper limit is set to the convergence time. This prescription for suppressing convergence to spurious xed points becomes more effective in the presence of asymmetry in synaptic interactions. Thus, the antisymmetric part can be very useful in the associative recall of the stored memories by slowing the process of wrong retrievals. As the antisymmetric part of the coupling matrix is chosen randomly, we cannot expect it to have only the positive effects mentioned above. Analytic studies (Treves & Amit, 1988; Singh et al., 1995) of the behavior of this model in the large-N limit indicate that the introduction of random synaptic asymmetry decreases the storage capacity of the network and marginally degrades the quality of recall. Also, networks with asymmetric interconnections and serial updating may exhibit time-dependent attractors, such
878
Z. Chengxiang, C. Dasgupta, and M. P. Singh
as limit cycles of varying length (Nutzel, ¨ 1991), in addition to xed points. Since k D 0.1 corresponds to a small relative strength of the antisymmetric component, and possibly due to the smallness of the systems studied, we found little evidence of such adverse effects. These effects are probably responsible for the slight reduction of the recall probability (see Figure 1) as the asymmetry is turned on. The “optimal” value of k would have to be determined by balancing these negative effects against the positive ones discussed above. Our limited investigation of the behavior of the network for other values of k suggests that its optimal value is higher than 0.1. A more detailed study of this issue would be useful. The retrieval properties of the network may improve if the antisymmetric component of the interconnection matrix is chosen in a systematic way instead of randomly. A recent study (Okada, Fukai, & Shiino, 1998) shows that random asymmetric dilution of the synaptic connections of a Hopeld-type model leads to a decreased storage capacity that can be rectied by a systematic dilution technique. Another indication of the benecial effects of structured asymmetry is provided by the fact that some of the learning algorithms proposed (Krauth & Mezard, 1987; Gardner, 1988) for increasing the storage capacity and optimizing the local stability of the memory states lead to asymmetric synaptic connections. We close with a few comments on the important question of whether the effects of synaptic asymmetry found in our study are also expected to be present, at least qualitatively, in other neural network models of associative memory. Our work and similar other studies (Crisanti & Sompolinsky, 1988; Gutfreund, Reger, & Young, 1988) of spin models with asymmetric interactions strongly suggest that the presence of synaptic asymmetry greatly reduces the number of spurious attractors. However, it is not clear whether this reduction in the number of spurious attractors always leads to an enlargement of the basins of attraction of the memory states. This does not happen in our model or in a Hopeld model with asymmetric dilution of the synapses (Spitzner & Kinzel, 1989). In contrast, a recent study (Athithan & Dasgupta, 1997) shows that asymmetric dilution of the synapses of a network constructed using an optimized learning rule (Krauth & Mezard, 1987) based on linear programming causes a nearly total suppression of convergence to spurious attractors. To take another example, extreme asymmetric dilution (Derrida et al., 1987; Gardner, 1989) of the synapses leads to maximal size of the basins of attraction of the memory states in some but not all models. So the effects of synaptic asymmetry on the size of the basins of attraction of the memory states appear to be somewhat model specic. The asymmetry-induced relative increase of the convergence time to spurious attractors is expected (Crisanti & Sompolinsky, 1988; Nutzel, ¨ 1991) to be quite general. Therefore, our prescription of introducing a moderate amount of asymmetry and then setting an upper limit to the convergence time is expected to be generally useful for improving the retrieval performance.
Hopeld Model with Random Asymmetric Interactions
879
Acknowledgments
Z. C. thanks the Department of Physics, Indian Institute of Science and T. V. Ramakrishnan for their hospitality for the period during which this work was carried out. Facilities for carrying out the numerical work were provided by the Supercomputer Education and Research Centre, Indian Institute of Science. References Amit, D. J. (1989). Modeling brain functions. Cambridge: Cambridge University Press. Athithan, G., & Dasgupta, C. (1997). On the problem of spurious patterns in neural associative memory models. IEEE Trans. Neur. Net., 8, 1483–1491. Crisanti, A., & Sompolinsky, H. (1987).Dynamics of spin systems with randomly asymmetric bonds: Langevin dynamics and a spherical model. Phys. Rev. A, 36, 4922–4939. Crisanti, A., & Sompolinsky, H. (1988).Dynamics of spin systems with randomly asymmetric bonds: Ising spins and Glauber dynamics. Phys. Rev. A, 37, 4865– 4879. Derrida, B., Gardner, E., & Zippelius, A. (1987). An exactly solvable asymmetric neural network model. Europhys. Lett., 4, 167–173. Eccles, J. C. (1964). The physiology of synapses. Berlin: Springer-Verlag. Erichsen Jr., R., & Theumann, W. K. (1991). Mixture states and storage with correlated patterns in Hopeld model. Int. J. Neur. Sys., 1, 347–353. Forrest, B. M. (1988). Content-addressability and learning in neural networks. J. Phys. A, 21, 245–255. Fukai, T. 1990. Metastable states of neural networks incorporating the physiological Dale hypothesis. J. Phys. A, 23, 249–258. Gardner, E. (1988). The space of interactions in neural network models. J. Phys. A, 21, 257–270. Gardner, E. (1989). Optimal basins of attraction in randomly sparse network models. J. Phys. A, 22, 1969–1974. Gardner, E., Derrida, B., & Mottishaw, P. (1987). Zero temperature dynamics for innite range spin glasses and neural networks. J. Phys. (France), 48, 741–755. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low ring rates. Phys. Rev. A, 41, 1843–1854. Gutfreund, H., Reger, J. D., & Young, P. Y. (1988). The nature of attractors in an asymmetric spin glass with deterministic dynamics. J. Phys. A, 21, 2775–2797. Hertz, J. A., Grinstein, G., & Solla, S. A. (1987). Neural nets with asymmetric bonds. In L. N. van Hemmen & I. Morgenstern (Eds.), Heidelberg Colloquium on Glassy Dynamics (pp. 538–546). Heidelberg: Springer-Verlag. Hopeld, J. J. (1982). Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558.
880
Z. Chengxiang, C. Dasgupta, and M. P. Singh
Hopeld, J. J. (1984). Neurons with gradedresponse have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hopeld, J. J., Feinstein, D. I., & Palmer, R. G. (1983). “Unlearning” has a stabilizing effect in collective memories. Nature, 304, 158–159. Kepler, T. B., & Abbott, L. F. (1988). Domains of attraction in neural networks. J. Phys. (France), 49, 1657–1662. Koml os, ´ J., & Paturi, R. (1993). Effects of connectivity in an associative memory model. J. Comput. Sys. Sci., 47, 350–373. Krauth, W., & Mezard, M. (1987). Learning algorithms with optimal stability in neural networks. J. Phys. A, 20, L745–L752. Krauth, W., Nadal, J. P., & Mezard, M. (1988).The roles of stability and symmetry in the dynamics of neural networks. J. Phys., A, 21, 2995–3011. Nutzel, ¨ K. (1991). The length of attractors in asymmetric random neural networks with deterministic dynamics. J. Phys. A, 24, L151–L157. Okada, M., Fukai, T., & Shiino, M. (1998). Random and systematic dilutions of synaptic connections in a neural network with a nonmonotonic response function. Phys. Rev. E, 57, 2095–2103. Parisi, G. (1986). Asymmetric neural networks and the process of learning. J. Phys., A19, L675–L680. Singh, M. P., Zhang, C., & Dasgupta, C. (1995). Fixed points in a Hopeld model with random asymmetric interactions. Phys. Rev. E, 52, 5261–5272. Sompolinsky, H. (1987). The theory of neural networks: The Hebb rule and beyond. In L. N. van Hemmen & I. Morgenstern (Eds.), Heidelberg Colloquium on Glassy Dynamics. Heidelberg: Springer-Verlag. Spitzner, P., & Kinzel, W. (1989). Hopeld model with directed bonds. Z. Phys., 74, 539–545. Tanaka, T., & Yamada, M. (1993). The characteristics of the convergence time of associative neural networks. Neural Computation, 5, 463–472. Toulouse, G., Dehaene, S., & Changeux, J.-P. (1986). Spin glass model of learning by selection. Proc. Natl. Acad. Sci. USA, 83, 1695–1698. Treves, A., & Amit, D. J. (1988). Metastable states in asymmetrically diluted Hopeld networks. J. Phys. A, 21, 3155–3169. Waugh, F. R., Marcus, C. M., & Westervelt, R. M. (1990). Fixed-point attractors in analog neural computation. Phys. Rev. Lett., 64, 1986–1989. Received October 26, 1998; accepted May 26, 1999.
LETTER
Communicated by Shun-ichi Amari
On “Natural” Learning and Pruning in Multilayered Perceptrons Tom Heskes Foundation for Neural Networks, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands Several studies have shown that natural gradient descent for on-line learning is much more efcient than standard gradient descent. In this article, we derive natural gradients in a slightly different manner and discuss implications for batch-mode learning and pruning, linking them to existing algorithms such as Levenberg-Marquardt optimization and optimal brain surgeon. The Fisher matrix plays an important role in all these algorithms. The second half of the article discusses a layered approximation of the Fisher matrix specic to multilayered perceptrons. Using this approximation rather than the exact Fisher matrix, we arrive at much faster “natural” learning algorithms and more robust pruning procedures. 1 Introduction
Natural gradient has recently received quite a lot attention (Amari, 1998; Rattray, Saad, & Amari, 1998; Yang & Amari, 1998). The basic idea is to replace the standard gradient by the natural gradient, dened as the steepest descent direction in Riemannian space (Amari, 1998). It has been shown (Rattray et al., 1998; Yang & Amari, 1998) that the use of natural gradients in on-line backpropagation learning for multilayered perceptrons (MLPs) leads to a tremendous speed-up in convergence. One of the important observations is that natural gradients greatly help to break the symmetry between hidden units (Rattray et al., 1998), which, using standard gradients, is the cause of long plateaus (long time spans in which the performance of the MLP barely improves—also referred to as temporary minima; see, among others, Ampazis, Perantonis, & Taylor, 1999; Saad & Solla, 1995; Wiegerinck & Heskes, 1996). The goal of this article is twofold: A different formulation of natural gradients (section 2). We will derive “natural” gradients in a slightly different way. In this formulation, the appropriate metric follows naturally from the error function one tries to minimize (section 2.1). We discuss the overlap and differences with Amari’s approach (section 2.2). The natural equivalents of batch-mode Neural Computation 12, 881–901 (2000)
c 2000 Massachusetts Institute of Technology °
882
Tom Heskes
learning and pruning can be dened and linked to existing algorithms (sections 2.3 and 2.4). An efcient approximation of the Fisher matrix for MLPs (sections 3 and 4). The Fisher matrix plays a key role in the theory about natural gradients. In practical applications of “natural” rather than “standard” learning or pruning procedures, one constantly has to compute and invert this matrix of size nweights £ nweights with nweights the total number of network weights. The proposed approximation loosens this computational burden by orders of magnitudes (section 4.1). For on-line learning, we show that the approximated Fisher matrix is sufciently powerful to help break the symmetry between hidden units and escape from the corresponding plateau in the same order of iterations as with the exact Fisher matrix (section 4.2). In the context of batchmode learning, we arrive at an alternative to Levenberg-Marquardt optimization, which can be much more efcient, especially for large networks (section 4.3). Pruning algorithms based on the approximated Fisher matrix already exist (although not interpreted in this way), and although in principle they are less powerful, they are claimed to be more robust than pruning algorithms using the exact Fisher matrix (Laar & Heskes, 1999) (section 4.4). 2 Natural Gradient in a Slightly Different Formulation 2.1 On-line Learning. We are concerned here with supervised learning. Given an input x, the output of a model with parameters W reads y( W, x). Both x and y can be vectors. We will use T to denote the transpose. In supervised learning, this output is compared with a given target vector t, and the error between the two is dened as d( y( W, x), t). Well-known error functions are sum-squared error, d( y, t) D ( y ¡ t) 2 / 2 and cross-entropy d( y, t) D t log[t / y] C (1 ¡ t) log[(1 ¡ t) / (1 ¡ y)], but other error functions might be chosen as well, as long as they are at least twice differentiable with respect to y. In on-line learning, a weight update is based on one combination of inputs x and targets t. The standard stochastic gradient procedure is
( ( ), ) O ¡ W D ¡g @d y W, x t , DW ´ W @W
(2.1)
O the new, and W the current set with g a (usually small) learning rate, W of parameters. Amari’s work on natural gradient learning (see, e.g., Amari, 1998) has made very clear that this particular choice for the gradient is rather arbitrary and should be replaced by natural gradients, dened as the steepest descent direction in Riemannian space. Here we use a slightly different formulation to arrive at more or less the same results. The advantage of our formulation is that we can achieve similar “natural” gradients for any kind
“Natural” Learning and Pruning in Multilayered Perceptrons
883
of (twice differentiable) error function d( y, t), without the need to introduce (manifolds of) probability distributions or Riemannian spaces. First note that the standard gradient descent procedure (see equation 2.1) is, in lowest order of the learning parameter g, equivalent to1 µ ¶ O D argmin gd(y(W 0 , x), t) C 1 kW 0 ¡ Wk2 , W 2 W0
(2.2)
P 2 with kWk2 ´ i Wi the Euclidean distance. The rst term expresses the desire to learn the new training example and the second term the need to be conservative, that is, to retain the information acquired in previous learning trials (Kivinen & Warmuth, 1997). The obvious question is, Why use the Euclidean distance and not some other distance measure? Indeed the “natural” choice would be to use the same distance 2 function for comparing the outputs of the new model with the given targets as for comparing them with those of the old model: d( y(W 0 , x), y(W, x)). To make this distance independent of the currently presented input x, we take the average over the distribution of all inputs; that is, we dene « ¬ D( W 0 , W ) D d( y( W 0 , x), y( W, x)) x .
(2.3)
This distance function is independent of the distribution of the targets t. On the other hand, it depends on the distribution of the inputs x, so the underlying assumption is that we know or at least can estimate this distribution. We will discuss a batch-mode procedure, for which this is obviously not a problem. However, in the on-line version, knowledge of the input distribution is indeed an important assumption, on which all studies on natural gradient procedures for supervised learning are based. So replacing in equation 2.2 the Euclidean distance by the more “natural” D(W 0 , W ) of equation 2.3, we obtain £ ¤ O D argmin gd(y(W 0 , x), t) C D( W 0 , W ) , W
(2.4)
W0
1
For an exact correspondence with the gradient 2.1, we should write D W D g lim
2 !0
2
1
»
h
i
¼
1 argmin 2 d(y(W , x), t) C kW 0 ¡ Wk2 ¡ W . 2 0 W 0
In fact, this kind of limit can be viewed as an alternative denition of a gradient to be compared with the schoolbook denition of a difference in the limit 2 to zero, but this seems unnecessarily complicated for the current purposes. 2 Since we do not require d(y, t) D d(t, y) the term distance is not always appropriate and could be replaced with divergence. We stick with distance because it does provide the correct sense.
884
Tom Heskes
which, in lowest order of g, corresponds to the gradient procedure
D W D ¡gF¡1 ( W )
@d( y(W, x), t) @2 D( W 0 , W ) , with F( W ) ´ . @W @W 0 @W 0T W 0 DW
(2.5)
We refer to F( W ) as the Fisher matrix, although it need not be a Fisher matrix according to its stricter denition as in equation 2.8 below. In computing the Fisher matrix, it is quite convenient that all secondorder derivatives of the outputs with respect to the parameters W disappear; that is, we have
@2 d( y(W 0 , x), y(W, x)) @y( W, x) @y( W, x) D , h( y(W, x)) 0 0 T @W @W @W @W T W 0 DW
(2.6)
with
h( y) D
@2 d( y0 , y) . @y0 @y0T y0 Dy
(2.7)
In other words, we only have to compute the second-order derivative h( y) of the error function with respect to the outputs, not the second-order derivative of the outputs with respect to the parameters W, which does appear in the computation of the Hessian matrix (see also section 2.3). 2.2 Relation to Amari’s Natural Gradients. In this section we rst show that our denition of a “natural” gradient does indeed correspond to Amari’s expressions for error functions of the log likelihood or KullbackLeibler type. Next we discuss when and how the two denitions of “natural” gradients differ. Let us consider the special case where the outputs y and targets t can be interpreted as probabilities or parameters in a probability density, and the error function d( y, t) is the Kullback-Leibler divergence between the probabilities implied by y and t: µ ¶ Z p(z|t) . d( y, t) D dz p( z|t) log p( z| y)
Both the sum-squared error (y refers to the mean of a gaussian distribution with xed variance, that is, p( z| y) » N ( y, s) with s xed and known) and the cross-entropy (y is the class probability, that is, p(z D 1|y) D y and p( z D 0| y) D 1 ¡ y; the integral can be replaced by a sum over z D f0, 1g), described in the beginning of section 2.1, can be written in this way up to irrelevant constants. For this kind of error function, we obtain, after some straightforward manipulations, @2 F( W ) D @W 0 @W 0T
½ Z
µ
p( z|W, x) dz p( z|W, x) log p( z|W 0 , x)
¶¾
0
x W DW
“Natural” Learning and Pruning in Multilayered Perceptrons
½ Z D
dz p(z|W, x)
@log p( z| W, x) @log p( z|W, x) @W @W T
885
¾ ,
(2.8)
x
the Fisher information (Amari 1985, 1998). In other words, we have shown here that in cases where the error function to be minimized can be interpreted as a Kullback-Leibler divergence, Amari’s natural gradient learning coincides with the formulation in equation 2.4. To understand when and how the two denitions of “natural” differ, let us rst rephrase Amari’s arguments in our context. Given a manifold of probability distributions parameterized by W, the most general denition of information divergence under some invariance considerations and obvious constraints for divergences (nonnegativity, zero iff the two distributions are equivalent, triangle inequality) is the so-called f -divergence (Amari, 1985), ½ Z 0
D( W , W ) D
dz p(z|W, x) f
¡ p(z|W , x) ¢ ¾ 0
p(z|W, x)
,
(2.9)
x
where f (¢) is any at least twice differentiable convex function, strictly positive, except that f (1) D 0. The Kullback-Leibler divergence follows from f (r) D r ¡ log r ¡ 1. It is easy to show that any choice of f (¢) yields, up to an irrelevant multiplying constant, the same Fisher information (see equation 2.8). This makes the Fisher information matrix the preferred metric in the manifold of probability distributions—that is, the only metric that is invariant under reparameterizations. In Amari’s formulation, there is no direct link between the rst and the second term on the right-hand side of equation 2.4. No matter what the rst term is, the second term is an f -divergence of the form equation 2.9, and thus the gradient of the rst term has to be transformed using the Fisher information metric, equation 2.8. In the formulation presented in section 2.1, there does exist a direct link between the two terms. Here the chosen distance function, since it depends on only the model outputs given the model parameters, by denition also ensures invariance under reparameterization. However, it is not unique in the sense that any distance function with appropriate invariance conditions gives rise to the same Fisher matrix (see equation 2.5). The remaining indifference stems from the fact that different choices for d( y0 , y) in the denition of D( W 0 , W ) can lead to different secondorder derivatives h( y) in equation 2.7, which yield different Fisher matrices in equation 2.5. As long as the error function d( y, t) can be interpreted as some kind of f divergence between probability distributions implied by the model outputs y and targets t, the two formulations yield exactly the same procedure. This is the case for most commonly used error functions such as sum-squared error and cross-entropy. The formulation of section 2.1 is more general in that it also yields some kind of “natural” gradient if there is no such interpretation. Amari’s formulation, on the other hand, is more general in that it can be
886
Tom Heskes
applied to learning rules that cannot be derived from an error function, but for which it is possible to dene the Riemannian structure (and impossible or more difcult to dene an appropriate distance function). An example is the application of natural gradients to independent component analysis (see, e.g., Amari, 1998). 2.3 Batch-Mode Learning. The above description, like Amari’s work on natural gradients, focuses on on-line learning. However, exactly the same line of reasoning can be followed for gradient-descent batch-mode learning. In batch-mode learning, a parameter update is obtained by averaging over all inputs x and corresponding targets t; that is, in equations 2.1, 2.2, 2.4, and 2.5, we can simply replace d( y( W, x), t) by hd(y(W, x), t)ix,t . In a batch-mode scenario, we can interpret equation 2.2 as a smooth approach to the minimum of hd(y(W, x), t)ix,t , trying to keep some history about the initial conditions by staying close to the previous set of parameters. This interpretation might explain some of the popularity of early stopping, where one starts with small parameters (a prior with no functional dependency between inputs and outputs) and terminates the procedure when the model starts overtting. In this context, the distance measure used for comparing W 0 and W in equations 2.2 and 2.4 can be viewed as a kind of (dynamic) prior. Pushing this analogy even further, the Euclidean distance in equation 2.2 corresponds to a prior similar to standard weight decay, whereas the “natural” distance D( W 0 , W ) of equation 2.3 implements a dynamic version of Jeffrey’s prior (see any textbook on Bayesian statistics, e.g., Gelman, Carlin, Stern, & Rubin, 1995). There is another link, maybe not so much in spirit, but more in the expressions that have to be computed, with the Levenberg-Marquardt method for least-squares tting (see, e.g., Luenberger, 1984, and many other textbooks on optimization). The¬idea behind the Levenberg-Marquardt method is that « the error d( y( W, x), t) x,t is, at least close to its minimum, quite well approximated by a quadratic form. If this is indeed a good approximation, we can jump to the minimum in a single leap, through (compare with equation 2.5),
D W D ¡H
¡1
( W)
with H ( W ) ´
« ¬ @ d( y(W, x), t) x,t @W
« ¬ @2 d( y( W, x), t) x,t @W@W T
.
, (2.10)
Now, the Levenberg approximation of the Hessian matrix H ( W ) for sumsquared error is nothing but our denition of the Fisher matrix: ½ ¾ @y( W, x) @y( W, x) D F(W ), (2.11) H( W ) ¼ @W @W T x
“Natural” Learning and Pruning in Multilayered Perceptrons
887
where in the rst step where the second-order derivative of y(W, x) with respect to the parameters W is neglected, and where the last step follows from equation 2.6 and h( y) D 1 for d( y, t) D ( y ¡ t)2 / 2 in equation 2.7. In other words, except for some technicalities like adding a constant to the diagonal, Levenberg-Marquardt least-squares tting is equivalent to a greedy version of batch-mode natural gradient learning for sum-squared error. In fact, this correspondence need not be restricted to sum-squared error: the Fisher matrix F( W ) for general error functions dened in equation 2.6 is indeed a valid approximation of the Hessian for these error functions, similar in spirit to the Levenberg approximation, equation 2.11, for the Hessian of sum-squared error. 2.4 Natural Pruning. Pruning seems quite different from learning. With pruning, one intends to remove from a trained model those parameters that hardly affect, or maybe even cause to deteriorate, the performance of the model. The reason to discuss pruning here is that most of the existing and popular pruning algorithms can be understood in terms of distance functions and metrics. In fact, as we will see, optimal brain surgeon (OBS) (Hassibi & Stork, 1993) can be interpreted as the natural equivalent of pruning based on weight magnitude. Usually pruning is done in an iterative procedure. Starting with the full model, one successively removes those parameters that appear to be least relevant. The better the performance of the model with a particular parameter (or set of parameters) left out (i.e., the lower the error), the less relevant this parameter. To quantify the relevance of a particular subset of parameters Wk , one has to go through the following two steps:
O ( k) closest to the original W according to the 1. Find the new model W predened distance function D(W 0 , W ), with constraint Wk0 D 0 (compare with equation 2.4): O ( k) D argmin D( W 0 , W ). W W 0 ,Wk0 D0
Below we will elaborate on specic choices for D( W 0 , W ). 2. Compute the error of this new model according to some predened O (k)). Often used criteria are training, validation, or criterion Ek D E( W test error or an estimate of these (see e.g., Pedersen, Hansen, & Larsen 1996), but also the distance to the original model, as, for example, in OBS, E( W 0 ) D D( W 0 , W ). Apart from all kinds of more technical details, the principal difference between different pruning strategies is their choice for the distance function D(W 0 , W ) and, to a lesser extent, their choice for the performance criterion E( W 0 ). A simple, and not recommended, choice would be the Euclidean distance D(W 0 , W ) D kW 0 ¡Wk2 combined with E( W 0 ) D D( W 0 , W ), yielding
888
Tom Heskes
Ek D kWk k2 . The resulting algorithm is: prune the weights according to their magnitude. OBS (Hassibi & Stork, 1993) (see also its “generalized” version; Stahlberger & Riedmiller, 1997) can be interpreted as the “natural” equivalent of pruning based on weight magnitude. The distance function used in OBS, although derived from a different perspective, is the quadratic approximation of the “natural” distance, equation 2.3: « ¬ D( W 0 , W ) D d( y( W 0 , x), y( W, x)) x ¼
1 (W 0 ¡ W ) T F( W )( W 0 ¡ W ), 2
(2.12)
with F( W ) the Fisher matrix given in equation 2.5. So there is a close link between (natural) pruning and learning: it all boils down to the denition of an appropriate distance between two different models. Until now, everything has been quite general; apart from yielding an output y given an input x, we have not made any assumptions about our model with parameters W. Some of the algorithms mentioned in the previous sections have been derived and introduced specically for neural networks. But although in practice there may be some advantages to a neural interpretation or implementation, in principle, they can be applied to any parameterized model. 3 The Fisher Matrix for Multilayered Perceptrons
The Fisher matrix plays a key role in the theory and algorithms discussed in the previous sections. In this section, we propose a simplication of the Fisher matrix, specic to MLPs. By replacing the exact Fisher matrix with this approximation, we hope to arrive at algorithms that make the most out of the layered structure of MLPs, by either increasing their speed or improving their robustness. 3.1 Notation. Let us consider an MLP with L layers of weights. Our notation will be slightly different from the one used in the previous section. yl is an nl £ 1 vector with the output activities of the lth layer, where layer 0 is the input layer and layer L the output layer. In other words, the outputs y in the previous section are now called yL ; that is, the error function depends on only the outputs yL . xl stands for the input activity of the units in layer l. They are related to yl¡1 and yl through
xl D Wl yl¡1 and yl D fl ( xl ),
(3.1)
with fl (¢) some kind of transfer function (usually a sigmoid for hidden units, by denition the identity for the input units), and Wl an nl £(nl¡1 C1) matrix with weights, where one column has been added to include biases (e.g., by dening yl0 D 1 for all layers l < L). The input x of the previous section
“Natural” Learning and Pruning in Multilayered Perceptrons
889
corresponds to x0 D y0 . (We will still use h. . .ix to denote an average over all inputs and h. . .ix,t for an average over both inputs and targets.) The set of parameters W is now the combination of all weights fW1 , . . . , WL g. For notational convenience, we assume that there are no connections across layers of hidden units. What follows can be easily generalized to this more general case. 3.2 The Exact Fisher Matrix. Equation 3.1 fully species the forward pass through the network. Gradient and Fisher matrix follow from the usual backpropagation rules. With denitions
g´
@d( yL , t) @yL and c l ´ T , @yL @xl
it is easy to derive @d( yL (W, x), t) D [c lT g]yTl¡1 , @Wl
(3.2)
where the nL £ ( nl¡1 C 1) matrices c l can be computed efciently using the backprop rule, c l¡1 D c l W l f 0 ( xl¡1 ). The components of the Fisher matrix directly follow from equation 2.6 combined with the above denitions: D E (3.3) Fll0 (W ) D c lT hc l0 yl0 ¡1 yTl¡1 , x
where h ´ h( yL ) from equation 2.7. 3.3 An Approximate Fisher Matrix for MLPs. Although the expressions may seem straightforward, what we have to deal with is a full nweights £ nweights -dimensional matrix with nweights the total number of weights. We have to compute and, perhaps even worse, invert this matrix each iteration. For example, this is what makes Levenberg-Marquardt optimization unpopular for applications with large MLPs. Do we really need this complexity? Considering the way we arrived here, maybe not. The Fisher matrix follows from our choice of a “natural” metric to compute the distance between the network before and after the update. As we will see, an approximation of the Fisher matrix leads to a different choice of this metric. Or, the other way around, maybe we can choose a different metric that leads to a much faster “natural” learning procedure. Here we will arrive at such a metric by simplifying the exact Fisher matrix.
890
Tom Heskes
The rst simplication is to treat different layers independently; we make the Fisher matrix block diagonal by ignoring the cross-terms that occur from changes in weights across layers: D E Fll0 ¼ Flld ll0 D c lT hc l yl¡1 yTl¡1 d ll0 . x
There is a very good reason to do this: these effects—for example, that a change in a weight can be compensated for by a change in a weight in a different layer—are highly nonlinear, so the quadratic approximation implied by the Fisher matrix may not be accurate at all (see Laar & Heskes, 1999, and section 4.4). The other simplication is to neglect correlations between the derivatives c l and the activities yl¡1 : D E D E (3.4) Fll0 ¼ [C l Cl¡1 ]d ll0 with C l ´ c lT hc l and Cl ´ yl yTl . x
x
Again, this seems to be a reasonable approximation: the term c lT hc l measures the impact of changes in hidden unit activities on the error. That is, it depends on what happens upstream (picturing an information ow from input to output), whereas the activities yl¡1 are directly affected by what happens downstream. Now we can go back and derive the metric that corresponds to this approximation of the Fisher matrix. First, we note that the quadratic approximation, equation 2.12, yields exactly the same natural gradient procedure as the original D(W 0 , W ), since only the second-order derivative with respect to the rst argument matters. By construction it is quadratic in W 0 but not in W (for nonconstant F( W )). The approximate distance function, which follows from substitution of the approximate Fisher matrix, equation 3.4, in equation 2.12, and some rewriting, has the same asymmetry: Dapprox( W 0 , W ) D
L D E 1X [xl ( W ) ¡Wl0 yl¡1 ( W )]T C l ( W )[xl ( W ) ¡W l0 yl¡1 ( W )]T , 2 lD1 x
(3.5)
with xl (W ) and yl ( W ) the incoming and outgoing activities of the units in layer l for a network with parameters W. In other words, the approximate Fisher matrix of equation 3.4 corresponds to a sum-squared error between the activities in all layers of the networks, weighted by C l ( W ) which measures how, on average, changes in these activities affect the error on the outputs of the network. 4 Implications for Learning and Pruning 4.1 Inversion of the Fisher Matrix. The reduction in computational complexity for inverting the Fisher matrix is enormous. We have to invert
“Natural” Learning and Pruning in Multilayered Perceptrons
891
the matrices Cl of size ( nl C 1) £ ( nl C 1) for l D 0 to L ¡ 1 and the matrices C l of size nl £ nl for l D 1 to L. To give a rough order of magnitude, suppose that all layers are of about the same size nl D n and that inverting an n £ n matrix requires n3 executions. Then, neglecting biases, inverting the exact Fisher matrix requires L3 n6 (total number of weights cubed) executions 3 whereas inverting the approximated Fisher matrix requires only 2Ln3 executions (2L£ number of units per layer cubed). For a moderate MLP with one layer of hidden units (L D 2) and n D 5 units per layer, we are already talking about a factor 250 speed-up. Computing the forward and backward pass through the network to compute the batch-mode gradient each requires on the order of Ln2 p executions (number of weights £ the number of patterns p), which is for reasonable amounts of patterns (p of the same order as n) about as time-consuming as inverting the approximate Fisher matrix. Computing the approximate Fisher matrix is only about twice as time-consuming as computing the batch-mode gradient. Note further that C0 , the covariance matrix of the network inputs, needs to be computed and inverted only once (in principle; see, however, section 4.3). So, roughly speaking, to use the “natural” batch-mode gradient with the approximated Fisher matrix makes batch-mode learning only a factor 2 slower when all layers are of about the same size. Large variations in the size of the layers and a relatively small number of patterns increase the relative overhead, but using the approximate Fisher is always an order of magnitude faster than working with the exact Fisher matrix. 4.2 Approximate On-line “Natural” Learning. Combining equations 2.5, 3.2, and 3.4, we obtain for the weights W l in layer l the on-line learning rule n on o @d( yL (W, x), t) ¡1 T ¡1 T ( ) D W l D ¡gF¡1 D C [c (4.1) W g] y C l l¡1 l¡1 . ll l @Wl
The second term between braces is the output activity of the preceding layer of units yl¡1 , but then normalized and decorrelated by the inverse of the corresponding covariance matrix Cl¡1 . This idea, to decorrelate the activities of the layers before computing the weight update, is not new (see, e.g., Almeida & Silva, 1992), but here it comes out naturally as an approximation to natural gradient learning. An interpretation of the rst term between braces is slightly more involved. Recall that [c lT g] D
@d( yL ( W, xl ), t) @xl
3 In some special cases, where the input distribution is known, e.g., gaussian, the Fisher matrix can be calculated analytically and inverted even in order n2input time, if ninput , the number of inputs, is much larger than the number of hidden and output units (Yang & Amari, 1998).
892
Tom Heskes
* and C l D
+
@2 d( yL ( W, x0l ), yL (W, xl )) @x0l @x0l T
0
,
xl Dxl x
that is, [c lT g] measures the impact of a change in xl on the error d( yL ( W, xl ), t), while C l is some kind of Fisher matrix, but now considering changes with respect to xl rather than with respect to a set of parameters. The general effect of incorporating curvature information in this way is to make larger steps in at directions and smaller steps in steep directions. So here, by multiplying the gradient [c lT g] with the inverse of the curvature C l , updates are more focused to create differences between the activities of units in the same layer. The tendency of hidden units to represent the same features, which often appears in the initial stages of learning, is the cause of the notorious plateaus, long time spans in which the performance of the MLP hardly changes. In previous studies it has been shown that incorporating curvature information helps a great deal to break the symmetry between the hidden units and thus considerably shortens the length of the plateau (see, e.g., Rattray et al., 1998; Yang & Amari, 1998). The obvious question is then whether we need the full Fisher matrix to achieve symmetry breaking or can do with the approximation as well. 4.2.1 Learning the Tent Map. To answer this question, we consider a simple simulation study: on-line learning of the tent map. In various studies it has been shown that standard on-line learning of the tent map is severely hampered by the existence of a long plateau (see, e.g., Hondou & Sawada, 1994; Wiegerinck & Heskes, 1996). In this work, the effect of correlations between subsequently presented patterns has been studied. Here we simply view it as a static problem, with a xed training set of 100 patterns. The 100 inputs x were homogeneously drawn from the interval [¡1, 1], and the corresponding targets t obey t D 1 ¡ |x|. Our MLP has two hidden units (see Figure 1). Initial weights were drawn at random from a gaussian with zero average and standard deviation 0.1. We performed three runs of online learning, each starting from the same set of initial weights and with the same sequence of pattern presentations: @d( y( W,x), t) @W
with g D 0.1. With a higher learning parameter, uctuations start to become unacceptably large. A smaller learning parameter further reduces the chance to escape from the plateau. Standard on-line. D W D ¡g
Exact natural gradient. D W D ¡g[F(W) Cm ]¡1
@d( y( W,x), t) @W
with F( W ) the Fisher matrix numerically computed based on the 100 inputs x, g D 0.01 andm D 0.01 a small constant added to the diagonal for stabilization (similar to the additional constant in Levenberg-Marquardt).
“Natural” Learning and Pruning in Multilayered Perceptrons
893
Figure 1: On-line learning of the tent map. The architecture of the network (middle), evolution of all weights and biases (on the sides), and the error (top) for three different learning procedure: standard on-line (dash-dotted; upper time axes), exact Fisher matrix (solid), and approximate Fisher matrix (dashed).
Approximate natural gradient. D W D ¡g[Fapprox (W ) Cm ]¡1
@d( y( W,x), t) @W
with the same g D 0.01 and m D 0.01 as for natural gradient learning and where Fapprox( W ) stands for the proposed approximation of the Fisher matrix. In other words, except for the replacement of the exact by the approximate Fisher matrix, the conditions for the latter two runs were completely equivalent. Standard on-line learning is used for reference. The results are «displayed in¬ Figure 1. The plot in the upper-left corner displays the error d( y( W, x), t) x,t as a function of time (number of patterns presented) for the three runs. The upper time axis is for standard on-line
894
Tom Heskes
learning (dash-dotted lines); the lower time axis is for exact natural gradient (solid lines) and approximate natural gradient (dashed lines). The other plots show the evolution of all the weights and biases, where the time axes are the same as for the errors (20 standard on-line steps for 1 exact or approximate natural gradient step). Standard on-line learning has great difculty escaping from the initial plateau. This plateau corresponds to a solution with all weights and biases close to zero except for the output bias, which is roughly 0.5, the average of the targets t; there is barely any functional dependence on the inputs x. The two runs using curvature information break the symmetry between the hidden units fairly rapidly and escape from this plateau. The run using the approximate Fisher matrix manages this even a little faster than the run using the exact Fisher matrix. After breaking this symmetry, both converge to a good solution (error < 0.001), the one for the exact Fisher matrix slightly better than the one for the approximate Fisher matrix. Both are still slowly improving. A closer look shows that the run using the approximate Fisher matrix in fact “jumps” from the initial plateau to a second, much lower plateau. The rst plateau—the one that is so hard to take for standard on-line learning— has to do with breaking the symmetry between the hidden units. Similar plateaus are the subject of many analytic studies on on-line learning in large networks (number of input units going to innity with a xed number of hidden units) (Rattray et al., 1998). The second plateau, however, is caused by the fact that small changes in the input-to-hidden weights and small (opposite) changes in the hidden-to-output weights almost cancel. This effect is neglected by the proposed approximation, yet properly taken into account by the full Fisher matrix, which does “look across layers of hidden units.” In short, the simulation shows that the approximate Fisher matrix is sufciently powerful to break the symmetry between the units and reduces the length of the plateau by the same order of magnitude as the full Fisher matrix. A second plateau, which has to do with interdependencies between weights in different layers, cannot be tackled efciently using the approximate Fisher matrix, but disappears when using the exact Fisher matrix. 4.2.2 Escape Times for Symmetry Breaking. The simulation described above is just a single instance of a single problem. To check whether the general picture is more or less reproducible, we have compared the three different learning strategies (standard on-line, exact natural gradient, and approximate natural gradient) on three different on-line learning problems and averaged over many different instances generated by randomizing over both initial conditions and the presentation order of the training patterns. We focus on the number of iterations it takes to escape from the rst plateau that requires breaking of the symmetry between hidden units. We considered the following problems. Unless specied otherwise, networks were initialized with weights drawn independently from a gaussian distribution with mean zero and standard deviation 0.01, had two hidden
“Natural” Learning and Pruning in Multilayered Perceptrons
895
units with hyperbolic tangent transfer functions, and were trained to minimize sum-squared error. For standard on-line, we took learning parameter g D 0.1, for both exact and approximate natural gradient learning g D 0.01 with a constant m D 0.01 added to the diagonal to prevent ill-conditioning. In all simulations, these appeared to be close to optimal settings for the learning parameters. With much smaller learning parameters breaking the symmetry takes (even) more time, much larger learning parameters lead to numerical instabilities. XOR. Two binary inputs, one output, t D 0 for x D f1, 1g and x D
f¡1, ¡1g, t D 1 for x D f1, ¡1g and x D f¡1, 1g. Each iteration, one of the four training patterns is drawn at random.
Tent map. As described above, with x at each iteration drawn homo-
geneously from the interval [¡1, 1] and t D 1 ¡ | x| . Exact and approximate Fisher matrix were obtained through numeric integration. Soft-committee. A kind of standard problem used for analyzing neu-
ral network learning in the statistical mechanics framework (see, e.g., Rattray et al., 1998). Here we considered N D 10 inputs, at each iteration independently drawn from a gaussian with zero mean and unit standard deviation. The target comes from a “teacher,” which is another neural network with exactly the same architecture as the “student.” Both student and teacher have no thresholds, one output, and all hidden-to-output weights xed to one. To obtain plateaus of about the same length as for the other problems, we had to initialize to extremely small weights, drawn from a gaussian distribution with standard deviation 10¡15 rather than 0.01 (apparently symmetry breaking in soft-committee machines is relatively much easier than in learning the XOR problem or the tent map). Exact and approximate Fisher matrix were obtained through numeric integration (which is why we considered a relatively small network). Besides randomizing the initial conditions of the “student” and the inputs x drawn at each iteration, we also randomized over “teachers”: at the beginning of each run, the input-to-hidden weights of the teacher network were independently drawn from a gaussian distribution with standard dep viation 1 / N. On each of the problems, we performed 1000 independent runs for each of the on-line learning procedures. For each run, we carried out the number of iterations needed to escape from the plateau—the number of iterations before the training error reached a level somewhere in the middle between the error on the plateau (before symmetry breaking) and close to optimal (after symmetry breaking). The results are displayed in Figure 2. With standard on-line learning (dash-dotted lines), only 11 out of 1000 networks managed to escape from the plateau within 104 iterations for the XOR-problem, and none for the tent map. The results for natural learning
896
Tom Heskes
Figure 2: Number of iterations needed to escape from the plateau for the XOR problem, learning the tent map, and imitating a soft committee machine for three different on-line learning procedures: standard on-line (dash-dotted), exact Fisher matrix (solid), and approximate Fisher matrix (dashed).
using the approximate and exact Fisher matrix are comparable: where the approximate version seems to have somewhat more difculty with breaking the symmetry between the hidden units for the XOR problem, the exact Fisher matrix has more problems with the tent map. Learning the soft committee machine appeared to be relatively easy for all procedures, and the differences between them are much less striking. As argued in Heskes and Coolen (1997), the plateaus for learning the soft committee machine and related problems studied in the statistical mechanics literature are an order of magnitude easier to tackle than the plateaus that occur when learning the tent map or the XOR function. In the former case, the plateau corresponds to a saddle point of the error: the gradient on the plateau is negligible, but the Hessian has a nite negative eigenvalue that drives the escape. In the latter two cases, the error surface really is at: the smallest eigenvalue of the Hessian matrix is negligible as well (zero curvature), which makes it much harder to escape. 4.3 Using the Approximate Fisher Matrix in Levenberg-Marquardt.
As explained in section 2.3, there is a close connection between batch-mode natural learning and Levenberg-Marquardt optimization. With LevenbergMarquardt, weight changes obey (compare with equation 2.10),
D W D ¡[F(W) Cm ]
¡1
« ¬ @ d( yL ( W, x), t) x,t @W
,
(4.2)
where m is a constant added to the diagonal of F( W ) to ensure that the error after the learning step is lower than the error before the learning step (extremely large m corresponds to a standard batch-mode gradient-descent weight update with learning parameter m ¡1 ).
“Natural” Learning and Pruning in Multilayered Perceptrons
897
Simply replacing the exact F( W ) by its approximation, equation 3.4, would yield (compare with equation 4.1): « ¬ @ d( yL ( W, x), t) x,t ¡1 D W l D ¡[Fll ( W ) Cm ] with Fll (W ) D [C l Cl¡1 ]. @Wl However, adding the constant to the diagonal destroys the decomposition D C l¡1 C¡1 , which considerably speeds up the computation of the F¡1 ll l¡1 inversions. The alternative, with similar properties but keeping the decomposition intact, would be to add the constant not to the diagonal of Fll ( W ) but to both C l and Cl¡1 : « ¬ p ¡1 p ¡1 @ d( yL ( W, x), t) x,t D W l D ¡[(C l C m ) (Cl¡1 C m ) ] . @Wl
(4.3)
For large m we still obtain a gradient step with learning parameter m ¡1 , whereas for smallm , the update is dominated by the approximated “natural” gradient.4 As an illustration, we compared three different batch-mode procedures for training a neural network with eight hidden units on the Boston housing data set: standard Levenberg-Marquardt as in equation 4.2, “layered” Levenberg-Marquardt as in equation 4.3, and, for reference, (almost) steepest descent (equation 4.2 with F(W ) ´ 0 or equation 4.3 with C l ´ 0 and Cl¡1 ´ 0). All 506 patterns were used for training. The training error as a function of CPU time is plotted in Figure 3. Especially in the early stages of learning, using the approximate Fisher matrix (dashed line) considerably speeds up learning compared with both using the exact Fisher matrix (solid line) and almost steepest descent (dash-dotted line). Other runs (not shown), with a validation set left aside, revealed that the minimum of the validation error is obtained somewhere around 101 seconds CPU time for Levenberg-Marquardt learning using either the approximate or the exact Fisher matrix. The inset in Figure 3 shows how the CPU time per update scales as a function of the number of hidden units for this particular problem (the number of weights is 15 times the number of hidden units plus 1). 4.4 Natural Pruning in the Layered Approximation. Also here, in the context of pruning, one might wonder whether using the approximate Fisher matrix instead of the exact one leads to faster or more robust pruning 4
Another speed-up can be obtained by using
£
D W1 D ¡ (C 1 Cm
) ¡1
C¡1 0
« ¬ ¤ @ d(yL (W, x), t) x,t @W1
for the weight changes in the rst layer, in which case we have to compute and invert the input covariance matrix only once.
898
Tom Heskes
Figure 3: Levenberg-Marquardt learning on the Boston housing data using the exact Fisher matrix (solid lines), the approximate Fisher matrix (dashed lines), and the identity matrix (dash-dotted lines). The main gure shows the training error as a function of CPU time for a network with eight hidden units. The inset shows the CPU time per update for the three different procedures as a function of the number of hidden units.
procedures. The effect of replacing the exact Fisher matrix by its approximation, equation 3.4, in OBS is most easily understood by looking at the corresponding metric, equation 3.5. Finding the new weights W boils down 0 to solving single-layer problems of the form (W lk denote the weight(s) to be pruned), D E O l ( k) D argmin 1 [xl ( W )¡ W 0 yl¡1 ( W )]T C l (W )[xl (W ) ¡W 0 yl¡1 ( W )]T W l l x W 0 IW0 D0 2 l
lk
E 1D kxl (W ) ¡ Wl0 yl¡1 ( W )k2 , D0 2
D argmin Wl0 IWlk0
(4.4)
where the last step, that the solution is independent of C l , can be easily veried by taking into account that C l is positive denite and symmetric. In fact, we are just tting the weights in order to reproduce the hidden unit activities of the original network as accurately as possible. This idea is present in many independently proposed pruning algorithms (Castellano, Fanelli, & Pelillo, 1997; Egmont-Petersen, Talmon, Hasman, & Ambergen, 1998; Laar & Heskes, 1999). The link provided here, a layered approximation of the exact Fisher matrix, might explain some of their success. In Laar and Heskes (1999), “partial retraining,” an algorithm that, apart from a few minor details, is based on equation 4.4, was compared with (generalized) OBS. It came out that partial retraining is much more robust than
“Natural” Learning and Pruning in Multilayered Perceptrons
899
OBS, mainly because the quadratic approximation, equation 2.12, breaks down quite rapidly after removing a few weights. Interdependencies between weights in different layers are not accurately approximated, and the exibility to compensate for pruned weights by changing weights in different layers is overestimated. The more conservative layered approximation (see equation 4.4) of the exact Fisher matrix appeared to be sufciently powerful and much less sensitive to weight removals. 5 Discussion and Directions
We have described natural learning and pruning procedures from a slightly different perspective than Amari’s orginal formulation. In all of this, the Fisher matrix, measuring the local curvature of the distance function, plays a key role. Making a block diagonal (“layered”) approximation of this Fisher matrix, we arrived at fast learning and robust pruning procedures specic to MLPs. The approximation proposed is rather straightforward and simplistic: it neglects all interdependencies between weights in different layers. Still, it appeared to be sufciently powerful to break the symmetry between hidden units in the early stages of learning. However, in later stages, where subtle interactions between weights in different layers come into play, one may have to pay the price for this neglect. The odds for a layered approximation in pruning, which has been discussed to a larger extent in Laar and Heskes (1999), are even better than for learning. In learning, the Fisher matrix is used only to transform the standard gradient in a smart way. To guarantee that the error indeed decreases (always for batch-mode or on average for on-line), some regularization can be added to prevent jumps that are too large in weight space. In pruning algorithms, the role of the Fisher matrix is much more critical. Here the Fisher matrix is used in a quadratic approximation of a distance function, and if this approximation breaks down, there is no feedback from an error function as in learning. This might explain why in pruning it pays to be somewhat more conservative and use a more local metric, as in equation 3.5, rather than the full quadratic approximation, in equation 2.12. For on-line natural learning, the important assumption remains that the distribution of inputs is known. Instead, one might try to sample the required statistics dynamically, while learning. It would be interesting to see under which conditions and with what kind of schemes one could best approximate the “true” natural gradient to get the most benet out of using the natural rather than the standard gradients. In Yang and Amari (1998), exact expressions for the Fisher matrix are derived for gaussian input probability distributions. One could apply the same expressions as an approximation for nongaussian input distributions as well. The advantage compared to the approach taken in this article is that such an approximation can take into account dependencies among
900
Tom Heskes
different layers. The question is how useful the expressions still are when the shape of the input distribution starts to differ from a standard gaussian. One could also think of combining the two approaches, where the layered approximation yields a kind of zeroth order with interdependencies among weights in different layers estimated assuming gaussian inputs. Acknowledgments
I thank Bert Kappen and Martijn Leisink for discussions and comments on earlier drafts. This research was supported by the Technology Foundation STW, applied science division of NWO, and the technology program of the Ministry of Economic Affairs. References Almeida, L., & Silva, F. (1992). Adaptive decorrelation. In I. Aleksander & J. Taylor (Eds.), Articial neural networks, 2 (pp. 149–156). Amsterdam: NorthHolland. Amari, S. (1985). Differential-geometric methods in statistics. New York: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Ampazis, N., Perantonis, S., & Taylor, J. (1999).Dynamics of multilayer networks in the vicinity of temporary minima. Neural Networks, 12, 43–58. Castellano, G., Fanelli, A., & Pelillo, M. (1997). An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks, 8, 519–531. Egmont-Petersen, M., Talmon, J., Hasman, A., & Ambergen, A. (1998).Assessing the importance of features for multi-layer perceptrons. Neural Networks, 11, 623–635. Gelman, A., Carlin, J., Stern, H., & Rubin, D. (1995). Bayesian data analysis. London: Chapman & Hall. Hassibi, B., & Stork, D. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In S. Hanson, J. Cowan, & L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Heskes, T., & Coolen, J. (1997).Learning in two-layered networks with correlated examples. Journal of Physics A, 30, 4983–4992. Hondou, T., & Sawada, Y. (1994). Analysis of learning processes of chaotic time series by neural networks. Progress in Theoretical Physics, 91, 397–402. Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1–63. Laar, P. v. d., & Heskes, T. (1999).Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993. Luenberger, D. (1984).Linear and nonlinear programming. Reading, MA: AddisonWesley.
“Natural” Learning and Pruning in Multilayered Perceptrons
901
Pedersen, M., Hansen, L., & Larsen, J. (1996). Pruning with generalization based weight saliencies: c OBD, c OBS. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 521–527). Cambridge, MA: MIT Press. Rattray, M., Saad, D., & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461. Saad, D., & Solla, S. (1995). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74, 4337. Stahlberger, A., & Riedmiller, M. (1997). Fast network pruning and feature extraction using the unit-OBS algorithm. In M. Mozer, M. Jordan & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 655–661). Cambridge, MA: MIT Press. Wiegerinck, W., & Heskes, T. (1996). How dependencies between successive examples affect on-line learning. Neural Computation, 8, 1743–1765. Yang, H., & Amari, S. (1998). The efciency and robustness of natural gradient descent learning rule. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 385–391). Cambridge, MA: MIT Press. Received April 26, 1999; accepted July 12, 1999.
LETTER
Communicated by William Lytton
Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations Michele Giugliano N.B.T Neural and Bioelectronic Technologies Group, Department of Biophysical and Electronic Engineering, University of Genova, Genova, Italy Markov kinetic models constitute a powerful framework to analyze patchclamp data from single-channel recordings and model the dynamics of ion conductances and synaptic transmission between neurons. In particular, the accurate simulation of a large number of synaptic inputs in wide-scale network models may result in a computationally highly demanding process. We present a generalized consolidating algorithm to simulate efciently a large number of synaptic inputs of the same kind (excitatory or inhibitory), converging on an isopotential compartment, independently modeling each synaptic current by a generic n-state Markov model characterized by piece-wise constant transition probabilities. We extend our ndings to a class of simplied phenomenological descriptions of synaptic transmission that incorporate higher-order dynamics, such as short-term facilitation, depression, and synaptic plasticity. 1 Introduction
Realistic network simulations, involving the description of individual synaptic contributions, generally require a large proportion of the total processing time to be devoted to the updating of the state variables associated with each synapse and to the computation of the resulting post-synaptic current, at every simulation time step (Lytton, 1996; Koch & Segev, 1998). Such distributed representation of a set of identical independent physical systems, performing potentially redundant calculations at slightly different times, has been previously analyzed in the case of specic model synapses, in order to improve its performances in terms of CPU time and memory required by software implementations (Srinivasan & Chiel, 1993; Lytton, 1996; Kohn ¨ & Worg¨ ¨ otter, 1998; Giugliano, Bove, & Grattarola, 1999). This has been done by representing the contribution of multiple synaptic sites in a consolidated iterated closed form. In this work, we prove that such an approach can be extended to arbitrary Markov kinetic models of the postsynaptic receptor dynamics. This leads to the synthesis of a general class of algorithms, efciently implementing the synaptic transmission in large network simulations. Besides a Neural Computation 12, 903–931 (2000)
c 2000 Massachusetts Institute of Technology °
904
Michele Giugliano
theoretical analysis of the performance improvements, we report a few examples and computer simulations to clarify the explored strategy. We show that synaptic transmission models incorporating the spontaneous release of neurotransmitter can be easily implemented and optimized with the same technique. Our ndings are then extended to the fast simulation of short-term dynamic synaptic phenomena (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997; Varela et al., 1997; Chance, Nelson, & Abbott, 1998), involving calcium-dependent facilitation, refractoriness of neurotransmitter exocytosis, exhaustion of the ready-releasable vesicle pool, desensitization of postsynaptic receptors and plasticity of the absolute synaptic efcacy (Finkel & Edelman, 1987). A previous approach to the optimization of a large ensemble of identical independent depressing synaptic conductances (Giugliano et al., 1999) is thus a particular case of a more general algorithm. 2 Methods
Biophysical details of the behavior of single channels have been widely investigated by using patch recording techniques (Sakmann & Neher, 1983). It has been possible to demonstrate experimentally the stochastic nature of membrane ion conductances and characterize the rapid transitions among different conformational states, which depend on the electric potential across the plasma membrane or on the interaction with second-messenger and ligand molecules (Sakmann & Neher, 1983; Hille, 1992). In such a context, Markov models were widely used consistently to model the stochastic microscopic properties (e.g., single channel openings and closings, open time distributions) (Colquhoun & Hawkes, 1983), as well as the deterministic macroscopic features determining membrane ionic conductances and other cellular signaling pathways such as the ionotropic and metabotropic chemical synaptic transmission (Destexhe, Mainen, & Sejnowski, 1994b). It is usually assumed that membrane channels and receptors uctuate through a succession of discrete conformational congurations, among a nite number of states, separated by large energy barriers (Hille, 1992). The probability of transitions from one state to another one is also assumed to depend only on the occupied state and not on the whole past history. Membrane channels and receptors can be thus regarded as nite-state Markov systems (Papoulis, 1991; Destexhe, Mainen, & Sejnowski, 1994a, 1994b). In this article we focus on the chemical synaptic transmission between neurons (Walmsley, Alvarez, & Fyffe, 1998) and we discuss the optimization of the collective contribution of many synaptic currents converging to an isopotential compartment, independently modeling each synaptic site by an arbitrary n-state Markov model. This framework is revealed to be appropriate for the efcient network simulation of phenomena such as the spontaneous release of neurotransmitters, useful for detailed investigations of the contribution of miniature synaptic potentials (i.e. mi-
Synthesis of Generalized Algorithms
905
nis) and spontaneous synaptic activity (Par e´ , Shink, Gondreau, Destexhe, & Lang, 1998) to the electrophysiological activity, with realistic biophysical models. 2.1 Denition of a Markov Kinetic Model. The formal denition of a Markov kinetic model identies a set S of n states and a set of transition probabilities among states, also known as transition rates (Howard, 1971):
S D fs1 , . . . , sn g
fi, j D 1 . . . n : r ij ¸ og.
(2.1)
The concepts of state and transition are particularly important in the mathematical description of the system congurations, specifying, for instance, transitions and conformational states where the channel binds to neurotransmitter molecules and opens or inactivates, consequently modifying the ionic ows through the membrane. The transition probabilities, and hence the Markov process itself, can be graphically represented by a state diagram, and an oriented line segment is drawn from each state and labeled with the probability per time unity r ij that a transition from the ith state to the jth state occurs. Focusing now on a single ionotropic ligand-gated postsynaptic receptor, we dene the probability qi ( t) that the generic ith state is occupied at a given time t: qi ( t) D Prfstate s D si , occupied at time tg
i D 1, . . . , n
0 · qi · 1. (2.2)
These n absolute probabilities evolve in time according to a set of linear differential equations (see equation 2.3) (Colquhoun & Hawkes, 1983; Papoulis, 1991; Destexhe et al., 1994a, 1994b; Grattarola & Massobrio, 1998), which can be equivalently rewritten by means of the vector notation (see equation 2.4). qP i D
n X jD1
rij ¢ qj
i D 1, . . . , n
(2.3) (2.4)
qP D Rq.
In the previous equations, we introduced the matrix R of the coefcients rij (Gantmacher, 1959), explicitly dened as: 8 > r i 6D j > < ijX n (2.5) rij D ¡ r ik i D j, > > : kD1 k6 Di
assuming that r ii D 0. When the coefcients of the matrix R are constant in time, equation 2.4 can be solved in closed form (Gantmacher, 1959.) Its
906
Michele Giugliano
solution is: q(t) D e( t¡to ) R q ,
(2.6)
o
with e( t¡to ) R indicating the exponential of a matrix and q the state vector o initial condition. Equation 2.6 is the analytical solution of equation 2.4, and thus the most complete description of the time course of the probability that each state will be occupied at any time. However, in most of computer simulations of large networks of neurons, such a description is not strictly required. Because of the architecture of digital computers, such probabilities must be computed at discrete times only. Depending on the implemented numerical method for the integration of equations that describe currents and voltages associated with each unit of the network, these discrete time events can be either uniformly spaced over the time axis (in the case of xed-step numerical methods) or not (in the case of adaptive integration-step implementations). Thanks to the properties of the matrix ( t ¡ to ) ¢ R, with ( t ¡ to ) being a scalar multiplying factor, the exact solution of equation 2.4 can be iteratively expressed as indicated below, for an arbitrary sequence of time steps (Gantmacher, 1959): q(tkC1 ) D eD k R q( tK )
t1 , t2 , . . . , tk , tkC1 , . . .
D k D tkC1 ¡ tk .
(2.7)
This general property of linear systems has already been used in the context of specic model synapses (Srinivasan & Chiel, 1993; Lytton, 1996; Giugliano et al., 1999) and it is fundamental to the development of the fast algorithms. 2.2 Fast Calculation of the Postsynaptic Currents. Let us consider an isopotential compartment of a model neuron. Neglecting propagation delays, whose management could be, for instance, optimized by a direct implementation of the single queue strategy proposed in Lytton (1996), on a rst approximation the deterministic macroscopic synaptic current Isyn, resulting from N independent synapses of the same type (e.g., excitatory), is given by a linear superposition of independent contributions, expressed by a standard I-V ohmic relationship (Koch & Segev, 1998; Destexhe et al., 1994a, 1994b), as reported in equation 2.8:
Isyn D
N X iD1
¡
¢
gi ( t) ¢ Esyn i ¡ Vm D
"
N X iD1
#
¡ ¢ gi ( t) ¢ Esyn ¡ Vm .
(2.8)
Vm and Esyn indicate the postsynaptic membrane potential and the apparent equilibrium (reversal) potential, respectively. The ith synaptic conductance gi ( t) is related to the fraction of ligand-gated receptors in particular conformational states, letting ions owing through the membrane. The fraction
Synthesis of Generalized Algorithms
907
of receptors in such states can be deterministically expressed under the hypotheses of the law of large number. In terms quantitatively dened by the Tchebyceff inequality (Papoulis, 1991), the probability qi (t) referred to a single channel represents the fraction of units occupying the ith state at time t, within a large ensemble of receptors. For instance, let us consider a nine-state kinetic model (Standley, Norris, Ramsey, & Usherwood, 1993), characterized by ve closed (C) and four open (O) channel congurations: C
$
C1 l O
$ $
C2 l O2
$ $
C3 l O3
$ $
C4 l O4 .
In this example, the ith synaptic site conductance is expressed as j k ( i) ( i) ( i) (i ) gi ( t) / qO (t) C qO2 ( t) C qO3 (t) C qO4 ( t) ,
(2.9)
(2.10)
qO ( t), qO2 ( t), . . . , qO4 (t) being the fractions of the postsynaptic receptors in the open states O, O2 , . . . , O4 , respectively. In a simpler case, the conductance can be related to a single conformational state, j k () (2.11) g i ( t) / q i ( t) , where q is the fraction of receptors in the open conformational state O evolving, for example, according to a simplied two-state kinetics (see Destexhe et al., 1994a): a¢T
C
! O. Ã
(2.12)
b
In the general case of equation 2.4, we now partly relax the constraint on R, letting it be time dependent. Specically, we require it to be piecewise constant; we assume the existence of a set of M constant matrices fW1 , W2 , . . . , W M g, such that at any given time t, 9m 2 f1, . . . , Mg: R(t) D Wm .
(2.13)
Given a sequence of time events t1 , t2 , t3 , . . . , tk , tkC1 , . . . , let us consider, for example, the following changes in the coefcients of R( t): 8 W 5 t 2 [t1 I t2 ) > > > > W 7 t 2 [t2 I t3 ) > > > ). 2 [t I W t t > m k kC1 > > > :.. .
908
Michele Giugliano
Such a sequence of events models steep changes in the concentration of the ligand / neurotrasmitter in the cleft and thus in the value of the coefcients of R, induced by evoked presynaptic responses, as well as determined by spontaneous exocytotic events, neuromodulation, or changes in the concentration of G- protein subunits in simplied models of second-messenger gating (Destexhe et al., 1994b). In the example of equation 2.14, it immediately follows
8 D tW 5 q( t) e > > > >eD tW7 q( t) > > > > > eD tWm q(t) > > > > :.. .
t 2 [t1 I t2 ) t 2 [t2 I t3 ) (2.15) t 2 [tk I tkC1 ),
by choosing D t < D k , for each k (see equation 2.7). For small M, this suggests that it might be convenient to compute off-line each exponential matrix. In particular, we dene the following M quantities: U1 D eD tW1 U2 D eD tW2 .. .
Um D eD tWm .. .
U M D eD tWM .
(2.16)
The kinetic model indicated in equation 2.12 is thus a particular case of a more general framework. In that example, the two transition probabilities have been made explicit: T indicates the concentration of neurotransmitter molecules in the cleft, and a, b, are the forward and backward rate constants for transmitter binding, respectively (Destexhe et al., 1994a). Indicating with gN i the maximal synaptic conductance (i.e., the absolute synaptic efcacy), the differential equation that describes the behavior of each synaptic conductance follows: gi (t) D gN i ¢ q( i) (t)
i D 1, . . . , N
dq( i) D ¡b ¢ q( i) C a ¢ Ti ( t) ¢ (1 ¡ q( i) ). dt
(2.17) (2.18)
Assuming the transmitter concentration Ti ( t) to occur in the ith synaptic cleft as a pulse of amplitude Tmax (i.e., M D 2) and duration Cdur (Anderson
Synthesis of Generalized Algorithms
909
& Stevens, 1973; Colquhoun et al., 1992), triggered by a presynaptic action potential, we satisfy the piece-wise time invariance previously discussed (see equation 2.13) and derive the iterative form of the solution:
8 ( i) q (t) ¢ e¡(a¢Tmax Cb )¢D t > > > a ¢ Tmax > > < C ( i) a ¢ T max C b q (t CD t) D > > > ¢ (1 ¡ e¡(a¢Tmax Cb)¢D t ) > > : ( i) q (t) ¢ e¡b¢D t
¡
¢
(2.19) tNi < t C D t < tNi C Cdur t C D t > tNi C Cdur
with tNi being the last occurrence time of a presynaptic action potential for the ith synapse (see equations 2.14 and 2.15). Such a simplied denition of the time course of the postsynaptic conductance implicitly accounts for saturation and summation of multiple presynaptic events, being as fast to calculate as a single alpha function (Destexhe et al., 1994a) and its quantitative comparison to the general case and notation (see equation 2.15) leads to: µ ¶ q and n D 2, M D 2, qD 1¡q W1 D
¡ ¡b b
a ¢ Tmax ¡a ¢ T max
¢
W2 D
¡ ¡b 0¢ b
0
.
We focus again on the general case of N independent synaptic sites described by a generic kinetics. A state vector q( i) and a matrix R( i) , i D 1, . . . , N are associated with each site, and the following set of N ¢ n differential equations can be derived (as in equation 2.4):
8 (1) > qP D R(1) q(1) > < .. . > > :qP( N) D R( N) q(N ) .
(2.20)
Under the hypothesis of equation 2.13, we now require that for the ith system, i D 1, . . . , N, at any given time t the following expression holds: 9m 2 f1, . . . , Mg: R(i) ( t) D Wm .
(2.21)
Assuming M considerably smaller than N, the previous statement on the time course of each R( i) matrix implicitly induces an instantaneous partition of the set of N systems into M classes, each characterized by the same matrix Wm , m D 1, . . . , M, at any time. This fact represents the fundamental stage for the fast algorithm and generalizes the previous consolidating strategies
910
Michele Giugliano
(Lytton, 1996) to the case of n states and M constant transition matrices. More quantitatively, we dene the set of all the systems sharing the transition matrix R( im ) at a given time t, and indicating the cardinality of such a set as Nm . We thus introduce the indexes im D 1, . . . , Nm , in order to refer independently to each element of the mth class, being 8m 2 f1, . . . , Mg
8im 2 f1, . . . , Nm g
R
( im )
D Wm,
(2.22)
so that ND
M X
(2.23)
8t
Nm
mD1
and qP ( im ) D W m q(im )
m D 1, . . . , M.
(2.24)
The computer simulation of equation 2.8 in the case of large networks may result in a highly demanding computational task. This is mainly due to the need for tracking the temporal evolution of each synaptic site (e.g., as in equation 2.19), and then computing the overall synaptic conductance by means of a sum over the N sites. Without losing any generality, we assume that a single conducting state exists in the generic Markov model describing the postsynaptic receptors, and we identify it (e.g.) as the fourth component of the state vector q( i) , for the ith synaptic site, i D 1, . . . , N. Let us dene the following M quantities: wm D
Nm X
( )
q im
m D 1, . . . , M.
i m D1
(2.25)
The sums of all the state vectors q( i) can be also expressed as the sum of these quantities: w D
N X iD1
() qi D
M X mD1
w m.
(2.26)
For the sake of clarity, we point out that the total synaptic conductance required for the computation of equation 2.8 is given by the fourth component of the vector w , or equivalently by the sum of the fourth component of each w m : h
i wm
4
D
Nm X im D1
( )
q4im .
(2.27)
Synthesis of Generalized Algorithms
911
Because of the linearity of equation 2.15, we can considerably speed up the computation of equation 2.27, by noting that each w m satises the following set of linear differential equations: wP m D Wm w m .
(2.28)
As for equations 2.6 and 2.7, by using the denitions given in equation 2.16, we can either analytically solve such differential equations, specifying their initial conditions w m ( to ), or iteratively calculate their exact solutions. By choosing the second approach, we nally express w in a lumped form, summing over M instead of N terms (see appendix A for the summarized general form of the algorithm): M X
w(t C D t) D
mD1
w m (t C D t) D
M X mD1
Um w m ( t).
(2.29)
In the last expression, an arbitrary time increment D t has been considered. The size of such increment is constrained only by the time interval between two consecutive events (e.g., as in equation 2.14), so that a xedstep updating is not strictly required. However, we note that further performance improvements can be achieved by calculating off-line each Um , m D 1, . . . , M, for a xed D t or for a limited set of D t values. Finally, equation 2.8 can be rewritten as follows: ( Isyn D gN ¢
N h X iD1
( i)
q
i 4
)
(
M h i X ¢ (Esyn ¡ Vm ) D gN ¢ w m mD1
) 4
¢ ( Esyn ¡ Vm ). (2.30)
In the last expression we assumed that all the synapses share the same value of the maximal conductance, but this is not strictly required, as we show later. 3 Results 3.1 Computational Efciency. In this section, an analysis of the number of arithmetic operations required at each time step by the implementation of the fast calculation algorithm (see appendix A) presented in section 2.2 is given.1 As stated by equation 2.7, assuming R time independent and without considering the off-line computation of its exponential matrix, the number of operations required for the iterative calculation of the analytical solution of equation 2.4, consists of n ¢ (n ¡ 1) sums and n2 products needed 1 Computer simulations, implemented in standard ANSI C-code available on request, were performed on a Digital DEC 3000 Alpha workstation and on an Intel Pentium II 333 MHz double processor workstation.
912
Michele Giugliano
to compute the product of the (n ¢ n) matrix eD k ¢R with a column vector ( n ¢ 1), q( tk ). In the case of nonoptimized calculation of the total synaptic conductance (see equation 2.8), the quantity N h i X ( i) w D q4 4
(3.1)
iD1
must be computed by accumulating N terms, whose updating expression is given by ()
i q 4 ( t C D t) D
n £ X
UF(i)
jD1
¤
()
4j
¢ qj i ( t),
(3.2)
generically assuming R( i) (tO) D UF(i) tO 2 [tI t C D t), F( i) 2 f1, . . . , Mg. In order to derive the computational load of the nonoptimized procedure for the summation of a large number of identical independent synaptic conductances, we rewrite equation 2.29 by denition as follows: N X n h i X () w ( t C D t) D [UF( i) ]4j ¢ qj i ( t). 4
(3.3)
iD1 jD1
By indicating with D prod and D sum (D prod > D sum ) the times required by an abstract oating-point unit to compute the result of a two-operand product and sum respectively, and neglecting any pipe-lined architecture, the total estimated time required for equation 3.3 is h i D TSLOW D N ¢ n2 ¢ D prod C n ¢ ( n ¡ 1) ¢ D sum C ( N ¡ 1) ¢ D sum . (3.4) In the fast algorithm, equation 3.1 is replaced by M h h i i X w D w , 4
mD1
(3.5)
m 4
so that the accumulation of only M quantities is required, independently on N. Each term is iteratively updated, immediately leading to M X n h i h i X w ( t C D t) D [Um ]4j ¢ w m ( t). 4
mD1 jD1
4
An abstract oating-point unit would then require h i D TFAST D M ¢ n2 ¢ D prod C n ¢ ( n ¡ 1) ¢ D sum C ( M ¡ 1) ¢ D sum .
(3.6)
(3.7)
Synthesis of Generalized Algorithms
913
We can thus dene a normalized gain factor f as follows (see equation 3.9), by dening ³ ´¡1 Q D c ¢ n2 ¡ n C 1 D TFAST D
c D
D prod D sum
C1 c > 1
1 1 ¢ ¢ D TSLOW f N
(3.8) (3.9)
When N is much bigger than Q, f becomes independent of N: NÀQ) f » D
1 , M¡Q
(3.10)
quantitatively establishing that the acceleration increases as the number of simulated independent synaptic sites increases. Explicitly, the following equation holds: µ ¶ c ¢ M ¢ n2 ¡ M ¢ n C ( M ¡ 1) 1 D TFAST » ¢ ¢ D TSLOW . D c ¢ n2 ¡ n C 1 N
(3.11)
In a real implementation of the algorithm, however, the acceleration improvements do not grow unbounded as N increases, because of the time required to process the temporal events t1 , t2 , . . . , tk , tkC1 , . . . , each time executing the part I of the algorithm (see appendix A), depending roughly on the product of N and the average rate of activation of synaptic inputs. 3.2 Numerical Accuracy. Equation 3.11 indicates that this technique improves the computation efciency, reducing the number of arithmetic operations at each simulation step, sincec > 1, n > 1, and N À M in all realistic situations. Such a result has other important consequences concerning the numerical accuracy of the computation (see Figure 1). Referring for the sake of simplicity to the example reported in equation 2.18, two main strategies exist in order to compute the total synaptic conductance resulting from an ensemble of N independent sites, under the usual hypotheses. The rst consists of the computation of the iterated exact solution, given by equation 2.19 (Destexhe et al., 1994a), while the second is the integration of equation 2.18 by means of a numerical technique, such as the Euler ’s method or the Runge-Kutta methods (Press, Teukolsky, Vetterling, & Flannery, 1996). It follows that the fast algorithm must be compared with the fastest computation strategy, being the numerical integration. Of course, there is a compromise between the performances that a numerical method can achieve in terms of accuracy and its complexity, resulting
914
Michele Giugliano
Figure 1: A comparison between the numerical accuracy given by two computer implementations is shown. (Top) Time course of the instantaneous arithmetic error of the numerical integration of equations 3.13 with respect to its iterated exact closed form. (Bottom) The higher arithmetic precision of the fast algorithm (equations 3.16–3.20) is evident; it shares the numerical accuracy of the exact iterated solution strategy; as a consequence, the instantaneous arithmetic error is zero most of the time. Both measures have been normalized with respect to the iterated exact closed form (equations 3.14), chosen as a reference.
in higher computational loads. However, our interest is not focused on the general convergence properties of any numerical methods, but rather on the truncation errors occurring during the accumulation of the N terms in the sum of equation 2.10, because of the nite precision arithmetics. We chose the fastest numerical methods (Euler ’s method) and veried that an implementation of the proposed fast algorithm shares the numerical accuracy of the exact iterated solution strategy and an optimal CPU-time performance, at a parity of time-step size. We can also roughly quantify the generic consequences of the niteprecision arithmetic on the errors accumulating over N sums by emulating such effects with a white additive noise source in equation 2.8, regarding the global synaptic conductance as a random variable (Oppenheim & Shaffer, 1975). From this analysis, it follows that the reduction of the total number of sums from N to M (see equation 2.26), implies a decrease in the variance of the global conductance (see equation 2.8) at each time step according to a factor M / N, thus improving the signal-to-noise ratio, especially when the number N of synapses is considerably high. An example of such an improvement is reported in Figure 1. 3.3 Efcient Simulation of the Neurotransmitter Spontaneous Release.
In this section, we propose an example of a fast implementation of synaptic transmission with a kinetic Markov model, incorporating spontaneous release of neurotransmitter in the cleft.
Synthesis of Generalized Algorithms
915
Figure 2: (Top) The temporal evolution of the piece-wise constant transition rate. (Bottom) The resulting time course of the synaptic conductance, according to equation 4.5.
We consider a two-state kinetic scheme for the neurotransmitter-receptor binding, but any n-state kinetics could have been used. On a rst approximation, we assume T ( t) resulting from the linear superposition of two processes, describing the prole of neurotransmitter in the cleft due to spontaneous exocytotic events and the spike-triggered releases. For simplicity, we assume that deterministic and spontaneous events differ only in the maximal amplitude of the neurotransmitter pulse (Anderson & Stevens, 1973; Sheperd, 1988; Colquhoun et al., 1992), but we note that distinct pulses duration could be considered too. The previous assumption ensures the piece-wise time invariance of the rate coefcients since T takes one out of four constant values at any given time (see Figure 2, top):
8 0 > > < T1 T(t) D T2 > > : T1 C T2
T1 < T 2 .
(3.12)
The dynamics of the postsynaptic conductance is described by the solution of the following equation, where tON and q1 are piece-wise constant, according to the values reported in Table 1, depending on the amplitude of neurotransmitter in the cleft (see equation 3.12): dq D dt
¡q
1
¡q
tON
¢
.
(3.13)
Under this framework, we can express the iterated exact solution of equation 3.13, by implementing the appropriate updating rule, depending on the
916
Michele Giugliano
Table 1: Constant Parameters Dened in the Example of Neurotransmitter Spontaneous Release. Symbol
Expression
tOFF q1 0 tON 1 q1 1 tON 2 q1 2 tON 3 q1 3
b ¡1 0 (a ¢ T1 C b) ¡1 a ¢ T1 ¢ (a ¢ T1 C b) ¡1 (a ¢ T2 C b) ¡1 a ¢ T2 ¢ (a ¢ T2 C b) ¡1 [a ¢ (T1 C T2 ) C b]¡1 a ¢ (T1 C T2 ) ¢ [a ¢ (T1 C T2 ) C b]¡1
value of T ( t) in the current time interval. Rewriting equation 2.15 in such a context leads to: 8 Dt ¡ > > e tOFF ¢ q( t) ³ TD0 > ´ > > ¡t Dt ¡t Dt > > <e ON1 ¢ q( t) C 1 ¡ e ON1 ¢ q11 T D T1 ³ ´ (3.14) q ( t C D t) D ¡t Dt ¡ Dt > ON2 ¢ q( t) C 1 ¡ e tON2 ¢ q12 T D T2 e > > > > ³ > Dt Dt ´ > :e¡ tON3 ¢ q( t) C 1 ¡ e¡ tON3 ¢ q13 T D T1 C T2 . Regarding a large population of such synaptic sites, we induce a partitioning of the ensemble of synapses into four classes (M D 4), by introducing the following index notation (see section 2.2): iOFF D 1, . . . , NOFF
iON 1 D 1, . . . , NON 1 iON 2 D 1, . . . , NON 2 iON 3 D 1, . . . , NON 3 The fast algorithm can be immediately instantiated to this case, rst by dening, as in equation 2.25, the following quantities: QOFF D
N OFF X
QON 1 D
qiOFF
i OFF D1
QON 3 D
NX ON 3
NX ON 1
qiON
1
iON 1 D1
QON 2 D
NX ON 2
qiON
2
iON 2 D1
(3.15)
qiON 3
iON 3 D1
and then by expressing them in an iterative way: ¡tDt
QOFF ( t C D t) D e
OFF
¢ QOFF ( t)
(3.16)
Synthesis of Generalized Algorithms ¡t
QON 1 ( t C D t) D e
Dt ON 1
³ ´ ¡ Dt ¢ QON 1 ( t) C 1 ¡ e tON 1
¢ q1 1 ¢ NON 1 ¡t
QON 2 ( t C D t) D e
Dt ON 2
¡t
Dt ON 3
(3.17)
³ ´ ¡ Dt ¢ QON 2 ( t) C 1 ¡ e tON 2
¢ q1 2 ¢ NON 2 QON 3 ( t C D t) D e
917
(3.18)
³ ´ ¡ Dt ¢ QON 3 ( t) C 1 ¡ e tON 3
¢ q1 2 ¢ NON 3
(3.19)
The resulting total synaptic conductance is given by the sum of all four terms (see equations 3.15 and 2.29): g ( t) /
N X iD1
q( i) D [QOFF C QON 1 C QON 2 C QON 3 ] .
(3.20)
In Figure 3, time courses of the total synaptic conductance computed by using several techniques have been compared. 4 Extension to Dynamic Synaptic Transmission and Plasticity 4.1 General Case. The consolidating algorithmic strategy we explored in this work turns out to constitute an appropriate framework for the optimization of a simplied phenomenological description of higher-order synaptic phenomena (Sen, Jorge-Rivera, Marder, & Abbott, 1996; Abbott & Marder, 1998). Several experimental investigations at the neuromuscular junction (Magleby & Zengel, 1975) and the central synapse (Thomson & Deuchars, 1994; Markram & Tsodyks, 1996; Varela et al., 1997; Chance et al., 1998) revealed that many dynamic behaviors characterize the signal transmission between excitable cells (see Figure 4). Indeed, several frequencydependent phenomena together with short-term plasticity occur over a heterogeneous set of timescales (Markram & Tsodyks, 1996; Varela et al., 1997). This past-history-dependent signal transmission results from presynaptic calcium accumulation, exhaustion of ready-releasable neurotransmitter resources, nite-time mobilization of vesicle pools, desensitization of postsynaptic receptors, and probably many other biophysical mechanisms. For instance, four temporally distinct components of the postsynaptic response amplitude enhancement have been identied (facilitation type 1, type 2, augmentation, and potentiation), and at least two components seem to be appropriate in order to account for activity-dependent synaptic depression,
918
Michele Giugliano
Figure 3: Temporal evolution of the total synaptic conductance, simulated by using 1000 synapses, randomly activated and spontaneously releasing neurotransmitter molecules. (a) A numerical integration of the equations 3.13. (b) The iterated closed form (see equation 3.14). (c) Optimized algorithm. The lines do not coincide in all three cases, because of time-step round-off differences among implementations (see section 3.2).
Figure 4: Characteristic temporal evolution of the total synaptic conductance in the case of short-term depression. The simulation refers to the activity of 1000 independent synaptic sites, activated randomly by Poisson point processes with the same mean frequency, synchronously increasing from 20 to 80 Hz (see Tsodyks, Pawelzik, & Markram, 1998), setting tF very small: the fast algorithm contributed to a reduction of 79% of the total CPU time.
Synthesis of Generalized Algorithms
919
involving the refractoriness of the exocytotic processes and a slower depletion of releasable vesicles. Recently a simplied model has been proposed, describing components of facilitating and depressing synaptic responses (Abbott et al., 1997; Abbott & Marder, 1998; Chance et al., 1998). This consists of a multiplicative composition of variables, accounting for different aspects of the response to a train of presynaptic spikes. For instance, the following choice was revealed to be appropriate in order to t experimental tracks in the case of excitatory synapses in layer 2 / 3 of rat primary visual cortex (Varela et al., 1997): g( t) D g ¢ [F ¢ D1 ¢ D2 ¢ D3 ¢ q].
(4.1)
Equation 4.1 includes a component for facilitation F and three for depression D1 , D2 , D3 acting on different timescales. In the case of a large number of synaptic conductances of this kind, the total current would be indicated by the expression: Isyn D
" N X
# g
iD1
( i)
¢F
( i)
() ¢ D1i
() ¢ D2i
() ¢ D3i
( i)
¢q
¢ ( Vm ¡ E syn ).
(4.2)
Equation 4.1 can be considered the particular case of a more general multiplicative composition of variables, being components of different state vectors, each evolving according to a distinct piece-wise time-invariant linear differential system. For simplicity, we now focus on the generic multiplicative composition of three variables, regarding the case of N identical systems and deriving the general differential equations for each of them (as in equation 2.20): q(1) , . . . , q( i) , . . . , q(N ) (1)
( i)
p , ..., p , ..., p (1)
( i)
s , ..., s , ...,s
( N)
( N)
qP( i) D Q( i) q( i)
dim(q) D nq
sP( i) D S(i) s( i)
dim(s) D ns
pP
( i)
( i) ( i)
DP p
dim(p) D np
(4.3)
Assuming the usual piece-wise time invariance for the transition matrices, we rewrite equation 2.21:
8 ( i) q( k) ( t) à e( t¡tOLD ) Wm q( k) ( tOLD ) > > > x7 > > <w à w ¡ q( k) (t) Nm à Nm ¡ 1 m m x7
> w n à w n C q( k) (t) > > x7 > > > t( k ) à t : OLD x7
x7
Nn à Nn C 1 x7
Part 2
FOR m D 1, . . . , M w m (t C D t) Ã Um w m ( t) x7
Synthesis of Generalized Algorithms
925
Appendix B: Dynamic Synaptic Transmission
As done in equation 2.16, the exponential of matrices Q, P, and S can be computed off-line and referred to as Q m D eD tWmq W q
mq D 1, . . . , M q
(B.1)
VQ mp D eD tWmp
mp D 1, . . . , Mp
(B.2)
XQ ms D eD tWms
ms D 1, . . . , Ms
(B.3)
In a similar way to what has been discussed in the case of equation 2.22, we dene an index i¤ ( mq , mp , ms ), running from 1 to N ( mq , mp , ms ), identifying the subset of systems sharing the same combination hQ, P, Si, at any time t. We generalize the formulation given in equation 4.2 by dening ej h k as the sum, over all the N systems, of the multiplicative composition of the (e.g.) jth component of the state vector q, the hth component of the state vector p, and the kth component of the state vector s, respectively, and by equivalently expressing it as a sum over the items of the same partition. We rst dene ej h k ( mq , mp , ms ) D
N( mX q ,mp ,ms ) ¤(
i mq ,mp ,ms )D1
()
()
()
i i i qj ¢ ph ¢ sk
(B.4)
being the sum of the multiplicative composition of the jth component of the state vector q, the hth component of the state vector p, and the kth component of the state vector s, respectively, over all the system sharing hQ, P, Si D hWmq , Wmp , Wms i. We thus note that ej h k D
N X iD1
()
()
()
i i i qj ¢ ph ¢ sk D
Mq X Mp X Ms X
ej h k ( mq , mp , ms )
(B.5)
mq D1 mp D1 ms D1
and, thanks to the linearity of equations 4.3, we conclude that it is possible to express ej h k iteratively in terms of ( Mq ¢ Mp ¢ Ms ) quantities, ej h k (mq , mp , ms ): ej h k | tCD t D
8 Mq X Mp X nq X np X Ms <X ns h X
mq D1 mp D1 ms D1
:
yD1 rD1 uD1
Qm W q
i h jy
VQ mp
9 = £ ey r u ( mq , mp , ms )| t . ;
i
h hr
XQ ms
i ku
(B.6)
Equation B.6 constitutes the main stage of the general optimization technique, and it proves that such a fast algorithm equivalently replaces the
926
Michele Giugliano
normal procedure, given by equation B.8, and that it leads to a consistent speed-up as soon as NÀ N X iD1
5 ¢ D prod C D sum 3 ¢ D prod C D sum
( i) qj
¢
( i) ph
¢
( i) sk | tCD t
¢ ( M q ¢ Mp ¢ Ms )
(B.7)
8 nq X np X ns h N <X i h i h i X Qm D W VQ mp XQ ms q : jy hr ku iD1 yD1 rD1 uD1 9 = £ q(yi) ¢ p(ri) ¢ s(ui) | t . (B.8) ;
Concerning the specic example discussed in section 4.2, in the case of the fast calculation of short-term depressing and facilitating synaptic conductances, a few consolidated new state variables (i.e., ey r u (ON) and ey r u (OFF)) have been dened, as reported in Table 2. In order to calculate equation 4.15 quickly, at each time step it is now required to compute the following linear iterated expressions, independent of N: w ON ( t C D t) D AON 1 ¢ w ON (t) C AON 2 ¢ C ON ( t) C AON 3 ¢ VON (t) C AON 4 ¢ GON (t) C B1 ¢ l(t) C B2 ¢ D ( t) C B3 ¢ =(t) C B 4 ¢ S ( t)
(B.9)
w OFF ( t C D t) D AOFF1 ¢ w OFF ( t) C AOFF2 ¢ C OFF ( t) C AOFF3 ¢ VOFF ( t) C AOFF4 ¢ GOFF ( t) (B.10) GON ( t C D t) D aON ¢ GON ( t) C b ¢ S ( t)
GOFF (t C D t) D aOFF ¢ GOFF ( t) C ON (t C D t) D aON ¢ c ¢ C ON ( t) C aON ¢ (1 ¡ c) ¢ GON ( t) C b ¢ c ¢ D ( t) C b ¢ (1 ¡ c) ¢ S ( t)
C OFF ( t C D t) D aOFF ¢ c ¢ C OFF ( t) C aOFF ¢ (1 ¡ c) ¢ GOFF (t) VON ( t C D t) D aON ¢ h ¢ VON ( t) C aON ¢ (1 ¡ h) ¢ GON ( t) C b ¢ h ¢ =(t) C b ¢ (1 ¡ h) ¢ S ( t) VOFF ( t C D t) D aOFF ¢ h ¢ VOFF ( t) C aOFF ¢ (1 ¡ h) ¢ GOFF (t) D ( t C D t) D c ¢ D (t) C (1 ¡ c) ¢ S (t) =(t C D t) D h ¢ =(t) C (1 ¡ h) ¢ S (t)
l ( t C D t) D c ¢ h ¢ l (t) C c ¢ (1 ¡ h) ¢ D ( t) C h ¢ (1 ¡ c) ¢ =(t) C (1 ¡ c) ¢ (1 ¡ h) ¢ S ( t)
Synthesis of Generalized Algorithms
927
The following updating steps must be also performed when the kth synapse, previously inactive, changes state from OFF to ON at time t:
When the kth synapse, already active, turns on again at time t, the following steps must be executed:
( k)
x12
q
(k)
à e¡b¢[t¡(t0 ¡
D
Ãe x12
t¡t
CCdur )]
¢D
C @1 ¡ e ¡
F ( k) Ã e x12
¢ q( k)
t¡t
(k) 0
tF
0
¡
¡
t¡t
¢1 (k)
0 tD
D ( k) Ã e x12
(
t¡t
¡
(k) 0
tD
¡
¡ C @1 ¡ e
t¡t
¢1 (k)
0 tF
¢ D( k) ¡
C @1 ¡ e
F
( k)
¡
Ãe
x12
A
(k) t¡t 0 tF
0
(k) 0 t
t¡t
¡
¢!
¢
0
A
¢ ¢ F ( k) ¡
¡
C q1 ¢ 1 ¡ e ¡
( k)
0
¡
¢ q( k)
¢ (k)
0 tD
¡
¢
x12
x12 ( k)
(k) 0 t
t¡t
q( k) Ã e¡
S Ã S C gN ( k)
¡
t¡t
¡
(k) 0
tD
¢1 A
¢ ¢ F ( k) ¡ ¡
C @1 ¡ e
(k) t¡t 0 tF
¢1 A
GOFF Ã GOFF ¡ gN (k) ¢ q( k)
w ON Ã w ON ¡ gN ( k) ¢ F( k) ¢ D( k) ¢ q(k)
GON Ã GON C gN ( k) ¢ q( k)
C ON Ã C ON ¡ gN ( k) ¢ D( k) ¢ q( k)
x12
x12
x12
x12
VON Ã VON ¡ gN (k) ¢ F( k) ¢ q( k) x12
w OFF Ã w OFF ¡ gN ( k) ¢ F( k) ¢ D( k) ¢ q( k)
D Ã D ¡ gN (k) ¢ D(k)
C OFF Ã C OFF ¡ gN ( k) ¢ D( k) ¢ q( k)
= Ã = ¡ gN ( k) ¢ F( k)
VOFF Ã VOFF ¡ gN ( k) ¢ F( k) ¢ q( k)
l à l ¡ gN ( k) ¢ F(k) ¢ D(k)
( ) ( ) ( ) D k Ãd k ¢D k
( ) ( ) ( ) D k Ãd k ¢D k
F ( k) Ã f ( k) C F ( k)
F ( k) Ã f ( k) C F ( k)
x28
x28
x28
x28
x28
x28
x28
x28
x28
x28
928
Michele Giugliano
w ON Ã w ON C gN (k) ¢ F( k) ¢ D( k) ¢ q( k)
w ON Ã w ON C gN (k) ¢ F( k) ¢ D( k) ¢ q( k)
C ON Ã C ON C gN ( k) ¢ D( k) ¢ q( k)
C ON Ã C ON C gN (k) ¢ D(k) ¢ q( k)
VON Ã VON C gN ( k) ¢ F(k) ¢ q( k)
VON Ã VON C gN ( k) ¢ F(k) ¢ q( k)
D Ã D C gN ( k) ¢ D( k)
D Ã D C gN ( k) ¢ D( k)
= Ã = C gN ( k) ¢ F( k)
= Ã = C gN ( k) ¢ F( k)
l à l C gN ( k) ¢ F( k) ¢ D( k)
l à l C gN ( k) ¢ F( k) ¢ D( k)
x28
x28
x28
x28
x28
x28
x28
x28
x28
x28
x28
x28
( )
t(ok) Ã t
k t0 Ã t
x28
x28
When the jth synapse turns OFF at time t, the following steps must be performed: S Ã S ¡ gN ( j) x28
¡ q
( j)
( j) 0 t
à e¡
D
( j)
¡
Ãe x28
t¡t
¡
Ãe x28
(k) 0
t¡t
(k) 0
tF
(
0
¢
tD
¡ ( j)
¡
¢D
( j)
¡
C @1 ¡ e 0
¢ ¢F
( j)
t¡t
¡
¡ ¡
t¡t
C @1 ¡ e
x11
() () GOFF Ã GOFF C gN j ¢ q j x11
w ON Ã w ON ¡ gN ( j) ¢ F( j) ¢ D( j) ¢ q( j) x11
C ON Ã C ON ¡ gN ( j) ¢ D( j) ¢ q( j)
(k) 0
(k) 0
tF
( j) 0
¢!
t
tD
GON Ã GON ¡ gN ( j) ¢ q( j)
x11
t¡t
¢ q( j) C q1 ¢ 1 ¡ e¡
x28
¡
F
¢
t¡t
¢1 A
¢1 A
Synthesis of Generalized Algorithms
929
VON Ã VON ¡ gN ( j) ¢ F( j) ¢ q( j) x11
D Ã D ¡ gN ( j) ¢ D( j) x11
= Ã = ¡ gN ( j) ¢ F( j) x11
l à l ¡ gN ( j) ¢ F( j) ¢ D( j) x11
w OFF Ã w OFF C gN ( j) ¢ F( j) ¢ D( j) ¢ q( j) x11
C OFF Ã C OFF C gN ( j) ¢ D( j) ¢ q( j) x11
VOFF Ã VOFF C gN ( j) ¢ F( j) ¢ q( j) x11
Acknowledgments
I thank A. Destexhe for comments on an earlier version of this article, as well as W. Rocchia for useful discussions and the referees for their helpful comments and suggestions. I am especially grateful to M. Grattarola, M. Bove, and J. Lowenstein for critically reading the manuscript. This work was supported by the University of Genoa (Italy). References Abbott, L. F., & Marder, E. (1998). Modeling small networks. In C. Koch & I. Segev (Eds.), Methods in neuronal modelling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing, and neuronal response variability. In Kearn, M. S., Solla, S. A., & Cohn, D. A. (Eds.), Advances in Neural Information Processing Systems 11. Cambridge, MA: MIT Press. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Anderson, C. R., & Stevens, C. F. (1973).Voltage-clamp analysis of acetylcholineproduced end-plate current uctuations at frog neuromuscular junction. J. Physiol. (London), 235, 655–691. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 simple cells. J. Neurosci., 18, 4785– 4799. Colquhoun, D., & Hawkes, A. G. (1983). The principles of the stochastic interpretation of ion-channel mechanisms. In R. Sakmann & E. Neher (Eds.), Single-channel recording (pp. 135–175). New York: Plenum Press. Destexhe, A. (1997). Conductance-based integrate-and-re models. Neural Comput., 9, 503–514.
930
Michele Giugliano
Destexhe, A., Contreras, D., & Steriade, M. (1998). Mechanisms underlying the synchronizing action of corticothalamic feedback through inhibition of thalamic relay cells. J. Neurophys., 79, 999–1016. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994a). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp., 6,14–18. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994b). Synaptic transmission and neuromodulation using a common kinetic formalism. J. Comp. Neurosci., 1, 195–230. Finkel, L. H., & Edelman, G. M. (1987). Population rules for synapses in networks. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function (pp. 711–757). New York: Wiley. Gantmacher, F. R. (1959). The theory of matrices. New York: Chelsea Publishing Company. Giugliano, M., Bove, M., & Grattarola, M. (1999). Fast calculation of short-term depressing synaptic conductances. Neural Comp., 11, 1413–1426. Grattarola, M., & Massobrio, G. (1998). Bioelectronics handbook: MOSFETs, biosensors, neurons. New York: McGraw-Hill. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer. Howard, R. A. (1971). Dynamic probabilistic systems. New York: Wiley. Kirsch, J., & Betz, H. (1998). Glycine-receptor activation is required for receptor clustering in spinal neurons. Nature, 392. Koch, C., & Segev, I. (eds.). (1998). Methods in neuronal modelling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Kohn, ¨ J., & Worg ¨ otter, ¨ F. (1998).Employing the Z-transform to optimize the calculation of the synaptic conductance of NMDA and other synaptic channels in network simulations. Neural Comp., 10, 1639–1651. Liao, D., Hessler, N. A., & Malinow, R. (1995). Activation of postsynaptically silent synapses during pairing-induced LTP in CA1 region of hippocampal slice. Nature, 375, 400–404. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comp., 8, 501–509. Magleby, K. L., & Zengel, J. E. (1975). A quantitative description of stimulationinduced changes in transmitter release at the frog neuromuscular junction. J. Gen. Physiol., 80, 613–638. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APS and EPSPs. Science, 275, 213–215. Oppenheim, A. V., & Shaffer, R. (1975).Digital-signalprocessing. London: Prentice Hall International. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.). New York: McGraw-Hill. Par´e, D., Shink, E., Gondreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical
Synthesis of Generalized Algorithms
931
pyramidal neurons in vivo. J. Neurophys., 79, 1450–1460. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1996).Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Sakmann, R., & Neher, E. (1983). Single-channel recording. New York: Plenum Press. Sen, K., Jorge-Rivera, J. C., Marder, E., & Abbott, L. F. (1996).Decoding synapses. J. Neurosci., 16, 6307–6318. Senn, W., Tsodyks, M., & Markram, H. (1997). An algorithm for synaptic modication based on exact timing of pre- and post-synaptic action potentials. In W. Gerstner, A. Germond, M. Hasler, & J. D. Nicoud (eds.), Articial neural networks (pp. 121–126). Lausanne: ICANN. Sheperd, G. M. (1988). Neurobiology (2nd ed.). Oxford: Oxford University Press. Standley, C., Norris, T. M., Ramsey, R. L., & Usherwood, P. N. R. (1993). Gating kinetics of the quisqualate-sensitive glutamate receptor of locust muscle studied using agonist concentration jumps and computer simulations. Biophys. J., 65, 1379–1386. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Comp., 5, 200–204. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trend Neurosci., 17, 119–126. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Varela, J., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A Quantitative description of short-term plasticity at excitatory synapses in layer 2 / 3 of rat primary visual cortex, J. Neurosci., 17, 7926–7940. Walmsley, B., Alvarez, F. J., & Fyffe, R. E. W. (1998). Diversity of structure and function at mammalian central synapses. Trends Neurosci., 21, 81–88. Received April 2, 1999; accepted May 26, 1999.
LETTER
Communicated by C. Lee Giles
Hierarchical Bayesian Models for Regularization in Sequential Learning J. F. G. de Freitas M. Niranjan A. H. Gee Cambridge University Engineering Department, Cambridge CB2 1PZ, U.K. We show that a hierarchical Bayesian modeling approach allows us to perform regularization in sequential learning. We identify three inference levels within this hierarchy: model selection, parameter estimation, and noise estimation. In environments where data arrive sequentially, techniques such as cross validation to achieve regularization or model selection are not possible. The Bayesian approach, with extended Kalman ltering at the parameter estimation level, allows for regularization within a minimum variance framework. A multilayer perceptron is used to generate the extended Kalman lter nonlinear measurements mapping. We describe several algorithms at the noise estimation level that allow us to implement on-line regularization. We also show the theoretical links between adaptive noise estimation in extended Kalman ltering, multiple adaptive learning rates, and multiple smoothing regularization coefcients.
1 Introduction
Sequential training of neural networks is important in applications where data sequences either exhibit nonstationary behavior or are difcult and expensive to obtain before the training process. Scenarios where this type of sequence arise include tracking and surveillance, control systems, fault detection, signal processing, communications, econometric systems, demographic systems, geophysical problems, operations research, and automatic navigation. Although there has been great interest on the topic of regularization in batch learning tasks, this topic has not received much attention in sequential learning tasks. In this article, we adopt a hierarchical Bayesian framework in conjunction with dynamical state-space models to derive regularization algorithms for sequential estimation. These algorithms are based on estimating the noise processes on-line, while estimating the network weights with the extended Kalman lter (EKF). In addition, we show that this methodology provides a unifying theoretical framework for many sequential estiNeural Computation 12, 933–953 (2000)
c 2000 Massachusetts Institute of Technology °
934
J. F. G. de Freitas, M. Niranjan, and A. H. Gee
mation algorithms that attempt to avoid local minima in the error function. In particular, we show that adaptive noise covariances in extended Kalman ltering, multiple adaptive learning rates in on-line backpropagation, and multiple smoothing regularization coefcients are mathematically equivalent. Section 2 describes the sequential learning task using state-space models and a three-level hierarchical Bayesian structure. The three levels of inference correspond to noise estimation, parameter estimation, and model selection. In section 3, we propose a solution to the parameter estimation level based on the application of the EKF to neural networks. Section 4 is devoted to the noise estimation level and regularization. Finally, we present our experiments in section 5 and point out several areas for further research in section 6. 2 Dynamical Hierarchical Bayesian Models
We address the problem of training neural networks sequentially within the following dynamical state-space framework: w kC1 D w k C d k
yk D g k ( w k , x k ) C v k ,
(2.1) (2.2)
where k( k D 1, . . . , L) denotes the discrete time index. The output measurements of the system (y k 2 p , t|hOs , 0) that at a time t, h is greater than p is given by Z P (h > p , t|hOs , 0) D
p
2p
p(h , t|hOs , 0)dh ,
(4.15)
which is equal to the probability of generating an action potential. By performing an ensemble average, we may equate this quantity with the mean number of spikes per slow-wave cycle. The conditional probability density obeys a Smoluchowski equation (Risken, 1989), µ ¶ @ D @ 0 O p(h , t|h s , 0) D U (h , t) C p(h , t|hOs , 0) 2 @h @t @h @
(4.16)
where U0 (h, t) represents the spatial derivative of the potential. The time dependence of the potential forbids a general closed solution to equation 4.16 and furthermore makes a numerical solution difcult to obtain. However,
Model of Mammalian Cold Receptors
1087
a Monte Carlo approximation to the probability density p(h , t|hOs , 0) may be obtained. We rst numerically iterate an ensemble of N receptors h i ( t), i D 1, . . . , N , for half of one slow-wave period, subject to the initial condition hi (t D 0) D hOs ´ cos¡1 (¡l min ), 8 i 2 N . An approximation to p(h, t|hOs , 0) will be given by a normalized histogram of the ensemble of hi (t D p / V) and so an estimation of the ring rate may be found from equation 4.15 (see Figure 5). 5 Discussion
We have developed a phase model of slow-wave parabolic bursting cold receptors that qualitatively reproduces the observed ring patterns at different constant temperatures. The stochastic extension of this model accounts for the variability in the number of spikes per burst and for the skipping patterns. Our analysis in terms of a Strutt map allows a description of the regions of parameter space that produce specic ring patterns, in both the deterministic and noise-driven regimes. It is worthwhile to make a few points regarding our model. 5.1 Paths for Other Receptors. There are many different thermally responsive bursting cells (Braun et al., 1984), for example, the feline lingual and infraorbital nerves, the boa constrictor warm ber, and the dogsh ampullae of Lorenzini. The discharge patterns of all of these cell types exhibit many similar qualitative features, but quantitatively they differ; for example, they have differing burst lengths at a given temperature. In addition, there can also be considerable variation within a single cell type (recall section 2). Therefore, the paradigm of a temperature-dependent noisy trajectory through the Strutt map allows a universal model that might help to explain the discharge patterns of more of these cells. 5.2 Chaos. The existence of chaos in thermoresponsive neuronal spiketrains has been recently studied in both real (Braun et al., 1997) and model (Longtin & Hinzer, 1996) neurons. However, the phase model reported here does not support chaos; instead its spike-train irregularities have a stochastic origin. Is this important? For this class of neurons at least, the answer is “probably not,” since it is more likely that the bursting pattern itself is the fundamental carrier of information rather than the timing of individual spikes within a burst. Such patterns may be more reliably detected by higher neurons due to synaptic facilitation. 5.3 Additive Versus Multiplicative Noise. We introduced thermal and pump noise as a simple, additive term on theh dynamics. However, there are other ways in which noise might enter the dynamical equations. Consider instead an additive term, j s ( t), on the slow dynamics (see also Gutkin &
1088
Peter Roper, Paul C. Bressloff, and Andr´e Longtin
Ermentrout, 1998), so that equation 3.5 becomes dh D [b C cos(h )] ¡ [1 ¡ cos(h )] [A cos (Vt) C j s ( t)] dt D F(h , t) ¡ [1 ¡ cos(h )]j s (t).
(5.1)
j s ( t) now couples multiplicatively to the fast dynamics, and when h D 0, the effect of noise vanishes. Since the noise cannot carry the system around the limit cycle, it might therefore appear that this sort of noise cannot contribute to the spiking dynamics. However, recall the gradient system (see equation 4.1) and Figure 6, and note that h D 0 is halfway down a fairly steep slope. At this time (the peak amplitude of a fast sodium spike) and for the noise levels considered here, the deterministic dynamics dominate and carry the cell around the limit cycle. The additive noise introduced in equation 3.11 would also have very little impact at this time. 4 Further, following Sigeti (1988), we could replace [1 ¡ cos(h )]j s ( t) by a mean noise (Gutkin & Ermentrout, 1998), averaged over some spatial domain (say, between the stable and unstable xed points), and still get quantitatively similar behavior. 5.4 Different Noise Distribution. In the absence of any electrophysiological motivation, we have introduced a white noise process. Recall that the discharge pattern derives from a random walk through the Strutt map and that spike augmentation and deletion arise when the random walk crosses a tongue boundary. In consequence, we note that other additive noise distributions (such as the Ornstein–Uhlenbeck process considered in Longtin & Hinzer, 1996) will produce qualitatively similar burst patterns, and so the actual noise distribution is not pertinent to an understanding of the general model. However, the choice of noise distribution will be extremely important when describing a specic burst pattern. 5.5 Asymmetric Burst Patterns and Parabolicity. In contrast to our phase model, discharge patterns from real cold receptors exhibit asymmetric burst patterns. Typical (Braun et al., 1980) burst proles comprise a rapid increase followed by a slower decrease in spike frequency, resulting in a sawtooth prole. However, parabolic cells generally exhibit burst proles with ring rates that vary nonmonotonically throughout the burst. Thus, denitively labeling the discharge of these cells as parabolic is problematic. However, according to Rinzel and Lee (1987), the symmetry of the discharge of the Plant model can be manipulated by parameter modication. Furthermore, 4 For additive noise, there is in fact a nite probability for h to recross zero in the opposite direction. However, such an event is unlikely, and, moreover, simply crossing zero does not constitute a spike. A spike is a full rotation around the limit cycle, and the probability that this will happen is vanishingly small (at least for these noise levels).
Model of Mammalian Cold Receptors
1089
q
2p
p
0
Figure 8: A burstP pattern from a sawtooth slow wave (the rst few terms of the expansion A[1 ¡ n 1n sin(nx)]). Parameters are V D 0.001, A D 0.5, and b D 0.6 (cf. gure 1 of Braun et al., 1980). See also Ermentrout and Kopell (1986).
for our canonical model, the time-dependent term A cos(Vt) models the oscillatory slow-wave dynamics. We have no electrophysiological guidelines to direct our choice for its functional form, and so we have chosen a cosine for its analytical tractability. However, the asymmetry of the discharge of real cells suggests that if the bursting is of slow-wave type, then the wave is asymmetric. Thus a more complex waveform may better reproduce the ner details of the discharge pattern. Figure 8 shows a single burst driven by a sawtooth slow wave, and the ner details of this pattern are in better agreement with data (compare with Figure 1 of Braun et al., 1980). Our analysis readily extends to slow waves of this form, except now the Mathieu equation must be replaced by a more general Hill equation (Ermentrout & Kopell, 1986). 6 Conclusions
We have presented a tractable phase model for cold-receptor function. Our phase model can be related to more complex, biophysical models of neuronal operation, and we have shown how certain parameters of the phase model derive from Plant’s ionic model. We have investigated the phase model, both numerically and analytically, in the deterministic regime and also when subject to a nite amount of thermal noise. Numerically obtained
1090
Peter Roper, Paul C. Bressloff, and Andr´e Longtin
spike trains and interspike interval histograms from the phase model agree well with the experimental data. From our investigations, we have shown that skipping might be caused by noise. We have further shown that both the number of spikes in a burst and also the skipping rate at any given temperature may be predicted. Finally, we have studied how altering the noise level affects the dynamics and shown that the skipping regime may be subdivided. The rst part of skipping is caused by noise-induced trapping, and the second by noise-induced spiking. Our ndings provide a framework in which to study the effect of various physicochemical parameters in the environment of these receptors and analytical insight into the variability of patterns across receptors. Appendix
Here we analyze the temperature dependence of the slow-wave frequency in a model of cold receptors based on the Plant model described in Longtin and Hinzer (1996). This model is ve-dimensional and possesses a clear separation of timescales as discussed in (Rinzel & Lee, 1987). The three fast variables are the voltage V and the gating variables h and n of the fast-spiking dynamics (sodium inactivation and potassium activation). The slow variables are the gating variable x of the slow inward current and the intracellular calcium concentration y. The dynamics of V are given by Cm
dV D ¡Iion (V, h, nI x, y). dt
(A.1)
The dynamics of x and y are given by dx Dl [x1 ( V ) ¡ x] / tx dt £ ¤ dy Dr Kc x( VCa ¡ V ) ¡ y . dt
(A.2) (A.3)
In the approximation where the full system can be separated into fast and slow subsystems, the slow dynamics are dx Dl [x1 ( Vss ) ¡ x] / tx dt £ ¤ dy Dr Kc x( VCa ¡ Vss ) ¡ y , dt
(A.4) (A.5)
where Vss is the pseudo-steady-state membrane voltage, obtained as a solution of 0 D Iion ( Vss , h1 ( Vss ), n1 ( Vss )I x, y).
(A.6)
Model of Mammalian Cold Receptors
1091
The important fact is that Vss D Vss ( x, y). Hence, the ( x, y) dynamics are coupled together by Vss . Based on numerical computations of this surface in the absence of fast-spiking dynamics (Rinzel & Lee, 1987; see in particular Figure 2), we then approximate the pseudo-steady-state surface by the plane Vss ( x, y) ¼ Vo C a1 x C a2 y,
(A.7)
where Vo is the mean value of the slow-wave potential, and the partial derivatives a1 and a2 are constants. These constants are negative according to (Rinzel & Lee, 1987; see Figure 2). Using this form for Vss , we can write £ ¤ (A.8) x¡1 1 ( x, y) D exp 0.15(¡50 ¡ Vo ¡ a1 x ¡ a2 y) C 1 . Introducing the coordinates X ´ x ¡ x¤ and Y ´ y ¡ y¤ , the linearization about the xed point (x¤ , y¤ ) of this two-dimensional subsystem is dX D al X C blY dt dY Dc r X Cdr Y, dt
(A.9) (A.10)
where we have isolated the temperature-dependent parameters l and r from the other parameters: ! @x1 a´ ¡ 1 tx¡1 (A.11) @x x¤ ,y¤
(
b´
tx¡1
@x1
@y x¤ ,y¤
c ´ Kc (VCa ¡ Vo ¡ 2a1 x¤ ¡ a2 y¤ ) ¤
d ´ ¡(Kc a2 x C 1).
(A.12) (A.13) (A.14)
We note that x and y are positive quantities, as is d ; a and b are negative, and the sign of d is dependent on precise parameter values. If (al C rd) 2 < 4rl(ad ¡ c b), the eigenvalues are complex. At the bifurcation point, al C rd D 0 and the frequency is given by p vH D (ad ¡ c b)rl » (rl) 1 / 2 , (A.15) which is real since ad ¡ c b > 0. Acknowledgments
This work was supported by a Loughborough University postgraduate studentship, a traveling fellowship from the John Phillips Guest fund, and NSERC (Canada).
1092
Peter Roper, Paul C. Bressloff, and Andr´e Longtin
References Apostolico, F., Gammaitoni, L., Marchesoni, F., & Santucci S. (1994). Resonant trapping: A failure mechanism in switch transitions. Physical Review E, 55(1), 36–39. Braun, H. A., Bade, H., & Hensel, H. (1980). Static, dynamic discharge patterns ¨ of bursting cold bres related to hypothetical receptor mechanisms. P ugers Archiv—European Journal of Physiology, 386, 1–9. Braun, H. A., Huber, M. T., Dewald, M., Sch¨affer, K., & Voigt, K. (1998). Computer simulations of neuronal signal transduction: The role of nonlinear dynamics and noise. International Journal of Bifurcation and Chaos, 8(5), 881– 889. Braun, H. A., Sch¨afer, K., Voigt, K., Peters, R., Bretschneider, F., Pei, X., Wilkens, L., & Moss, F. (1997). Low-dimensional dynamics in sensory biology 1: Thermally sensitive electroreceptors of the catsh. Journal of Computational Neuroscience, 4, 335–347. Braun, H. A., Sch¨afer, K., & Wissing, H. (1990). Theories and models of temperature transduction. In J. Bligh & K. Voigt (Eds.), Thermoreception and thermal regulation (pp. 18–29). Berlin: Springer-Verlag. Braun, H. A., Sch¨afer, K., Wissing, H., & Hensel H. (1984). Periodic transduction processes in thermosensitive receptors. In W. Hamman & A. Iggo (Eds.), Sensory receptor mechanisms (pp. 147–156). Singapore: World Scientic Publishing. Braun, H. A., Wissing, H., Sch¨afer, K., & Hirsch, M. C. (1994). Oscillation and noise determine signal transduction in shark multimodal sensory cells. Nature, 367, 270–273. Canavier, C. C, Clark, J. W., & Byrne, J. H. (1991). Simulation of the bursting activity of neuron R15 in Aplysia: Role of ion currents, calcium balance, modulatory transmitters. Journal of Neurophysiology, 66, 2107–2124. Carpenter, D. O. (1967). Temperature effects on pacemaker generation, membrane potential, and critical ring threshold in Aplysia neurons. Journal of General Physiology, 50(6), 1469–1484. Coddington, E. A., & Levinson, N. (1955). Theory of ordinary differential equations. New York: McGraw-Hill. DeFelice, L. J. (1981). Introduction to membrane noise. New York: Plenum. Dykes, R. W. (1975). Coding of steady, transient temperatures by cutaneous “cold” bers serving the hand of monkeys. Biophysical Journal, 98, 485–500. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM Journal of Applied Mathematics, 46(2), 233–253. Gang, H., Ditzinger, T., Ning, C. Z., & Haken, H. (1993). Stochastic resonance without external periodic force. Physical Review Letters, 71(6), 807–810. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047– 1065.
Model of Mammalian Cold Receptors
1093
Ivanov, K. P. (1990). The location, function of different skin thermoreceptors. In J. Bligh & K. Voigt (Eds.), Thermoreception and thermal regulation (pp. 37–43). Berlin: Springer-Verlag. Kopell, N., & Ermentrout, G. B. (1986). Subcellular oscillations and bursting. Mathematical Biosciences, 78, 265–291. Lisman, J. E. (1997). Bursts as a reliable unit of neural information: Making unreliable synapses reliable. Trends in Neuroscience, 20(1), 38–43. Longtin, A., & Hinzer, K. (1996). Encoding with bursting, subthreshold oscillations, and noise in mammalian cold receptors. Neural Computation, 8, 215–255. McLachlan, N. W. (1947). Theory and application of Mathieu functions. Oxford: Clarendon Press. Nayfeh, A. H., & Mook, D. T. (1995). Nonlinear oscillations. New York: Wiley. Plant, R. E. (1981).Bifurcation and resonance in a model for bursting nerve-cells. Journal of Mathematical Biology, 11(15), 15–32. Rappel, W., & Strogatz, S. H. (1994). Stochastic resonance in an autonomous system with a nonuniform limit cycle. Physical Review E, 50(4), 3249–3250. Rinzel, J., & Lee, Y. S. (1987). Dissection of a model for neuronal parabolic bursting. Journal of Mathematical Biology, 25, 653–675. Risken, H. (1989). The Fokker-Planck equation (2nd ed.). New York: SpringerVerlag. Sch¨afer, K., Braun, H. A., & Rempe, L. (1988). Classication of a calcium conductance in cold receptors. Progress in Brain Research, 74, 29–36. Sch¨afer, K., Braun, H. A., & Rempe, L. (1990). Mechanisms of sensory transduction in cold receptors. In J. Bligh & K. Voigt (Eds.), Thermoreceptionand thermal regulation (pp. 30–36). Berlin: Springer-Verlag. Sigeti, D. E. (1988). Universal results for the effects of noise on dynamical systems. Unpublished doctoral dissertation, University of Texas at Austin. Soto-Trevino, ˜ C., Kopell, N., & Watson,D. (1996). Parabolic bursting revisited. Journal of Mathematical Biology, 35, 114–128. Strogatz, S. H. (1994). Nonlinear dynamics and chaos. Reading, MA: AddisonWesley. Wiederhold, M. L., & Carpenter, D. O. (1982). Possible role of pacemaker mechanisms in sensory systems. In D. O. Carpenter (Ed.), Cellular pacemakers (Vol. 2, pp. 27–58). New York: Wiley-Interscience. Willis, J. A., Gaubatz, G. L., & Carpenter, D. O. (1974).The role of the electrogenic pump in modulation of pacemaker discharge of Aplysia neurons. Journal of Cellular Physiology, 84, 463–471. Received November 10, 1998; accepted June 14, 1999.
LETTER
Communicated by Bard Ermentrout
The Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal Networks D. Golomb Zlotowski Center for Neuroscience and Department of Physiology, Faculty of Health Sciences, Ben Gurion University of the Negev, Be’er-Sheva 84105, Israel D. Hansel Laboratoire de Neurophysique et de Physiologie du Syst`eme Moteur, EP 1848 CNRS, Universit´e Ren´e Descartes, 45 rue de Saints P`eres 75270 Paris Cedex 06, France The prevalence of coherent oscillations in various frequency ranges in the central nervous system raises the question of the mechanisms that synchronize large populations of neurons. We study synchronization in models of large networks of spiking neurons with random sparse connectivity. Synchrony occurs only when the average number of synapses, M, that a cell receives is larger than a critical value, Mc . Below Mc , the system is in an asynchronous state. In the limit of weak coupling, assuming identical neurons, we reduce the model to a system of phase oscillators that are coupled via an effective interaction, C . In this framework, we develop an approximate theory for sparse networks of identical neurons to estimate Mc analytically from the Fourier coefcients of C . Our approach relies on the assumption that the dynamics of a neuron depend mainly on the number of cells that are presynaptic to it. We apply this theory to compute Mc for a model of inhibitory networks of integrate-and-re (I&F) neurons as a function of the intrinsic neuronal properties (e.g., the refractory period Tr ), the synaptic time constants, and the strength of the external stimulus, Iext . The number Mc is found to be nonmonotonous with the strength of Iext . For Tr D 0, we estimate the minimum value of Mc over all the parameters of the model to be 363.8. Above M c , the neurons tend to re in smeared one-cluster states at high ring rates and smeared two-or -more-cluster states at low ring rates. Refractoriness decreases Mc at intermediate and high ring rates. These results are compared to numerical simulations. We show numerically that systems with different sizes, N , behave in the same way provided the connectivity, M , is such that 1 / M eff D 1 / M ¡ 1 / N remains constant when N varies. This allows extrapolating the large N behavior of a network from numerical simulations of networks of relatively small sizes (N D 800 in our case). We nd that our theory predicts with remarkable accuracy the value of Mc and the patterns of synchrony above M c , provided the synaptic coupling is not too large. We also study the strong coupling regime of inhibitory c 2000 Massachusetts Institute of Technology Neural Computation 12, 1095– 1139 (2000) °
1096
D. Golomb and D. Hansel
sparse networks. All of our simulations demonstrate that increasing the coupling strength reduces the level of synchrony of the neuronal activity. Above a critical coupling strength, the network activity is asynchronous. We point out a fundamental limitation for the mechanisms of synchrony relying on inhibition alone, if heterogeneities in the intrinsic properties of the neurons and spatial uctuations in the external input are also taken into account. 1 Introduction The ubiquity of coherent oscillations in the central nervous system (for a review, see Gray, 1994), raises fundamental biophysical issues concerning the role of intrinsic neuronal and synaptic characteristics in the emergence of coherent neuronal activity. In recent years, these issues have been addressed in various theoretical studies, focusing mostly on the simplest situation in which the neurons are identical and coupled all-to-all (Gerstner, 1999; Gerstner, van Hemmen, & Cowan, 1996; Hansel, Mato, & Meunier, 1993a, 1993b, 1993c, 1995; van Vreeswijk, Abbott, & Ermentrout, 1994). These studies have demonstrated that the degree of synchrony that can emerge in a large neural system depends crucially on the nature, whether excitatory or inhibitory, of the synaptic interactions between the neurons. More recently, the effect of heterogeneities in neuron properties, still assuming an all-to-all coupling (Chow, 1998; Chow, White, Ritt, & Kopell, 1998; Neltner, Hansel, Mato, & Meunier, 1999; White, Chow, Ritt, Soto-Trevino, Q & Kopell, 1998), have been investigated. On the basis of these studies, it has been suggested that inhibition should play a crucial role in the emergence of synchrony in the central nervous system. Similar conclusion has been reached, in parallel with these theoretical investigations, in experimental studies in in vitro preparations. Indeed, Whittington, Traub, and Jefferys (1995) demonstrated that an inhibitory population can re in synchrony in the gamma band even if the excitatory population is blocked pharmacologically. This result goes along the idea that pyramidal cells in vivo may be entrained by synchronous rhythmic inhibition originating from fast spiking interneurons, as suggested as early as 1983 by Buzsaki, ´ Leung, and Vanderwolf (1983) and Buzs a´ ki and Chrobak (1995) (see also the theoretical study by Lytton & Sejnowski, 1991). Most of the theoretical investigations of neural synchrony in large networks have focused on all-to-all architecture. However, in cortex, each neuron is coupled to only a certain number of neurons, which is much smaller than the total number of neurons in the network. This property of the network architecture, referred to as sparseness or dilution, is expected to reduce, or even destroy, the synchrony in the network even if the corresponding network with all-to-all coupling is highly coherent. It is often assumed in models that the coupling between neurons is random. In these cases (Barkai, Kanter, & Sompolinsky, 1990; Wang, Golomb, & Rinzel, 1995), it was found
Synchrony of Sparse Networks
1097
that when the all-to-all network is synchronized, synchrony will still prevail if the average number of inputs a cell receives is larger than a critical number. Wang and Buzs´aki (1996) numerically studied a model of a sparse and heterogeneous network made of spiking neurons with inhibitory coupling. They claimed that synchrony could emerge if the ratio between the synaptic decay time constant and the oscillation period was sufciently large and the heterogeneity level was modest. For their specic parameter set, they determined that the average number of inputs a cell should receive in order to get synchrony is on the order of 100. However, this number depends on the intrinsic properties of the cells as well as on the synaptic and network properties. In this article, we study models of large, sparse networks made of identical intrinsically oscillating neurons with inhibitory coupling. Our goal is to nd how the critical average number of synaptic inputs, Mc , above which synchrony emerges, and the patterns of synchrony in networks in which neurons receive more inputs than this critical number, depend on the intrinsic and synaptic properties of the neurons. Our strategy is to study the limit of weak synaptic coupling in which the network dynamics can be reduced to the dynamics of coupled phase oscillator, which, to a large extent, can then be studied analytically. Using an approximate mean-eld theory, we estimate the critical average number of inputs for the integrateand-re (I & F) neuron model. This result helps us to understand the network dynamics even beyond this limit, as obtained numerically. Simulations of sparse I & F networks are performed to verify the prediction of our approximate theory for nite (but not too strong) coupling, to nd the patterns of synchrony in networks for which the asynchronous state is unstable, and to study the model behavior beyond the weak coupling limit. In the next section we present a general framework for modeling diluted networks. Section 3 is devoted to the weak coupling case. The theory is applied to networks of I & F neurons in section 4, and simulations of such networks are described in section 5. Section 6 summarizes and discusses our ndings. 2 The Model 2.1 Architecture of the Network. We are interested in the dynamics of one population of N neurons of the same type, all excitatory or all inhibitory, which are randomly connected. A neuron, i, ( i D 1, . . . , N ) is making one synapse on neuron j with a probability M / N, where M < N. We dene a connectivity matrix wij by: » wij D
1 0
if neuron j is presynaptic to neuron i otherwise.
(2.1)
1098
D. Golomb and D. Hansel
In this architecture, the number of synaptic inputs uctuates from neuron to neuron with a population average M. The total synaptic input received by neuron i is gtot syn,i D
N gsyn X wij sj , M jD1
(2.2)
where si is the synaptic variable of the ith neuron and gsyn does not depend on M. With this normalization, the system-average synaptic input that a cell receives when all of the synaptic channels are open (si D 1 for all i) is gsyn. The standard deviation of the neuron-to-neuron uctuations is r 1 1 ¡ . (2.3) gsyn M N p This value is equal to gsyn / M for N À M. The spatial uctuations in the synaptic input decrease as M increases, and we expect the system to be more synchronized. Analysis of diluted networks is often carried out in two behavioral regimes: The ratio M / N remains constant as N ! 1. In this regime, input uctuations go to zero at large N, and the behavior is similar to the case of all-to-all coupling (M D N) (Golomb, Wang, & Rinzel, 1994).
The number M remains constant as N ! 1. This is the sparse regime, and synchrony is expected to be reduced at small values of M in comparison to the case M D N. In previous work on different systems (Barkai et al., 1990; Wang et al., 1995; Wang & Buzs a´ ki, 1996), it was suggested that the asynchronous state is stable if M is below a critical value Mc . In this work we study the dynamical properties of sparse networks in the limit N ! 1. 2.2 Integrate-and-Fire Dynamics. In this article we mainly study networks of I & F neurons (Tuckwell, 1988; Abbott & van Vreeswijk, 1993; Bressloff & Coombes, 1998a, 1998b; Brunel & Hakim, 1999; Chow, 1998; Gerstner, 1999; Gerstner et al., 1996; Hansel et al., 1995; Kuramoto, 1991; Mirollo and Strogatz, 1990; Treves, 1993; Tsodyks, Mitkov, & Sompolinsky, 1993; Tsodyks & Sejnowki, 1995; van Vreeswijk et al., 1994). This enables us to investigate the synchronization properties of models without active ionic currents that generate spikes. In the following, we study the simplest I & F model (the Lapicque model) in which the membrane potential of a neuron satises the differential equation: dV V D ¡ C Iext C Isyn (t) t0 dt
(2.4)
Synchrony of Sparse Networks
1099
for 0 < V < h and ) V ( tC 0 D0
) V ( t¡ 0 D h.
if
(2.5)
This last condition corresponds to a fast resetting of the neuron after the ring of a spike at time t0 ; this ring occurs when the membrane potential reaches the spiking threshold h ; we choose h D 10 mV above rest. The time constant of the membrane is t0 D 10 ms and Iext D IQext / C, where IQIext represents the contribution of an external (e.g., sensory) stimulus to the system. The quantity, C, is the membrane capacitance. Note that the normalized parameter, Iext , has units of mV / ms. Refractoriness is introduced into the model by imposing that V ( t) remains equal to 0 for a refractory period, Tr , after the ring of a spike. The last term in equation 2.4 is the synaptic current received by the neuron, normalized to the value of C. For this model we adopt Isyn( t) D gsyn s (t) ,
(2.6)
where gsyn is the normalized synaptic current (in units of mV/ms) and s is a synaptic variable given by sD
X spikes
f (t ¡ tspike),
(2.7)
where the summation is done over all of the spikes emitted prior to t by all of the neurons presynaptic to the neuron we consider. With this simplied form of synaptic coupling, excitation corresponds to gsyn > 0 and inhibition to gsyn < 0. The function, f (t), is the extended a-function: f ( t) D
exp (¡t / t2 ) ¡ exp (¡t / t1 ) . t2 ¡ t1
(2.8)
The characteristic times t1 and t2 are the rise and decay times of the synapse, respectively: t1 · t2 . If the neurons are not interacting (gsyn D 0), they emit spikes periodically for Iext > h / t0 with a ring period, T0 (ring frequency f0 D 1 / T0 ):
¡
h T0 D Tr ¡ t0 ln 1 ¡ t0 Iext
¢
.
(2.9)
3 The Weak Coupling Limit 3.1 Reduction to Phase Models. In general, the dynamical equations of large networks of spiking neurons cannot be solved analytically, and
1100
D. Golomb and D. Hansel
the study of synchronization in such networks relies on numerical computations. However, under the following three conditions the state of each neuron can be described by a phase variable, w i , which characterizes at time t, the location of neuron i along its limit cycle (Kuramoto, 1984): 1. The uncoupled neurons display a periodic behavior (limit cycle) when the external current is above some threshold. 2. Their ring rates all lie in a narrow range, that is, the heterogeneities in the neuronal intrinsic properties and in the external inputs are small. 3. The coupling is weak (and therefore it modies the ring rate of the neurons only slightly). The original system of equations for the N oscillators is replaced by a simpler set of N differential equations that governs the time evolution of the w i , ¡ ¢ 1 X dw i D 2p f0i C wij C w i ¡ wj , dt M j6Di
(3.1)
where f0i is the ring rate of neuron i, when uncoupled. Here, C is the effective interaction between any two neurons: 1 C w i ¡ wj D 2p ¡
¢
Z
2p 0
¡ ¢ Z ( y C w i ) Isyn y C w i , y C wj dy .
(3.2)
The function Z is the phase-resetting curve (Kuramoto, 1984; Hansel et al., 1993a) of the neuron in the limit of vanishingly small perturbations of the membrane potential, and Isyn(w i , wj ) is the synaptic interaction, which is a function of the phases of the presynaptic and postsynaptic neurons. The function C depends on only the single-neuron dynamics. For simple I & F neurons it can be computed analytically. For conductance-based neuronal models, numerical methods have to be used as described in Ermentrout and Kopell (1984), Kopell (1988), or Hansel et al. (1993a). Once this effective phase interaction is determined, it can be used to analyze networks of arbitrary complexity. Reduction to a phase model assumes that the coupling is small enough for amplitude effects to be neglected. The more stable the limit cycle is, the weaker this constraint will be, but it is difcult to derive quantitative a priori estimates of the validity of the phase reduction. However, predictions of phase models often remain valid, at least qualitatively, for moderate values of the coupling. This is the case for the examples studied below.
Synchrony of Sparse Networks
1101
3.2 Approximate Theory for Randomly and Weakly Coupled Neurons. The Fourier series representation of the C function is C (w ) D C 0 C CQ (w ) D C 0 ¡ |C 0 |
1 X nD1
cn sin (nw C an ) ,
(3.3)
Q ) includes all the where C 0 is the zeroth Fourier component of C(w ), C(w phasic components of C, cn is the normalized amplitude of the nth Fourier component and is therefore always nonnegative, and an is the angle of that mode. This parameterization is chosen in order to be consistent with previous works on networks of phase oscillators (Kuramoto, 1984; Strogatz & Mirollo, 1991). Substituting equation 3.3 in equation 3.1 yields the following dynamics: N ¡ ¢ 1 X dw i mi C 0 D 2p f0i C C wij CQ w i ¡ w j dt M M jD1
i D 1, . . . , N,
(3.4)
where mi D
N X
wij
(3.5)
jD1
is the number of synaptic inputs the ith cell receives from other cells. The second term in this equation is a major source of disorder in the system, which tends to desynchronize it. For example, if C 0 < 0, a cell that receives a larger number of inputs, mi , will tend, in general, to delay the time of occurrence of its next spike in comparison to another cell that receives a smaller mi . The third term in equation 3.4 includes the Fourier components, which depend on the difference w i ¡wj and therefore can help to synchronize the system. Therefore, the synchronization of the system is determined by a competition between the second and third terms. In most of this article, we assume that all the neurons are identical, that is, f0i D f0 for i D 1, . . . , N. In that case the rst term in equation 3.4 can be gauged out of the equations. We analyze the synchronization properties of equation 3.4 in the limits N!1
(3.6)
1 ¿ M ¿ N.
(3.7)
and
The number of neurons that receive m inputs is a binomial distribution that we denote by P0 ( m, M). Under the condition N À M À 1 (see equation 3.7),
1102
D. Golomb and D. Hansel
this distribution (see equations 2.1 and 3.5) becomes a gaussian distribution, P(d m, M): P (d m, M ) D p
1 2p M
exp
¡ ¡d m ¢ 2
2M
,
(3.8)
where dm D m ¡ M. In this work, we develop an approximate theory. Our analysis is based on the assumption that the state of the network at time t can be characterized by the density function r (w , t, d m) dw , which is the fraction of neurons with M C d m inputs whose phase lies between w and w C dw at time t. It is normalized such that Z
p ¡p
r (w , t, dm) dw D 1.
(3.9)
In other words, we assume that the dynamics of a neuron depends on only the number of inputs it directly receives from other neurons in the network, and we derive (approximate) dynamics in terms of a Fokker-Planck equation for r (see equation A1). As shown in appendix A, this amounts to the assumption that most of the effects of the sparseness are due to the second term in equation 3.4. This is exact in the limit cn ! 0 for all n, but only approximate for a general phase model (Golomb & Hansel, unpublished). However, as we will see, this approximation provides good estimates in the case of neuronal systems. The asynchronous state is given by r 0 (w , t, dm) D
1 . 2p
(3.10)
In appendix A we estimate the critical value, Mc , such that for M > Mc the asynchronous state is an unstable solution of the (approximate) dynamics. Here we give the nal results. The critical number, Mc , is the minimum of the numbers, M c,n , over all positive integers, n, for which it is nite, Mc D min Mc,n ,
(3.11)
n
where Mc,n is determined by the amplitude, cn , and the phase, an , of the nth Fourier term of C(w) (see equation 3.3): Mc,n
» £ 2 2 ¤ 8 / p cn b (an ) D innity
¡p / 2 < an < p / 2 . otherwise
(3.12)
The function b(a) is universal and does not depend on the model details (see Figure 12B). It is a decreasing function of | a|. As a ! § p / 2, b(§ p / 2) D 0. The decrease is gradual for values of a sufciently far from § p / 2 . Near
Synchrony of Sparse Networks
1103
these boundaries, it is very steep. For example, the value of b is 0.25 at a D § 0.995p / 2. The integer nc is dened as the value of n for which Mc,n is minimal. Therefore, M c D Mc,nc . The eigenmode that destabilizes the stationary distribution is a linear combination of exp(inc w ) and exp(¡inc w ). The strength of the coupling, gsyn, affects only C 0 , and not the cn ’s and an ’s (see equation 3.3). Hence, in the limit of weak coupling, Mc does not depend on gsyn . Intuitively, this result stems from the fact that Mc is determined by the competition between the neuron-to-neuron uctuations in the level of synaptic input, represented by C 0 , and the strength of the higher components, represented by C 0 cn (see equation 3.4). Both factors are proportional to C 0 , and therefore, gsyn does not affect the balance between them. From equation 3.12 we see that in order to get a nite M c,n , the condition ¡p / 2 < an < p / 2 should hold. Moreover, Mc,n is an increasing function of |an |. The larger the coefcient, cn , the smaller M c,n becomes. If the transition from an asynchronous state to a state with some synchrony is supercritical, nc determines, for M»>Mc , the nature of the state in which the system settles at large time for almost asynchronous initial conditions. For example, if nc D 1, the system is expected to settle in a “smeared” one-cluster state in which the phase distribution is broad but monomodal. If nc D 2, the system is expected to behave above Mc as a “smeared” twocluster state, characterized by bimodal phase distributions. Examples of such states are shown below. 4 Sparse Networks of Integrate-and-Fire Neurons 4.1 Estimates of Mc at Weak Coupling. The synaptic coupling term (see equation 2.6) in our version of the I&F model depends only on the dynamics of the presynaptic neuron via the variable s(t). Therefore, the effective interaction, C(w ), for this model is (see equation 3.2) Z 2p 1 C (w ) D (4.1) dy Z ( y C w) Isyn (y ) . 2p 0 The right-hand side is almost a convolution integral (except for the plus sign in the function Z(y C w )). The phase coordinates are obtained from the time coordinate according to w D v0 t with v0 D 2p f0 . The functions Z(w) for the I&F model is (Hansel et al., 1995) ( 0 0 < w · wr (4.2) Z (w ) D 1 (w¡w r ) / (v0 t0 ) w r < w · 2p , e Iext where w r D v0 Tr . The function Isyn(w ) is (Hansel et al., 1995) Isyn(w ) D
gsyn t2 ¡ t1
¡
e¡w / (v0 t2 ) e¡w / (v0 t1 ) ¡ ( ) ¡1 t f 1¡e / 0 2 1 ¡ e¡1 / ( f0 t1 )
¢
.
(4.3)
1104
D. Golomb and D. Hansel
V /q
1.0 0.5 0.0
s
1.0 0.5 0.0
I ext Z
5.0
/ |g syn|
10.0
2.0
0.0
G
2.5 0
p /2
p f
3p /2
2p
Figure 1: The functions V(w ) (normalized toh ), s(w ), Z(w ), and C(w ) (normalized to | gsyn |) for the I & F model with no refractory period (Tr D 0) and inhibitory coupling (gsyn < 0). Parameters: t1 D 5.9 ms, t2 D 6 ms, Iext D 1.12 mV / ms, f0 D 44.8 Hz.
The functions V(w ), s(w), Z(w ), and C(w) for a specic case are plotted in Figure 1. The function Z(w ) is discontinuous at w D 0 and w D w r , but the function C(w ) is continuous for all w . Using the parameterization (see equation 3.3) and the calculation reported in appendix B, one obtains: ´³ ´o¡1 / 2 n³ 1 C n2 v02 t12 1 C n2 v02 t22 ¶¼ ³ ´¡1 / 2 »µ 2Iext (Iext ¡h ) 2 2 2 (1¡cos ( )) £ 1 C n v0 t0 1C nv0 Tr h2
cn D 2
¡ ¢ p sign gsyn C arctan nv0 t1 C arctan nv0 t2 2 sin (n v0 Tr ) . C arctan nv0 t0 C arctan [Iext / (Iext ¡ h )] ¡ cos ( n v0 Tr )
1 /2
(4.4)
an D ¡p C
(4.5)
Synchrony of Sparse Networks
1105
The rst factor in equation 4.4 depends on only the synaptic properties, whereas the second factor reects the intrinsic response properties of the neurons. Similarly, the rst three, nontrivial, terms in equation 4.5 depend the nature of the interactions and on the synaptic dynamics, while the two last terms in this equation depend on only the intrinsic properties of the neurons. 4.2 Inhibitory Coupling, Tr D 0. The angle, an , for an inhibitory coupling and Tr D 0 is given by an D ¡
3p C arctan nv0 t0 C arctan nv0 t1 C arctan nv0 t2 . 2
(4.6)
The critical number, Mc,n , is nite only if ¡p / 2 < an < p / 2. All the arguments of the arctan function in equation 4.6 are positive; therefore an < 0, since the sum of the three last terms is between 0 and 3p / 2. To get an > ¡p / 2, the sum of these three terms should be larger than p (see Figure 2A). The coefcients, cn , are given by cn D 2
´³ ´³ ´i¡1 / 2 h³ 1 C n2 v02 t02 1 C n2 v02 t12 1 C n2 v20 t22 .
(4.7)
Below, the implications of equations 4.6 and 4.7 are listed . While reading it, it is useful to look at Figure 2B, where an and cn are plotted as a function of f0 for t2 D 6 ms, t1 D 5.9 ms, and n D 1, 2, 3; and at the two examples in Figure 3A, in which Mc,n is shown as a function of f0 for that parameter set (left) and for t1 D 1 ms (right). The rst six conclusions below are valid for t1 > 0. 1. Competition between an and cn . The angle, an , increases with f0 , whereas the amplitude, cn , decreases with it (see Figure 2). Therefore, for each n, at low ring rates, an is smaller than ¡p / 2, and Mc,n is innite. There is a critical ring rate for which Mc,n is nite, and for slightly larger f0 , M c,n decreases because near an D ¡p / 2 the function b is very steep. For large frequencies, an approaches 0, and cn continues to decrease; therefore, Mc,n increases. As a result, Mc,n has a minimum as a function of f0 at a nite value of f0 . 2. The asynchronous state is always unstable for high enough M. The angle, an , is a monotonic increasing function of n. For t1 > 0, an ! 0 for n ! 1. Therefore, there is always an nc such that an > ¡p / 2 for all n > nc . This implies that the asynchronous state loses stability at a nite value of M, which is Mc,nc . 3. The critical mode is nc D 1 at high enough values of f0 . This is because for high-enough values of f0 , a1 > ¡p / 2 and c1 is the minimal cn . At very high values of f0 , a1 ¼ 0, and from equations 3.12, 4.7, and 2.9 one concludes that Mc scales like f06 (or I6ext ).
1106
D. Golomb and D. Hansel
4. The critical mode, nc , increases with decreasing f0 . This is a consequence of the facts that: I. for each specic n, an decreases as the ring frequency, f0 , decreases; and II. an increases with n. When for some n, an goes below ¡p / 2, nc is higher than that specic n. 5. All Mc,n ’s have the same minimum with respect to f0 . For xed t1 / t0 and t2 / t0 , M c,n is a function of the variable x D nf0 t0 . Therefore, ¡ ¢ ¡ ¢ M c,n f0 D Mc,1 nf0 .
(4.8)
In particular, the minima over the network parameters of all Mc,n are the same. 6. The critical number of synaptic connection, Mc , is nonmonotone with f0 . Here we describe the behavior of Mc as f0 decreases. At high f0 , nc D 1, and M c decreases with decreasing f0 . At a specic value of f0 , it reaches its minimum Mc,1 ( f0 ) and then increases again. As f0 continues to decrease, a value of f0 for which nc D 2 is reached, and Mc decreases again until it reaches a second minimum. Then it increases until nc D 3, decreases again and so on. More generally, M c oscillates innitely many times as f0 decreases toward 0, that is, as Iext decreases toward the threshold h . 7. Positive t1 is needed for destabilizing the asynchronous state. If t1 D 0, and in the absence of refractoriness, an < ¡p / 2 for all n. Therefore the asynchronous state is stable for every M. This extends to the case of a sparse, weakly coupled network, a similar result established by Van Vreeswijk (1996) in the case of all-to-all coupling. 8. Effects of synaptic times. As t1 increases from 0 (note that t1 · t2 ), a lower frequency is needed for Mc,1 to be nite. Since c1 is larger at lower frequencies, the minimal value of M c,1 as a function of f0 is smaller at higher t1 . For the parameters of Figure 3A, it is 418 for
Figure 2: Facing page. (A) Schematic demonstration of the condition needed to achieve an > ¡p / 2 for the I&F model with inhibitory coupling and Tr D 0. The sum of the three angles, denoted by the three levels of shading, should be larger than p for achieving this condition. The angle, arctan(n v0 t0 ), stems from the contribution of the response function, Z, of the I&F neuron to the C function (see equation 4.2), whereas the two angles, arctan(n v0 t1 ) and arctan(n v0 t2 ), stem from the contribution of the two exponents in the synaptic coupling function, Isyn (see equation 4.3). (B) The dependence of an (I) and cn (II) on f0 for Tr D 0, t1 D 5.9 ms, t2 D 6 ms; n D 1, 2, 3. The angle, an , increases with Iext , whereas the coefcient, cn , decreases with it. For a given Iext , an increases with n while cn decreases with it.
Synchrony of Sparse Networks
1107
t1 D 5.9ms and 2825 for t1 D 1ms. The dependence of M c on the synaptic rise and decay times at high frequency ( f0 = 102 Hz) is shown in Figure 3B. In the left panel, Mc,1 varies with t1 for t2 D 6 ms. At low t1 , Mc1 is indenite because a1 < ¡p / 2. It increases with t1 for high
- 3p /2 p /2 arctan(nw
t )
0 0
p
arctan(nw
0 t )
0 1
arctan(nw 0t 2)
I
0
a
p /2
n
p /2 n=1 n=2 n=3
p 3p /2 II
2.0
cn
1.5 1.0 0.5 0.0
0
50 100 F requency, f 0 (H z)
150
1108
D. Golomb and D. Hansel
t 1= 5.9 m s
4000
t 1=1 m s
Mc
3000 2000
n= 1 n= 2 n= 3
1000 0
0
50 100 F requency, f 0 (H z)
50 100 F requency, f 0 (H z)
t 2= 6 m s
10000
Mc
0
t 1=1 m s
5000
0
0
2
4 t 1 (m s)
6
2
4 t 2 (m s)
6
Figure 3: The rst three critical numbers of inputs Mc, n for an I&F model with inhibitory coupling and Tr D 0. The graphs of Mc,n are denoted by the solid, the dotted, and the dashed lines for n D 1, 2, 3, respectively. (A) Mc,n versus f0 for t2 D 6 ms and t1 D 5.9 ms (left) or t1 D 1 ms (right). Both gures demonstrate the nonmonotonicity of Mc , the minimal Mc, n , as a function of Iext . (B) Mc,n versus t1 for t2 D 6 ms (left) and versus t2 for t1 D 1 ms (right). In both cases, Iext D 1.6 mV / ms ( f0 D 102 Hz). In the left panel, Mc,2 and Mc,3 are too large to be plotted; in the right panel, Mc,3 is not plotted for the same reason. The scale for t2 starts from 1 ms because t2 ¸ t1 by denition.
t1 because c1 decreases. Higher modes have Mc,n ’s that are out of the scale of the gure. In the right panel, Mc,1 varies with t2 for t1 D 1 ms. Again, it has a minimum with respect to t2 . At small t2 , nc D 2. We have found that the minimal number, over all the parameters of the I & F model with inhibition and Tr D 0, of synaptic inputs M required
Synchrony of Sparse Networks
1109
to destabilize the asynchronous state is obtained for t1 D t2 D t0 where Mc D 363.8; for t0 D 10 ms, f0 at the minimum is 31.2 Hz. This shows that I & F neurons are hard to synchronize with inhibitory coupling in the sense that a large number of inputs is necessary to destabilize the asynchronous state of a sparse network of such neurons. 4.3 The Effect of Refractoriness. In the I & F model with no refractoriness, Mc is on the order of several hundred or more. Can Mc be reduced by the inclusion of refractoriness? To investigate this issue, we analyze the effect of Tr for xed ring frequency, f0 . Finite Tr contributes a factor that is larger than 1 to cn (see equation 4.4) and a term to an (see equation 4.5). This factor and term depend explicitly on v0 Tr . They are small at low ring rates, but they can have a major effect at high ring rates. The dependence of M c,n , n D 1, 2, 3 on Iext is presented in Figure 4A for Tr D 2 ms, t2 D 6 ms, and two values of t1 : 5.9 ms and 1 ms. As in the case Tr D 0, the stability is determined by the rst mode at large Iext and by higher modes as Iext decreases. The effect of Tr on M c,n is demonstrated by comparing Figure 4A to Figure 3A. This comparison demonstrates that the effect of the refractory period is indeed more important at high frequencies. The dependence of Mc on Tr is demonstrated in Figure 4B for t2 D 6 ms, t1 D 1 ms. The level of external current, Iext , is tuned such that the frequency is xed: f0 D 102 Hz. At small Tr , nc D 1. Mc decreases considerably with Tr (from 2941 at Tr D 0 to 141 at Tr D 4.1 ms), mainly because of an increase of c1 . At Tr D 5.5 ms, a1 D ¡p / 2, and hence nc D 2 for higher Tr . In summary, if the mode n D 1 destabilizes the asynchronous state for Tr D 0, refractoriness decreases Mc considerably at high ring rates (as long as it is not too large), but only slightly at low ring rates. 5 Numerical Simulations 5.1 Goals and Methods. The analytical theory that enables us to estimate Mc has been developed for the conditions of weak coupling, 1 ¿ M ¿ N, and in the limit where cn ! 0, for all n. There are several questions it cannot answer: How good is the estimate of equation 3.12 when these conditions are not satised? What are the patterns of synchrony for M > M c ? Are the qualitative predictions of the theory correct beyond the weak coupling limit? How does Mc vary when the coupling increases? In order to answer these questions, we carry out numerical simulations of the full model and the corresponding phase model.
1110
D. Golomb and D. Hansel
t 1= 5.9 m s
4000
t 1= 1 m s
Mc
3000 2000
n= 1 n= 2 n= 3
1000 0
0
50 100 F requenc y, f 0 (H z)
0
50 100 F requen cy, f 0 (H z)
t 1= 1 m s, f 0= 102 H z
Mc
4000
2000
0
0
2
4 T r (m s)
6
Figure 4: Effects of refractory period, Tr . The rst three critical numbers of inputs Mc,n are plotted for an I & F model with inhibitory coupling and t2 D 6 ms. The graphs of Mc,n are denoted by the solid, the dotted, and the dashed lines for n D 1, 2, 3, respectively. (A) Mc,n versus f0 for Tr D 2 ms and t1 D 5.9 ms (left) or t1 D 1 ms (right). (B) Mc,n versus Tr for t1 D 1 ms ( f0 D 102 Hz).
5.1.1 Phase Model. The simulations of the phase model were done as follows. The phase equations (see equation 3.4) (without the term 2p f0 that does not affect synchronization) were integrated using a fourth-order RungeKutta method. The time step, D t ,was chosen such that there are at least 20 steps in a time period of the simulated model. The initial conditions for w i were chosen at random from a uniform distribution between ¡p and p . The state of the network—the level and form of synchrony—was characterized
Synchrony of Sparse Networks
1111
by the order parameters (Kuramoto, 1984), Zn D
1 Tm
Z
Tm 0
|zn (t)| dt
(5.1)
where zn D
N 1 X einw i N iD1
(5.2)
and Tm is a long-enough measurement time. The rst-order parameter, Z1 , measures the tendency of the oscillator population to evolve in full synchrony: Z1 D 1 if and only if w i ( t) D w (t) for all i and t. The higher-order parameters, Zn , measure the tendency of the population to segregate into n clusters. For example, if a large population segregates into two equally populated clusters, oscillating locked in antiphase, Z2 D 1 while Z1 D 0. In the limit N ! 1, Zn D d n,0 if and only if the system is in an asynchronous state. As a result, zn (t) rises from near 0 until it reaches a steady state. The order parameters, Zn , are computed for a time period of 5000 time steps after the transients are gone. All numerical simulations are carried out with 10 realizations, over which results are averaged for each set of parameters. 5.1.2 Integrate-and-Fire Network. The I & F equations (2.4 and 2.5) were integrated using a second-order Runge-Kutta method supplemented by a rst-order interpolation of the ring time at threshold as explained in Hansel, Mato, Meunier, and Neltner (1998) and Shelley (1999). This algorithm is optimal for I & F neuronal networks and allows one to get reliable results for time steps, D t, which are substantially large. The integration time step is D t D 0.25 ms. In typical simulations, the dynamic of the network was integrated for 4 £ 104 ¡ 3 £ 105 time steps. Stability of the results was checked against varying the number of time steps in the simulations and their size. The initial conditions were chosen according to
¡
¡
Vi (0) D I0 t0 1 ¡ exp ¡b
i ¡ 1 ( T0 ¡ T r ) t N
¢¢
(5.3)
( i D 1, . . . , N ). The coefcient 0 < b < 1 controls the degree of synchrony of the initial condition. Setting b D 0 yields a perfectly synchronized initial condition. On the opposite, b D 1 corresponds to a uniform distribution of ring times for uncoupled neurons between 0 and T0 ¡ Tr . All synaptic currents were set to 0 at t D 0. To measure the degree of synchrony in the steady state of the network, we follow the method proposed and developed in Ginzburg and Sompolinsky (1994), Golomb and Rinzel (1993, 1994), and Hansel and Sompolinsky (1992)
1112
D. Golomb and D. Hansel
that is grounded on the analysis of the temporal uctuations of macroscopic observables of the networks, such as the instantaneous activity of the system or the spatial average of the membrane potential. For the latter, one evaluates at a given time, t, the quantity: V ( t) D
N 1 X Vi (t). N iD1
(5.4)
The variance of the time uctuations of V ( t) is D E sV2 D [V (t) ]2 ¡ [hV ( t)it ]2 , t
(5.5)
R where h. . .it D 0Tm dt . . . denotes time averaging over a large time, Tm . After normalization of sV to the average over the population of the single-cell membrane potentials, D E sV2i D [Vi ( t) ]2 ¡ [hVi (t)it ]2 , (5.6) t
one denes a synchrony measure,  ( N ) for the activity of a system of N neurons by  2 (N ) D
1 N
sV2 PN
iD1
sV2 i
.
(5.7)
This synchrony measure, Â ( N), is between 0 and 1. In the limit N ! 1 it behaves as
¡ ¢
1 a  (N) D  (1) C p C O N N
,
(5.8)
where a is some constant, between 0 and 1. In particular,  (N ) D 1, if the system is fully synchronized (i.e., Vi ( t) D V (t) for all i), and  (N ) D p O(1 / ( N)) if the state of the system is asynchronous. Asynchronous and synchronous states are unambiguously characterized in the thermodynamic limit (i.e., when the number of neurons is innite). In the asynchronous state,  (1) D 0. By contrast, in synchronous states,  (1) > 0. In homogeneous fully connected networks, an important class of partially synchronous state, (0 <  (1) < 1), are the “cluster states” (Golomb, Hansel, Shraiman, & Sompolinsky, 1992; Golomb & Rinzel, 1994; Golomb et al., 1994; Hansel et al., 1995). In these states, the system segregates into several clusters that may re alternately. In heterogeneous or sparse networks, the disorder can smear the clustering. For instance, in a “smeared one-cluster state,” the population voltage, V( t), (see equation 5.4), oscillates with time and has one peak as a function of time in each time period. More generally, in a “smeared n-cluster state,” V ( t) has n peaks as a function of time in each time period.
Synchrony of Sparse Networks
1113
5.1.3 Finite-Size Effects in Sparse Networks. In general, for nite N, even in the asynchronous state,  (N ) is not equal to 0. However, for nite but large N, according to equation 5.8, synchronous state differs from asynchronous state in the way  ( N) scales with N—that is, by the way the nite size of the system affects the level of synchrony of the neural activity. As explained in Hansel and Sompolinsky (1996), these nite-size effects can be used in numerical simulations to characterize the regimes of parameters in which the network states are asynchronous or synchronous. If in the thermodynamic limit a continuous transition occurs between asynchronous and synchronous regimes when a parameter is changed (e.g., M), this transition is replaced for nite N by a crossover between the two different scalings of  with N. These nite-size effects occur no matter what the connectivity. In networks with nite M and N, sparseness is another source of nite-size effects. The reason is that the desynchronizing effect due to the sparseness depends not only on M, but also on the relative values of M and N. For instance, in equation 3.4, the synchrony level in the system is determined by the competition between the term CQ that may be synchronizing, and the distribution of mi / M that tends to desynchronize the system. If M is sufciently large, this binomial distribution of mi / M is approximated by a gaussian with a variance 1 / M (see equation 3.8). When N is nite but large enough (say, N > 30), and N ¡ M is also large enough, the binomial distribution of mi / M is well approximated by a gaussian with a variance, ³ Var
´ 1 1 mi D ¡ . M M N
(5.9)
This suggests that the level of synchrony should depend on M and N through the effective variable, Meff , dened as: 1 1 1 D ¡ . M eff M N
(5.10)
Relying on this heuristic argument, we expect that M eff is a more appropriate parameter than M to represent simulation results. 5.2 Simulation Results for Weakly Coupled Neurons. 5.2.1 Finite-Size Effects. In Figure 5A, we present the dependence of  on M for I & F networks with different size, N D 200 to N D 3200. The parameter set is: Tr D 0, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms, gsyn D ¡0.2 mV / ms. The initial conditions where chosen such that b D 0.5 (see equation 5.3). When N increases, the curves shift toward higher values of M. Very large values of N would be necessary in order to see that the lines converge to a limit curve. This is due to the fact that for these parameters,
1114
D. Golomb and D. Hansel
A
1.0 0.8
N=200 N=400 N=800 N=1600 N=3200
c
0.6 0.4 0.2 0.0
0.0
1000.0
B
1.0
2000.0 M
3000.0
4000.0
0.8
c
0.6 0.4 0.2 0.0
0.0
5000.0 M eff
10000.0
Figure 5: Simulations of the I & F model with Tr D 0, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms ( f0 D 144.3 Hz), gsyn D ¡0.2 mV / ms, b D 0.5. Simulations are carried out with N D 200, averaged over 100 realizations of the network connectivity and random initial conditions (dots), N D 400, 50 realizations (crosses), N D 800, 25 realizations (dashed line), N D 1600, 20 realizations (dotted line) and N D 3200,10 realizations (solid line). The integration time step D t D 0.25 ms. The runs were performed over 4 £ 104 time steps. The synchrony measure was computed after discarding a transient of 2 £ 104 time steps. (A) Synchrony measure  versus M. (B) Synchrony measure  versus Meff (see equation 5.10). The lines for N D 800, 1600, 3200 almost coincide with each other, and the deviations of the line with N D 400 and N D 200 are small. The analytical estimate for M pc is < 2500,  decreases with N:  / 1 / N. denoted by the arrow. Note that for Meff »
Mc is very large. In Figure 5B, we present  as a function of Meff for the same values of N. It is remarkable that the lines for N D 800, 1600, 3200 collapse on almost the same curve. Deviations from this curve are more
Synchrony of Sparse Networks
1115
apparent at small values of M, where one can check that  depends on N as p 1 / N, which means that the network state is asynchronous there. For larger values of Meff , around Meff ¼ 2500, a crossover occurs between a regime in p which  D O(1 / N ) and a regime where  remains nite in the large N limit. The region in which this crossover occurs is in good correspondence with our weak coupling estimate for the critical value of M above which, for this set of parameters, the asynchronous state loses stability: Mc D 2785. For N D 400 and N D 200,  ( Meff ) is also quite close to the effective line, although deviations are more prominent than for larger values of N. This example shows that it is possible to estimate Mc by studying how  depends on Meff in simulation with a relatively small N (about 200–800), even though Mc is much larger than these values of N. 5.2.2 Weakly Coupled Inhibitory Neurons. To proceed further in the comparison between the theory and the numerical simulations, we study the dependence of the synchrony level on M for three values of the external input current Iext : 1.02 mV / ms, 1.07 mV / ms, and 1.12 mV / ms. We simulated both the phase model and the full I & F model for inhibitory coupling with N D 800 neurons, Tr D 0, t1 D 5.9 ms, t2 D 6 ms, and b D 0.5. The rise and decay times were chosen such that Mc will not be too large. Indeed, the theory for weak coupling predicts a value of Mc that is smaller by almost one order of magnitude than for t1 D 1 ms and t2 D 3 ms. Another prediction of the theory is that Mc (for N ! 1) varies nonmonotonously with Iext : Mc D 490, 1737, 418.4 for Iext D 1.02 mV / ms, 1.07 mV / ms, and 1.12 mV / ms respectively. For the full I&F model, we chose gsyn D ¡0.025 mV / ms. This is a weak coupling. For example, for Iext D 1.12 mV / ms, the ring rate of the neurons changes by less than 3%. The results for the phase models that correspond to these parameters of the I & F network are shown in Figure 6A. For all Iext values, Z1 p and Z2 are small at small M. Careful analysis shows that they scale as 1 / N, indicating that at small Meff , the network is in the asynchronous state. By increasing the value of Meff , a continuous transition to a synchronous state occurs. The analytical estimates for Mc , (see equation 5.10) for the three values of the external current are denoted by the arrows in Figure 6. There is a very good correspondence between these values and the range of Meff values in which Z1 or Z2 starts to be substantially different from 0. Note, however, that in order to estimate in a systematic way the value of Mc from the simulations, a precise nite-size scaling procedure is necessary (Binder & Heermann, 1992). This is beyond the scope of this article. The pattern of synchrony displayed by the network in the synchronous regime depends also on the value of Iext . Indeed, for Iext D 1.12 mV / ms, the network settles in a smeared one-cluster state in which both Z1 and Z2 are O(1) in the large N limit. By contrast, for Iext D 1.02 mV / ms and Iext D 1.07 mV / ms, the pattern of synchrony is a smeared two-cluster state characterized by Z2 D O(1) but
1116
D. Golomb and D. Hansel
A. Phase Model
0.8 I 0.6
0.4
0.4
c
0.6
0.2
0.0
0.0
0.8 II
0.8 II
0.6
0.6
0.4
0.4 0.2
0.0
0.0
0.8 III
0.8 III
0.2 0
1000
Meff
2000
c
Zn
0.6 Z1 Z2
0.4 0.0
I ext (m V /m s)
1.07
0.2
0.6
B. Full Model
1.02
0.2
c
Zn
Zn
0.8 I
3000
1.12
0.4 0.2 0.0
0
1000
Meff
2000
300 0
Figure 6: Simulations of inhibitory I & F networks. Parameters: Tr D 0, t1 D 5.9 ms, t2 D 6 ms, N D 800. Three values of Iext are used: 1.02 mV / ms ( f0 D 25.4 Hz); 1.07 mV / ms ( f0 D 36.7 Hz); 1.12 mV / ms ( f0 D 44.8 Hz). (A) The order parameters Z1 and Z2 in the reduced-phase model versus Meff . Integration time step is 0.1. The results are averaged over 5 £ 103 time steps (after a discarded transient of 5 £ 103 time steps) and 10 realizations of the network architecture and initial conditions (homogeneous distribution between 0 and 2p ). The error bars represent the standard deviation of Z. The arrows denote the analytical estimates for Mc . (B) The synchrony measure  in the full I & F model with gsyn D ¡0.025 mV / ms. Time averaging over 8 £ 104 time steps; discarded transient 8 £ 104 ; D t D 0.025 ms; 25 realizations; the initial conditions are chosen at random with b D 0.5. The error bars represent the standard deviation of Â.
p Z1 D O(1 / N ). This is in agreement with the theory that predicts nc D 1 for Iext D 1.12 mV / ms, and nc D 2 for Iext D 1.02 mV / ms and Iext D 1.07 mV / ms. Note that for Iext D 1.07 mV / ms and Iext D 1.02 mV / ms for large values of M, the system displays multistability: smeared two-cluster states coexist with a stable smeared one-cluster state (data not shown). In Figure 6B we present the results of our simulations for the full I & F networks for a synaptic strength, gsyn D ¡0.025 mV / ms. Similar to the
Synchrony of Sparse Networks
1117
phase model, when M increases, a continuous transition occurs between the asynchronous state and a synchronous state in a range of values of Meff that agrees with our analytical estimate for M c . The synchrony measure  does not discriminate between different patterns of synchrony of the full model. To characterize these patterns further, the rastergrams are plotted in Figure 7 for Meff D 600 and the three Iext values. We plot also r , the percentage of neurons ring as a function of time (1 ms bins). For Iext D 1.12, simulations reveal only one group of neurons ring every cycle; that is, the state of the system is a smeared one-cluster (see Figure 7A). By contrast, for Iext D 1.07 mV / ms and Iext D 1.02 mV / ms, the network settles in a smeared two-cluster state (see (Figures 7B and 7C). These results are in agreement with what we have found for the corresponding-phase models. The theory predicts a nonmonotonous variation of Mc with the external current, or equivalently with f0 (see Figure 4). Our simulations for both the phase model and the full model are in agreement with this prediction (see Figure 6). From this result, one expects that for xed M, the synchrony measure  is a nonmonotonous function of f0 . This is conrmed by our simulations as shown in Figure 8A, where we have plotted  versus f0 for Meff D 800 (M D 400 and N D 800). In the range of f0 we studied,  displayed two peaks of different amplitudes. The peak of smallest amplitude occurs for 20 Hz < f0 < 30 Hz. Raster plots reveal that for these values of f0 , the network settles into smeared twocluster states (results not shown). A second peak of larger amplitude occurs for 40 Hz < f0 < 70 Hz. In this range, the network settles into a smeared one-cluster state. For 30 Hz < f0 < 40 Hz, the asynchronous state is stable. This is also in agreement with our analytical predictions (see Figure 4). Interestingly, in a small range of f0 (37 Hz < f0 < 40 Hz), the network state depends on the initial conditions. Starting with initial conditions that are strongly coherent (b D 0.95), the network settles into a one-cluster state. By contrast, if b D 0.1, the asymptotic state is asynchronous. Therefore, in this range of frequencies, the system is bistable with the coexistence of a stable asynchronous state and a stable smeared 1-cluster state. The results shown in Figure 4B suggest that for a xed ring rate, f , the synchrony measure can be a nonmonotonic function of the refractory period. This prediction is checked in Figure 8B where  is plotted versus the refractory period for f0 D 102 Hz and M eff D 800 (other parameters as in Figure 4B, N D 800, M D 400). For small refractory period, Tr < 1 ms, the network is in an asynchronous state. By increasing Tr , the asynchronous state becomes unstable, and the synchrony increases. The network settles in a smeared one-cluster state. However, increasing Tr further causes  to decrease. In a small range of Tr , namely, 5.5 ms < Tr < 6.5 ms, the asynchronous state becomes stable again, as predicted from our analytic results (see Figure 4B). For large values of Tr , the activity of the network
1118
D. Golomb and D. Hansel
synchronizes again and settles into a smeared two-cluster state (result not shown). We conclude that the phase model and our analytical estimation can be used to predict the behavior of the full model with small coupling strength. Similar behaviors are observed in the simulations of the full model (see Figure 6B) for nite but not too large coupling strength. 5.3 Beyond Weak Coupling. Here we are interested in how M c depends on the coupling strength in the I & F model beyond the weak coupling regime. In Figure 9, we plot results from simulations showing the synchrony measure  verus M eff for a xed value of the external input, Iext D 2 mV/ms and different coupling strength, | gsyn|. For sufciently low values of | gsyn |,  ( M) depends weakly on | gsyn|, and Mc remains close to the weak coupling estimate, although it starts to increase. For larger ¡gsyn (e.g., 0.5 mV / ms), the limit of stability of the asynchronous state becomes substantially different from its weak-coupling limit. Moreover, the system displays multistability: one-cluster state (upper branch) coexists with asynchronous or two-cluster state (lower branch). Therefore, increasing | gsyn | opposes synchrony instead of helping synchrony, as one would expect at rst. This is also clear from Figure 10A (solid line). Here we have plotted  versus ¡gsyn for a given value of M and Iext . Not only does the synchrony of the system decrease when the coupling increases, but there exists a critical coupling, ¡gc ( M), above which the network settles into an asynchronous state. In these simulations, Iext was kept xed. Therefore, the ring rate decreased when the inhibitory coupling was increased. In order to compare networks with comparable average ring rates but different coupling strength, we have performed another set of simulations in which Iext and ¡gsyn increase in such a way that the average ring rate, f , in the asynchronous state was kept xed. The results are shown in Figure 10A (dotted line). Here too  is a decreasing function of ¡gsyn, although it varies more slowly than when the ring rate is not xed. For a sufciently strong coupling, | gsyn| > ¡gc ( M), the network settles into a state in which neurons re in an asynchronous and (almost) periodic fashion.
Figure 7: Facing page. Simulations of networks of inhibitory I & F neurons. Parameters are the same as in Figure 6, and M D 600. Rastergrams of 400 neurons out of N D 800 are shown in the upper panels. Initial conditions are such that b D 0. A bar denoting the time period, T0 , is shown above each rastergram. In each lower panelr , the number of ring neurons within a time bin of 1 ms is displayed as a function of time. For Iext D 1.12 mV / ms (A) the system is in a smeared one-cluster state, whereas for Iext D 1.07 mV / ms (B) and Iext D 1.02 mV / ms (C), it is in a two-cluster state.
Synchrony of Sparse Networks
1119
0 .8 0 .4 0 .0
0
200
0 0
0 .8 0 .4 0 .0
200 2 00
0.8 0.4 0.0
T0
50 ms
400 4 00
T0
400
T0
By further increasing ¡gsyn, one can maintain the population average ring-rate constant by increasing Iext . However, the neural activity becomes more and more heterogeneous across the network, and the distribution of the time-averaged ring rates of the neurons becomes broad. Simultane-
r
N e u ro n In d e x , i
1120
D. Golomb and D. Hansel
0.8
c
0.6 0.4 0.2 0.0
0.0
20.0 40.0 60.0 Frequency, f (H z)
80.0
1.0 0.8
c
0.6 0.4 0.2 0.0
0.0
2.0
4.0 T r (m s)
6.0
Figure 8: Simulation of the full I & F network with N D 800, integration time step: D t D 0.25 ms. (A) Synchrony measure  versus frequency f0 for Tr D 0, t1 D 5.9 ms, t2 D 6 ms, gsyn D ¡0.025 mV / ms, M D 450 (corresponding to M D 1028.6 in the limit N ! 1). Results were averaged over 40 realizations of the network architecture and initial conditions. b D 0.95 and b D 0.1; 2.4 £ 105 time steps (discarded transient of 1.8 £ 105 time steps). (B) Synchrony measure  versus the refractory period Tr . The applied current was changed such that f0 D 102 Hz. Other parameters: gsyn D ¡0.1 mV / ms, M D 400 (corresponding to M D 800 in the limit N ! 1), t1 D 1 ms, t2 D 6 ms. All the runs had 3 £ 105 time steps. Time averages were computed after discarding a transient of 1.5£105 time steps. The results were averaged over 50 realizations (b D 0.95 and b D 0.1).
Synchrony of Sparse Networks
1121
1.0
g syn
c
(mV/ms) 0.1 0.2 0.3 0.4 0.5 0.6
0.5
0.0
0
1000 M eff
2000
Figure 9: Beyond weak coupling: simulations of the full inhibitory I & F network. The synchrony measure  is plotted versus Meff for different synaptic strength. Parameters are Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms ( f D 112 Hz), N D 800. Simulations with 4 £ 104 time steps with D t D 0.25 ms. Time averages were computed after discarding a transient of 2 £ 104 time steps. The results were averaged over 100 realizations. The arrow denotes the analytical estimate for Mc in the weak coupling limit. In the asynchronous regime, the average ring rate in the network changes from f ¼ 105 Hz for gsyn D ¡0.1 mV / ms to f ¼ 80 Hz for gsyn D ¡0.6 mV / ms.
ously, the activity of the neurons becomes less and less periodic. This is shown in Figure 11 for f D 100 Hz and gsyn D ¡25mV / ms. In this gure, we present results from the simulations of networks with different size (N D 400, N D 800, and N D 1600), but the same Meff (Meff D 800). Figure 11A shows the distribution of the ring rates of the neurons. The coefcient of variations of the interspike distribution (CV) of the neurons versus their ring rates is displayed in Figure 11B. Remarkably, the curves for the three sizes are superimposing. In Figure 11B, the three curves are indistinguishable. This means that in the strong coupling regime, the dynamic state of the network also depends on only Meff . Note that the variability in the ring pattern is stronger for the neurons with the lower rate. For xed gsyn , the lower the population average ring
1122
D. Golomb and D. Hansel
A
B
fixed Iext fixed f
1.0
d = 0.075 m V/m s d = 0.15 m V/m s
0.8
c
0.6 0.4 0.2 0.0 0.0
1.0
2.0
g syn (m V/m s)
3.0
0.0
1.0
2.0
g syn (m V/m s)
3.0
Figure 10: Synchrony measure  plotted versus coupling strength gsyn for M D 400, N D 800 (Meff D 800). Parameters are: Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, N D 800. Simulations were carried out with 4 £ 104 time steps and D t D 0.25 ms. (A) No heterogeneity. Solid line: The external current is xed at Iext D 1.816 mV / ms. The ring rate decreases with increasing ¡gsyn from 100 Hz for gsyn D 0 to 56 Hz for gsyn D ¡1 mV / ms. Dotted line: The external current increases when ¡gsyn increases such that the average ring rate in the asynchronous state remains constant ( f D 100 Hz). (B) Heterogeneity. The external current varies from neuron to neuron and is homogeneously distributed between NIext ¡ d and NIext C d . The average, NIext , is determined such that the population average of the ring rate, f , in the asynchronous state is kept constant, f D 100 Hz. Solid line: The width of the distribution is d D 0.075 mV / ms. The network activity is synchronized for 0.5 mV / ms · ¡gsyn · 2.2 mV / ms. Dotted line: The width of the distribution is d D 0.15 mV / ms. The network activity is always asynchronous. Simulations with larger N show that the residual nonzero value of  are nite-size effects (result not shown). Simulations indicate that in the fully connected network with the same parameters, the asynchronous state becomes unstable at ¡gsyn ¼ 0.6mV / ms (result not shown).
rate or the lower the Meff , the larger is the variability of the neural activity (result not shown). This is in agreement with results obtained by previous authors (van Vreeswijk & Sompolinsky, 1996, 1998; Kim, van Vreeswijk, & Sompolinsky, 1998). 5.4 Synchrony in Sparse and Heterogeneous Inhibitory Networks. Until now we have investigated sparse homogeneous networks. Here we consider briey some effects of heterogeneities. For the sake of simplicity, heterogeneities are introduced as spatial uctuations in the external input, that
N u m b e r o f C e lls ´ s
Synchrony of Sparse Networks
1123
N=400 N=800 N=1600
A 0.005
0.000
B
1.0
CV
0.8 0.6 0.4 0.2 0.0
0.0
100.0 200.0 1 Firing R ate (s )
Figure 11: Distribution histogram of the ring rates (A) and coefcient of variation of the interspike distribution (B) for Meff D 800 and N D 400, 800, 1600. Parameters are: Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, f D 100 Hz, gsyn D ¡25 mV / ms. Simulations are done with 3.2 £ 106 time steps and D t D 0.125 ms. Results for N D 400, 800, 1600 are averaged over 40, 20, 10 realizations, respectively, and a bin width of 2 Hz. In both gures the silent neurons (neurons with ring rate < 2 Hz which are about 6.2% of the population) are not represented. In (B), the three curves superimpose almost perfectly and are hardly distinguishable.
is, we assume that each neuron receives an external input, Ii,ext D INext C d i ,
(5.11)
where INext is the population average of the external current andd i is a random number, uniformly distributed between ¡d and d . These distributions of external current can be thought of as a consequence of spatial heterogeneities
1124
D. Golomb and D. Hansel
in some external stimulus or, equivalently, as the result of heterogeneities in the threshold of the neurons in the network. Neltner et al. (2000) have shown that for a given level of heterogeneities, d, the asynchronous state of a fully connected inhibitory network is destabilized if the coupling | gsyn | is larger than a critical value, which is roughly proportional tod . When the network is heterogeneous and sparse, the situation is different. This is shown in Figure 10B where the synchrony measure, Â, is plotted as a function of ¡gsyn, for two values of d : d D 0.075mV / ms and d D 0.15mV / ms, and the same parameters as in Figure 10A. At p weak ( coupling, the network activity is asynchronous in both cases (Â D O N)). This is a consequence of the heterogeneities that prevent the neural activity to synchronize. Indeed, for d D 0 the network activity would synchronize (Meff > Mc ). For d D 0.075mV / ms, Â starts to increase above some critical value of the coupling. This signals the fact that the asynchronous state becomes unstable because the stronger coupling overcomes the desynchronizing effect of heterogeneities. However, if the coupling increases too much, Â decreases again. This is because the spatial uctuations that result from the sparseness become sufciently strong to desynchronize the network activity again. Eventually, at large coupling, the neurons re asynchronously. For d D 0.15mV / ms, the network activity remains asynchronous at any coupling value. More generally, this will be the situation if d is too large. Therefore, in contrast to what happens in a fully connected network, there is a maximum value ford , which depends on M, that is compatible with the emergence of synchrony. 6 Discussion 6.1 Analytical Results. The analytical results reported in this article are based on two approximations. The rst is the weak coupling, in which the model can be reduced to a phase model. The conditions for which this approximation is valid have been discussed before (Kuramoto, 1984). Simulations of networks of identical neurons with all-to-all coupling have revealed that results based on the weak-coupling limit are also valid for intermediate coupling strengths (Hansel et al., 1995). A similar conclusion is obtained here. In particular, simulations of the full model with moderate coupling strength reveal values of Mc that are similar to what is obtained by simulating the reduced-phase model. Our analytical approach also relies on a second assumption: the dynamics of a neuron is governed only by the number of synaptic inputs it receives from other neurons in the network. A more thorough analysis of this assumption (Golomb & Hansel, unpublished) reveals that it is exact in the limit cn ! 0, but is only approximate for a general phase model. Our simulations indicate that this approximation provides a good estimate of the dynamics of sparse neuronal systems. In all the cases we have
Synchrony of Sparse Networks
1125
examined, the differences between the theoretically estimated values of Mc and the values of M c obtained from simulation have been less than 20%. The good correspondence between the analytical and simulation results stems from the fact that in most cases in I & F, neurons re at not too small rate, c1 is 0.2 or less, and cn values with n > 1 are even smaller. Our approximation means that the difference in the number of inputs between different neurons is the main source of desynchronzing disorder coming from the sparseness of the connectivity. A second source of disorder—the fact that neurons receive synaptic input from different sources (output neurons)—was found to be of minor effect in our simulations. Another type of sparse architecture that has been used in modeling neuronal systems by other authors (Whittington et al., 1995; Traub, Whittington, Colliing, Buzsaki, ´ & Jefferys, 1996) consists of neurons randomly connected, but with a number of inputs that is the same for all the neurons. For such models, this second source of disorder is the only desynchronizing factor due to sparseness. Our simulations show that the network exhibits a high level of synchrony for values of M, on the order of 10 or less (data not shown; see also Wang & Buzs´aki, 1996, for a conductance-based model). We have also found that the system exhibits bistability even for weak coupling, when a highly synchronized state coexists with an asynchronous state. 6.2 Patterns of Synchrony for Weakly Coupled Neurons. The nth mode of the stability matrix of the asynchronous state is unstable if ¡p / 2 < an < p / 2, and cn is large enough. The rst condition is identical to the instability criterion of the asynchronous state in a fully connected network. The second condition means that the modulation of the nth Fourier mode of the effective phase interaction is large enough to overcome the disorder caused by the sparse connectivity. If M D N, this disorder is absent, and therefore any modulation can destabilize the asynchronous state. We rst discuss the pattern of synchrony of inhibitory I & F networks. In that case, Mc is a nonmonotonous function of the frequency, f0 (or equivalently the external current Iext ), and it has many minima. At high frequency, the asynchronous state is destabilized by the rst modes, whereas at lower frequencies, it is destabilized by higher modes. These are consequences of the interplay between the effects of the angle, an , and the amplitude, cn . For each mode, cn decreases with frequency, and the phasic modulation of the mode decreases, but an is within the desynchronizing range (¡p / 2, p / 2) only for high-enough frequencies. Therefore, for each mode, there is a minimal value for its critical number, Mc,n , with respect to f0 . Moreover, cn is smaller and an is larger for higher n, and therefore these minima occur at lower frequencies for higher modes. The asynchronous state is always stable for Tr D 0 if the synaptic rise is instantaneous (t1 D 0). For small t1 , Mc is very high. However, with refractoriness, Mc can be moderate and even small for small or zero t1 .
1126
D. Golomb and D. Hansel
Our simulations show that destabilization of the asynchronous state by the nth mode leads to a “smeared” n-cluster state. This explains why cluster states are observed at low frequencies. However, for sufciently large M, these states coexist with smeared one-cluster states. A consequence is that in numerical simulations, the synchrony pattern in which the network settles depends on the initial conditions from which the simulation begins. The existence of cluster states is not specic of sparse networks. Indeed, for I & F network models, cluster states can also be found in fully connected coupled networks when the synaptic timescale was fast in comparison to the ring frequency (van Vreeswijk, 1996; Neltner et al., 2000). Cluster states have also been found in conductance-based models of inhibitory spiking neurons (Hansel et al., 1993c; Wang & Buzsaki, ´ 1996). Models of thalamic networks, in which inhibitory neurons display postinhibitory rebound, also exhibit cluster states with hyperpolarizing inhibition and fast synaptic kinetics (Golomb & Rinzel, 1993, 1994; Golomb et al., 1994). Cluster states are therefore an important type of states demonstrated by various models of recurrent neuronal networks. However, their occurrence in the central nervous system has yet to be demonstrated. 6.3 Patterns of Synchrony for Finite Coupling Strength. Increasing the strength of the inhibitory synapses (keeping the ring rate constant by increasing also the external input) has two competitive effects. On one hand, this increases the average synaptic modulation that may synchronize the system, but on the other hand, this increases the neuron-to-neuron uctuations, which tends to desynchronize the neuronal activity. For weak coupling, these two factors cancel each other, and the stability of the asynchronous state is not affected by gsyn; the convergence time to the synchronous state, however, is proportional to 1 / | gsyn |. Beyond weak coupling, the synaptic inputs also distort the neuronal time course from the limit cycle of the isolated neuron; this distortion is different for neurons that receive different numbers of inputs. Then it is not clear which of the two competitive effects is the stronger. Remarkably, in all the cases we have studied, the synchrony measure, Â, was found to decrease monotonously with the coupling strength, and for | gsyn | larger than a critical value | gc ( M)| , the network is in the asynchronous state. This suggests that the increase in the spatial uctuations is always winning. It would be interesting to have more rigorous proof of this result. van Vreeswijk and Sompolinsky (1996, 1998) have studied analytically a network of sparsely connected binary neurons in the limit 1 ¿ M ¿ p N,N, M ! 1, gsyn, Iext / M. They have shown that in this limit, the network settles in an asynchronous state in which the neuronal ring is at random and is close to a Poisson process. In this state the total input to the cells is order 1, although excitation p (from the external input) and inhibition are separately on the order of O( M). Therefore, excitation and inhibition
Synchrony of Sparse Networks
1127
are balancing each other without any ne tuning of the parameters. More recently, Kim et al. (1998) have performed numerical simulations of large, sparse networks of I & F neurons. They have found that under similar conditions, the network settles in a chaotic state very similar to this balanced state. Our ndings that the network activity becomes asynchronous for gsyn > gc ( M) are consistent with these results. In order to understand the dynamics of the sparse networks, further it would be interesting to know how gc ( M) scales with M for large M. 6.4 Behavior of More General I & F Network Models. The model we have studied is restricted, even in the I & F framework. We assume that the postsynaptic neuron receives a postsynaptic current immediately after the presynaptic neuron has red. In cortex, there is a delay of about 2 ms even between adjacent neurons (Thomson, Deuchars, & West, 1993; Markram, Lubke, Frotscher, Roth, & Sakmann, 1997). With time delay td , the effective coupling of the phase model, C d , is related to C by: C d (w ) D C(w ¡ D ) where D D 2p td / T0 (Hansel et al., 1995; Izhikevich, 1998). Therefore, all the angles, an , are increased by D . For inhibitory coupling, a small delay reduces the frequencies at which the destabilization of the asynchronous state occurs. The coupling in equation 2.4 is via “synaptic currents” in the sense that the total conductance of a cell does not depend on the synaptic input. In a version of the I & F network model that is more faithful to the biophysics, the interactions are mediated by conductance changes. The synaptic current is Isyn( t) D gsyn( t)( V ¡ Vrev ), where Vrev is the synaptic reversal potential. In this case, the effective time constant t0 decreases with increasing the synaptic input. This is expected to have some quantitative, although not qualitative, effects on the results presented here. The temporal summation of synaptic inputs here is assumed to be linear, and kinetic properties of synapses, such as depression or facilitation, are not considered. This is a justied assumption, provided one deals with stationary properties of the network, and the synaptic time and strength t those of the real synapses at long time, and not in response to the initial stimulus (Markram & Tsodyks, 1996). 6.5 Sparseness Versus Heterogeneity. Random architecture has two consequences (van Vreeswijk & Sompolinsky, 1998): (1) the number of inputs uctuates from neuron to neuron, and (2) different neurons receive synaptic inputs from different sources. In the approximate analytical theory we have developed in the weak coupling limit, the latter effect is neglected. That is why the instability condition of the asynchronous state (see equations A.18 and A.19) is identical to what is found for a fully connected network of phase oscillators with random intrinsic frequencies (Sakaguchi & Kuramoto, 1986; Mirollo & Strogatz, 1991; Neltner et al., 1990), distributed according to a gaussian distribution with an average and a variance that depends on the zero mode of the phase interaction and therefore on the coupling strength.
1128
D. Golomb and D. Hansel
The good agreement of the simulations with this theory is due to the fact that for I&F neurons (and more generally for conductance-based neurons; Golomb & Hansel, unpublished), the zero mode of the phase interaction is large in comparison to the amplitude of the rst mode. A more exact mean-eld theory would require also taking into account the fact that different neurons receive synaptic inputs from different sources. This can be done in the large M, large N limit. Then one nds that the dynamics of the sparse network is similar to the dynamics of an effective phase model with randomly distributed intrinsic frequencies and a timecorrelated multiplicative noise. The properties of the noise and of the frequency distribution have to be determined self-consistently. This shows that a sparse network is not equivalent to a fully connected network with heterogeneities. However, the solution of this exact mean-eld theory is difcult. We intend to study it in future work. The difference between the effect of sparseness and of heterogeneities is also clear from our simulations at large coupling. Indeed, with a nite level of heterogeneity, a large, inhibitory all-to-all network with xed external drive becomes asynchronous with large enough | gsyn | (Wang & Buzs a´ ki, 1996; White et al., 1998). This is related to decreasing the ring rate when the coupling increases. However, if the external drive is changed such that the ring rate remains xed when the coupling strength increases, the asynchronous state becomes unstable at large enough | gsyn| (Neltner et al., 2000). By contrast, our simulations show that when the ring rate is kept xed, a sparse network becomes asynchronous at large enough | gsyn| . In systems with both heterogeneities and sparseness, we have found that for coupling that is not too strong,  is an increasing function of | gsyn | because the desynchronizing effect of the heterogeneities is counterbalanced by the interactions. However, for sufciently strong coupling, the effect of the sparseness wins. Therefore, an optimal value exists for gsyn at which  is maximal. Note that nonmonotonic variations of the synchrony level in sparse and heterogeneous network have been also found by Wang and Buzs´aki (1996). However, they found it without controlling the ring rate of the neurons when the coupling is changed. 6.6 Consequences for Neural Modeling. In many cases, modeling large neuronal systems of the central nervous system (CNS) in their full size would require simulating networks of tens of thousands of neurons, interacting through several million synapses. Simulations of such large systems are both time and memory consuming, and systematic studies of their dynamical properties using numerical simulations are hard and frequently impossible due to limitations in computer capabilities. In order to avoid these problems, systems of reduced sizes are frequently studied. However, it is not always clear how to make the size reduction of the real system. Our work suggests a simple scaling rule that can be applied to the study of patterns of synchrony in sparse network. Assume that the system one wants to
Synchrony of Sparse Networks
1129
study consists of N neurons connected at random, with an average number of synapses per neuron, M. Assume for simplicity that all the synapses have the same strength, Gs . Our study shows that the synchronization properties of this system can be studied by simulating a model network consisting of Nm randomly connected neurons, with average connectivity, M m , and synaptic strength Gm , satisfying the following scaling relation: Gm D
M Gs Mm
1 1 1 1 ¡ D ¡ . Mm Nm M N
(6.1) (6.2)
The rst equation ensures that the average synaptic input in both the real and the simulated systems are the same. The second relation guarantees that the spatial uctuations of this input are the same in both systems. As we have shown, using this scaling allows one to investigate the behavior of very large networks by simulating a network with a relatively small Nm —on the order of a few hundreds. Assuming that in the real system 1 ¿ M ¿ N, Nm and Mm will not be very different. In spite of this fact, our simulation results indicate that the synchronization properties in the two systems are similar. Synchrony in I & F inhibitory networks is very sensitive to disorder. In the context of our work here, this is expressed in the fact that Mc is large for these models. Similar conclusions have been reached for networks of I & F with heterogeneous properties. This lack of robustness is a consequence of the purely passive integration of the synaptic and external inputs. However, even within the I & F framework, refractoriness can improve the robustness of the synchrony substantially by decreasing Mc , as we have shown. Synchrony in conductance-based neurons that take into account more biophysics of neurons seems to be more robust (Wang and Buzs a´ ki, 1996; Neltner et al., 2000; Golomb & Hansel, unpublished). The patterns of synchrony for M > Mc can also depend on the active properties and their kinetics. For example, for the conductance-based model of Wang and Buzs a´ ki (1996), the network settles into a smeared one-cluster state except for very low frequencies. With slower kinetics of the intrinsic currents, the second mode destabilizes the asynchronous state for the entire frequency range (Golomb & Hansel, preliminary results). In that case, the network may settle into a two-cluster state. Therefore, the role of intrinsic properties of the neurons is as important as the role of the synaptic properties in determining when synchrony can emerge in the neural activity of large networks. Nevertheless, our study points out qualitative trends regarding the role of active properties in the emergence of synchrony that have the effect of reducing Mc . However, one cannot draw general quantitative conclusions regarding the minimal number of synaptic inputs required for synchrony to emerge since this number can vary by one order of magnitude or more, depending on
1130
D. Golomb and D. Hansel
the model considered. This raises an important issue for neural modeling: under which conditions can a conductance-based dynamics be reduced to an I & F dynamics without losing synchronization properties? 6.7 Consequences for the Mechanisms of Neural Synchrony. Finally, we discuss some consequences of our results for the possible mechanisms underlying neural synchrony in the CNS. It has been proposed recently that inhibitory neurons could play a central role in the emergence of synchrony in large networks in the CNS. According to these ideas, synchronous rhythmic oscillations would emerge in the subnetwork of inhibitory interneurons and subsequently would entrain other populations of neurons in the system. This mechanism is supported by theoretical considerations. Indeed, it has been realized that it is easier to achieve coherent activity in fully connected networks of identical neurons if the interactions are inhibitory than if the interactions are excitatory. For that to be true, no specic active properties of the neurons are required. Recent studies have pointed out that because of the heterogeneities, synchrony in inhibitory networks can be a problem, especially at moderate or high ring rates. Nevertheless, as it has been demonstrated in Neltner et al. (2000), the desynchronizing effect of heterogeneities of fully connected networks can be overcome provided the external input and the synaptic interactions are sufciently strong. We have seen that if the network is sparsely connected, increasing the interaction is desynchronizing. Hence, the capability of compensating for the heterogeneities in the strong coupling regime is opposed by the amplication in this regime of the spatial uctuations due to the sparseness. To characterize this limitation in a more quantitative manner, more specic information is needed about the intrinsic active properties of the interneurons and the degree of connectivity of networks of interneurons in the CNS. However, this article shows a fundamental limitation for the mechanism of synchrony relying solely on inhibition. An important issue is therefore whether more robust synchrony in the CNS can be achieved through recurrent inhibition between excitatory and inhibitory populations. Work is in progress in this direction. Appendix A: Analytical Estimate of Mc in the Weak Coupling Limit We assume that the state of the network at time t can be completely characterized by the density function r (w , t, m) dw, which is the fraction of neurons with exactly m inputs whose phase lies between w and w C dw at time t. Under this assumption, one can write the continuity equation for r (w , t, m), which reads: @r (w , t, m) @t
Synchrony of Sparse Networks
1131
(
" Z ¡ ¢ 2p mC 0 m N D¡ r (w , t, m) C C dm0 P0 m0 , M @w T0 M M 0 #) Z p ¡ 0 ¢ ¡ ¢ 0 0 Q 0 £ . dw r w , t, m C w ¡ w @
¡p
(A.1)
The right-hand side term in equation A.1 means that a neuron receives phasic contribution from m other neurons, and the state of the “ancestor ” neurons is determined by the number of input m0 they receive themselves. Because M À 1, the contribution from the “tails” of the gaussian distribution RN (see equation 3.8) to the integral 0 dm0 is negligible, and therefore this R1 integral is well approximated by ¡1 dm0 . The following transformations, tO D | C 0 | t,
dm D m ¡ M,
wO D w ¡
¡ 2p
¢
T0
C C0
t,
(A.2)
lead to ³
O t, O dm @r w,
´
@tO
³
D ¡s0
O O d m @r w, t, dm
´ ¡
1 |C 0 |
¡ 1C
dm M
¢
M @wO µ ³ ´Z 1 @ O dm £ r wO , t, d (d m) P (d m, M) @wO ¡1 Z p ³ ´ ³ ´¶ 0 0 O 0 O O Q O O £ , dw r w , t, d m C w ¡ w ¡p
(A.3)
where s0 D sign(C 0 ) and sign(x) D 1 if x > 0 and ¡1 if x < 0. From now on, we will omit the symbol O in equation A.3 and refer to the values t, w , as the new values dened in equation A.2. A trivial solution of equation A.3 is r (w , t, dm) D
1 . 2p
(A.4)
It corresponds to the asynchronous state of the network. For analyzing its stability, we follow the approach of Strogatz and Mirollo (1991). We investigate the evolution of small deviations around this solution: r (w , t, dm) D
1 C g (w , t, d m) . 2p
(A.5)
Using equation 3.3 and expanding g(w , t, dm) in Fourier series g (w , t, d m) D
1 £ X nD1
¤ Bn (t, dm) einw C c.c. ,
(A.6)
1132
D. Golomb and D. Hansel
one nds @Bn (t, dm)
dm Bn (t, dm) M Z 1 ¡ ¢ ¡ ¢ ¡ ¢ 1 dm 1C C ncn eian d d m0 Bn t, d m0 P dm0, M . (A.7) 2 M ¡1
D ¡in s0
@t
¡
¢
We seek for each n a solution for equation A.7 of the form B n (t, d m) D bn (dm) el n t ,
(A.8)
where the eigenvalue, l n , does not depend on d m. Solving equations A.7 and A.8 for bn (d m), we nd ¡ bn (d m) D
dm M
1C
¢
An
l n C in s0 dm M
,
(A.9)
where An is determined by the self-consistent equation: n An D cn eian 2
Z
1 ¡1
d (d m) P (d m, M) bn (d m) .
(A.10)
The complex number l n can be written as l n D m n ¡ inVn , where m n and Vn are real. Using this notation, the complex equations, A.9 and A.10, are expressed as: 2 cos an D n cn 2 sin an D n cn
Z
1 ¡1
Z
1 ¡1
¡ ¢ m n 1 C dm M d (dm) P (dm, M) ¡ ¢2 m 2n C n2 s0 dm ¡ Vn M d (d m) P (d m, M)
¡ ¢¡ ¢ ¡ Vn 1 C dm n s0 dm M M ¡ ¢2 . m 2n C n2 s0 dm ¡ V n M
(A.11)
(A.12)
p The term d M / M is of the order 1 / M. Since M À 1 (see equation 3.6), we neglect this term in equations A.11 and A.12 in comparison to 1, but not in comparison to |Vn |, which may be small. We want to nd the critical value, Mc,n , above which l n has a positive real part (m n > 0). Hence, we look for a solution, Mc,n , for equations A.11 and A.12 in the limit m n ! 0C . The righthand side of equation A.11 is always positive form n ¸ 0, and, therefore, this equation has a solution only if ¡p / 2 < an < p / 2.
(A.13)
Synchrony of Sparse Networks
1133
R1 Using the fact that limm n !0C 1 dºm n / (m 2n C º2 ) D pd (0), dening yn D p Mc,n Vn , and using the identity,
¡
¢
1 m P ( m, M) D p P p , 1 , M M
(A.14)
equation A.11 yields (after some straightforward algebra) ¡ ¢p 2 cos an D p P yn , 1 Mc,n . cn
(A.15)
Similarly, one takes the limit m n ! 0C in equation A.12 that gives p ¡ ¢ sin an D Mc,n J yn , 1 , cn
(A.16)
where, following Sakaguchi and Kuramoto (1986), we have dened Z J (V, M) D
1 0
dx [P (V C x, M) ¡ P (V ¡ x, M)] . 2x
(A.17)
Therefore, Mc,n and Vn are computed from: ¡ ¢ p ¡ ¢ J y, 1 D P y, 1 tan an 2 M c,n D
(
2 cos an ¡ ¢ p cn P y, 1
(A.18)
!2 .
(A.19)
These equations are similar to those obtained by Sakaguchi and Kuramoto (1986) for the emergence of a synchronized state in a heterogeneous phase model with coupling via the rst mode only: C(w ) D sin(w C a). The value of y obtained from solving equation A.18 is a function of an only. In Figure 12A, we plot the graphs of the functions J( y, 1) and P( y, 1) tan a for a D ¡p / 4. It is clear that y increases with |a| for |a| < p / 2, and therefore P( y, 1) decreases with |a| in this range. We dene the universal function, b(a), to be b(a) D
p
2p P( y, 1) cos a
(A.20)
(see Figure 12B) where y is the solution of equation A.18. Equations A.19 and A.13 become M c,n D
8 p c2n b 2
(an )
I
¡ p / 2 < an < p / 2.
(A.21)
1134
D. Golomb and D. Hansel
p J, 2 P tan a 0.5
0.0
10
0
y
10
0.5
b (a )
1.0
0.5
0.0
p 2
p 4 a
0
p
4
p
2
Figure 12: (A) Functions J(y, 1) (solid line) and (p / 2)P(y, 1) tan a (dashed line) for a D ¡p / 4. (B) Graph of the function b(a), referred to in equation 3.12 and dened in equations A.18 and A.20.
Equation A.21 gives the number of inputs, Mc,n , above which the nth Fourier mode destabilizes the asynchronous state. In order to nd the number Mc above which this state becomes unstable, we look for the minimal Mc,n Mc D min Mc,n , n
(A.22)
where the minimum is over all the positive integer, n, values for which ¡p / 2 < an < p / 2. Below Mc , the asynchronous state can be stable or marginally stable. In the case of the Kuramoto model (Kuramoto, 1984), the asynchronous state is linearly marginally stable when it is not unstable, in the sense that the stability matrix has a zero eigenvalue. Still, uctuations of the asynchronous state decay with time (Strogatz & Mirollo, 1991; Strogatz, Mirollo, & Matthews,
Synchrony of Sparse Networks
1135
1992). This is probably also the situation here. Our numerical results show that below Mc , the system dynamics converge to the asynchronous state when the initial conditions are chosen from a uniform distribution. Note that the instability criterion we have established is formally similar to the equations that determine the stability boundary of the asynchronous state in a fully connected network of neurons with heterogeneous intrinsic properties (Neltner et al., 2000). Appendix B: The Fourier Transform of C(w) The Fourier transform of a periodic function h(w ), in the complex exponential representation, is h (w) D
1 X nD¡1
hn e¡inw .
(B.1)
The complex number, hn , is represented as hn D | hn | exp[i arg(hn )], where |hn | is nonnegative and ¡p / < arg(hn ) · p . For a real number x, arg(x) D 0 if x > 0 and arg(x) D p if x < 0. It is straightforward to calculate the Fourier coefcients of the functions Z(w ) (see equation 4.2) and Isyn (w ) (see equation 4.3). One nds: |Zn | D
h 1 ¡ ¢ Iext (Iext ¡ h ) T0 1 C n2 v2 t 2 1 / 2 0 0 » ¼ 2Iext (Iext ¡ h ) ( ) £ 1C ¡ cos [1 nv T ] 0 r h2
1/2
(B.2)
arg (Zn ) D ¡ arctan nv0 t0 ¡ arctan
sin (nv0 Tr ) [Iext / ( Iext ¡ h ) ] ¡ cos ( nv0 Tr )
(B.3)
| gsyn | t0 ¢¡ ¢¤1 2 1 C n2 v02 t12 1 C n2 v20 t22 /
(B.4)
¡ ¢ ¡ ¢ arg Isyn,n D arg gsyn C arctan ( nv0 t1 ) C arctan ( nv0 t2 ) ,
(B.5)
Isyn,n D
£¡ 2
T0
where the function arctan takes values in the interval [¡p / 2, p / 2]. The convolution theorem implies that the Fourier components of C(w ) (in the complex exponential representation) are given by C n D Zn I¤syn,n ,
(B.6)
1136
D. Golomb and D. Hansel
where the asterisk means complex conjugate. Transferring this representation to the sin representation (see equation 3.3), this formula becomes
cn D | Z0 | Isyn,0 2 |Zn | Isyn,n
an D ¡
¡ ¢ p ¡ arg (Zn ) C arg Isyn,n . 2
(B.7)
(B.8)
Equations B.6, B.7, and B.8 are valid for every model in which the synaptic coupling is expressed in terms of currents (and not in terms of conductances). Substituting equations B.2–B.5 into equations B.7 and B.8 yields equations 4.4 and 4.5. Acknowledgments We are thankful to C. van Vreeswijk, G. Mato, H. Sompolinsky, L. Neltner, and G. B. Ermentrout for most helpful discussions. We thank C. van Vreeswijk for a careful reading of the manuscript. D. H. acknowledges the hospitality of the Center for Neural Computation and the Racah Institute of Physics of the Hebrew University. This research was partially supported by the PICS 236 (PIR Cognisciences and M.A.E) and by a grant from AFIRST to D. H., and by a grant from the Israel Science Foundation to D. G. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Barkai, E., Kanter, I., & Sompolinsky, H. (1990). Properties of sparsely connected excitatory neural networks. Phys. Rev. A, 41, 590–597. Binder, K., & Heermann, D. W. (1992). Monte Carlo simulation in statistical physics. Berlin: Springer Verlag. Bressloff, P. C., & Coombes, S. (1998a). Desynchronization, mode locking, and bursting in strongly coupled integrate-and-re oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (1998b).Spike train dynamics underlying pattern formation in integrate-and-re oscillator network. Phys. Rev. Lett., 81, 2384– 2387. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comput., 11, 1621–1671. Buzs´aki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuron networks. Curr. Opin. Neurobiol., 5, 504–510. Buzs´aki, G., Leung, L., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res. Rev., 6, 139–171.
Synchrony of Sparse Networks
1137
Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comput. Neurosci., 5, 407–420. Ermentrout, G. B., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators. SIAM J. Math. Anal., 15, 215–237. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput., 12, 43–89. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking. Neural Comput., 8, 1653–1676. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neuronal networks. Phys. Rev. E, 50, 3171–3191. Golomb, D. & Hansel, D. (2000). Synchrony in sparse networks of phase oscillators. Unpublished manuscript. Golomb, D., Hansel, D., Shraiman, B., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Phys. Rev. A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, E 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 71, 259–282. Golomb, D., Wang, X. J., & Rinzel, J. (1994).Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109– 1126. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comp. Neurosci., 1, 11–38. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Clustering and slow switching in globally coupled phase oscillators. Phys. Rev, E48, 3470–3477. Hansel, D., Mato, G., & Meunier, C. (1993c). Phase reduction and neural modeling. Concepts in Neuroscience, 4, 192–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comp., 10, 467–485. Hansel, D., & Sompolinsky, H. (1992). Synchrony and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comp. Neurosci., 3, 7–34. Izhikevich, E. M. (1998). Phase models with explicit time delays. Phys. Rev. E, 58, 905–908. Kim, J., van Vreeswijk C., & Sompolinsky, H. (1998). Stochastic activity in balanced integrate-and-re neural networks. Unpublished manuscript, Racah Institute of Physics, Hebrew University of Jerusalem. Kopell, N. (1988). Toward a theory of modeling central pattern generators. In A. Cohen (Ed.), Neural control of rhythmic movements in vertebrates. (pp. 369–413). New York: Wiley.
1138
D. Golomb and D. Hansel
Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Lytton, W. W., & Sejnowski, T. J. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol., 66, 1059– 1079. Markram, H., Lubke, J., Frotscher, M., Roth, A., & Sakmann, B. (1997). Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. J. Physiol. (Lond), 500, 409–440. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Mirollo R. E., & Strogatz S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1657. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12 (in preparation). Sakaguchi, H., & Kuramoto, Y. (1986). A solvable active rotator model showing phase transitions via mutual entrainment. Prog. Theor. Phys., 56, 576–581. Shelley, M. (1999). Efcient and accurate integration schemes for systems of integrate-and-re neurons. Unpublished manuscript, Courant Institute of Mathematical Sciences and Center for Neural Science, New York University. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635. Strogatz, S. H., Mirollo, R. E., & Matthews, P. C. (1992). Coupled nonlinear oscillations below the synchronization threshold: Relaxation be generalized Landau damping. Phys. Rev. Lett., 68, 2730–2733. Thomson, A. M., Deuchars, J., & West, D. C. (1993). Large, deep later pyramidpyramid single axon EPSPs in slices of rat motor cortex display paired pulse and frequency-dependent depression, mediated presynaptically and selffacilitation, mediated postsynaptically. J. Neurophysiol., 70, 2354–2369. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Treves, A. (1993). Mean eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in integrate and re network. Phys. Rev. Lett., 71, 1280–1283. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. New York: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537 van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity, Science, 274, 1724–1726.
Synchrony of Sparse Networks
1139
van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1372. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst ring in a thalamic model: Specic neuronal mechanisms. Proc. Natl. Acad. Sci. (USA), 2, 5577–5581. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronous oscillations in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D, & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received April 30, 1999; accepted June 30, 1999.
LETTER
Communicated by David Lowe
A Neural Network Architecture for Visual Selection Yali Amit Department of Statistics, University of Chicago, Chicago, IL 60637, U.S.A. This article describes a parallel neural net architecture for efcient and robust visual selection in generic gray-level images. Objects are represented through exible star-type planar arrangements of binary local features which are in turn star-type planar arrangements of oriented edges. Candidate locations are detected over a range of scales and other deformations, using a generalized Hough transform. The exibility of the arrangements provides the required invariance. Training involves selecting a small number of stable local features from a predened pool, which are well localized on registered examples of the object. Training therefore requires only small data sets. The parallel architecture is constructed so that the Hough transform associated with any object can be implemented without creating or modifying any connections. The different object representations are learned and stored in a central module. When one of these representations is evoked, it “primes” the appropriate layers in the network so that the corresponding Hough transform is computed. Analogies between the different layers in the network and those in the visual system are discussed. Furthermore, the model can be used to explain certain experiments on visual selection reported in the literature.
1 Introduction
The issue at hand is visual selection in real, still gray-scale images, which is guided by a very specic task of detecting a “memorized” or “learned” object class. Assume no “easy” segmentation cues are available, such as color or brightness of the object, or a relatively at and uniform background. In other words, segmentation can not precede detection. Moreover, the typical 240 £ 320 gray-scale image is highly cluttered with multiple objects at various scales, and dealing with false positives is crucial. The aim of selection is to choose quickly a small number of candidate poses for the object in the scene, from among the millions of possibilities, taking into account all possible locations and a range of scales. Some of these candidate poses may be false and may be further analyzed for nal verication, using more intensive procedures, which are not discussed here. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1141–1164 (2000) °
1142
Yali Amit
Various biological facts and experiments must be dealt with when constructing a computational model for visual selection. Selection is a very rapid process—approximately 150 milliseconds after presentation of the image (Thorpe, Fize, & Marlow, 1996; Desimone, Miller, Chelazzi, & Lueschow, 1995). Moreover of these 150 milliseconds, at least 50 milliseconds are spent on processing from the retina through the primary visual cortex V1, which is assumed to be involved with detection of edges, various local motion cues, and so forth. So very little time is left for global processing necessary to determine a candidate location of the object. There is a limited number of processing stages between V1 and IT, where there is clear evidence for selective responses to full objects even prior to the saccade. (See Desimone et al., 1995.) A common schematic description of these stages includes V1 simple cells, V1 complex cells, V2 cells detecting more complex structures (e.g., illusory contours, tjunctions, corners; see von der Heydt, 1995), V4 cells, and nally IT. Receptive eld size increases as one moves up this hierarchy. The receptive elds of cells in a given column, which corresponds to approximately the same location in the visual eld and to the same feature, exhibit signicant deviations with respect to each other. These deviations increase with the level in the hierarchy (Zeki, 1993). Selection is not a very accurate process. Mistakes can be made. Often it is necessary to foveate to the location and process the data further to make a nal determination as to whether the required object is indeed there. Selection is invariant to a range of scales and locations (Lueschow, Miller, & Desimone, 1994; Ito, Tamura, Fujita, & Tanaka, 1995). The question is whether there are computational approaches that accommodate the constraints listed above—that is, can efcient selection be achieved at a range of scales, with a minimal number of false negatives and a small number of false positives, using a small number of functional layers? Can the computation be parallelized, making minimal assumptions regarding the information carried by an individual neuron? Such computational approaches, if identied, may serve as models for information processing beyond V1. They can provide a source of hypotheses that can be experimentally tested. The aim of this article is to present such a model in the form of a neural network architecture that is able to select candidate locations for any object representation evoked in a memory model; the network does not need to be changed for detecting different objects. The network has a sequence of layers similar to those found in visual cortex. The operations of all units are simple integer counts. A unit is either on or off, and its status depends
Neural Network Architecture for Visual Selection
1143
on the number of on units feeding into it. Selection at a xed resolution is invariant to a range of scales and all locations, as well as small linear and nonlinear deformations. 1.1 Overview of the Selection Algorithm. Selection involves nding candidate locations for the center of the object, which can be present at a range of poses (location, small rotations, scaling of § 20% and other small deformations). In other words, the detection is invariant to the range of poses but does not convey information about the parameters of the pose except for location. Various options for recovering pose parameters such as scale and rotation are discussed in Amit and Geman (1999) and Amit (1998). This issue is not dealt with here. Selection can therefore be viewed as a process whereby each location is classied as object or background. By object, we mean that an object is present at one of an admissible range of deformations at that location. There are two pitfalls arising from this point of view that should be avoided. The rst is the notion that one simply needs training samples from the object at the predetermined range of scales and samples from background images. These are then processed through some learning algorithm (e.g., tree classiers, feedforward neural nets, support vector machines) to produce the classier. The second is that this classier is subsequently implemented at each location. The problem with this point of view is that it will produce a different global architecture (tree, neural net, etc.) for each new object that needs to be detected; it will require a large training set both to represent background images and to learn the required range of scales; it will not directly address the issue of efciency in computation and resources required for detection of multiple objects. It should be noted, however, that this approach has led to successful computer vision algorithms for specic objects. For an example in the context of face detection, see Rowley, Baluja, and Takeo (1998) and Sung and Poggio (1998). In view of these issues and in order to accommodate the constraints described in the previous section, the selection system described below is guided by the following assumptions:
The classication carried out at each image location—object versus background—must be a very simple operation. One cannot expect a sophisticated classier such as a heavily connected neural net to be present at each location. Objects are represented through spatial arrangements of local features, not through individual local features. Individual local features are insufcient to discriminate between an object and the rich structure of the background; multiple local features are also insufcient without spatial constraints on their relative locations.
1144
Yali Amit
The basic inputs to the network are coarse-oriented edge detectors with a high degree of photometric invariance. Local features are dened as functions of these edges. Training involves determining moderate probability local features, at specic locations on examples of the object, which are registered to a reference pose in a reference grid. Invariance to local deformations is hard wired into the denition of the local features, not learned. Invariance to global transformations is hard wired into the computation and not learned. No changes need to be made to the network for detecting new objects except for learning their representation in a central memory module. The representation does not need to be distributed to all locations. The network codes for the continuous location parameter through a retinotopic layer, which is active (on) only at the detected locations. Individual units are not expected to code for continuous variables. The solution proposed here is to represent objects in terms of a collection of local features constrained to lie in certain regions relative to a virtual center. In other words the spatial conjunction employed here can be expressed in a very simple star-type graph (see Figure 2, right panel.) Any constraints involving relative locations of pairs or larger subsets of features signicantly increase the computational load or the complexity of the required parallel architecture. Selection involves nding those locations for which there is a sufcient number of such features satisfying the relative spatial constraints. This is done using a version of the generalized Hough transform (see Grimson, 1990). The parallel architecture is constructed so that the Hough transform associated with any object can be implemented without creating or modifying any connections. The different object representations are stored in a central module. When one of these representations is evoked, it primes the appropriate layers in the network so that the corresponding Hough transform is computed. The network involves two levels of simple and complex type neurons. The simple layers detect features (edge and edge conjunctions) at specic locations; the complex layers perform a disjunction over a certain region. These disjunctions are appropriately displaced from the original location of the feature, enabling a simple summation to compute the Hough transform. The disjunction regions are not learned but are hard-wired into the architecture. The features come from a predened pool. They are constructed so as to have sufciently low density on generic background. Training for an object class involves taking a small number of examples at reference pose (i.e., registered to the reference grid) and identifying those features from the pool that are frequent at particular locations in the reference grid. These locations implicitly dene the global spatial relations. There is no spe-
Neural Network Architecture for Visual Selection
1145
cial learning phase for these relations. The object representation is simply the list of selected features with the selected locations on the reference grid. The entire architecture uses a caricature of neurons as binary variables that are either on or off. This is clearly a very simplistic assumption, but it has several advantages. The description of the algorithm is greatly simplied, its serial implementation becomes very efcient, and it does not a prioricommit to any specic assumption on the information conveyed in the continuous output of any of the neurons. On the other hand, starting from the model presented here, it is possible to incorporate multivalued outputs for the neurons at each stage and carefully study how this affects the performance, stability, and invariance properties of the system. Two main conclusions can be drawn from the experiments and model described here. There is a way to create good object and background discrimination, with low false-negative and false-positive rates, in real images, with a very simple form of spatial conjunctions of local features. The local features are simple functions of the edge maps. This can be viewed as a statistical statement about objects and background in images. The detection algorithm has a simple parallel implementation, based on the well-known Hough transform, that exhibits interesting analogies with the visual system. In other words, efcient visual selection can be done in parallel. The article is organized as follows. Section 2 discusses relations to other work on detection. Section 3 provides a description of the detection algorithm, the type of local features employed, and how they are identied in training. In section 4 the neural architecture for such detection is described. In section 5 the biological analogies of the model are discussed and how the architecture can be used to explain certain experiments in visual selection reported in the literature. 2 Related Work
The idea of using local feature arrangements for detection can also be found in Burl, Leung, and Perona (1995), Wiskott, Fellous, Kruger, and von der Malsburg (1997), and Cootes and Taylor (1996). In these approaches the features, or certain relevant parameters, are also identied through training. One clear difference, however, is that the approach presented here makes use of very simple binary features with hard-wired invariances and employs a very simple form of spatial arrangement for the object representation. This leads to an efcient implementation of the detection algorithm. On the other hand, the representations described in these articles are more detailed, provide more precise registration information, and are useful in the more
1146
Yali Amit
intensive verication and recognition processes subsequent to selection and foveation. In Van Rullen, Gautrais, Delorme, and Thorpe (1998) a similar representation is derived. Features dened in terms of simple functionals of an edge input layer are used to represent components of the face, making use of the statistics of the edges on the training set. They model only three locations (two eyes and mouth) on the object, which are chosen by the user, training multiple features for each. In the approach suggested here, many more locations on the object are represented using one feature for each. These locations are identied automatically in training. One problem with overdedicating resources to only three locations is the issue of noise and occlusion, which may eliminate one of the three. Our representation for faces, for example, is at a much lower resolution—14 pixels between the eyes on average. At that resolution the eyes and mouth in themselves may be quite ambiguous. In Van Rullen et al. (1998) pose-invariant detection is achieved by having a “face neuron” for each location in the scene. This raises a serious problem of distributing the representation to all locations. One of the main goals of the architecture here is to overcome the need to distribute the representation of each new object that is learned, to a dedicated “object neuron” at each location. The generalized Hough transform has been extensively used in object detection (see Ballard, 1981; Grimson, 1990; Rojer & Schwartz, 1992). However, in contrast to most previous uses of the Hough transform, the local features are more complex and are identied through training, not from the geometric properties of the object. It should be emphasized that this “trained” Hough transform can be implemented with any pool of binary features that exhibit the appropriate invariance properties and the appropriate statistics on object and background—for example, those suggested by Van Rullen et al. (1998). In the use of simple and complex layers this architecture has similarities with Fukushima (1986) and Fukushima and Wake (1991). Indeed, in both articles the role of ORing in the complex cell layers as a means of obtaining invariance is emphasized. However, there are major differences between the architecture described here and the neocognitron paradigm. In our model, training is done only for local features. The global integration of local-level information is done by a xed architecture, driven by top-down information. Therefore, features do not need to get more and more complex. There is no need for a long sequence of layers. Robust detection occurs directly in terms of the local feature level, which has only the oriented edge level below it. 3 Detection
We rst describe a simple counting classier for object versus background at the reference pose. Then, an efcient implementation of this classier on an entire scene is presented using the Hough transform.
Neural Network Architecture for Visual Selection
1147
3.1 A Counting Classier. Let G denote a d £ d reference grid, centered at the origin, where d » 30. An image from the object class is said to be registered to the reference grid if certain well-dened landmarks on the object are located at xed locations in G. For example, an image of a face would be registered if the two eyes and the mouth are at xed locations. Let a1 , . . . , anob be binary local features (lters) that have been chosen through training (see section 3.4) together with locations y1 , . . . , ynob in G. This collection constitutes the object representation. For any i D 1, . . . , nob and any location x in a scene, we write ai ( x) D 1 / 0 to indicate the presence or absence of local feature ai in a small neighborhood of x. A threshold t is chosen. An image in the reference grid is then classied as object if at least t of these features are present at the correct location, and background otherwise. Assume that conditional on the presence of a registered object in the reference grid the variables ai ( yi ) are independent, with P(ai ( yi ) D 1| Object) D po , and conditional on the presence of a generic background image in the reference grid these variables are all also independent with P(ai ( yi ) D 1|Bgd.) D pb < < po . Using basic properties of the binomial distribution, we can predict the false-positive and false-negative probabilities for given nob , t, po , pb (see Rojer & Schwartz, 1992; Amit, 1998). 3.2 Finding Candidate Poses. How can this classier be extended to the full image, taking into account that the object may appear at all locations and under a large range of allowable transformations? Assume all images are presented on a D £ D grid L and centered at the origin, with D » 300. Thus, the reference grid is a small subset of the image grid. Henceforth y will denote locations on the reference grid, and x will denote locations in the scene or image. A pose of the object in the image is determined by a location xc and a map A 2 A, where A is some subset of linear (and perhaps even nonlinear) transformations of the plane. Typically A accommodates a range of scales of § 25% and a small range of rotations. To decide whether the object is present at a location xc 2 L, for each transformation A 2 A, we need to check how many of the nob features ai are present at the transformed location xi D Ayi C xc . If at least t are found for some A, then the object is declared present at location xc . This operation involves an extensive search through the pose set A or, alternatively, verifying complex constraints on the relative locations of pairs or triples of local features (see Grimson, 1990). It is not clear how to implement this in a parallel architecture. We simplify by decoupling the constraints between the local features. To decide if the object is present at xc , we check for each feature ai whether it is present at xi D Ayi C xc for any A 2 A. Namely, the map A may vary from feature to feature. In other words we count how many of the regions xc C Byi contain at least one instance of the corresponding feature ai , where
1148
Yali Amit
Byi D fAyi : A 2 Ag. If we nd at least t , we declare the object is present at xc . This is summarized as follows: (C) There exist at least t indices i1 , . . . , it , such that the region xc C Byij contains an instance of local feature aij , for j D 1, . . . , t . The object model can now be viewed as a exible star-type planar arrangement in terms of the local features, centered at the origin (see Figure 2, right panel.) This is a very simple system of spatial constraints on the relative locations of the features. Decoupling the constraints allows us to detect candidate locations with great efciency. There is no need to proceed location by location and verify (C). Starting from the detected locations of the local features, the following Hough transform provides the candidate locations of the object: Hough Transform
1. Identify the locations of all nob local features in the image. 2. Initialize D £ D arrays Qi , i D 1, . . . , nob at 0.
3. For every location x of feature ai set to 1, all locations in the region x ¡ Byi in array Qi . P 4. Choose locations xc for which niDob1 Qi ( xc ) ¸ t . The locations of step 4 are precisely the locations where condition (C) is satised. For a 30 £ 30 reference grid, the size of Bi D Byi needed to accommodate the range of scales and rotations mentioned above varies depending on the distance of the point yi from the origin; it is at most on the order of 100 pixels. Note that B y is entirely determined by y once the set of allowable transformations A is xed. Under condition (C) both po and pb increase relative to their value on the reference grid. The new background probability pb is approximated byl b | B|, where l b is the density of the features in background, and | B| is the area of the regions B i . The new object probability po can be reestimated from the training data using the regions Bi . The main concern is the increase in false positives due to the increase in the background probabilities. This is directly related to the background density l b . As long as l b is sufciently small, this “greedy” detection algorithm will not yield too many false positives. 3.3 Local Features. The key question is which local features to use— namely, how to decrease l b while maintaining relatively high “on object” probabilities po . The rst possibility would be to use the oriented edges, which serve as input to the entire system. These are the most basic local features detected in the visual system. The problem with edges, however, is
. . . .. .. .
Neural Network Architecture for Visual Selection
1149
z2 y2
z3
z
y
y 3
z y 1 1
Figure 1: A vertical edge is present at z if |I(z) ¡ I(y)| > |I(z) ¡ I(zi )| and |I(z) ¡ I(y)| > |I(y) ¡ I(yi )| for all i D 1, 2, 3. The polarity depends on the sign of I(z) ¡ I(y). Horizontal edges are dened by rotating this scheme by 90 degrees.
that they appear with a rather high density in typical scenes. In other words l b is such that l b | B| > 1, leading to very large numbers of false positives. (See also the detailed discussion in Grimson, 1990.) The edge detector used here is a simple local maximum of the gradient with four orientations— horizontal and vertical of both polarities (see Figure 1). The edges exhibit strong invariance to photometric transformations and local geometric transformations. Typical l b for such edges on generic images is .03 per pixel. We therefore move on to exible edge conjunctions to reduce l b while maintaining high po. These conjunctions are exible enough to be stable at specic locations on the object for the previously dened range of deformations, and they are lower density in the background. These conjunctions are simple binary lters computed on the edge maps. We emphasize that the computational approach and associated architecture can be implemented with other families of local features exhibiting similar statistical properties. 3.3.1 Flexible edge arrangements. Dene a collection of small regions bv in the immediate vicinity of the origin, indexed by their centers v. Let Gloc denote the collection of these centers. A local feature is dened in terms of nloc pairs of edges and locations ( e`, v`), ` D 1, . . . , nloc . It is present at point x if there is an edge of type e` in the neighborhood x C bv` , for each ` D 1, . . . , nloc . Figure 2 shows an example of an arrangement with three edges. It will be convenient to assume that v1 D 0 and that b1 D f0g. Invariance to local deformations is explicitly incorporated in the disjunction over the regions bv . Note that each local feature can be viewed as a exible startype planar arrangement centered at x. Arrangements with two edges— nloc D 2—loosely describe local features such as corners, t-junctions, slits, and various contour segments with different curvature. On generic gray-scale images the density l b is found to decay exponentially with nloc (see Amit & Geman, 1999), and for xed nloc is more or less the same no matter which arrangement is chosen. For example for nloc D 4 the density l b lies between .002 and .003. For nloc D 2, l b » .007.
1150
Yali Amit
bv2 bv
1
bv
3
Figure 2: An arrangement with three edges. We are using v1 D (0, 0), v2 D (¡2, 2), v3 D (2, ¡3). The regions bv2 and bv3 are 3 £3, b1 D f0g. This arrangement is present at a location x; if a horizontal edge is present at x, another horizontal edge is present anywhere in x C bv2 and a vertical edge is present anywhere in x C bv3 . (Right) A graphical representation of the full model. The large circle in the middle represents the center point, each medium-sized circle represents the center of a local feature, and the small circles represent the edges
3.4 Training. Training images of the object class are registered to the reference grid. A choice of feature complexity is made; nloc is chosen. At each location y a search through the pool of features with nloc edges is carried out to nd a feature that is present in over 50% of the data, anywhere in a small neighborhood of y. If one is found, it is recorded together with the location. For a variety of object classes such as faces, synthetically deformed symbols and shapes, rigid 3D objects, and using nloc D 4, a greedy search typically yields several tens of locations on the reference grid for which po » .5. After correcting for the larger regions Bi we obtain po » 0.8, and it turns out that taking into consideration the background statistics using only 20 of these feature and location pairs is sufcient. These are chosen randomly from all identied feature location pairs. We then use a threshold t » 10 in the Hough transform. Alternatively a full search over all features with two edges (nloc D 2) yields po > 0.9. In this case 40 local features with t » 30 are needed for good classication. It is precisely the fact that this hierarchy of local features exhibits such statistics in real gray-level images that makes the algorithm feasible. 3.5 Some Experiments. In order to cover a range of scales of approximately 4:1, the same detection procedure is carried out at 5 to 6 resolutions. On a 200-Mhz Pentium-II the computation time for all scales together is about 1.5 seconds for generic 240 £ 320 gray-scale images. Example detections are shown in Figures 7, 8, and 9. Also in these displays are random samples from the training sets and some of the local features used in each
Neural Network Architecture for Visual Selection
1151
representation in their respective locations on the reference grid. The detectors for the 2D view of the clip (see Figure 9) and for the “eight” (see Figure 8) were made from only one image, which was subsequently synthetically deformed using some random linear transformations. In the example of the clip, due to the invariances built into the detection algorithm, the clip can be detected in a moderate range of viewing angles around the original one, which were not present in training. The face detector was produced from only 300 faces of the Olivetti database. Still faces are detected in very diverse lighting conditions and in the presence of occlusion. (More can be viewed online at http://galton. uchicago.edu/»amit/detect.) For each image, we show both the detection of the Hough transform using a representation with 40 local features of two edges each. We also show the result of an algorithm that employs one additional stage of ltering of the false positives as described in Amit and Geman (1999). In this data set, there are on average 13 false positives per image at all six resolutions together. This amounts to .00005 false positives per pixel. The false-negative rate for this collection of images is 5%. Note that the lighting conditions of the faces are far more diverse than those on the Olivetti training set. 4 Neural Net Implementation
Our aim is to describe a network that can easily adjust to detect arrangements of local features representing different object classes. All features used in any object representation come from a predetermined pool F D fa1 , . . . , aN g. The image locations of each feature in F are detected in arrays Fi , i D 1, . . . , N (see section 4.1.) A location x in Fi is on if local feature ai is present at location x in the image. Let F £ G denote the entire collection of pairs of features and reference grid locations—namely, F £ G D f(ai , y), i D 1, . . . , N, y 2 Gg. Dene a module M that has one unit corresponding to each such pair—N £ |G| units. An object representation consists of a small collection of nob such pairs (aij , yj ), j D 1, . . . , nob . For simplicity assume that nob is the same for all objects and that t is the same as well. Typically nob will be much smaller than N—on the order of several tens. An object representation is a simple binary pattern in M , with nob 1s and ( N ¡ nob ) 0s. For each local feature array Fi , introduce a system of arrays Qi,y for each location y 2 G. These Q arrays lay the ground for the detection of any object representation by performing step 3 in the Hough transform, for each of the By regions. Thus a unit at location x 2 Qi,y receives input from the region x C By in Fi . Note that for each unit u D (ai , y) in M , there is a corresponding Qi,y array. All units in Qi,y receive input from u. This is where the top-down ow of information is achieved. A unit x in Qi,y is on only if both (ai , y) is on and any unit in the region x C By is on in Fi . In other words the representation
1152
Yali Amit
evoked in M primes the appropriate Qi,y arrays. The system of Q arrays sum into an array S . A unit at location x 2 S receives input from all Qi,y arrays at P P location x and is on if N iD1 y2G Qi,y ( x) ¸ t . The array S therefore shows those locations for which condition (C) is satised. If a unique location needs to be chosen, say, for the next saccade, then picking the x with the highest sum seems a natural choice. In Figures 7, 8, and 9, the red dots show all locations satisfying condition (C), and the green dot shows the one with the most hits. This architecture, called NETGLOB, is described in Figure 3. For each local feature ai we need |G| D d2 complex Q-type arrays, so that the total number of Q-type arrays is NQ D Nd2 . For example, if we use arrangements with two edges and assuming 4 edge types and 16 possible regions bv (see section 3.3), the total number of features N D 4£4£16 D 256. Taking a 30 £ 30 reference grid NQ is on the order of 105 . It is, of course, possible to give up some degree of accuracy in the locations of the yj ’s, assuming, for example, they are on some subgrid of the reference grid. Recall that the S array is only supposed to provide candidate locations, which must then undergo further processing after foveation, including small corrections for locations. Therefore, assuming, for example, that the coordinates of yj are multiples of three would lead to harmless errors. But then the number of possible locations in the reference grid reduces to 100, and NQ » 104 . Using such a coarse subgrid of G also allows the F and Q arrays to have lower resolution (by the same factor of three), thus reducing the space required by a factor of nine. Nonetheless, NQ is still very large due to the assumption that all possible local features are detected bottom-up in hard-wired arrays Fi . This is quite wasteful in space and computational resources. In the next section we introduce adaptable local feature detectors, which greatly reduce the number of Q-type layers needed in the system. 4.1 Detecting the Local Features. The local features are local star-type arrangements of edges; they are detected in the Fi layers with input from edge detector arrays Ee , e D 1, . . . , 4. For each edge type e, dene a system of “complex” arrays Ce,v , v 2 Gloc . A unit x 2 Ce,v receives input from the region x C bv in Ee . It is on if any unit in x C bv is on . A local feature ai with pattern ( e`, v`), ` D 1, . . . , nloc , as dened in section 3.3, is detected in the array Fi . Each x 2 Fi receives input from Ce`,v` (x), ` D 1, . . . , nloc and is on if all nloc units are on . The full system of E, C, and F layers, called NETLOC, is shown in Figure 4. Note that in contrast to the global detection scheme, where a top-down model determined which arrays contributed to the summation array S , here everything is hard-wired and calculated bottom-up. Each unit in a Fi array sums the appropriate inputs from the edge detection arrays. Alongside the hard-wired local feature detectors we introduce adaptable local feature detectors. Here local feature detection is driven by top-down
Neural Network Architecture for Visual Selection
Detected location: Only active place in summation layer Sglob .
Sglob Q
1,y
Q 1,y1
Q
Mglob
. . . .
a 1 y1 a 7y3
.
Q
a 4 y2
a 9 y4
Model 4 local features at 4 locations.
Q
x + By2
.
4,y 4,y 2
.
.
.
F4
7,y 3
. .
.
F7 .
9,y 4
.
x + By4
Wedges representing region of influence in F - layer.
F1
9,y
Q
x + By3
. .. .
Q 7,y
Q
x + By1
1153
. .
F9 .
Q
N,y
.
. .
FN
Figure 3: NETGLOB. The object is dened in terms of four features at four locations. Each feature/location (ai , y) pair provides input to all units in the corresponding Qi,y array (thick lines). The locations of the feature detections are shown as dots. They provide input to regions in the Q arrays shown as double-thick lines. At least three local features need to be found for a candidate detection. The fourth feature does not contribute to the detection; it is not present in the correct location. There are N systems of F, Q arrays—one for each ai 2 F .
1154
Yali Amit
Figure 4: NETLOC. Four E layers detect edge locations. For each Ee layer, there is a Ce, v corresponding to each v 2 Gloc —all in all, 4 £ |Gloc | ¡ C layers. There is an F layer for each local feature, each location in the F layer receives input from the same location in Ce`,v` , ` D 1, . . . , nloc .
information. Only features required by the global model are detected. If only adaptable features detectors are used in the full system, the number of Q arrays required for the system is greatly reduced and is equal to the size of the reference grid. For each location y in the reference grid, there is a module Mloc,y , a detector array Floc,y , one “complex” array Qy , and a dedicated collection of Ce,v arrays. Without a local feature representation being evoked in Mloc,y nothing happens in this system. When a local feature representation is turned on in Mloc,y , only the corresponding Ce,v arrays are primed. Together with the input from the edge arrays, the local feature locations are detected in a summation array array Floc,y . Finally a detection at x in Floc,y is spread to the region x ¡ B y in Qy . (See Figure 5.)
Neural Network Architecture for Visual Selection
1155
Qy Detected location: Only active place in summation layer Floc,y.
Floc,y C1,v C1,v 3 C 1,v2 C1,v 1
Mloc,y
. .
e1 v1
e4 v2
. .. .
E1 . .
C4,v
.
C4,v 3 Model 2 edges at 2 locations.
C4,v 2 C4,v 1
. .
E4
Figure 5: NETADAP. The representation of the local feature is turned on in the local feature memory module. This determines which C layers receive input from Mloc, y . Units in these C layers that also receive input from the edge layer become activated. The Floc, y layer sums the activities in the C layers and indicates the locations of the local features. Floc,y then feeds into Q y : x 2 Q y is on if any point in x C By is on in Floc,y .
A global system with only adaptable feature detections is shown in Figure 6. In this system the number of Q type arrays is |G| . Top-down information does not go directly to these Q layers. When an object representation is turned on in M , each unit (ai , y) that is on activates the representation for ai in Mloc,y , which primes the associated Ce,v arrays. The detections of these features are obtained in the Floc,y arrays and appropriately spread to the Qy arrays. All the arrays Qy are summed and thresholded in S to nd the candidate locations. The adaptable local feature detectors Floc,y can operate alongside the hard-wired F arrays. The arrays Qy , together with the original system of
1156
Yali Amit
Sglob
Detected location: Only active place in summation layer Sglob .
.. .. . . . . .. .. Mloc,y
1
a 7y3
1
. .. .
F loc,y
1
Mloc,y
2
Mglob
a 1 y1
Qy
a 4 y2
a 9 y4
Model 4 local features at 4 locations.
Qy
2
Mloc,y
3
Mloc,y
4
Mloc,y
.
.
.
Floc,y
2
.
Qy
3
Qy
4
.
.
Floc,y
.
3
. .
Floc,y
4
. .
Qy
.
Floc,y Figure 6: NETGLOB (adaptable). The object is dened in terms of four features at four locations. Each location y provides input to an Mloc,y . If a pair (a, y) is on in M glob , the representation of a is on in Mloc, y . This is represented by the dots in the corresponding modules. The detections of feature a are then obtained in Floc,y , using a system as in Figure 5, and spread to Qy . At least three local features need to be found for a candidate detection. The fourth feature does not contribute to the detection; it is not present in the correct location. There was no feature on at location y, so the system of Mloc,y is not active.
Neural Network Architecture for Visual Selection
1157
Figure 7: Each of the top two panels shows 10 of the 20 local features from the face representation, trained from 300 faces of the Olivetti database. The bluegreen combinations indicate which edge is involved in the edge arrangement. Blue stands for darker. The three pink dots are the locations of the two eyes and mouth on the reference grid. Each red dot in an image represents a candidate location for the center of the face. The green dot shows the location with most counts. Detections are shown from all six resolutions together. The top image was scanned from Nature, vol. 278, October 1997.
1158
Yali Amit
Figure 8: First panel: 12 samples from the training set. Second panel: 10 images of the prototype 8 with two different local features in each. Detections are shown in four randomly generated displays; red dots indicate detected locations, and the green dot indicates the location with the highest count.
Qi,y arrays, sum into S glob . If a local feature (ai , y) is on in M , either there is an associated and hard-wired detector layer Fi and Qi,y is primed, or it feeds into Mloc,y and the required detections are obtained in Floc,y and spread to the associated Qy . It appears more reasonable to have a small number of re-
Neural Network Architecture for Visual Selection
1159
Figure 9: Same information for the clip. The clip was trained on 32 randomly perturbed images of one original image, 12 of which are shown. Detection is only for this range of views, not for all possible 3D views of the clip.
curring local features that are hard-wired in the system alongside adaptable local feature detectors. 5 Biological Analogies 5.1 Labeling the Layers. Analogies can be drawn between the layers dened in the architecture of NETGLOB and existing layers of the visual system. The simple edge-detector layers Ee , e D 1 . . . , 4 correspond to simple orientation selective cells in V1. These layers represent a schematic abstraction of the information provided by the biological cells. However, the
1160
Yali Amit
empirical success of the algorithm on real images indicates that not much more information is needed for the task of selection. The Ce,y layers correspond to complex orientation selective cells. One could imagine that all cells corresponding to a xed-edge type e and xed image location x are arranged in a column. The only difference as one proceeds down the column is the region over which disjunction (ORing) is performed (e.g., the receptive eld). In other words, the units in a column are indexed by v 2 Gloc , which determines the displaced region bv . Indeed, any report on vertical electrode probings in V1, in which orientation selectivity is constant, will show a variation in the displacement of the receptive elds. (See Zeki, 1993.) The detection arrays Fi correspond to cells in V2 that respond to more complex structures, including what is termed as illusory contours (see von der Heydt, 1995). If an illusory contour is viewed as a loose local arrangement of edge elements, that is precisely what the Fi layers are detecting. The Qi,y layers correspond to V4 cells. These have much larger receptive elds, and the variability of the location of these elds as one proceeds down a column is much more pronounced than in V1 or V2; it is on the order of the size of the reference grid. 5.2 Bottom-Up–Top-Down Interactions. Top-down and bottom-up processing are explicitly modeled. Bottom-up processing is constantly occurring in the simple cell arrays Ee , which feed into the complex cell arrays Ce,y which in turn feed into the Fi –V2 type arrays. One could even imagine the Q–V4 type arrays being activated by the F arrays regardless of the input from M . Using the terminology introduced in Ullman (1996), only those units that are primed by receiving input from M contribute to the summation into S . The object pattern that is excited in the main memory module determines which of the Q arrays will have enhanced activity toward their summation into S . Thus the nal determination of the candidate locations is given by an interaction of the bottom-up processing and the top-down processing manifested in the array S . With the adaptable local feature detectors, the topdown inuence is even greater. Nothing occurs in an Floc,y detector unless a local feature representation is turned on in Mloc,y due to the activation in M of an object representation that contains that feature at location y. 5.3 Gating and Invariant Detection. The summation array S can serve as a gating mechanism for visual attention. Introduce feedback connections from each unit x 2 S to the unit at the same location in all Q layers. The only way a unit in a Qi,y layer can remain active for an extended period is if it received input from some (ai , y) 2 M and from the S layer. This could provide a model for the behavior of IT neurons in delay matchto-sample task experiments, (see Chelazzi, Miller, Duncan, & Desimone, 1993, and Desimone et al., 1995). In these experiments neurons in IT se-
Neural Network Architecture for Visual Selection
1161
lectively responsive to two different objects are found. The subject is then presented with the object to be detected: the sample. After a delay period, an image with both objects is displayed, both displaced from the center. The subject is supposed to saccade to the sample object. After presentation of the test image, neurons responsive to both objects are active. After about 100 milliseconds and a few tens of milliseconds prior to the saccade, the activity of the neurons selective for the nonsample object decays. Only neurons selective for the sample object remain active. If all units in the Qi,y layer feed into (ai , y) in M , then (ai , y) receives bottom-up input whenever ai is present anywhere in the image, and no matter what y is. Thus, the units in M do not discriminate between the locations from which the inputs are coming, and at the presentation of a test image, one may observe activity in M due to bottom-up processing in the Q layers, which responds to other objects in the scene. This is consistent with scale and location invariance observed in IT neuron responses (Ito et al., 1995). When a location is detected in S , gating occurs as described above, and the only input into M from the Q arrays is from the detected location at the layers corresponding to the model representation. For example assume that feature ai is detected at locations x1 , x2 , x3 in the image (i.e., units x1 , x2 , x3 are on in Fi ). Then all the units in x1 ¡ By , x2 ¡ By , x3 ¡ B y will be on in Qi,y for all y 2 G. This corresponds to the bottom-up ow of information. In the object model, assume ai is paired with location O Assume that only x1 comes from the correct location on an instance of the y. object, that is, the center of the object lies around x1 ¡ByO . Then if enough other features are detected on this instance of the object, some point x¤ 2 x1 ¡ B yO will be on in S , and subsequently the area around x¤ in the Qy,i O layer will remain active. After detection occurs in S followed by the gating process, the only input coming into M from the Q arrays corresponding to feature ai (Qi,y , y 2 G) is from Qi, yO ( x¤ ). The same is true for the other features in the object representation that have been detected at location x¤ . This means that after detection, the M module is either receiving no input (if no candidate locations are found) or some subset of the units in the object representation is receiving reinforced input. Only the object representation is reinforced in M , signifying that detection has occurred invariant to translation scale and other deformations. To summarize, the units in M exhibit selective activity for the sample object, only due to the input from a lower-level process in which location has already been detected and gating has occurred. 5.4 Hard-Wired Versus Adaptable Features. The number of hard-wired local feature arrays Fi and Qy,i we would expect to nd in the system is limited. It is reasonable to assume that a complementary system of adaptable features exists as described in section 4.1. These features are learned for more specic tasks and may then be discarded if not needed; for example,
1162
Yali Amit
ner angle tuning of the edges may be required, as in the experiments in Ahissar and Hochstein (1997). People show dramatic improvement in detection tasks over hundreds of trials in which they are expected to repeat the task. It may well be that the hard-wired local features are too coarse for these tasks. The repetition is needed to learn the appropriate local features, which are then detected through the adaptable local feature system. 5.5 Learning. Consider the reference grid as the central eld of view (fovea), and let each unit (ai , y) 2 M be directly connected to Fi ( y) for the locations y 2 G. The unit (ai , y) 2 M is on if Fi ( y) is on. In other words, the area of the reference grid in the Fi arrays feeds directly into the M module. Learning is then done through images of the object class presented at the reference grid. Each such image generates a binary pattern in the module M . In the current implementation (see section 3.4), a search is carried out at each location y 2 G to identify stable features for each location. These are then externally stored. However, it also possible to consider M as a connected neural net that employs Hebbian learning (see, for example, Brunel, Carusi, & Fusi, 1998, and Amit & Fusi, 1994) to store the binary patterns of high-frequency units for an object class internally. These patterns of high-frequency units would be the attractors generated by presentations of random samples from the object class. The attractor pattern produced in M then activates the appropriate Q layers for selection, as described above. 6 Discussion
The architecture presented here provides a model for visual selection when we know what we are looking for. What if the system is not guided toward a specic object class? Clearly high-level contextual information can restrict the number of relevant object classes to be entertained (see Ullman, 1996). However, even in the absence of such information, visual selection occurs. One possibility is that there is a small collection of generic representations that are either learned or hard-wired, which are useful for “generic” object and background discrimination and have dedicated systems for detecting them. For example, we noticed, not surprisingly, that the object representation produced for the “clip” (see Figure 9) often detects generic rectangular, elongated dark areas. Many psychophysical experiments on detection and visual selection involve measuring time as a function of the number of distractors. In Elder and Zucker (1993) there is evidence that detection of shapes with “closure” properties does not slow linearly with the number of distractors, whereas shapes that are “open” do exhibit slowing; Also in Treisman and Sharon (1990) some detection problems exhibit slowing; others do not. This is directly related to the issue of false positives. The architecture presented above is parallel and is not affected by the number of distractors. However, the number of
Neural Network Architecture for Visual Selection
1163
distractors can inuence the number of false positives. The chance that the rst saccade moves to the correct location decreases, and slowing down occurs. How can the effect of the distractors be avoided? Again, this could be achieved using generic object representations with hard-wired detection mechanisms, that is, a summation layer S dedicated to a specic representation, and hence responding in a totally bottom-up fashion. This discussion also relates to the question of how the representation for detection interacts with representations for more detailed recognition and classication, which occur after foveation. How many false positives can detection tolerate, leaving nal determination to the stage after foveation? There are numerous feedback connections in the visual system. In the current architecture, there is no feedback to the local feature detectors and edge detectors beyond the basic top-down information provided by the model. Also the issue of larger rotations has not been addressed. The local features and edges are not invariant to large rotations. How to deal with this issue remains an open question. Finally, in the context of elementary local cues, we have not made use of anything other than spatial discontinuities: edges. However, the same architecture can incorporate local motion cues, color cues, depth cues, texture, and so forth. Local features involving conjunctions of all of these local cues can be used. Acknowledgments
I am indebted to Daniel Amit and Donald Geman for numerous conversations and suggestions. The experiments in this article would not have been possible without the coding assistance of Keneth Wilder. This work was supported in part by the Army Research Ofce under grant DAAH04-96-1-0061 and MURI grant DAAHO4-96-1-0445. References Ahissar, M., & Hochstein, S. (1997).Task difculty and visual hierarchy: Reverse hierarchies in sensory processing and perceptual learning. Nature, 387, 401– 406. Amit, Y. (1998). Deformable template methods for object detection (Tech. Rep.). Chicago: Department of Statistics, University of Chicago. Amit, D. J., & Fusi, S. (1994). Dynamic learning in neural networks with material synapses. Neural Computation, 6, 957. Amit, Y., & Geman, D. (1999).A computational model for visual selection. Neural Computation, 11, 1691–1715. Ballard, D. H. (1981). Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 13, 111–122. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network, 9, 123–152.
1164
Yali Amit
Burl, M. C., Leung, T. K., & Perona, P. (1995). Face localization via shape statistics. In M. Bichsel (Ed.), Proc. Intl. Workshop on Automatic Face and Gesture Recognition (pp. 154–159). Chelazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature, 363, 345–347. Cootes, T. F., & Taylor, C. J. (1996). Locating faces using statistical feature detectors. In Proc., Second Intl. Workshop on Automatic Face and Gesture Recognition (pp. 204–210). Desimone, R., Miller, E. K., Chelazzi, L., & Lueschow, A. (1995). Multiple memory systems in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 475–486). Cambridge, MA: MIT Press. Elder, J., & Zucker, S. W. (1993). The effect of contour closure on the rapid discrimination of two-dimensional shapes. Vison Research, 33, 981–991. Fukushima, K. (1986). A neural network model for selective attention in visual pattern recognition. Biol. Cyber., 55, 5–15. Fukushima, K., & Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. IEEE Trans. Neural Networks, 2, 355–365. Grimson, W. E. L. (1990). Object recognition by computer: The role of geometric constraints. Cambridge, MA: MIT Press. Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995).Size and position invariance of neuronal response in monkey inferotemporal cortex. Journal of Neuroscience, 73(1), 218–226. Lueschow, A., Miller, E. K., & Desimone, R. (1994).Inferior temporal mechanisms for invariant object recognition. Cerebral Cortex, 5, 523–531. Rojer, A. S., & Schwartz, E. L. (1992). A quotient space Hough transform for space-variant visual attention. In G. A. Carpenter & S. Grossberg (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press. Rowley, H. A., Baluja, S., & Takeo, K. (1998). Neural network-based face detection. IEEE Trans. PAMI, 20, 23–38. Sung, K. K., & Poggio, T. (1998). Example-based learning for view-based face detection. IEEE Trans. PAMI, 20, 39–51. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Treisman, A., & Sharon, S. (1990). Conjunction search revisited. Journal of Experimental Psychology: Human Perception and Performance, 16, 459–478. Ullman, S. (1996). High-level vision. Cambridge, MA: MIT Press. Van Rullen, R., Gautrais, J., Delorme, A., & Thorpe, S. (1998). Face processing using one spike per neuron. Biosystems, 48, 229–239. von der Heydt, R. (1995), Form analysis in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 365–382). Cambridge, MA: MIT Press. Wiskott, L., Fellous, J.-M., Kruger, N., & von der Marlsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Trans. on Patt. Anal. and Mach. Intel., 7, 775–779. Zeki, S. (1993). A vision of the brain. Oxford: Blackwell Scientic Publications. Received November 19, 1998; accepted July 12, 1999.
LETTER
Communicated by B. E. Stein
Using Bayes’ Rule to Model Multisensory Enhancement in the Superior Colliculus Thomas J. Anastasio Beckman Instituteand Department of Molecular and IntegrativePhysiology,University of Illinois at Urbana/Champaign, Urbana, IL 61801, U.S.A. Paul E. Patton Kamel Belkacem-Boussaid Beckman Institute, University of Illinois at Urbana/Champaign, Urbana, IL 61801, U.S.A. The deep layers of the superior colliculus (SC) integrate multisensory inputs and initiate an orienting response toward the source of stimulation (target). Multisensory enhancement, which occurs in the deep SC, is the augmentation of a neural response to sensory input of one modality by input of another modality. Multisensory enhancement appears to underlie the behavioral observation that an animal is more likely to orient toward weak stimuli if a stimulus of one modality is paired with a stimulus of another modality. Yet not all deep SC neurons are multisensory. Those that are exhibit the property of inverse effectiveness: combinations of weaker unimodal responses produce larger amounts of enhancement. We show that these neurophysiological ndings support the hypothesis that deep SC neurons use their sensory inputs to compute the probability that a target is present. We model multimodal sensory inputs to the deep SC as random variables and cast the computation function in terms of Bayes’ rule. Our analysis suggests that multisensory deep SC neurons are those that combine unimodal inputs that would be more uncertain by themselves. It also suggests that inverse effectiveness results because the increase in target probability due to the integration of multisensory inputs is larger when the unimodal responses are weaker.
1 Introduction
The power of the brain as an information processor is due in part to its ability to integrate inputs from multiple sensory systems. Nowhere else has neurophysiological research provided a clearer picture of multisensory integration than in the deep layers of the superior colliculus (SC) (see Stein & Meredith, 1993, for a lucid review). Enhancement is a form of multisensory integration in which the neural response to a stimulus of one modality is augmented by a stimulus of another modality. We propose a probabilistic c 2000 Massachusetts Institute of Technology Neural Computation 12, 1165–1187 (2000) °
1166
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
theory of the functional signicance of multisensory enhancement in the deep SC that can account for its most important features. The SC is a layered structure located in the mammalian midbrain (Wurtz & Goldberg, 1989). The deep layers receive inputs from various brain regions composing the visual, auditory, and somatosensory systems (Edwards, Ginsburgh, Henkel, & Stein, 1979; Cadusseau & Roger, 1985; Sparks & Hartwich-Young, 1989). The deep SC integrates this multisensory input and then initiates an orienting response, such as a saccadic eye movement, toward the source of the stimulation. Neurons in the SC are organized topographically according to the location in space of their receptive elds (auditory, Middlebrooks & Knudsen, 1984; visual, Meredith & Stein, 1990; somatosensory, Meredith, Clemo, & Stein, 1991). Individual deep SC neurons can receive inputs from multiple sensory systems (Meredith & Stein, 1983; 1986b; Wallace & Stein, 1996). There is considerable overlap between the receptive elds of individual multisensory neurons for inputs of different modalities (Meredith & Stein, 1996). The responses of multisensory deep SC neurons are dependent on the spatial and temporal relationships of the multisensory stimuli. Stimuli that occur at the same time and place can produce response enhancement (King & Palmer, 1985; Meredith & Stein 1986a, 1986b; Meredith, Nemitz, & Stein, 1987). Multisensory enhancement is the augmentation of the response of a deep SC neuron to a sensory input of one modality by an input of another modality. Percent enhancement (% enh) is quantied as (Meredith & Stein, 1986a): % enh D
¡ CM ¡ SM ¢ max
SMmax
£ 100,
(1.1)
where CM and SMmax are the numbers of neural impulses evoked by the combined modality and best single modality stimuli, respectively. Multisensory enhancement is dependent on the size of the responses to unimodal stimuli. Smaller unimodal responses are associated with larger percentages of multisensory enhancement (Meredith & Stein, 1986b). This property is called inverse effectiveness (Wallace & Stein, 1994). Percentage enhancement can range upwards of 1000% (Meredith & Stein, 1986b). Yet not all deep SC neurons are multisensory. Respectively, in cat and monkey, 46% and 73% of deep SC neurons studied receive sensory input of only one modality (Wallace & Stein, 1996). It is not clear why some neurons in the deep SC exhibit such extreme levels of multisensory enhancement, while others receive only unimodal input. Since the deep SC initiates orienting movements, it is not surprising to nd that the orienting responses observed in behaving animals also show multisensory enhancement (Stein, Huneycutt, & Meredith, 1988; Stein, Meredith, Huneycutt, & McDade, 1989). For example, the chances that an orienting response will be made to a stimulus of one modality can be in-
Multisensory Enhancement in the Superior Colliculus
1167
creased by a stimulus of another modality presented simultaneously in the same place, especially when the available stimuli are weak. The rules of multisensory enhancement make intuitive sense in the behavioral context. Strong stimuli that effectively command orienting responses by themselves require no multisensory enhancement. In contrast, the chances that a weak stimulus indicates the presence of an actual source should be greatly increased by another stimulus (even a weak one) of a different modality that is coincident with it. We will develop this intuition into a formal model using elementary probability theory. Our model, based on Bayes’ rule, provides a possible explanation for why some deep SC neurons are multisensory while others are not. It also offers an explanation for the property of inverse effectiveness observed for multisensory deep SC neurons. 2 Bayes’ Rule
We hypothesize that each deep SC neuron computes the probability that a stimulus source (target) is present in its receptive eld, given the sensory inputs it receives. The sensory inputs are neural and are modeled as stochastic. Stochasticity implies that the input carried by sensory neurons indicating the presence of a target is to some extent uncertain. Thus, we postulate that each deep SC neuron computes the conditional probability that a target is present given its uncertain sensory inputs. We will model the computation of this conditional probability using Bayes’ rule. The model is schematized in Figure 1, in which the deep SC receives input from the visual and auditory systems. We represent these two sensory systems because the bimodal, visual-auditory combination is the most common one observed in cat and monkey (Wallace & Stein, 1996). The results generalize to any other combination. Each block on the grid in Figure 1 represents a single, deep SC neuron. Our model will focus on individual deep SC neurons and on the visual and auditory inputs they receive from within their receptive elds. We begin with the simplest case: a deep SC neuron receives an input of only one sensory modality. We denote the target as random variable T and a visual sensory input as random variable V. Then we hypothesize that a deep SC neuron computes the conditional probability P( T | V ). This conditional probability can be computed using Bayes’ rule (Hellstrom, 1984; Applebaum, 1996): P( T | V ) D
P( V | T ) P( T ) . P( V )
(2.1)
Bayes’ rule is a fundamental concept in probability that has found wide application (Duda & Hart, 1973; Bishop, 1995). It underlies all modern systems for probabilistic inference in articial intelligence (Russell & Norvig, 1995).
1168
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
Figure 1: Schematic diagram illustrating multisensory integration in the deep layers of the superior colliculus (SC). Neurons in the SC are organized topographically according to the location in space of their receptive elds. Individual deep SC neurons, represented as blocks on a grid, can receive sensory input of more than one modality. The receptive elds of individual multisensory neurons for inputs of different modalities overlap. V and A are random variables representing input from the visual and auditory systems, respectively. T is a binary random variable representing the target. P(T D 1 | V, A) is the Bayesian probability that a target is present given sensory inputs V and A. The bird image is from Stein and Meredith (1993).
We suggest that a deep SC neuron can be regarded as implementing Bayes’ rule in computing the probability of a target given its sensory input. According to equation 2.1, we suppose that the nervous system can compute P( T | V ) given some input V because it already has available the other three probabilities: P( V | T ), P (V ), and P(T ). The conditional probability of V given T [P(V | T )] is the probability of obtaining some visual sensory input V D v given a target T D t (upper- and lowercase italicized letters represent random variables and particular numerical values of those variables, respectively). In the context of Bayes’ rule, P( V | T ) is referred to as the likelihood of V given T. The unconditional probability of V [P(V)] is the probability of obtaining some visual sensory input V D v regardless of whether a target is present. Thus, P (V ) and P( V | T ) are properties of the visual system and the environment, and it is reasonable to suppose that those properties could be represented by the brain.
Multisensory Enhancement in the Superior Colliculus
1169
The unconditional probability of T [P(T)] is the probability that T D t. P( T ) is a property of the environment. In the context of Bayes’ rule, P(T ) is referred to as the prior probability of T. What Bayes’ rule essentially does is to modify the prior probability on the basis of input. In Bayes’ rule (equation 2.1), P(T | V) is computed by multiplying P(T ) by the ratio P( V | T ) / P( V ). Because P( T | V ) can be thought of as a modied version of P(T ) based on input, P(T | V ) can be referred to as the posterior probability of T. Thus, when computed using Bayes’ rule, P( T | V ) is referred to as either the Bayesian or the posterior probability of T. It’s possible that the prior probability of T [P(T)], and its representation by the brain, could change as circumstances dictate. With Bayes’ rule, we suggest a model of how the brain could use knowledge of the statistical properties of its sensory systems to modify this prior on the basis of sensory input. 3 Bayes’ Rule with One Sensory Input
Equation 2.1 can be used to compute the posterior (Bayesian) probability of a target given sensory input of only one modality. It can be used to simulate the responses of unimodal deep SC neurons. To evaluate equation 2.1, it is necessary to dene the probability distributions for the terms on the right-hand side. We dene T as a binary random variable where T D 1 if the target is present and T D 0 if it is absent. We dene the prior probability distribution of T [P(T)] by arbitrarily assigning the probability of 0.1 to P( T D 1) and 0.9 to P( T D 0). These values are not meant to reect precisely some experimental situation, but are chosen because they offer advantages in illustrating the principles. Bayes’ rule is valid for any prior distribution. We dene V as a discrete random variable that represents the number of neural impulses arriving as input to a deep SC neuron from a visual neuron in a unit time interval. The unit interval for the data we model is 250 msec (Meredith & Stein, 1986b). We dene the likelihoods of V given T D 1 [P(V | T D 1)] or T D 0 [P(V | T D 0)] as Poisson densities with different means. The Poisson density provides a reasonable rst approximation to the discrete distribution of the number of impulses per unit time observed in the spontaneous activity and driven responses of single neurons, and offers certain modeling advantages. The Poisson density function for discrete random variable V with mean l is dened as (Hellstrom, 1984; Applebaum, 1996): P( V D v ) D
l v e¡l . v!
(3.1)
Example likelihood distributions of V are illustrated in Figure 2A. The likelihood of V given T D 0 [P(V | T D 0)] is the probability distribution of V when a target is absent. Thus, P (V | T D 0) is the likelihood of V under
1170
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
Figure 2: Likelihood distributions and Bayesian probability in the unimodal case. The target is denoted by the binary random variable T. The prior probabilities of the target [P(T)] are 0.9 that it is absent [P(T D 0) D 0.9] and 0.1 that it is present [P(T D 1) D 0.1]. The discrete random variable V represents the number of impulses arriving at the deep SC from a visual input neuron in a unit time interval (250 msec). Poisson densities, shown in (A), model the likelihood distributions of V [P(V | T)]. A mean 5 Poisson density models the spontaneous likelihood of V [with target absent, P(V | T D 0), dashed line]. A mean 8 Poisson density models the driven likelihood of V [with target present, P(V | T D 1), line with circles]. The unconditional probability of V [P(V), line without symbols in (A)] is computed from the input likelihoods and target priors (using equation 3.2). The posterior (Bayesian) probability that the target is present given input V [P(T D 1 | V)] is computed using equation 3.3, and the result is shown in (B) (line with circles). The target-present prior [P(T D 1) D 0.1] is also shown for comparison in (B) (dashed line).
spontaneous conditions. It is modeled as a Poisson density (equation 3.1) with l D 5 (dashed line). The mean of 5 impulses per 250-msec interval corresponds to a mean spontaneous ring rate of 20 impulses per second. This spontaneous rate is meant to represent a reasonable generic value, but one that is high enough so that the behavior of the model under spontaneous
Multisensory Enhancement in the Superior Colliculus
1171
conditions can be examined easily. The model is valid for any spontaneous rate. The likelihood of V given T D 1 [P(V | T D 1)] is the probability distribution of V when a target is present. Thus, P( V | T D 1) is the likelihood of V under driven conditions. The driven likelihood takes into account random variability in the stimulus effectiveness of the target, as well as the intrinsic stochasticity of the sensory neurons. It is modeled as a Poisson density (equation 3.1) with l D 8 (line with circles in Figure 2A). The mean of 8 impulses per 250-msec interval corresponds to a mean driven ring rate of 32 impulses per second. Notice that there is overlap between the spontaneous and driven likelihoods of V (see Figure 2A). Given the prior probability distribution of T [P(T)] and the likelihood distributions of V [P(V | T )], the unconditional probability of V [P(V)] can be computed using the principle of total probability (Hellstrom, 1984; Applebaum, 1996): P(V ) D P(V | T D 1)P(T D 1) C P( V | T D 0)P(T D 0).
(3.2)
Equation 3.2 shows that the unconditional probability of V is the sum of the likelihoods of V weighted by the priors of T. The result is the compound distribution shown in Figure 2A (line without symbols). With P( T ) and P( V | T ) dened, and P( V ) computed from them (equation 3.2), all of the variables needed to evaluate Bayes’ rule in the unimodal case (equation 2.1) have been specied. Our hypothesis is that a deep SC neuron computes the Bayesian (posterior) probability that a target is present in its receptive eld given its sensory input. We therefore want to evaluate equation 2.1 for T D 1 with unimodal input V: P( T D 1 | V ) D
P( V | T D 1)P(T D 1) . P( V )
(3.3)
We evaluate equation 3.3 by taking integer values of v in the range from 0 to 25 impulses per unit time. For each value of v, the values of P( V | T D 1) and P( V ) are read off the curves previously dened (see Figure 2A). The unimodal posterior probability that T D 1 [P(T D 1 | V )] is plotted in Figure 2B (line with circles). The prior probability that T D 1 [P(T D 1) D 0.1] is also shown for comparison in Figure 2B (dashed line). For the Poisson density means chosen, the posterior exceeds the prior probability when v ¸ 7 impulses per unit time (see Figure 2B). For the same values of v ( v ¸ 7), the driven likelihood of V [P(V | T D 1)] exceeds the unconditional probability of V [P(V)] (see Figure 2A). Because the target-present prior [P(T D 1) D 0.1] is much smaller than the target-absent prior of T [P(T D 0) D 0.9], the unconditional probability of V [P(V)] is dominated by the spontaneous likelihood of V [P(V | T D 0)]
1172
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
(equation 3.2). This is shown in Figure 2A [P(V), line without symbols; P( V | T D 0), dashed line]. For the priors of T as dened, the ratio P(V | T D 1) / P( V) is roughly equal to the ratio P( V | T D 1) / P (V | T D 0), and computing the posterior probability of the target using Bayes’ rule (equation 3.3) is roughly equivalent to multiplying the target-present prior by the ratio of the driven and spontaneous likelihoods of the input. Thus, as the input increases from 0 to 25 impulses per unit time, it changes from being most likely spontaneous to most likely driven (see Figure 2A). On the basis of this increasing unimodal input, Bayes’ rule indicates that the target changes from being most probably absent to most probably present (see Figure 2B). We suggest that the sigmoidal curve (Figure 2B, line with circles) describing the Bayesian probability that a target is present given unimodal input V [P(T D 1 | V )] is proportional to the responses of individual, unimodal deep SC neurons. 4 Bayes’ Rule with Two Sensory Inputs
We can now consider the bimodal case of visual and auditory inputs. We dene A as a discrete random variable that represents the number of neural impulses arriving at the SC neuron from an auditory neuron in a unit time interval. The Bayesian probability in the bimodal case (given V and A) can be computed as in the unimodal case, except that in the bimodal case, the input is the conjunction of V and A. The posterior probability that T D 1 given V and A [P(T D 1 | V, A)] can be computed using Bayes’ rule as follows: P (T D 1 | V, A) D
P(V, A | T D 1)P(T D 1) . P( V, A)
(4.1)
To evaluate Bayes’ rule in the bimodal case (equation 4.1) we need to dene the likelihoods of A given that a target is present or absent [P(A | T D 1) or P( A | T D 0)]. As for V, we dene the likelihoods of A given T as Poisson densities with different means. To simplify the analysis, we assume that V and A are conditionally independent given T. On this assumption, the joint likelihoods of V and A given T [P(V, A | T )] can be computed as the products of the corresponding likelihoods of V given T [P(V | T )] and of A given T [P(A | T )]. For the case in which a target is present, P (V, A | T D 1) D P(V | T D 1)P(A | T D 1).
(4.2)
Also on the assumption of conditional independence, the unconditional joint probability of V and A [P(V, A)] can be computed using the principle of total probability: P (V, A) D P( V | T D 1)P(A | T D 1)P(T D 1) C P( V | T D 0)P(A | T D 0)P(T D 0).
(4.3)
Multisensory Enhancement in the Superior Colliculus
1173
We use the same priors of T as before. With the priors of T and the likelihoods of V and A dened, and the conditional and unconditional joint probabilities of ( V, A) computed from them (equations 4.2 and 4.3), we can use Bayes’ rule (equation 4.1) to simulate the bimodal responses of deep SC neurons. We evaluate equation 4.1 by taking integer values of v and a in the range from 0 to 25 impulses per unit time. To simplify the comparison between simulated unimodal and bimodal responses, we consider the hypothetical case in which v and a are equal. The results are qualitatively similar when v and a are unequal as long as they increase together. An example of a bimodal Bayesian probability that a target is present is shown in Figure 3A (line without symbols). In this example, the likelihoods of A are the same as those of V (in Figure 2A). The means of the Poisson densities modeling the spontaneous and driven likelihoods of both V and A are 5 and 8, respectively. The relationship between the Bayesian probability and the inputs in the bimodal case, in which V and A increase together, is qualitatively similar to that in the unimodal case, in which V increases alone. As in the unimodal case (Figure 2), Bayes’ rule in the bimodal case indicates that the target changes from being most probably absent to most probably present as the input changes from being most likely spontaneous to most likely driven. In the bimodal case, however, the spontaneous and driven input distributions reect the products of the individual likelihood distributions of V and A (equations 4.2 and 4.3). The roughly sigmoidal curves in Figure 3A for the bimodal [P(T D 1 | V, A)] (line without symbols) and the unimodal [P(T D 1 | V )] (line with circles) Bayesian probabilities are similar in that both rise from 0 to 1 as v and a (bimodal) or v only (unimodal) increase from 0 to 25 impulses per unit time. The difference is that the bimodal curve rises faster. There is a range of values of v over which the probability of a target is higher in the bimodal case, when a is also available as an input, than in the unimodal case with v alone. This can be called the bimodal-unimodal difference (BUD) range. Within the BUD range, the Bayesian probability that a target is present will be higher in the bimodal than in the unimodal case at the same level of input. The size of the BUD range is related to the amount of overlap between the spontaneous and driven likelihood distributions of the inputs (e.g., Figure 2A). The amount of overlap decreases as the difference between the driven and spontaneous means increases. As an example, the mean of the Poisson density modeling the driven likelihoods of both V and A is changed from 8 to 20. The spontaneous mean for both V and A remains at 5. The unimodal (line with circles) and bimodal (line without symbols) Bayesian probability curves for this example are shown in Figure 3B. The BUD range is appreciably smaller in this example, with driven and spontaneous means of 20 and 5 (see Figure 3B) than in the previous example with means of 8 and 5 (see Figure 3A).
1174
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
Figure 3: Bayesian target-present probabilities computed in unimodal and bimodal cases. The prior probabilities of the target are set as before [P(T D 1) D 0.1 and P(T D 0) D 0.9] and the likelihoods of the visual (V) and auditory (A) inputs are modeled using Poisson densities as in Figure 2A. Unimodal Bayesian probabilities (see equation 3.3) are computed as V takes integer values v in the range from 0 to 25 impulses per unit time. Bimodal Bayesian probabilities (equation 4.1) are computed as V and A vary together over the same range with v D a. (A) Spontaneous and driven likelihood means are 5 and 8, respectively, for the unimodal case of V only, and for both V and A in the bimodal case. (B) Spontaneous and driven likelihood means are 5 and 20, respectively. The bimodal exceeds the unimodal Bayesian probability over a much narrower range in (B) than in (A). For (A) and (B), lines with circles and lines without symbols denote unimodal and bimodal Bayesian probabilities, respectively.
Comparison of the curves in Figures 3A and 3B illustrates the general nding that the BUD range decreases as the difference between the driven and spontaneous input means increases. The BUD range is also dependent on the prior probabilities of the target. The inuences of changing the spontaneous and driven input means and target priors on the BUD range are illustrated in Figure 4. In this gure, the summed difference between the bimodal and unimodal Bayesian probability curves (cumulative BUD) is
Multisensory Enhancement in the Superior Colliculus
1175
Figure 4: The difference between the bimodal and unimodal Bayesian probabilities decreases rapidly as the difference between the driven and spontaneous input means increases. The spontaneous means for both V and A are xed at 5 impulses per unit time, and the driven means for V and A are varied together from 7 to 25. Unimodal (equation 3.3) and bimodal (equation 4.1) Bayesian probabilities are computed as v, or v and a together, vary over the range from 0 to 25 impulses per unit time (as in Figure 3). The unimodal is subtracted from the bimodal curve at each point in the input range. The difference (BUD) is summed over the range and plotted. The cumulative BUD is computed with target-present prior probabilities of 0.1 (line without symbols), 0.01 (dashed line), and 0.001 (line with asterisks) (target-absent priors are 0.9, 0.99, and 0.999, respectively).
1176
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
plotted as the driven means of both V and A are varied together from 7 to 25. The spontaneous means of both V and A are xed at 5. The target prior distribution is altered so that the target-present prior equals 0.1 (line without symbols), 0.01 (dashed line), or 0.001 (line with asterisks). Figure 4 shows that the cumulative BUD decreases as the difference between the driven and spontaneous input means increases. The difference is higher for all means as the target-present prior decreases. Whether a deep SC neuron receives multimodal or only unimodal input may depend on the amount by which the driven response of a unimodal input differs from its spontaneous discharge. If the spontaneous and driven means are not well separated, then a unimodal sensory input provided to a deep SC neuron indicating the presence of a target would be uncertain and ambiguous. In the ambiguous case, a sensory input of another modality makes a relatively big difference in the computation of the probability that a target is present (e.g., Figure 4 for driven means less than about 15). Conversely, if the spontaneous and driven means are well separated, then a unimodal input provided to a deep SC neuron would be unambiguous. In the unambiguous case, a sensory input of another modality makes relatively little difference in the computation of the probability that a target is present (e.g., Figure 4 for driven means greater than about 15). Unimodal deep SC neurons may be those that receive an unambiguous input of one sensory modality and have no need for an input of another modality. 5 Bayes’ Rule with One Spontaneous and One Driven Input
To compare the Bayes’ rule model with neurophysiological data on multisensory enhancement, we must explore the behavior of the bimodal model (equation 4.1) not only for cases of bimodal stimulation, but also for cases where stimulation of only one or the other modality is available. To model the experimental situation in which sensory stimulation of both modalities is presented, we consider the case in which v and a are equal and vary together over the range from 0 to 25 impulses per unit time (as before). This simulates the driven responses of both inputs (both-driven). To model situations in which sensory stimulation of only one modality is presented, we consider the cases in which v or a is allowed to vary over the range, while the other input is xed at the mean of its spontaneous likelihood distribution (driven/spontaneous). The results of these simulations are shown in Figure 5A, in which the spontaneous and driven input means are 5 and 8 for both V and A, as before. Because both sensory inputs have the same spontaneous and driven means, the bimodal Bayesian probability curves for the two driven/spontaneous cases are the same. Thus, the curves coincide for the bimodal cases in which v is xed at 5 while a varies (line with diamonds) and in which a is xed at 5 while v varies (line with crosses). There is a range of v and a in Figure 5A over which the bimodal Bayesian probability in the both-driven
Multisensory Enhancement in the Superior Colliculus
1177
Figure 5: Bimodal Bayesian probabilities (equation 4.1) are computed with both inputs driven or with one input driven and the other spontaneous. Either v and a vary together with v D a (both-driven), or one input varies while the other is xed at its spontaneous mean of 5 (v D 5 or a D 5, driven/spontaneous). Target prior probabilities are set as before [P(T D 1) D 0.1 and P(T D 0) D 0.9] and the likelihoods of inputs V and A are modeled using Poisson densities (as in Figure 2A). (A) The driven input mean is set at 8 impulses per unit time for both V and A. The unimodal Bayesian probability (equation 3.3) is computed for comparison. (B) The driven input mean is 8 for A (as in A) but 10 for V. In (A) and (B), the Bayesian probabilities for the cases where both inputs vary together exceed over a broad range those for which one or the other input is xed at 5. The driven/spontaneous bimodal Bayesian probability curves are denoted by lines with diamonds for v D 5, and lines with crosses for a D 5. The both-driven bimodal probability curves (v D a) are denoted by lines without symbols, and the unimodal Bayesian probability curve in (A) is denoted by a line with circles, as in Figure 3.
1178
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
case (line without symbols) exceeds that in the driven/spontaneous cases (lines with diamonds and crosses). This can be called the range of enhancement. The bimodal curves (both-driven and driven/spontaneous; equation 4.1) are compared in Figure 5A with the corresponding unimodal Bayesian probability curve computed before (equation 3.3). It shows that the Bayesian probabilities in the bimodal, driven/spontaneous cases (lines with diamonds and crosses) are smaller than in the unimodal case (line with circles) for the same inputs v and/or a. This illustrates that Bayes’ rule works in both directions. The bimodal probability can be higher than the unimodal probability when the target most likely drives both inputs. Conversely, the bimodal probability can be lower than the unimodal probability when the target most likely drives one input while the other input is most likely spontaneous. For simplicity, we have considered cases of the bimodal Bayes’ rule where the likelihood distributions are the same for inputs of both modalities. The Bayesian analysis presented is valid for any input mean values, as long as inputs that are intended to represent neurons that are activated by sensory stimulation have driven means that are higher than spontaneous means. Figure 5B presents an example of bimodal Bayesian probabilities computed with the same spontaneous means (of 5), but with different driven means of 10 and 8 for V and A, respectively. This case is qualitatively similar to the previous one except that the driven/spontaneous curves do not coincide. The range of enhancement is much larger relative to the curve for which a varies alone while v D 5 (line with diamonds) than relative to the curve for which v varies alone while a D 5 (line with crosses). There are values of the input for which a driven/spontaneous probability is almost zero while the both-driven probability is almost one. We hypothesize that the response of a deep SC neuron is proportional to the Bayesian probability that a target is present given its inputs. For the curves in Figures 5A and 5B, the ranges of input over which the bimodal, both-driven probabilities (lines without symbols) exceed the driven/ spontaneous probabilities (lines with diamonds and crosses) correspond to ranges of multisensory enhancement. As v and a increase within a range of enhancement, the amount of enhancement increases and then decreases until, above the upper limit of the range, enhancement decreases to zero. This behavior of the bimodal Bayes’ rule model can be used to simulate the results of a neurophysiological experiment (Meredith & Stein, 1986b). The experiment involves recording the responses of a multisensory deep SC neuron to stimuli that by themselves produce minimal, suboptimal, or optimal responses. The simulation is based on the bimodal case shown in Figure 5B, in which the driven means for V and A are different. Inputs for v and a are chosen that give similar minimal, suboptimal, and optimal Bayesian probabilities in the driven/spontaneous case. The both-driven probability is computed using the same, driven input values. The percent-
Multisensory Enhancement in the Superior Colliculus
1179
Table 1: Numerical Values for a Simulated Enhancement Experiment. Driven Input
Minimal Suboptimal Optimal
Bimodal Bayesian Probability
v
a
v Driven
a Driven
va Driven
% enh
8 12 16
9 15 21
0.0476 0.4446 0.9276
0.0487 0.4622 0.9351
0.3960 0.9865 1
713 113 7
Notes: Target prior probabilities are set at 0.1 for target present and 0.9 for target absent. The likelihoods of inputs V and A are modeled using Poisson densities with a spontaneous mean of 5 impulses per unit time for both V and A and driven means of 10 for V and 8 for A. With one input xed at the spontaneous mean, the other input (v or a) is assigned a driven value such that the two driven/spontaneous, bimodal Bayesian probabilities are approximately equal at each of three levels (minimal, suboptimal, and optimal). The both-driven (va) bimodal Bayesian probability is computed using the same driven input values. The percentage enhancement is computed using equation 1.1 in which the bothdriven probability is substituted for CM and the larger of the two driven/spontaneous probabilities is substituted for SMmax . The percentage increase in the Bayesian probability (% enh, equation 1.1) when both inputs are driven is largest when the driven/spontaneous Bayesian probabilities are smallest.
age enhancement is computed using equation 1.1, in which the Bayesian, bimodal both-driven probability is substituted for CM and the larger of the two driven/spontaneous probabilities is substituted for SMmax . The values are listed in Table 1. The results are illustrated in Figure 6, where the bars correspond to the Bayesian, bimodal probabilities in the driven/spontaneous cases (white bar, visual driven (v); gray bar, auditory driven ( a)) or in the both-driven case (black bar, visual and auditory driven ( va)). Figure 6 is drawn as a bar graph to facilitate comparisons with published gures based on experimental data (Meredith & Stein, 1986b; Stein & Meredith, 1993). Figure 6 shows that in the model, the percentage enhancement goes down as the driven/spontaneous probabilities go up. These modeling results simulate the property of inverse effectiveness that is observed for multisensory deep SC neurons, by which the percentage of multisensory enhancement is largest when the singlemodality responses are smallest. 6 Discussion
Multisensory enhancement is a particularly dramatic form of multisensory integration in which the response of a deep SC neuron to a stimulus of one modality can be increased many times by a stimulus of another modality. Yet not all deep SC neurons are multisensory. Those that are exhibit the property of inverse effectiveness, by which the percentage of multisensory enhancement is largest when the single-sensory responses are smallest. We hypothesize that the responses of individual deep SC neurons are propor-
1180
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
Figure 6: Comparing the driven/spontaneous bimodal Bayesian probabilities at three levels (minimal, suboptimal, and optimal) with the corresponding bothdriven probability. Parameters for the target prior and input likelihood distributions are as in Figure 5B. Bimodal probabilities (equation 4.4) are computed when both v and a take various values within the range representing driven responses (va, black bars), or when one input takes the driven value (v, white bars; a, gray bars) but the other is xed at the spontaneous mean of 5. The percentage increase in the Bayesian probability when both inputs are driven is largest when the driven/spontaneous Bayesian probabilities are smallest. The values are reported in Table 1.
tional to the Bayesian probability that a target is present given their sensory inputs. Our model, based on this hypothesis, suggests that multisensory deep SC neurons are those that receive unimodal inputs that would be more ambiguous by themselves. It also provides an explanation for the property of inverse effectiveness. The probability of a target computed on the basis of a large unimodal input will not be increased much by an input of another modality. However, the increase in target probability due to the integration of inputs of two modalities goes up rapidly as the size of the unimodal inputs decreases. Thus, our hypothesis provides a functional interpretation of multisensory enhancement that can explain its most salient features.
Multisensory Enhancement in the Superior Colliculus
1181
6.1 Hypothesis, Assumptions, and Model Parameters. Our basic hypothesis is that the responses of deep SC neurons are proportional to the probability that a target is present. A similar hypothesis was put forward by Barlow (1969, 1972), who suggested that the responses of feature detector neurons are proportional to the probability that their trigger feature is present. Barlow suggested a model based on classical statistical inference, in which the response of a neuron is proportional to ¡ log(P) where P is the probability of the results on a null hypothesis (i.e., that the trigger feature is absent). Classical inference differs from Bayesian analysis in many respects, with the main difference being that prior information is used in the latter but not in the former (Applebaum, 1996). In our model, the response of a deep SC neuron is directly proportional to the Bayesian (posterior) probability of a target, based on the prior probability of the target and sensory input of one or more modalities. We constructed our model so as to minimize the number of assumptions needed to evaluate it. That was our motivation in choosing the Poisson density rather than some other probability density function to represent the visual and auditory inputs, and in considering those two inputs as conditionally independent given the target. The Bayesian formulation is valid regardless of the probability density function chosen to represent the inputs, or whether those inputs are independent or dependent. Our particular construction of the model demonstrates its robustness, since the parameters of other instantiations requiring additional assumptions can easily be set so as to exaggerate the multisensory enhancement effects that we successfully simulate despite the constraints we impose. Conditional independence given the target implies that the visibility of a target indicates nothing about its audibility, and vice-versa. One can easily imagine real targets for which this assumption is valid and others for which it is not. The advantage of the assumption of conditional independence is that the conditional and unconditional joint distributions of the visual and auditory inputs [P(V, A | T ) and P( V, A)] are then computed directly from the likelihoods of each individual input and the target priors (equations 4.2 and 4.3). The alternative assumption of conditional dependence carries with it a number of additional assumptions that would have to be made in order to specify how V and A are jointly distributed. The parameters of such a dependent, joint distribution could not be constrained on the basis of available data, but could be chosen in such a way that the simulated multisensory enhancement effects would be greatly exaggerated. Thus, our model construction based on conditional independence of the inputs is more restrictive than the dependent alternative, and conclusions based on it are therefore stronger. The spike trains of individual neurons have been described using various stochastic processes (Gerstein & Mandelbrot, 1964; Geisler & Goldberg, 1966; Lansky & Radil, 1987; Gabbiani & Koch, 1998). We use the discrete Poisson density to approximate the number of impulses per unit time in
1182
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
the discharge of individual neurons. Real spike trains can be well described as Poisson processes in some cases but not in others (Munemori, Hara, Kimura, & Sato, 1984; Turcott et al., 1994; Lestienne, 1996; Rieke, Warland, & de Ruyter van Stevenick, 1997). The advantage of the Poisson density is that it is specied using only one parameter, the mean, and this parameter has been reported for neurons in many of the brain regions that supply sensory input to the deep SC. Other densities require setting two or more parameters, which could not be well constrained on the basis of available data but could be chosen in such a way that the simulated multisensory enhancement effects would be greatly exaggerated. Thus, our model construction using Poisson densities to represent the inputs is better constrained than others that would use alternative densities, and conclusions based on it are therefore more robust. 6.2 Unimodality versus Bimodality. Our analysis suggests that whether a deep SC neuron is multisensory may depend on the statistical properties of its inputs. We use two probability distributions to dene each sensory input: one for its spontaneous discharge and the other for its activity as it is driven by the sensory attributes of the target. A critical feature of these distributions is their degree of overlap. The more the spontaneous and driven distributions of an input overlap, the more uncertain and ambiguous will that input be for indicating the presence of a target. Our model suggests that if a deep SC neuron receives a unimodal input for which the spontaneous and driven distributions have little overlap, then that neuron receives unambiguous unimodal input and has no need for input of another modality. Conversely, if a deep SC neuron receives a unimodal input for which the spontaneous and driven distributions overlap a lot, then that neuron receives ambiguous unimodal input. Such a deep SC neuron could effectively disambiguate this input by combining it with an input of another modality, even if the other input is also ambiguous by itself. About 45% of deep SC neurons in the cat and 21% in the monkey receive input from two sensory modalities (Wallace & Stein, 1996). An input of a third modality could provide further disambiguation. About 9% of all deep SC neurons in the cat and 6% in the monkey receive input of three modalities. Inputs to the deep SC are likely to vary in the degree to which their spontaneous discharges and driven responses overlap. For example, sensory neurons vary in their sensitivity, and overlap would occur for those that are less sensitive to stimulation. Sensory inputs would therefore vary in their degree of ambiguity. This could explain why some deep SC neurons are unimodal while others are multimodal (Wallace & Stein, 1996). 6.3 Spontaneous Activity. In the model, the Bayesian probability of a target increases from zero to one as the input increases. We suggest that the responses of individual deep SC neurons are proportional to this Bayesian probability. The Bayesian probability is zero or nearly zero for inputs that are
Multisensory Enhancement in the Superior Colliculus
1183
most likely spontaneous. For the example shown in Figure 2, the Bayesian probability is nearly zero even for an input of 16 impulses per second (4 impulses per 250-msec interval). The spontaneous activity of deep SC neurons is also zero or nearly zero, despite the fact that most of the neurons that supply input to the deep SC have nonzero spontaneous ring rates. The spontaneous rates of neurons in some of the cortical and subcortical visual and auditory regions that project to the deep SC range from zero to over 100 impulses per second (anterior ectosylvian visual area in anesthetized cat, 1–8 Hz: Mucke, Norita, Benedek, & Creutzfeldt, 1982; pretectum in anesthetized cat, 1–31 Hz: Schmidt, 1996; retinal ganglion cells in anesthetized cat, 0–80 Hz: Ikeda, Dawes, & Hankins, 1992; superior olivary complex in anesthetized cat, 0–140 Hz: Guinan, Guinan, & Norris, 1972; inferior colliculus in alert cat, 2–54 Hz: Bock, Webster, & Aitkin, 1971; trapezoid body in anesthetized cat, 0–104 Hz: Brownell, 1975). For the unanesthetized rabbit, mean spontaneous rates of 36.1 § 37.8 (SD) and 11.1 § 11.4 (SD) impulses per second have been observed for single neurons in the superior olivary complex and inferior colliculus, respectively (D. Fitzpatrick and S. Kuwada, pers. comm.). Although neurons discharge spontaneously in most of the regions that provide inputs to the deep SC, most deep SC neurons exhibit low or zero spontaneous discharge, in both the anesthetized preparation and the alert, behaving animal (Peck, Schlag-Rey, & Schlag, 1980; Middlebrooks & Knudsen, 1984; Grantyn & Berthoz, 1985). In the context of the model, we would interpret this low level of deep SC neuron activity as indicating that the probability of a target is low when the inputs are spontaneous. 6.4 Driven Responses. The driven responses of deep SC neurons are inuenced by a multidimensional array of stimulus parameters in addition to location in space. For visual stimuli these can include size, and rate and direction of motion (Stein & Arigbede, 1972; Gordon, 1973; Peck et al., 1980; Wallace & Stein, 1996). For auditory stimuli, factors inuencing driven responses include sound frequency, intensity, and motion of the sound source (Gordon, 1973; Wise & Irvine, 1983, 1985; Hirsch, Chan, & Yin, 1985; Yin, Hirsch, & Chan, 1985). Specicity for these parameters is most likely derived from the sensory systems that provide input to the deep SC. In the model, we represent input activity as the number of impulses per unit time, which increases monotonically as the optimality of the sensory stimulus increases. The multidimensional array of stimulus parameters, as well as the multiplicity of input sources, raises the possibility of enhancement within a single sensory modality. For example, a visual input specic for size might enhance another visual input specic for direction of motion. The basic model, which includes only one input from each of two sensory modalities, can be generalized to include arbitrarily many inputs of various modalities and specicities.
1184
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
The Bayesian probability increases as the input increases, and is one or nearly one for inputs that are most likely driven. A Bayesian probability of one could correspond to the maximum activity level of deep SC neurons. The model suggests that unimodal input could maximally drive individual deep SC neurons, even those that are multisensory (e.g., Figure 5). However, in order for some inputs to drive deep SC neurons maximally, they would have to attain very high activity levels. For example, when the auditory input alone is driven in Figure 5B (line with diamonds), it does not produce a Bayesian probability of one even when it has attained 80 impulses per second (20 impulses per unit time). It is possible that some unimodal inputs are unable to attain the activity levels needed to drive deep SC neurons maximally, regardless of the optimality of the unimodal stimulus. The maximal response of a multimodal neuron to such unimodal input would be less than its maximal bimodal or multimodal response. Neurophysiological evidence suggests that this may be the case for some deep SC neurons (Stein & Meredith, 1993). 6.5 Multisensory Enhancement and Inverse Effectiveness. When the responses of deep SC neurons are interpreted as being proportional to the Bayesian probability of a target, then multisensory enhancement is analogous to an increase in that probability due to multimodal input. The strength of the evidence of a target provided to a deep SC neuron is related to the size of its driven input. If a driven input of one modality is large, a deep SC neuron receives overwhelming evidence on the basis of that input modality alone. The probability of a target would be high, and driven input of another modality will not increase that probability much. Conversely, if a driven input of one modality is small, then the evidence it provides to a deep SC neuron is weak, and the probability of a target computed on the basis of that input alone would be small. In that case, the probability of a target could be greatly increased by corroborating input of another modality. According to the probabilistic interpretation, the amount of multisensory enhancement should be larger for smaller unimodal inputs. The experimental paradigm actually used to demonstrate multisensory enhancement would accentuate this effect. Typically in studying a multisensory deep SC neuron, the response to stimuli of two modalities presented together is compared with the response to a stimulus of either modality presented alone (Meredith & Stein, 1986b; Stein & Meredith, 1993). According to the model, the response of a bimodal, deep SC neuron to a unimodal stimulus would be proportional to the probability of a target given that one input is driven and the other is spontaneous. This bimodal, driven/spontaneous probability is lower than that which would be computed by a unimodal deep SC neuron receiving only the driven input (see Figure 5A). The both-driven bimodal probability can be increased greatly over the driven/spontaneous bimodal probabilities, especially when the driven/spontaneous probabilities are smaller.
Multisensory Enhancement in the Superior Colliculus
1185
Thus, the amount of increase in target probability due to the integration of multimodal sensory inputs goes up rapidly as the responses to unimodal inputs decrease. This relationship is illustrated in Figures 5 and 6. It is analogous to the phenomenon of inverse effectiveness, which is the hallmark of multisensory enhancement in deep SC neurons (Stein & Meredith, 1993). The property of inverse effectiveness is precisely what would be expected on the basis of the hypothesis that deep SC neurons use their multisensory inputs to compute the probability of a target. Acknowledgments
We thank Pierre Moulin, Hao Pan, Sylvian Ray, Jesse Reichler, and Christopher Seguin for many helpful discussions and for comments on the manuscript prior to submission. This work was supported by NSF grant IBN 92-21823 and a grant from the Critical Research Initiatives of the State of Illinois, both to T.J.A. References Applebaum, D. (1996). Probability and information: An integrated approach. Cambridge: Cambridge University Press. Barlow, H. B. (1969). Pattern recognition and the responses of sensory neurons. Annals of the New York Academy of Sciences, 156, 872–881. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bock, G. R., Webster, W. R., & Aitkin, L. M. (1971). Discharge patterns of single units in inferior colliculus of the alert cat. J. Neurophysiol., 35, 265–277. Brownell, W. E. (1975). Organization of the cat trapezoid body and the discharge characteristics of its bers. Brain Res., 94(3), 413–433. Cadusseau, J., & Roger, M. (1985). Afferent projections to the superior colliculus in the rat, with special attention to the deep layers. J. Hirnforsch., 26(6), 667– 681. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Edwards, S. B., Ginsburgh, C. L., Henkel, C. K., & Stein, B. E. (1979). Sources of subcortical projections to the superior colliculus in the cat. J. Comp. Neurol., 184(2), 309–329. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). (pp. 312–360). Cambridge, MA: MIT Press. Geisler, C. D., & Goldberg, J. M. (1966). A stochastic model of the repetitive activity of neurons. Biophys. J., 6, 53–69. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68.
1186
T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid
Gordon, B. (1973). Receptive elds in deep layers of cat superior colliculus. J. Neurophysiol., 36(2), 157–178. Grantyn, A., & Berthoz, A. (1985). Burst activity of identied tecto-reticulospinal neurons in the alert cat. Exp. Brain Res., 57(2), 417–421. Guinan, J. J., Guinan, S. S., & Norris, B. E. (1972). Single auditory units in the superior olivary complex. I. Responses to sounds and classications based on physiological properties. Int. J. Neurosci., 4, 101–120. Hellstrom, C. W. (1984).Probabilityand stochastic processes for engineers. New York: Macmillan. Hirsch, J. A., Chan, J. C., & Yin, T. C. (1985). Responses of neurons in the cat’s superior colliculus to acoustic stimuli. I. Monaural and binaural response properties. J. Neurophysiol., 53(3), 726–745. Ikeda, H., Dawes, E., & Hankins, M. (1992).Spontaneous ring level distiguishes the effects of NMDA and non-NMDA receptor antagonists on the ganglion cells in the cat retina. Eur. J. Pharm., 210, 53–59. King, A. J., & Palmer, A. R. (1985).Integration of visual and auditory information in bimodal neurones in guinea-pig superior colliculus. Exp. Brain Res., 60, 492–500. Lansky, P., & Radil, T. (1987). Statistical inference on spontaneous neuronal discharge patterns. Biol. Cybern., 55, 299–311. Lestienne, R. 1996. Determination of the precision of spike timing in the visual cortex of anaesthetised cats. Biol. Cybern., 74, 55–61. Meredith, M. A., Clemo, H. R., & Stein, B. E. (1991). Somatotopic component of the multisensory map in the deep laminae of the cat superior colliculus. J. Comp. Neurol., 312(3), 353–370. Meredith, M. A., Nemitz, J. W., & Stein, B. E. (1987). Determinants of multisensory integration in superior colliculus neurons. I. Temporal factors. J. Neurosci., 7(10), 3215–3229. Meredith, M. A., & Stein, B. E. (1983). Interactions among converging sensory inputs in the superior colliculus. Science, 221(4608), 389–391. Meredith, M. A., & Stein, B. E. (1986a). Spatial factors determine the activity of multisensory neurons in cat superior colliculus. Brain Res., 365(2), 350– 354. Meredith, M. A., & Stein, B. E. (1986b).Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. J. Neurophysiol., 56(3), 640–662. Meredith, M. A., & Stein, B. E. (1990). The visuotopic component of the multisensory map in the deep laminae of the cat superior colliculus. J. Neurosci., 10(11), 3727–3742. Meredith, M. A., & Stein, B. E. (1996). Spatial determinants of multisensory integration in cat superior colliculus neurons. J. Neurophysiol., 75(5), 1843– 1857. Middlebrooks, J. C., & Knudsen, E. I. (1984). A neural code for auditory space in the cat’s superior colliculus. J. Neurosci., 4(10), 2621–2634. Mucke, L., Norita, M., Benedek, G., & Creutzfeldt, O. (1982). Physiologic and anatomic investigation of a visual cortical area situated in the ventral bank of the anterior ectosylvian sulcus of the cat. Exp. Brain Res., 46, 1–11.
Multisensory Enhancement in the Superior Colliculus
1187
Munemori, J., Hara, K., Kimura, M., & Sato, R. (1984). Statistical features of impulse trains in cat’s lateral geniculate neurons. Biol. Cybern., 50, 167– 172. Peck, C. K., Schlag-Rey, M., & Schlag, J. (1980). Visuo-oculomotor properties of cells in the superior colliculus of the alert cat. J. Comp. Neurol., 194(1), 97–116. Rieke, F., Warland, D., & de Ruyter van Stevenick, R. (1997). Spikes: Exploring the neural code. Cambrige, MA: MIT Press. Russell, S., & Norvig, P. (1995). Articial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall. Schmidt, M. (1996).Neurons in the cat pretectum that project to the dorsal lateral geniculate nucleus are activated during saccades. J. Neurophysiol., 76(5), 2907– 2918. Sparks, D. L., & Hartwich-Young, R. (1989). The deep layers of the superior colliculus. In R. H. Wurtz & M. Goldberg (Eds.), The neurobiology of saccadic eye movements (pp. 213–255). Amsterdam: Elsevier. Stein, B. E., & Arigbede, M. O. (1972).A parametric study of movement detection properties of neurons in the cat’s superior colliculus. Brain Res., 45, 437–454. Stein, B. E., Huneycutt, W. S., & Meredith, M. A. (1988). Neurons and behavior: The same rules of multisensory integration apply. Brain Res., 448(2), 355–358. Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. Cambridge, MA: MIT Press. Stein, B. E., Meredith, M. A., Huneycutt, W. S., & McDade, L. (1989). Behavioral indices of multisensory integration: Orientation to visual cues is affected by auditory stimuli. J. Cog. Neurosci., 1(1), 12–24. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C., & Teich, M. C. (1994). A non-stationary Poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biol. Cybern., 70, 209–217. Wallace, M. T., & Stein, B. E. (1994). Cross-modal synthesis in the midbrain depends on input from cortex. J. Neurophysiol., 71(1), 429–432. Wallace, M. T., & Stein, B. E. (1996). Sensory organization of the superior colliculus in cat and monkey. Prog. Brain Res., 112, 301–311. Wise, L. Z., & Irvine, D. R. (1983). Auditory response properties of neurons in deep layers of cat superior colliculus. J. Neurophysiol., 49(3), 674–685. Wise, L. Z., & Irvine, D. R. (1985). Topographic organization of interaural intensity difference sensitivity in deep layers of cat superior colliculus: Implications for auditory spatial representation. J. Neurophysiol., 54(2), 185–211. Wurtz, R. L., & Goldberg, M. E. (Eds.). (1989). The neurobiology of saccadic eye movements. Amsterdam: Elsevier. Yin, T. C., Hirsch, J. A., & Chan, J. C. (1985). Responses of neurons in the cat’s superior colliculus to acoustic stimuli. II. A model of interaural intensity sensitivity. J. Neurophysiol., 53(3), 746–758. Received August 5, 1998; accepted May 10, 1999.
LETTER
Communicated by Anthony Zador
Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classier Panayiota Poirazi Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Biophysical modeling studies have suggested that neurons with active dendrites can be viewed as linear units augmented by product terms that arise from interactions between synaptic inputs within the same dendritic subregions. However, the degree to which local nonlinear synaptic interactions could augment the memory capacity of a neuron is not known in a quantitative sense. To approach this question, we have studied the family of subsampled quadratic classiers: linear classiers augmented by the best k terms from the set of K D (d2 C d) / 2 second-order product terms available in d dimensions. We developed an expression for the total parameter entropy, whose form shows that the capacity of an SQ classier does not reside solely in its conventional weight values—the explicit memory used to store constant, linear, and higher-order coefcients. Rather, we identify a second type of parameter exibility that jointly contributes to an SQ classier’s capacity: the choice as to which product terms are included in the model and which are not. We validate the form of the entropy expression using empirical studies of relative capacity within families of geometrically isomorphic SQ classiers. Our results have direct implications for neurobiological (and other hardware) learning systems, where in the limit of high-dimensional input spaces and low-resolution synaptic weight values, this relatively little explored form of choice exibility could constitute a major source of trainable model capacity. 1 Introduction
Experimental evidence continues to accumulate indicating that the dendritic trees of many neuron types are electrically “active”; they contain voltage-dependent ionic conductances that can lead to full regenerative propagation of action potentials and other dynamical phenomena (for recent reviews see Johnston, Magee, Colbert, & Christie, 1996; Stuart, Spruston, Sakmann, & Hausser, 1997; Stuart, Spruston, & Hausser, 1999). Although our state of knowledge regarding the functions of dendritic trees is largely conned to the realm of hypothesis (Koch, Poggio, & Torre, 1982; Rall & c 2000 Massachusetts Institute of Technology Neural Computation 12, 1189–1205 (2000) °
1190
Panayiota Poirazi and Bartlett W. Mel
A
hypothetical response at soma
C
B
A+B
A+C soma
Figure 1: Conceptual illustration of location-dependent nonlinear interactions in an active dendritic tree. Two inputs, A and B, activated at high frequency on a dendritic branch containing a thresholding nonlinearity could be sufcient to cause the branch (and the cell) to “re” repeatedly, while coactivation of more distantly separated inputs, A and C, might generate only a rare suprathreshold event at any location. This type of nonlinear interaction between A and B is sometimes loosely termed multiplicative.
Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywaca, 1992; Mel, 1992a, 1992b, 1993, 1994, 1999; Zador, Claiborne, & Brown, 1992; Yuste & Tank, 1996; Segev & Rall, 1998; Mel, Ruderman, & Archie, 1998a, 1998b), results of existing experimental and modeling studies strongly suggest that electrical compartmentalization provided by the dendritic branching structure, coupled with local thresholding provided by active dendritic channels, together are likely to have a profound impact on the way in which synaptic inputs to a dendritic tree are integrated to produce a cell’s output (see Figure 1). A key challenge in deciphering the functions of individual neurons involves properly characterizing the form of nonlinear interactions between synapses within a dendritic compartment, and to this end, a variety of biophysical modeling studies have identied possible functional roles for such interactions (Koch et al., 1982; Rall & Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywacz, 1992; Kock & Poggio, 1992; Zador et al., 1992; Mel 1992a, 1992b, 1993). In this same vein, two recent biophysical modeling studies (Mel et al., 1998a, 1998b) have shown that the time-averaged response of a cortical pyramidal cell with weakly excitable dendrites can mimic the output of a quadratic “energy” model, a formalism that has been used to model a variety of receptive eld properties in visual cortex (Pollen & Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa, DeAngelis, & Freeman, 1997). This close connection between dendritic and receptive
Capacity of a Subsampled Quadratic Classier
1191
eld nonlinearities is highly suggestive and supports the notion that active dendrites could contribute to a diverse set of neuronal functions in the mammalian central nervous system that involve a common, quasi-static multiplicative nonlinearity of relatively low order (for further discussion, see Mel, 1999). The potential contributions of active dendritic processing to the brain’s capacity to learn and remember are a fundamental problem in neuroscience that have thus far received relatively little attention. In one study addressing the issue of neuronal capacity, Zador and Pearlmutter (1996) computed the Vapnik-Chervonenkis (1971) dimension of an integrate-and-re neuron with a passive dendritic tree. However location-dependent nonlinear synaptic interactions, which are likely to occur within the dendrites of many neuron types, were not considered in this study. To begin to address this issue, we previously developed a spatially extended single-neuron model called the “clusteron,” which explicitly included location-dependent multiplicative synaptic interactions in the neuron’s input-output function (Mel, 1992a) and illustrated how a neuron with active dendrites could act as a high-dimensional quadratic learning machine. The clusteron abstraction was conceived primarily to help quantify the impact of local dendritic nonlinearities on the usable memory capacity of a cell; an earlier biophysical modeling study had demonstrated empirically that this impact could be signicant (Mel, 1992b). However, an additional insight gained from this work was that a cell with s input sites is likely capable of representing only a small fraction of the O( d2 ) second-order interaction terms available among d input lines, since in the brain, it is often reasonable to assume that d is very large. In short, if a neuron does indeed pull out higher-order features in its dendrites, it is likely to have to choose very carefully among them. Inspired in part by this observation, our goal here has been to quantify the degree to which inclusion of a subset of the best available higher-order terms augments the capacity of a subsampled quadratic (SQ) classier relative to its linear counterpart. Although it has been long appreciated that inclusion of higher-order terms can increase the power of a learning machine for both regression and classication (Poggio, 1975; Barron, Mucciardi, Cook, Craig, & Barron, 1984; Giles & Maxwell, 1987; Ghosh & Shin, 1992; Guyon, Boser, & Vapnik, 1993; Karpinsky & Werther, 1993; Schurmann, 1996), and algorithms for learning polynomials have often included heuristic strategies for subsampling the intractable number of higher-order product terms, existing theory bearing on the learning capabilities of quadratic classiers is limited. Cover (1965) showed that an rth order polynomial classier consisting of all r-wise products of the input variables in d dimensions has a natural separating capacity of 2
¡ dCr ¢ r
,
1192
Panayiota Poirazi and Bartlett W. Mel
or twice the number of parameters. However, an SQ classier with k nonzero product terms, which would appear at rst to have a Cover capacity of 2(1 C d C k), in fact contains additional covert parameters relating to the choice of nonzero coefcients to be included in the classier, since this choice is a bona de form of model exibility that can be used to t training data. Viewed in another way, the SQ classier with k nonzero terms also implicitly species zero values on the product terms that were not chosen, so that the capacity of an SQ classier must necessarily take account of these additional internal degrees of freedom. Our specic objective here has been to obtain an expression that quanties the separate contributions to the capacity of an SQ classier that arise from parameter value versus choice exibility. Beyond the theoretical signicance of this question and its broader implications for the learning performance of classiers that subsample from large sets of nonlinear basis functions, our hope has been that this analysis and its subsequent renements will lead to further insights into the potential contributions of active dendritic processing to the memory capacity of neural tissue. 2 Toward the Capacity of a Subsampled Quadratic Classier
We consider the family of SQ classiers in which only k of the K D (d2 C d) / 2 available second-order terms in d dimensions is allowed to take on a nonzero value. In the special case where k D 0, we are left with a pure linear classier, while k D K corresponds to a full quadratic classier. Crucially, the k nonlinear terms included in the model are not chosen at random, but are chosen instead to be the best subset of k terms—that which minimizes training set error. The capacity of a learning machine depends on the number of distinct input-output functions that can be expressed, which in the case of a classier corresponds to the number of different labelings (dichotomies) that can be assigned to N patterns in d dimensions (Cover, 1965; Vapnik & Chervonenkis, 1971). Although a general expression for the number of dichotomies realizable by an SQ classier is difcult to obtain as a function of N, d, and k, an upper bound on the total parameter entropy of an SQ classier taken over the space of randomly drawn training sets may be expressed as B( k, d) D (1 C d C k) w C log 2
¡K¢ k
,
(2.1)
where the rst term measures the entropy contained in the stored parameter values, with w an unknown representing the average number of bits of value entropy per parameter, and the second term measures the entropy contained in the choice of higher-order terms, assuming that all combina-
Capacity of a Subsampled Quadratic Classier
1193
Classifier Bits vs. Product Term Ratio 2000
Bits Used in Classifiers
d = 10 d = 20 1500
1000
d = 30
choice bits
weight bits
500
0 0
20 40 60 80 Product Term Ratio p = k/K (%)
100
Figure 2: SQ capacity curves were generated using equation 2.1 with w D 4. Dashed lines indicate the contribution of explicit coefcients to the bit total, while the additional increment is provided by the combinatorial term in equation 2.1 pertaining to the choice of nonzero coefcients over the available collection of product terms. Both terms are O(d2 ).
tions of k higher-order terms are chosen equally often. The factorization of value versus choice entropy assumes that the choice of higher-order terms, and the values assigned to them, are statistically independent. The relative magnitudes of value versus choice capacity are shown in Figure 2, where B( d, k) is plotted as a function of the product term ratio p D Kk , for w D 4. The contribution of the weight values to the bit total is given by the dashed line connecting the p D 0% and p D 100% points on each curve. The contribution of the choice term has the form of an upside-down “U” and can be seen as the increment to the bit count above the dashed line. 1 When the mapping between learning problems and classier parameters is deterministic—i.e., when there always exists a single best setting of classier error parameters to minimize training, and when all classier states are equally likely, equation 2.1 represents the mutual information contained in the classier parameters about training set labels. Unfortunately, equa1 The expression for B assumes that the value bits and the choice bits are independent of each other, which holds only when the k best-product terms chosen for inclusion in the classier are unlikely to include any zero values. This assumption breaks down as k ! K, in which limit some of the k largest learned coefcients will have values near zero. This leads in turn to an overprediction of classier capacity according to equation 2.1 and the resulting slight nonmonotonicity seen in Figure 2 for large p values. A more intricate counting method would be needed to separate these two sources of classier capacity for large p values.
1194
Panayiota Poirazi and Bartlett W. Mel
tion 2.1 cannot be directly used to predict error rates for SQ classiers. First, the value of w is generally unknown a priori. Second, actual error rates do not depend on parameter entropy alone, but depend also on geometric factors that determine the suitability of the classier ’s parameters to specic training set distributions. Following the maxim that capacity is closely linked to trainable degrees of freedom, however, and the principle that capacity and error rates should be related to each other in consistent (if unknown) ways, we conjectured that there could exist subfamilies of isomorphic SQ classiers wherein the relative capacity of any two classiers i and j, at any error rate and for any training set distribution, is given by the ratio of their respective B( k, d) values. Specically, a family of isomorphic SQ classiers C is dened such that for any two classiers fi, jg 2 C , the relative number of patterns N learnable by each classier at error rate 2 obeys the relation Ri / j D
Ni Bi D , for all 2 . Nj Bj
(2.2)
Put another way, two isomorphic classiers should yield equal error rates when asked to learn an equivalent number of patterns per parameter bit, for any input distribution. We describe several tests of this scaling conjecture in the following sections. 2.1 Special Cases: Linear and Full Quadratic Classiers. The scaling relation expressed in equation 2.2 may be easily veried in the limiting cases of k D 0 and k D K, which correspond to linear and full quadratic classiers, respectively. In both cases, the combinatorial term representing the choice capacity in equation 2.1 vanishes. Assuming that the value-entropy scaling factor w is always equal for the two classiers at a given error rate, the capacity ratio for two isomorphic “choiceless” classiers indexed by i and j simplies to the ratio of their explicit parameter counts:
Ri / j D
1 C di 1 C dj
(2.3)
for the linear case and
¡d C 2¢ 2 D ¡ ¢ D dd d C2 i
Ri / j
j
2
2 i 2 j
C 3di C 2 C 3dj C 2
(2.4)
for the full quadratic case. Both outcomes are consistent with the fact that the natural separating capacity for both linear and full polynomial discriminant functions is directly proportional to the number of free parameters
Capacity of a Subsampled Quadratic Classier
1195
available to the classier (Cover, 1965). Numerical simulations and supporting calculations conrmed that this capacity scaling at every error rate holds for linear and quadratic classiers in 10, 20, and 30 dimensions (see Figure 3). The superposition of performance curves for linear classiers in 10, 20, and 30 dimensions shown in Figure 3C validates the assumption in equation 2.3 that w is independent of d. Thus, randomly labeled samples drawn from a spherical gaussian distribution evoke the same number of bits of entropy in a linear classier, per parameter, regardless of the dimension in which the classier sits. A statistical mechanical interpretation of this fact is given in Riegler & Seung (1997). Similarly, the scaling of full quadratic performance curves shown in Figure 3C implies that the value of w is independent of d within this family of learning machines as well. However, it is important to note that the value of w is not necessarily consistent between the two families, since the graphs of error rate versus training set size for linear and quadratic classiers are not related by any simple scaling factor (see Figure 3C). In fact, full quadratic classiers performed better per parameter than linear classiers for small training sets—in this case, up to six patterns per model parameter. The opposite was true for large training sets for which the training set covariance matrix approaches the identity, thus depleting any information about class boundaries contained in the O( d2 ) quadratic parameters. 2.2 Capacity Scaling for General SQ Classiers. We tested the scaling relation in equation 2.2 for SQ classiers in the general case when 0 < k < K. In all simulations reported here, the coefcients for the k nonzero second-order terms were determined as follows. A conjugate gradient algorithm was used to train a full quadratic classier to minimize mean squared error (MSE) over the training set. Given the sphericity of the training set distribution, the selection of the k best-product terms after training could be made based on weight magnitude, a strategy that is not valid in general (see Bishop, 1995). However, in this case, it was veried empirically that the increase in MSE or classication error when a single weight was set to zero grew roughly monotonically with the weight magnitude (see Figure 4). The pruned classier, including only the k largest second-order terms, was retrained to minimize MSE. During this second round of training, the output of the classier was passed through a sigmoidal thresholding function (1 ¡ e¡x / s ) / (1 C e¡x / s ) with slope s D 0.51. In some cases, the resulting value parameters were binarized to take on values of § 1. Graphs of error rate versus training set size for SQ classiers are shown in Figure 5A for a range of k values in 10 dimensions. The capacity boost resulting from inclusion of varying numbers of second-order terms to a linear classier was culled from this graph by choosing a xed error rate
1196
Panayiota Poirazi and Bartlett W. Mel
A Model Fit to Performance Curves
B Model Fit to Performance Curves
45
45
for Quadratic classifiers
Classification Error (%)
for Linear classifiers
40
d = 10
35
35
30
30
25
25
20
20
15
15
10
10
spherical gaussian model
5 0
50
100
150
non-spherical gaussian model learned quadratic discriminant
5
learned linear discriminant 0
d = 10
40
0
200
0
200
400
600
800
1000
Training Patterns
Training Patterns VC Cover Dimension Capacity
C Scaling of Performance Curves Across Dimension
45
Classification Error (%)
40
d = 10, 20, 30
35 30 25
linear full quadratic
20 15 10 5 0
0
2
4
6
8
10
Training Patterns / Model Parameters
Figure 3: We trained linear and full quadratic classiers using randomly labeled patterns drawn from a d-dimensional zero-mean spherical gaussian distribution and measured average classication error versus training set size. (A) A simple analytical model (crosses) t numerical simulations results (squares) in the asymptote of large N (see the appendix). Dashed lines indicate VC dimension 1 C d and Cover capacity 2(1 C d) for the perceptron, which correspond to 1% and 7% error rates, respectively. Error rates slightly above the theoretical minimum (0% at VC dimension) were due to imperfect control of the learning-rate parameter. (B) A Bayes-optimal classier based on estimated covariance matrices (crosses) provided a control for the performance of a numerically optimized full quadratic classier (squares); the performance curves again merge in the limit of large N. (C) Scaling of linear and quadratic error rates across dimension, when plotted against training patterns per model parameter (1 C d for linear, 1 C d C (d2 C d) / 2 for quadratic).
Capacity of a Subsampled Quadratic Classier Distribution of Learned Weights
A
d = 10, N = 100
Linear
35
Quadratic
Squared Error Increase
Relative Probabilities
40 d = 10, N = 100
0.16 0.14 0.12 0.1 0.08 0.06 0.04
30 25 20 15 10 Linear 5
0.02 0 10
MSE Increase for Individual Weight
B
0.2 0.18
1197
8
6
4
2
0
2
Weight Values
4
6
8
10
0
Quadratic 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Weight Magnitude
Figure 4: Distributions of learned weight values and increments to MSE with pruning of individual weights. (A) Similar distribution of linear versus secondorder weight values after training a full quadratic classier. (B) Effect of pruning individual weights on the MSE. For each data point, x-value indicates weight magnitude; y-value indicates the increment to MSE when that weight was set to 0 (without further retraining). As expected from the symmetry of the training distribution, an increase in MSE is roughly monotonic with weight magnitude for both linear and second-order weights. A similar relationship was seen for increments in classication error as a function of weight magnitude.
and recording the associated capacities for the linear versus various SQ curves (see Figure 5B). The contribution of the choice exibility is evident here, leading to capacity boosts in excess of that predicted by explicit parameter counts. For example, at a xed 1% error rate in 10 dimensions, addition of the 10 best quadratic terms to the 11 linear and constant terms boosted the trainable capacity by more than a factor of 3, whereas the number of explicit parameters grew by less than a factor of 2. By analogy with the special cases of linear and full quadratic classiers, we sought to identify subfamilies of isomorphic SQ classiers within which the relative capacity of any two classiers was given by Ri /j . In doing so, however, two complications arise. First, it is a priori unclear what condition(s) on d and k bind together families of isomorphic SQ classiers in general, if such families exist. Second, even if such families do exist, w no longer cancels in the expression for Ri / j , as it does for both linear and full quadratic classiers. In cases where w is unknown, therefore, evaluation of Ri / j is prevented. Regarding the criterion for classier isomorphism, we found empirically that error rates for SQ classiers i and j in dimensions di and dj could be brought into register across dimension with a simple capacity scaling, as
1198 Performance of SQ Classifiers
45
Classification Error (%)
d = 10
k =20
30
k =30
25 20
k =40
15 full quadratic
10
k = 40
8
k = 30 6
k = 20
4
k =10
2
5 0
k = 55 (full quadratic)
10
k =10
35
Capacity Boosts for SQ Classifiers (relative to linear)
12
linear
d = 10
40
B
Capacity Ratio (SQ/linear)
A
Panayiota Poirazi and Bartlett W. Mel
0
100
200 Training Patterns
300
0
k = 0 (linear) 0
5
10 15 20 Classification Error (%)
25
Figure 5: Performance of SQ classiers. (A) Performance curves for various values of k in a 10-dimensional input space. (B) Ratio of the memory capacity of SQ classiers in 10 dimensions relative to a linear classier k D 0, plotted as a function of the error rate.
long as the relation kj ki D Ki Kj
(2.5)
was obeyed. This suggests that a key geometric invariance in the SQ family involves the proportion of higher-order terms included in the classier (see Figure 6B). This rule also operates correctly for the special cases of linear (k D 0) and full quadratic (k D K) classiers. By contrast, another plausible condition on d and k failed to bind together sets of isomorphic SQ classiers: classiers that use the same fraction of their complete parameter set (see Figure 6A). Having established that two SQ classiers with equal product term ratios p D Kk produce performance curves that are isomorphic up to a scaling of their capacities, the key question remaining was whether the scaling factor would be predicted by Ri / j . Assuming wi D wj as for the linear and full quadratic cases, we note that the relative capacity of two isomorphic SQ classiers is again independent of w and approaches ( ddi ) 2 for large di and dj . This j occurs because as di and dj grow large, the ratio of the two value capacities considered separately, and the two choice capacities considered separately, both individually approach ( ddji ) 2 . However, for the lower-dimensional cases included in our empirical studies, the capacity ratio Ri / j does not entirely shed its dependence on w and p. We found that a value of w D 4 produced good ts to the empirically determined scaling factors for p-matched SQ
Capacity of a Subsampled Quadratic Classier Comparison of Parameter Bits Used for Isomorphic Capacity Curves
A
Product Term Ratio in Target Dimension (%)
Parameters Used in Target Dimension (% of total parameters)
d = 20 d = 30
80
60
40
20
0
0
20 40 60 80 Parameters Used in 10 Dimensions (% of total parameters)
Comparison of 2nd order Terms Used for Isomorphic Capacity Curves
B
100
100
1199
100 d = 20 d = 30
80
60
40
20
0
0
20
40
60
80
100
Product Term Ratio p in 10 Dimensions (%)
Figure 6: Relationship between SQ classiers in different dimensions whose performance curves were isomorphic. (A) The nonlinear relation shown indicates that no simple proportionality exists between the total parameter counts (1 C d C k) used by two SQ classiers whose performance curves were found to be isomorphic up to an x-axis scaling. 20-d (crosses) and 30-d (squares) classiers were compared to a 10-d reference case. (B) In contrast, a linear relation (equality) holds between the product term ratios p D Kk for two isomorphic SQ classiers.
classiers in 10, 20, and 30 dimensions, for a wide range of p values tested, and for training sets drawn from both gaussian and uniform distributions 2 (see Figure 7). To eliminate uncertainty in the value of w, we trained binary-valued SQ classiers so that w could be assigned a value of 1 in the expression for Ri / j . In this case, SQ classiers were trained as before, but prior to testing, the real-valued coefcients were mapped to § 1 according to their sign. This radical quantization of weight values led to large increases in error rates and gross alterations in the form of the error curves; compare Figure 8A to Figure 7A. However Ri /j could now be evaluated with no free parameters and continued to produce nearly perfect superpositions of error rate curves within isomorphic families (see Figure 8). The nonmonotonicities seen in these binary-valued cases are of unknown origin. However, given that they scale across dimension, we infer they arise from consistent, family-specic idiosyncrasies in the number, and independence, of the dichotomies realizable for particular training set distributions. 2
Fits using w D 3 were nearly as good; the scaling factor ’s weak dependence on w is explained by the fact that w appears in both the numerator and denominator of Ri / j .
1200
Panayiota Poirazi and Bartlett W. Mel SQ Performance Curves with Analog Weights Scale Across Dimesion
A
B
40
35
d
k 30
10 12 20 46 30 90
30 25
k Å K
20
Uniform
21%
15
k Å K
10 5 0
Classification Error (%)
Classification Error (%)
35
0
SQ Curves with Analog Weights have different shapes for Uniform vs. Gaussian Distributions
45%
d
k
10 20 30
25 95 210
0.5 1 1.5 2 2.5 Training Patterns / Parameter Bits
25 Gaussian
20 15 10
k Å K
5
3
0
0
18%
d
k
10 20
10 40
0.5 1 1.5 2 2.5 Training Patterns/Parameter Bits
3
Figure 7: Error rates for SQ classiers in different dimension could be superimposed when matched for product term ratio p D Kk . (A) Error rates were equivalent for isomorphic SQ classiers in 10, 20, and 30 dimensions when training set size was normalized by capacity as given by equation 2.1, with w D 4. Two groups of curves shown are for p ¼ 21% and p ¼ 45%. The approximation arises from quantization of k values. The different shapes of the two groups of curves imply that classiers with different p-ratios perform differently per bit in learning the same training set. (B) When training patterns were drawn from a uniform distribution, normalized capacity curves for isomorphic SQ classiers continued to match across dimension, although the shapes of the curves were different from those produced by the same classiers trained with gaussian samples. 3 Discussion
We derived a simple expression for the total parameter exibility of a subsampled quadratic classier, with separate factors for value versus choice capacity. To validate the form of the expression, we (i) identied families of geometrically isomorphic SQ classiers, with linear and full quadratic discriminant functions as special cases; (ii) measured relative capacities within these families; (iii) demonstrated that equation 2.2 predicted capacity scaling factors for SQ classiers in 10 to 30 dimensions with one free parameter for classiers with real-valued coefcients, and with no free parameters for classiers with binary-valued coefcients; and (iv) showed that for classiers with real-valued or binary-valued coefcients, both the predicted and the empirically derived capacity scaling factors were consistent for two very different training set distributions, suggesting that the expression for capacity scaling in equation 2.2 may hold for any well-behaved input distribution. The quantication of capacity in equation 2.1 is not yet supported by an exact function counting theorem as has been worked out for linear and full
Capacity of a Subsampled Quadratic Classier A
Shapes for Uniform vs. Gaussian Distributions
50
Å
21%
45
d
30
k
25
K
Å
k
45%
20 15 10 5 0
0
0.5
1
1.5
2
2.5
d
k
10 20 30
25 95 210
3
Training Patterns/Parameter bits
3.5
Uniform
40
10 12 20 46 30 90
Classification Error (%)
k
35 K Classification Error (%)
B SQ Curves with Binary Weights have Different
SQ Performance Curves with Binary Weights Scale Across Dimension
40
1201
35 Gaussian
30 25 20 15
k Å K
10 5 4
0 0
1
18%
d
k
10 20
10 40
2 3 4 5 6 Training Patterns/Parameter Bits
7
Figure 8: Error curves for p-matched SQ classiers trained with gaussian distribution and constrained to use binary valued § 1 weights. Superposition of curves was achieved using capacity scaling factor Ri / j with no free parameters, since w could be set to 1 in this case.
rth-order polynomials (Cover, 1965), nor has it been derived from a fundamental statistical mechanical perspective as has been done, for example, for the simple perceptron (Riegler & Seung, 1997)—and we cannot predict how difcult it will be to see through either of these approaches. In its favor, however, the simple form of equation 2.1, coupled with the condition for classier isomorphism given in equation 2.5, together account for much of the empirical variance in model capacity expressed by this rather complex class of learning machines. It is therefore likely that these two quantitative relations will be echoed in future theorems and proofs bearing on the class of subsampled polynomial classiers. While choice capacity is in some sense invisible, it is nonetheless very real. Thus, when a single optimally chosen nonlinear basis function is added to a linear model, the system behaves as if it gained not only an additional w bits of value, or coefcient, capacity, but a bonus capacity of log 2 ( K) bits owing to the choice that was made. When K is large and w is small, this bonus can be highly signicant. It seems plausible that the bit-counting approach of equation 2.1 could be useful in quantifying the capacity of other types of classiers that adaptively subsample from large families of nonlinear basis functions—for example, a classier that draws an optimal set of k radial basis functions from a complete set of K basis functions tiling an input space. The practical signicance of choice capacity is most acute for hardwarebased learning systems, possibly including the brain, where in the limits of low-resolution weight values (Petersen, Malenka, Nicoll, & Hopeld,
1202
Panayiota Poirazi and Bartlett W. Mel
1998) and very high-dimensional input spaces, learning capacity may be dominated by the exibility to choose from a huge set of candidate nonlinear basis functions. As a corollary, the capacity of such learning systems could be signicantly underestimated if based solely on counts of existing units or synaptic connections—a strategy that, by contrast, is appropriate when used to estimate the capacity of a fully connected neural network (Baum & Haussler, 1989). Appendix: Derivation of Formula for Error Rate of Spherical Perceptron in Limit of Large N
We derived an expression for perceptron error rates using a simple geometric argument valid in the limit of large, randomly labeled training sets. When N patterns are drawn from a zero-mean spherical gaussian distribution G and randomly labeled with § 1, the resulting positive and negative training sets Tpos and Tneg may each be modeled as spherical gaussian blobs Gpos and Gneg, with means Xpos and Xneg slightly shifted from the origin. The sample means Xpos and Xneg are themselves distributed as d-dimensional gaussians, but with reduced variance given by p s0 D s / N /2 . (A.1) Given that most of the mass of a high-dimensional spherical p gaussian with variance s 2 is contained in a thin shell around radius s d, we may view Xpos and Xneg as randomly oriented vectors of length, p p L D s 0 d D s (2d / N ).
(A.2)
Since in high dimension Xpos and Xneg are also orthogonal, the expected p distance between them is given by D X D 2L. The optimal linear discriminant function is thus a hyperplane that bisects, and is perpendicular to, the line connecting Xpos and Xneg . Given that Tpos and Tneg have equal prior probability, the expected classication error is given by the integrated probability contained in Gpos (or Gneg) lying beyond the discriminating hyperplane. Since Gpos is factorial, this calculation reduces to a complementary error p function—the total probability lying beyond the distance D X / 2 D s d / N from the origin of a 1-dimensional gaussian. Thus, in the limit of high dimension with N À d, the expected classication error is ! r 1 d , (A.3) CE D erfc 2 2N
(
using the standard substitution z D px , and the complementary error R 1 22s function dened by erfc(z) D p2p z e¡t dt.
Capacity of a Subsampled Quadratic Classier
1203
In a variant of this classication problem, the N training samples drawn from G are all assigned to Tpos , while Tneg is taken to be G itself. This problem thus requires responding yes to the largest possible fraction of training samples, while saying no to the largest possible fraction of the probability density contained in G. Assuming equal priors for positive and negative exemplars during testing, the optimal separating hyperplane in this case cuts halfway through, and perpendicular to, the line connecting Xpos to the origin. This reduces D X, and hence the z score in equation A.3, by a factor q p 1 d ). of 2, to give a classication error of CE D 2 erfc( 4N This was dubbed the “d´eja-v u” ` learning problem and analyzed by Pearlmutter (1992) using a somewhat different approach. Acknowledgments
Thanks to Barak Pearlmutter, Dan Ruderman, Chris Williams, and Lyle Borg-Graham for helpful comments and suggestions. Funding for this work was provided by the NSF, the ONR, and a Myronis fellowship. References Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Amer., A2, 284–299. Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and applications in the united states of algorithms related to gmdh. In S. Farlow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Borg-Graham, L., & Grzywacz, N. (1992). A model of the direction selectivity circuit in retina: Transformations by neurons singly and in concert. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 347–375). Orlando, FL: Academic Press. Cover, T. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. ElectronicComputers, 14, 326–334. Ghosh, J., & Shin, Y. (1992). Efcient higher order neural networks for classication and function approximation. International Journal of Neural Systems, 3, 323–350. Giles, C., & Maxwell, T. (1987). Learning, invariance, and generalization in highorder neural networks. Applied Optics, 26, 4972–4978. Guyon, I., Boser, B., & Vapnik, V. (1993). Automatic capacity tuning of very large vc-dimension classiers. In S. Hanson, J. Cowan, & C. Giles (Eds.),Advances in
1204
Panayiota Poirazi and Bartlett W. Mel
neural information processing systems, 5 (pp. 147–155). San Mateo, CA: Morgan Kaufmann. Heeger, D. (1992). Half-squaring in responses of cat striate cells. Visual Neurosci., 9, 427–443. Johnston, D., Magee, J., Colbert, C., & Christie, B. (1996). Active properties of neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186. Karpinsky, M., & Werther, T. (1993). Vc dimension and uniform learnability of sparse polynomials and rational functions. SIAM J. Comput., 22, 1276–1285. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 315–345). Orlando, FL: Academic Press. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Phil. Trans. R. Soc. Lond. B, 298, 227– 264. Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kaufmann. Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Comp., 4, 502–516. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70(3), 1086–1101. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W. (1999). Why have dendrites? A computational perspective. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 271–289).Oxford: Oxford University Press. Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998a). Toward a single cell account of binocular disparity tuning: An energy model may be hiding in your dendrites. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 208–214). Cambridge, MA: MIT Press. Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998b). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Ohzawa, I., DeAngelis, G., & Freeman, R. (1997).Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77. Pearlmutter, B. A. (1992). How selective can a linear threshold unit be? In International Joint Conference on Neural Networks. Beijing, PRC: IEEE. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopeld, J. J. (1998). All-ornone potentiation at ca3–ca1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732– 4737. Poggio, T. (1975). Optimal nonlinear associative recall. Biol. Cybern., 9, 201. Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial frequency lters. IEEE Trans. Sys. Man Cybern., 13, 907–916. Rall, W., & Segev, I. (1987). Functional possibilities for synapses on dendrites and on dendritic spines. In G. Edelman, W. Gall, & W. Cowan (Eds.), Synaptic
Capacity of a Subsampled Quadratic Classier
1205
function (pp. 605–636). New York: Wiley. Riegler, P., & Seung, H. S. (1997). Vapnik-chervonenkis entropy of the spherical perceptron. Phys. Rev. E, 55, 3283–3287. Schurmann, J. (1996). Pattern classication: A unied view of statistical and neural approaches. New York: Wiley. Segev, I., & Rall, W. (1998). Excitable dendrites and spines—earlier theoretical insights elucidate recent direct observations. Trends Neurosci., 21, 453–460. Shepherd, G., & Brayton, R. (1987). Logic operations are properties of computersimulated interactions between excitable dendritic spines. Neurosci., 21, 151– 166. Stuart, G., Spruston, N., Sakmann, B., & Hausser, M. (1997). Action potential initiation and backpropagation in neurons of the mammalian CNS. TINS, 20, 125–131. Stuart, G., Spruston, N., & Hausser, M. (Eds.). (1999). Dendrites. Oxford: Oxford University Press. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Yuste, R., & Tank, D. W. (1996). Dendritic integration in mammalian neurons, a century after Cajal. Neuron, 16, 701–716. Zador, A., Claiborne, B., & Brown, T. (1992). Nonlinear pattern separation in single hippocampal neurons with active dendritic membrane. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 51–58). San Mateo, CA: Morgan Kaufmann. Zador, A., & Pearlmutter, B. A. (1996). VC dimension of an integrate-and-re model neuron. Neural Computation, 8, 611–624. Received July 1, 1998; accepted July 20, 1999.
LETTER
Communicated by John Platt
New Support Vector Algorithms ¤ Bernhard Sch olkopf ¨ Alex J. Smola GMD FIRST, 12489 Berlin, Germany, and Department of Engineering, Australian National University, Canberra 0200, Australia
Robert C. Williamson Department of Engineering, Australian National University, Canberra 0200, Australia Peter L. Bartlett RSISE, Australian National University, Canberra 0200, Australia We propose a new class of support vector algorithms for regression and classication. In these algorithms, a parameter º lets one effectively control the number of support vectors. While this can be useful in its own right, the parameterization has the additional benet of enabling us to eliminate one of the other free parameters of the algorithm: the accuracy parameter e in the regression case, and the regularization constant C in the classication case. We describe the algorithms, give some theoretical results concerning the meaning and the choice of º, and report experimental results. 1 Introduction
Support vector (SV) machines comprise a new class of learning algorithms, motivated by results of statistical learning theory (Vapnik, 1995). Originally developed for pattern recognition (Vapnik & Chervonenkis, 1974; Boser, Guyon, & Vapnik, 1992), they represent the decision boundary in terms of a typically small subset (Sch¨olkopf, Burges, & Vapnik, 1995) of all training examples, called the support vectors. In order for this sparseness property to carry over to the case of SV Regression, Vapnik devised the so-called e -insensitive loss function, |y ¡ f ( x)| e D maxf0, | y ¡ f ( x )| ¡ e g,
(1.1)
which does not penalize errors below some e > 0, chosen a priori. His algorithm, which we will henceforth call e -SVR, seeks to estimate functions, f ( x ) D ( w ¢ x ) C b, ¤
w , x 2 R N , b 2 R,
(1.2)
Present address: Microsoft Research, 1 Guildhall Street, Cambridge, U.K.
c 2000 Massachusetts Institute of Technology Neural Computation 12, 1207– 1245 (2000) °
1208
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
based on independent and identically distributed (i.i.d.) data, ( x 1 , y1 ), . . . , ( x`, y`) 2 R N £ R .
(1.3)
Here, R N is the space in which the input patterns live but most of the following also applies for inputs from a set X . The goal of the learning process is to nd a function f with a small risk (or test error), Z R[ f ] D
l( f, x , y) dP( x , y),
(1.4)
where P is the probability measure, which is assumed to be responsible for the generation of the observations (see equation 1.3) and l is a loss function, for example, l( f, x , y) D ( f ( x ) ¡ y) 2 , or many other choices (Smola & Sch¨olkopf, 1998). The particular loss function for which we would like to minimize equation 1.4 depends on the specic regression estimation problem at hand. This does not necessarily have to coincide with the loss function used in our learning algorithm. First, there might be additional constraints that we would like our regression estimation to satisfy, for instance, that it have a sparse representation in terms of the training data. In the SV case, this is achieved through the insensitive zone in equation 1.1. Second, we cannot minimize equation 1.4 directly in the rst place, since we do not know P. Instead, we are given the sample, equation 1.3, and we try to obtain a small risk by minimizing the regularized risk functional, 1 kw k2 C C ¢ Reemp [ f ]. 2
(1.5)
Here, kw k2 is a term that characterizes the model complexity, Reemp [ f ] :D
` 1X
` iD1
| yi ¡ f ( x i )| e ,
(1.6)
measures the e -insensitive training error, and C is a constant determining the trade-off. In short, minimizing equation 1.5 captures the main insight of statistical learning theory, stating that in order to obtain a small risk, one needs to control both training error and model complexity—that is, explain the data with a simple model. The minimization of equation 1.5 is equivalent to the following constrained optimization problem (see Figure 1): minimize
` 1 1X (j i C ji¤ ), t ( w , » (¤) ) D kw k2 C C ¢ 2 ` iD1
(1.7)
New Support Vector Algorithms
1209
Figure 1: In SV regression, a desired accuracy e is specied a priori. It is then attempted to t a tube with radius e to the data. The trade-off between model complexity and points lying outside the tube (with positive slack variables j ) is determined by minimizing the expression 1.5.
subject to
(( w ¢ x i ) C b) ¡ yi · e C j i yi ¡ (( w ¢ x i ) C b) · e j i(¤)
C j i¤
¸ 0.
(1.8) (1.9) (1.10)
Here and below, it is understood that i D 1, . . . , `, and that boldface Greek letters denote `-dimensional vectors of the corresponding variables; (¤) is a shorthand implying both the variables with and without asterisks. By using Lagrange multiplier techniques, one can show (Vapnik, 1995) that this leads to the following dual optimization problem. Maximize W ( ® , ® ¤ ) D ¡e ¡
subject to
` X
iD1
` X
iD1
(ai¤ C ai ) C
` X
` X
iD1
(ai¤ ¡ ai ) yi
1 (a¤ ¡ ai )(aj¤ ¡ aj )(x i ¢ xj ) 2 i, jD1 i
(ai ¡ ai¤ ) D 0
h ai(¤) 2 0,
C `
(1.11)
(1.12)
i .
(1.13)
The resulting regression estimates are linear; however, the setting can be generalized to a nonlinear one by using the kernel method. As we will use precisely this method in the next section, we shall omit its exposition at this point.
1210
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
To motivate the new algorithm that we shall propose, note that the parameter e can be useful if the desired accuracy of the approximation can be specied beforehand. In some cases, however, we want the estimate to be as accurate as possible without having to commit ourselves to a specic level of accuracy a priori. In this work, we rst describe a modication of the e SVR algorithm, called º-SVR, which automatically minimizes e . Following this, we present two theoretical results on º-SVR concerning the connection to robust estimators (section 3) and the asymptotically optimal choice of the parameter º (section 4). Next, we extend the algorithm to handle parametric insensitivity models that allow taking into account prior knowledge about heteroscedasticity of the noise. As a bridge connecting this rst theoretical part of the article to the second one, we then present a denition of a margin that both SV classication and SV regression algorithms maximize (section 6). In view of this close connection between both algorithms, it is not surprising that it is possible to formulate also a º-SV classication algorithm. This is done, including some theoretical analysis, in section 7. We conclude with experiments and a discussion. 2 º-SV Regression
To estimate functions (see equation 1.2) from empirical data (see equation 1.3) we proceed as follows (Sch¨olkopf, Bartlett, Smola, & Williamson, 1998). At each point x i , we allow an error of e . Everything above e is cap(¤) tured in slack variables j i , which are penalized in the objective function via a regularization constant C, chosen a priori (Vapnik, 1995). The size of e is traded off against model complexity and slack variables via a constant º ¸ 0: (¤)
(
` 1 1X (ji C j i¤ ) , e ) D kw k2 C C ¢ ºe C 2 ` iD1
minimize
t (w, »
subject to
((w ¢ xi ) C b) ¡ yi · e C j i yi ¡ (( w ¢ x i ) C b) · e C
!
(2.1)
(2.2)
j i¤
(2.3)
j i(¤) ¸ 0, e ¸ 0.
(2.4) (¤)
(¤)
For the constraints, we introduce multipliers ai , g i , b ¸ 0, and obtain the Lagrangian L( w , b, ® D
(¤)
, b, »
(¤)
, e, ´ (¤) )
` ` X 1 CX (ji C j i¤ ) ¡ b e ¡ (g iji C gi¤j i¤ ) kw k2 C Cºe C 2 ` iD1 iD1
New Support Vector Algorithms
¡ ¡
` X
iD1 ` X
iD1
1211
ai (j i C yi ¡ ( w ¢ x i ) ¡ b C e ) ai¤ (j i¤ C ( w ¢ xi ) C b ¡ yi C e ).
(2.5)
To minimize the expression 2.1, we have to nd the saddle point of L—that (¤) is, minimize over the primal variables w , e, b, j i and maximize over the (¤) (¤) dual variables ai , b, g i . Setting the derivatives with respect to the primal variables equal to zero yields four equations: wD
X i
C ¢º ¡ ` X
iD1 C `
(ai¤ ¡ ai ) xi
X i
(2.6)
(ai C ai¤ ) ¡ b D 0
(2.7)
(ai ¡ ai¤ ) D 0
(2.8)
¡ ai(¤) ¡ gi(¤) D 0.
(2.9) (¤)
In the SV expansion, equation 2.6, only those ai will be nonzero that correspond to a constraint, equations 2.2 or 2.3, which is precisely met; the corresponding patterns are called support vectors. This is due to the KarushKuhn-Tucker (KKT) conditions that apply to convex constrained optimization problems (Bertsekas, 1995). If we write the constraints as g( x i , yi ) ¸ 0, with corresponding Lagrange multipliers ai , then the solution satises ai ¢ g( x i , yi ) D 0 for all i. Substituting the above four conditions into L leads to another optimization problem, called the Wolfe dual. Before stating it explicitly, we carry out one further modication. Following Boser et al. (1992), we substitute a kernel k for the dot product, corresponding to a dot product in some feature space related to input space via a nonlinear map W, k( x , y ) D (W( x) ¢ W( y )).
(2.10)
By using k, we implicitly carry out all computations in the feature space that W maps into, which can have a very high dimensionality. The feature space has the structure of a reproducing kernel Hilbert space (Wahba, 1999; Girosi, 1998; Sch¨olkopf, 1997) and hence minimization of kw k2 can be understood in the context of regularization operators (Smola, Sch¨olkopf, & Muller, ¨ 1998). The method is applicable whenever an algorithm can be cast in terms of dot products (Aizerman, Braverman, & Rozonoer, 1964; Boser et al., 1992; Sch¨olkopf, Smola, & Muller, ¨ 1998). The choice of k is a research topic in its
1212
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
own right that we shall not touch here (Williamson, Smola, & Sch¨olkopf, 1998; Sch¨olkopf, Shawe-Taylor, Smola, & Williamson, 1999); typical choices include gaussian kernels, k( x , y ) D exp(¡kx ¡ y k2 / (2s 2 )) and polynomial kernels, k(x , y ) D ( x ¢ y ) d (s > 0, d 2 N). (¤) Rewriting the constraints, noting that b, gi ¸ 0 do not appear in the dual, we arrive at the º-SVR optimization problem: for º ¸ 0, C > 0, maximize W ( ®(¤) ) D
` X
iD1
¡
subject to
` X
i D1
i D1
` 1X (a¤ ¡ ai )(aj¤ ¡ aj ) k( x i , xj ) 2 i, jD1 i
(ai ¡ ai¤ ) D 0
h ai(¤) 2 0, ` X
(ai¤ ¡ ai ) yi
C
(2.11)
(2.12)
i (2.13)
`
(ai C ai¤ ) · C ¢ º.
(2.14)
The regression estimate then takes the form (cf. equations 1.2, 2.6, and 2.10), f ( x) D
` X
iD1
(ai¤ ¡ ai )k( x i , x ) C b,
(2.15)
where b (and e ) can be computed by taking into account that equations 2.2 P and 2.3 (substitution of j (aj¤ ¡ aj ) k( xj , x ) for ( w ¢ x) is understood; cf. (¤)
equations 2.6 and 2.10) become equalities with ji D 0 for points with (¤) 0 < ai < C / `, respectively, due to the KKT conditions. Before we give theoretical results explaining the signicance of the parameter º, the following observation concerning e is helpful. If º > 1, then e D 0, since it does not pay to increase e (cf. equation 2.1). If º · 1, it can still happen that e D 0—for example, if the data are noise free and can be perfectly interpolated with a low-capacity model. The case e D 0, however, is not what we are interested in; it corresponds to plain L1 -loss regression. We will use the term errors to refer to training points lying outside the tube 1 and the term fraction of errors or SVs to denote the relative numbers of For N > 1, the “tube” should actually be called a slab—the region between two parallel hyperplanes. 1
New Support Vector Algorithms
1213
errors or SVs (i.e., divided by `). In this proposition, we dene the Pmodulus of absolute continuity of a function f as the function 2 (d ) D sup i | f (bi ) ¡ is taken over all disjoint intervals ( ai , bi ) with f (ai )|, where the supremum P ai < bi satisfying i ( bi ¡ ai ) < d. Loosely speaking, the condition on the conditional density of y given x asks that it be absolutely continuous “on average.” Suppose º-SVR is applied to some data set, and the resulting e is nonzero. The following statements hold: Proposition 1.
i. º is an upper bound on the fraction of errors. ii. º is a lower bound on the fraction of SVs. iii. Suppose the data (see equation 1.3) were generated i.i.d. from a distribution P(x , y) D P( x) P( y| x ) with P ( y| x) continuous and the expectation of the modulus of absolute continuity of its density satisfying limd!0 E2 (d ) D 0. With probability 1, asymptotically, º equals both the fraction of SVs and the fraction of errors. Ad (i). The constraints, equations 2.13 and 2.14, imply that at most (¤) (¤) a fraction º of all examples can have ai D C / `. All examples with j i > 0 (¤) (¤) (i.e., those outside the tube) certainly satisfy ai D C / ` (if not, ai could (¤) grow further to reduce j i ). Ad (ii). By the KKT conditions, e > 0 implies b D 0. Hence, equation 2.14 becomes an equality (cf. equation 2.7). 2 Since SVs are those examples for (¤) which 0 < ai · C / `, the result follows (using ai ¢ ai¤ D 0 for all i; Vapnik, 1995). Ad (iii). The strategy of proof is to show that asymptotically, the probability of a point is lying on the edge of the tube vanishes. The condition on P( y| x ) means that Proof.
¢ ¡ sup E P | f (x ) C t ¡ y| < c x < d (c )
(2.16)
f,t
for some function d (c ) that approaches zero as c ! 0. Since the class of SV regression estimates f has well-behaved covering numbers, we have (Anthony & Bartlett, 1999, chap. 21) that for all t, ! ³ ´ ` O Pr sup P`(| f (x ) C t ¡ y| < c / 2) < P(| f ( x) C t¡y| < c ) > a < c1 c¡ 2 ,
(
2
f
In practice, one can alternatively work with equation 2.14 as an equality constraint.
1214
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
where PO` is the sample-based estimate of P (that is, the proportion of points that satisfy | f ( x) ¡y C t| < c ), and c1 , c2 may depend onc and a. Discretizing the values of t, taking the union bound, and applying equation 2.16 shows that the supremum over f and t of PO`( f (x ) ¡ y C t D 0) converges to zero in probability. Thus, the fraction of points on the edge of the tube almost surely converges to 0. Hence the fraction of SVs equals that of errors. Combining statements i and ii then shows that both fractions converge almost surely to º. Hence, 0 · º · 1 can be used to control the number of errors (note that for º ¸ 1, equation 2.13 implies 2.14, since ai ¢ ai¤ D 0 for all i (Vapnik, 1995)). Moreover, since the constraint, equation 2.12, implies that equation 2.14 is P (¤) equivalent to i ai · Cº / 2, we conclude that proposition 1 actually holds for the upper and the lower edges of the tube separately, with º / 2 each. (Note that by the same argument, the number of SVs at the two edges of the standard e -SVR tube asymptotically agree.) A more intuitive, albeit somewhat informal, explanation can be given in terms of the primal objective function (see equation 2.1). At the point of the solution, note that if e > 0, we must have ( @ / @e )t (w , e ) D 0, that is, º C ( @ / @e )Reemp D 0, hence º D ¡(@ / @e ) Reemp . This is greater than or equal to the fraction of errors, since the points outside the tube certainly do contribute to a change in Remp when e is changed. Points at the edge of the tube possibly might also contribute. This is where the inequality comes from. Note that this does not contradict our freedom to choose º > 1. In that case, e D 0, since it does not pay to increase e (cf. equation 2.1). Let us briey discuss how º-SVR relates to e -SVR (see section 1). Both algorithms use the e -insensitive loss function, but º-SVR automatically computes e . In a Bayesian perspective, this automatic adaptation of the loss function could be interpreted as adapting the error model, controlled by the hyperparameter º. Comparing equation 1.11 (substitution of a kernel for the dot product is understood) and equation 2.11, we note that e -SVR requires P an additional term, ¡e `iD1 (ai¤ C ai ), which, for xed e > 0, encourages (¤) that some of the ai will turn out to be 0. Accordingly, the constraint (see equation 2.14), which appears inº-SVR, is not needed. The primal problems, equations 1.7 and 2.1, differ in the term ºe . If º D 0, then the optimization can grow e arbitrarily large; hence zero empirical risk can be obtained even when all as are zero. In the following sense, º-SVR includes e -SVR. Note that in the general case, using kernels, w N is a vector in feature space. N then e -SVR with e set a If º-SVR leads to the solution eN , w N , b, N N , b. priori to eN , and the same value of C, has the solution w Proposition 2.
New Support Vector Algorithms
1215
If we minimize equation 2.1, then x e and minimize only over the remaining variables. The solution does not change. Proof.
3 The Connection to Robust Estimators
Using the e -insensitive loss function, only the patterns outside the e -tube enter the empirical risk term, whereas the patterns closest to the actual regression have zero loss. This, however, does not mean that it is only the outliers that determine the regression. In fact, the contrary is the case. Proposition 3 (resistance of SV regression).
Using support vector regression with the e -insensitive loss function (see equation 1.1), local movements of target values of points outside the tube do not inuence the regression. Shifting yi locally does not change the status of ( x i , yi ) as being a point outside the tube. Then the dual solution ®(¤) is still feasible; it satises (¤) the constraints (the point still has ai D C / `). Moreover, the primal solution, withj i transformed according to the movement of yi , is also feasible. Finally, (¤) the KKT conditions are still satised, as still ai D C / `. Thus (Bertsekas, 1995), ®(¤) is still the optimal solution. Proof.
The proof relies on the fact that everywhere outside the tube, the upper (¤) bound on the ai is the same. This is precisely the case if the loss function increases linearly ouside the e -tube (cf. Huber, 1981, for requirements on robust cost functions). Inside, we could use various functions, with a derivative smaller than the one of the linear part. For the case of the e -insensitive loss, proposition 3 implies that essentially, the regression is a generalization of an estimator for the mean of a random variable that does the following: Throws away the largest and smallest examples (a fractionº / 2 of either category; in section 2, it is shown that the sum constraint, equation 2.12, implies that proposition 1 can be applied separately for the two sides, using º / 2). Estimates the mean by taking the average of the two extremal ones of the remaining examples. This resistance concerning outliers is close in spirit to robust estimators like the trimmed mean. In fact, we could get closer to the idea of the trimmed mean, which rst throws away the largest and smallest points and then computes the mean of the remaining points, by using a quadratic loss inside the e -tube. This would leave us with Huber’s robust loss function.
1216
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Note, moreover, that the parameter º is related to the breakdown point of the corresponding robust estimator (Huber, 1981). Because it species the fraction of points that may be arbitrarily bad outliers, º is related to the fraction of some arbitrary distribution that may be added to a known noise model without the estimator failing. Finally, we add that by a simple modication of the loss function (White, (¤) 1994)—weighting the slack variables » above and below the tube in the target function, equation 2.1, by 2l and 2(1¡l), withl 2 [0, 1]—respectively— one can estimate generalized quantiles. The argument proceeds as follows. Asymptotically, all patterns have multipliers at bound (cf. proposition 1). The l , however, changes the upper bounds in the box constraints applying to the two different types of slack variables to 2Cl / ` and 2C(1 ¡ l) / `, respectively. The equality constraint, equation 2.8, then implies that (1 ¡ l) and l give the fractions of points (out of those which are outside the tube) that lie on the two sides of the tube, respectively. 4 Asymptotically Optimal Choice of º
Using an analysis employing tools of information geometry (Murata, Yoshizawa, & Amari, 1994; Smola, Murata, Sch¨olkopf, & Muller, ¨ 1998), we can derive the asymptotically optimal º for a given class of noise models in the sense of maximizing the statistical efciency.3 Denote p a density with unit variance,4 and P a family of noise y models generated from P by P :D fp| p D s1 P( s ), s > 0g. Moreover assume that the data were generated i.i.d. from a distribution p(x, y) D p(x) p( y ¡ f ( x)) with p( y ¡ f ( x)) continuous. Under the assumption that SV regression produces an estimate fO converging to the underlying functional dependency f , the asymptotically optimal º, for the estimation-of-locationparameter model of SV regression described in Smola, Murata, Sch¨olkopf, & Muller ¨ (1998), is Remark.
Z º D 1¡
e
P(t) dt where
¡e
e :D argmin t
1 ( P(¡t ) C P(t)) 2
¡
Z 1¡
t
P( t) dt
¡t
¢ (4.1)
3 This section assumes familiarity with some concepts of information geometry. A more complete explanation of the model underlying the argument is given in Smola, Murata, Scholkopf, ¨ & Muller ¨ (1998) and can be downloaded from http://svm.rst.gmd.de. 4 is a prototype generating the class of densities . Normalization assumptions are made for ease of notation.
New Support Vector Algorithms
1217
To see this, note that under the assumptions stated above, the probability of a deviation larger than e , Prf|y ¡ fO( x)| > e g, will converge to © ª Pr | y ¡ f (x)| > e D
Z
£fRn[¡e, e]g
Z
D1¡
e
¡e
p( x) p(j ) dx dj
p(j ) dj .
(4.2)
This is also the fraction of samples that will (asymptotically) become SVs (proposition 1, iii). Therefore an algorithm generating a fraction º D 1 ¡ Re (j ) dj SVs will correspond to an algorithm with a tube of size e . The ¡e p consequence is that given a noise model p(j ), one can compute the optimal e for it, and then, by using equation 4.2, compute the corresponding optimal value º. To this end, one exploits the linear scaling behavior between the standard deviation s of a distribution p and the optimal e . This result, established in Smola, Murata, Sch¨olkopf, & Muller ¨ (1998) and Smola (1998), cannot be proved here; instead, we shall merely try to give a avor of the argument. The basic idea is to consider the estimation of a location parameter using the e -insensitive loss, with the goal of maximizing the statistical efciency. Using the Cram´er-Rao bound and a result of Murata et al. (1994), the efciency is found to be e
³e ´ s
D
Q2 . GI
(4.3)
Here, I is the Fisher information, while Q and G are information geometrical quantities computed from the loss function and the noise model. This means that one only has to consider distributions of unit variance, say, P, to compute an optimal value of º that holds for the whole class of distributions P. Using equation 4.3, one arrives at 1 1 G / 2 D ( P(¡e ) C P( e ))2 e( e ) Q
¡
Z 1¡
e ¡e
¢
P(t) dt .
(4.4)
The minimum of equation 4.4 yields the optimal choice of e , which allows computation of the corresponding º and thus leads to equation 4.1. p Consider now an example: arbitrary polynomial noise models (/ e¡|j | ) with unit variance can be written as !p ! r r 1 C ( 3 / p) p C ( 3 / p) P(j ) D ¡ ¢ exp ¡ |j | (4.5) C ( 1 / p) 2 C ( 1 / p) C 1 / p
((
where C denotes the gamma function. Table 1 shows the optimal value of º for different polynomial degrees. Observe that the more “lighter-tailed”
1218
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Table 1: Optimal º for Various Degrees of Polynomial Additive Noise. Polynomial degree p Optimal º
1
2
3
4
5
6
7
8
1.00
0.54
0.29
0.19
0.14
0.11
0.09
0.07
the distribution becomes, the smaller º are optimal—that is, the tube width increases. This is reasonable as only for very long tails of the distribution (data with many outliers) it appears reasonable to use an early cutoff on the inuence of the data (by basically giving all data equal inuence via ai D C / `). The extreme case of Laplacian noise (º D 1) leads to a tube width of 0, that is, to L1 regression. We conclude this section with three caveats: rst, we have only made an asymptotic statement; second, for nonzero e , the SV regression need not necessarily converge to the target f : measured using |.| e , many other functions are just as good as f itself; third, the proportionality between e and s has only been established in the estimation-of-location-parameter context, which is not quite SV regression. 5 Parametric Insensitivity Models
We now return to the algorithm described in section 2. We generalized e -SVR by estimating the width of the tube rather than taking it as given a priori. What we have so far retained is the assumption that the e -insensitive zone has a tube (or slab) shape. We now go one step further and use parametric models of arbitrary shape. This can be useful in situations where the noise is heteroscedastic, that is, where it depends on x . (¤) Let ffq g (here and below, q D 1, . . . , p is understood) be a set of 2p positive functions on the input space X . Consider the following quadratic (¤) (¤) program: for given º1 , . . . , ºp ¸ 0, minimize t ( w , » (¤) , "(¤) ) D kw k2 / 2 0
p X
C C¢@
subject to
qD1
1 (ºq eq C ºq¤ eq¤ ) C
((w ¢ xi ) C b) ¡ yi · yi ¡ (( w ¢ x i ) C b) ·
X q
X q
j i(¤) ¸ 0, eq(¤) ¸ 0.
` 1X
` iD1
(ji C j i¤ ) A
(5.1)
eqfq ( xi ) C j i
(5.2)
eq¤fq¤ ( x i ) C j i¤
(5.3) (5.4)
New Support Vector Algorithms
1219
A calculation analogous to that in section 2 shows that the Wolfe dual consists of maximizing the expression 2.11 subject to the constraints 2.12 and 2.13, and, instead of 2.14, the modied constraints, still linear in ®(¤) , ` X
iD1
ai(¤) fq(¤) ( xi ) · C ¢ ºq(¤) .
(5.5)
In the experiments in section 8, we use a simplied version of this optimization problem, where we drop the term ºq¤ eq¤ from the objective function, equation 5.1, and use eq and fq in equation 5.3. By this, we render the problem symmetric with respect to the two edges of the tube. In addition, we use p D 1. This leads to the same Wolfe dual, except for the last constraint, which becomes (cf. equation 2.14), ` X
iD1
(ai C ai¤ )f ( x i ) · C ¢ º.
(5.6)
Note that the optimization problem of section 2 can be recovered by using the constant function f ´ 1.5 The advantage of this setting is that since the same º is used for both sides of the tube, the computation of e, b is straightforward: for instance, by solving a linear system, using two conditions as those described following equation 2.15. Otherwise, general statements are harder to make; the linear system can have a zero determinant, depending on whether the functions (¤) (¤) fp , evaluated on the x i with 0 < ai < C / `, are linearly dependent. The latter occurs, for instance, if we use constant functions f (¤) ´ 1. In this case, it is pointless to use two different values º, º¤ ; for the constraint (see P (¤) equation 2.12) then implies that both sums `iD1 ai will be bounded by ¤ C ¢ minfº, º g. We conclude this section by giving, without proof, a generalization of proposition 1 to the optimization problem with constraint (see equation 5.6): Proposition 4.
that e > 0. Then
Suppose we run the above algorithm on a data set with the result
i. Pº`( x ) is an upper bound on the fraction of errors. i
f
i
ii. Pº`( x ) is an upper bound on the fraction of SVs. i
f
i
5 Observe the similarity to semiparametric SV models (Smola, Frieß, & Scholkopf, ¨ 1999) where a modication of the expansion of f led to similar additional constraints. The important difference in the present setting is that the Lagrange multipliers ai and ai¤ are treated equally and not with different signs, as in semiparametric modeling.
1220
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
iii. Suppose the data in equation 1.3 were generated i.i.d. from a distribution P( x , y) D P( x )P( y| x ) with P( y| x) continuous and the expectation of its modulus of continuity satisfying limd!0 E2 (d) D 0. With probability 1, R asymptotically, the fractions of SVs and errors equal º¢ ( f ( x ) dPQ( x)) ¡1 , where PQ is the asymptotic distribution of SVs over x . 6 Margins in Regression and Classication
The SV algorithm was rst proposed for the case of pattern recognition (Boser et al., 1992), and then generalized to regression (Vapnik, 1995). Conceptually, however, one can take the view that the latter case is actually the simpler one, providing a posterior justication as to why we started this article with the regression case. To explain this, we will introduce a suitable denition of a margin that is maximized in both cases. At rst glance, the two variants of the algorithm seem conceptually different. In the case of pattern recognition, a margin of separation between two pattern classes is maximized, and the SVs are those examples that lie closest to this margin. In the simplest case, where the training error is xed to 0, this is done by minimizing kw k2 subject to yi ¢ (( w ¢ x i ) C b) ¸ 1 (note that in pattern recognition, the targets yi are in f§ 1g). In regression estimation, on the other hand, a tube of radius e is tted to the data, in the space of the target values, with the property that it corresponds to the attest function in feature space. Here, the SVs lie at the edge of the tube. The parameter e does not occur in the pattern recognition case. We will show how these seemingly different problems are identical (cf. also Vapnik, 1995; Pontil, Rifkin, & Evgeniou, 1999), how this naturally leads to the concept of canonical hyperplanes (Vapnik, 1995), and how it suggests different generalizations to the estimation of vector-valued functions. Let ( E, k.kE ) , ( F, k.kF ) be normed spaces, and X We dene the e -margin of a function f : X ! F as Denition 1 (e -margin).
me ( f ) :D inffkx ¡ ykE : x , y 2 X , k f ( x) ¡ f ( y )kF ¸ 2e g.
½ E. (6.1)
me ( f ) can be zero, even for continuous functions, an example being f ( x) D 1 / x on X D R C . There, me ( f ) D 0 for all e > 0. Note that the e -margin is related (albeit not identical) to the traditional modulus of continuity of a function: given d > 0, the latter measures the largest difference in function values that can be obtained using points within a distance d in E. The following observations characterize the functions for which the margin is strictly positive. Lemma 1 (uniformly continuous functions).
With the above notations, me ( f ) is positive for all e > 0 if and only if f is uniformly continuous.
New Support Vector Algorithms
1221
By denition of me , we have
Proof.
¡
k f ( x) ¡ f ( y )kF ¸ 2e H) kx ¡ y kE ¸ me ( f )
¢
(6.2)
¡ ¢ () kx ¡ y kE < me ( f ) H) k f ( x ) ¡ f ( y )kF < 2e ,
(6.3)
that is, if me ( f ) > 0, then f is uniformly continuous. Similarly, if f is uniformly continuous, then for each e > 0, we can nd a d > 0 such that k f ( x ) ¡ f ( y )kF ¸ 2e implies kx ¡ y kE ¸ d. Since the latter holds uniformly, we can take the inmum to get me ( f ) ¸ d > 0. We next specialize to a particular set of uniformly continuous functions. Lemma 2 (Lipschitz-continuous functions). If there exists some L > 0 such that for all x , y 2 X , k f ( x ) ¡ f ( y )kF · L ¢ kx ¡ y kE , then me ¸ 2Le . Proof.
Take the inmum over kx ¡ y kE ¸
k f ( x)¡ f (y )kF L
¸
2e . L
Example 1 (SV regression estimation).
Suppose that E is endowed with a dot product (. ¢ .) (generating the norm k.kE ). For linear functions (see equation 1.2) the margin takes the form me ( f ) D k2wek . To see this, note that since | f (x ) ¡ f ( y )| D |( w ¢ ( x ¡ y ))| , the distance kx ¡ y k will be smallest given | (w ¢ ( x ¡ y))| D 2e , when x ¡ y is parallel to w (due to Cauchy-Schwartz), i.e. if x ¡ y D § 2e w / kw k2 . In that case, kx ¡ y k D 2e / kw k. For xed e > 0, maximizing the margin hence amounts to minimizing kw k, as done in SV regression: in the simplest form (cf. equation 1.7 without slack variables j i ) the training on data (equation 1.3) consists of minimizing kw k2 subject to | f (x i ) ¡ yi | · e.
(6.4)
Example 2 (SV pattern recognition; see Figure 2). We specialize the setting of example 1 to the case where X D fx 1 , . . . , x `g. Then m1 ( f ) D kw2 k is equal to
the margin dened for Vapnik’s canonical hyperplane (Vapnik, 1995). The latter is a way in which, given the data set X , an oriented hyperplane in E can be uniquely expressed by a linear function (see equation 1.2) requiring that minf| f ( x)|: x 2 X g D 1.
(6.5)
Vapnik gives a bound on the VC-dimension of canonical hyperplanes in terms of kw k. An optimal margin SV machine for pattern recognition can be constructed from data, (x 1 , y1 ), . . . , ( x `, y`) 2 X
£ f§ 1g
(6.6)
1222
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Figure 2: 1D toy problem. Separate x from o. The SV classication algorithm constructs a linear function f (x) D w ¢ x C b satisfying equation 6.5 (e D 1). To maximize the margin me ( f ), one has to minimize |w|.
as follows (Boser et al., 1992): minimize kw k2 subject to yi ¢ f ( x i ) ¸ 1.
(6.7)
The decision function used for classication takes the form f ¤ ( x ) D sgn(( w ¢ x ) C b).
(6.8)
The parameter e is superuous in pattern recognition, as the resulting decision function, f ¤ ( x ) D sgn(( w ¢ x ) C b),
(6.9)
will not change if we minimize kw k2 subject to yi ¢ f (x i ) ¸ e.
(6.10)
Finally, to understand why the constraint (see equation 6.7) looks different from equation 6.4 (e.g., one is multiplicative, the other one additive), note that in regression, the points ( xi , yi ) are required to lie within a tube of radius e , whereas in pattern recognition, they are required to lie outside the tube (see Figure 2), and on the correct side. For the points on the tube, we have 1 D yi ¢ f ( xi ) D 1 ¡ | f ( xi ) ¡ yi |. So far, we have interpreted known algorithms only in terms of maximizing me . Next, we consider whether we can use the latter as a guide for constructing more general algorithms. Example 3 (SV regression for vector-valued functions). Assume E D R N . For linear functions f (x ) D W x C b , with W being an N £ N matrix, and b 2 R N ,
New Support Vector Algorithms
1223
we have, as a consequence of lemma 1, me ( f ) ¸
2e , kWk
(6.11)
where kWk is any matrix norm of W that is compatible (Horn & Johnson, 1985) with k.kE . If the matrix norm is the one induced by k.kE , that is, there exists a unit vector z 2 E such that kW z kE D kWk, then equality holds in 6.11. To see the latter, we use the same argument as in example 1, setting x ¡ y D 2e z / kWk. qP N 2 For the Hilbert-Schmidt norm kWk2 D i, jD1 Wij , which is compatible with the vector norm k.k2 , the problem of minimizing kWk subject to separate constraints for each output dimension separates into N regression problems. In Smola, Williamson, Mika, & Sch¨olkopf (1999), it is shown that one can specify invariance requirements, which imply that the regularizers act on the output dimensions separately and identically (i.e., in a scalar fashion). In particular, it turns out that under the assumption of quadratic homogeneity and permutation symmetry, the Hilbert-Schmidt norm is the only admissible one. 7 º-SV Classication
We saw that º-SVR differs from e -SVR in that it uses the parameters º and C instead of e and C. In many cases, this is a useful reparameterization of the original algorithm, and thus it is worthwhile to ask whether a similar change could be incorporated in the original SV classication algorithm (for brevity, we call it C-SVC). There, the primal optimization problem is to minimize (Cortes & Vapnik, 1995) 1 t( w , » ) D kw k2 C 2
C `
X
ji
(7.1)
i
subject to yi ¢ (( x i ¢ w ) C b) ¸ 1 ¡ ji , j i ¸ 0.
(7.2)
The goal of the learning process is to estimate a function f ¤ (see equation 6.9) such that the probability of misclassication on an independent test set, the risk R[ f ¤ ], is small.6 Here, the only parameter that we can dispose of is the regularization constant C. To substitute it by a parameter similar to the º used in the regression case, we proceed as follows. As a primal problem for º-SVC, we 6
Implicitly we make use of the f0, 1g loss function; hence the risk equals the probability of misclassication.
1224
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
consider the minimization of 1 1X t ( w , » , r ) D kw k2 ¡ ºr C ji 2 ` i
(7.3)
subject to (cf. equation 6.10) yi ¢ (( xi ¢ w ) C b) ¸ r ¡ j i ,
(7.4)
j i ¸ 0,
(7.5)
r ¸ 0.
For reasons we shall expain, no constant C appears in this formulation. To understand the role of r , note that for » D 0, the constraint (see 7.4) simply states that the two classes are separated by the margin 2r / kw k (cf. example 2). To derive the dual, we consider the Lagrangian L( w , » , b, r , ®, ¯ , d) D
1 1X kw k2 ¡ ºr C ji 2 ` i ¡
X i
(ai ( yi ((x i ¢ w ) C b) ¡ r C j i ) C bij i )
¡ dr ,
(7.6)
using multipliers ai , bi , d ¸ 0. This function has to be mimimized with respect to the primal variables w , » , b, r and maximized with respect to the dual variables ®, ¯, d . To eliminate the former, we compute the corresponding partial derivatives and set them to 0, obtaining the following conditions: wD
X
(7.7)
ai yi x i
i
ai C b i D 1 / `,
0D
X
ai yi ,
i
X i
ai ¡d D º.
(7.8)
In the SV expansion (see equation 7.7), only those ai can be nonzero that correspond to a constraint (see 7.4) that is precisely met (KKT conditions; cf. Vapnik, 1995). Substituting equations 7.7 and 7.8 into L, using ai , bi , d ¸ 0, and incorporating kernels for dot products leaves us with the following quadratic optimization problem: maximize W ( ®) D ¡
1X ai aj yi yj k( x i , xj ) 2 ij
(7.9)
New Support Vector Algorithms
1225
subject to 0 · ai · 1 / `
(7.10)
0D
(7.11)
X i
X
ai yi
i
(7.12)
ai ¸ º.
The resulting decision function can be shown to take the form !
(
f ¤ (x ) D sgn
X i
ai yi k( x, x i ) C b .
(7.13)
Compared to the original dual (Boser et al., 1992; Vapnik, 1995), there are two differences. First, there is an additional constraint, 7.12, similar to the P regression case, 2.14. Second, the linear term i ai of Boser et al. (1992) no longer appears in the objective function 7.9. This has an interesting consequence: 7.9 is now quadratically homogeneous in ®. It is straightforward to verify that one obtains exactly the same objective function ifP one starts with the primal function t ( w , » , r ) D kw k2 / 2 C C ¢ (¡ºr C (1 / `) i j i ) (i.e., if one does use C), the only difference being that the constraints, 7.10 and 7.12 would have an extra factor C on the right-hand side. In that case, due to the homogeneity, the solution of the dual would be scaled by C; however, it is straightforward to see that the corresponding decision function will not change. Hence we may set C D 1. To compute b and r , we consider two sets S§ , of identical size s > 0, containing SVs x i with 0 < ai < 1 and yi D § 1, respectively. Then, due to the KKT conditions, 7.4 becomes an equality with ji D 0. Hence, in terms of kernels, b D¡
1 2s
X
X
x 2SC [S¡
j
aj yj k( x , xj ),
(7.14)
0 r D
1 X X
1 @ 2s x 2S
C
j
aj yj k(x , xj ) ¡
X X x 2S¡
aj yj k(x , xj ) A .
(7.15)
j
As in the regression case, the º parameter has a more natural interpretation than the one we removed, C. To formulate it, let us rst dene the term margin error. By this, we denote points with j i > 0—that is, that are either errors or lie within the margin. Formally, the fraction of margin errors is 1 r Remp [ f ] :D | fi: yi ¢ f (x i ) < r g|. `
(7.16)
1226
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Here, f is used to denote the argument of the sgn in the decision function, equation 7.13, that is, f ¤ D sgn ± f . We are now in a position to modify proposition 1 for the case of pattern recognition: Suppose k is a real analytic kernel function, and we run º-SVC with k on some data with the result that r > 0. Then Proposition 5.
i. º is an upper bound on the fraction of margin errors. ii. º is a lower bound on the fraction of SVs. iii. Suppose the data (see equation 6.6) were generated i.i.d. from a distribution P( x , y) D P( x ) P( y| x) such that neither P(x , y D 1) nor P( x , y D ¡1) contains any discrete component. Suppose, moreover, that the kernel is analytic and non-constant. With probability 1, asymptotically, º equals both the fraction of SVs and the fraction of errors.
Ad (i). By the KKT conditions, r > 0 impliesd D 0. Hence, inequality 7.12 becomes an equality (cf. equations 7.8). Thus, at most a fraction º of all examples can have ai D 1 / `. All examples with j i > 0 do satisfy ai D 1 / ` (if not, ai could grow further to reduce j i ). Ad (ii). SVs can contribute at most 1 / ` to the left-hand side of 7.12; hence there must be at least º` of them. Ad (iii). It follows from the condition on P( x , y) that apart from some set of measure zero (arising from possible singular components), the two class distributions are absolutely continuous and can be written as integrals over distribution functions. Because the kernel is analytic and nonconstant, it cannot be constant in any open set; otherwise it would be constant everywhere. Therefore, functions f constituting the argument of the sgn in the SV decision function (see equation 7.13) essentially functions in the class of SV regression functions) transform the distribution over x into distributions such that for all f , and all t 2 R , limc !0 P(| f ( x) C t| < c ) D 0. At the same time, we know that the class of these functions has well-behaved covering numbers; hence we get uniform convergence: for all c > 0, sup f | P(| f (x ) C t| < c ) ¡ PO`(| f ( x ) C t| < c )| converges to zero in probability, where PO` is the sample-based estimate of P (that is, the proportion of points that satisfy | f ( x) C t| < c ). But then for all a > 0, limc !0 lim`!1 P(sup f PO `(| f ( x) C t| < c ) > a) D 0. Hence, sup PO`(| f (x ) C t| D 0) converges to zero in probability. Using t D § r Proof.
f
thus shows that almost surely the fraction of points exactly on the margin tends to zero; hence the fraction of SVs equals that of margin errors.
New Support Vector Algorithms
1227
Combining (i) and (ii) shows that both fractions converge almost surely to º. Moreover, since equation 7.11 means that the sums over the coefcients of positive and negative SVs respectively are equal, we conclude that proposition 5 actually holds for both classes separately, with º / 2. (Note that by the same argument, the number of SVs at the two sides of the margin asymptotically agree.) A connection to standard SV classication, and a somewhat surprising interpretation of the regularization parameter C, is described by the following result: If º-SV classication leads to r > 0, then C-SV classication, with C set a priori to 1 /r , leads to the same decision function. Proposition 6.
If one minimizes the function 7.3 and then xes r to minimize only over the remaining variables, nothing will change. Hence the obtained solution w 0 , b0 , » 0 minimizes the function 7.1 for C D 1, subject to the constraint 7.4. To recover the constraint 7.2, we rescale to the set of variables w 0 D w /r , b0 D b /r , » 0 D » /r . This leaves us, up to a constant scaling factor r 2 , with the objective function 7.1 using C D 1 /r . Proof.
As in the case of regression estimation (see proposition 3), linearity of the (¤) target function in the slack variables » leads to “outlier ” resistance of the estimator in pattern recognition. The exact statement, however, differs from the one in regression in two respects. First, the perturbation of the point is carried out in feature space. What it precisely corresponds to in input space therefore depends on the specic kernel chosen. Second, instead of referring to points outside the e -tube, it refers to margin error points—points that are misclassied or fall into the margin. Below, we use the shorthand z i for W( x i ). Proposition 7 (resistance of SV classication).
Suppose w can be expressed
in terms of the SVs that are not at bound, that is, wD
X
c i zi ,
(7.17)
i
with c i 6D 0 only if ai 2 (0, 1 / `) (where the ai are the coefcients of the dual solution). Then local movements of any margin error zm parallel to w do not change the hyperplane. Since the slack variable of z m satises j m > 0, the KKT conditions (e.g., Bertsekas, 1995) imply am D 1 / `. If d is sufciently small, then transforming the point into z 0m :D z m C d ¢ w results in a slack that is still nonzero, Proof.
1228
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
0 that is, j m0 > 0; hence we have am D 1 / ` D am . Updating the j m and keeping all other primal variables unchanged, we obtain a modied set of primal variables that is still feasible. We next show how to obtain a corresponding set of feasible dual variables. To keep w unchanged, we need to satisfy
X i
ai yi zi D
X i6Dm
ai0 yi z i C am ym z 0m .
Substituting z 0m D z m C d ¢ w and equation 7.17, we note that a sufcient condition for this to hold is that for all i 6D m, ai0 D ai ¡dc i yi am ym . Since by assumption c i is nonzero only if ai 2 (0, 1 / `), ai0 will be in (0, 1 / `) if ai is, provided d is sufciently small, and it will equal 1 / ` if ai does. In both cases, we end up with a feasible solution ®0 , and the KKT conditions are still satised. Thus (Bertsekas, 1995), ( w , b) are still the hyperplane parameters of the solution. Note that the assumption (7.17) is not asP restrictive as it may seem. Although the SV expansion of the solution, w D i ai yi z i , often contains many multipliers ai that are at bound, it is nevertheless conceivable that, especially when discarding the requirement that the coefcients be bounded, we can obtain an expansion (see equation 7.17) in terms of a subset of the original vectors. For instance, if we have a 2D problem that we solve directly in input space, with k(x , y ) D ( x ¢ y), then it already sufces to have two linearly independent SVs that are not at bound in order to express w . This holds for any overlap of the two classes—even if there are many SVs at the upper bound. For the selection of C, several methods have been proposed that could probably be adapted for º (Sch¨olkopf, 1997; Shawe-Taylor & Cristianini, 1999). In practice, most researchers have so far used cross validation. Clearly, this could be done also for º-SVC. Nevertheless, we shall propose a method that takes into account specic properties of º-SVC. The parameter º lets us control the number of margin errors, the crucial quantity in a class of bounds on the generalization error of classiers using covering numbers to measure the classier capacity. We can use this connection to give a generalization error bound for º-SVC in terms of º. There are a number of complications in doing this the best possible way, and so here we will indicate the simplest one. It is based on the following result: Suppose r > 0, 0 < d < 12 , P is a probability distribution on X £ f¡1, 1g from which the training set, equation 6.6, is drawn. Then with probability at least 1 ¡ d for every f in some function class F , the Proposition 8 (Bartlett, 1998).
New Support Vector Algorithms
1229
probability of error of the classication function f ¤ D sgn ± f on an independent test set is bounded according to r
R[ f ¤ ] · Remp [ f ] C
q ¡ 2 `
ln N
¢ ( F , l21`, r / 2) C ln(2 /d ) ,
(7.18)
where N ( F , l`1 , r ) D sup XDx1 ,..., x` N ( F | X , l1 , r ) , F | X D f( f ( x1 ), . . . , f ( x`)): f 2 F g, N ( FX , l1 , r ) is the r -covering number of FX with respect to l1 , the usual l1 metric on a set of vectors. To obtain the generalization bound for º-SVC, we simply substitute the r bound Remp [ f ] · º (proposition 5, i) and some estimate of the covering numbers in terms of the margin. The best available bounds are stated in terms of the functional inverse of N , hence the slightly complicated expressions in the following. Proposition 9 (Williamson et al., 1998).
Denote BR the ball of radius R around the origin in some Hilbert space F. Then the covering number N of the class of functions
F D fx 7! ( w ¢ x ): kw k · 1, x 2 BR g
(7.19)
at scale r satises log2 N
n 2 2 c R ( F , l`1 , r ) · inf n r2
1 n
o ¡ ¢ log2 1 C n` ¸ 1 ¡ 1,
(7.20)
where c < 103 is a constant. This is a consequence of a fundamental theorem due to Maurey. For ` ¸ 2 one thus obtains log2 N
(F , l`1 , r ) ·
c2 R2 log2 ` ¡ 1. r2
(7.21)
To apply these results to º-SVC, we rescale w to length 1, thus obtaining a margin r / kw k (cf. equation 7.4). Moreover, we have to combine propositions 8 and 9. Using r / 2 instead of r in the latter yields the following result. Proposition 10. Suppose º-SVC is used with a kernel of the form k( x , y) D k(kx ¡ y k) with k(0) D 1. Then all the data points W( xi ) in feature space live in a
ball of radius 1 centered at the origin. Consequently with probability at least 1 ¡d over the training set (see equation 6.6), the º-SVC decision function f ¤ D sgn ± f ,
1230
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
P with f (x ) D i ai yi k( x , xi ) (cf. equation 7.13), has a probability of test error bounded according to s 2 4c2 kw k2 r ¤ log2 (2`) ¡ 1 C ln(2 /d ) R[ f ] · Remp [ f ] C r2 ` s 2 4c2 kw k2 ·ºC log2 (2`) ¡ 1 C ln(2 /d) . r2 `
¡
¡
¢
¢
Notice that in general, kw k is a vector in feature space. Note that the set of functions in the proposition differs from support vector decision functions (see equation 7.13) in that it comes without the C b term. This leads to a minor modication (for details, see Williamson et al., 1998). Better bounds can be obtained by estimating the radius or even optimizing the choice of the center of the ball (cf. the procedure described by Sch¨olkopf et al. 1995; Burges, 1998). However, in order to get a theorem of the above form in that case, a more complex argument is necessary (see Shawe-Taylor, Bartlet, Williamson, & Anthony, 1998, sec. VI for an indication). We conclude this section by noting that a straightforward extension of the is to include parametric models fk ( x ) for the margin, and º-SVC algorithm P thus to use q r q fq (x i ) instead of r in the constraint (see equation 7.4)—in complete analogy to the regression case discussed in section 5. 8 Experiments 8.1 Regression Estimation. In the experiments, we used the optimizer LOQO. 7 This has the serendipitous advantage that the primal variables b and e can be recovered as the dual variables of the Wolfe dual (see equation 2.11) (i.e., the double dual variables) fed into the optimizer.
8.1.1 Toy Examples. The rst task was to estimate a noisy sinc function, given ` examples (xi , yi ), with xi drawn uniformly from [¡3, 3], and yi D sin(p xi ) / (p xi ) C ui , where the ui were drawn from a gaussian with zero mean and variance s 2 . Unless stated otherwise, we used the radial basis function (RBF) kernel k( x, x0 ) D exp(¡|x ¡ x0 | 2 ), ` D 50, C D 100, º D 0.2, and s D 0.2. Whenever standard deviation error bars are given, the results were obtained from 100 trials. Finally, the risk (or test error) of a regression estimate f was computed with respect to the sinc function without noise, R3 as 16 ¡3 | f ( x) ¡ sin(p x) / (p x)| dx. Results are given in Table 2 and Figures 3 through 9. 7
Available online at http://www.princeton.edu/»rvdb/.
New Support Vector Algorithms
1231
Figure 3: º-SV regression with º D 0.2 (top) and º D 0.8 (bottom). The larger º allows more points to lie outside the tube (see section 2). The algorithm automatically adjusts e to 0.22 (top) and 0.04 (bottom). Shown are the sinc function (dotted), the regression f , and the tube f § e .
1232
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Figure 4: º-SV regression on data with noise s D 0 (top) and s D 1; (bottom). In both cases, º D 0.2. The tube width automatically adjusts to the noise (top: e D 0; bottom: e D 1.19).
New Support Vector Algorithms
1233
Figure 5: e -SV regression (Vapnik, 1995) on data with noise s D 0 (top) and s D 1 (bottom). In both cases, e D 0.2. This choice, which has to be specied a priori, is ideal for neither case. In the upper gure, the regression estimate is biased; in the lower gure, e does not match the external noise (Smola, Murata, Sch¨olkopf, & Muller, ¨ 1998).
1234
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Figure 6: º-SVR for different values of the error constant º. Notice how e decreases when more errors are allowed (large º), and that over a large range of º, the test error (risk) is insensitive toward changes in º.
New Support Vector Algorithms
1235
Figure 7: º-SVR for different values of the noise s. The tube radius e increases (¤) linearly with s (largely due to the fact that both e and the j i enter the cost function linearly). Due to the automatic adaptation of e , the number of SVs and points outside the tube (errors) are, except for the noise-free case s D 0, largely independent of s.
1236
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Figure 8: º-SVR for different values of the constant C. (Top) e decreases when the regularization is decreased (large C). Only very little, if any, overtting occurs. (Bottom) º upper bounds the fraction of errors, and lower bounds the fraction of SVs (cf. proposition 1). The bound gets looser as C increases; this corresponds to a smaller number of examples ` relative to C (cf. Table 2).
New Support Vector Algorithms
1237
Table 2: Asymptotic Behavior of the Fraction of Errors and SVs. `
10
50
100
200
500
1000
1500
2000
e
0.27
0.22
0.23
0.25
0.26
0.26
0.26
0.26
Fraction of errors
0.00
0.10
0.14
0.18
0.19
0.20
0.20
0.20
Fraction of SVs
0.40
0.28
0.24
0.23
0.21
0.21
0.20
0.20
Notes: The e found byº-SV regression is largely independent of the sample size `. The fraction of SVs and the fraction of errors approach º D 0.2 from above and below, respectively, as the number of training examples ` increases (cf. proposition 1).
Figure 10 gives an illustration of how one can make use of parametric insensitivity models as proposed in section 5. Using the proper model, the estimate gets much better. In the parametric case, we used º D 0.1 and R f ( x) D sin2 ((2p / 3)x), which, due to f ( x) dP(x) D 1 / 2, corresponds to our standard choice º D 0.2 in º-SVR (cf. proposition 4). Although this relies on the assumption that the SVs are uniformly distributed, the experimental ndings are consistent with the asymptotics predicted theoretically: for ` D 200, we got 0.24 and 0.19 for the fraction of SVs and errors, respectively. 8.1.2 Boston Housing Benchmark. Empirical studies using e -SVR have reported excellent performance on the widely used Boston housing regression benchmark set (Stitson et al., 1999). Due to proposition 2, the only difference between º-SVR and standard e -SVR lies in the fact that different parameters, e versus º, have to be specied a priori. Accordingly, the goal of the following experiment was not to show that º-SVR is better than e -SVR, but that º is a useful parameter to select. Consequently, we are interested only inº and e , and hence kept the remaining parameters xed. We adjusted C and the width 2s 2 in k( x , y ) D exp(¡kx ¡ y k2 / (2s 2 )) as in Sch¨olkopf et al. (1997). We used 2s 2 D 0.3 ¢ N, where N D 13 is the input dimensionality, and C / ` D 10 ¢ 50 (i.e., the original value of 10 was corrected since in the present case, the maximal y-value is 50 rather than 1). We performed 100 runs, where each time the overall set of 506 examples was randomly split into a training set of ` D 481 examples and a test set of 25 examples (cf. Stitson et al., 1999). Table 3 shows that over a wide range ofº (note that only 0 · º · 1 makes sense), we obtained performances that are close to the best performances that can be achieved by selecting e a priori by looking at the test set. Finally, although we did not use validation techniques to select the optimal values for C and 2s 2 , the performances are state of the art (Stitson et al., 1999, report an MSE of 7.6 for e -SVR using ANOVA kernels, and 11.7 for Bagging regression trees). Table 3, moreover, shows that in this real-world application, º can be used to control the fraction of SVs/errors.
1238
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Table 3: Results for the Boston Housing Benchmark (top:º-SVR; bottom: e -SVR). º
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
automatic e MSE STD Errors SVs
2.6 9.4 6.4 0.0 0.3
1.7 8.7 6.8 0.1 0.4
1.2 9.3 7.6 0.2 0.6
0.8 9.5 7.9 0.2 0.7
0.6 10.0 8.4 0.3 0.8
0.3 10.6 9.0 0.4 0.9
0.0 11.3 9.6 0.5 1.0
0.0 11.3 9.5 0.5 1.0
0.0 11.3 9.5 0.5 1.0
0.0 11.3 9.5 0.5 1.0
e MSE STD Errors SVs
0
1
2
3
4
5
6
7
8
9
10
11.3 9.5 0.5 1.0
9.5 7.7 0.2 0.6
8.8 6.8 0.1 0.4
9.7 6.2 0.0 0.3
11.2 6.3 0.0 0.2
13.1 6.0 0.0 0.1
15.6 6.1 0.0 0.1
18.2 6.2 0.0 0.1
22.1 6.6 0.0 0.1
27.0 7.3 0.0 0.1
34.3 8.4 0.0 0.1
Note: MSE: Mean squared errors; STD: standard deviations thereof (100 trials); errors: fraction of training points outside the tube; SVs: fraction of training points that are SVs.
8.2 Classication. As in the regression case, the difference between CSVC and º-SVC lies in the fact that we have to select a different parameter a priori. If we are able to do this well, we obtain identical performances. In other words, º-SVC could be used to reproduce the excellent results obtained on various data sets using C-SVC (for an overview; see Sch¨olkopf, Burges, & Smola, 1999). This would certainly be a worthwhile project; however,we restrict ourselves here to showing some toy examples illustrating the inuence of º (see Figure 11). The corresponding fractions of SVs and margin errors are listed in Table 4. 9 Discussion
We have presented a new class of SV algorithms, which are parameterized by a quantity º that lets one control the number of SVs and errors. We described º-SVR, a new regression algorithm that has been shown to be rather
Figure 9: Facing page. º-SVR for different values of the gaussian kernel width 2s2 , using k(x, x0 ) D exp(¡|x ¡ x0 | 2 / (2s2 )). Using a kernel that is too wide results in undertting; moreover, since the tube becomes too rigid as 2s2 gets larger than 1, the e needed to accomodate a fraction (1 ¡ º) of the points, increases significantly. In the bottom gure, it can again be seen that the speed of the uniform convergence responsible for the asymptotic statement given in proposition 1 depends on the capacity of the underlying model. Increasing the kernel width leads to smaller covering numbers (Williamson et al., 1998) and therefore faster convergence.
New Support Vector Algorithms
1239
useful in practice. We gave theoretical results concerning the meaning and the choice of the parameter º. Moreover, we have applied the idea underlying º-SV regression to develop a º-SV classication algorithm. Just like its regression counterpart, the algorithm is interesting from both a prac-
1240
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Figure 10: Toy example, using prior knowledge about an x-dependence of the noise. Additive noise (s D 1) was multiplied by the function sin2 ((2p / 3)x). (Top) The same function was used as f as a parametric insensitivity tube (section 5). (Bottom) º-SVR with standard tube.
New Support Vector Algorithms
1241
Table 4: Fractions of Errors and SVs, Along with the Margins of Class Separation, for the Toy Example Depicted in Figure 11. º
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Fraction of errors Fraction of SVs Margin 2r / kw k
0.00 0.29 0.009
0.07 0.36 0.035
0.25 0.43 0.229
0.32 0.46 0.312
0.39 0.57 0.727
0.50 0.68 0.837
0.61 0.79 0.922
0.71 0.86 1.092
Note: º upper bounds the fraction of errors and lower bounds the fraction of SVs, and that increasing º, i.e. allowing more errors, increases the margin.
tical and a theoretical point of view. Controlling the number of SVs has consequences for (1) run-time complexity, since the evaluation time of the estimated function scales linearly with the number of SVs (Burges, 1998); (2) training time, e.g., when using a chunking algorithm (Vapnik, 1979) whose complexity increases with the number of SVs; (3) possible data compression applications—º characterizes the compression ratio: it sufces to train the algorithm only on the SVs, leading to the same solution (Sch¨olkopf et al., 1995); and (4) generalization error bounds: the algorithm directly optimizes a quantity using which one can give generalization bounds. These, in turn, could be used to perform structural risk minimization over º. Moreover, asymptotically, º directly controls the number of support vectors, and the latter can be used to give a leave-one-out generalization bound (Vapnik, 1995).
Figure 11: Toy problem (task: separate circles from disks) solved using º-SV classication, using parameter values ranging from º D 0.1 (top left) to º D 0.8 (bottom right). The larger we select º, the more points are allowed to lie inside the margin (depicted by dotted lines). As a kernel, we used the gaussian k( x, y ) D exp(¡kx ¡ y k2 ).
1242
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
In both the regression and the pattern recognition case, the introduction of º has enabled us to dispose of another parameter. In the regression case, this was the accuracy parameter e ; in pattern recognition, it was the regularization constant C. Whether we could have as well abolished C in the regression case is an open problem. Note that the algorithms are not fundamentally different from previous SV algorithms; in fact, we showed that for certain parameter settings, the results coincide. Nevertheless, we believe there are practical applications where it is more convenient to specify a fraction of points that is allowed to become errors, rather than quantities that are either hard to adjust a priori (such as the accuracy e ) or do not have an intuitive interpretation (such as C). On the other hand, desirable properties of previous SV algorithms, including the formulation as a denite quadratic program, and the sparse SV representation of the solution, are retained. We are optimistic that in many applications, the new algorithms will prove to be quite robust. Among these should be the reduced set algorithm of Osuna and Girosi (1999), which approximates the SV pattern recognition decision surface by e -SVR. Here, º-SVR should give a direct handle on the desired speed-up. Future work includes the experimental test of the asymptotic predictions of section 4 and an experimental evaluation of º-SV classication on realworld problems. Moreover, the formulation of efcient chunking algorithms for the º-SV case should be studied (cf. Platt, 1999). Finally, the additional freedom to use parametric error models has not been exploited yet. We expect that this new capability of the algorithms could be very useful in situations where the noise is heteroscedastic, such as in many problems of nancial data analysis, and general time-series analysis applications (Muller ¨ et al., 1999; Mattera & Haykin, 1999). If a priori knowledge about the noise is available, it can be incorporated into an error model f ; if not, we can try to estimate the model directly from the data, for example, by using a variance estimator (e.g., Seifert, Gasser, & Wolf, 1993) or quantile estimator (section 3). Acknowledgments
This work was supported in part by grants of the Australian Research Council and the DFG (Ja 379/7-1 and Ja 379/9-1). Thanks to S. Ben-David, A. Elisseeff, T. Jaakkola, K. Muller, ¨ J. Platt, R. von Sachs, and V. Vapnik for discussions and to L. Almeida for pointing us to White’s work. Jason Weston has independently performed experiments using a sum inequality constraint on the Lagrange multipliers, but declined an offer of coauthorship. References Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837.
New Support Vector Algorithms
1243
Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Bartlett, P. L. (1998). The sample complexity of pattern classication with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athena Scientic. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classiers. In D. Haussler (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1–47. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press. Huber, P. J. (1981). Robust statistics. New York: Wiley. Mattera, D., & Haykin, S. (1999). Support vector machines for dynamic reconstruction of a chaotic system. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 211–241). Cambridge, MA: MIT Press. Muller, ¨ K.-R., Smola, A., R¨atsch, G., Sch¨olkopf, B., Kohlmorgen, J., and Vapnik, V. (1999). Predicting time series with support vector machines. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 243–253). Cambridge, MA: MIT Press. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for articial neural network models. IEEE Transactions on Neural Networks, 5, 865–872. Osuna, E., & Girosi, F. (1999). Reducing run-time complexity in support vector machines. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 271–283). Cambridge, MA: MIT Press. Platt, J. (1999). Fast training of SVMs using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Pontil, M., Rifkin, R., & Evgeniou, T. (1999). From regression to classication in support vector machines. In M. Verleysen (Ed.), Proceedings ESANN (pp. 225– 230). Brussels: D Facto. Sch¨olkopf, B. (1997). Support vector learning. Munich: R. Oldenbourg Verlag. Sch¨olkopf, B., Bartlett, P. L., Smola, A., & Williamson, R. C. (1998). Support vector regression with automatic accuracy control. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks (pp. 111–116). Berlin: SpringerVerlag.
1244
B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett
Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Sch¨olkopf, B., Burges, C., & Vapnik, V. (1995).Extracting support data for a given task. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings, First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Sch¨olkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Kerneldependent support vector error bounds. In Ninth International Conference on Articial Neural Networks (pp. 103–108). London: IEE. Sch¨olkopf, B., Smola, A., & Muller, ¨ K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sch¨olkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Trans. Sign. Processing, 45, 2758– 2765. Seifert, B., Gasser, T., & Wolf, A. (1993). Nonparametric estimation of residual variance revisited. Biometrika, 80, 373–383. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940. Shawe-Taylor, J., & Cristianini, N. (1999). Margin distribution bounds on generalization. In Computational Learning Theory: 4th European Conference (pp. 263– 273). New York: Springer. Smola, A. J. (1998). Learning with kernels. Doctoral dissertation, Technische Universit a¨ t Berlin. Also: GMD Research Series No. 25, Birlinghoven, Germany. Smola, A., Frieß, T., & Sch¨olkopf, B. (1999). Semiparametric support vector and linear programming machines. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.),Advances in neural informationprocessing systems, 11 (pp. 585–591).Cambridge, MA: MIT Press. Smola, A., Murata, N., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). Asymptotically optimal choice of e -loss for support vector machines. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks (pp. 105–110). Berlin: SpringerVerlag. Smola, A., & Sch¨olkopf, B. (1998). On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica, 22, 211–231. Smola, A., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637– 649. Smola, A., Williamson, R. C., Mika, S., & Sch¨olkopf, B. (1999). Regularized principal manifolds. In Computational Learning Theory: 4th European Conference (pp. 214–229). Berlin: Springer-Verlag. Stitson, M., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C., & Weston, J. (1999). Support vector regression with ANOVA decomposition kernels. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 285–291). Cambridge, MA: MIT Press.
New Support Vector Algorithms
1245
Vapnik, V. (1979). Estimation of dependences based on empirical data [in Russian]. Nauka: Moscow. (English translation: Springer-Verlag, New York, 1982). Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition [in Russian]. Nauka: Moscow. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979). Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 69–88). Cambridge, MA: MIT Press. White, H. (1994). Parametric statistical estimation with articial neural networks: A condensed discussion. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neural networks. Berlin: Springer. Williamson, R. C., Smola, A. J., & Sch¨olkopf, B. (1998). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators (Tech. Rep. 19 Neurocolt Series). London: Royal Holloway College. Available online at http://www.neurocolt.com. Received December 2, 1998; accepted May 14, 1999.
Neural Computation, Volume 12, Number 6
CONTENTS Article
Separating Style and Content with Bilinear Models Joshua B. Tenenbaum and William T. Freeman
1247
Notes
Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters Danilo P. Mandic and Jonathon A. Chambers
1285
The VC Dimension for Mixtures of Binary Classiers Wenxin Jiang
1293
The Early Restart Algorithm Malik Magdon-Ismail and Amir F. Atiya
1303
Letters
Attractor Dynamics in Feedforward Neural Networks Lawrence K. Saul and Michael I. Jordan
1313
Visualizing the Function Computed by a Feedforward Neural Network Tony A. Plate, Joel Bert, John Grace, and Pierre Band
1337
Discriminant Pattern Recognition Using TransformationInvariant Neurons Diego Sona, Alessandro Sperduti, and Antonina Starita
1355
Observable Operator Models for Discrete Stochastic Time Series Herbert Jaeger
1371
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons Shun-ichi Amari, Hyeyoung Park, and Kenji Fukumizu
1399
Nonmonotonic Generalization Bias of Gaussian Mixture Models Shotaro Akaho and Hilbert J. Kappen
1411
Efcient Block Training of Multilayer Perceptrons A Navia-V´azquez and A. R. Figueiras-Vidal
1429
An Optimization Approach to Design of Generalized BSB Neural Associative Memories Jooyoung Park and Yonmook Park
1449
Nonholonomic Orthogonal Learning Algorithms for Blind Source Separation Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
1463
ARTICLE
Communicated by John Platt
Separating Style and Content with Bilinear Models Joshua B. Tenenbaum ¤ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. William T. Freeman MERL, a Mitsubishi Electric Research Lab, 201 Broadway, Cambridge, MA 02139, U.S.A. Perceptual systems routinely separate “content” from “style,” classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. Yet a general and tractable computational model of this ability to untangle the underlying factors of perceptual observations remains elusive (Hofstadter, 1985). Existing factor models (Mardia, Kent, & Bibby, 1979; Hinton & Zemel, 1994; Ghahramani, 1995; Bell & Sejnowski, 1995; Hinton, Dayan, Frey, & Neal, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Hinton & Ghahramani, 1997) are either insufciently rich to capture the complex interactions of perceptually meaningful factors such as phoneme and speaker accent or letter and font, or do not allow efcient learning algorithms. We present a general framework for learning to solve two-factor tasks using bilinear models, which provide sufciently expressive representations of factor interactions but can nonetheless be t to data using efcient algorithms based on the singular value decomposition and expectation-maximization. We report promising results on three different tasks in three different perceptual domains: spoken vowel classication with a benchmark multispeaker database, extrapolation of fonts to unseen letters, and translation of faces to novel illuminants. 1 Introduction
Perceptual systems routinely separate the “content” and “style” factors of their observations, classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. These and many other basic perceptual tasks have in common the need to pro¤
Current address:Department of Psychology, Stanford University, Stanford, CA 94305, U.S.A. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1247– 1283 (2000) °
1248
Joshua B. Tenenbaum and William T. Freeman
cess separately two independent factors that underlie a set of observations. This article shows how perceptual systems may learn to solve these crucial two-factor tasks using simple and tractable bilinear models. By tting such models to a training set of observations, the inuences of style and content factors can be efciently separated in a exible representation that naturally supports generalization to unfamiliar styles or content classes. Figure 1 illustrates three abstract tasks that fall under this framework: classication, extrapolation, and translation. Examples of these abstract tasks in the domain of typography include classifying known characters in a novel font, extrapolating the missing characters of an incomplete novel font, or translating novel characters from a novel font into a familiar font. The essential challenge in all of these tasks is the same. A perceptual system observes a training set of data in multiple styles and content classes and is then presented with incomplete data in an unfamiliar style, missing either content labels (see Figure 1A) or whole observations (see Figure 1B) or both (see Figure 1C). The system must generate the missing labels or observations using only the available data in the new style and what it can learn about the interacting roles of style and content from the training set of complete data. We describe a unied approach to the learning problems of Figure 1 based on tting models that discover explicit parameterized representations of what the training data of each row have in common independent of column, what the data of each column have in common independent of row, and what all data have in common independent of row and column— the interaction of row and column factors. Such a modular representation naturally supports generalization to new styles or content. For example, we can extrapolate a new style to unobserved content classes (see Figure 1B) by combining content and interaction parameters learned during training with style parameters estimated from available data in the new style. Several models for the underlying factors of observations have recently been proposed in the literature on unsupervised learning. These include essentially additive factor models, as used in principal component analysis (Mardia et al., 1979), independent component analysis (Bell & Sejnowski, 1995), and cooperative vector quantization (Hinton & Zemel, 1994; Ghahramani, 1995), and hierarchical factorial models, as used in the Helmholtz machine and its descendants (Hinton et al., 1995; Dayan et al., 1995; Hinton & Ghahramani, 1997). We model the mapping from style and content parameters to observations as a bilinear mapping. Bilinear models are two-factor models with the mathematical property of separability: their outputs are linear in either factor when the other is held constant. Their combination of representational expressiveness and efcient learning procedures enables bilinear models to overcome two principal drawbacks of existing factor models that might be applied to learning the tasks in Figure 1. In contrast to additive factor models, bilinear models provide for rich factor interactions by allowing factors to modulate each other ’s contributions multiplicatively (see section 2).
Separating Style and Content
A
1249
Classification
A A A
B C D E B C D E B C D E
A B C D E B CA E D
Training Generalization
B
Extrapolation
A A A
B C D E B C D E B C D E
A B C D E ? ? C D E C
Translation
A A A
B C D E ? B C D E B C D E
? ?
A ?
B C D E ? ? ? ? F G H
Figure 1: Given a labeled training set of observations in multiple styles (e.g., fonts) and content classes (e.g., letters), we want to (A) classify content observed in a new style, (B) extrapolate a new style to unobserved content classes, and (C) translate from new content observed only in new styles into known styles or content classes.
Model dimensionality can be adjusted to accommodate data that arise from arbitrarily complex interactions of style and content factors. In contrast to hierarchical factorial models, model tting can be carried out by efcient
1250
Joshua B. Tenenbaum and William T. Freeman
techniques well known from the study of linear models, such as the singular value decomposition (SVD) and the expectation-maximization (EM) algorithm, without having to invoke extensive stochastic (Hinton et al., 1995) or deterministic (Dayan et al., 1995) approximations. Our approach is also related to the “learning-to-learn” research program (Thrun & Pratt, 1998)—also known as task transfer or multitask learning (Caruana, 1998). The central insight of “learning to learn” is that learning problems often come in clusters of related tasks, and thus learners may automatically acquire useful biases for a novel learning task by training on many related ones. Rather than families of related tasks, we focus on how learners can exploit the structure in families of related observations, bound together by their common styles, content classes, or style £ content interaction, to acquire general biases useful for carrying out tasks on novel observations from the same family. Thus, our work is closest in spirit to the family discovery approach of Omohundro (1995), differing primarily in our focus on bilinear models to parameterize the style £ content interaction. Section 2 explains and motivates our bilinear modeling approach. Section 3 describes how these models are t to a training set of observations. Sections 4, 5, and 6 present specic applications of these techniques to the three tasks of classication, extrapolation, and translation, using realistic data from three different perceptual domains. Section 7 suggests directions for future work, and section 8 offers some concluding comments. We will use the terms style and content generically to refer to any two independent factors underlying a set of perceptual observations. For tasks that require generalization to novel classes of only one factor (see Figures 1A and 1B), we will refer to the variable factor (which changes during generalization) as style and the invariant factor (with a xed set of classes) as content. For example, in a task of recognizing familiar words spoken in an unfamiliar accent, we would think of the words as content and the accent as style. For tasks that require generalization across both factors (see Figure 1C), the labels style and content are arbitrary, and we will use them as seems most natural. 2 Bilinear Models
We have explored two bilinear models, closely related to each other, which we distinguish by the labels symmetric and asymmetric. This section describes the two models and illustrates them on a simple data set of face images. 2.1 Symmetric Model. In the symmetric model, we represent both style s and content c with vectors of parameters, denoted a s and b c and with dimensionalities I and J, respectively. Let y sc denote a K-dimensional observation vector in style s and content class c. We assume that y sc is a bilinear
Separating Style and Content
1251
function of as and b c given most generally by the form D ysc k
J I X X iD1 jD1
wijk asi bjc .
(2.1)
Here i, j, and k denote the components of style, content, and observation vectors, respectively.1 The wijk terms are independent of style and content and characterize the interaction of these two factors. Their meaning becomes clearer when we rewrite equation 2.1 in vector form. Letting W k denote the I £ J matrix with entries fwijk g, equation 2.1 can be written as T
D as W k b c . ysc k
(2.2)
In equation 2.2, the K matrices W k describe a bilinear map from the style and content vector spaces to the K-dimensional observation space. The interaction terms have another interpretation, which can be seen by writing the symmetric model in a different vector form. Letting w ij denote the K-dimensional vector with components fwijk g, equation 2.1 can be written as y sc D
X i, j
w ij asi bjc .
(2.3)
In equation 2.3, the wijk terms represent I £ J basis vectors of dimension K, and the observation y sc is generated by mixing these basis vectors with coefcients given by the tensor product of a s and b c . Of course, all of these interpretations are formally equivalent, but they suggest different intuitions which we will exploit later. As a concrete example, Figure 2 illustrates a symmetric model of face images of different people in different poses (sampled from the complete data in Figure 6). Here the basis vector interpretation of the wijk terms is most natural, by analogy to the well-known work on eigenfaces (Kirby & Sirovich, 1990; Turk & Pentland, pose 1991). Each pose is represented by a vector of I parameters, ai , and each person person by a vector of J parameters, bj . To render an image of a particular person in a particular pose, a set of I £ J basis images w ij is linearly mixed with coefcients given by the tensor product of these two parameter vectors (see equation 2.3). The symmetric model can exactly reproduce the observations when I and J equal the numbers of observed styles S and content classes C, respectively, as is the case in Figure 2. The model provides coarser but more compact representations as these dimensionalities are decreased. 1
The model in equation 2.1 may appear trilinear, but we view the wijk terms as describing a xed bilinear mapping from a s and b c to y sc .
1252
Joshua B. Tenenbaum and William T. Freeman
Figure 2: Illustration of a symmetric bilinear model for a small set of faces (a subset of Figure 6). The two factors for this example are person and pose. pose person One vector of coefcients, ai , describes the pose, and a second vector, bj , describes the person. To render a particular person under a particular pose, pose person the vectors ai and bj multiply along the four rows and three columns of the array of basis images w ij . The weighted sum of basis images yields the reconstructed faces y pose, person .
2.2 Asymmetric Model. Sometimes linear combinations of a few basis styles learned during training may not describe new styles well. We can obtain more exible, asymmetric models by letting the interaction terms wijk
Separating Style and Content
1253
P themselves vary with style. Then equation 2.1 becomes ysc D i, j wsijk asi bjc . k Without loss of generality, we can combine the style-specic terms of equation 2.1 into s D ajk
X i
wsijk asi ,
(2.4)
s bc . ajk j
(2.5)
giving D ysc k
X j
Again, there are two interpretations of the model, corresponding to different vector forms of equation 2.5. First, letting A s denote the K £ J matrix with s g, equation 2.5 can be written as entries fajk y sc D A s b c .
(2.6)
s terms as describing a style-specic linear map Here, we can think of the ajk from content space to observation space. Alternatively, letting ajs denote the s g, equation 2.5 can be written as K-dimensional vector with components fajk
y sc D
X j
ajs bjc .
(2.7)
s terms as describing a set of J style-specic Now we can think of the ajk basis vectors that are mixed according to content-specic coefcients bjc (independent of style) to produce the observations. Figure 3 illustrates an asymmetric bilinear model applied to the face database, with head pose as the style factor. Each pose is represented by pose a set of J basis images aj and each person by a vector of J parameters person
. To render an image of a particular person in a particular pose, the bj pose-specic basis images are linearly mixed with coefcients given by the person-specic parameter vector. Note that the basis images for each pose look like eigenfaces (Turk & Pentland, 1991) in the appropriate style of each pose. However, they do not provide a true orthogonal basis for any one pose, as in Moghaddam and Pentland (1997), where a distinct set of eigenfaces is computed for each of several poses. Instead, the factorized structure of the model ensures that corresponding basis vectors play corresponding roles across poses (e.g., the rst vector holds roughly the mean face for that pose, the second seems to modulate hair distribution, and the third seems to modulate head size), which is crucial for adapting to new styles. Familiar content can be easily
1254
Joshua B. Tenenbaum and William T. Freeman
person
Pose basis images: a j
pose
Person coefficients: b j
Reconstructed faces: y
pose, person
Figure 3: The images of Figure 2 represented by an asymmetric bilinear model person with head pose as the style factor. Person-specic coefcients bj multiply pose
pose-specic basis images aj ; the sum reconstructs a given person in a given pose. The basis images are similar to an eigenface representation within a given pose (Moghaddam & Pentland, 1997),except that in this model the different basis images are constrained to allow one set of person coefcients to reconstruct the same face across different poses.
translated to a new style by mixing the new style-specic basis functions with the old content-specic coefcients. Figure 4 shows the same data represented by an asymmetric model, but with the roles of style and content switched. Now the ajs parameters provide a basis set of poses for each person’s face. Again, corresponding basis vectors play corresponding roles across styles (e.g., for each person’s face, the rst
Separating Style and Content
1255
Person basis images: a j
person
Pose coefficients: b pose j
Reconstructed faces: y
person, pose
Figure 4: Asymmetric bilinear model applied to the data of Figure 2, treating pose identity as the style factor. Now pose-specic coefcients bj weight personperson
specic basis images aj to reconstruct the face data. Across different faces, corresponding basis images play corresponding roles in rotating head position.
vector holds roughly the mean pose, the second modulates head orientation, the third modulates amount of hair showing, and the fourth adds in facial detail), allowing ready stylistic translation. Finally, we note that because the asymmetric model can be obtained by summing out redundant degrees of freedom in the symmetric model (see equation 2.4),2 the three sets of basis images in Figures 2 through 4 are not at all independent. Both the pose- and person-specic basis images in Figures 3 and 4 can be expressed as linear combinations of the symmetric model basis images in Figure 2, mixed according to the pose- or personspecic coefcients (respectively) from Figure 2. The asymmetric model’s high-dimensional matrix representation of style may be too exible in adapting to data in new styles and cannot support 2
This holds exactly only when the dimensionalities of style and content vectors are equal to the number of observed styles and content classes, respectively.
1256
Joshua B. Tenenbaum and William T. Freeman
translation tasks (see Figure 1C) because it does not explicitly model the structure of observations that is independent of both style and content (represented by wijk in equation 2.1). However, if overtting can be controlled by limiting the model dimensionality J or imposing some additional constraint, asymmetric models may solve classication and extrapolation tasks (see Figures 1A and 1B) that could not be solved using symmetric models with a realistic number of training styles.
3 Model Fitting
In conventional supervised learning situations, the data are divided into complete training patterns and incomplete (e.g., unlabeled) test patterns, which are assumed to be sampled randomly from the same distribution (Bishop, 1995). Learning then consists of tting a model to the training data that allows the missing aspects of the test patterns (e.g., class labels in a classication task) to be lled in given the available information. The tasks in Figure 1, however, require that the learner generalize from training data sampled according to one distribution (i.e., in one set of styles and content classes) to test data drawn from a different but related distribution (i.e., in a different set of styles or content classes). Because of the need to adapt to a different but related distribution of data during testing, our approach to these tasks involves model tting during both training and testing phases. In the training phase, we learn about the interaction of style and content factors by tting a bilinear model to a complete array of observations of C content classes in S styles. In the testing or generalization phase, we adapt the same model to new observations that have something in common with the training set, in either content or style or in their interaction. The model parameters corresponding to the assumed commonalities are xed at the values learned during training, and new parameters are estimated for the new styles or content using algorithms similar to those used in training. New and old parameters are then combined to accomplish the desired classication, extrapolation, or translation task. This section presents the basic algorithms for model tting during training. The algorithms for model tting during testing are essentially variants of these training algorithms but are task-specic; they will be presented in the appropriate sections below. The goal of model tting during training is to minimize the total squared error over the training set for the symmetric or asymmetric models. This goal is equivalent to maximum likelihood estimation of the style and content parameters given the training data, under the assumption that the data were generated from the models plus independently and identically distributed gaussian noise.
Separating Style and Content
1257
3.1 Asymmetric Model. Because the procedure for tting the asymmetric model is simpler, we discuss it rst. Let y ( t) denote the tth training observation (t D 1, . . . , T). Let the indicator variable hsc ( t) D 1 if y ( t) is in style s and content class c, and 0 otherwise. Then the total squared error E over the training set for the asymmetric model (in the form of equation 2.6) can be written as
ED
T X S X C X tD1 sD1 cD1
hsc (t)|| y (t) ¡ A s b c | | 2 .
(3.1)
If the training set contains equal numbers of observations in each style and in each content class, there exists a closed-form procedure to t the asymmetric model using the SVD. Although we are the rst to use this procedure as the basis for a learning algorithm, it is mathematically equivalent to a family of computer vision algorithms (Koenderink & van Doorn, 1997) best known in the context of recovering structure from motion of tracked points under orthographic projection (Tomasi & Kanade, 1992). Let P sc ( ) ( ) sc th t y t yN D P , sc ( t ) h t the mean observation in style s and content class c. These observations are most naturally represented in a three-way array, but in order to work with standard matrix algorithms, we must stack these SC K-dimensional (column) vectors into a single ( SK ) £ C observation matrix:
2 6
N D4 Y
yN 11
.. .
¢¢¢ .. .
yN S1
yN 1C
3
7 5.
(3.2)
yN SC
We can then write the asymmetric model (see equation 2.6) in compact matrix form,3 N D AB , Y
(3.3)
identifying the ( SK ) £ J matrix A and J £ C matrix B as the stacked style and content parameters, respectively:
2 6
A D4
A1
.. .
3 7 5,
(3.4)
AS
3 Strictly speaking, equation 3.3 is a model of the mean observations. However, when the data are evenly distributed across styles and content classes, the parameter values that minimize the total squared error for equation 3.3 will also minimize E in equation 3.1.
1258
Joshua B. Tenenbaum and William T. Freeman
h
i
B D b1 ¢ ¢ ¢ bC .
(3.5)
To nd the least-squares optimal style and content parameters for equation 3.3, we compute the SVD of YN D USV T . (By convention, we always take the diagonal elements of S to be ordered by decreasing eigenvalue.) We then dene the style parameter matrix A to be the rst J columns of US and the content parameter matrix B to be the rst J rows of V T . The model dimensionality J can be chosen in various ways: from prior knowledge, by requiring a sufciently good approximation to the data, or by looking for an “elbow” in the singular value spectrum. If the training data are not distributed equally across the different styles and content classes, we must minimize equation 3.1 directly. There are many ways to do this. We use a quasi-newton method (BFGS; Press, Teukolsky, Vetterling, & Flannery, 1992) with initial parameter estimates determined by the SVD of the mean observation matrix YN , as described above. If there happen to be no observations in a particular style s and content class c, N will have some indeterminate (0 / 0) entries. Before taking the SVD of YN , Y we replace any indeterminate entries by the mean of the observations in the appropriate style s (across all content classes) or content class c (across all styles). In our experience, this method has yielded satisfactory results, although it is at least an order of magnitude slower than the closed-form SVD solution. Note that if the training data are almost equally distributed across styles and content classes, then the closed-form SVD solution found by assuming that the data are exactly balanced will almost minimize equation 3.1, and improving this solution using quasi-newton optimization may not be worth the much greater effort involved. Because all of the examples presented below have equally distributed training observations, we will need only the closed-form procedure for the remainder of this article. 3.2 Symmetric Model. The total squared error E over the training set for the symmetric model (in the form of equation 2.2) can be written as
ED
N X S X C X K X tD1 sD1 cD1 kD1
hsc (t)( yk ( t) ¡ a s W k b c )2 . T
(3.6)
Again, if we assume that the training set consists of an equal number of observations in each style and content class, there are efcient matrix algorithms for minimizing E. The algorithm we use was described for scalar observations by Magnus and Neudecker (1988) and adapted to vector observations by Marimont and Wandell (1992), in the context of characterizing color surface and illuminant spectra. Essentially, we repeatedly apply the SVD algorithm for tting the asymmetric model, alternating the role of style and content factors within each iteration until convergence. First we need a few matrix denitions. Recall that YN consists of the SC K-dimensional mean observation vectors yN sc stacked into a single SK £ C
Separating Style and Content
1259
A
B
Figure 5: Schematic illustration of the vector transpose (following Marimont & Wandell, 1992). For a matrix Y considered as a two-dimensional array of stacked vectors (A), its vector-transpose YVT is dened to be the same stacked vectors with their row and column positions in the array transposed (B).
matrix (see equation 3.2). In general, for any AK £ B matrix X constructed by stacking AB K-dimensional vectors A down and B across, we can dene its vector transpose X VT to be the BK £ A matrix consisting of the same Kdimensional vectors stacked B down and A across, where the vector a across and b down in X becomes the vector b across and a down of X VT (see Figure 5). In particular, YN VT consists of the means yN sc stacked into a single ( KC) £ S matrix:
2 6
N VT D 4 Y
yN 11
.. .
¢¢¢ .. .
yN C1
yN 1S
3 7 5.
(3.7)
yN CS
Finally, we dene the IK £ J stacked weight matrix W , consisting of the IJ K-dimensional basis functions w ij (see equation 2.3) in the form
2 6
W D4
w 11
.. .
w I1
¢¢¢ .. .
w 1J
3 7 5.
w IJ
Its vector transpose W VT is also dened accordingly.
(3.8)
1260
Joshua B. Tenenbaum and William T. Freeman
We can then write the symmetric model (see equation 2.6) in either of these two equivalent matrix forms,4 h
YN D W VT A
iVT
B,
(3.9)
YN VT D [WB ]VT A ,
(3.10)
identifying the I £ S matrix A and the J £ C matrix B as the stacked style and content parameter vectors, respectively: h
i
h
i
A D a 1 ¢ ¢ ¢ aS , B D b 1 ¢ ¢ ¢ b C .
(3.11)
The iterative procedure for estimating least-squares optimal values of A and B proceeds as follows. We initialize B using the closed-form SVD procedure described above for the asymmetric model (see equation 3.5). Note that this initial B is an orthogonal matrix (BB T is the J £ J identity matrix), N T ]VT D W VT A (from equation 3.9). Thus, given this initial estiso that [YB N T ]VT D USV T and update our mate for B , we can compute the SVD of [YB T estimate of A to be the rst I rows of V . This A is also orthogonal, so that [YN VT A T ]VT D WB (from equation 3.10). Thus, given this estimate of A , we can compute the SVD of [YN VT A T ]VT D USV T and update our estimate of B to be the rst J rows of V T . This completes one iteration of the algorithm. Typically, convergence occurs within ve iterations (around 10 SVD operations). Convergence is also guaranteed. (See Magnus & Neudecker, 1988, for a proof for the scalar case, K D 1, which can be extended to the vector case N T ]VT A T ]VT to considered here.) Upon convergence, we solve for W D [[YB obtain the basis vectors independent of both style and content. As with the asymmetric model, if the training data are not distributed equally across the different styles and content classes, we minimize equation 3.1 starting from the same initial estimates for A and B but using a more costly quasi-Newton method. 4 Classication
Many common classication problems involve multiple observations likely to be in one style, for example, recognizing the handwritten characters on an envelope or the accented speech of a telephone voice. People are signicantly better at recognizing a familiar word spoken in an unfamiliar voice or a familiar letter character written in an unfamiliar font when it is embedded in the context of other words or letters in the same novel style 4 As with the asymmetric model, when the data are evenly distributed across styles and content classes, the parameter values that minimize the total squared error for the mean observations in equations 3.9 and 3.10 will also minimize E in Equation 3.6.
Separating Style and Content
1261
(van Bergem, Pols, & van Beinum, 1988; Sanocki, 1992; Nygaard & Pisoni, 1998), presumably because the added context allows the perceptual system to build a model of the new style and factor out its inuence. In this section, we show how a perceptual system may use bilinear models and assumptions of style consistency to factor out the effects of style in content classication and thereby signicantly improve classication performance on data in novel styles. We rst describe the two concrete tasks investigated. We then present the general classication algorithm and the results of several experiments comparing this algorithm to standard techniques from the pattern recognition literature, such as nearest neighbor classication, which do not explicitly model the effects of style on content classication. 4.1 Task Specics. We report experiments with two data sets: a benchmark speech data set and a face data set that we collected ourselves. The speech data consist of six samples of each of 11 vowels (content classes) uttered by 15 speakers (styles) of British English.5 Each data vector consists of K D 10 log area parameters, a standard vocal tract representation computed from a linear predictive coding analysis of the digitized speech. The specic task we must learn to perform is classication of spoken vowels (content) for new speakers (styles). The face data were introduced in section 2 (see Figures 2–4). Figure 6 shows the complete data set, consisting of images of 11 people’s faces (styles) viewed under 15 different poses (content classes). The poses span a grid of three vertical positions (up, level, down) and ve horizontal positions (far left, left, straight ahead, right, far right). The pictures were shifted to align the nose tip position, found manually. The images were then blurred and cropped to 22 £32 pixels, and represented simply as vectors of K D 704 pixel values each. The specic task we must learn to perform is classication of head pose (content) for new people’s faces (styles). 4.2 Algorithm. For both speech and face data sets, we train on observations in all content classes but in a subset of the available styles (the “training” styles). We t asymmetric bilinear models (see equation 3.3) to these training data using the closed-form SVD procedure described in section 3.1. This yields a K £ J matrix A s representing each style s and a J-dimensional vector b c representing each content class c. The model dimensionality J is a free parameter, which we discuss in depth at the end of this section. Fitting these asymmetric models to either data set required less than 1 second. This and all other execution times reported below were obtained for nonoptimized MATLAB code running on a Silicon Graphics O2 workstation.
5
The data were originally collected by David Deterding and are now available online from the UC Irvine machine learning repository (http://www.ics.uci.edu/»mlearn).
1262
Joshua B. Tenenbaum and William T. Freeman
Person
Pose
Figure 6: Classication of familiar content in a new style, illustrated on a data set of face images. During training, we observe the faces of different people (style) in different poses (content), resulting in the matrix of faces shown. During generalization, we observe a new person’s face in each of these familiar poses, and we seek to classify the pose of each new image. The separable mixture model allows us to build up a model for the new face at the same time that we classify the new poses, thereby improving classication performance.
The generalization task is to classify observations in the remaining styles (the “test” styles) by lling in the missing content labels for these novel observations using the style-invariant content vectors b c xed during training (see Figure 6). Trying to estimate both content labels and style parameters for the new data presents a classic chicken-and-egg problem, very much like the problems of k-means clustering or mixture modeling (Duda & Hart, 1973). If the content class assignments were known, then it would be easy to estimate parameters for a new style sQ by simply inserting into the basic asymmetric bilinear model (i.e., equation 2.6) all the observation vectors in style sQ, along with the appropriate content vectors b c , and solving for the style matrix A sQ . Similarly, if we had a model A sQ of new style sQ, then we could classify any test observation from this new style based simply on its distance to each of the known content vectors b c multiplied by A sQ .6 Initially, however, both style models and content class assignments are unknown for the test data. To handle this uncertainty, we embed the bilinear model within a gaussian mixture model to yield a separable mixture model (SMM) (Tenenbaum &
6
This is equivalent to maximum likelihood classication, assuming gaussian likelihood functions centered on the predictions of the bilinear model A sQ b c .
Separating Style and Content
1263
Freeman, 1997), which can then be t efciently to new data using the EM algorithm (Dempster, Laird, & Rubin, 1977). The mixture model has S £ C gaussian components—one for each pair of S styles and C content classes— with means given by the predictions of the bilinear model. However, only on the order of (S C C) parameters are needed to represent the means of these S £ C gaussians, because of the bilinear model’s separable structure. To classify known content in new styles and estimate new style parameters simultaneously, the EM algorithm alternates between calculating the probabilities of all content labels given current style parameter estimates (E-step) and estimating the most likely style parameters given current content label probabilities (M-step), with likelihood determined by the gaussian mixture model. If, in addition, the test data are not segmented according to style, style label probabilities can be calculated simultaneously as part of the E-step. More formally, after training on labeled data from S styles and C content classes, we are given test data from the same C content classes and SQ new styles, with labels for content (and possibly also style) missing. The new style matrices A sQ are also unknown, but the content vectors b c are known, having been xed in the training phase. We assume that the probability of a new unlabeled observation y being generated in new style sQ and old content c is given by a gaussian distribution with variance s 2 centered at the prediction of the bilinear model: p( y| sQ, c) / expf¡ky ¡ A sQ b c k2 / (2s 2 )g. ( ) The P total probability of y is then given by the mixture distribution p y D s, c), unless sQ,c p( y | sQ , c ) p( sQ, c ). Here we assume equal prior probabilities p(Q the observations are otherwise labeled. 7 The EM algorithm (Dempster et al., 1977) alternates between two steps in order to nd new style matrices A sQ and style-content labels p( sQ, c| y ) that best explain the test data. In the E-step, we compute the probabilities p( sQ, c| y ) D p( y | sQ, c)p( sQ, c) / p( y ) that each test vector y belongs to style sQ and content class c, given the current style matrix sQ estimates. In the M-step, we estimate new style matrices P by setting A to ¤ maximize the total log-likelihood of the test data, L D y log p( y ). The Mstep can be computed in closed form by solving the equations @L¤ / @A sQ D 0, sQ given the probability estimates from the E-step and which are linear in AP P the quantities m sQc D y p( sQ, c| y )y and nsQc D y p( sQ, c| y ): " sQ
A D
X c
#" T m sQc b c
X
T nsQc b c b c
#¡1
.
(4.1)
c
The EM algorithm is guaranteed to converge to a local maximum of L¤ , and this typically takes around 20 iterations for our problems. After each 7 That is, if the style identity s¤ of y is labeled but the content class is unknown, we 6 let p(Qs, c) D 1 / C if sQ D s¤ and 0 if sQ D s¤ . If both the style and content identities of y are Q ( D unlabeled, we let p(Qs, c) 1 / SC) for all sQ and c.
1264
Joshua B. Tenenbaum and William T. Freeman
E-step, test vectors in new styles can by selecting the content P be classied 8 (Q ). class c that maximizes p( c| y ) D , y Classication performance p s c| sQ is determined by the percentage of test data for which the probability of content class c, as given by EM, is greatest for the actual content class. Note that because content classication is determined as a by-product of the E-step, the standard practice of running EM until convergence to a local maximum in likelihood will not necessarily lead to optimal classication performance. In fact, we have often observed overtting behavior, in which optimal classication is obtained after only two or three iterations of EM, but the likelihood continues to increase in subsequent iterations as observations are assigned to incorrect content classes. This classication algorithm thus has three free parameters for which good values must somehow be determined. In addition to the model dimensionality J mentioned above, these parameters include the model variance s 2 and the maximum number of iterations for which EM is run, tmax . In general, we set these parameters using a leave-one-style-out cross-validation procedure with the training data. That is, given S complete training styles, we train separate bilinear models on each of the S subsets of S ¡ 1 styles and evaluate these models’ classication performance on the one style that each was not trained on. For each of these S training set splits, we try a range of values for the parameters J, s 2 , and tmax . Those values that yield the best average performance over the S training set splits are then used in tting a new bilinear model to the full training data and generalizing to the designated test set. The main drawback of this cross-validation procedure is that it can be time-consuming. Although each run of EM is quite fast—on the order of seconds—an exhaustive search of the full parameter space in a leave-one-out design may take hours. In real applications, such a search would hopefully need to be done only once in a given domain, if at all, and might also be guided by domain-specic knowledge that could narrow the search space signicantly. Initialization is an important factor in determining the success of the EM algorithm. Because we are primarily interested in good classication, we initialize EM in the E-step, using the results of a simple 1-nearest-neighbor classier to set the content-class assignments p( c| x ). That is, we initially assign each test observation to the content class of the most similar training observation (for which the content labels are known). 4.3 Results. We conducted four experiments: three using the benchmark speech data to investigate different aspects of the algorithm’s behavior and one using the face data to provide evidence from a separate domain. In each case, we report results with all three free parameters set using the
8
6 If the style s¤ of y is known, then p(Qs, c| y ) D 0 for all sQ D s¤ (see previous note). In ¤ ) ). y y that case, p(c| is simply equal to p(s , c|
Separating Style and Content
1265
Table 1: Accuracy of Classifying Spoken Vowels from a Benchmark Multispeaker Database. Classier Multilayer perceptron (MLP) Radial basis function network (RBF) 1-nearest neighbor (1-NN) Discriminant adaptive nearest neighbor (DANN) Generic parameter settings Optimal parameter settings Separable mixture models (SMM)—test speakers labeled All parameters set by CV (J D 3, s 2 D 1 / 64, tmax D 2) tmax D 1; J, s 2 set by CV (J D 3, s 2 D 1 / 32) tmax D 1; optimal J, s 2 (J D 4, s 2 D 1 / 16) Separable mixture models (SMM)—test speakers not labeled All parameters set by CV (J D 3, s 2 D 1 / 64, tmax D 2) tmax D 1; J, s 2 set by CV (J D 3, s 2 D 1 / 32) tmax D 1; optimal J, s 2 (J D 3, s 2 D 1 / 64)
Percentage Correct on Test Data 51% 53% 56% 59.7% 61.7% 75.8% 68.2% 77.3% 69.8% § .3% 59.9% § 1.1% 63.2% § 1.5%
Notes: MLP, RBF, and 1-NN results were obtained by Robinson (1989). The DANN classier (Hastie & Tibshirani, 1996) achieves the best performance we know of for an approach that does not adapt to new speakers. The SMM classiers perform significantly better vowel classication by simultaneously modeling speaker style. “CV” denotes the cross-validation procedure for parameter setting described in the text.
cross-validation procedure described above, as well as for two conditions in which EM was run until convergence (tmax D 1): both J and s 2 determined by cross validation and both J and s 2 set to their optimal values (as an indicator of best in-principle performance for the maximum likelihood solution). 4.3.1 Train 8, Test 7 on Speech Data—Speakers Labeled. The rst experiment with the speech data was the standard benchmark task described in Robinson (1989). Robinson compared many learning algorithms trained to categorize vowels from the rst eight speakers (four male and four female) and tested on samples from the remaining seven speakers (four male and three female). The variety and the small number of styles make this a difcult task. Table 1 shows the best results we know of for standard approaches that do not adapt to new speakers. Of the many techniques that Robinson (1989) tested, 1-nearest neighbor (1-NN) performs the best, with 56.3% correct; chance is approximately 9% correct. The discriminant adaptive nearest neighbor (DANN)classier (Hastie & Tibshirani, 1996) slightly outperforms 1-NN, obtaining 59.7% correct with its generic parameter settings and 61.7% correct for optimal parameter settings.
1266
Joshua B. Tenenbaum and William T. Freeman
After tting an asymmetric bilinear model to the training data, we tested classication performance using our SMM on the seven new speakers’ data. We rst assumed that style labels were available for the test data (indicating only a change of speaker, but no information about the new speaker ’s style). Running EM until convergence (tmax D 1), our best results of 77.3% correct were obtained with J D 4 and s 2 D 1 / 16. Almost comparable results of 75.8% correct were obtained using cross-validation to set all parameters (J, s 2 , tmax ) automatically (see Table 1). The SMM clearly outperforms the many nonadaptive techniques tested, exploiting extra information available in the speaker labels that nonadaptive techniques make no use of. 4.3.2 Train 8, Test 7 on Speech Data—Speakers Not Labeled. We next repeated the benchmark experiment without the assumption that any style labels were available during testing. Thus, our SMM algorithm and the nonadaptive techniques had exactly the same information available for each test observation (although the SMM had style labels available during training). We used EM to gure out both the speaker assignments and the vowel class assignments for the test data. EM requires that the number of style components S in the mixture model be set in advance; we chose S D 7 (the actual number of distinct speakers) for consistency with the previous experiment. In initializing EM in the E-step, we assigned each test observation equally to each new style component, plus or minus 10% random noise to break symmetry and allow each style component to adapt to a distinct subset of the new data. Using cross-validation to set J, s 2 , and tmax automatically, we obtained 69.8% § 0.3% correct. Table 1 presents results for other parameter settings. These scores reect average performance over 10 random initializations. Not surprisingly, SMM performance here was worse than in the previous section, where the correct style labels were assumed to be known. Nonetheless, the SMM still provided a signicant improvement over the best nonadaptive approaches tested. 4.3.3 Train 14, Test 1 on Speech Data. We noticed that the performance of the SMM on the speech data varied widely across different test speakers and also depended signicantly on the particular speakers chosen for training and testing. In some cases, the SMM did not perform much better than 1NN, and in other cases the SMM actually did worse. Thus, we decided to conduct a more systematic study of the effects of individual speaker style on generalization. Specically, we tested the SMM’s ability to classify the speech of each of the 15 speakers in the database individually when the other 14 speakers were used for training. Because only one speaker was presented during testing, we used only a single style model in EM, and thus there was no distinction between the “speakers labeled” and “speakers not labeled” conditions investigated in the previous two sections. Averaged over all 15 possible test speakers, 1-NN obtained 63.9% § 3.4% correct. Using cross-validation to set J, s 2 , and tmax automatically, our SMM
Separating Style and Content
1267
obtained 74.3% § 4.2% correct. Running EM until convergence (tmax D 1), we obtained 75.6% § 4.0% using the best parameter settings of J D 11, s 2 D .01, and 73.5% § 4.4% correct using cross-validation to select J and s 2 automatically. These SMM scores are not signicantly different from each other but are all signicantly higher than 1-NN as measured by paired ttests (p < .01 in all cases). The results suggest that the superior performance of SMM over nonadaptive classiers such as nearest neighbor will hold in general over a range of different test styles. This experiment also provided an opportunity to assess systematically the performance details of EM on the task of speech classication. For all 15 possible test speakers, EM converged within 30 iterations, taking less than 5 seconds. The optimal number of iterations for EM as determined by cross-validation was always between 1 and 10 iterations, with a mean of 3.2 iterations. We can assess the chance of falling into a bad local minimum with EM by comparing the classication performance of the SMM t using EM with the performance of 1-NN on a speaker-by-speaker basis. For 14 of 15 test speakers, the SMM consistently obtained better classication rates than 1-NN, regardless of which of the three parameter-setting procedures described above was used. The remaining speaker was always classied better by 1-NN. Evidently the vast majority of stylistic variations in this domain can be successfully modeled by the SMM and separated from the relevant content to be classied using the EM algorithm. 4.3.4 Train 10, Test 1 on Face Data. To provide further support for the general advantage of SMM over nonadaptive approaches, we replicated the previous experiment using the face database instead of the speech data. Although the two databases are of roughly comparable size, the nature of the observations is quite different: 704-dimensional vectors of pixel values versus 10-dimensional vectors of vocal tract log area coefcients. Specically, we tested the SMM’s ability to classify head pose for each of the 11 faces in the database individually when the other 10 people’s faces were used for training. There were 15 possible poses, with one image of each face in each pose. Averaged over all 11 possible test faces, 1-NN obtained 53.9% § 4.3% correct. Using cross-validation to set J, s 2 , and tmax automatically, our SMM obtained 73.9% § 6.7% correct. Running EM until convergence (tmax D 1), we obtained 80.6% § 7.5% using the best parameter settings of J D 6, s 2 D 105 , and 75.8% § 6.4% correct using cross-validation to select J and s 2 automatically. As on the speech data, these SMM scores are not signicantly different from each other but do represent signicant improvements over 1-NN as measured by paired t-tests (p < .05 in all cases). The performance details of EM on this face classication task were quite similar to those on the speech classication task reported above. For all 11 possible test faces, EM converged within 30 iterations, taking less than 20 seconds. (The longer convergence times here reect the higher-dimensional
1268
Joshua B. Tenenbaum and William T. Freeman
vectors involved.) The optimal number of iterations for EM as determined by cross-validation was between 9 and 20 iterations, with a mean of 12.6 iterations. For 9 of 11 test faces, the SMM consistently obtained better classication rates than 1-NN over all three parameter-setting procedures we used, indicating that local minima do not prevent the EM algorithm from factoring out the majority of stylistic variations in this domain. 4.4 Discussion. Our approach to style-adaptive content classication involves two signicant modeling choices: the use of a bilinear model of the mean observations and the use of a gaussian mixture model—centered on the predictions of the bilinear model—for observations whose content or style assignments are unknown. The mixture model provides a principled probabilistic framework, allowing us to use the EM algorithm to solve the chicken-and-egg problem of simultaneously estimating style parameters for the new data and labeling the data according to content class (and possibly style as well). The bilinear structure of the model allows the M-step to be computed in closed form by solving systems of linear equations. In this sense, bilinear models represent the content of observations independent of their style in a form that can be generalized easily to model data in new styles. To summarize our results, we found that separating style and content with bilinear models improves content classication in new styles substantially over the best nonadaptive approaches to classication, even when no style information is available during testing, and dramatically so when style demarcation is available. Although our SMM classier has several free parameters that must be chosen a priori, we showed that nearoptimal values can be determined automatically using a cross-validation training procedure. We obtained good results on two very different data sets, low-dimensional speech data and high-dimensional face image data, suggesting that our approach may be widely applicable to many two-factor classication tasks that can be thought of in terms of recognizing invariant content elements under variable style. 5 Extrapolation
The ability to draw analogies across observations in different contexts is a hallmark of human perception and cognition (Hofstadter, 1995; Holyoak & Barnden, 1994). In particular, the ability to produce analogous content in a novel style—and not just recognize it, as in the previous section—has been taken as a severe test of perceptual abstraction (Hofstadter, 1995; Grebert, Stork, Keesing, & Mims, 1992). The domain of typography provides a natural place to explore these issues of analogy and production. Indeed, Hofstadter (1995) has argued that the question, “What is the letter ‘a’?” may be “the central problem of AI” (p. 633). Following Hofstadter, we study the task of extrapolating a novel font from an incomplete set of letter observations in that font to the remaining unobserved letters. We rst describe the
Separating Style and Content
1269
task specics and our shape representation for letter observations. We then present our algorithm for stylistic extrapolation based on bilinear models and show results on extrapolating a natural font. 5.1 Task Specics. Given a training set of C D 62 characters (content) in S D 5 standard fonts (style), the task is to generate characters that are stylistically consistent with letters in a novel sixth font. The initial data were obtained by digitizing the uppercase letters, lowercase letters, and digits 0–9 of the six fonts at 38 £ 38 pixels using Adobe Photoshop. Successful shape modeling often depends on having an image representation that makes explicit the appropriate structure that is only implicit in raw pixel values. Specically, the need to represent shapes of different topologies in comparable forms motivates using a particle-based representation (Szeliski & Tonnesen, 1992). We also want the letters in our representation to behave like a linear vector space, where linear combinations of letters also look like letters. Beymer and Poggio (1996) advocate a dense warp map for related problems. Combining these two ideas, we chose to represent each letter shape by a 2 £ 38 £ 38 D 2888-dimensional vector of (horizontal and vertical) displacements that a set of 38 £ 38 D 1444 ink particles must undergo to form the target shape from a reference grid. With identical particles, there are many possible such warp maps. To ensure that similarly shaped letters are represented by similar warps, we use a physical model. We give each particle of the reference shape (taken to be the full rectangular bitmap) unit positive charge, and each pixel of the target letter negative charge proportional to its gray level intensity. The total charge of the target letter is set equal to the total charge of the reference shape. We track the electrostatic force lines from each particle of the reference shape to where they intersect the plane of the target letter (see Figure 7A). The target shape is positioned opposite the reference shape, 100 pixel units away in depth. The force lines land in a uniform density over the target letter, creating a correspondence between each ink particle of the reference shape and a position on the target letter. The result is a smooth, dense warp map specifying a translation that each ink particle of the reference shape must undergo to reassemble the collection of particles into the target letter shape (see Figure 7B). The electrostatic forces used to nd the correspondences are easily calculated from Coulomb’s law (Jackson, 1975). We call this a Coulomb warp representation. To render a warp map representation of a shape, we rst translate each particle of the reference shape by its warp map value, using a grid at four times the linear pixel resolution. We then blur with a gaussian (standard deviation equal to ve pixels) threshold the ink densities at 60% of their maximum and subsample to the original font resolution. By allowing noninteger charge values and subpixel translations, we can preserve font antialiasing information. Figure 8 shows three pairs of shapes of different topologies, and the average of each pair in a pixel representation and in a Coulomb warp rep-
1270
Joshua B. Tenenbaum and William T. Freeman
Reference shape (+ charges) Target shape ( charges)
A
B
shapes
Coulomb warp
pixel
Figure 7: Coulomb warp representation for shape. A target shape is represented by the displacements that a collection of ink particles would undergo to assemble themselves into the target shape. (A) Correspondences between ink particles of the reference shape and those of the target shape are found by tracing the electrostatic force lines between positively and negatively charged ink particles. (B) The vector from the start to the end of the eld line species the translation for each particle. The result is a smooth warp, with similar shapes represented by similar warps.
averages
Figure 8: The result of averaging shapes together under different representations. Averaging in a pixel representation always gives a “double exposure” of the two shapes. Under the Coulomb warp representation, a circle averaged with a square gives a rounded square; a lled circle averaged with a square gives a very thick rounded square. The letter A averaged with the letter B yields a shape that is arguably intermediate to the shapes of the two letters. This property of the representation makes it well suited to linear algebraic manipulations with bilinear models.
Separating Style and Content
1271
resentation. Averaging the shapes in a pixel representation simply yields a “double exposure” of the two images; averaging in a Coulomb warp representation results in a shape intermediate to the two being averaged. 5.2 Algorithm. During training, we t the asymmetric bilinear model (see equation 3.3) to ve full fonts using the closed-form SVD procedure described in section 3.1. This yields a K £ J matrix A s representing each font s and a J-dimensional vector b c representing each letter class c, with the observation dimensionality K D 2888. Training on this task took approximately 48 seconds. Adapting the model to an incomplete new style sQ can be carried out in closed form using the content vectors b c learned during training. Suppose we observe M samples of style sQ, in content classes C D fc1 , . . . , cM g. We nd the style matrix A sQ that minimizes the total squared error over the test data,
E¤ D
X c2C
ky sQc ¡ A sQ b c k2 .
(5.1)
The minimum of E¤ is found by solving the linear system @E¤ / @A sQ D 0. Missing observations in the test style sQ and known content class c can then be synthesized from y sQc D A sQ b c . In order to allow the model sufcient expressive range to produce naturallooking letter shapes, we set the model dimensionality J as high as possible. However, such a exible model led to overtting on the available letters of the test font and consequently poor synthesis of the missing letters in that font. To regularize the style t to the test data and thereby avoid overtting, we add a prior term to the squared-error cost of equation 5.1 that encourages A sQ to be close to the linear combination of training style parameters A 1 , . . . , A S , that best ts the test font. Specically, let A OLC be the value of A sQ that minimizes equation 5.1 subject to the constraint that A OLC is a linear P combination of the training style parameters A s , that is, A OLC D SsD1 as A s for some values of as . (OLC stands for “optimal linear combination.”) Without loss of generality, we can think of the as coefcients as the best-tting style parameters of a symmetric bilinear model with dimensionality I equal to the number of styles S. We then redene E¤ to include this new cost, E¤ D
X c2C
ky sQc ¡ A sQ b c k2 C lkA sQ ¡ A OLC k2 ,
(5.2)
and again minimize E¤ by solving the linear system @E¤ / @A sQ D 0 . The tradeoff between these two costs is determined by the free parameter l, which we set by eye to yield results with the best appearance. For this example, we used a model dimensionality of 60 and l D 2 £ 104 .9 All model tting during this testing phase took less than 10 seconds. 9
This large value of l was necessary to compensate for the differences in scale between
1272
Joshua B. Tenenbaum and William T. Freeman
A
?
?
B synthetic actual Figure 9: Style extrapolation in typography. (A) Rows 1–5: The rst 13 (of 62) letters of the training fonts. Row 6: The novel test font, with A–I unseen by the model. (B) Row 1: Font extrapolation results. Row 2: The actual unseen letters of the test font. 5.3 Results. Figure 9 shows the results of extrapolating the unseen letters “A”–“I” of a new font, Monaco, using the asymmetric model with a symmetric model (OLC) prior as described above. All characters in the Monaco font except the uppercase letters “A” through “I” were used to estimate its style parameters (via equation 5.2). In contrast to the objective performance scores on the classication tasks reported in the previous section, here the evaluation of our results is necessarily subjective. Given examples of “A” through “I” in only the ve training fonts, the model nonetheless has succeeded in rendering these letters in the test font, with approximately correct shapes for each letter class and with the distinctive stylistic features of Monaco: strict upright posture and uniformly thin strokes. Note that each of these stylistic features appears separately in one or more of the two error norms in equation 5.2 and does not directly represent the relative importance of these two terms. To assess the relative contributions of the asymmetric and symmetric (OLC) terms to the new style matrix A sQ , we can compare the distances from A sQ to the two style matrices that minimize equation 5.2 for l D 0 and l D 1. Call these two Q Q ( D A OLC ), respectively. For l D 2 £ 104 , the distance kA sQ ¡ A sQ matrices A sasym and A ssym asym k Q is approximately 2.5 times greater than the distance kA sQ ¡ A ssym k, suggesting that the symmetric (OLC) term makes a greater—but not vastly greater—contribution to the new style matrix.
1273
pi xe lr ep . as ym m et ric sy m m et ric as sy ym m m m e e t tr ric ic pr wit io h ac r tu al
Separating Style and Content
Figure 10: Result of different methods applied to the font extrapolation problem (Figure 9), where unseen letters in a new font are synthesized. The asymmetric bilinear model has too many parameters to t, and generalization to new letters is poor (second column). The symmetric bilinear model has only 5 degrees of freedom for our data and fails to represent the characteristics of the new font (third column). We used the symmetric model result as a prior to constrain the exibility of the asymmetric model, yielding the results shown here (fourth column) and in Figure 9. All of these methods used the Coulomb warp representation for shape. Performing the same calculations in a pixel representation requires blurring the letters so that linear combinations can modify shape, and yields barely passable results (rst column). The far right column shows the actual letters of the new font.
the training fonts, but they do not appear together in any one training font. 5.4 Discussion. We have shown that it is possible to learn the style of a font from observations and extrapolate that style to unseen letters, using a hybrid of asymmetric and symmetric bilinear models. Note that the asymmetric model uses 173,280 degrees of freedom (the 2888 £ 60 matrix A sQ ) to describe the test style, while the optimal linear combination style model A OLC uses only ve degrees of freedom (the coefcients as of the ve training styles). Results using only the high-dimensional asymmetric model without the low-dimensional OLC term are far too unconstrained and fail to look like recognizable letters (see Figure 10, second column). Results using only the low-dimensional OLC model without the highdimensional asymmetric model are clearly recognizable as the correct letters, but fail to capture the distinctive style of Monaco (see Figure 10, third column). The combination of these two terms, with a exible high-
1274
Joshua B. Tenenbaum and William T. Freeman
dimensional model constrained to lie near the subspace of known style parameters, is capable of successful stylistic extrapolation on this example (see Figure 9 and Figure 10, fourth column). Perceptually, the hybrid model results appear more similar to the low-dimensional OLC results than to the high-dimensional asymmetric model results, which is consistent with the fact that the hybrid style matrix is somewhat closer in Euclidean distance to the OLC matrix than to the pure asymmetric model (see note 9). Using an appropriate representation, such as our Coulomb warp, was also important in obtaining visually satisfying results. Applying the same modeling methodology to a pixel space representation of letters resulted in signicantly less appealing output (see Figure 10, rst column). Previous models of extrapolation and abstraction in typography have been restricted to articial grid-based fonts, for which the grid elements provide a reasonable distributed representation (Hofstadter, 1995; Grebert et al., 1992), or even simpler “grandmother cell” representations of each letter (Polk & Farah, 1997). In contrast, our shape representation allowed us to work directly with natural fonts. Why did this extrapolation task require a hybrid modeling strategy and a specially designed input representation, while the classication tasks of the previous section did not? Two general differences between classication and extrapolation tasks make the latter kind of task objectively more difcult. First, successful classication requires only that the outputs of the bilinear model be close to the test data under the generic metric of mean squared error, while success in extrapolation is judged by the far more subtle metric of visual appearance (Teo & Heeger, 1994). Second, successful classication requires that the model outputs be close to the data only in a relative sense— the test data must be closer to the model outputs using the correct content classes than to the model outputs using incorrect content classes—while extrapolation tasks require model and data to be close in an absolute sense— the model outputs must actually look like plausible instances of the correct content classes. Although the choice of model and representation turned out to be essential in this example, our results were obtained without any detailed knowledge or processing specic to the domain of typography. Hofstadter (1995) has been critical of approaches to stylistic extrapolation that minimize the role of domain-specic knowledge and processing, in particular, the connectionist model of Grebert et al. (1992), arguing that models that “don’t know anything about what they are doing” (p. 408) cannot hope to capture the subtleties and richness of a human font designer’s productions. We agree with Hofstadter ’s general diagnosis. There are many aspects of typographical competence, and we have tried to model only a subset of those. In particular, we have not tried to model the higher-level creative processes of expert font designers, who draw on elaborate knowledge bases, reect on the results of their work, and engage in multiple revisions of each synthesized character. However, we do think that our approach captures two
Separating Style and Content
1275
essential aspects of the human competence in font extrapolation: people’s representations of letter and font characteristics are modular and independent of each other, and people’s knowledge of letters and fonts is abstracted from the ability to perform the particular task of character synthesis. Generic connectionist approaches to font extrapolation such as Grebert et al. (1992) do not satisfy these constraints. Letter and font information is mixed together inextricably in the extrapolation network’s weights, and an entirely different network would be needed to perform recognition or classication tasks with the same stimuli. Our bilinear modeling approach, in contrast, captures the perceptual modularity of style and content in terms of the mathematical property of separability that characterizes equations 2.1 through 2.7. Knowledge of style s (in equation 2.6, for example) is localized to the matrix parameter A s , while knowledge of content class c is localized to the vector parameter b c , and both can be freely combined with other content or style parameters, respectively. Moreover, during training, our models acquire knowledge about the interaction of style and content factors that is truly abstracted from any particular behavior, and thus can support not only extrapolation of a novel style, but also a range of other synthesis and recognition tasks as shown in Figure 1. 6 Translation
Many important perceptual tasks require the perceiver to recover simultaneously two unknown pieces of information from a single stimulus in which these variables are confounded. A canonical example is the problem of separating the intrinsic shape and texture characteristics of a face from the extrinsic lighting conditions, which are confounded in any one image of that face. In this section, we show how a perceptual system may, using a bilinear model, learn to solve this problem from a training set of faces labeled according to identity and lighting condition. The bilinear model allows a novel face viewed under a novel illuminant to be translated to its appearance under known lighting conditions, and the known faces to be translated to the new lighting condition. Such translation tasks are the most difcult kind of two-factor learning task, because they require generalization across both factors at once. That is, what is common across both training and test data sets is not any particular style or any particular content class, but only the manner in which these two factors interact. Thus, only a symmetric bilinear model (see equations 2.1–2.3) is appropriate, because only it represents explicitly the interaction between style and content factors, in the W k parameters. 6.1 Task Specics. Given a training set of S D 23 faces (content) viewed under C D 3 different lighting conditions (style) and a novel face viewed under a novel light source, the task is to translate the new face to known lighting conditions and the known faces to the new lighting condition. The
1276
Joshua B. Tenenbaum and William T. Freeman
face images, provided by Y. Moses of the Weizmann Institute, were cropped to remove nonfacial features and blurred and subsampled to 48 £ 80 pixels. Because these faces were aligned and lacked sharp edge features (unlike the typographical characters of the previous section), we could represent the images directly as 3840-dimensional vectors of pixel brightness values. 6.2 Algorithm. We rst t the symmetric model (see equation 2.2) to the training data using the iterated SVD procedure described in section 3.2. This yields vector representations a s and b c of each face c and illuminant s, and a matrix of interaction parameters W (dened in equation 3.8). The dimensionalities for as and b c were set equal to S and C, respectively, allowing the bilinear model maximum expressivity while still ensuring a unique solution. Because these dimensionalities were set to their maximal values, only one iteration of the tting algorithm (taking approximately 10 seconds) was required for complete training. For generalization from a single test image y˜ , we adapt the model simultaneously to both the new face identity cQ and the new illuminant sQ, while holding xed the face £ illuminant interaction term W learned during training. Specically, we rst make an initial guess for the new face identity vector b cQ (e.g., the mean of the training set style vectors) and solve for the least-squares optimal estimate of the illuminant vector a sQ : sQ
a D
µh Wb
cQ
iVT ¶¡1
y˜ .
(6.1)
Here [. . .]¡1 denotes the pseudoinverse function. Given this new value for a sQ , we then reestimate b cQ from cQ
b D
µh W
VT sQ
a
iVT ¶¡1
y˜ ,
(6.2)
and iterate equations 6.1 and 6.2 until both asQ and b cQ converge. In the example presented in Figure 11, convergence occurred after 14 iterations (approximately 16 seconds). We can then generate images of the new face under c D a sT W b cQ ) and of known face under the new known illuminant s (from ysQ c k k T Qc Q s s c illuminant (from yk D a W k b ). 6.3 Results. Figure 11 shows results, with both the old faces translated to the new illuminant and the new face translated to the old illuminants. For comparison, the true images are shown next to the synthetic ones. Again, evaluation of these results must necessarily be subjective. The lighting and shadows for each synthesized image appear approximately correct, as do the facial features of the old faces translated to the new illuminant. The facial features of the new face translated to the old illuminants appear slightly
Separating Style and Content
1277
A
B
? ? ?
?
C synthetic actual
Figure 11: Translation across style and content in shape-from-shading. (A) Row 1–3: The rst 13 (of 24) faces viewed under the three illuminants used for training. Row 4: The single test image of a new face viewed under a new light source. (B) Column 1: Translation of the new face to known illuminants. Column 2: The actual (unseen) images. (C) Row 1: Translation of known faces to the new illuminant. Row 2: The actual (unseen) images.
blurred, but otherwise resemble the new face more than any of the old faces. One reason the synthesized images of the new face are not as sharp as the synthesized images of the old faces is that the latter are produced by averaging images of a single face under several lighting conditions—across which all the facial features are precisely aligned—while the former are produced by averaging images of many faces under a single lighting condition—across which the facial features vary signicantly in their positions. 6.4 Discussion. A history of applying linear models to face images motivates our bilinear modeling approach. The original work on eigenfaces (Kirby & Sirovich, 1990; Turk & Pentland, 1991) showed that images of many different faces taken under xed lighting and viewpoint conditions occupy a low-dimensional linear subspace of pixel space. Subsequent work by Hallinan (1994) established the complementary result that images of a single face taken under many different lighting conditions also occupy a low-dimensional linear subspace. Thus, the factors of facial identity and illumination have already been shown to satisfy approximately the denition of bilinearity—the effects of one factor are linear when the other is held constant—so it is natural to integrate them into a bilinear model.
1278
Joshua B. Tenenbaum and William T. Freeman
While the general problem of separating shape, texture, and illumination features in an image is underdetermined (Barrow & Tenenbaum, 1978), here the bilinear model learned during training provides sufcient constraint to approximately recover both face and illumination parameters from a single novel image. Atick, Grifn, & Redlich (1996) proposed a related approach to learning shape-from-shading for face images, based on a linear model of head shape in three dimensions and a nonlinear physical model of the image formation process. In contrast, our bilinear model uses only two-dimensional (i.e., image-based) information and requires no prior knowledge about the physics of image formation. Of course, we have not solved the general shape-from-shading recovery problem for arbitrary objects under arbitrary illumination. Our solution (as well as that of Atick et al., 1996) depends critically on the assumption that the new image, like the images in the training set, depicts an upright face under reasonable lighting conditions. In fact, there is evidence that the brain often does not solve the shape-from-shading problem in its most general form, but rather has learned (or evolved) solutions to important special cases such as face images (Cavanagh, 1991). So-called Mooney faces—brightness-thresholded face images in which shading is the only cue to shape—can be easily recognized as images of three-dimensional surfaces when viewed in upright position, but cannot be discriminated from two-dimensional ink blotches when viewed upside down so that the shading conventions are atypical (Shepard, 1990), or when the underlying three-dimensional structure has been distorted away from a globally facelike shape (Moore & Cavanagh, 1998). More generally, the ability to learn constrained solutions to a priori underconstrained inference problems may turn out to be essential for perception (Poggio & Hurlbert, 1994; Nayar & Poggio, 1996). Bilinear models offer one simple and general framework for how biological and articial perceptual systems may learn to solve a wide range of such tasks. 7 Directions for Future Work
The most obvious extension of our work is to observations and tasks with more than two underlying factors, via multilinear models (Magnus & Neudecker, 1988). For example, a symmetric trilinear model in three factors q, r, and s would take the form: qrs
yl
D
X i, j,k
q
wijkl ai bjr csk .
(7.1)
The procedures for tting these models are direct generalizations of the learning algorithms for two-factor models that we describe here. As in the two-factor case, we iteratively apply linear matrix techniques to solve for the parameters of each factor given parameter estimates for the other factors, until all parameter estimates converge (Magnus & Neudecker, 1988).
Separating Style and Content
1279
As with any other learning framework, the success of bilinear models depends on having suitable input representations. Further research is needed to determine what kinds of representations will endow specic kinds of observations with the most nearly bilinear structure. In particular, the font extrapolation task might benet from a representation that is better tailored to the important features of letter shapes. Also, it would be of interest to develop general procedures for incorporating available domain-specic knowledge into bilinear models, for example, via a priori constraints on the model parameters (Simard, LeCun, & Denker, 1993). Finally, we would like to explore the relevance of bilinear models for research in neuroscience and psychophysics. The essential representational feature of bilinear models is their multiplicative factor interactions. At the physiological level, multiplicative neuronal interactions (Andersen, Essick, & Siegel, 1985; Olshausen, Anderson, & van Essen, 1993; Riesenhuber & Dayan, 1997), arising from nonlinear synaptic (Koch, 1997) or populationlevel (Salinas & Abbott, 1993) mechanisms, have been proposed for visual computations that require the synergistic combination of two inputs, such as modulating spatial attention (Andersen et al., 1985; Olshausen et al., 1993; Riesenhuber & Dayan, 1997; Salinas & Abbott, 1993) or estimating motion (Koch, 1997). May these same kinds of circuits be co-opted to perform some of the tasks we study here? It has been suggested that SVD, the essential computational procedure for learning in our framework, can be implemented naturally in neural networks using Hebb-like learning rules (Sanger, 1989, 1994; Oja, 1989). Could these or similar mechanisms support a biologically plausible learning algorithm for bilinear models? At the level of psychophysics, many studies have demonstrated that the ability of the human visual and auditory systems to factor out contextually irrelevant dimensions of variation, such as lighting conditions or speaker accent, is neither perfect nor instantaneous. Observations in unusual styles are frequently more time-consuming to process and more likely to be processed incorrectly (van Bergem et al., 1988; Sanocki, 1992; Nygaard & Pisoni, 1998; Moore & Cavanagh, 1998). Do bilinear models have difculty on the same kinds of stimuli that people do? Do the dynamics of adaptation in bilinear models (e.g., EM for classication, or the iterative SVD-based procedure for translation) take longer to converge on stimuli that people are slower to process? Can our approach provide a framework for designing and modeling perceptual learning experiments, in which subjects learn to separate the effects of style and content factors in a novel stimulus domain? These are just a few of the empirical questions motivated by a bilinear modeling approach to studying perceptual inference. 8 Conclusions
In one sense, bilinear models are not new to perceptual research. It has previously been shown that several core vision problems, such as the recovery
1280
Joshua B. Tenenbaum and William T. Freeman
of structure from motion under orthographic projection (Tomasi & Kanade, 1992) and color constancy under multiple illuminants (Brainard & Wandell, 1991; Marimont & Wandell, 1992; D’Zmura, 1992) are solvable efciently because they are fundamentally bilinear at the level of geometry or physics (Koenderink & van Doorn, 1997). These results are important but of limited usefulness, because many two-factor problems in perception do not have this true bilinear character. More commonly, perceptual inferences based on accurate physical models of factor interactions are either very complex (as in speech recognition), underdetermined (as in shape-from-shading), or simply inappropriate (as in typography). Here we have proposed that perceptual systems may often solve such challenging two-factor tasks without detailed domain knowledge, using bilinear models to learn approximate solutions rather than to describe explicitly the intrinsic geometry or physics of the problem. We presented a suite of simple and efcient learning algorithms for bilinear models, based on the familiar techniques of SVD and EM. We then demonstrated the scope of this approach with applications to three two-factor tasks—classsication, extrapolation, and translation—using three kinds of signals—speech, typography, and face images. With their combination of broad applicability and ready learning algorithms, bilinear models may prove to be a generally useful component in the tool kits of engineers and brains alike. Acknowledgments
We thank E. Adelson, B. Anderson, M. Bernstein, D. Brainard, P. Dayan, W. Richards, and Y. Weiss for helpful discussions. Joshua Tenenbaum was a Howard Hughes Medical Institute predoctoral fellow. Preliminary reports of this material appeared in Tenenbaum and Freeman (1997) and Freeman and Tenenbaum (1997). References Andersen, R., Essick, G., & Siegel, R. (1985). The encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Atick, J. J., Grifn, P. A., & Redlich, A. N. (1996). Statistical approach to shape from shading: Reconstruction of 3d face surfaces from single 2d images. Neural Computation, 8, 1321–1340. Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. In A. R. Hanson & E. M. Riseman (Eds.), Computer vision systems (pp. 3–26). New York: Academic Press. Bell, A., & Sejnowski, T. (1995). An information–maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Beymer, D., & Poggio, T. (1996). Image representations for visual learning. Science, 272, 1905–1909.
Separating Style and Content
1281
Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Brainard, D., & Wandell, B. (1991). A bilinear model of the illuminant’s effect on color appearance. In M. S. Landy and J. A. Movshon (Eds.), Computational models of visual processing (chap. 13). Cambridge, MA: MIT Press. Caruana, R. (1998). Multitask learning. In S. Thrun and L. Pratt (Eds.), Learning to learn (pp. 95–134). Norwell, MA: Kluwer. Cavanagh, P. (1991). What’s up in top-down processing? In A. Gorea (Ed.), Representations of vision: Trends and tacit assumptions in vision research (pp. 295– 304). Cambridge: Cambridge University Press. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7(5), 889–904. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39, 1–38. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley-Interscience. D’Zmura, M. (1992).Color constancy: Surface color from changing illumination. Journal of the Optical Society of America A, 9, 490–493. Freeman, W. T., & Tenenbaum, J. B. (1997). Learning bilinear models for twofactor problems in vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (pp. 554–560). San Juan, PR. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 617–624). Cambridge, MA: MIT Press. Grebert, I., Stork, D. G., Keesing, R., & Mims, S. (1992). Connectionist generalization for production: An example from gridfont. Neural Networks, 5, 699– 710. Hallinan, P. W. (1994). A low-dimensional representation of human faces for arbitrary lighting conditions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (pp. 995–999). Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classication. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 607–616. Hinton, G., Dayan, P., Frey, B., & Neal, R. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Royal Soc. B, 352, 1177–1190. Hinton, G. E., & Zemel, R. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. Cowan, G. Tesauro, and J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kauffman. Hofstadter, D. (1985). Metamagical themas. New York: Basic Books. Hofstadter, D. (1995).Fluid concepts and creative analogies. New York: Basic Books. Holyoak, K., & Barnden, J. (Eds.). (1994). Advances in connectionist and neural computation theory. Norwood, NJ: Ablex. Jackson, J. D. (Ed.). (1975). Classical electrodynamics (2nd ed.). New York: Wiley.
1282
Joshua B. Tenenbaum and William T. Freeman
Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1), 103–108. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–211. Koenderink, J. J., & van Doorn, A. J. (1997). The generic bilinear calibration– estimation problem. International Journal of Computer Vision, 23(3), 217–234. Magnus, J. R., & Neudecker, H. (1988).Matrix differentialcalculus with applications in statistics and econometrics. New York: Wiley. Mardia, K., Kent, J., & Bibby, J. (1979). Multivariate analysis. London: Academic Press. Marimont, D. H., & Wandell, B. A. (1992). Linear models of surface and illuminant spectra. Journal of the Optical Society of America A, 9(11), 1905–1913. Moghaddam, B., & Pentland, A. P. (1997). Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 696–710. Moore, C., & Cavanagh, P. (1998). Recovery of 3d volume from 2-tone images of novel objects. Cognition, 67 (1,2), 45–71. Nayar, S., & Poggio, T. (Eds.). (1996). Early visual learning. New York: Oxford University Press. Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specic learning in speech perception. Perception and Psychophysics, 60(3), 355–376. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Olshausen, B., Anderson, C., & van Essen, D. (1993). A neural model of visual attention and invariant pattern recognition. Journal of Neuroscience, 13, 4700– 4719. Omohundro, S. M. (1995). Family discovery. In D. Touretzky, M. Moser, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 402– 408). Cambridge, MA: MIT Press. Poggio, T., & Hurlbert, A. (1994).Observations on cortical mechanisms for object recognition and learning. In C. Koch & J. Davis (Eds.), Large scale neuronal theories of the brain (pp. 152–182). Cambridge, MA: MIT Press. Polk, T. A., & Farah, M. J. (1997). A simple common contexts explanation for the development of abstract letter identities. Neural Computation, 9(6), 1277–1289. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C. New York: Cambridge University Press. Riesenhuber, M., & Dayan, P. (1997). Neural models for part-whole hierarchies. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 17–23). Cambridge, MA: MIT Press. Robinson, A. (1989). Dynamic error propagation networks. Unpublished doctoral dissertation, Cambridge University. Salinas, E., & Abbott, L. (1993). A model of multiplicative neural responses in parietal cortex. In Proc. Nat. Acad. Sci. USA (pp. 11956–11961). Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Sanger, T. (1994). Two algorithms for iterative computation of the singular value decomposition from input/output samples. In J. Cowan, G. Tesauro, &
Separating Style and Content
1283
J. Alspector (Eds.),Advances in neural informationprocessing systems, 6 (pp. 144– 151). San Mateo, CA: Morgan Kauffman. Sanocki, T. (1992).Effects of font- and letter-specic experience on the perceptual processing of letters. American Journal of Psychology, 105(3), 435–458. Shepard, R. N. (1990). Mind sights. New York: Freeman. Simard, P. Y., LeCun, Y., & Denker, J. (1993). Efcient pattern recognition using a new transformation distance. In S. Hanson, J. Cowan, & L. Giles (Eds.), Advances in neural information processing systems, 5. San Mateo, CA: Morgan Kaufman. Szeliski, R., & Tonnesen, D. (1992). Surface modeling with oriented particle systems. In Proc. SIGGRAPH 92 (Vol. 26, pp. 185–194). Tenenbaum, J. B., & Freeman, W. T. (1997). Separating style and content. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 662–668). Cambridge, MA: MIT Press. Teo, P., & Heeger, D. (1994). Perceptual image distortion. In First IEEE International Conference on Image Processing (Vol. 2, pp. 982–986). New York: IEEE. Thrun, S., & Pratt, L. (Eds). (1998). Learning to learn. Norwell, MA: Kluwer. Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision, 9(2), 137–154. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. van Bergem, D. R., Pols, L. C., & van Beinum, F. J. K. (1988). Perceptual normalization of the vowels of a man and a child in various contexts. Speech Communication, 7(1), 1–20. Received October 27, 1998; accepted June 8, 1999.
NOTE
Communicated by Simon Haykin
Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters Danilo P. Mandic School of Information Systems, University of East Anglia, Norwich, U.K. Jonathon A. Chambers Signal Processing Section, Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, London, U.K. The lower bounds for the a posteriori prediction error of a nonlinear predictor realized as a neural network are provided. These are obtained for a priori adaptation and a posteriori error networks with sigmoid nonlinearities trained by gradient-descent learning algorithms. A contractivity condition is imposed on a nonlinear activation function of a neuron so that the a posteriori prediction error is smaller in magnitude than the corresponding a priori one. Furthermore, an upper bound is imposed on the learning rate g so that the approach is feasible. The analysis is undertaken for both feedforward and recurrent nonlinear predictors realized as neural networks. 1 Introduction
A posteriori techniques have been considered in the area of linear adaptive lters (Treichler, Johnson, & Larimore, 1987; Ljung & Soderstrom, 1983; Douglas & Rupp, 1997). However, in the area of neural networks, the use of a posteriori techniques is still in its infancy. Recently it has been shown that an a posteriori approach in the neural networks framework exhibits behavior correspondent to the normalised least mean square (NLMS) algorithm in the linear adaptive lters case (Mandic & Chambers, 1998). Consequently, it is expected that the instantaneous a posteriori output error eN( k ) is smaller in magnitude than the corresponding a priori error e( k ) (Treichler et al., 1987; Mandic & Chambers, 1998). However, little is known about the relationships between the a posteriori and a priori error, learning rate, slope in the nonlinear activation function of a neuron, and feasibility of such a neural predictor. In the case of a single-node neural network, with a nonlinear activation function of a neuron W, the a priori output of a network y is given by ³ ´ (1.1) y ( k ) D W x T ( k )w ( k ) , where x ( k), w ( k ), and (¢)T denote, respectively, the input vector to a network, c 2000 Massachusetts Institute of Technology Neural Computation 12, 1285– 1292 (2000) °
1286
Danilo P. Mandic and Jonathon A. Chambers
the weight vector, and vector transpose operator. Function W is assumed to belong to the class of sigmoid functions. The updated weight vector w ( k C 1) is available from the learning algorithm before the next, updated, input vector x ( k C 1), therefore an a posteriori estimate yN can be calculated as ³ ´ (1.2) yN ( k ) D W xT ( k)w ( k C 1) . The corresponding instantaneous a priori and a posteriori errors at the output neuron of a neural network are given respectively as e(k ) D d( k) ¡ y( k) and eN(k ) D d( k ) ¡ yN (k ), where d(k ) is some teaching signal. Our aim is to preserve | eN( k)| · c |e(k)|,
0 1 ¡ g(k)W 0 x T (k )w (k ) kx ( k )k22 e( k), which is the lower bound for the a posteriori error for a contractive nonlinear activation function. The range allowed for the learning rateg ( k ) in an a priori adaptation a posteriori error algorithm, 3.1, with constraint 1.3, for the conditions given in theorem 1, is Corollary 1.
0 < g(k)
1 ¡ g(k)kx( k )k22 e( k ). Theorem 2.
The range allowed for the learning rate g(k), for the conditions given in theorem 2, is Corollary 2.
0 < g(k)
|j |, 8j 2 R, and | W 0 (j )| ¸ 1. In this case, equation 3.2 becomes
³ ³ ´ ´i h eN( k) · 1 ¡ W g(k)W 0 x T ( k )w ( k ) kx (k )k22 e( k).
(3.10)
For W an expansion, |W(j )| > |j | , and equation 3.4 nally becomes ³ ´ h i eN( k) < 1 ¡ g(k)W 0 xT ( k)w ( k) kx( k )k22 e( k ),
(3.11)
which holds for negative g, which is not feasible. 4 The Case of a Recurrent Neural Filter
In this case, the gradient-descent updating equation regarding the recurrent neuron can be symbolically expressed as (Haykin, 1994) @y(k ) @w (k )
³ ´ D P(k C 1) D W 0 x T ( k )w ( k ) [x (k ) C w (k )P (k )] ,
(4.1)
where vector P denotes the set of corresponding gradients of the output neuron, and vector x( k ) encompasses both the external and feedback inputs to the recurrent neuron. This case is a generalization of linear adaptive innite impulse response (IIR) lters. The correction to the weight vector at the time instant k becomes
D w ( k ) D g(k)e(k)P ( k ).
(4.2)
The real time recurrent learning (RTRL) (Williams & Zipser, 1989) based learning algorithm for single-node a priori adaptation a posteriori error networks is now given by w (k C 1) D w ( k ) C g(k)e(k)P (k )
³ ´ eN(k ) D d( k) ¡ W x T ( k )w ( k C 1) .
(4.3)
In the spirit of algorithm 1.4, following the same principle as for feedforward networks, we obtain the lower error bound for the a priori adaptation a posteriori error algorithm in single-node recurrent neural networks acting as nonlinear adaptive lters.
1290
Danilo P. Mandic and Jonathon A. Chambers
The lower bound for the a posteriori error obtained by algorithm 4.3 with constraint 1.3, and a contractive nonlinear activation function W, is h i (4.4) eN(k ) > 1 ¡ g(k)x T ( k )P ( k) e(k ), Theorem 3.
whereas the range allowed for the learning rate g(k) is given in the following corollary. The range allowed for the learning rateg ( k ) in an a priori adaptation a posteriori error algorithm, 4.3, with constraint 1.3, and the conditions given in theorem 3, is Corollary 3.
0 < g(k)
1 ¡ g(k)x T ( k )P 1 ( k) e( k ), Corollary 4.
whereas the range of allowable learning rates g(k) is 0 < g(k)
i 1h 4 ¡ g(k)bkx ( k)k22 e(k ) 4
(5.1)
and 0 < g(k)
r can be shattered by H either. 3. If no sample of length r can be shattered by H, then VCdim(H) < r. 4. If one can nd a sample fx 1 , x 2 , . . . , x n g of length n that can be shattered by H, but no sample of length n C 1 can be shattered by H, then VCdim(H) D r.
5. For any two hypothesis spaces H and H 0 , if H ½ H 0 and H shatters a sample S, then H0 also shatters S. 6. For any two hypothesis spaces H and H 0 , if H ½ H 0 , then VCdim(H) · VCdim(H0 ). In the following, we rst describe the “experts” to be mixed, then the “gating functions” (the mixing weights) and the ME models. 3 The Binary Experts and Mixtures of Binary Experts
In statistical pattern recognition and in supervised learning (Vapnik, 1992, 1998), one considers a random input-response pair ( x , y) 2 X £ Y , where y takes value from Y D f0, 1g and is random (noisy) conditional on x . Learning is based on a data set comprising pairs of (x , y ), and looks for a classier Ch from a hypothesis space, such that Ch (x ) and y are closest in the sense of risk minimization. The ME is constructed in this context. (This is different from probably approximately correct [PAC] learning, where y conditional on x is noiseless. However, we note that PAC learning can still be performed based on the ME, since, as we will see, the ME generates a hypothesis space, based on which one can certainly consider the learning of a noiseless response (target function) y( x ).) A binary expert proposes a probability model of the response conditional on the input: p( y D 1| x ) D p( x ), p( y D 0| x ) D 1 ¡ p( x ). One example is the logistic regression expert, where p(x ) D f1 C exp(¡a¡ ¯ T x )g¡1 , parameterized by a ( s C 1) vector (a, ¯ T ). Another example is the Bernoulli expert, where p( x ) D p , parameterized by a scalar p in (0, 1). The mixtures of m ¸ 2 experts use a local mixture of the simple experts and propose p( x ) D
m X
gk ( x )p k ( x ),
(3.1)
kD1
where gk (x )’s are local weights, called the gating functions, and p k ( x )’s are the probability functions of the simple experts. For example, when the Bernoulli experts are mixed, p k ( x ) D p k , p k 2 (0, 1), for each k; when
1296
W. Jiang
the logistic regression experts are mixed, p k ( x ) D f1 C exp(¡ak ¡ ¯ Tk x )g¡1 , (ak , ¯ Tk ) 2 <sC 1 , for each k. The gating functions take a multinomial logit form, 0 vk C u Tk x
gk ( x ) D e
/@
1 m X jD1
vj C ujT x
e
A , k D 1, . . . , m,
(3.2)
where (vk , u Tk ) 2 <sC 1 for each k, and vm D u m D 0 is often imposed for the sake of identiability. A probability model such as equation 3.1 generates a classier Ch by Ch ( x ) D I[t (x I h ) ¸ 0], where the discriminant function t ( xI h ) can be taken P v CuT x as p( x ) ¡ c or p(x ) ¡ c multiplied by a positive function (such as jmD1 e j j ), for some c in (0, 1) such as 0.5. Here I[¢] is the indicator function dened by I[true] D 1 and I[ f alse] D 0. When m logistic regression experts are mixed, the grand parameter h of the classier includes all components of (vk , u Tk ), k D 1, . . . , m ¡ 1, and (aj , ¯jT ), j D 1, . . . , m; and h takes values in the parameter space H D 0, the probability of producing b (after ai0 , . . . , ait¡1 have already been produced) is equal to P[Xt D b | X0 D ai0 , . . . , Xt¡1 D ait¡1 ]. Using equation 2.6, the latter amounts to calculating at time t, for every b 2 O , the conditional probability P[Xt D b | X0 D ai0 , . . . , Xt¡1 D ait¡1 ] D
P[X0 D ai0 , . . . , Xt¡1 D ait¡1 , Xt D b] P[X0 D ai0 , . . . , Xt¡1 D ait¡1 ]
D 1 tb tait¡1 , ¢ ¢ ¢ , tai0 w0 / 1 tait¡1 , ¢ ¢ ¢ , tai0 w0 ! tait¡1 , ¢ ¢ ¢ , tai0 w0 D 1 tb 1 tait¡1 , ¢ ¢ ¢ , tai0 w0
(
D: 1tb wt ,
(3.1)
and producing at time t the outcome b with this conditional probability (the symbol D: denotes that the term on the right-hand side is being dened by the equation). Calculations of equation 3.7 can be carried out incrementally if one observes that for t ¸ 1, wt can be computed from wt¡1 : wt D
tat¡1 wt¡1 1 tat¡1 wt¡1
.
(3.2)
Note that all vectors wt thus obtained have a component sum equal to 1. Observe that the operation 1 tb ¢ in equation 3.1 can be done effectively by pre-computing the vector vb :D 1 tb . Computing equation 3.1 then amounts
Observable Operator Models
1377
to multiplying the (row) vector vb with the (column) vector wt , or, equivalently, evaluating the inner product hvb , wt i. The prediction task is completely analogous to the generation task. Given an initial realization ai0 , . . . , ait¡1 of the process up to time t ¡ 1, one has to calculate the probability by which an outcome b is going to occur at the next time step t. This is again an instance of equation 3.1, the only difference being that now the initial realization is not generated by oneself but is externally given. Many-time-step probability predictions of collective outcomes can be calculated by evaluating inner products too. Let the collective outcome A D fbN1 , . . . , bNn g consist of n sequences of length s C 1 of outcomes (i.e., outcome A is recorded when any of the sequences bNi occurs). Then the probability that A is going to occur after an initial realization aN of length t ¡ 1 can be computed as follows: P[( Xt , . . . , XtC s ) 2 A | ( X0 , . . . , Xt¡1 ) D aN ] D D
X
N b2A
X
N b2A
* D
P[(Xt , . . . , XtC s ) D bN | ( X0 , . . . , Xt¡1 ) D aN ] 1tbN wt D:
X N b2A
+
X N b2A
hvbN , wt i
vbN , wt D: hv A , wt i.
(3.3)
To calculate the future probability of a collective outcome A repeatedly, utilization of equation 3.3 reduces computational load considerably because the vector vA needs to be (pre-)computed only once. The generation procedure shall now be illustrated using the exemplary OOM M from equation 2.5. We rst compute the vectors va , vb : va D 1ta D 1 vb D 1tb D 1
¡ 1 /8
¢
3 /8
1 /5 D (1 / 2, 1 / 5), 0
3/8
4 /5 D (1 / 2, 4 / 5). 0
¡1/8
¢
Starting with w0 D (2 / 3, 1 / 3), we obtain probabilities hva , w0 i D 2 / 5, hvb , w0 i D 3 / 5 of producing a versus b at the rst time step. We make a random decision for a versus b, weighted according to these probabilities. Let us assume the dice fall for b. We now compute w1 D tb w0 / 1tb w0 D (7 / 12, 5 / 12)T . For the next time step, we repeat these computations with w1 in place of w0 .
1378
H. Jaeger
4 From Stochastic Processes to OOMs
This section introduces OOMs again, but this time in a top-down fashion, starting from general stochastic processes. This alternative route claries the fundamental nature of observable operators. Furthermore, the insights obtained in this section will yield a short and instructive proof of the central theorem of OOM equivalence, to be presented in the next section. The material presented here is not required after the next section; readers without a strong interest in probability theory can skip it. In section 2, we described OOMs as structures ( Rm , (ta )a2 , w0 ). In this section, we arrive at isomorphic structures ( G, ( ta )a2 , ge ), where again G is a vector space, ( ta )a2 is a family of linear operators on G, and ge 2 G. However, the vector space G is now a space of certain numerical prediction functions. In order to discriminate OOMs characterized on spaces G from the “ordinary” OOMs, we shall call ( G, ( ta )a2 , ge ) a predictor-space OOM. Let (Xt )t2N , or for short, ( Xt ) be a discrete-time stochastic process with values in a nite set O . Then the distribution of (Xt ) is uniquely characterized by the probabilities of nite initial subsequences, that is, by all probabilities of the kind P[Na], where aN 2 O C . N We introduce shorthand for conditional probabilities, by writing P[Na | b] N for P[( Xn , . . . , XnC s ) D aN | ( X0 , . . . , Xn¡1 ) D b]. We shall formally write the unconditional probabilities as conditional probabilities, too, with the empty condition e ; we use the notation P[Na | e ] :D P[(X0 . . . Xs ) D aN ] D P[Na]. Thus, the distribution of ( Xt ) is uniquely characterized by its conditional N continuation probabilities, that is, by the conditional probabilities P[Na | b], C N ¤ where aN 2 O , b 2 O . For every bN 2 O ¤ , we collect all conditioned continuation probabilities of Nb into a numerical function
gbN : O C ! R,
N if P[b] N D 6 0 aN 7! P[Na | b], N D 0. 7! 0, if P[b]
(4.1)
The set fgbN | bN 2 O ¤ g uniquely characterizes the distribution of (Xt ) too. Intuitively, a function gbN describes the future distribution of the process N after an initial realization b. Let D denote the set of all functions from O C into the real numbers, that is, the numerical functions on nonempty sequences. D canonically becomes a real vector space if one denes scalar multiplication and vector addition as follows: for d1 , d2 2 D, a, b 2 R, aN 2 O C put (ad1 C b d2 )(aN ) :D a( d1 ( aN )) C b(d2 ( aN )). Let G D hfgbN | bN 2 O ¤ gi denote the linear subspace spanned in D by the conditioned continuations. Intuitively, G is the (linear closure of the) space of future distributions of the process (Xt ).
Observable Operator Models
1379
We are now halfway done with our construction of ( G, ( ta )a2 , ge ). We have constructed the vector space G, which corresponds to Rm in the “ordinary” OOMs from section 2, and we have dened the initial vector ge , which is the counterpart of w0 . It remains to dene the family of observable operators. In order to specify a linear operator on a vector space, it sufces to specify the values the operator takes on the basis of the vector space. Choose O0¤ µ O ¤ such that the set fgbN | bN 2 O0¤ g is a basis of G. Dene, for every a 2 O , a linear function ta : G ! G by putting N gN ta( gbN ) :D P[a | b] ba
(4.2)
N denotes the concatenation of the sequence bN with a). It for all bN 2 O0¤ (ba turns out that equation 4.2 carries over from basis elements bN 2 O 0¤ to all bN 2 O ¤ : Proposition 2.
tion
For all bN 2 O ¤ , a 2 O , the linear operator ta satises the condi-
N gN . ta( gbN ) D P[a | b] ba
(4.3)
The proof is given in appendix B. Intuitively, the operator ta describes the change of knowledge about a process due to an incoming observation of a. More precisely, assume that the process has been observed up to time n. That is, an initial observation bN D b0 , . . . , bn has been made. Our knowledge about the state of the process at this moment is tantamount to the predictor function gbN . Then assume that at time n C 1, an outcome a is observed. After that, our knowledge about the process state is expressed by gba N . But this is N (up to scaling by P[a | b]) just the result of applying ta to the old state, gbN . The operators (ta )a2 are the analog of the observable operators (ta )a2 in OOMs and can be used to compute probabilities of nite sequences: Let fgbN | bN 2 O 0¤ g be a basis Pof G. Let aN :D ai0 , . . . , aik be an initial realization of ( Xt ) of length k C 1. Let iD1,...,n ai gbNi D taik , ¢ ¢ ¢ , tai0 ge be the linear combination of taik , . . . , tai0 ge from basis vectors. Then it holds that Proposition 3.
P[Na] D
X
ai .
(4.4)
iD1,...,n
Note that equation 4.4 is valid for any basis fgbN | bN 2 O0¤ g. The proof can be found in appendix C. Equation 4.4 corresponds exactly to equation 2.6 since left-multiplication of a vector with 1 amounts to summing the vector components, which in turn are the coefcients of that vector with respect to a vector space basis.
1380
H. Jaeger
Due to equation 4.4, the distribution of the process ( Xt ) is uniquely characterized by the observable operators ( ta )a2 . Conversely, these operators are uniquely dened by the distribution of (Xt ). Thus, the following denition makes sense: Let ( Xt )t2N be a stochastic process with values in a nite set O . The structure ( G, (ta )a2 , ge ) is called the predictor-space observable operator model of the process. The vector space dimension of G is called the dimension of the process and is denoted by dim(Xt ). Denition 2.
I remarked in section 1 that stochastic processes have previously been characterized in terms of vector spaces. Although the vector spaces were constructed in ways other than G, they led to equivalent notions of process dimension. Heller (1965) called nite-dimensional (in our sense) stochastic processes nitary; in Ito et al. (1992) the process dimension (if nite) was called minimum effective degree of freedom. Equation 4.3 claries the fundamental character of observable operators: ta describes how the knowledge about the process’s future after an observed past bN (i.e., the predictor function gbN ) changes through an observation of a. The power of the observable operator idea lies in the fact that these operators turn out to be linear (see proposition 2). I have treated only the discrete time, discrete value case here. However, predictor-space OOMs can be dened in a similar way also for continuous-time, arbitrary-valued processes (detailed out in Jaeger, 1999). It turns out that in those cases, the resulting observable operators are linear too. In the sense of updating predictor functions, the change of knowledge about a process due to incoming observations is a linear phenomenon. In the remainder of this section, I describe how the dimension of a process is related to the dimensions of ordinary OOMs of that process. Proposition 4.
1. If ( Xt ) is a process with nite dimension m, then an m-dimensional ordinary OOM of this process exists. 2. A process (Xt ) whose distribution is described by a k-dimensional OOM A D (Rk , (ta )a2 , w0 ) has a dimension m · k. The proof is given in appendix D. Thus, if a process ( Xt ) has dimension m and we have a k-dimensional OOM A of ( Xt ), we know that an m-dimensional OOM A0 exists, which is equivalent to A in the sense of specifying the same distribution. Furthermore, A0 is minimal dimensional in its equivalence class. A minimal-dimensional OOM A0 can be constructively obtained from A in several ways, all of which amount to an implicit
Observable Operator Models
1381
construction of the predictor-space OOM of the process specied by A. Since the learning algorithm presented in later sections can be used for this construction too, I do not present a dedicated procedure for obtaining minimal-dimensional OOMs here. 5 Equivalence of OOMs
Given two OOMs A D ( Rk , (ta )a2 , w0 ), B D (Rl , (ta0 )a2 , w00 ), when are they equivalent in the sense that they describe the same distribution? This question can be answered using the insights gained in the previous section. First, construct minimal-dimensional OOMs A0 , B 0 , which are equivalent to A and B, respectively. If the dimensions of A0 , B0 are not equal, then A and B are not equivalent. We can therefore assume that the two OOMs whose equivalence we wish to ascertain have the same (and minimal) dimension. Then the answer to our question is given in the following proposition: Two minimal-dimensional OOMs A D (Rm , (ta )a2 , w0 ), B D ( Rm , (ta0 )a2 , w00 ) are equivalent iff there exists a bijective linear map %: Rm ! Rm , satisfying the following conditions: Proposition 5.
1.
%( w0 )
D w00 .
2. ta0 D %ta%¡1 for all a 2 O .
3. 1 v D 1%v for all (column) vectors v 2 Rm . Sketch of proof: (: trivial. ): We have done all the hard work in the previous section. Let s , s be the canonical projections from A , B on the predictor-space OOM of the process specied by A (and hence by B). Observe that s , s are bijective linear maps that preserve the component sum of vectors. Dene % :D s ¡^ ± s . Then condition 1 in proposition 5 follows from s ( w0 ) D s (w00 ) D ge , condition 2 follows from 8cN 2 O C : s(tcN w0 ) D s(tcN0 w0 ) D P[Nc]gcN , and condition 3 from the fact that s , s preserve component sum of vectors. 6 A Non-HMM Linearly Dependent Process
The question of when an LDP can be captured by a HMM has been fully answered in the literature (original result in Heller, 1965; renements in Ito, 1992), and examples of non-HMM LDPs have been given. I briey restate the results and then elaborate on a simple example of such a process. It was rst described in a slightly different version in Heller (1965). The aim is to provide an intuitive insight into how the class of LDPs is “larger ” than the class of processes that can be captured by HMMs.
1382
H. Jaeger
Characterizing HMMs as LDPs heavily draws on the theory of convex cones and nonnegative matrices. I rst introduce some concepts, following the notation of a standard textbook (Berman & Plemmons, 1979). With a set S µ Rn , we associate the set SG , the set generated by S, which consists of all nite nonnegative linear combinations of elements of S. A set K µ Rn is dened to be a convex cone if K D KG . A convex cone KG is called n-polyhedral if K has n elements. A cone K is pointed if for every nonzero v 2 K, the vector ¡v is not in K. A cone is proper if it is pointed and closed, and its interior is not empty. Using these concepts, proposition 6.a1 and 6.a2 give two conditions that individually are equivalent to condition 3 in denition 1, and 6b renes condition 6.a1 for determining when an OOM is equivalent to a HMM. Finally, 6.c states necessary conditions that every ta in an OOM must satisfy. Proposition 6. (t ) a1. Let A D ( Rm , (ta )a2 , w0 ) be a structure P consisting of linear maps a a2 on Rm and a vector w0 2 Rm . Let m :D a2 ta . Assume that the rst two conditions from denition 1 hold: 1w0 D 1 and m has column sums equal to
1. Then A is an OOM if and only if there exist pointed convex cones (Ka )a2 satisfying the following conditions: 1. 1 v ¸ 0 for all v 2 Ka (where a 2 O ). S 2. w0 2 ( a2 Ka )G . 3. 8a, b 2 O : tb Ka µ Kb .
a2. Using the same assumptions as before, A is an OOM if and only if there exists a pointed convex cone K satisfying the following conditions: 1. 1 v ¸ 0 for all v 2 K. 2. w0 2 K.
3. 8a 2 O : ta K µ K. b. Assume that A is an OOM. Then there exists an HMM equivalent to A if and only if a pointed convex cone K according to condition a2 exists that is npolyhedral for some n. n can be selected such that it is not greater than the minimal state number for HMMs equivalent to A. c. Let A be a minimal-dimensional OOM and ta be one of its observable operators, and K be a cone according to a2. Then (i) the spectral radius %(ta ) of ta is an eigenvalue of ta , (ii) the degree of %(ta ) is greater than or equal to the degree of any other eigenvalue l with | l | D %(ta ), and (iii) an eigenvector of corresponding to %(ta ) lies in K. (The degree of an eigenvalue l of a matrix is the size of the largest diagonal block in the Jordan canonical form of the matrix, which contains l.)
Observable Operator Models
1383
The proof of parts a1 and b goes back to Heller (1965) and has been reformulated in Ito (1992). 1 The equivalence of proposition 6.a1 with 6.a2 is an easy exercise. The conditions collected in proposition 6.c are equivalent to the statement that ta K µ K for some proper cone K (proof in theorems 3.2 and 3.5 in Berman & Plemmons, 1979). It is easily seen that for a minimaldimensional OOM, the cone K required in proposition 6.a2 is proper. Thus, proposition 6.c is a direct implication of 6.a2. Proposition 6 has two simple but interesting implications. First, every two-dimensional OOM is equivalent to an HMM (since all cones in two dimensions are polyhedral). Second, every nonnegative OOM (i.e., matrices ta have only nonnegative entries) is equivalent to an HMM (since nonnegative matrices map the positive orthant, which is a polyhedral cone, on itself). Proposition 6.c is sometimes useful to rule out a structure A as an OOM by showing that some ta fails to satisfy the conditions given. Unfortunately, even if every ta of a structure A satises the conditions in proposition 6.c, A need not be a valid OOM. Imperfect as it is, however, proposition 6.c is the strongest result available at this moment in the direction of characterizing OOMs. Proposition 6 is particularly useful to build OOMs from scratch, starting with a cone K and constructing observable operators satisfying ta K µ K. Note, however, that the theorem provides no means to decide, for a given structure A, whether A is a valid OOM, since the theorem is nonconstructive with respect to K. More specically, proposition 6.b yields a construction of OOMs that are not equivalent to any HMM. This will be demonstrated in the remainder of this section. Let tQ : R3 ! R3 be the linear mapping that right-rotates R3 by an angle Q around the rst unit vector e1 D (1, 0, 0). Select some angle Q that is not a rational multiple of 2p . Then put ta :D atQ , where 0 < a < 1. Any three-dimensional process described by a three-dimensional OOM containing such a ta is not equivalent to any HMM: let Ka be a convex cone corresponding to a according to Proposition 6.a1. Due to condition 3, Ka must satisfy ta Ka µ Ka . Since ta rotates any set of vectors around (1, 0, 0) by Q, this implies that Ka is rotation symmetric around (1, 0, 0) by Q. Since Q is a nonrational multiple of 2p , Ka cannot be polyhedral. According to proposition 6.b, this implies that an OOM that features this ta cannot be equivalent to any HMM. I describe now such an OOM with O D fa, bg. The operator ta is xed, according to the previous considerations, by selecting a D 0.5 and Q D 1.0. For tb , we take an operator that projects every v 2 R3 on a multiple of 1
Heller and Ito use a different denition for HMMs, which yields a different version of the minimality statement in proposition 6.b.
1384
H. Jaeger
Figure 3: Rise and fall of probability to obtain a from the probability clock C. The horizontal axis represents time steps t, and the vertical axis represents the probabilities P[a | at ] D 1 ta (tat w0 / 1 tat w0 ), which are rendered as dots. The solid line connecting the dots is given by f (t) D 1 ta (tt w0 / 1tt w0 ), where tt is the rotation by angle t described in the text.
(.75, 0, .25), such that m D ta C tb has column vectors with component sums equal to 1 (cf. denition 1.2). The circular convex cone K whose border is obtained from rotating (.75, 0, .25) around (1, 0, 0) obviously satises the conditions in proposition 6.a2. Thus, we obtain a valid OOM provided that we select w0 2 K. Using abbreviations s :D sin(1.0), c :D cos(1.0), the matrices read as follows: 0 1 1 0 0 ta D 0.5 @0 c sA 0 ¡s c 0 1 .75 ¢ .5 .75(1 ¡ .5c C .5s) .75(1 ¡ .5s ¡ .5c) A. 0 0 tb D @ 0 (6.1) .25 ¢ .5 .25(1 ¡ .5c C .5s) .25(1 ¡ .5s ¡ .5c) As starting vector w0 , we take (.75, 0, .25) to obtain an OOM C D ( R3 , (ta , tb ), w0 ). I will briey describe the phenomenology of the process generated by C . The rst observation is that every occurrence of b “resets” the process to the initial state w0 . Thus, we only have to understand what happens after uninterrupted sequences of a’s. We should look at the conditional probabilities P[¢ | e ], P[¢ | a], P[¢ | aa], . . .—at P[¢ | at ], where t D 0, 1, 2 . . . . Figure 3 gives a plot of P[a | at ]. The process could amply be called a probability clock or probability oscillator. Rotational operators can be exploited for timing effects. In our example, for instance, if the process were started in the state according to t D 4 in Figure 3, there would be a high chance that two initial a’s would occur, with a rapid drop in probability for a third or fourth. Such nonexponential-
Observable Operator Models
1385
decay duration patterns for identical sequences are difcult to achieve with HMMs. Essentially HMMs offer two possibilities for identical sequences: (1) recurrent transitions into a state where a is emitted or (2) transitions along sequences of states, each of which can emit a. Option 1 is cumbersome because recurrent transitions imply exponential decay of state, which is unsuitable for many empirical processes. Option (2) blows up model size. A more detailed discussion of this problem in HMMs can be found in Rabiner (1990). If one denes a three-dimensional OOM C 0 in a similar manner but with a rational fraction of 2p as an angle of rotation, one obtains a process that can be modeled by a HMM. It follows from Proposition 6.b that the minimal HMM state number in this case is at least k, where k is the smallest integer such that kQ is a multiple of 2p . Thus, the smallest HMM equivalent to a suitably chosen three-dimensional OOM can have an arbitrarily great number of states. In a similar vein, any HMM that would give a reasonable approximation to the process depicted in Figure 3 would require at least six states, since in this process it takes approximately six time steps for one full rotation. 7 Interpretable OOMs
Some minimal-dimension OOMs have a remarkable property: their statespace dimensions can be interpreted as probabilities of certain future outcomes. These interpretable OOMs are described in this section. Let ( Xt )t2N be an m-dimensional LDP. For some suitably large k, let O k D A1 [, ¢ ¢ ¢ , [Am be a partition of the set of sequences of length k into m disjoint nonempty subsets. The collective outcomes Ai are called characteristic events if some sequences bN1 , . . . , bNm exist such that the m £ m matrix (P[Ai | bNj ])i, j is nonsingular (where P[Ai | bNj ] denotes characteristic events:
(7.1) P
( a aN 2Ai P[N
| bNj ]). Every LDP has
Proposition 7. Let ( Xt )t2N be an m-dimensional LDP. Then there exists some k ¸ 1 and a partition O k D A1 [, ¢ ¢ ¢ , [Am of O k into characteristic events.
The proof is given in appendix E. Let A D ( Rm , (ta )a2 , w0 ) be an mdimensional OOM of the process ( Xt ). Using the characteristic events A1 , . . . , Am , we shall now construct from A an equivalent, interpretable OOM A( A1 , . . . , Am ), which has the property that the m state vector components represent the probabilities that m characteristic events will occur. More precisely, if during the generation procedure described in section 3, ) ( A( A1 , . . . , Am ) is in state wt D ( w1t , . . . , wm t at time t, the probability P[ XtC 1 ,
1386
H. Jaeger
. . . , XtC k ) 2 Ai | wt ] that the collective outcome Ai is generated in the next k time steps, is equal to wit . In shorthand notation: P[Ai | wt ] D wit . We shall use proposition 5 to obtain A( A1 , . . . , Am ). Dene tAi :D Dene a mapping %: Rm ! Rm by %( x)
(7.2) P
:D (1 tA1 x, . . . , 1 tAm x ).
aN 2Ai taN .
(7.3)
The mapping % is obviously linear. It is also bijective, since the matrix ( P[Ai | bNj ]) D ( 1 tAi xj ), where xj :D tbNj w0 / 1 tbNj w0 , is nonsingular. Furthermore, % preserves component sums of vectors, since for i D 1, . . . , m, it holds that 1 xj D 1 D 1 ( P[A1 | xj ], . . . , P[Am | xj ]) D 1 ( 1 tA1 x, . . . , 1 tAm x ) D 1%( xj ). (Note that a linear map preserves component sums if it preserves component sums of basis vectors.) Hence, % satises the conditions of proposition 5. We therefore obtain an OOM equivalent to A by putting
A ( A1 , . . . , Am ) D ( Rm , (%ta%¡1 )a2 , %w0 ) D: ( Rm , (ta0 )a2 , w00 ).
(7.4)
In A( A1 , . . . , Am ), equation 7.2 holds. To see this, let w0t be a state vector obtained in a generation run of A( A1 , . . . , Am ) at time t. Then conclude w0t D 0 0 0 0 %%¡1 wt D ( 1 tA1 ( %¡1 wt ), . . . , 1 tAm (%¡1 wt )) D ( P[A1 | %¡1 wt ], . . . , P[Am | ¡1 0 0 0 % wt ]) (computed in A ) D ( P[A1 | wt ], . . . , P[Am | wt ]) (computed in A( A1 , . . . , Am )). The m £ m matrix corresponding to % can easily be obtained from the original OOM A by observing that %
D (1 tAi ej ),
(7.5)
where ei is the ith unit vector. The following fact lies at the heart of the learning algorithm presented in the next section: Proposition 8.
In an interpretable OOM A( A1 , . . . , Am ) it holds that
1. w0 D (P[A1 ], . . . , P[Am ]). N 1 ], . . . , P[bA N m ]). 2. t N w0 D ( P[bA b
The proof is trivial. The state dynamics of interpretable OOM can be graphically represented in a standardized fashion, which makes it possible to compare the dynamics of different processes visually. Any state vector wt occurring in a generation run of an interpretable OOM is a probability vector. It lies in the nonnegative hyperplane H ¸ 0 :D f(v1 , . . . , vm ) 2 Rm | v1 C ¢ ¢ ¢ C vm D 1, vi ¸ 0 for i D
Observable Operator Models
1387
Figure 4: (a) Positioning of H ¸ 0 within state-space. (b) State sequence of the probability clock, corresponding to Figure 3.
1, . . . , mg. Therefore, if one wishes to depict a state sequence w0 , w1 , w2 . . . , one needs only to render the bounded area H ¸ 0 . Specically, in the case m D 3, H ¸ 0 is the triangular surface shown in Figure 4a. We can use it as the drawing plane, putting the point (1 / 3, 1 / 3, 1 / 3) in the origin. For our orientation, we include the contours of H¸ 0 into the graphical reprep sentation. This is an equilateral triangle whose edges have length 2. If w D ( w1 , w2 , w3 ) 2 H¸ 0 is a state vector, its componentspcan be recovered from its position within this triangle by exploiting wi D 2 / 3di , where the di are the distances to the edges of the triangle. A similar graphical representation of states was rst introduced in Smallwood and Sondik (1973) for HMMs. When one wishes to represent graphically states of higher-dimensional, interpretable OOMs (i.e., where m > 3), one can join some of the characteristic events until three merged events are left. State vectors can then be plotted in a way similar to the one just outlined. To see an instance of interpretable OOMs, consider the probability clock example from equation 6.1. With k D 2, the following partition (among others) of fa, bg2 yields characteristic events: A1 :D faag, A2 :D fabg, A3 :D fba, bbg. Using equation 7.5, one can calculate the matrix %, which we omit here, and compute the interpretable OOM C ( A1 , A2 , A3 ) using equation 7.4. These are the observable operators ta and tb thus obtained: 0
0.645 ta D @0.355 0
¡0.395 0.395 1
1 0.125 ¡0.125 A , 0
0
0 tb D @0 0
0 0 0
1 0.218 0.329 A . 0.452
(7.6)
Figure 4b shows a 30-step state sequence obtained by iterated applications of ta of this interpretable equivalent of the probability clock. This sequence corresponds to Figure 3.
1388
H. Jaeger
8 Learning OOMs
This section describes a constructive algorithm for learning OOMs from data. For demonstration, it is applied to learning the probability clock from data. There are two standard situations where one wants to learn a model of a stochastic system: (1) from a single long data sequence (or a few of them), one wishes to learn a model of a stationary process, and (2) from many short sequences, one wants to induce a model of a nonstationary process. OOMs can model both stationary and nonstationary processes, depending on the initial state vector w0 (cf. proposition 1). The learning algorithms presented here are applicable in both cases. For the sake of notational convenience, we treat only the stationary case. We shall address the following learning task. Assume that a sequence S D a0 a1 . . . aN is given and that S is a path of an unknown stationary LDP (Xt ). We assume that the dimension of ( Xt ) is known to be m (the question of how to assess m from S is discussed after the presentation of the basic learning algorithm). We select m characteristic events A1 , . . . , Am of ( Xt ) (again, selection criteria are discussed after the presentation of the algorithm). Let A( A1 , . . . , Am ) be an OOM of ( Xt ), which is interpretable with respect to A1 , . . . , Am . The learning task, then, is to induce from S an OOM AQ that is an estimate of A( A1 , . . . , Am ) D (Rm , (ta )a2 , w0 ). We require that the estimation be consistent almost surely—that is, for almost every innite path S1 D a0 a1 . . . of ( Xt ), the sequence ( AQ n )n¸ n0 obtained from estimating OOMs from initial sequences Sn D a0 a1 ¢ ¢ ¢ an of S1 , converges to A( A1 , . . . , Am ) (in some matrix norm). An algorithm meeting these requirements shall now be described. As a rst step we estimate w0 . Proposition 8.1 states that w0 D ( P[A1 ], . . . , Q 1 ], . . . , P[A Q m ]), where P[Am ]). Therefore, a natural estimate of w0 is wQ 0 D (P[A Q P[Ai ] is the estimate for P[Ai ] obtained by counting frequencies of occurrence of Ai in S, as follows: Q i ] D Number of aN 2 Ai occurring in S D P[A Number of aN occurring in S
number of aN 2 Ai occurring in S N¡kC 1
,
(8.1)
where k is the length of events Ai . In the second step, we estimate the operators ta . According to proposition 8.2, 8(2), for any sequence bNj it holds that ta (tbNj w0 ) D ( P[bNj aA1 ], . . . , P[bNj aAm ]).
(8.2)
An m-dimensional linear operator is uniquely determined by the values it takes on m linearly independent vectors. This basic fact from linear algebra
Observable Operator Models
1389
leads us directly to an estimation of ta , using equation 8.2. We estimate m linQ bNj A1 ], . . . , P[ Q bNj Am ]) early independent vectors vj :D tbNj w0 by putting vQj D ( P[ (j D 1, . . . , m). For the estimation we use a similar counting procedure as in equation 8.1: N Q bNj Ai ] D Number of bNa (where aN 2 Ai ) occurring in S , P[ N¡l¡kC 1
(8.3)
where l is the length of bNj . Furthermore, we estimate the results vj0 :D ta (tbNj w0 ) Q bNj aA1 ], . . . , P[ Q bNj aAm ]), where of applying ta to vi by vQ 0 D ( P[ j
N Q bNj aAi ] D Number of baNa (where aN 2 Ai ) occurring in S . P[ N¡l¡k
(8.4)
Thus we obtain estimates (vQj , vQj0 ) of m argument-value pairs ( vj , vj0 ) D ( vj , ta vj ) of applications of ta . From these estimated pairs, we can compute an estimate tQa of ta through an elementary linear algebra construction: rst collect Q and the vectors vQ 0 as columns in a the vectors vQj as columns in a matrix V, j Q a , then obtain tQa D W Q a VQ ¡1 . matrix W This basic idea can be augmented in two respects: 1. Instead of simple sequences bNj , one can just as well take collective Q j Ai ]), W Qa D events Bj of some common length l to construct VQ D (P[B Q j aAi ]) (exercise). We will call Bj indicative events. ( P[B Q a as described above, one can also use Q W 2. Instead of constructing V, raw count numbers, which saves the divisions on the right-hand side in equations 8.1, 8.3, and 8.4. That is, use V # D (#butlastBj Ai ), Wa# D (#Bj aAi ), where #butlastBj Ai is the raw number of occurrences of Bj Ai in Sbutlast :D s1 . . . sN¡1 , and #Bj aAi is the raw number of occurrences of Bj aAi in S. This gives the same matrices tQa as the original procedure. Assembled in an orderly fashion, the entire procedure works as follows (assume that model dimension m, indicative events Bj , and characteristic events Ai have already been selected): Step 1. Compute the m £ m matrix V# D (#butlastBj Ai ).
Step 2. Compute, for every a 2 O , the m £ m matrix Wa# D (#Bj aAi ). Step 3. Obtain tQa D Wa# ( V# )¡1 .
The computational demands of this procedure are modest compared to today’s algorithms used in HMM parameter estimation. The counting for
1390
H. Jaeger
V # and Wa# can be done by a single sweep of an inspection window (of length k C l C 1) over S. Multiplying or inverting m £ m matrices essentially has a computational cost of O( m3 / p ) (this can be slightly improved, but the effects become noticeable only for very large m), where p is the degree of parallelization. The counting and inverting/multiplying operations together give a time complexity of this core procedure of O( N C nm3 / p ), where n is the size of O . We shall now demonstrate the mechanics of the algorithm with an articial toy example. Assume that the following path S of length 20 is given: S D abbbaaaabaabbbabbbbb. We estimate a two-dimensional OOM. We choose the simplest possible characteristic events A1 D fag, A2 D fbg and indicative events B1 D fag, B2 D fbg. First we estimate the invariant vector w0 , by putting wQ0 D (#a, #b) / N D (8 / 20, 12 / 20). Then we obtain V # and W a# , W b# by counting occurrences of subsequences in S: V# D W a#
D
W b# D
¡#
butlastaa
#butlastab
¡ #aaa
¢ ¡
¢ ¡
#aab
2 , 1
#abb
#bba 1 D #bbb 3
2 . 5
¢ ¡
3 , 7
¢
#baa 2 D #bab 2
¡ #aba
¢
#butlastba 4 D #butlastbb 4
¢
From these raw counting matrices we obtain estimates of the observable operators by tQa D Wa# ( V# )¡1 D tQb D Wb# ( V# )¡1 D
¡3 /8
¢
1 /8 , ¡1 / 8
5 /8
¡ ¡1 /16 1 / 16
That is, we have arrived at an estimate
¡ ¡
3 8 AQ D R2 , / 5/8
¢ ¡
¢
5 / 16 . 11 / 16
1/8 ¡1 / 16 , ¡1 / 8 1 / 16
¢
¢
5 / 16 , (9 / 20, 11 / 20) . 11 / 16
(8.5)
This concludes the presentation of the learning algorithm in its core ver-
Observable Operator Models
1391
sion. Obviously before one can start the algorithm, one has to x the model dimension, and select characteristic and indicative events. These two questions shall now be briey addressed. In practice, the problem of determining the “true” model dimension seldom arises. Empirical systems quite likely are very high-dimensional or even innite-dimensional. Given nite data, one cannot capture all of the true dimensions. Rather, the task is to determine m such that learning an m-dimensional model reveals m signicant process dimensions, while any contributions of higher process dimensions are insignicant in the face of estimation error. Put bluntly, the task is to x m such that data are neither overtted nor underexploited. A practical solution to this task capitalizes on the idea that a size m model does not overt data in S if the “noisy” matrix V # has full rank. Estimating the true rank of a noisy matrix is a common task in numerical linear algebra, for which heuristic solutions are known (Golub & Loan, 1996). A practical procedure for determining an appropriate model dimension, which exploits these techniques, is detailed out in Jaeger (1998). Selecting optimal characteristic and indicative events is an intricate issue. The quality of estimates AQ varies considerably with different characteristic and indicative events. The ramications of this question are not fully understood, and only some preliminary observations can be supplied here. A starting point toward a principled handling of the problem is certain conditions on Ai , Bj reported in Jaeger (1998). If Ai , Bj are chosen according to these conditions, it is guaranteed that the resulting estimate AQ is itself interpretable with respect to the chosen characteristic events A1 , . . . , Am , that is, it holds that AQ D AQ ( A1 , . . . , Am ). In turn, this warrants that AQ yields an unbiased estimate of certain parameters that dene ( Xt ). A second observation is that the variance of estimates AQ depends on the selection of Ai , Bj . For instance, characteristic events should be chosen such that sequences aN x , aN y 2 Ai collected in the same characteristic event have a high correlation in the sense that the probabilities PQ S [Nax | Bj ], PQS [Nay | Bj ] (where j D 1, . . . , m) correlate strongly. The better this condition is met, the smaller is the variance of PQS [Ai ], and, accordingly, the variance of VQ and WQ a . Several other rules of thumb similar to this one are discussed in Jaeger (1998). A third and last observation is that characteristic events can be chosen such that available domain knowledge is exploited. For instance, in estimating an OOM of a written English text string S, one might collect in each characteristic or indicative event such letter strings as belong to the same linguistic (phonological or morphological) category. This would result in an interpretable model whose state dimensions represent the probabilities for these categories to be realized next in the process.
1392
H. Jaeger
We now apply the learning procedure to the probability clock C introduced in sections 6 and 7.2 (We provide only a sketch here. A more detailed account is contained in Jaeger, 1998.) C was run to generate a path S of length N D 30, 000. C was started in an invariant state (cf. proposition 1); therefore S is stationary. We shall construct from S an estimate C Q of C . Assume that we know that the process dimension is m D 3 (this value was also obtained by the dimension estimation heuristics mentioned above). The rules of thumb noted above led to characteristic events A1 D faag, A2 D fabg, A3 D fba, bbg and indicative events B1 D faag, B2 D fab, bbg, B3 D fbag (details of how these events were selected are documented in Jaeger, 1998). Using these characteristic and indicative events, an OOM C Q D (R3 , (tQa , tQb ), wQ 0 ) was estimated using the algorithm described in this section. We briey highlight the quality of the model thus obtained. First, we compare the operators C Q with the operators of the interpretable version C ( A1 , A2 , A3 ) D ( R3 , (ta , tb ), w0 ) of the probability clock (cf. equation 7.6). The average absolute error of matrix entries in tQa , tQb versus ta , tb was found to be approximately .0038 (1.7% of average correct value). A comparison of the matrices tQa , tQb versus ta , tb is not too illuminating, however, because some among the matrix entries have little effect on the process (i.e., varying them greatly would only slightly alter the distribution of the process). Conversely, information in S contributes only weakly to estimating these matrix entries, which are therefore likely to deviate considerably from the corresponding entries in the original matrices. Indeed, taken as a sequence generator, the estimated model is much closer to the original process than a matrix entry deviation of 1.7% might suggest. Figure 5 illustrates the t of probabilities and states computed with C Q versus the true values. Figure 5a is similar to Figure 3, plotting the probabilities P Q [a | at ] against time steps t. The solid line represents the probabilities of the original probability clock. Even after 30 time steps, the predictions are very close to the true values. Figure 5b is an excerpt of Figure 4b and shows the eight most frequently taken states of the original probability clock. Figure 5c is an overlay of Figure 5b with a 100,000-step run of the estimated model C Q. Every hundredth step of the 100,000-step run was additionally plotted into Figure 5b to obtain Figure 5c. The original states closely agree with the states of the estimated model. I conclude this section by emphasizing a shortcoming of the learning algorithm in its present form. An OOM is characterized by the three conditions stated in denition 1. The learning algorithm guarantees only that 2
The calculations were done using the Mathematica software package. Data and Mathematica programs are available online at www.gmd.de/People/Herbert.Jaeger/.
Observable Operator Models
1393
Figure 5: Comparison of estimated model versus original probability clock. (a) Probabilities of P(a | aN ) as captured by the learned model (dots) versus the original (solid line). (b) The most frequent states of the original. (c) States visited by the learned model. See the text for details.
the estimated structure AQ satises the rst two conditions (an easy exercise). Unfortunately, it may happen that AQ does not satisfy condition 3 and thus might not be a valid OOM. That is, AQ might produce negative “probabilities” P[Na]. For instance, the structure AQ estimated in equation 8.5 is not a valid OOM; one obtains 1tb ta tb ta tb w0 D ¡0.00029. A procedure for transforming an “almost-OOM” into a nearest (in some matrix norm) valid OOM would be highly desirable. Progress is currently blocked by the lack of a decision procedure for determining whether an OOM-like structure A satises condition 3. In practice, however, even an “almost-OOM” can be useful. If the model dimension is selected properly, the learning procedure yields either valid OOMs or pseudo ones that are close to valid ones. Even the latter ones Q a] if aN is not too long and P[Na] is not too close yield good estimates P[N to zero. Further examples and discussion can be found in Jaeger (1997b, 1998).
1394
H. Jaeger
9 Conclusion
The results reported in this article are all variations on a single insight: The change of predictive knowledge that we have about a stochastic system is a linear phenomenon. This leads to the concept of observable operators, which has been detailed here for discrete-time, nite-valued processes but is applicable to every stochastic process (Jaeger, 1999). The linear nature of observable operators makes it possible to cast the system identication task purely in terms of numerical linear algebra. The proposed system identication technique proceeds in two stages: rst, determine the appropriate model dimension and choose characteristic and indicative events; second, construct the counting matrices and from them the model. While the second stage is purely mechanical, the rst involves some heuristics. The situation is reminiscent of HMM estimation, where in a rst stage, a model structure has to be determined—usually by hand; or of neural network learning, where a network topology has to be designed before the mechanical parameter estimation can start. Although the second stage of model construction for OOMs is transparent and efcient, it remains to be seen how well the subtleties involved with the rst stage can be mastered in actual applications. Furthermore, the problem remains to be solved of how to deal with the fact that the learning algorithm may produce invalid OOMs. One reason for optimism is that these questions can be posed in terms of numerical linear algebra, which arguably is the best understood of all areas of applied mathematics. Appendix A: Proof of Proposition 1
A numerical function P on the set of nite initial sequences of V can be uniquely extended to the probability distribution of a stochastic process if the following two conditions are met: 1. P is a probability measure on the set of initial sequences of length k C 1, P for all k ¸ 0. That is, (a) P[ai0 . . . aik ] ¸ 0 and (b) ai ¢¢¢ai 2 kC 1 P[ai0 . . . aik ] 0 k D 1. 2. The values of P on the initial sequences of length k C 1 agree P with continuation of the process in the sense that P[ai0 . . . aik ] D b2 P[ai0 . . . aik b]. The process is stationary, if additionally the following condition holds: P 3. P[ai0 . . . aik ] D b0 ...bs 2 sC 1 P[b0 . . . bs ai0 . . . aik ] for all ai0 . . . aik in O kC 1 and s ¸ 0. Point 1a is warranted by virtue of condition 3 from denition 1. Point 1b is a consequence of conditions 1 and 2 from the denition (exploit that condition 2 implies that left-multiplying a vector by m does not change the
Observable Operator Models
1395
sum of components of the vector): X aN 2
kC 1
P[Na] D
X aN 2
D1
k
(
1taN w0
X a2
! ta ¢ ¢ ¢
(
X
! ta w0
a2
(
k C 1 terms
(
X
!! ta
a2
D 1m ¢ ¢ ¢ m w0 D 1 w0 D 1.
For proving point 2, again exploit condition 2: X
b2
P[Nab] D
X
1tb taN w0 D 1m taN w0 D 1 taN w0 D P[Na].
b2
Finally, the stationarity criterion 3 is obtained by exploiting m w0 D w0 : X
N b2
sC 1
Na ] D P[bN
X
N b2
sC 1
1 taN tbN w0
( s C 1 terms m )
D 1taN m . . . m w0
D 1taN w0 D P[Na]. Appendix B: Proof of Proposition 2
P Let bN 2 O ¤ , and gbN D niD1 ai gcNi be the linear combination of gbN from basis elements of G. Let dN 2 O C . Then we obtain the statement of the proposition through the following calculation: !! ³X ´ n X (ta ( g N ))(dN) D ta (dN) D ai gcN ai ta (gcN ) ( dN)
((
b
D D
³X X
i
i
iD1
´ X ai P[a | cNi ] gcNi a ( dN) D ai P[a | cNi ]P[dN | cNi a]
ai P[a | cNi ]
N X P[cNi ]P[adN | cNi ] P[cNi ad] D ai P[a | cNi ]P[cNi ] P[Nci ]
N D P[a | b] N P[dN | ba] N D gbN ( adN) D P[adN | b] N g N ( dN). D P[a | b] ba
Appendix C: Proof of Proposition 3
From an iterated application of equation 4.2, it follows that taik , ¢ ¢ ¢ , tai0 ge D P[ai0 . . . aik ]gai0 ...aik . Therefore, it holds that
gai0 ...aik D
X iD1,...,n
ai gN . P[ai0 . . . aik ] bi
1396
H. Jaeger
Interpreting the vectors gbNi and gai0 ...aik as probability distributions (cf. equaP i tion 4.1), it is easy to see that iD1...n P[ai a,...,a D 1, from which the statement i ] immediately follows.
0
k
Appendix D: Proof of Proposition 4
To see proposition 4.1, let (G, (ta )a2 , ge ) be the predictor space OOM of ( Xt ). Choose fbN1 , . . . , bNm g ½ O ¤ such that the set fgbNi | i D 1, . . . , mg is a basis of G. Then it is an easy exercise to show that an OOM A D ( Rm , (ta )a2 , w0 ) and an isomorphism p : G ! Rm exist such that (1) p ( ge ) D w0 , (2) p (gbNi ) D ei , where ei is the ith unit vector, (3) p (ta d) D tap ( d) for all a 2 O , d 2 G. These properties imply that A is an OOM of (Xt ). In order to prove proposition 4.2, let again ( G, ( ta )a2 , ge ) be the predictor space OOM of ( Xt ). Let C be the linear subspace of Rk spanned by the vectors fw0 g [ ftaN w0 | aN 2 O C g. Let ftbN1 w0 , . . . , tbNl w0 g be a basis of C. Dene a linear mapping s from C to G by putting s(tbNi w0 ) :D P[bNi ] gbNi , where i D 1, . . . , l. s is called the canonical projection of A on the predictor-space OOM. By a straightforward calculation, it can be shown that s(w0 ) D ge and that for all cN 2 O C it holds that s(tcN w0 ) D P[c] N gcN (cf. Jaeger, 1997a for these and other properties of s). This implies that s is subjective, which in turn yields m · k. Appendix E: Proof of Proposition 7
From the denition of process dimension (see denition 2) it follows that m sequences aN 1 , . . . , aN m and m sequences bN1 , . . . , bNm exist such that the m £ m matrix ( P[Naj | bNi ]) is nonsingular. Let k be the maximal length occurring in the sequences aN 1 , . . . , aN m . Dene collective events Cj of length k by using the sequences aj as initial sequences, that is, put Cj :D fNaj cN | cN 2 O k¡| aNj | g, where | aNj | denotes the length of aNj . It holds that (P[Cj | bNi ]) D ( P[Naj | bNi ]). We transform the collective events Cj in two steps in order to obtain characteristic events. In the rst step, we make them disjoint. Observe that due to their construction, two collective events Cj1 , Cj2 are either disjoint or one is properly included in the other.SWe dene new, nonempty, pairwise disjoint, collective events Cj0 :D Cj n Cx ½Cj Cx by taking away from Cj all collective events properly included in it. It is easily seen that the matrix ( P[C0 | bNi ]) can be obj
tained from ( P[Cj | bNi ]) by subtracting certain rows from others. Therefore, this matrix is nonsingular too. In the second step, we enlarge the Cj0 (while preserving disjointness) in
order to arrive at collective events Aj that exhaust O k . Put C00 D O k n ( C01 [
Observable Operator Models
1397
¢ ¢ ¢ [ C0m ). If C00 D ;, Aj :D Cj0 ( j D 1, . . . , m ) are characteristic events. If 6 ;, consider the m £ ( m C 1) matrix (P[C0 | bNi ])iD1,...,m, jD0,...,m . It has C0 D 0
j
rank m, and column vectors vj D ( P[Cj0 | bN1 ], . . . , P[Cj0 | bNm ]). If v0 is the null vector, put A1 :D C00 [ C01 , A2 :D C02 , . . . , Am :DPC0m to obtain characteristic events. If v0 is not the null vector, let v0 D ºD1,...,m aºvº be its linear combination from the other column vectors. Since all vº are nonnull, nonnegative vectors, some aº0 must be properly greater than 0. A basic linear algebra argument (exercise) shows that the m £ m matrix made from column vectors v1 , . . . , vº0 C v0 , . . . , vm has rank m. Put A1 :D C01 , . . . , Aº0 :D Cº0 0 [ C00 , . . . , Am :D C0m to obtain characteristic events. Acknowledgments
I am deeply grateful to Thomas Christaller for his condence and great support. This article has beneted enormously from advice given by Hisashi Ito, Shun-Ichi Amari, Rafael Nunez, ˜ Volker Mehrmann, and two extremely constructive anonymous reviewers, whom I heartily thank. The research described in this article was carried out under a postdoctoral grant from the German National Research Center for Information Technology (GMD). References Bengio, Y. (1996).Markovian models for sequential data (Tech. Rep. No. 1049). Montreal: Department d’Informatique et Recherche Op´erationelle, Universit´e de Montr´eal. Available online at: http://www.iro.umontreal.ca/labs/neuro/ pointeurs/hmmsTR.ps. Berman, A., & Plemmons, R. (1979). Nonnegative matrices in the mathematical sciences. San Diego, CA: Academic Press. Doob, J. (1953). Stochastic processes. New York: Wiley. Elliott, R., Aggoun, L., & Moore, J. (1995). Hidden Markov models: Estimation and control. New York: Springer-Verlag. Gilbert, E. (1959). On the identiability problem for functions of nite Markov chains. Annals of Mathematical Statistics, 30, 688–697. Golub, G., & Loan, C. van. (1996).Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press. Heller, A. (1965). On stochastic processes derived from Markov chains. Annals of Mathematical Statistics, 36, 1286–1291. Iosifescu, M., & Theodorescu, R. (1969).Random processes and learning. New York: Springer-Verlag. Ito, H. (1992). An algebraic study of discrete stochastic systems. Unpublished doctoral dissertation, University of Tokyo, Bunkyo-ku, Tokyo, 1992. Available by ftp from http://kuro.is.sci.toho-u.ac.jp:8080/english/D/. Ito, H., Amari, S.-I., & Kobayashi, K. (1992). Identiability of hidden Markov information sources and their minimum degrees of freedom. IEEE Transactions on Information Theory, 38(2), 324–333.
1398
H. Jaeger
Jaeger, H. (1997a).Observable operator models and conditioned continuation representations (Arbeitspapiere der GMD No. 1043). Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1997b). Observable operator models II: Interpretable models and model induction (Arbeitspapiere der GMD No. 1083).Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1998). Discrete-time, discrete-valued observable operator models: A tutorial (GMD Report No. 42). Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1999). Characterizing distributions of stochastic processes by linear operators (GMD report No. 62). Sankt Augustin: GMD. Available online at http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Narendra, K. (1995). Identication and control. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 477–480). Cambridge, MA: MIT Press/Bradford Books. Rabiner, L. (1990).A tutorial on hidden Markov models and selected applications in speech recognition. In A. Waibel and K.-F. Lee (Eds.), Readings in speech recognition (pp. 267–296). San Mateo, CA: Morgan Kaufmann. Smallwood, R., & Sondik, E. (1973). The optimal control of partially observable Markov processes over a nite horizon. Operations Research, 21, 1071–1088. Received January 5, 1998; accepted April 8, 1999.
LETTER
Communicated by David Saad
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons Shun-ichi Amari Hyeyoung Park Kenji Fukumizu RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-0198, Japan The natural gradient learning method is known to have ideal performances for on-line training of multilayer perceptrons. It avoids plateaus, which give rise to slow convergence of the backpropagation method. It is Fisher efcient, whereas the conventional method is not. However, for implementing the method, it is necessary to calculate the Fisher information matrix and its inverse, which is practically very difcult. This article proposes an adaptive method of directly obtaining the inverse of the Fisher information matrix. It generalizes the adaptive Gauss-Newton algorithms and provides a solid theoretical justication of them. Simulations show that the proposed adaptive method works very well for realizing natural gradient learning. 1 Introduction
Natural gradient (Amari, 1998; Yang & Amari, 1998) gives an on-line learning algorithm that takes the Riemannian metric of the parameter space into account. The natural gradient method is based on information geometry (Amari, 1985; Amari & Nagaoka, 2000) and uses the Riemannian metric of the parameter space to dene the steepest direction of a cost function. It may be regarded as a version of the stochastic descent method. It uses a matrix learning rate equal to the inverse of the Fisher information matrix, which plays the role of the Riemannian metric in the space of perceptrons. It overcomes two types of shortcomings involved in backpropagation learning: efciency and slow convergence. On-line learning may use a training example once when it is observed, while batch learning stores all the examples and any examples can be reused later. Therefore, it is in general true that the performance of on-line learning is worse than batch learning. This is true for the conventional backpropagation method. Amari (1998) shows that natural gradient achieves Fisher efciency. According to the Cramer-Rao bounds, this is the best asymptotic performance that any unbiased learning algorithm can achieve. The backpropagation method is known to be very slow in convergence. This is because of plateaus in which the parameter is trapped in the proc 2000 Massachusetts Institute of Technology Neural Computation 12, 1399– 1409 (2000) °
1400
S. Amari, H. Park, and K. Fukumizu
cess of learning, taking a long time to get rid of them. Statistical-mechanical analysis (Saad & Solla, 1995) has made it clear that plateaus are ubiquitous in backpropagation learning, and various acceleration methods so far proposed cannot avoid or escape from them. Amari (1998) and Yang and Amari (1998) suggested that the natural gradient algorithm has the possibility of avoiding plateaus or quickly escaping from them. This has been theoretically conrmed (Rattray, Saad, & Amari, 1998), where an ideal performance of the natural gradient method was theoretically demonstrated in the thermodynamical limit. However, it is difcult to calculate the Fisher information matrix of multilayer perceptrons. Even when it is obtained, its inversion is computationally expensive. Yang and Amari (1998) gave an explicit form of the Fisher information matrix and the computational complexity of its inversion by assuming that the distribution of input signals is gaussian. Rattray et al. (1998) gave it in terms of statistical-mechanical order parameters. These results show that it is difcult to implement natural gradient learning for practical large-scale problems, however excellent its performance is. This article proposes an adaptive method of obtaining the inverse of the Fisher information matrix directly without any matrix inversion, by applying the Kalman lter technique. The proposed adaptive method generalizes the adaptive Gauss-Newton method (LeCun, Boutou, Orr, & Muller, ¨ 1998) and provides a solid theoretical justication based on different philosophical ideas. Computer experiments demonstrate that the proposed method has almost the same performance as the original natural gradient method and that its convergence speed is surprisingly faster than the conventional backpropagation method. This is a preliminary study; more general performances of the proposed method will be reported in the future by Park, Amari, and Fukumizu (2000). 2 Natural Gradient for MLP
Let us consider a multilayer perceptron (MLP) that receives an n-dimensional input signal x and emits a scalar output signal y. When it has m hidden units and one linear output unit, its input-output behavior is written as yD
m X aD1
¡ ¢ va Q wa ¢ x C ba C b0 C j,
(2.1)
where w a is an n-dimensional connection weight vector from the input to the ath hidden unit (a D 1, . . . , m); va is the connection weight from the ath hidden unit to the output unit; ba and b0 are biases to the ath hidden node and the output node, respectively; and j is a random noise subject to N(0, s©2 ). The function Qªis a sigmoidal-type activation function. The parameters w 1 , . . . , w m I bI v can be summarized into a single fm(n C 2) C 1gdimensional vector µ. It is a stochastic network because of noise j . The set
Natural Gradient Learning for Multilayer Perceptrons
1401
S of all such stochastic multilayer perceptrons forms a manifold where µ plays the role of a coordinate system in the space S. A multilayer perceptron having parameter µ is connected with the conditional probability distribution of output y conditioned on input x, µ ¶ ª2 1 © exp ¡ 2 y ¡ f (x, µ ) , 2s 2p s
(2.2)
¡ ¢ va Q wa ¢ x C ba C b0
(2.3)
p( y| x, µ ) D p
1
where f ( x, µ ) D
m X aD1
is the mean value of y given input x. The space S of multilayer perceptrons is identied with the set of all the conditional probability distributions of equation 2.2. Its logarithm is l( y| x, µ ) D ¡
p ª2 1 © y ¡ f ( x, µ ) ¡ log( 2p s) 2 2s
(2.4)
and is called the log-likelihood. This can be regarded as the negative of the square of an error when y is a target value given x, except for a scale and a constant term. Hence, the maximization of the likelihood is equivalent to the minimization of the square error, ¡ ¢ 1© ª2 l¤ y, x, µ D y ¡ f ( x, µ ) . 2
(2.5)
Let us consider the training of a perceptron. Given a sequence of training examples fx1 , y¤1 ), ( x2 , y¤2 ), . . .g, the conventional on-line backpropagation learning algorithm is written as ¡ ¢ µtC 1 D µ t ¡ gt rl¤ xt , y¤t I µt ,
(2.6)
where µ t is the network parameter at time t, gt is a learning rate that may depend on t, » rl¤ ( x, y¤ , µ ) D
@ ¤ l ( x, y¤ I µ ) @hi
¼ (2.7)
is the gradient of the loss function l¤ , and y¤t is the desired output signal given from the teacher. The gradient rl¤ is in general believed to be the steepest direction of scalar function of l¤ . This is true only when the space S is Euclidean and µ is an orthonormal coordinate system. In our case of stochastic perceptrons, the
1402
S. Amari, H. Park, and K. Fukumizu
parameter space S has a Riemannian structure (Amari, 1998; Yang & Amari, 1998), so that the ordinary gradient does not represent the steepest direction of the loss function. The steepest descent direction of the loss function l¤ ( µ ) in a Riemannian space is given (Amari, 1998) by Q ¤ ( µ ) D ¡G¡1 (µ )r l¤ ( µ ), ¡ rl
(2.8)
where G¡1 is the inverse of a matrix G D ( gij ) called the Riemannian metric tensor. This gradient is called the natural gradient of the loss function l¤ ( µ ) in the Riemannian space, and it suggests the natural gradient descent algorithm of the form Q ¤(µ ) µ tC 1 D µt ¡ gt rl
(2.9)
(Amari, 1998; Edelman, Arias, & Smith, 1998). In the case of a stochastic multilayer perceptron, the Riemannian metric tensor G( µ ) D ( gij ( µ )) is given by the Fisher information matrix (Amari, 1998), µ gij ( µ ) D E
@l( y| xI µ ) @l( y| xI µ ) @hi
¶
@hj
,
(2.10)
where E denotes expectation with respect to the input-output pair ( x, y) given by equation 2.2. Note that we do not use the teacher signal y¤ to dene it. It was proved that the on-line learning method based on the natural gradient is asymptotically efcient, as the optimal batch algorithm is. It was also suggested that natural gradient learning is much easier to get rid of plateaus than ordinary gradient learning (Amari, 1998; Yang & Amari, 1998). The statistical physical approach elucidated an ideal behavior of natural gradient learning (Rattray et al., 1998), avoiding plateaus. 2.1 Adaptive Estimation of Natural Gradient. The Fisher information matrix of equation 2.10 at µ t can be rewritten, by using equation 2.4, as
µ Gt D E
@l( y| xI µt ) @l( y| xI µ t ) @µt
@µt
0¶
i µ @f ( xI µ ) @f ( xI µ ) 0 ¶ 1 h t t 2 ( )g fy ¡ I E f x µ E t 4s 4 @µ t @µ t µ 0¶ 1 @f ( xI µ t ) @f ( xI µ t ) D , E 4s 2 @µ t @µt D
(2.11) (2.12) (2.13)
where prime denotes transposition of a vector or matrix. For the calculation of expectation, we have to know the probability distribution q(x ) of input x. Such knowledge about the input distribution, however, is hardly given in practical problems. In addition, it is difcult to obtain an explicit form of
Natural Gradient Learning for Multilayer Perceptrons
1403
G even if we know it. Moreover, inversion of G is computationally costful when the number m of the hidden units is large. All of these suggest that natural gradient learning is not practical. In order to overcome these difculties, ¡ we ¢ propose an adaptive method of directly obtaining an estimate of G¡1 µ t , which does not require costful inversion of G. Since the Fisher information at µ t is obtained by the expectation over inputs x of r f (r f )0 , we make use of the adaptive method of obtaining GO t , which is an estimate of G( µ t ), given by ¡ ¢ ¡ ¢0 GO t D (1 ¡ et ) GO t¡1 C et r f xt¡1 , µ t¡1 r f xt¡1 , µ t¡1 ,
(2.14)
where et is a time-dependent learning rate. Typical examples are et D c / t and 0 et D e . When we put et D 1 / t, GO t is equal to the arithmetic ¡ ¢ mean of r fi r fi 2 except for a scalar factor (1 / 4s ), where r fi D r f xi , µi , i D 1, . . . , t. When ¡ ¢ µ t converges to µ ¤ , GO t converges to G µ ¤ . Our purpose is to obtain GO ¡1 t directly. To this end, we use the well-known technique of the Kalman lter by applying the following identity, which holds for a nonsingular symmetric n £ n matrix A and an n £ k matrix B( k < n ), ¡
aA C bBB0
¢¡1
1 D A¡1 ¡ a
¡
¢ IC
¡ pb
¢ A B a ¢ ¡ pb ¡1
b 0 ¡1 BA B a
¡1
a
A¡1 B
¢
0
,
(2.15)
where a and b are scalars (see Bottou, 1998). We apply the identity to the right-hand side of equation 2.14 for A D GO t¡1 and B D r ft . We then have the following adaptive algorithm to obtain an estimate of the inverse of Fisher information: ¡ ¢0 ¡1 O GO ¡1 1 O ¡1 et ¡1 t r ft r ft Gt O . (2.16) GtC 1 D Gt ¡ (1 ¡ et ) (1 ¡ et ) C et r f 0 GO ¡1 r ft 1 ¡ et t t When et is small, we may approximate it by a simpler ¡ ¢0 ¡1 O ¡1 O (1 C et ) GO ¡1 GO ¡1 t ¡ et Gt r ft r ft Gt . tC 1 D
(2.17)
The related natural gradient learning algorithm is given by ¡ ¢ ¤ µtC 1 D µ t ¡ gt GO ¡1 t rl x t , yt I µ t .
(2.18)
Equations 2.17 and 2.18 are the new method that we propose to implement natural gradient learning.
1404
S. Amari, H. Park, and K. Fukumizu
Since we have introduced one more learning rate et in addition to gt , we need to study how to choose et . We see that et determines how quickly and ¡ ¢ ¡1 how accurately GO ¡1 µt . In the nal stage where µ t is close t converges to G to µ ¤ , an error in G¡1 is not serious since the estimator µ t is consistent even when G¡1 is misspecied. Here, an adequate choice of gt is much more important. There are lots of theories concerning the determination of gt . (See, for example, Amari, 1998, for its adaptive determination.) Avoidance of plateaus is an important benet of natural gradient. If et is too small, there is a serious delay for adjusting GO ¡1 so that the parameter might approach plateaus, even though it escapes from them eventually. On the other hand, there is numerical instability for large et . All in all, we suggest using a constant et that is not so small. The performance is insensitive to et in a reasonable constant range. These are conrmed by computer simulations. It should be noted here that the natural gradient method given by equations 2.8 and 2.9 is completely different from the Newton-Raphson method in which G( µ ) is replaced by the Hessian £ ¤ H( µ ) D E rrl¤ ( x, y¤ I µ ) .
(2.19)
H( µ ) depends on the specic cost function l¤ , which includes the teacher signal y¤ explicitly, but G(µ ) is the metric of the underlying space S independent of the target function to be approximated. However, when the loglikelihood is used as the cost function and the teacher signal y is generated by a network having parameter µ ¤ , H( µ ¤ ) D G( µ ¤ )
(2.20)
holds at µ¤ . Hence, the natural gradient is equivalent to the Newton method at around the optimal point. The Hessian H ( µ ) is not necessarily positive-denite. Moreover, it includes the second derivatives. To avoid calculations of the second derivatives and to keep H( µ ) positive-denite, the Gauss-Newton method approximates H (µ ), by neglecting the terms of the second derivatives by the sample average, T 1X r f ( xt , µ )r f (xt , µ )0 . GN ( µ ) D T tD1
(2.21)
LeCun et al. (1998) also describe stochastic or adaptive Gauss-Newton method, when l¤ is a quadratic error function. The proposed method generalizes the adaptive Gauss-Newton algorithms and provides a solid theoretical justication even when l¤ is not quadratic. The proposed adaptive method is computationally costless. It should be noted that when µ is close to a plateau, GN ( µ ) is almost singular, and its inverse diverges.
Natural Gradient Learning for Multilayer Perceptrons
1405
3 Experimental Results 3.1 Simple Toy Model. We conducted an experiment for comparing the convergence speeds among the conventional gradient, the natural gradient, and the proposed adaptive methods. We used a simple MLP model,
yD
2 X aD1
va Q ( wa ¢ x) C j,
(3.1)
having two-dimensional input and no bias terms. Assuming that the input is subject to the gaussian distribution N (0, I ), where I is the identity matrix, we can get the Fisher information and its inverse explicitly so that we have the exact natural gradient learning algorithm. Training data (xt , y¤t ), t D 1, 2, . . . at each time t are given by the noisy output y¤t from a teacher network. The teacher network is an orthogonal committee machine having the same structure as the student network, with the true parameter dened as | w 1 | D | w 2 | D 1,
wT1 w2 D 0, va D 1.
(3.2)
The variance of output noise is put equal to 0.1. We chose initial values of the parameters randomly subject to the uniform distribution on small intervals: h i wai » 0.3 C U ¡10¡8 , 10¡8 , h i (3.3) va » 0.2 C U ¡10¡4 , 10¡4 , a, i D 1, 2. We conducted the three learning algorithms simultaneously with the same initial value of parameters and the same training data sequence under various conditions. A typical result of convergence speed is shown in Figure 1. As shown in Figure 1, the proposed learning algorithm can give the convergence speed as fast as that of the exact natural gradient learning algorithm. They can escape from plateaus much faster than the conventional one in this example. The learning rate was selected empirically to get an accurate estimator and a good convergence speed for each algorithm. Figure 1 is a typical case with a small, constant learning rate g. We also tried the case with gt D c / t and others. Various choices of et are also checked. When G is Q ¤ become too large. In such a close to a singular matrix, G¡1 and hence rl case, we made gt smaller to avoid too large an adjustment of the parameter vector at one time. The adaptive method works very well in this example but has a tendency of being trapped in a plateau when et is chosen too small, while the natural gradient method does not. This is due to a time delay for adjusting G¡1 . We have checked the effect of different choices of et . However, there are few changes in the performance within an adequate range of et , that is, the range 0.05–0.5. When et is too large, numerical instability emerges. When it is too small, for example, 0.01, the plateau phenomena become remarkable.
1406
S. Amari, H. Park, and K. Fukumizu
Figure 1: Average learning curves for the three algorithms: OGL, the ordinary gradient learning; NGL, the natural gradient learning; ANGL, the adaptive natural gradient learning. 3.2 Extended XOR Problem. To show that the proposed learning algorithm can be applied to practical problems, we treated a pattern classication problem. We used the extended exclusive OR problem, which is a benchmark problem for pattern classiers. Figure 2 shows the pattern set to be classied into two categories. It consists of nine clusters, each assigned to one of the two classes marked by the symbols ± and C . Each cluster is generated subject to a gaussian distribution specied by a mean and a common covariance matrix. Each cluster has 200 elements, but there are some overlapped parts between the two clusters. We used the MLP model with 12 hidden units and 1 nonlinear output unit. Since it is difcult to obtain an explicit form of natural gradient, we conducted experiments for ordinary gradient learning and proposed learning. The desired outputs y are put equal to 1 and 0 for the two classes. Since the size of the training examples is xed, we used them repeatedly, and the mean square training error over the set was calculated at each cycle. The algorithms started with initial values generated randomly, subject to
h i wai , va , ba , bo » U ¡10¡1 , 10¡1 ,
a, i D 1, . . . , m.
(3.4)
We conducted 10 independent runs. The generalization error sometimes does not decrease to a desired level in the case of ANGL, but it mostly fails in the case of the conventional method. We do not know if the state converged
Natural Gradient Learning for Multilayer Perceptrons
1407
Figure 2: Extended XOR problem.
to a poor local minimum or is trapped at a plateau. Figure 3 shows the best learning curve for each algorithm. Since we barely get the zero training error, we stopped learning when there was no more increase in the classication rate for training data set. The resulting training errors were 0.0754 and 0.0675 for ordinary gradient learning and proposed learning, respectively. The necessary learning cycles to get the same classication rates (98.11%) are 154,300 and 1500 for ordinary gradient learning and proposed learning, respectively. This means that the convergence speed of proposed learning was surprisingly quick in this example. 4 Conclusions and Discussions
We have proposed an adaptive calculation of the inverse of Fisher information matrix without matrix inversion and an adaptive natural gradient learning algorithm using it. With a simple toy problem, we showed that the proposed learning algorithm is as fast as the exact natural gradient learning algorithm, which is proved to be Fisher efcient. The proposed method is remarkably faster than the conventional learning method. The experiment with a benchmark pattern recognition problem showed that the proposed
1408
S. Amari, H. Park, and K. Fukumizu
Figure 3: Learning curves for extended XOR problem: OGL, ordinary gradient learning; ANGL, adaptive natural gradient learning.
learning algorithm can also be applied to the practical problem successfully. It is demonstrated that the improvement of the convergence speed by the proposed learning method over the ordinary gradient learning method is remarkable. This article is a preliminary study of the topological and metrical structure of the parameter space of MLPs. More detailed theoretical and experimental results will be published in future. Acknowledgments
We express our gratitude to Leon Bottou and Noboru Murata for their discussions and suggestions. References Amari, S. (1985). Differential-geometrical method in statistics. Berlin: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Amari, S., & Nagaoka, H. (2000). Information geometry. New York: American Mathematical Society and Oxford University Press. Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning in neural networks (pp. 9–42). Cambridge: Cambridge University Press. Edelman, A., Arias, T., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications,
Natural Gradient Learning for Multilayer Perceptrons
1409
20, 303–353. LeCun, Y., Bottou, L., Orr, G. B., & Muller, ¨ K.-R. (1998). Efcient backprop. In G. B. Orr & K. R. Muller ¨ (Eds.), Neural networks—Tricks of the trade, (pp. 5–50). Berlin: Springer-Verlag. Park, H., Amari, S., & Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Submitted. Rattray, M., Saad, D. & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461–5464. Saad, D., & Solla, S. A. (1995).On-line learning in soft committee machines. Phys. Rev. E, 52, 4225–4243. Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Computation, 10, 2137– 2157. Received December 9, 1998; accepted May 14, 1999.
LETTER
Communicated by Shun-ichi Amari and Todd Leen
Nonmonotonic Generalization Bias of Gaussian Mixture Models Shotaro Akaho Electrotechnical Laboratory, Information Science Division, Ibaraki 305-8568, Japan Hilbert J. Kappen RWCP Theoretical Foundation SNN, Department of Medical Physics and Biophysics, University of Nijmegen, NL 6525 EZ Nijmegen, The Netherlands Theories of learning and generalization hold that the generalization bias, dened as the difference between the training error and the generalization error, increases on average with the number of adaptive parameters. This article, however, shows that this general tendency is violated for a gaussian mixture model. For temperatures just below the rst symmetry breaking point, the effective number of adaptive parameters increases and the generalization bias decreases. We compute the dependence of the neural information criterion on temperature around the symmetry breaking. Our results are conrmed by numerical cross-validation experiments. 1 Introduction
An important problem for learning is to optimize the model such that the best generalization performance is obtained. Generalization performance depends on the number of adaptive parameters in the model. We can reduce the training error as much as needed by increasing the number of adaptive parameters. However, the performance of the model should be evaluated by the generalization error, which measures the error for test samples. The best generalization performance is usually not obtained by maximizing the number of adaptive parameters (Vapnik, 1984; Rissanen, 1986). The generalization bias is dened as the difference between the generalization error and the training error. The expected generalization bias can be measured approximately by the neural information criterion (NIC) or numerically through cross-validation on test data (Moody, 1992; Amari, 1993; Murata, Yoshizawa, & Amari, 1991, 1994). Typically, if the number of model parameters increases, the training error decreases because a better t can be obtained. At the same time, the generalization bias is expected to increase due to the larger model variability. In this article we present an example where this intuition is violated: the case that generalization bias decreases when at the same time the number of model parameters increases. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1411– 1427 (2000) °
1412
Shotaro Akaho and Hilbert J. Kappen
We consider radial basis Boltzmann machines (RBBM), a special class of gaussian mixture models (Kappen, 1995). Gaussian mixture models have attracted a lot of attention in the neural network community (Barkai, Seung, & Sompolinsky, 1993; Titterington, 1985) because they are closely related to several neural network models such as radial basis function (RBF) networks (Poggio & Girosi, 1990) and hierarchical mixture of experts (HME) networks (Jacobs & Jordan, 1991; Jordan & Jacobs, 1994). For gaussian mixtures, complexity and generalization performance are controlled by the number of mixture components. In RBBMs, the complexity of the model is controlled by a continuous parameter b. When b is small, the maximum likelihood (ML) solution of the RBBM degenerates into one gaussian. At a critical value of b, the ML solution becomes a mixture of several gaussians. This phenomenon of symmetry breaking is repeated recursively for increasing b. Thus, b controls the effective number of mixture components and, in this way, the generalizaton performance. In this article we study the generalization bias for RBBMs around the rst symmetry breaking point. In section 2, we dene RBBMs and show the symmetry breaking mechanism. In section 3, we derive the condition when the symmetry breaking is two-way or h-way, where h is the number of mixture components in the RBBM. In section 4, we show an analytical result that the generalization bias of the model, as measured by the NIC, decreases if the symmetry breaking is two-way, even though the supercial effective number of parameters increases. If we can assume that training error does not change signicantly around the symmetry breaking point, this anomaly implies that the ML solution just below the critical temperature is expected to realize a smaller generalization error than just above the critical temperature. In section 5, we show that this effect is conrmed numerically. 2 Radial Basis Boltzmann Machines
Let us consider a gaussian mixture model with equal priors in which the variance of all components is spherically symmetric and identical: h 1X p( x | WI b ) D h iD1
r
b exp(¡b kx ¡ w i k2 ). p
(2.1)
W denotes the set of adaptive parameters fw 1 , . . . , w h g. The complexity of the model is determined by h, the number of gaussian components, and b, a control parameter called the inverse temperature in physics. Statistically b is equal to half the inverse variance. We chose spherical covariance matrices, since we are interested in the relation between complexity and symmetry breaking. We expect that our results can be generalized to mixture models with full covariance matrices.
Generalization Bias of Gaussian Mixture Models
1413
This unsupervised model was originally proposed by Rose, Gurewitz, and Fox (1990) and generalized by Kappen (1993, 1995; Nijman & Kappen, 1997) to the supervised case. The model is related to fuzzy clustering as well (Bezdek, 1980). In this article, we refer to model 2.1 as a RBBM. Equation 2.1 can be written as p( x ) D
h X iD1
p( x, i) D
h X
p( x|i)p(i),
iD1
with i labeling the individual clusters and p( i) D 1 / h the prior probability of each cluster. p( x| i) is simply a gaussian distribution in x centered on wi . For xed b, the ML solution for the cluster means is easily derived and is given by Rose et al. (1990),
wi D
hxp( i| x)i , hp(i| x )i
where h ¢ i denotes expectation values with respect to q( x), Z h¢i ´ ¢ q( x) dx, and q( x) is the target distribution from which the training samples and test samples are generated. These coupled equations can be solved by a variety of methods such as the expectation maximization (EM) algorithm. An example of the ML solutions as a function of the temperature is shown in Figure 1. The training data are generated with equal probability from two distributions: u[0.5, 1.5] and N[¡1, 0.32 ], where u[a, b] is the uniform distribution on [a, b] and N[m , v] is the gaussian distribution with mean m and variance v. The number of training samples is 100, and h D 100. For small b, the ML solution is of the form w 1 D w 2 D ¢ ¢ ¢ D w h , that is, the solution corresponds to one cluster. At a critical value of b, the cluster splits into smaller parts. These symmetry breakings reoccur recursively at higher values of b. Therefore, the number of gaussian components can be controlled by adjusting the temperature in this model. So at any b, although the total number of kernels is h, only a smaller effective number of kernels is used. For this reason, we assume h to be sufciently large. The complexity is then controlled by b only. 3 Symmetry Breaking Point of the RBBM
Since the effective number of parameters changes at the symmetry breaking points (SBPs), we study the symmetry breaking process in detail. Although the behavior of symmetry breaking is complicated to analyze, we have some results on the rst SBP, the highest temperature at which the phase transition occurs. These results are expected to be applicable for the other SBPs qualitatively.
1414
Shotaro Akaho and Hilbert J. Kappen
Figure 1: An example of the ML solution and the symmetry breaking phenomenon. Horizontal axis: log(b); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.
When the temperature is higher than 1 / bc , all the gaussian components are degenerated into one gaussian, and the ML solution is given by w i D hx i. One can compute the rst SBP analytically as b D bc ´
1 , 2l 1
(3.1)
where l 1 is the maximal eigenvalue of the covariance matrix of x. This means that the rst symmetry breaking occurs when b is equal to the variance of samples along the rst principal component axis (Rose et al., 1990). This result is considered to be applicable to other SBPs as follows. After breaking, the two clusters drift apart as a function of b. If the distance between the clusters is sufciently large, each data point contributes to the covariance matrix of the closest cluster only. The eigenvalues of this reduced matrix determine the next critical temperature, as in Equation 3.1. 3.1 Below the First SBP. We can characterize the behavior of the symmetry breaking under the following assumption:
Generalization Bias of Gaussian Mixture Models
1415
The target distribution q( x ) is dened on R (one dimension) and is assumed to be symmetric. The number of gaussian components h in the RBBM is taken to be even. Assumption.
Let s 2 ´ h(x¡hxi)2 i and s4 ´ h(x¡hxi)4 i denote, respectively, the secondand fourth-order moment of the distribution q(x ). The fourth-order cumulant is dened as k4 ´ s4 ¡ 3(s 2 )2 (k4 / s4 is called the kurtosis in statistics). Under the assumption, the behavior of the rst symmetry breaking can be classied into the following two cases: Proposition 1.
6 0 the symmetry breaking is two-way: the components split into two 1. If k4 D clusters. The relation between the ML solution wi and b in the neighborhood of the rst SBP bc is given by
Db ’
s4 ( D w i )2 , 6(s 2 )4
(3.2)
where D b D b ¡ b c > 0, D wi D wi ¡ hxi.
2. If k4 D 0 the symmetry breaking is h-way: the individual components D wi are underdetermined up to the third-order approximation that we computed. This suggests that no preferred breaking pattern exists, although fourth-order corrections may somewhat limit this freedom. The relation between the ML solution wi and b in the neighborhood of the rst SBP bc is written as
Db ’ where sw2 D
1 s2 , 2(s 2 )2 w 1 h
P
i D wi
2
(3.3)
, D wi D wi ¡ hxi.
k4 is equal to zero when q( x ) is gaussian; hence the condition represents the similarity between q( x ) and gaussian in the sense of the fourth-order cumulant. An outline of the proof of proposition 1 is given in appendix A. This result can be intuitively understood as follows. When b increases toward the critical point, the stability of the symmetric solution wi D hxi decreases. The stability is measured by the eigenvalues of the Hessian, which is the matrix of second derivatives of the likelihood. At b D bc some eigenvalues of the Hessian become zero. Thus, the symmetric solution becomes 6 0, only one eigenvalue becomes zero. The correspondunstable. When k4 D ing eigenvector is in the direction of symmetry breaking, and the breaking is twofold: one cluster moves in the positive eigendirection and one in the negative eigendirection. When k4 D 0, however, all eigenvalues become zero simultaneously, and thus all directions become unstable. The breaking is h-fold, because the initial movement of each cluster center can be in any arbitrary direction.
1416
Shotaro Akaho and Hilbert J. Kappen
Equation 3.3 is interpreted as follows: Suppose that we have an innite P number of kernels. The model p( x | w) D (1 / h ) hiD1 w (x ¡ wi , b ), with w ( x, b ) a gaussian density function with mean 0 and variance 1 / 2b, can be written as Z p( x | w) D p( w )w (x ¡ w, b ) dw, with p( w) some distribution over the kernel means. As an example of k4 D 0, let us consider the case that the target distribution q(x ) is gaussian. Therefore, at large b, p( x | w ) will approach a gaussian distribution. Since p(x | w) is the convolution of p(w ) with a gaussian distribution, it follows that p(w ) for large b will also approach a gaussian distribution. 4 Nonmonotonic Generalization Bias
Since for the RBBM the effective number of adaptive parameters changes as a function of temperature, we must nd the temperature that gives the best generalization. Overtting results in a suboptimal effective number of components and a suboptimal clustering. Although the generalization bias is a statistical quantity and unknown in general, we can estimate it through cross-validation. Alternatively, we can approximate the mean generalization bias, which is known as the NIC (Murata et al., 1991, 1994) or the effective number of parameters (Moody, 1992). Using this value, we can select the model that minimizes the sum of the training likelihood and the mean generalization bias. Another wellknown generalization bias is Akaike’s information criterion (AIC), which is given by the number of independent parameters. However, AIC assumes that the target distribution is an RBBM, which is a poor assumption around the SBPs. 4.1 Neural Information Criterion. Given a set of P training samples X( P) D fx1 , . . . , xP g from a target probability distribution q( x), the ML solution W D W ( P) maximizes the empirical likelihood over the training samples, P) ( R(emp WI b ) ´
P 1X log p(xi | WI b ). P iD1
(4.1)
The true likelihood is dened as the expectation value of the likelihood over the target distribution q( x): Rexp ( WI b ) ´ hlog p(x | WI b )i.
(4.2)
The generalization bias can be approximated by making a Taylor expansion (P) ( WI b ) around W ( P) and the fact that the asymptotic distribution of of Remp
Generalization Bias of Gaussian Mixture Models
1417
W ( P) is locally a gaussian distribution (Murata et al., 1991, 1994). Consequently, when P is large enough, the mean generalization bias is asymptotically given by D E D E h NIC P) ( (P) , R(emp W I b ) ¡ Rexp ( W ( P) I b ) ’ P
(4.3)
where the average is taken over all training sets of size P, X(P ) . hNIC is the NIC, dened by hNIC (b ) ´ Tr[H(W ¤ )¡1 D( W ¤ )],
(4.4)
where W ¤ denotes the ML solution of the true likelihood. H ( W ) and D( W ) are the following matrices, Hij ( W ) ´ ¡h@i @j log p( xI W, b )i, Dij ( W ) ´ h@i log p( x | W, b ) @j log p( x | W, b )i,
(4.5) (4.6)
where @i ´ @@wi . If q( x) belongs to the model set, NIC is equal to AIC because H (W ¤ ) D D( W ¤ ). In practice, when we apply the NIC in selecting a model, we need to evaluate the bias by using only one training set. In this case, the bias is given by P) ( (P) R(emp W I b ) ¡ Rexp ( W ( P) I b ) ’
hNIC U C p , P P
(4.7)
p (P) ( W ¤ I b ) ¡ Rexp (W ¤ I b )g is a random variable of order 1 where U D PfRemp with zero mean. It can be shown that U is the same for all the models within a nested set of models (Murata et al., 1994), which holds for the RBBM. p Therefore, although the U / P term dominates the NIC term, it does not affect the model selection. In the following sections, we compute the behavior of the NIC near the SBP for the RBBM. The effective number of adjustable parameters in the RBBM is constant between symmetry breakings and increases stepwise for increasing b. Since the NIC measures the complexity of the model, it is expected to increase as b increases. Our analysis indeed shows a linear increase of NIC with b just before the symmetry breaking (b < bc ). However, just after the symmetry breaking (b > bc ), our analysis shows a decrease of the NIC depending on k4 . 4.2 Above the First SBP. When there is only one cluster (b < b c ), we can compute hNIC explicitly. The following proposition shows that the generalization bias increases linearly in proportion to b:
1418
Shotaro Akaho and Hilbert J. Kappen
Proposition 2.
If b < bc , NIC is given by
hNIC (b ) D 2bTr[Vx ],
(4.8)
where Vx is the covariance matrix of q(x ). An outline of the proof of proposition 2 is given in appendix B. Intuitively, the NIC measures the sensitivity of the likelihood against sample uctuation. Proposition 2 explains this intuition because for small b, one has broad kernels whose centers are less sensitive to sample uctuations. 4.3 Below the First SBP. Because the situation below the critical temperature is very complicated, we analyze the NIC under the same assumption as in section 3.1: Proposition 3.
lim
b #bc
@ @b
6 0 and s4 D 6 (s 2 ) 2 , Under the assumption and also if k4 D
hNIC (b ) D ¡1.
(4.9)
6 (s 2 ) 2 applies to all distributions except for a mixture distribuThe condition s4 D tion of 2 d-functions. If q(x ) D (d( x¡1) C d(x C 1)) / 2 one obtains @hNIC (b c ) / @b D ¡4.
An outline of the proof of proposition 3 is given in appendix C. Proposition 3 states that the NIC of the RBBM decreases even if the effective number of parameters increases when the symmetry breaking in the rst SBP is two-way. Since the training error is approximately constant around the SBP, we conclude that this model gives slightly better generalization error (in terms of NIC) just below the critical temperature than at or just above the critical temperature. Whether this anomalous behavior affects the optimal value of b depends on the functional dependence of the training error on b. Although this effect causes a local minimum in the generalization error as a function of b around the SBP, its global minimum might be attained at much higher or lower values of b. It is not easy to analyze the case k4 D 0, since the symmetry breaking is h-way and the ML solution is underdetermined by the third-order approximation as shown in proposition 1. 5 Experiments
In this section, we show some computer simulation results for the two cases with different fourth-order cumulants presented in proposition 1. In both cases, the target distribution q( x) is created so that the mean is 0.0 and the variance is 1.0. Therefore, log b c ’ ¡0.693, and the ML solution for training
Generalization Bias of Gaussian Mixture Models
1419
Figure 2: ML solution versus log(b ). Data are generated from a distribution 6 with k4 D 0. Horizontal axis: log(b ); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.
samples breaks around this value. Both the number of training samples (P) and the number of gaussian components (h) is set to 100. We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to optimize the empirical likelihood. We initialize the EM algorithm with one gaussian component on each of the training samples. The test set consists of 100,000 samples, which are generated from the same distribution as training samples. 6 5.1 The Case of k4 D 0. The target distribution is a mixture of two
gaussians,
r
n o n oi C1 h exp ¡C1 (x ¡ C2 )2 C exp ¡C1 ( x C C2 )2 , p p where C1 D 12.5, C2 D 0.96. C1 and C2 are chosen such that the variance of q( x ) is 1.0. An example of the ML solution as a function of the temperature around the rst SBP is shown in Figure 2. The approximate generalization bias, as given by the NIC, as a function of temperature is shown in Figure 3. Note that the vertical tangent at the breaking point is in agreement with proposition 3. 1 q( x) D 2
1420
Shotaro Akaho and Hilbert J. Kappen
Figure 3: NIC as a function of log(b ) (magnied around the symmetry breaking 6 point). Data are generated from a distribution with k4 D 0. Horizontal axis: log(b); vertical axis: NIC multiplied by the number of training samples; dashed line: bc
The generalization bias averaged over 50 experiments with different training sets is shown in Figure 4. It differs from Figure 3 in two ways. It does not show the vertical tangent as in the individual runs because the vertical tangent appears only very close to the symmetry breaking point and the symmetry breaking point uctuates for different training sets. Second, the term U in equation 4.7 is rather different for different training sets, giving a smoothing effect on the average generalization bias. 5.2 The Case of k4 D 0. The target distribution is one gaussian with a
unit variance,
1 exp(¡x2 / 2). q(x ) D p 2p In this case, the convergence of the EM algorithm is much more unstable than in the case of the previous section, because of the reasons mentioned in proposition 1. A typical ML solution as a function of temperature is shown in Figure 5.
Generalization Bias of Gaussian Mixture Models
1421
Figure 4: Generalization bias from cross-validation, averaged over 50 experiments, as a function of log(b ). Data are generated from a distribution with 6 k4 D 0. Horizontal axis: log(b ); vertical axis: generalization bias multiplied by the number of training samples; dashed line: average of empirical value of bc ’s.
The generalization bias averaged over 50 experiments with different random numbers are shown in Figure 6. We did not observe a single instance 6 0. where the likelihood is maximal around the rst SBP as in the case of k4 D 6 Conclusion
We have shown the nonmonotonical behavior of the generalization bias of a special class of gaussian mixture models, called the radial basis Boltzmann machines. For high temperature (large variance in the gaussian kernels), the generalization error increases linearly with b. On the other hand, below the critical temperature, the symmetry breaking phenomenon depends critically on the value of the fourth cumulant k4 . 6 0, the generalization bias decreases with b below the critical temIf k4 D perature. This means that the NIC decreases during the symmetry breaking process. After symmetry breaking is complete, NIC increases again. While NIC decreases, the effective number of adaptive parameters increases because the kernels split up. It is normally assumed that NIC measures the
1422
Shotaro Akaho and Hilbert J. Kappen
Figure 5: ML solution versus log(b ). Data are generated from a gaussian distribution (k4 D 0). Horizontal axis: log(b); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.
effective number of adaptive parameters. We conclude that this relation is violated around the SBPs. This anomalous behavior affects the optimal model selection, which in this article is the choice of the optimal b. When the optimal generalization error occurs around an SBP, increasing b will decrease (or at least not increase) training error as well as decrease NIC. If k4 ’ 0 we predict theoretically that symmetry breaking is h-way, but this is not observed numerically. We attribute this discrepancy to the delicacy of the symmetry breaking process and the numerical instability of the optimization procedure. Although our analysis was restricted to one dimension, we expect our results to hold in higher dimensions as well. If the number of kernels is large, the condition of an even number of kernels is not very strict. If the target distribution is nonsymmetric and contains odd moments, other results could be observed. Appendix A: Outline of the Proof of Proposition 1
We assume hxi D 0 without loss of generality. Since we assume an even number of gaussian components, let h D 2h0 .
Generalization Bias of Gaussian Mixture Models
1423
Figure 6: Generalization bias from cross-validation, averaged over 50 experiments, as a function of log(b ). Data are generated from a gaussian distribution (k4 D 0). Horizontal axis: log(b ); vertical axis: generalization bias multiplied by the number of training samples; dashed line: average of experimental values of bc ’s.
Since the target distribution is symmetric and the number of gaussians is even, the symmetry breaking will be symmetric and we can write h0 1 X p(x | WI b ) D 0 2h iD1
r
b pi ( x | wi , b ), p
(A.1)
where pi ( x | wi , b ) ´ exp(¡b(x ¡ wi )2 ) C exp(¡b(x C wi )2 ).
(A.2)
The derivative of the log-likelihood is dened by Li (x | W, b ) ´
@ @wi
log p(x | W, b ) D
@i pi ( x | wi , b )
p( x | W, b )
.
(A.3)
The expectation value of Li under the target distribution q is equal to zero at the ML solution, Rexp ( W ¤ , b ) D hLi (x | W ¤ , b )i D 0.
(A.4)
1424
Shotaro Akaho and Hilbert J. Kappen
Above the critical temperature, W D 0. Below the critical temperature, W will become nonzero. Let D wi denote the weight vector describing the mean of kernel i. In order to obtain a nontrivial solution of D wi as a function of D b D b ¡ bc , we expand Li ( x | W, b ) to third order in W and to rst order in b around b D b c and W D 0: hLi ( x | W, b )i D
¡
¢
2 1X s4 D b D wi ¡ ¡ 1 D wi D wj2 0 2 0 2 )2 2 jD h (s h 6 i » ¼ 1 3 s4 s4 C ¡ 3 ¡ ¡ 1 D w3i 6h0 (s 2 )2 (s 2 )2 h0 (s 2 )2 C higher order terms. (A.5)
¡
¢
Setting hLi ( x | W, b )i D 0 and neglecting higher-order terms, we obtain h0 simultaneous equations of D b and D wi . Since a solution D wi D 0 gives a local minimum of the likelihood, we 6 6 can assume D wi D 0. If s4 D 3(s 2 )2 , we obtain D w2i D D wj2 and D b D 2 4 2 fs4 / 6(s ) gD wi , which is the rst case of proposition 1. On the other hand, P if s4 D 3(s 2 )2 , the simultaneous equations degenerate to the equation D b D i D w2i / f2h0 (s 2 )2 g. It means wi cannot be determined uniquely from b when we neglect higher-order terms. Appendix B: Outline of the Proof of Proposition 2
If the temperature is higher than the rst SBP, all gaussians degenerate to one gaussian; therefore the only thing we should do is to calculate NIC for one gaussian model.1 We derive D( W ¤ ) and H( W ¤ ) for one gaussian model as follows, Dij (W ¤ ) D 4b 2 Vij , Hij (W ¤ ) D 2bdij ,
(B.1) (B.2)
where Vij is a covariance between xi and xj and dij is Kronecker’sd. Therefore NIC is given by hNIC (b ) D Tr[H(W ¤ )¡1 D( W ¤ )] D 2bTr[Vx ].
(B.3)
Appendix C: Outline of the Proof of Proposition 3
Similar to appendix A, we assume hxi D 0 without loss of generality. The symmetry breaking is two-way from the assumption of proposition 3. As a 1
This is not strictly true, because the matrices H and D are still h-dimensional. However, one can show that this h dependence drops out in equation 4.4.
Generalization Bias of Gaussian Mixture Models
1425
result, one can show that the expansion of hNIC around the SBP is identical to the case of only two gaussian kernels. Therefore, we analyze the NIC of the model of two gaussians. The model of two gaussians is written as 1 p( x | w 1 , w 2 I b ) D 2
r
i b h expf¡b(x ¡ w1 )2 g C expf¡b(x C w2 )2 g . (C.1) p
D(w1 , w2 ) and H( w1 , w2 ) can be calculated from their denition. At the ML solution, we have w1 D w2 D w. Therefore we obtain µ
D( w, w ) D H( w, w ) D
µ
d0 d2 d1 d2
¶ d2 , d0 ¶ d2 , d1
(C.2) (C.3)
where ½ ¾ ( p 1 )2 d0 D 4b 2 ( x ¡ w)2 2 , p ½ ¾ p1 p1 ¡ 4b 2 ( x ¡ w)2 , d1 D d0 C 2b p p ½ ¾ p1 p2 d2 D ¡4b 2 (x ¡ w )(x C w ) 2 , p
(C.4) (C.5) (C.6)
where p1 D exp(¡b(x ¡ w)2 ), p2 D exp(¡b(x C w )2 ) and p D p1 C p2 . Let hO NIC (b, w ) D Tr[H(w, w)¡1 D( w, w)],
(C.7)
which is equal to hNIC (b ) for w D w¤ . Expanding hO NIC (b, w ) around the rst SBP (b D b c , w¤ D 0) with respect to b and w, d0 d1 ¡ d22 , hO NIC (b, w ) D Tr[H¡1 D] D 2 2 d1 ¡ d22
(C.8)
we obtain » hO NIC (b, w ) D hO NIC (bc , 0) C
¼
@ O hNIC (bc , 0) D b @b
» ¼ 1 @2 O (b hNIC c , 0) D w2 2 @w2 C higher-order terms, C
(C.9)
1426
Shotaro Akaho and Hilbert J. Kappen
where both the second and the third terms on the right-hand side of equation C.9 are of order D b, since D w2 ’ f6(s 2 )4 / s4 gD b at the ML solution from equation 3.2. Substituting d0 , d1 , d2 by their values, the coefcient of the second term is given by @ O hNIC (bc , 0) D 2s 2 , @b
(C.10)
and the coefcient of the third term before substituting b D bc is given by
¡
1 @2 O 1 ¡ 4b 2 s4 hNIC (b, 0) D 4b 1 ¡ 2bs 2 ¡ 2 2 @w 1 ¡ 2bs 2
¢
.
(C.11)
6 (s 2 )2 , and using the fact that bc D 1 / (2s 2 ) and s4 ¸ (s 2 )2 , we When s4 D infer that equation C.11 diverges to ¡1 as b converges to bc from right. The only case that s4 D (s 2 )2 is when q(x ) is equal to d(x) or (d( x ¡ a) C d(x C a )) / 2. Since there is no SBP in the former case, we need to consider only the latter case. Without loss of generality, we assume a D 1 and the derivative taken from the right is derived from a simple calculation,
@ O hNIC (bc , 0) D ¡4. @b
(C.12)
References Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Barkai, N., Seung H. S., & Sompolinsky, H. (1993). Scaling laws in learning of classication tasks. Physical Review Letters, 70, 3167–3170. Bezdek, J. C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Trans. on PAMI, 2, 1–8. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc. Ser. B, 39, 1–38. Jacobs, R. A., & Jordan, M. I. (1991). A competitive modular connectionist architecture. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 767–773). San Mateo, CA: Morgan Kaufmann. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Kappen, B. (1993). Using Boltzmann machines for probability estimation: A general framework for neural network learning. In S. Gielen et al. (Eds.), Proc. of ICANN’93 (pp. 521–526). Berlin: Springer-Verlag. Kappen, B. (1995). Deterministic learning rules for Boltzmann machines. Neural Networks, 8, 537–548. Moody, J. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S.
Generalization Bias of Gaussian Mixture Models
1427
J. Hanson, & R. P. Lippman (Eds.), Advances in neural information processing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Murata, N., Yoshizawa, S., & Amari, S. (1991). A criterion for determining the number of parameters in an articial neural network model. In T. Kohonen et al. (Eds.), Articial neural network (ICANN) (pp. 9–14). Amsterdam: Elsevier. Murata, N., Yoshizawa, S., & Amari, S. (1994).Network information criterions— determining the number of parameters for an articial neural network model. IEEE Trans. on Neural Networks, 5, 865–872. Nijman, M. J., & Kappen, H. J. (1997). Symmetry breaking and training from incomplete data with radial basis Boltzmann machines. International Journal of Neural Systems, 8, 301–316. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 1080–1100. Rose, K., Gurewitz, E., & Fox, G. (1990).Statistical mechanics of phase transitions in clustering. Physical Review Letters, 65, 945–948. Titterington, D. M. (1985). Statistical analysis of nite mixture distribution. New York: Wiley. Vapnik, V. A. (1984). Estimation of dependences based on empirical data. New York: Springer-Verlag. Received October 14, 1998; accepted May 5, 1999.
LETTER
Communicated by Chris Williams
Efcient Block Training of Multilayer Perceptrons A. Navia-V a´ zquez A. R. Figueiras-Vidal DTC, Universidad Carlos III de Madrid, 28911-Legan´es (Madrid), Spain The attractive possibility of applying layerwise block training algorithms to multilayer perceptrons MLP, which offers initial advantages in computational effort, is rened in this article by means of introducing a sensitivity correction factor in the formulation. This results in a clear performance advantage, which we verify in several applications. The reasons for this advantage are discussed and related to implicit relations with secondorder techniques, natural gradient formulations through Fisher ’s information matrix, and sample selection. Extensions to recurrent networks and other research lines are suggested at the close of the article. 1 Introduction
Multilayer perceptrons (MLP) offer a powerful approach for solving estimation and classication problems, where gradient descent with error backpropagation (BP) is the classical algorithm for training them. The BP algorithm not only has problems with local minima (also present in other search methods) but also with the speed of convergence. Similar drawbacks exist to some extent in many BP variants. The so-called block learning methods represent an interesting alternative because they proceed blockwise, decomposing the global problem into simpler layer-by-layer minimizations, such that training consists of iteratively solving those minimizations by means of efcient methods. This new approach makes gradient descent unnecessary (hence, avoiding its problems) and has foreseeable additional benets, such as lower computational cost along with easier analysis and control of the training process. We will disregard as block training techniques those hybrid (between block and gradient) algorithms, such as the moving targets (MT) algorithm (Rohwer, 1991), the LS-based algorithm proposed by Di Claudio, Parisi and Orlandi (1993), and the dynamic hidden targets (DHT) algorithm (Song & Hassoun, 1990) because they use gradient descent to some extent. We will therefore consider as pure block training methods either those that rely on linearization of the network (Douglas & Meng, 1991; Deller & Hunt, 1992) or those that use the inverse of the activation function to propagate target values to the previous layer (Pethel & Bowden, 1992; Biegler-Konig ¨ & B¨armann, 1993). Furthermore, we will focus here on the latter because of c 2000 Massachusetts Institute of Technology Neural Computation 12, 1429– 1447 (2000) °
1430
A. Navia-V´azquez and A. R. Figueiras-Vidal
their proved superior convergence. Basically, these methods decompose the global problem into minimization problems at every layer—mainly, least squares (LS) minimizations—with a twofold purpose: obtaining optimal weight values for every layer and backpropagating target values to previous layers. In the following section, we present the block training framework in more detail and qualitatively analyze a direct implementation known as least squares backpropagation (LSB) (Biegler-Konig ¨ & B a¨ rmann, 1993). In section 3, we propose a modication to LSB by solving reduced-sensitivity minimizations, thereby obtaining the reduced-sensitivity, least-squares block (RS-LSB) algorithm. Section 4 compares the performance of the proposed algorithm with some other well-known training methods, using some standard simulation examples. In section 5 we compare the computational cost per epoch of these algorithms. In section 6 we give some theoretical insights about the relationship between RS-LSB and some other efcient minimization techniques, such as natural gradient or second-order methods; additionally, we comment on its relationship with sample selection and regularization strategies. Finally, we present our conclusions and propose some further work. 2 Block Training 2.1 General Formulation. To preserve generality, a feedforward neural network (FFNN) scheme with K layers has been adopted, layer k containing Nk neurons. The goal of the network is to learn the associations (xp , dp ) for a total of P patterns, where xp is the input and dp the corresponding desired output. From now on, matrix notation is adopted by storing the input and output patterns as the rows of matrices X (P £ N0 ) and D (P £ NL ), respectively. Using this notation and taking into account that every layer contains linear (weighted sum) and nonlinear (sigmoid-like activation function) components, the update equations for layer k are: Z
(k)
O
(k)
D O (k¡1 ) W ( k) D [1 |tanhZ
(k)
],
(2.1) (2.2)
where | denotes augmentation, O ( k¡1) is the output of the preceding layer, W ( k ) is the weight matrix interconnecting layers k ¡ 1 and k, and Z ( k ) is the matrix formed with the state values. In equation 2.2, a hyperbolic tangent function, which maps state values to the range (¡1, C 1), has been used without loss of generality, and an unity vector ( 1 ) has been added to O ( k) such that offset values for layer k C 1 can be stored as the rst row of W ( kC 1) . Additionally, matrices T ( k) and Q ( k) store output and state target values, respectively. Thus, O (0) D X (at layer 0) and T (K ) D D (at layer K). In block training, the global minimization problem is suboptimally decomposed by solving at every layer (and iteratively until convergence)
Block Training of Multilayer Perceptrons
1431
“double” least-squares problems—that is, with respect to the weights (to solve the linear part, equation 2.1, optimally) and with respect to the previous layer outputs (to propagate errors into the preceding layer). These adjustments are usually carried out by solving LS minimizations because, for linear problems, they lead to well-known robust and efcient algorithms. A more detailed description of a block training framework is given in the next section, where we describe a basic implementation. 2.2 A Qualitative Analysis of Direct Application. One of the simplest block training methods is the LSB algorithm (Biegler-K¨onig & B¨armann, 1993). We propose to apply at every layer the following three-step procedure. First, the inverse of the activity function is applied to target outputs in order to compute target states, ³ ´ Q (k ) D tanh¡1 T ( k ) . (2.3)
b ( k) for that layer can be obtained by Then the optimal weight matrix W solving
® ®
® ®
b (k) ¡ Q (k) ® . min ®O ( k¡1) W 2 b(k) W
(2.4)
Once the new weights are computed, a second minimization problem is solved to obtain output target values for the previous layer (T ( k¡1) ),
® ®
® ®
b (k) ¡ Q (k) ® . min ®T ( k¡1) W
T (k¡1)
2
(2.5)
Thus, a training epoch is completed after applying this three-step procedure at layers K, K ¡ 1, . . . , 1. This procedure is repeated until an appropriate convergence criterion is reached. A nal consideration has to be taken into account when implementing LSB. Specically, the target values obtained for the previous layer usually lie outside the range (¡1, C 1), requiring a normalization procedure to map them into the appropriate interval, before applying the inverse activation function, equation 2.3. A scaling matrix C ( k) has to be computed at every layer if normalization is necessary (a detailed description of the computation of C ( k) , as well as its inverse, can be found in (Biegler-Konig ¨ & B¨armann, 1993). The new matrices are ³ ´¡1 ( ) ( ) ( ) e T k¡1 D T k¡1 C k e (k) D C(k) W (k) , W e (k) D T ( k¡1) W ( k) , and values in e which ensures that e T ( k¡1) W T ( k¡1 ) lie in (¡1, C 1), such that equation 2.3 can be applied.
1432
A. Navia-V´azquez and A. R. Figueiras-Vidal
Figure 1: Contour plot and comparison of convergency trajectories of LSB, BP, and RS-LSB for the simple test case.
In some practical situations, we have found that LSB has problems related to the nal error and weight values. To analyze the behavior of LSB, a very simple model has been selected, which, in spite of its simplicity, allows us to highlight the weak points of LSB and propose an improved block training algorithm. The test problem is the following: Given a network with a single hidden neuron, linear activation in the second layer,1 and specic target outputs, nd the optimal weights using LSB with a random weight initialization. A test case can be created with random input values chosen in the interval (¡1, 1) and desired outputs computed using a model with w(1) =4 and w(2) =1.5. 2 The associated error surface can be easily computed for this case and is depicted in Figure 1 as a contour plot with respect to w(1) and w(2) . As it can be observed, there exists a minimum at (1.5, 4) and the surface resembles a valley with a curved bottom (marked with a dashed line); in this area partial derivatives become very small, and the convergence of
1 In this case, no offset values have been used, and the weight matrices are reduced to real numbers w(1) and w(2) . 2 The choice of these values is not critical whenever w(1) is not very small; otherwise, the model would be nearly linear and the experiment irrelevant.
Block Training of Multilayer Perceptrons
1433
gradient-descent algorithms slows dramatically. This effect is corroborated by observing the gradient-descent trajectory in Figure 1, where the rst 20 weight values have been represented by circles. When LSB is used, it converges to the bottom of the valley in a few iterations (mainly adjusting w(2) ). Once there, it moves along the bottom, not toward the minimum but in the opposite direction (see Figure 1). We observe that LSB is not able to nd a reasonable solution and, furthermore, it exhibits the undesirable behavior of moving away from the minimum.3 The nal situation achieved with LSB is a model where the hidden neuron works in its linear region (small weight values in the rst layer and, hence, small state values for all the patterns). Similar effects can be observed in networks with many neurons, where the LSB algorithm fails to obtain a good solution, as will be shown in section 4 by means of several real-world applications. 3 Reduced Sensitivity Block Training ( )
In an MLP architecture, an error in state zpk is propagated to the output ( )
of the neuron with a different gain spk (referred to as sensitivity) for every pattern. This “error gain” can be approximately computed as the derivative ( ) of the activation function at zpk ; that is, sensitivity is high at the linear region of the activation function and low at the saturation regions. In a block training framework, the inverse of the activation function is used to compute target states from target outputs (see equation 2.3), and some ( ) problems appear when spk is very small, because target state values (Q ( k) ) show high variability even for small perturbations in T ( k) (the “inverse error ( ) gain” is 1 / spk ). We propose to take advantage of a weighted LS (WLS) cost function (the weighting values being the sensitivities sp ) to obtain an improved solution: 4 E( w ) D
P £ X
sp ep (w )
pD1
¤2
D
P h X pD1
³ ´i2 . sp opT w ¡qp
(3.1)
Minimizing E( w ) with respect to w can be carried out using classical LS techniques (such as Moore-Penrose pseudoinverse, QR or SVD decomposition, and gaussian elimination) by redening q and O as follows, q s D diag(sp )q
O s D diag(sp )O , 3
(3.2) (3.3)
This happens even when weights are initialized very close to the optimal values. Indexes referring to neuron and layer have been dropped for convenience. This error is local, and the WLS minimization has to be repeated for every neuron in the network. 4
1434
A. Navia-V´azquez and A. R. Figueiras-Vidal
and solving
®
®
min ®O s w ¡ q s ®2 . w
(3.4)
If we apply this WLS formulation to the LSB algorithm, we obtain the RS-LSB algorithm. In RS-LSB, a generalized expression for sensitivity can be used: ( )
( )
sp,kj D ( tanh0 ( zp,kj ))b ,
(3.5)
where tanh0 is the derivative of the activation function. Expression 3.5 is qualitatively equivalent to simply evaluating the derivative, but it provides an additional adjustable parameter b. Some interesting characteristics arise with this sensitivity denition because every neuron will selectively focus on patterns with maximal sensitivity.5 At the rst layer, the maximal sensitivity locus for neuron j is a strip in the N0 -dimensional input space, centered around the hyperplane (1) o (0) w j D 0, and vanishing as the distance to the hyperplane increases. In following layers, this locus can take different shapes, because states are computed using nonlinear combinations of previous layer outputs. For illustration, we present a simple surface modeling problem. We constructed a training set using 400 samples from the surface depicted in Figure 2a and trained a 2-4-1 MLP using RS-LSB. The resulting surface is depicted in Figure 2b; in Figure 2c we represent the four sensitivity strips at the input space using gray shades (one strip for every neuron in the hidden layer, and drawn superimposed), where every hyperplane (center of every strip) has been marked using dashed lines. Several trials yielded different arrangements of strips at the rst layer, but in every case, a circular sensitivity strip at the output layer (see Figure 2d) was obtained, and the problem was solved. If we regard the problem represented in Figure 2a as a binary classication task (with a circular boundary between classes as solution), we observe that the network concentrates on patterns closer to such a boundary (see Figure 2d), thereby implementing a pattern selection strategy (this aspect will be discussed further in section 6.4). We have observed through extensive simulation that the nal performance is not very dependent on the exact location of the strips and that the block training method manages to generate a strip disposition, which, combined in following layers, leads to a satisfactory solution (the low variance of the experiments described in section 4 also suggests this fact). Note that 5 Standard BP behaves in an analogous manner (neurons are more strongly sensitive to being trained by patterns that fall in their “active” region), while LSB does not; hence, one of the benets of RS-LSB is to incorporate this pattern discrimination behavior into a block training framework.
Block Training of Multilayer Perceptrons
1435
Figure 2: Visualization of the sensitivity strips in a simple case: (a) original surface and (b) NN approximation using a 2-4-1 MLP trained with RS-LSB. The sensitivity strips for the input layer can be observed in (c) (four superimposed, one for every neuron); the maximal sensitivity locus at the output layer is depicted in (d).
if b D 0, the algorithm reduces to LSB (strips having innity width) and the network loses approximation capabilities, as will be shown in section 4 by means of several examples. If small, random values are used to initialize the weights, the neuron states are usually close to zero (zp ¼ 0) at initial stages, yielding sensitivity values very close to the unity. In this case, there is very little discrimination (as in the LSB algorithm), and ill conditioning in the next layer may arise (especially when Nk > Nk¡1 ). To avoid this, sensitivity is computed using modied state values in matrix Z ( k) , scaled and shifted to the range (¡a, a): 0 ( )
0 ±
k sp, j D @tanh0 @a
( )
2zp,kj ¡ (m1 C m2 ) ( m2 ¡ m1 )
² 11b
AA
(3.6)
1436
A. Navia-V´azquez and A. R. Figueiras-Vidal
where n o ( ) m1 D min zp,kj I
p D 1, . . . , PI j D 1, . . . , NK
n o (k) m2 D max zp, j I
p D 1, . . . , PI j D 1, . . . , NK .
p, j
p, j
Sensitivity computation depends not only on b but also on the weight values of every neuron (s D ( tanh0 ( o T w ))b ), and therefore the algorithm is not severely constrained by a particular choice of b (whenever it is greater than zero; otherwise LSB algorithm is obtained). In fact, we have found that similar sensitivity values can be obtained with different values of b with appropriate weight values. Because the method is not highly sensitive to the selection of b, we did not consider it necessary to explore this further analytically (nontrivial task). We suggest, instead, trying several values and retaining the best results (usually it is enough to try out b D 1, 2, 3), a procedure that we have followed in the experiments. It is possible to simplify the RS method using incomplete data set analysis (Rao & Toutenburg, 1995), where some patterns (those with sensitivity below a threshold) can be purposely discarded or truncated. Analogous simplied procedures have been applied in (Azimi-Sadjadi & Liou, 1992; Wang & Chen, 1996); unfortunately, patterns are discarded if they lie outside the inverse activation function domain, which clearly is a suboptimal criterion. 4 Application Examples
Before presenting the experimental results in detail, we will describe some features (common in all of the experiments) such as the NN architecture, error measures, experimental set-up, convergence criterion, and particular implementations of the classical training algorithms used for comparison. Most of the specications conform with those dened in Proben, an NN benchmark specication document and database (Prechelt, 1994), although we will explicitly state them for further clarity. With respect to the NN architecture, we use throughout an MLP with a single hidden layer and feedforward connections only (no intralayer connections, or shortcuts); the number of inputs and hidden neurons are different in every experiment. A hyperbolic tangent is used in the hidden layer, and the output layer uses linear activation functions for estimation problems and sigmoidal ones for classication problems. Initialization is random, and weights are chosen in [¡0.1, 0.1]. We restrict ourselves to single hidden-layer MLPs due to their universal approximation capabilities, with the number of hidden neurons chosen without much care. This may lead to suboptimal solutions but poses no problem for efciency comparisons
Block Training of Multilayer Perceptrons
1437
of the training algorithms given a xed architecture, the matter of main concern here. Performance is evaluated using the normalized mean square error (NMSE D MSE / s 2 , s 2 being the variance of the output training values) for approximation problems, and classication error (CE) for classication problems. The convergence criterion used is early stopping by cross-validation. In particular, we interrupt the training process when generalization loss (GL(n) D 100(Evalid ( n) / Eopt (n ) ¡ 1), Eopt (n ) D minfE valid (i), i D 1, . . . , ng, as dened in Prechelt (1994) exceeds 1%. Data sets are split into training, validation, and test sets using a specied number of patterns, and the data are normalized to the interval (¡1, 1). Ten runs are conducted in every experiment, and resulting data are collected using mean values and standard deviations. We have chosen Matlab to set up experiments because it is a widely used tool and many NN training algorithms are readily available. Although faster implementations of these algorithms are usually available in C or C++, the use of Matlab is reasonable for comparison purposes if we measure the number of oating-point operations (FLOPS) consumed by every algorithm until convergence. The gradient descent with error BP, fast backpropagation (FBP), and Levenberg-Mardquardt (LM) algorithms are available in the neural networks toolbox (trainbp, trainbpx, and trainlm routines, respectively), while conjugate gradient (CG) and quasi-Newton (with BroydenFletcher-Goldfarb-Shanno update, BFGS) methods were not available there and had to be extracted from Bishop’s NETLAB library (routines conjgrad and quasinew, respectively, also coded in Matlab). 6 Block learning methods are also implemented in Matlab, using gauss elimination to solve every LS minimization. Unless otherwise specied, all of these algorithms have been run with the default parameter settings included in the original libraries. We now describe the data sets used, as well as the results obtained with each of the algorithms. 4.1 The “Cancer” Data Set. This data set has been extracted from the Proben benchmark data set collection (Prechelt, 1994) and is also part of the UCI machine learning repository under the description “breast cancer Wisconsin.”7 The purpose here is to use some cell characteristics obtained by microscopic examination (nine in total) in order to classify a tumor as benign or malignant. The total number of patterns is 699, and we have chosen a data partition that uses the rst 50% of the patterns for train-
6
NETLAB can be obtained online at http://www.ncrg.aston.ac.uk/netlab. Proben is available for anonymous FTP from the Neural Bench archive at Carnegie Mellon University (ftp.cs.cmu.edu,/afs/cs/project/connect/bench/contrib/pr echelt). 7
1438
A. Navia-V´azquez and A. R. Figueiras-Vidal
Table 1: Performance Comparison for the “Cancer” Data Set. BP
FBP
CG
BFGS
LM
LSB
RS-LSB
Train CE %
3.7 (0.12)
4 (0.3)
3.7 (0.2)
4.6 (0.7)
3.9 (0.5)
4.9 (0.4)
3.8 (0.3)
Test CE %
1.7 (0.1)
2.2 (0.5)
1.8 (0.2)
2.8 (0.9)
1.8 (0.4)
2.8 (0.4)
1.7 (0.2)
No Iters.
2517 (887)
79 (36)
12.8 (0.5)
8.8 (4.2)
5.7 (2.9)
8 (1.2)
9.2 (0.9)
Load % BP
100%
3.5%
5.5%
3.5%
11%
1.1%
1.7%
Notes: We show Classication error (mean values and standard deviations, in parentheses), corresponding to 10 runs. The average number of epochs to converge and (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.
ing, 25% for validation, and 25% for testing (cancer1 partition in Proben). The MLP architecture used is 9-8-1 for all cases, and the parameters used are the default ones in the implementation, except for the learning rate in BP, which was set to 10¡3 (otherwise training was exceedingly slow). In RS-LSB we used a D 3 and b D 1. The number of runs is 10, and the results of the experiment are shown in Table 1, where we present mean and standard values of training and test error classication. We observe that the best-performing methods are BP, CG, LM, and RS-LSB, LSB yielding a suboptimal solution (CE almost twice than in the RS-LSB case). Although further explanation will be given in next section, we can see that block learning methods are advantageous in computational load with respect to the other algorithms, RS-LSB being about three times faster than CG or six times faster than LM, two of the fastest algorithms yielding similar performance. 4.2 The “Boston Housing” Data Set. This data set has been extracted from the Delve8 database and is also part of the Statlib database at the Carnegie Mellon University.9 It contains information about housing in the Boston area. The purpose here (prototask described as “price”) is to use some variables (for example, per capita crime rate by town, nitric oxides concentration, average number of rooms per dwelling, and pupil-teacher ratio by town—up to 13 mixed categorical and continuous variables) to pre-
8 Delve—Data for Evaluating Learning in Valid Experiments—is a standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data; it can be accessed online at the Computer Science Department of the University of Toronto (http://www.cs.toronto.edu/»delve/). 9 The location is http://lib.stat.cmu.edu/datasets/boston.
Block Training of Multilayer Perceptrons
1439
Table 2: Results for the “Boston Housing” Data Set. BP
FBP
CG
BFGS
LM
LSB
RS-LSB
Train NMSE
0.09 (0.05)
0.41 (0.12)
0.17 (0.05)
0.26 (0.06)
0.12 (0.2)
0.21 (0.02)
0.12 (0.03)
Test NMSE
0.18 (0.01)
0.45 (0.1)
0.27 (0.05)
0.3 (0.07)
0.21 (0.11)
0.31 (0.01)
0.19 (0.05)
No Iters.
688 (68)
51 (35)
15 (4.7)
7.6 (3.9)
7.4 (2.9)
10 (2.7)
31 (6.5)
Load % BP
100%
7%
24%
15%
363%
7%
19%
Notes: We illustrate performance by means of classication error (mean values and standard deviations, in parentheses), corresponding to 10 runs. The average number of epochs to converge and the (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.
dict house prices. The total number of patterns is 506 in this case, and we also used a data partition with the rst half of the patterns for training, 25% for validation, and 25% for testing. The MLP architecture used is 13-20-1, and the training parameters are the default ones, except for the learning rate in BP, which was set to 10¡3 . In RS-LSB we have used a D 3 and b D 1. The number of runs is 10, and the results are shown in Table 2, where we present mean and standard values of training and test NMSE. We observe that the best-performing methods are BP, LM, and RS-LSB, LSB yielding a suboptimal solution (NMSE almost double than with RS-LSB). In this case, RS-LSB is about 5 times faster than BP and 20 times faster than LM, the two algorithms yielding similar performance. 4.3 The “Computer Activity” Data Set. This data set is also included in the Delve data set and contains information about computer system measures (e.g., number of system calls per second, number of “write” calls) in order to predict CPU consumption. The number of available patterns is 8192, and we have used the rst half of them for training, 25% for validation, and the rest for testing (with the original ordering). The MLP architecture used is 12-8-1, and the parameters are the default ones, except for the learning rate in BP, which was set to 5 £ 10¡5 (the default value was too high for this application and caused instabilities). In RS-LSB we have used a D 3 and b D 3. The number of runs is 10, and the results are shown in Table 3, where we present mean and standard values of training and test error classication. We observe that the best-performing methods are BP, FBP, CG, and RS-LSB, LSB yielding a suboptimal solution (NMSE about four times bigger than in RS-LSB case). This time, RS-LSB is about 13 times faster than BP, 6 times faster than FBP, and 11 times faster than CG.
1440
A. Navia-V´azquez and A. R. Figueiras-Vidal
Table 3: Experiment Results for the “Computer Activity” Data Set. BP
FBP
CG
BFGS
LM
LSB
RS-LSB
Train NMSE
0.04 (0.004)
0.07 (0.01)
0.04 (0.002)
0.15 (0.14)
0.3 (0.2)
0.26 (0.05)
0.08 (0.02)
Test NMSE
0.05 (0.005)
0.07 (0.01)
0.05 (0.002)
0.15 (0.13)
0.26 (0.19)
0.25 (0.03)
0.06 (0.01)
No Iters.
395 (120)
153 (57)
30 (5.4)
13.6 (7.9)
16 (5.6)
6.2 (1.7)
10.9 (4)
Load % BP
100%
39%
85%
37%
187%
4%
8%
Notes: We show classication error (mean values and standard deviations, in parentheses) corresponding to 10 runs. The average number of epochs to converge and the (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.
5 Computational Considerations
In order to evaluate the speed gain of block learning methods with respect to more classical approaches (BP, FBP, CG, BFGS, and LM), we have registered the number of oating operations per epoch consumed by every algorithm and the number of epochs needed to converge. In Tables 1 through 3 we show the computational load with respect to BP. It can be observed that from this point of view, block methods always appear to be advantageous, RS-LSB being from 3 to 13 times faster than the fastest competitor (among those with comparable nal performance)—the moderate computational excess needed by RS-LSB with respect to LSB being justied by the improved quality of the solutions. Although the numbers collected in Tables 1 through 3 reect actual computational cost, it is convenient to give approximate expressions for computational load per epoch for every one of the algorithms under test. It is rather complicated to derive exact analytical expressions because of the numerous operations, and therefore we have obtained empirical expressions assuming computational cost linear in the number of weights for the BP, FBP, CG, and BFGS cases, quadratic in the number of weights for the LM case and quadratic in the number of neurons in LSB and RS-LSB; the free (multiplying) parameters are adjusted afterward by LS tting in a logarithmic scale (i.e., computational loads are rst converted to a log scale in order to obtain a similar tting error for problems with different complexity). This estimation gives, as shown in Figure 3, estimates (-o-) close to measured numbers (-*-) (the analytical expressions are also included in Figure 3). The accuracy of these expressions serves to conrm that, as we had guessed, BP, FBP, CG, and BFGS behave linearly in the number of weights (NW ), LM is quadratic in NW , and block training methods are quadratic in the number of neurons (NN ).
Block Training of Multilayer Perceptrons
1441
Figure 3: (a) Computational load (number of oating operations per epoch) as a function of PNw for the three data sets under study (o) and empirical estimation (*). The estimates are accurate enough. (b) Estimated load expressions for every algorithm where NW is the number of weights in the network (including biases) and NN the number of neurons (including the input layer).
6 Some Theoretical Insights 6.1 Points of Tangency with Second-Order Searching. The basic local technique to nd the minimum of a nonquadratic error function using
1442
A. Navia-V´azquez and A. R. Figueiras-Vidal
second-order information is the well-known Newton’s method. It accepts that the error function E( w ) can be locally approximated by a second-order polynomial and computes Newton step D w N (such that E( w C D w N ) is minimal), solving H ( w )D w N D ¡rE( w )
(6.1)
where rE(w ) is the gradient vector and H ( w ) the Hessian matrix. In our case, vector w represents one of the columns of W ( k) ; we are solving a local minimization problem at every neuron. If we consider the error term r (w ) D tanh(Ow ) ¡ t, then E( w ) D 1 ( ) T ( ), r 2 w r w the gradient vector is rE( w ) D J( w )T r( w )
(6.2)
and the Hessian matrix can be reasonably approximated (Battiti, 1992) using the Jacobian matrix J( w ) D @r (w ) / @w , H ( w ) ’ J (w )T J (w ).
(6.3)
We obtain the Gauss-Newton step by solving equation 6.1 using these values,
D w GN D ¡( J( w )T J( w ))¡1 J T ( w )r (w ),
(6.4)
such that strong similarities (and also qualitative differences) appear between the Gauss-Newton method and the RS-LSB algorithm (with b D 1 in equation 3.5. To facilitate their comparison, we obtain the analogous RS-LSB step by conveniently rewriting equation 3.4 and minimizing with respect to the weight update 10 D w RS¡LSB,
®
®
min ®O s w C O s D w RS¡LSB ¡ q s ®2 ,
D w RS¡LSB
(6.5)
and, if we use the Moore-Penrose generalized inverse to solve equation 6.5, we obtain
D w RS¡LSB D ¡(O Ts O s )¡1 O Ts ( O s w ¡ q s )
D ¡(O Ts O s )¡1 O Ts r RS¡LSB( w ).
(6.6)
10 Note that block learning methods can be formulated in direct or incremental form, the former yielding a new set of weights in every step (solving equation 3.4) and the latter obtaining weight increments (solving equation 6.5) to be added to present weight values, but both formulations leading to identical solutions.
Block Training of Multilayer Perceptrons
1443
It is straightforward to verify that O s D J (w ) (using equation 3.5, with b D 1) such that, to compare RS-LSB and Gauss-Newton methods, only the following residual terms have to be considered: rGN ( w ) D r (w ) D tanh( Ow ) ¡ t
(6.7)
rRS¡LSB( w ) D ( O s w ¡ q s ) D diag( s )( Ow ¡ q ).
(6.8)
The multiplication by sensitivity in equation 6.8 emphasizes those error terms opT w ¡ qp with low value for | opT w | because sp becomes negligible when | opT w | ! 1, thereby yielding the localization properties mentioned in section 3. Thus, the RS-LSB algorithm, when considered as a whole, combines an accurate local method to adjust the weights of every neuron with the layerwise training approach that decomposes the global optimization problem into smaller LS problems. When we use general expression 3.6 for the sensitivity, which means 6 1 and the scaling procedure, RS-LSB and Gauss-Newton methods using b D are no longer directly comparable, but the qualitative behavior of RS-LSB is maintained. 6.2 Relationship with Regularization Techniques. Taking advantage of the LS formulation, it would be easy to include an additional regularization term a kw k2 in equation 3.4, such that the minimization becomes ® ©® ª min ®O s w ¡ q s ®2 C a kw k2 , (6.9) w
which leads to well-known regularization techniques that either add a small value (a) to the diagonal of O Ts O s or discard those singular values smaller than a. We have observed through experimentation that this extra regularization mechanism sometimes excessively limits the approximation capabilities of the network (being also highly sensitive to the choice of a). As the RS-LSB algorithm already relies on minimal norm solutions, it generally obtains solutions with good generalization, and we found no reason to include this additional mechanism. (Further comments on regularization are included in section 6.4.) 6.3 Relationship with Natural Gradient. It has been recently shown that the parameter space of perceptrons is not Euclidean but has a Riemannian structure (Amari, 1998). Thus, the ordinary gradient does not provide the steepest direction, which is given by the natural (contravariant) gradient. Preliminary results (Amari, 1998) show that the performance of this method is good, releasing BP from many of its convergency problems. The Riemannian steepest descent direction of E( w ) is
e w ) D G ¡1 ( w )(¡r E( w )), ¡rE(
(6.10)
1444
A. Navia-V´azquez and A. R. Figueiras-Vidal
where G ( w ) is the Riemannian metric tensor. If we formulate every neuron using a statistical model t D tanh( o T w ) C n, n being an N(0, sn2 ) random variable and po ( o T ) being the distribution of inputs, the input-output joint probability is p( o T , tI w ) D po ( o T ) p
1 2p sn
n o exp ¡(tanh(o T w )¡t)2 / sn2 ,
(6.11)
and the geometry of the space is given by the Fisher information matrix ( ) T @ log p( o T , tI w ) @ log p( o T , tI w ) G(w ) D E , (6.12)
¡
¢¡
@w
¢
@w
which becomes (Amari, 1998) G(w ) D
1 Ef(tanh0 (o T w ))2 oo T g. sn2
(6.13)
The expectation in equation 6.13 has to be estimated using an average over the nite training set of P patterns: b(w ) D G D
P 1 X ( tanh0 ( opT w )op )( tanh0 ( opT w )op )T 2 sn P pD1
1 T J ( w )J (w ). sn2 P
(6.14)
Block methods assume normally distributed errors in state-space and, in particular, RS-LSB uses the weighted error terms ep D sp (opT w ¡qp ) (see equation 3.1) such that the new joint probability function is p( o T , tI w ) D po ( o T ) p
1 2p sn
n o exp ¡(diag( s )( o T w ¡q))2 / sn2 .
(6.15)
It is also easy to nd the Fisher information matrix in this case: b RS¡LSB (w ) D G
P 1 X 1 (sp op )( sp op )T D 2 O Ts O s , 2 sn P pD1 sn P
(6.16)
which is completely equivalent to equation 6.12, given the relationship J( w ) D O s . We observe that the RS-LSB update, equation 6.6, contains a b ¡1 ( ). factor proportional to G RS¡LSB w Therefore, it truly takes into account the nonorthonormal geometry of the space, beneting from the good convergence properties attributed to natural gradient. On the other hand, note b LSB( w ) D 12 O T O (it lacks the that in LSB, the estimated Fisher matrix is G s P n
tanh0 ( opT w ) terms), which is clearly different from equation 6.12.
Block Training of Multilayer Perceptrons
1445
6.4 On Implicit Sample Selection and Regularization. Sample selection is a well-known strategy for improving the efciency of neural networks. The rst proposal for introducing sample selection (Munro, 1992) demonstrated this, as have many other modications (Cachin, 1994), some using other cost objectives (Telfer & Szu, 1994), according to the discussion in Cid-Sueiro, Arribas-S´anchez, Urb´an-Munoz, ˜ and Figueiras-Vidal (1999). Recently, powerful learning architectures, the support vector machines (SVM), have appeared as a result of formulations including generalization objectives (Vapnik, 1995; Vapnik, Golowich, & Smola, 1997; Sch¨olkopf et al., 1997). These architectures also include (implicit) sample selection. Our approach has some differences from these. Specically, the selection is carried out independently at every neuron, and none of the patterns is (fully) discarded at any point. These characteristics can be advantageous. For example, the second approach seems to be a softer way of doing things, more adequate when training data may include misclassied patterns or outliers. Finally, note that in the above references for SVM (and also in Bartlett, 1997), good performance is attributed to controlling the weight sizes (as in the rst proposal of regularization for pruning purposes; (Hinton, 1987). In this sense, since LS minimizations yield minimum-norm solutions, LSB, and in particular RS-LSB, will also follow this principle.
7 Conclusions
We have seen that block training methods constitute a powerful approach to MLP training, promising a great reduction in computational effort with respect to other techniques. Some problems arise in direct implementations such as LSB, mainly concerning the nal error and generalization capability. We have proposed and developed a reduced sensitivity approach, which has proved to be advantageous with respect to other well-known training algorithms, as shown by means of several real-world examples. Our approach is also related to some other efcient training techniques, such as second-order approaches or natural gradient descent. Additionally, reducing the sensitivity of the model to some of the training patterns can be considered an implicit sample selection strategy, which also yields benets for the global learning process. Extending this approach to other scenarios, such as recurrent neural networks, could yield simpler and practically applicable designs, one of the limitations of this type of neural networks. To apply other cost functions and additional sample selection strategies are also interesting tasks, as are analyzing sensitivities and variable selection under these new training procedures.
1446
A. Navia-V´azquez and A. R. Figueiras-Vidal
Acknowledgments
This work has been partially supported by the III National R & D Plan of the Spanish Government (CICYT), under grant TIC96-0500-C10-03. References Amari, S.-I. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Azimi-Sadjadi, M., & Liou, R. (1992). Fast learning process of multilayer neural networks using recursive least squares method. IEEE Trans. on Signal Processing, 40, 446–450. Bartlett, P. (1997). For valid generalization, the size of the weights is more important than the size of the network. In M. M. Mozer, M. Jordan, T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 134–140). San Mateo, CA: Morgan Kaufmann. Battiti, R. (1992). First and second-order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 4, 141–166. Biegler-K onig, ¨ F., & B¨armann, F. (1993). A learning algorithm for multilayered neural networks based on linear least squares problems. Neural Networks, 6, 127–131. Cachin, C. (1994). Pedagogical pattern selection strategies. Neural Networks, 7, 175–181. Cid-Sueiro, J., Arribas-S´anchez, I., Urb´an-Mu˜noz, S., & Figueiras-Vidal, A. R. (1999). Cost functions to estimate “a posteriori” probabilities in multi-class problems. IEEE Trans. on Neural Networks, 10(3), 645–656. DiClaudio, E. D., Parisi, R., & Orlandi, G. (1993). LS-based training algorithms for neural networks. In C. Kamm & G. Kuhn (Eds.), Neural networks for signal processing III (pp. 22–29). New York: IEEE Press. Deller, J., & Hunt, S. (1992). A simple “linearized” algorithm which outperforms back-propagation. In Proc. Intl. Joint Conf. on Neural Networks (Vol. 3, pp. 133– 138). Douglas, S., & Meng, T. (1991). Linearized least-squares training of multilayer feedforward neural networks. In Proc. Intl. Joint Conference on Neural Networks (Vol. 1, pp. 307–312). Hinton, G. (1987). Learning traslation invariant recognition in massively parallel networks. In J. Bakker, A. Nijman, & P. Treleaven (Eds.), Proc. PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin: Springer-Verlag. Munro, P. (1992). Repeat until bored: A pattern selection strategy. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 1001–1008). San Mateo, CA: Morgan Kaufmann. Pethel, S., & Bowden, C. (1992). Neural network applications to nonlinear time series analysis. In Proc. Intl. Conf. on Industrial Electronics, Control, Instrumentation and Automation, Power Electronics and Motion Control (Vol. 2, pp. 1094– 1098).
Block Training of Multilayer Perceptrons
1447
Prechelt, L. (1994). Proben1—A set of neural network benchmark problems and benchmarking rules (Tech. Rep. No. 21/94). Fakultat fur Informatik, Universitat Karlsruhe, Karlsruhe, Germany. Available online via anonymous FTP: ftp.ira.uka.de, /pub/papers/techreports /1994/1994-21.ps.Z. Rao, C., & Toutenburg, H. (1995). Linear models: Least squares and alternatives. New York: Springer-Verlag. Rohwer, R. (1991).The moving targets training algorithm. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 558–565). San Mateo, CA: Morgan Kaufmann. Sch¨olkopf, B., Sung, K., Burges, C., Girosi, F., Nigoyi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Trans. on Signal Processing, 45, 2758–2765. Song, J., & Hassoun, M. (1990). Learning with hidden targets. In Proc. Intl. Joint Conf. on Neural Networks (Vol. 3, pp. 93–98). Telfer, B., & Szu, H. (1994). Energy functions for minimizing misclassication error with minimum-complexity networks. Neural Networks, 7, 809–818. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., Golowich, S., & Smola, A. (1997).Support vector method for function approximation, regression, estimation and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 281–287). Cambridge, CA: MIT Press. Wang, G.-J., & Chen, C.-C. (1996). A fast multilayer neural-network training algorithm based on the layer-by-layer optimizing procedures. IEEE Trans. Neural Networks, 7, 768–775. Received June 26, 1998; accepted May 15, 1999.
LETTER
Communicated by Richard Golden
An Optimization Approach to Design of Generalized BSB Neural Associative Memories Jooyoung Park Department of Control and Instrumentation Engineering, Korea University, Chochiwon, Chungnam, 339-800, Korea Yonmook Park Department of Information Engineering, Graduate School, Korea University, Chochiwon, Chungnam, 339-800, Korea This article is concerned with the synthesis of the optimally performing GBSB (generalized brain-state-in-a-box) neural associative memory given a set of desired binary patterns to be stored as asymptotically stable equilibrium points. Based on some known qualitative properties and newly observed fundamental properties of the GBSB model, the synthesis problem is formulated as a constrained optimization problem. Next, we convert this problem into a quasi-convex optimization problem called GEVP (generalized eigenvalue problem). This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods. Design examples are given to illustrate the proposed approach and to compare with existing synthesis methods. 1 Introduction
Recently, the problem of realizing associative memories via recurrent neural networks has received much attention (Hassoun, 1993; Michel & Farrell, 1990). In general, the key properties required for these neural associative memories are the following (Lillo, Miller, Hui, & Zak, 1994): The desired binary patterns, which will be called the prototype patterns in this article, are stored as asymptotically stable equilibrium points of the system. When a vector sufciently close to a stored prototype pattern is applied as an initial condition, then the system trajectory converges to the prototype pattern. The number of asymptotically stable equilibrium points that do not correspond to the prototype patterns (i.e., spurious states) is minimal. The system is globally stable; every trajectory of the system converges to some equilibrium point. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1449– 1462 (2000) °
1450
Jooyoung Park and Yonmook Park
Among the various kinds of promising neural models suitable for achieving these properties are the so-called brain-state-in-a-box (BSB) neural networks. This model was rst proposed by Anderson, Silverstein, Ritz, and Jones (1977) and has attracted substantial interest among researchers. Cohen and Grossberg (1983) proved a theorem on the global stability of the continuous-time, continuous-state BSB dynamical systems with real symmetric weight matrices. Golden (1986) proved that all trajectories of the BSB dynamical systems with real symmetric weight matrices approach the set of equilibrium points under a certain condition, which will be used as the global stability condition in this article. Marcus and Westervelt (1989) reported results on global stability for a large class of BSB model type systems. Perfetti (1995) analyzed qualitative properties of the BSB model and formulated the design of the BSB-based associative memories as a constrained optimization in the form of a linear programming with an additional nonlinear constraint. Also, he proposed an ad hoc iterative algorithm to solve the constrained optimization and illustrated the algorithm with some design examples. Park, Cho, and Park (1999) recast Perfetti’s formulation into a semidenite programming problem (SDP) by converting its nonlinear constraints into a set of linear matrix inequalities (LMIs). We are concerned here with developing a synthesis procedure for associative memories based on an advanced form of the BSB model, which is often referred to as GBSB (generalized BSB). Our synthesis procedure will be given in the form of an LMI-based optimization problem called a generalized eigenvalue problem (GEVP). This article may be viewed as extending the design method of Park et al. (1998) for the GBSB memories. Both articles use a readily available software (MATLAB LMI Control Toolbox) rather than implementing ad hoc algorithms for the memory synthesis. The most signicant difference between them is that the synthesis procedure we present is based on the GEVP, while the method of Park et al. (1999) uses the SDP. A design example in section 4 will demonstrate the superiority of the proposed GEVP-based method over the SDP method. The GBSB model was proposed and studied by Hui and Zak (1992) and Golden (1993). Lillo, Miller, Hui, and Zak (1994) analyzed the dynamics of the GBSB model and presented a novel synthesis procedure for the GBSBbased associative memories. Their procedure uses a decomposition of the weight matrix, which results in an asymmetric interconnection structure, an asymptotic stability of the prototype patterns, and a small number of spurious states. Later, Zak, Lillo, and Hui (1996) incorporated the learning and forgetting capabilities into the synthesis method of Lillo et al. (1994). Also, Chan and Zak (1997) proposed a “designer ” neural network to solve the problem of realizing associative memories via the GBSB neural networks. In this article, attention is focused on how to nd the parameters of the optimally performing GBSB given a set of the prototype patterns. After introducing fundamental properties of the GBSB model, we formulate the synthesis of the GBSB that can store the prototype patterns with optimal
Optimization Approach to Design
1451
performance as a constrained optimization problem. Next, we convert the synthesis problem into an optimization problem called a generalized eigenvalue problem (GEVP). This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods (Boyd, El-Ghaoui, Feron, & Balakrishnan, 1994; Vandenberghe & Balakrishnan, 1997). For specic algorithms belonging to these interior point methods, one can refer to Boyd and El-Ghaoui (1993) and Nesterov and Nemirovskii (1994), for example. An implementation of the Nesterov and Nemirovskii algorithm is also provided in the LMI Control Toolbox (Gahinet, Nemirovskii, Laub, & Chilali, 1995). Since GEVPs can be efciently solved by interior point methods, converting the synthesis problem into a GEVP is equivalent to nding a solution to the original problem. As an optimum searcher for the GEVP appearing in the proposed approach, we use the function “gevp” of the LMI Control Toolbox. Throughout this article, we use the following denitions and notation, in which Rn denotes the set of real n-vectors. A symmetric matrix A 2 6 0 , and A > 0 denotes Rn£n is positive denite if xT Ax > 0 for any x D this. Also, A > B denotes that A ¡ B is positive denite. H n denotes the hypercube [¡1, C 1]n . By a binary vector, we mean a bipolar binary vector whose element is either C 1 or ¡1, and the set of all these binary vectors in H n is denoted by Bn . The usual Hamming distance between two vectors x ¤ and x in Bn is denoted by H( x ¤ , x ). By an l-neighborhood of a vertex c 2 Bn , we mean the set fc ¤ 2 B n | H (c ¤ , c ) · lg. In section 2, we present some known qualitative properties and newly observed fundamental properties of the GBSB model. Also, based on the properties, we formulate the synthesis of the GBSB with optimal performance as a constrained optimization problem. Section 3 describes how this synthesis problem can be converted into a GEVP. In section 4, we present numerical examples to show the superiority of the proposed method over existing synthesis methods. Finally, in section 5 contains concluding remarks. 2 Background Results
The dynamics of the GBSB model is described by the following state equation: x( k C 1) D g( x ( k) C a( Wx ( k ) C b )), n
(2.1)
where x ( k) 2 R is the state vector at time k, a > 0 is the step size, W 2 Rn£n is the weight matrix, b 2 Rn is the bias vector, and g: Rn ! Rn is a linear saturating function whose ith component is a function dened as follows: 8 if xi ¸ 1, 0. An equilibrium point xe of system 2.1 is stable if for any 2 exists d > 0 such that
> 0, there
kx (0) ¡ x e k < d implies kx ( k) ¡ x e k < 2 , 8k > 0. An equilibrium point x e of system 2.1 is asymptotically stable if it is stable and there exists d > 0 such that x (k ) ! x e as k ! 1 if kx (0) ¡ x e k < d.
System 2.1 is globally stable if every trajectory of the system converges to some equilibrium point. The criteria on the stability of the GBSB model are now well established (Lillo et al., 1994; Golden, 1993; Perfetti, 1995): A vertex c of the hypercube H n is an equilibrium point of system 2.1 if and only if 0 ci @
1 n X jD1
wijc j C bi A ¸ 0, 8i 2 f1, . . . , ng.
(2.2)
A vertex c of the hypercube Hn is an asymptotically stable equilibrium point of system 2.1 if 0 ci @
n X jD1
1 wijc j C bi A > 0, 8i 2 f1, . . . , ng.
(2.3)
If condition 2.3 holds for a vertex c 2 Bn and wii D 0, 8i 2 f1, . . . , ng, then only on the vertices c ¤ 2 Bn such that H(c , c ¤ ) > 1 can other asymptotically stable equilibrium points of system 2.1 exist.1 1 In fact, this was proved only for the BSB model in theorem 1 and corollary 2 of Perfetti (1995). However, one can easily prove that this still holds true for the GBSB model by following the arguments of Perfetti (1995).
Optimization Approach to Design
1453
System 2.1 is globally stable if the weight matrix W is symmetric and l min > ¡1,
(2.4)
where l min is the smallest eigenvalue of I C aW . Designing GBSB neural associative memories based on the above criteria only is not good enough in general. The guidelines for addressing the attraction domains of the prototype patterns and the number of spurious states should be added. We present our main theorem, which will be used to provide an additional guideline for performance optimization: Letc 2 Bn be an asymptotically stable equilibrium point of system 2.1 and let l be an integer in f1, . . . , ng. If W and b satisfy Theorem.
wii D 0, 8i 2 f1, . . . , ng 0 1 n X
and c i @
jD1
wijc j C bi A > 2(l ¡ 1) max | wij | , i D 1, . . . , n,
¤
(2.6)
j
then any binary vector c properties: c
(2.5)
¤
2 Bn such that 1 · H (c ¤ , c ) · l has the following
is not an equilibrium point.
6 c i , then at the next time step, xi moves toward c i (i.e., If x (0) D c ¤ and c i¤ D xi (1) is closer to c i than xi (0) is).
Proof . Let c
¤
2 Bn be any binary vector satisfying 1 · H (c ¤ , c ) · l, and let 4
6 c i (i.e., c ¤ D ¡c i ). Since d D c ¤ ¡ c satises c i¤ D i
n X wijdj D |wi1d1 C ¢ ¢ ¢ C 0 £ di C ¢ ¢ ¢ C windn | jD1 · 2(l ¡ 1) max | wij | , j
we have 0 c i¤
1
n X
@
jD1
wijc j¤
00
C bi A D ¡c i @@
1 n X j D1
0
wijc j C bi A C @
11 n X j D1
wijdj AA
0 1 X n @ A @ A · ¡c i ci wijc j C bi C wijdj jD1 jD1 0
n X
1
1454
Jooyoung Park and Yonmook Park
n X @ A D ¡c i wijc j C bi C wijdj jD1 jD1 0
1
0
1
n X
n X
· ¡c i @
jD1
wijc j C bi A C 2(l ¡ 1) max |wij |. j
This, together with inequalities 2.6, implies that c i¤
(
n X iD1
! wijc j¤
C bi
< 0.
(2.7)
Thus, condition 2.2 does not hold for c ¤ 2 Bn . Therefore, c ¤ cannot be an equilibrium point of system 2.1. Now, let system 2.1 start from x(0) D c ¤ . Pn ¤ From inequality 2.7, we see that c i and ( jD1 wijc j¤ C bi ) have different polarities. Since 0
0
xi (1) D g @xi (0) C a @ 0
0
n X j D1
n X
D g @c i¤ C a @
jD1
11 wij xj (0) C bi AA 11
wijc j¤ C bi AA ,
we have |xi (1) ¡ c i | < |xi (0) ¡ c i | D 2.
From our theorem, we see that W and b satisfying conditions 2.5 and 2.6 ensure that the vertices in the l-neighborhood of the prototype pattern c cannot become spurious states and are very likely to be in the attraction domain ofc . Thus, a possible strategy for decreasing the number of spurious states and increasing the attraction domains of the prototype patterns is to maximize the l in the constraints 2.6 for each prototype pattern c 2 Bn . This observation and other background results allow us the following formulation for the optimal design: Given a set of the prototype patterns x (1) , . . . , x ( m) , the parameters of the optimally performing GBSB can be found by solving: max s.t.
d ( >³0) ´ ( ) Pn (k ) xi k jD1 wij xj C bi > 2d maxj |wij |, i D 1, . . . , n, k D 1, . . . , m, wii D 0, i D 1, . . . , n, W D WT , l min > ¡1,
(2.8)
Optimization Approach to Design
1455
where l min is the smallest eigenvalue of I C aW . This optimization problem has two groups of nonlinear constraints, which prevent us from applying readily available optimization techniques. In the next section, we show that this problem can be converted into an optimization problem that can be efciently solved. 3 A GEVP Approach to Design of GBSBs
In this section, we convert problem 2.8 into an optimization problem called GEVP. The general form of a GEVP is given as follows (Boyd et al., 1994): min l s.t. l B (z ) ¡ A (z ) > 0, B ( z ) > 0, C ( z ) > 0, where A ( z ), B (z ), and C ( z ) are symmetric matrices that are afne functions of the variable z. This GEVP is to minimize the maximum generalized eigenvalue of the pair ( A ( z ), B ( z )) subject to the linear matrix inequality (LMI) constraints B ( z ) > 0 and C ( z ) > 0. It is a quasi-convex optimization problem and can be efciently solved by recently developed interior point methods (Boyd et al., 1994; Vandenberghe & Balakrishnan, 1997). Before proceeding further, note that multiple LMIs C (1) ( z ) > 0 , . . . , C ( p) ( z ) > 0 can be expressed as the single LMI diag( C (1) (z ), . . . , C ( p) ( z )) > 0. In order to convert problem 2.8 into a GEVP, we rst introduce additional variables qi , i D 1, . . . , n satisfying 0
x(i k)
1
n X
@
jD1
( ) wij xj k
C bi A > 2dqi > 2d max |wij |, j
i D 1, . . . , n, k D 1, . . . , m.
(3.1)
Note that inequalities 3.1 are only a restatement of the rst constraint set of problem 2.8 with the deliberately chosen terms 2dqi placed between ( ) P ( ) xi k ( jnD1 wij xj k C bi ) and 2d maxj |wij |. Also, note that inequalities 3.1 can be divided into LMIs qi ¡ wij > 0, i D 1, . . . , n, j D 1, . . . , n, wij C qi > 0, i D 1, . . . , n, j D 1, . . . , n, and a set of inequalities, 0 ¡2dqi C x(i k) @
n X jD1
1 ( )
k wij xj C bi A > 0, i D 1, . . . , n, k D 1, . . . , m.
1456
Jooyoung Park and Yonmook Park
Next, we consider the eigenvalue condition, inequality 2.4. Since I C aW is real symmetric, its eigenvalues are real, and corresponding eigenvectors can be chosen to be real orthonormal (Strang, 1988). Thus, its spectral decomposition can be written as I C aW D U L U T ,
where the eigenvalues of I C aW appear on the diagonal of L , and U , whose columns are the real orthonormal eigenvectors of I C aW , satises UU T D U T U D I (Strang, 1988). Note that l min > ¡1 is equivalent to L > ¡I .
(3.2)
By pre- and postmultiplying matrix inequality 3.2 by U and U T, respectively, we obtain an equivalent condition I C aW > ¡I , which is again equivalent to the following LMI: 2I C aW > 0 . As a result of the conversion processes, the problem of nding the GBSB that can store the prototype patterns x (1) , . . . , x ( m ) with optimal performance can now be reformulated as the following GEVP: min s.t.
(¡d)
³P ´ (k ) n (¡d)(2qi ) C x(i k) C w x b i > 0, jD1 ij j 8 i D 1, ¢ ¢ ¢ , n, k D 1, ¢ ¢ ¢ , m, 9 > qi ¡ wij > 0, i D 1, . . . , n, j D 1, . . . , n, > > > > > > < wij C qi > 0, i D 1, . . . , n, j D 1, . . . , n, > = 2I C aW > 0, > > > > w D 0, i D 1, . . . , n, > > > > : ii ; T W DW .
(3.3)
Note that the constraints inside “f¢g” are LMIs in qi , i D 1, . . . , n, and the independent scalar entries of W satisfying wii D 0, i D 1, . . . , n and W D WT . As a further requirement that should be considered in designing GBSB neural associative memories, the magnitudes of the weights |wij | should have a reasonable upper bound so that the resulting networks could be implemented without difculty. To reect this practical concern, the following LMIs are added in the list of constraints of problem 3.3: L < qi < U, i D 1, . . . , n.
(3.4)
Note that the resulting optimization problem is still in the form of GEVP.
Optimization Approach to Design
1457
4 Design Examples
To demonstrate the applicability of the GEVP approach proposed in this article and to compare with existing synthesis methods, we consider three design examples. In these examples, we consider the GBSB model or the BSB model with n D 10 neurons. The performance of each designed memory system is to be evaluated by simulations, in which every possible binary vector is applied as an initial condition for the system, and the system is allowed to evolve from the initial condition to a nal state. Analyzing the simulation results, one can obtain average recall probabilities and the convergence rate to correct patterns, which are dened as follows: Given the pair of a prototype pattern x ( k) and an integer l 2 f0, 1, . . . , ng, the probability that a binary initial condition vector at the Hamming distance of l away from x ( k) is attracted to x( k) is denoted by P( x( k) , l). By the average recall probability Prob( l), we mean the average of the 4
( x ( k) , l) over all the prototype patterns x (1) , . . . , x( m ) (i.e., Prob(l) D PP ( (k) )g f m kD1 P x , l / m).
By the convergence rate to correct patterns, we mean the percentage that binary initial condition vectors are attracted to the prototype patterns nearest in the sense of Hamming distance.
In the rst example, which is taken from Lillo et al. (1994), we wish to store the following six prototype patterns: x(1) x(2) x(3) x(4) x(5) x(6)
D D D D D D
[ [ [ [ [ [
¡1 C1 ¡1 ¡1 C1 C1
C1 C1 C1 C1
¡1 C1
C1
¡1 C1 ¡1 ¡1 ¡1
C1
¡1 C1 ¡1 C1 C1
C1
¡1 ¡1 ¡1 C1 ¡1
C1 C1
¡1 ¡1 ¡1 C1
¡1 ¡1 C1 C1 C1 C1
¡1 ¡1 ¡1 ¡1 C1 C1
¡1 C1 ¡1 C1 C1 ¡1
¡1 ¡1 ¡1 C1 ¡1 ¡1
]T ]T ]T ]T ]T ]T
The step size of the GBSB model was set to be a D 0.3 as in Lillo et al. (1994). Solving the corresponding GEVP with the bounds in inequalities 3.4 set to L D 0.1 and U D 0.2, we obtained a GBSB memory for the rst example. According to the simulation results, the GBSB designed by the proposed method is with no spurious state, while the GBSB of Lillo et al. (1994) has two spurious states in Bn . The convergence rates to correct patterns and the average recall probabilities of these GBSBs are shown in Table 1 and Figure 1, respectively. In the second example, which is taken from Chan and Zak (1997), we want to store the following ve prototype patterns in the GBSB model: x(1) x(2)
D D
[ [
¡1 C1
C1 C1
¡1 ¡1
C1
¡1
C1 C1
C1
¡1
¡1 C1
C1
¡1
C1 C1
C1 C1
]T ]T
1458
Jooyoung Park and Yonmook Park Table 1: Convergence Rates to Correct Patterns of the GBSBs Designed for the First Example. Convergence Rate to Correct Patterns (%) Lillo et al. (1994) Proposed method
x (3) x (4) x (5)
D D D
[ [ [
¡1 C1 C1
C1 C1
¡1
C1
¡1 ¡1
C1 C1
¡1
13.6 92.7
¡1 ¡1 C1
¡1 C1 C1
C1
¡1 C1
¡1 C1 ¡1
C1 C1
¡1
¡1 C1 ¡1
]T ]T ]T
As in Chan and Zak (1997), the step size of the GBSB model was set to be a D 0.3. Solving the corresponding GEVP with L D 0.1 and U D 0.2, we obtained a GBSB memory for the second example. From the simulations, we found that the GBSB designed by the proposed method and the GBSB of Chan and Zak (1997) are both with no spurious state. The convergence rates to correct patterns and the average recall probabilities of these GBSBs are shown in Table 2 and Figure 2, respectively.
Figure 1: Average recall probabilities of the GBSBs designed for the rst example.
Optimization Approach to Design
1459
Table 2: Convergence Rates to Correct Patterns of the GBSBs Designed for the Second Example. Convergence Rate to Correct Patterns (%) Chan and Zak (1997) Proposed method
80.1 85.3
In the third and nal example, which was considered in Perfetti (1995) and Park et al. (1999), the prototype patterns of the second example are again considered. The design objective of this example is to store the prototype patterns in the BSB model instead of the GBSB model. For a fair comparison, the bias vector and the step size were xed at b D 0 and a D 1, respectively, as in Perfetti (1995) and Park et al. (1999). Solving the corresponding GEVP with L D 0.1 and U D 0.2, we obtained a BSB memory. From the simulations for the BSB designed by the proposed method, the BSB of Perfetti (1995), and BSB I of Park et al. (1999), we observed that each BSB is with ve spurious states, which consist of the negatives of the prototype patterns. The
Figure 2: Average Recall probabilities of the GBSBs designed for the second example.
1460
Jooyoung Park and Yonmook Park Table 3: Convergence Rates to Correct Patterns of the BSBs Designed for the Third Example. Convergence Rate to Correct Patterns (%) Perfetti (1995) BSB I of Park et al. (1998) Proposed method
46.7 46.4 48.9
convergence rates to correct patterns and the average recall probabilities of these BSBs are shown in Table 3 and Figure 3, respectively. The performance comparisons in the above design examples show that the memories designed by the proposed method generally have higher average recall probabilities, higher convergence rates to correct patterns, and fewer or a comparable number of spurious states than the memories designed by other recently developed synthesis methods.
Figure 3: Average recall probabilities of the BSBs designed for the third example.
Optimization Approach to Design
1461
5 Conclusion
We have addressed the problem of nding the parameters of the GBSB neural networks that can store given prototype patterns with optimal performance. Based on known qualitative properties and newly derived fundamental properties of the GBSB model, the problem was formulated as a nonlinear constrained optimization problem. Then this problem was converted into a quasi-convex optimization problem called a GEVP. This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods. Three design examples were presented to illustrate the proposed method and to compare with existing methods. The simulation results showed that the memories obtained by the proposed method have better performance than the memories designed by existing methods.
References Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Boyd, S., & El-Ghaoui, L. (1993). Method of centers for minimizing generalized eigenvalues. Linear Algebra and Applications, 188, 63–111. Boyd, S., El-Ghaoui, L., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in systems and control theory. Philadelphia: SIAM. Chan, H. Y., & Zak, S. H. (1997). On neural networks that design neural associative memories. IEEE Transactions on Neural Networks, 8, 360–372. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formulation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 815–826. Gahinet, P., Nemirovskii, A., Laub, A. J., & Chilali, M. (1995). LMI control toolbox. Natick, MA: Mathworks. Golden, R. M. (1986).The brain-state-in-a-boxneural model is a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73–80. Golden, R. M. (1993). Stability and optimization analyses of the generalized brain-state-in-a-box neural network model. Journal of Mathematical Psychology, 37, 282–298. Hassoun, M. H. (Ed.). (1993). Associative neural memories: Theory and implementation. New York: Oxford University Press. Hui, S., & Zak, S. H. (1992). Dynamical analysis of the brain-state-in-a-box (BSB) neural models. IEEE Transactions on Neural Networks, 3, 86–100. Lillo, W. E., Miller, D. C., Hui, S., & Zak, S. H. (1994).Synthesis of brain-state-in-abox (BSB) based associative memories. IEEE Transactions on Neural Networks, 5, 730–737. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504.
1462
Jooyoung Park and Yonmook Park
Michel, A. N., & Farrell, J. A. (1990). Associative memories via articial neural networks. IEEE Control Systems Magazine, 10, 6–17. Nesterov, Y., & Nemirovskii, A. (1994). Interior point polynomial methods in convex programming. Philadelphia: SIAM. Park, J., Cho, H., & Park, D. (1999). On the design of BSB neural associative memories using semidenite programming. Neural Computation, 11, 1985– 1994. Perfetti, R. (1995). A synthesis procedure for brain-state-in-a-box neural networks. IEEE Transactions on Neural Networks, 6, 1071–1080. Strang, G. (1988). Linear algebra and its applications (3rd ed.). Orlando, FL: HBJ Publishers. Vandenberghe, L., & Balakrishnan, V. (1997). Algorithms and software for LMI problems in control. IEEE Control Systems Magazine, 17, 89–95. Zak, S. H., Lillo, W. E., and Hui, S. (1996).Learning and forgetting in generalized brain-state-in-a-box (GBSB) neural associative memories. Neural Networks, 9, 845–854. Received February 4, 1999; accepted April 20, 1999.
LETTER
Communicated by Klaus Obermayer
Nonholonomic Orthogonal Learning Algorithms for Blind Source Separation Shun-ichi Amari Tian-Ping Chen Andrzej Cichocki RIKEN Brain Science Institute, Brain-Style Information Systems Group, Japan Independent component analysis or blind source separation extracts independent signals from their linear mixtures without assuming prior knowledge of their mixing coefcients. It is known that the independent signals in the observed mixtures can be successfully extracted except for their order and scales. In order to resolve the indeterminacy of scales, most learning algorithms impose some constraints on the magnitudes of the recovered signals. However, when the source signals are nonstationary and their average magnitudes change rapidly, the constraints force a rapid change in the magnitude of the separating matrix. This is the case with most applications (e.g., speech sounds, electroencephalogram signals). It is known that this causes numerical instability in some cases. In order to resolve this difculty, this article introduces new nonholonomic constraints in the learning algorithm. This is motivated by the geometrical consideration that the directions of change in the separating matrix should be orthogonal to the equivalence class of separating matrices due to the scaling indeterminacy. These constraints are proved to be nonholonomic, so that the proposed algorithm is able to adapt to rapid or intermittent changes in the magnitudes of the source signals. The proposed algorithm works well even when the number of the sources is overestimated, whereas the existent algorithms do not (assuming the sensor noise is negligibly small), because they amplify the null components not included in the sources. Computer simulations conrm this desirable property. 1 Introduction
Blind source separation is the problem of recovering a number of independent random or deterministic signals when only their linear mixtures are available. Here, “blind” implies that the mixing coefcients, the original signals, and their probability distributions are not known a priori. The problem is also called independent component analysis (ICA) or blind extraction of source signals. The problem has become increasingly important because of many potential applications, such as those in speech recognition and enc 2000 Massachusetts Institute of Technology Neural Computation 12, 1463– 1484 (2000) °
1464
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
hancements, telecommunication, and biomedical signal analysis and processing. However, there still remain open theoretical problems to be solved further. The source signals can be identied only up to a certain ambiguity, that is, up to a permutation and arbitrary scaling of original signals. In order to resolve the scaling indeterminacy, most learning algorithms developed so far impose some constraints on the power or amplitude of the separated signals (Jutten & H´erault, 1991; H´erault & Jutten, 1986; Bell & Sejnowski, 1995; Common, 1994; Douglas, Cichocki, & Amari, 1997; Sorouchyari, 1991; Chen & Chen, 1994; Cichocki, Unbehauen, & Rummert, 1994; Cichocki, Unbehauen, Moszczynski, & Rummert, 1994; Cichocki, Amari, Adachi, & Kasprzak, 1996; Cichocki, Bogner, Moszczynski, & Pope, 1997; Cardoso & Laheld, 1996; Girolami & Fyfe, 1997a, 1997b; Amari, Chen, & Cichocki, 1997; Amari, Cichocki, & Yang, 1995, 1996; Amari & Cardoso, 1997; Amari, 1967; Amari & Cichocki, 1998). When the source signals are stationary and their average magnitudes do not change over time, the constraints work well to extract the original signals in xed magnitudes. However, the source signals change their local average magnitudes in many practical applications such as evoked potentials and speech by humans. In such cases, when a component source signal becomes very small suddenly, the corresponding row of the separating matrix tends to be large in the learning process to compensate for this change and to emit the output signal of the same amplitude. In particular, when one source signal becomes silent, the separating matrix diverges. This causes serious numerical instability. This article proposes an efcient learning algorithm with equivariant properties that works well even when the magnitudes of the source signals change quickly. To this end, we rst introduce an equivalence relation in the set of all the demixing matrices. Two matrices are dened to be equivalent when the recovered signals differ only by scaling. When the trajectory of the learning dynamics of the demixing matrix moves in a single equivalence class, the recovered signals remain the same except for scaling. Therefore, such a motion is ineffective in learning. In order to reduce ineffective movements in the learning process, we modify the natural gradient learning rule (Amari, Cichocki, & Yang, 1996; Amari, 1998) such that the trajectory of learning is orthogonal to the equivalence class. This imposes constraints on the directions of movements in learning. However, such constraints are proved to be nonholonomic, implying that there are no integral submanifolds corresponding to the constraints. This implies that the trajectory can reach any demixing matrices in spite of the constraints, a property that is useful in robotic design. Nonholonomic dynamics is a current research topic in robotics. Since the trajectory governed by the proposed algorithm does not include useless (redundant) components, the algorithm can be more efcient (with respect to convergence and efciency) than those algorithms with other
Nonholonomic Orthogonal Learning Algorithms
1465
constraints. It is also robust against sudden changes in the magnitudes of the source signals. Even when some components disappear, the algorithm still works well. Therefore, our algorithm can be used even when the number of source signals is overestimated. These features have been conrmed by computer simulations. The stability conditions for the proposed algorithm are also derived. 2 Critical Review of Existing Adaptive Learning Algorithms
The problem of blind source separation is formulated as follows (H´erault & Jutten, 1986; Jutten & H´erault, 1991). Let n source signals s1 ( t), . . . , sn (t ) at time t be represented by a vector s(t ) D [s1 ( t) ¢ ¢ ¢ sn (t)]T , where the superscript T denotes transposition of a vector or matrix. They are assumed to be stochastically independent. We assume that the source signals are not directly accessible, and we can instead observe n signals x1 (t), . . . , xn (t ) summarized in a vector x( t) D [x1 ( t) ¢ ¢ ¢ xn ( t)]T , which are instantaneous linear mixtures of the n source signals. Therefore, we have
x( t) D As( t),
(2.1)
where A D [aij ]n£n is an unknown nonsingular mixing matrix. The source signals s(1), s(2), . . . , s(T ) are generated at discrete times t D 1, 2, . . . , T subject to unknown probability distributions. In order to recover the original source signals s( t), we make new linear mixtures of the sensor signals x(t ) by
y (t ) D W ( t)x( t),
(2.2)
where W D [wij ]n£n is called the demixing or separating matrix. If W ( t) D A¡1 , the original source signals are recovered exactly. However, A or W is unknown, so we need to estimate W from the available observations x(1), . . . , x( t). Here, W ( t) is an estimate. It is well known that the mixing matrix A, and hence source signals, are not completely identiable. There are two types of indeterminacies or ambiguities: the order of the original source signals is indenite, and the scales of the original signals are indenite. The rst indeterminacy implies that a permutation or reordering of recovered signals remains possible, but this is not critical, because what is wanted is to extract n independent signals; their ordering is not important. The second indeterminacy can be explained as follows. When the signals s are transformed as s0 D L s, where L is a diagonal scaling matrix with nonzero diagonal entries, the observed signals are kept invariant if the mixing matrix A is changed into A0 D A L ¡1 , because x D As D A0 s0 . This implies that W and W 0 D L W are equivalent in the sense that they give the same estimated source signals except for their scales. Taking these indeterminacies into account, we can estimate
1466
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
W D A ¡1 P L at best, where P is a permutation matrix and L is a diagonal matrix. These intrinsic (unavoidable) ambiguities do not seem to be a severe limitation, because in most applications, most information is contained in the waveforms of the signals (for instance, speech signals and multisensor biomedical recordings) rather than their magnitudes and the order in which they are arranged. In order to make the scaling indeterminacies clearer, we introduce an equivalence relation in the n2 -dimensional space of nonsingular matrices Gl( n ) D fW g. We dene W and L W as equivalent, W » LW ,
(2.3)
where L is an arbitrary nonsingular diagonal matrix. Given W , its equivalence class
C W D fW 0 | W 0 D L W g
(2.4)
is an n-dimensional subspace of Gl( n ) consisting of all the matrices equivalent to W . Therefore, a learning algorithm need not search for the nonidentiable W D A¡1 but searches for the equivalence class C W that contains the true W D A¡1 except for permutations. In existing algorithms, in order to determine the scales, one of two constraints is imposed explicitly or implicitly. The rst is a rigid constraint. Let us introduce n restrictions on W , fi ( W ) D 0,
i D 1, . . . , n.
(2.5)
A typical one is fi ( W ) D wii ¡ 1,
i D 1, . . . , n.
Dene the set Qr of W ’s satisfying the above constraints to be
Qr D fW | fi ( W ) D 0,
i D 1, . . . , ng.
(2.6)
The set Qr is supposed to form a ( n2 ¡ n )-dimensional manifold transversal to C W£. In order to resolve the scaling indeterminacies, we require that ¤ W ( t) D wij ( t) always be included in the Qr . Such a rigid constraint is implemented, for example, by the learning algorithm wij ( t C 1) D wij ( t) ¡ g(t)Qi ( yi (t ))yj ( t),
6 iD j,
(2.7)
where Q(yi ) are suitably chosen nonlinear functions, g(t) is the step size, and all the diagonal elements wii are xed equal to 1.
Nonholonomic Orthogonal Learning Algorithms
1467
This idea of reducing the complexity of a learning algorithm slightly is a simple one and has been used in several papers (H´erault & Jutten, 1986; Jutten & H´erault, 1991; Sorouchyari, 1991; Chen & Chen, 1994; Moreau & Macchi, 1995; Matsuoka, Ohya, & Kawamoto, 1995; Oja & Karhunen, 1995). However, the constraints (see equation 2.5) do not allow equivariant properties (Cardoso & Laheld, 1996). Therefore, when A is ill conditioned, the algorithm does not work well. The second constraint species the power or energy of the recovered (estimated) signals. Such soft constraints can be expressed in a general form as E[hi ( yi )] D 0,
i D 1, . . . , n,
(2.8)
where E denotes the expectation and hi ( yi ) is a nonlinear function; typically hi ( yi ) D 1 ¡ Qi ( yi )yi
(2.9)
or hi ( yi ) D 1 ¡ y2i .
(2.10)
The latter guarantees that the variances or powers of the recovered signals yi are 1. Such soft constraints are implemented automatically in the adaptive learning rule,
W (t C 1) D W ( t) C g(t)F [y (t)]W ( t),
(2.11)
where F [y ( t)] is an n £ n matrix of the form (Cichocki, Unbehauen, & Rummert, 1994; Amari et al., 1996)
F [y ( t)] D I ¡ ’( y ( t))y T (t ),
(2.12)
(or Amari et al., 1997) F[y (t )] D I ¡ a’( y (t ))y T (t) C b y ( t)’T (y ( t))
(2.13)
with adaptively or adequately chosen constants a and b, or (Cardoso & Laheld, 1996),
F [y ( t)] D I ¡ y ( t)y T ( t) ¡ ’( y ( t))y T (t ) C y (t)’( y T ( t)).
(2.14)
The diagonal elements of the matrix F ( y ) in equation 2.11 can be regarded as specifying the constraints, £ ¤ E hi ( yi ) D 0.
(2.15)
1468
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
Then, although W ( t) can move freely in C W in the learning process, any equilibrium point of the averaged learning equation satises the constraints of equation 2.15. Let us dene the set Qs of W ’s by
Qs D fW | E[hi ( yi )] D 0,
i D 1, . . . , ng.
(2.16)
Then Qs is an ( n2 ¡ n )-dimensional attracting submanifold and W ( t) is attracted toward Qs , although it does not stay exactly in Qs (t) because of stochastic uctuations. Summarizing, rigid constraints do not permit algorithms with equivariant properties (Cardoso & Laheld 1996). Soft constraints enable us to construct algorithms with equivariant properties. However, for source signals whose amplitudes or powers change over time (e.g., signals with long silent periods and spikes or rapid jumps like speech signals), it is more difcult to achieve rapid convergence to equilibria because of inadequate specication of constraints than in the stationary case. Special tricks or techniques may be employed, like random selection of training samples and dilation of silence periods (Bell & Sejnowski, 1995; Amari, Douglas, & Cichocki, 1998). Performance may be poorer for such nonstationary signals. Moreover, it is easy to show that such algorithms 2.11 through 2.14 are unstable (in the sense that weights wij ( t) diverge to innity) in cases when the number of the true sources is fewer than the number of the observed signals, where W is a quadratic n £ n matrix. Such a case arises often in practice when one or more source signals are silent or decay to zero in an unpredictable way. In this article, we will elucidate the roles of such constraints geometrically. We then show that such constraints and drawbacks can be lifted or at least relaxed by introducing a new algorithm. In other words, we impose no explicit constraints for the separating matrix W and search for the equivalence class C W directly. This article is theoretical in nature, and simulations are limited. Larger-scale simulations studies and applications to real data are necessary for proving the practical usefulness of the proposed method. 3 Derivation of a New Learning Algorithm
In order to derive a new algorithm, we recapitulate the derivation of the natural gradient algorithm due to Amari, Chen, & Cichocki (1997). To begin, we consider the loss function, l( y , W ) D ¡ log | det(W )| ¡
m X
log pi ( yi ),
(3.1)
iD1
where pi ( yi ) are some positive probability density functions (p.d.f.) of output
Nonholonomic Orthogonal Learning Algorithms
1469
signals and det( W ) denotes the determinant of matrix W . We can then apply the stochastic gradient descent learning method to derive a learning rule. In order to calculate the gradient of l, we derive the total differential dl of l when W is changed from W to W C dW . In component form, dl D l( y , W C dW ) ¡ l( y , W ) D
X @l i, j
@wij
dw ij ,
(3.2)
where the coefcients @l / @wij of dwij represent the gradient of l. Simple algebraic and differential calculus yields dl D ¡tr(dW W ¡1 ) C ’( y )T dy ,
(3.3)
where tr is the trace of a matrix and ’( y ) is a column vector whose components are Qi ( yi ) D ¡
d log(pi ( yi ) pPi ( yi ) D¡ , dyi pi ( yi )
(3.4)
“ P ” denoting differentiation. From y D W x, we have dy D dW x D dW W ¡1 y .
(3.5)
Hence, we dene dX D dW W ¡1 ,
(3.6)
whose components dxij are linear combinations of dwij . The differentials dxij form a basis of the tangent space at W of Gl( n ), since they are linear combinations of the basis dw ij . The space Gl( n) of all the nonsingular matrices is a Lie group. An element W is mapped to the identity I when W ¡1 is multiplied from the right. A small deviation dW at W is mapped to dX D dW W ¡1 at I . By using dX , we can compare deviations dW 1 and dW 2 at two different points W 1 ¡1 and W 2 , using dX 1 D dW 1 W ¡1 1 and dX 2 D dW 2 W 2 . The magnitude of deviation dX is measured by the inner product ³ ´ X¡ ¢ 2 hdX , dX iI D tr dX dX T D (3.7) dxij . i, j
This implies that the inner product of dW at W is given by » ³ ´T ¼ hdW , dW iW D tr dW W ¡1 dW W ¡1 ,
(3.8)
1470
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
P which is different from ( dwij )2 . Therefore, the space Gl( n ) has a Riemannian metric. It should be noted that dX D dW W ¡1 is a nonintegrable differential form so that we do not have a matrix function X ( W ) which gives equation 3.6. Nevertheless, the nonholonomic basis dX / is useful. Since the basis dX is orthonormal in the sense that kdX k2 is given by equation 3.7, it is effective to analyze the learning equation in terms of dX . The natural Riemannian gradient (Amari, Cichocki, & Yang, 1995; Amari et al., 1996, 1998) is automatically implemented by it and the equivariant properties automatically hold. It is easy to rewrite the results in terms of dW by using equation 3.6. The gradient dl in equation 3.3 is expressed in the differential form: dl D ¡tr(dX ) C ’( y )T dXy .
(3.9)
This leads to the stochastic gradient learning algorithm,
D X (t ) DX ( t C 1)¡ X (t) D ¡g(t)
h i dl Dg(t) I ¡ ’(y ( t))y T ( t) , (3.10) dX
in terms of D X (t) D D W (t)W ¡1 ( t). This is rewritten as h i W ( t C 1) D W (t) C g(t) I ¡ ’(y ( t))y T ( t) W ( t)
(3.11)
in terms of W ( t). It was shown that the diagonal term in equation 3.11 can be set arbitrarily (Cichocki, Unbehauen, Moszczynski, & Rummert, 1994; Cichocki & Unbehauen, 1996; Amari & Cardoso, 1997). The problem has been analyzed also based on the information geometry of semiparametric statistical models (Amari & Kawanabe, 1997a, 1997b; Bickel, Klaassen, Ritov, & Wellner, 1993). Therefore, the above algorithm can be generalized in a more exible and universal form as h i (3.12) W ( t C 1) D W (t) C g(t) L ( t) ¡ ’( y ( t))y T ( t) W ( t), where L ( t) is any positive denite scaling diagonal matrix. By determining L correctly depending on y , we have various algorithms. We propose ©
ª
L (t ) D diag y1 ( t)Q1 ( y1 (t )), . . . , yn (t )Qn ( yn (t)) . This algorithm is effective when the nonlinear activation functions Qi ( yi ) are suitably chosen. This corresponds to setting new constraints dxii ( t) D 0,
i D 1, . . . , n.
(3.13)
Nonholonomic Orthogonal Learning Algorithms
1471
The new constraints lead to a new learning algorithm, written as
W (t C 1) D W ( t) ¡ g(t)F [y (t)]W ( t),
(3.14)
where all the elements on the main diagonal of the n £ n matrix F (y ) are put equal to zero, that is, fii D 0
(3.15)
and fij D Qi ( yi )yj ,
6 iD j.
(3.16)
The learning rule, 3.14, is called the orthogonal nonholonomic natural gradient descent algorithm for the reason shown in the next section. Algorithm 3.14 can be generalized as follows (Amari et al. 1997),
W (t C 1) D W ( t) ¡ g(t)F fy ( t)gW ( t),
(3.17)
with entries of the matrix F ( y ) as fij D dijl ii C a1 yi yj C a2 Qi ( yi )yj ¡ a3 yi Qi ( yj ),
(3.18)
where dij is Kronecker delta, l ii D ¡a1 y2i ¡ a2 Qi ( yi )yi C a3 yi Qi ( yi ) and a1 , a2 , and a3 are parameters that may be determined adaptively. 4 Nonholonomicity and Orthogonality
The space Gl( n ) D fW 0 g is partitioned into equivalence classes C W , where any W 0 belonging to the same C W is regarded as an equivalent demixing matrix. Let dW C be a direction tangent to C W , that is, W C dW C and W belong to the same equivalence class. The learning equation determines D W ( t) depending on the current y and W . When D W ( t) includes components belonging to the tangent directions to C W , such components are ineffective because they drive W within the equivalent class. Therefore, it is better to design a learning rule such that its trajectories D W ( t) are always orthogonal to the equivalence classes. Since C W are n-dimensional subspaces, if we could nd a family of n2 ¡ n-dimensional subspaces Q’s such that C W and Q are orthogonal, we could impose the constraints that the learning trajectories would belong to Q. Therefore, one interesting question arises: Is there a family of ( n2 ¡ n )-dimensional submanifolds Q that are orthogonal to the equivalence classes C W ? The answer is no. We can prove now that there does
1472
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
not exist a submanifold that is orthogonal to the families of submanifolds C W ’s. The direction dW is orthogonal to C W , if and only if
Theorem 1.
dxii D 0,
i D 1, . . . , n
(4.1)
where dX D dW W ¡1 . Since the equivalent class C W consists of matrices L W , L is regarded as a coordinate system in C W . A small deviation of W in C W is written as Proof.
dW C D d L W ,
(4.2)
where d L is diag(dl 1 , . . . , dl n ). The tangent space of C W is spanned by them. The inner product of dW and dW C is given by «
dW , dW C
¬ W
D E D dW W ¡1 , dW C W ¡1 I X « ¬ D dX , d L I D dxij dl ij i, j
D
X
dxii dl ii .
(4.3)
Therefore, dW is orthogonal to C W when and only when dW satises dxii D 0 for all i. We next show that dX is not integrable, that is, there are no matrix functions G( W ) such that dX D
@G @W
± dW ,
(4.4)
where @G @W
± dW D
X @G i, j
@wij
dw ij .
(4.5)
If such G exists, X D G(W ) denes another coordinate system in Gl(n ). Even when such G does not exist, dX is well dened, and it forms a basis in the tangent space of Gl(n ) at W . Such a basis is called a nonholonomic basis (Schouten, 1954; Frankel, 1997). Theorem 2.
The basis dened by dX D dW W ¡1 is nonholonomic.
Nonholonomic Orthogonal Learning Algorithms Proof.
1473
Let us assume that there exists a function G( W ) such that
dX D dG( W ) D dW W ¡1 .
(4.6)
We now consider another small deviation from W to W C d W . We then have ddX D ddG D dW d(W ¡1 ).
(4.7)
We have d(W ¡1 ) D ( W C d W )¡1 ¡ W ¡1 ’ ¡W ¡1d W W ¡1 .
(4.8)
ddX D ¡dW W ¡1d W W ¡1 .
(4.9)
Hence,
Since matrices are in general noncommutative, we have 6 dd X . ddX D
(4.10)
This shows that there does not exist a matrix X because dd G D ddG always hold when a matrix G exists. Our constraints dxii D 0,
i D 1, . . . , n
(4.11)
restrict the possible directions of D W and dene ( n2 ¡n)-dimensional movable directions at each point W of Gl( n ). These directions are orthogonal to C W . However, by the same reasoning, there do not exist functions gi (W ) such that dxii D dgi ( W ) D
X @gi (W ) j,k
@wjk
dwjk .
(4.12)
This implies that there exist no subspaces dened by gi (W ) D 0. Such constraints are said to be nonholonomic. The learning equation, 3.14, with equation 3.15 denes a learning dynamics with nonholonomic constraints. At each point W , D W is constrained in ( n2 ¡ n ) directions. However, the trajectories are not constrained in any ( n2 ¡ n )-dimensional subspace, and they can reach any points in Gl( n).
1474
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
This property is important when the amplitudes of si ( t) change over time. If the constraints are £ ¤ E hi ( yi ) D 0,
(4.13)
for example, h i E y2i ¡ 1 D 0,
(4.14)
when E[s2i ] suddenly become 10 times smaller, the learning dynamics with the soft constraints make the ith row of W 10 times larger in order to compensate for this change and keep E[y2i ] D 1. Therefore, even when W converges to the true C W , large uctuations emerge from the ineffective movements caused by changes in amplitude of si . This sometimes causes numerical instability. On the other hand, our nonholonomic dynamics is always orthogonal to C W so that such ineffective uctuations are suppressed. This is conrmed by computer simulations a few of which are presented in section 6. There is a great deal of research on nonholonomic dynamics. Classical analytical dynamics uses nonholonomic bases to analyze the spinning gyro (Whittaker, 1940). It was also used in relativity (Whittaker, 1940). Kron (1948) used the nonholonomic constraints to present a general theory of electromechanical dynamics of generators and motors. Nonholonomic properties play a fundamental role in continuum mechanics of distributed dislocations (see, e.g., Amari, 1962). An excellent explanation is found in Brockett (1993), where the controllability of nonlinear dynamical systems is shown by using the related Lie algebra. Recently robotics researchers have been eager to analyze dynamics with nonholonomic constraints (Suzuki & Nakamura 1995). Remark 1.
Theorem 2 shows that the orthogonal natural gradient descent algorithm evolves along a trajectory path that does not include redundant (useless) components in the directions of C W . Therefore, it seems likely that the orthogonal algorithm can be more efcient than other algorithms, as has been conrmed by preliminary computer simulations. Remark 2.
5 Stability Analysis
In this section, we discuss the local stability of the orthogonal natural gradient descent algorithm, 3.14, dened by
D W ( t) D ¡g(t)F fy ( t)gW (t),
(5.1)
Nonholonomic Orthogonal Learning Algorithms
1475
6 with fii D 0 and fij D Qi ( yi )yj , if i D j, in the vicinity of the desired equilibrium submanifold C A¡1 . Similar to a theorem established in (Amari et al., 1997a), we formulate the following theorem:
Theorem 3.
When the zero-mean independent source signals satisfy conditions
E(Qi ( si )sj ) D 0,
if
E(QP i ( si )sj sk ) D 0,
(5.2)
6 iD j
if
(5.3)
6 kD j
and when the following inequalities hold, E(QP i ( si )sj2 )E( QPj ( sj )s2i ) > E( si Qi ( si ))E(sj Qi ( sj )),
i, j D 1, . . . , n,
(5.4)
then the desired equivalence class C W , W D A¡1 P L is a stable equilibrium. However, if conditions 5.2 and 5.3 are still satised and E(QP i ( si )sj2 )E( QPj ( sj )s2i ) < E( si Qi ( si ))E(sj Qi ( sj )),
(5.5)
then replacing fij D Q(yi )yj by fij D yi Qj ( yj ),
6 iD j
(5.6)
it is guaranteed that the class C W , W D A¡1 P L , contains stable separating equilibria. Let W D A¡1 and (dW )i denote the ith row of dW and ( W ¡1 ) j be the jth column of W ¡1 . Then taking into account that dxij D Qi ( yi )yj and dxii D 0, we evaluate the second differentials in the Hessian matrix as follows: Proof.
d2 xij D Efd(Qi ( yi )yj )g
D EfQP i ( yi )yj dyi g C EfQi ( yi )dyj g
D EfQP i ( yi )yj ( dW )i (W ¡1 s )g C EfQi ( yi )( dW )j ( W ¡1 s )g
D EfQP i ( yi )yj ( dW )i (W ¡1 )j g C EfQi ( yi )yi ( dW )j ( W ¡1 )i g D ¡EfQPi (si )sj2 gdxij ¡ EfQi ( si )si gdxji .
(5.7)
1476
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
Hence d2 xij D ¡EfQPi ( si )sj2 gdxij ¡ EfQi (si )si gdxji
(5.8)
d2 xji D ¡EfQj (sj )sj gdxij ¡ EfQPj ( sj )s2i gdxji
(5.9)
d2 xii D 0.
(5.10)
By the assumptions made in theorem 3, we see that the equilibrium point W D A ¡1 is stable. The proof for W D A ¡1 P L is the same and is omitted here for brevity. The proof for the other cases is similar. From the proof of theorem 2, we have Let us consider the special case that all neurons have the same activation function: the cubic nonlinearity (Qi ( yi ) D y3i ). In this case, if all the source signals are subgaussian signals (with negative kurtosis), then the learning algo6 rithm, 3.14, with fii D 0 and fij D y3i yj , if i D j ensures local stability for the desired ¡1 equilibrium points: W D A P L . Analogically, if all the source signals are supergaussian (with positive kurtosis), then the algorithm, 3.14, with fii D 0 and 6 fij D yi yj3 , if i D j ensures that all the desired equilibrium points W D A ¡1 P L are stable equilibrium separating points. Corollary.
6 Computer Simulations
In our computer experiments we have assumed that source signals have generalized gaussian distributions, and corresponding nonlinear activation functions have the form Q(yi ) D | yi | ri ¡1 sgn(yi ) for ri ¸ 1 and Q(yi ) D yi / [|yi | (2¡ri ) C 2 ] for 0 · r1 < 1, where 2 is small positive constant added to prevent singularity of the nonlinear function (Cichocki, Bogner, Moszczynski, & Pope, 1997; Cichocki, Sabala, & Amari, 1998). For spiky supergaussian sources the optimal parameter ri changes between zero and one. In the illustrative examples that follow, we have assumed that ri D 0.5 and the learning rate is xed equal to g D 0.001. Preliminary computer simulation experiments have conrmed the high performance of the proposed algorithm. We will give two illustrative examples to show the efciency of the orthogonal natural gradient adaptive algorithm and verify the validity of our theorems. We also show that the orthogonal natural gradient descent algorithm is usually more efcient than soft constraint natural gradient descent algorithms for nonstationary
Nonholonomic Orthogonal Learning Algorithms
1477
SOURCE SIGNALS
5 0 5 5 0 5 10 5 0 5 1.5
1.55 1.6
1.65 1.7
1.75 1.8
1.85 1.9
1.95 x 10
2
4
(a)
20 10 0 10 20 10 0 10 20 10 0 10 20 10 0 10 1.5
SENSOR SIGNALS
1.55 1.6
1.65 1.7
1.75 1.8
1.85 1.9
1.95 x 10
2
4
(b)
Figure 1: Source and sensor signals for example 1. (a) Original source signals. (b) Mixing (sensors) signals.
signals, especially when the number of sources changes dynamically in time. 6.1 Example 1. To show the effectiveness of our algorithm, we assume that the number of sources is unknown (but their number is not greater than the number of the sensors). Three natural speech signals shown in Figure 1a are mixed in four sensors by using an ill-conditioned 4 £ 3 mixing matrix. It is assumed that the mixing matrix and the number of the sources are unknown. It should be noted that the sensor signals look almost identical (see Figure 1b). Recovered signals are shown in Figure 2a.
1478
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki 1 0 1 1 0 1 1 0 1 2 0 2 1.5
1.55 1.6
1.65 1.7
1.75 1.8 1.85 1.9
1.95 2 4 x 10
(a)
5 0 5 10 0 10 10 0 10 0.5 0 0.5 1.5
1.55 1.6
1.65 1.7
1.75 1.8 1.85 1.9
1.95 2 4 x 10
(b)
Figure 2: Separated signals for example 1. (a) Separated source signals for zero noise using ONGD algorithm. (b) Separated signals using the standard algorithm (3.11).
The algorithm was able after 5000 iterations not only to estimate the waveform of original speech signals but also their number. It should be noted that one output signal quickly decays to zero. For comparison we applied the standard independent component analysis learning algorithm, 3.11, to the same noiseless data. The standard algorithm also recovered the original three signals well, but it creates a false signal, as shown in Figure 2b. Moreover, the norm of separating matrix W is increasing exponentially, as illustrated in Figure 3a. This problem can be circumvented by computing preliminarily the rank of covariance matrix of the sensor signals and then taking only the number of sensors equal to the rank, provided noise
Nonholonomic Orthogonal Learning Algorithms
1479
Figure 3: Convergence dynamic behavior of the Frobenious norm of a separating matrix W as a function of the number of iterations for (a) the standard algorithm (3.11) for noiseless case, (b) the standard algorithm (3.11) for 1% additive gaussian noise, and (c) the orthogonal natural gradient descent algorithm (3.14).
1480
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
is negligible. However, such an approach is not very practical for a nonstationary environment when the number of sources changes continuously in time. When some small (1%) additive noise is added to sensors, the standard algorithm can be stabilized, as shown in Figure 3b. However, a uctuation of the norm of separating matrix W is larger than for the orthogonal natural gradient descent algorithm (see Figures 3b and 3c). 6.2 Example 2. In this simulation, we separated seven highly nonstationary natural speech signals, as shown in Figure 4a, using seven sensors (microphones), assuming that two of sources are temporally zero (silence period). It is assumed again that the number of active source signals and the randomly chosen mixing matrix A are completely unknown. The new learning algorithm, 3.14, was able to separate all signals, as illustrated in Figure 4c. Again we have found by simulation that the standard learning algorithm, 3.11, demonstrates higher uctuation of synaptic weights of the separating matrix. Moreover, the absolute values of the weights increase exponentially to innity for the noiseless case, so we are not able to compare the performance of the two algorithms, because the standard ICA algorithm, 3.11, is principally unstable in the noiseless case (without computing the rank of covariance matrix of sensor signals). Even for small noise, the weights of the separating matrix take very high values and are unstable in the case of the standard ICA method. 7 Conclusion
We have pointed out several drawbacks in existing adaptive algorithms for blind separation of signals—for example, the problem of separation of nonstationary signals and the problem of stability if some sources decay to zero. To avoid these problems, we developed a novel orthogonal natural gradient descent algorithm with the equivariant property. We have demonstrated by computer simulation experiments that the proposed algorithm ensures a good convergence speed even for ill-conditioned problems. The algorithm solves the problem of on-line estimation of the waveforms of the source signals even when the number of sensors is larger than the number of sources that is unknown. Moreover, the algorithm works properly and precisely for highly nonstationary signals like speech signals and biomedical signals. It is proved theoretically that the trajectories governed by the algorithm do not include redundant (useless) components and the algorithm is potentially more efcient than algorithms with soft or rigid constraints. The main feature of the algorithm is its orthogonality and that it converges to an equivalent class of the separating matrices rather than isolated critical points. Local stability conditions for the developed algorithm are also given.
Nonholonomic Orthogonal Learning Algorithms
1481
5 0 5 5 0 5 5 0 5 5 0 5 10 0 10 1.5
1.55 1.6
1.65 1.7
1.75 1.8
1.85 1.9
1.95 2 4 x 10
1.85 1.9
1.95 2 4 x 10
1.85 1.9
1.95 2 4 x 10
(a) 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 1.5
1.55 1.6
1.65 1.7
1.75 1.8
(b) 2 2 1 0 1 2 2 1 0 1 2 2 2 2 2 2 1.5
1.55 1.6
1.65 1.7
1.75 1.8
(c)
Figure 4: Computer simulation for example 2. (a) Original active speech signals. (b) Mixed (microphone) signals. (c) Separated (estimated) source signals.
1482
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
References Amari, S. (1962). On some primary structures of non-Riemannian elasticitytheory. In K. Kondo (Ed.), RAAG memoirs of the unifying study of basic problems in engineering and physical sciences (Vol. 3, pp. 99–108). Tokyo: Gakujutsu Bunken Fukyu-kai. Amari, S. (1967).A theory of adaptive pattern classiers. IEEE Trans. on Electronic Computers, 3, 299–307. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 2692–2700. Amari, S., & Cardoso, J. F. (1997). Blind source separation—Semi-parametric statistical approach. IEEE Trans. on Signal Processing, 45(11), 2692–2700. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., & Cichocki, A. (1998). Blind signal processing—neural network approaches. Proceedings IEEE, special issue on Blind Identication and Estimation, 86, 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1995).Recurrent neural networks for blind separation of sources. In Proceedings of International. Symposium on Nonlinear Theory and its Applications, NOLTA-95 (pp. 37–42). Amari, S., Cichocki, A., & Yang, H. H. (1996).A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 457–763). Cambridge, MA: MIT Press. Amari, S., Douglas, S. C., & Cichocki, A. (1998). Multichannel blind deconvolution and equalization using the natural gradient. Manuscript submitted for publication. Amari, S., & Kawanabe, M. (1997a). Information geometry of estimating functions in semiparametric statistical models. Bernoulli, 3, 29–54. Amari, S., & Kawanabe, M. (1997b). Estimating functions in semi-parametric statistical models. In I. V. Basawa, V. P. Godambe, & R. L. Taylor (Eds.), Estimating functions (Vol. 3, pp. 65–81). Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efcient and adaptive estimation for semi-parametric models. Baltimore, MD: Johns Hopkins University Press. Brockett, R. W. (1993). Differential geometry and design of gradient algorithms. In R. E. Green & S. T. Yau (Eds.), Differential geometry: Partial differential equations on manifolds (pp. 69–92). Providence, RI: American Mathematical Society. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 12, 3017–3030. Chen, T. P., & Chen, R. (1994). Blind extraction of stochastic and deterministic signals by neural network approach. In Proc. of 28th Asilomar Conference on Signals, Systems and Computer (pp. 892–896).
Nonholonomic Orthogonal Learning Algorithms
1483
Cichocki, A., Amari, S., Adachi, S., & Kasprzak, W. (1996). Self-adaptive neural networks for blind separation of sources. In Proceedings of Int.Symp. on Circuits and Systems (ISCAS-96) (Vol. 2, pp. 157–161). Cichocki, A., Bogner, R. E., Moszczynski, L., & Pope, K. (1997).Modied H´eraultJutten algorithms for blind separation of sources. Digital Signal Processing, 7(2), 80–93. Cichocki, A., Sabala, I., & Amari, S. (1998). Intelligent neural networks for blind signal separation with unknown number of sources. In Proc. of Conf. Engineering of Intelligent Systems, ESI-98 (pp. 148–154). Cichocki, A., Sabala, I., Choi, S., Orsier, B., & Szupiluk, R. (1997). Self adaptive independent component analysis for sub-gaussian and super-gaussian mixtures with unknown number of sources and additive noise. In Proceedings of 1997 International Symposium on Nonlinear Theory and Its Applications (NOLTA-97) (Vol. 2, pp. 731–734). Cichocki, A., & Unbehauen, R. (1996). Robust neural networks with on-line learning for blind identication and blind separation of sources. IEEE Transaction on Circuits and Systems—I: Fundamental Theory and Applications, 43, 894– 906. Cichocki, A., Unbehauen, R., Moszczynski, L., & Rummert, E. (1994). A new on-line adaptive algorithm for blind separation of source signals. In Proc. of 1994 Int. Symposium on Articial Neural Networks ISANN-94 (pp. 406– 411). Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17), 1386–1387. Common, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Douglas, S. C., Cichocki, A., & Amari, S. (1997). Multichannel blind separation and deconvolution of sources with arbitrary distributions. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 436–444). New York: IEEE Press. Frankel, T. (1997). The geometry of physics. Cambridge: Cambridge University Press. Girolami, M., & Fyfe, C. (1997a). Extraction of independent signal sources using a deationary exploratory projection pursuit network with lateral inhibition. I.E.E. Proceedings on Vision, Image and Signal Processing Journal, 14, 299–306. Girolami, M., & Fyfe, C. (1997b).Independence are far from normal. In Proceeding of European Symposium on Articial Neural Networks (pp. 297–302). H´erault, J., & Jutten, C. (1986). Space or time adaptive signal processing by neural network models. In J. S. Denker (Ed.), Neural networks for computing: Proceedings of AIP Conference (pp. 206–211). New York: American Institute of Physics. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Kron, G. E. (1948). Geometrical theory of electrodynamics. New York: Wiley. Matsuoka, K., Ohya, M., & Kawamoto, M. (1995). A neural net for blind separation of non-stationary signals. Neural Networks, 8(3), 411–419.
1484
Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki
Moreau, E., & Macchi, O. (1995). High order contrasts for self-adaptive source separation. InternationalJournal of Adaptive Control and Signal Processing, 10(1), 19–46. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami et al. (Eds.), Computational intelligence—A dynamic system perspective (pp. 83–97). New York: IEEE Press. Schouten, J. A. (1954). Ricci calculus. Berlin: Springer-Verlag. Sorouchyari, E. (1991). Blind separation of sources, part III: Stability analysis. Signal Processing, 24, 21–30. Suzuki, T., & Nakamura, Y. (1995). Spiral motion of nonholonomic space robots. Journal of the Robotics Society of Japan, 13, 1020–1029. Whittaker, E. T. (1940). Analytical dynamics. Proc.Roy. Soc., Edinburgh. Received September 15, 1997; accepted June 4, 1998.
Neural Computation, Volume 12, Number 7
CONTENTS Article
The Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition Rosario M. Balboa and Norberto M. Grzywacz
1485
Note
Multidimensional Encoding Strategy of Spiking Neurons Christian W. Eurich and Stefan D. Wilke
1519
Letters
Synergy in a Neural Code Naama Brenner, Steven P. Strong, Roland Koberle, William Bialek, and Rob R. de Ryter van Steveninck
1531
Increased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A Phase-Shift Mechanism Elizabeth Thomas and Thierry Grisar
1553
Multispikes and Synchronization in a Large Neural Network with Temporal Delays Jan Karbowski and Nancy Kopell
1573
Synchrony in Heterogeneous Networks of Spiking Neurons L. Neltner, D. Hansel, G. Mato, and C. Meunier
1607
Dynamics of Spiking Neurons with Electrical Coupling Carson C. Chow and Nancy Kopell
1643
A Model for Fast Analog Computation Based on Unreliable Synapses Wolfgang Maass and Thomas Natschl¨ager
1679
Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces Aapo Hyv¨arinen and Patrik Hoyer
1705
Tilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex James A. Bednar and Risto Miikkulainen 1721 Errata
1741
ARTICLE
Communicated by J. van Hateren
The Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition Rosario M. Balboa Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115-1813, U.S.A. Recently we found that the theories related to information theory existent in the literature cannot explain the behavior of the extent of the lateral inhibition mediated by retinal horizontal cells as a function of background light intensity. These theories can explain the fall of the extent from intermediate to high intensities, but not its rise from dim to intermediate intensities. We propose an alternate hypothesis that accounts for the extent’s bell-shape behavior. This hypothesis proposes that the lateral-inhibition adaptation in the early retina is part of a system to extract several image attributes, such as occlusion borders and contrast. To do so, this system would use prior probabilistic knowledge about the biological processing and relevant statistics in natural images. A key novel statistic used here is the probability of the presence of an occlusion border as a function of local contrast. Using this probabilistic knowledge, the retina would optimize the spatial prole of lateral inhibition to minimize attribute-extraction error. The two signicant errors that this minimization process must reduce are due to the quantal noise in photoreceptors and the straddling of occlusion borders by lateral inhibition. 1 Introduction
Theories for lateral inhibition existent in the literature that are based on information-theory-like ideas1 are not consistent with the behavior of lateral inhibition in the retina’s outer plexiform layer (OPL) (Balboa & Grzywacz, 2000). These theories can only explain qualitatively the fall of the lateralinhibition extent as a function of intensity beginning at intermediate intensities (Baldridge & Ball, 1991; Myhr, Dong, & McReynolds, 1994; Lankheet, Rowe, Van Wezel, & van der Grind, 1996). To account for this intensity behavior, three of these theories use second-order statistics (autocorrelation or power spectrum) in natural images (Srinivasan, Laughlin, & Dubs, 1982; 1 We call these theories information-theory-like, because although they essentially deal with signal-to-noise ratio, as does conventional information theory, they do not quantify information per se, as theory does.
c 2000 Massachusetts Institute of Technology Neural Computation 12, 1485– 1517 (2000) °
1486
Rosario M. Balboa and Norberto M. Grzywacz
Atick & Redlich, 1992; McCarthy & Owen, 1996), while the fourth (Field, 1994) uses a four-order statistic (the image kurtosis). These theories explain this behavior, because as intensity decreases, the relative noise due to random photon absorption increases. Hence, the theories must increase the lateral-inhibition extent to average out the noise. No one can rule out that some of the computations performed in the OPL are related to the statistics used by these theories. However, there is something missing to explain the increase of the extent of the OPL’s lateral inhibition with intensity at low intensities (Mangel & Dowling, 1985; Baldridge, 1993; Jamieson, 1994; Xin, Bloomeld, & Persky, 1994). The complete behavior from low to high intensities was only recently published (Xin & Bloomeld, 1999); Figure 1 shows this behavior. (Previous experiments were not sufciently broad to cover carefully the full range of intensities, eliminating artifacts due to adaptation to the test stimuli.) The upper panel shows the electrophysiological measurements of the horizontal cells’ receptive-eld sizes for two different cell types in the rabbit’s retina. The lower panel shows the extension of the neurobiotin coupling for the same horizontal-cell classes. In both panels, the extent is small at low intensities, grows with intensity up to a maximum, and then falls as the intensity increases further. This same intensity behavior was also observed in goldsh (Jamieson, Baldridge, & Ball, 1994). From these results, one can see that different experiments in different species lead to the same bell-shape behavior for horizontal cells and, thus quite likely, for the lateral inhibition that they mediate. Moreover, given that it was observed in different types of horizontal cells, one can say that this is a general behavior, which is not explained by previous theories. In this work, we propose a new information-theory-like hypothesis for the role of the OPL’s lateral inhibition. We focus on the OPL, because it is the rst stage of visual processing and therefore must be dealing with general aspects of natural images. The new hypothesis will be shown to be consistent with the OPL’s intensity behavior. Relatedly, the hypothesis can account for the disappearance of lateral inhibition at extremely low intensities (Barlow, Fitzhugh, & Kufer, 1957; Bowling, 1980) and the shrinkage of its extent for stimuli that are not spatially homogeneous (Reifsnider & Tranchina, 1995; Kamermans, Haak, Habraken, & Spekreijse, 1996). Moreover, the hypothesis explains the inhibition’s division-like mechanism of action (Merwine, Amthor, & Grzywacz, 1995; Verweij, Kamermans, & Spekreijse, 1996) and linearity under low contrasts (Tranchina, Gordon, Shapley, & Toyoda, 1981). Finally, the hypothesis is consistent with the psychophysics of the Steven’s power law of intensity perception (Stevens, 1970) and the asymmetry of positive and negative Mach bands at the two sides of an intensity border (Fiorentini & Radici, 1957; Ratliff, 1965). The hypothesis and its predictions have already appeared in abstract form (Balboa & Grzywacz, 1998).
Early Retinal Lateral Inhibition
1487
Figure 1: (A) Electrophysiological receptive-eld-extent measurements for two different types of horizontal cells in the rabbit retina: type A (HA; empty circles) and type B (HB; lled circles). When dark adapted (¡1), the receptive-eld extent is small. It then rises with background intensity up to a maximum and falls again as the intensity continues to increase. (B) Tracer-coupling extent with neurobiotin in the same cell types. The coupling has the same behavior as the receptive elds in A. (Data from Xin & Bloomeld, 1999.)
2 Hypothesis Rationale 2.1 Philosophy. There is a major philosophical difference between the new hypothesis and some of the theories in the literature (Atick & Redlich, 1992; McCarthy & Owen, 1996). Past theories prescribe extracting the largest
1488
Rosario M. Balboa and Norberto M. Grzywacz
amount possible of nonredundant signal. The new hypothesis prescribes extracting as well as possible signal that is relevant for the animal. An example of what we mean by “relevant” is provided by the work of Kamermans and colleagues with the goldsh’s OPL (Kamermans, Kraaij, & Spekreijse, 1998). If it were to extract all the color information from the scene, then it should obtain information about the color of the objects and illuminant. However, the goldsh’s OPL appears to discard, or at least pay much less attention to, the illuminant’s color. Why would the visual system play down some types of information so early in the processing? We propose two answers to this question. First, by devoting the available output channels of information to fewer types of signal, larger quantities of these signals can be transmitted. Second, if one assumes that the central nervous system has limited capacity, then the system might be better off dealing with fewer types of signal than by dividing its energy among many types, some of which may not be relevant for the animal’s survival. What types of signal should lateral inhibition play down or stress? The most obvious type is illumination. For example, the predictive coding theory (Srinivasan et al., 1982) proposes to reduce information about the intensity of illumination of points whose neighbors have the same intensity. The goal is to t the wide range of intensities in natural scenes in the narrow dynamic range of the second-order neuron, for instance, the bipolar cell. However, the retina must achieve this goal before lateral inhibition, since the photoreceptor itself has a narrow dynamic range (Fain & Dowling, 1973; Baylor, Lamb, & Yau, 1979). Hence, different from the predictive coding theory, we suggest that the main goal of lateral inhibition is not codication of intensity. Rather, we propose that the nonchromatic goals of the OPL’s lateral inhibition are to aid in the optimization of border detection (Shapley & Tolhurst, 1973), and contrast (Campbell & Robson, 1968) and intensity (Stevens, 1970) estimation. Contrast would be particularly important at borders, and intensity would be important away from them. Border position is an important piece of information, because it is from it that the visual system would characterize objects. As such, lateral inhibition is the rst step of image segmentation, by being the rst step of edge detection. Because we emphasize object segmentation, we place particular importance on occluding edges. However, this does not mean that we think that the retina discriminates different types of edges (such as occlusions and shadows). Edge discrimination requires relatively high-level cues (Adelson 1993; Wishart, Frisby, & Buckley, 1997) beyond the scope of the retina. All that we are proposing is that the retina contributes directly to border extraction, with the retinal circuitries designed specically to deal with the statistics of occluding borders. The idea of lateral inhibition being important for border extraction is not new (Ratliff, 1965). What is new here is quantifying the cost of missing borders and specifying the information obtained from them and away from them.
Early Retinal Lateral Inhibition
1489
2.2 Lateral Inhibition and Image-Attribute Extraction. Figure 2.2A illustrates the structure of the new hypothesis for lateral inhibition as a process in the extraction of image attributes. From the stimulus (I(Er ), where Er is position), the retina wishes to estimate a few attributes (the functions H1 , H2 , . . .), particularly the position and contrast of edges and the intensity outside edges. To do so, it rst transduces the stimulus through a noisy process, getting the photoreceptor responses (T(Er )). Then lateral inhibition modulates these responses, giving rise to bipolar responses (B( Er )). Consider the spatial layout of these responses across the population of the bipolar cells. Wherever there are edges in the image, there are uctuations in the layout caused by the lateral inhibition—the so-called Mach bands (Ratliff,
1490
Rosario M. Balboa and Norberto M. Grzywacz
1965). The contrast of an edge and its Mach band’s amplitude (or largest spatial gradient) is linearly related. Because the probability that there is an edge due to occlusions in a point of the image increases with the contrast at that point (Balboa & Grzywacz, 1999c), the spatial bipolar gradient of any given point codes the probability that there is an edge there. Therefore, for a point in the image with a high spatial gradient across bipolar cells, the position of these cells indicates the position of a likely edge, and the gradient is proportional to the presumed edge’s contrast. For points with low gradient, the bipolar cells would code an amplitude-compressed representation of T (Er ). We propose that the visual system implements recovery functions (R1 (B( Er )), R2 ( B(Er )), . . .) to go from the bipolar responses to estimates (He1 , He2 , . . .) of the image attributes. For each attribute, the visual system computes the expected error (E1 , E2 , . . .) by taking into account the prior probability of the attribute in natural images (P(H1 ), P( H2 ), . . .), the probabilistic model relating the attribute to the estimate (P(He1 |H1 ), P( He2 |H2 ), . . .), and Bayesian theory. These models and prior probabilities would come from learning during development or evolution. The models amount to knowledge about the biology linking the stimulus to the estimates of the attributes. The prior probabilities amount to knowledge of the attributes in natural images. The new hypothesis postulates that the prole of lateral inhibition changes to minimize the total error, that is, E D C1 E1 C C2 E2 C ¢ ¢ ¢ ,
(2.1)
Figure 2: Facing page. Rationale of the minimal-local asperity hypothesis. (A) Schematics of the hypothesis. Images are transduced and transformed to bipolar cell output with the participation of lateral inhibition. Several recovery functions (R1 , R2 , . . .) implemented neurally transform the bipolar output to a new signal (He1 , He2 , . . .), which represents estimates of desired attributes from the image (H1 , H2 , . . .). The system then computes the estimation error by using prior knowledge about images and the underlying biology. A signal representing the error feeds back to the OPL to modulate lateral inhibition. (B) Schematics of stimuli and OPL. The dashed line represents the intensities of objects in the world, while the solid line adds the quantal noise to these intensities. Photoreceptors (whose responses are the solid lines) and horizontal and bipolar cells are represented by the shapes labeled P, H, and B, respectively. The gure uses three labeled sites to illustrate three cases of the relationship between these cells and the stimulus. In site 1, the lateral inhibition mediated by the horizontal cells is conned to one object, helping to compress its intensity to the neural signal. This signal suffers from the quantal-noise error. In site 2, the local circuit straddles one border, producing a Mach band signal, which helps to code the border ’s location and contrast. Finally, in site 3, lateral inhibition straddles more than one border, leading to an erroneous border-detection process (the local-asperity error).
Early Retinal Lateral Inhibition
1491
where the Ci ’s are constants, which weigh the cost associated with each error. They are inversely related to “tness” in population genetics 2 (Ridley, 1993) and directly related to the “loss function” in economics (Berger, 1985) and computer science (Gallant, 1995). Figure 2.2B is a one-dimensional (1D) scheme, which illustrates what the main errors in the estimation of the desired visual attributes are. This scheme shows the intensity levels of ve 1D objects (dashed line), with the photon-absorption noise superimposed (solid line). Lateral inhibition mediated by horizontal cells (H) acts directly on the photoreceptors (P), modulating their intensity estimate before it reaches the bipolar cell (B). For reasons of clarity, we show only three nonoverlapping sites of lateral inhibition to describe the dominant errors. Depending on the site, what would the response of the bipolar cell be? In site 1, the bipolar response would provide an estimate of the intensity inside the object. The only error in this site would be due to the photon-absorption noise. In site 2, the bipolar cell and its neighbors would be coding something like the position of a border and its contrast. Again, the main source of error for contrast estimation is the photon-absorption noise. In site 3, the borders of the small object might be altogether missed, since the bipolar cell would be receiving inputs related to three different objects, corrupting its signals. Consequently, there are two dominant sources of errors: the photon-absorption noise and the straddling of multiple borders by lateral inhibition. We call the errors due to the rst source the quantal-noise errors and the error due to the second source the local-asperity error. (Asperity means “roughness of a surface.” If one thinks of images as three-dimensional representations of the intensities in each point, then an asperous image would have a harsh prole.) 2.3 Lateral-Inhibition Extent. The spatial extent of the inhibitory lter is designed to minimize the error in equation 2.1 given a model like that in Figure 2.2A. Mathematical analysis conrms that the two dominant sources of errors are those illustrated in Figure 2.2B (see section 3.2). Theories in the literature all take into account the quantal-noise error (Rushton, 1961). Its process is well known, and the probability of photon absorption follows a Poisson distribution (Fuortes & Yeandle, 1964; Baylor et al., 1979; Grzywacz, Hillman, & Knight, 1988). We estimate the error of this process (section 3) as a function of intensity and the extent of the lateral-inhibition lter. Reducing the intensity produces a relatively noisier signal. To compensate for the error being larger, the system should average the signal over a wider sample of photoreceptor responses and therefore increase the 2 Population genetics quanties the evolution of a population of certain genes, starting from their initial frequencies, and how t they are for reproduction and survival. In our case, the measure of “tness” is related to the contribution of visual behavior to these genes’ success.
1492
Rosario M. Balboa and Norberto M. Grzywacz
extent of the lateral-inhibition lter. The problem with such an average is that more borders would be straddled, causing the local-asperity error. Hence, ideally, to minimize the local-asperity error, the retina would have to use narrow lateral-inhibition extents. It follows that the quantal-Noise and local-asperity errors impose opposite demands on lateral inhibition. To alleviate this problem, the minimal local-asperity hypothesis proposes that the retina chooses the extent of lateral inhibition to minimize a weighed sum of these errors (see equation 2.1), that is, that the retina reaches a compromise. 2.4 Lateral Inhibition’s Mechanism. We chose a division-like mechanism of lateral inhibition to implement the hypothesis. In other words, the inhibition corresponds not to the subtraction of the signal from the surround (as in a difference of gaussians model; Rodiek & Stone, 1965) but rather to a division (reduction of gain) of the center ’s signal. We chose this form of inhibition for three reasons. First, it would be useful to maintain intensity information for regions away from borders. A division-like inhibition ensures the maintenance of this information, whereas a subtractive inhibition may not (Srinivasan et al., 1982). Second, mathematical analyses of subtractive and division-like models (Grzywacz & Balboa, 1999) reveal that the latter but not former allows the extraction of contrast locally. In the linear model, a bipolar cell’s signals confound contrast, mean intensity, and amplitude of inhibition. The only way to disambiguate these signals is to obtain global information about the mean intensity. Third, biophysical evidence points out that the OPL’s lateral inhibition works through a Ca2C -dependent Cl¡ conductance at the photoreceptors synapse (Verweij, Kamermans, & Spekreijse, 1996). Such a conductance operates in a division-like manner. 3 Mathematical Formulation 3.1 Lateral Inhibition’ s Mechanism. Let the intensity-dependent input and output to the photoreceptor ’s synapse at position ( i, j) be Ti,0 j and Bi, j , respectively. One can think of Ti,0 j as the current generated by transduction and Bi, j as the voltage at the bipolar cell or presynaptic site. (The variable Bij can be pre- or postsynaptic, since we assume here for simplicity that the synapse is linear, but our arguments do not depend on that.) The output is
B ij D
Tij0 g0r C g0I
,
(3.1)
where g0r > 0 is inversely proportional to the resting gain of the photoreceptor ’s synapse (the resting conductance) and g0I is inversely proportional to the gain change due to lateral inhibition (varying conductance). (For the justication of using a division-like inhibition, see the text following
Early Retinal Lateral Inhibition
1493
equation 2.1.) Next, we simplify this equation by dividing the numerator and denominator by g0r , and dening Tij D Tij0 / g0r and gI D g0I / g0r . We propose to set gI D ( hi, j ¤ Bi, j )n , where ¤ stands for convolution, hi, j ¸ 0 is the lateral-inhibition lter, and n ¸ 1 is a constant. From this proposition and equation 3.1, we have
Bi, j D
Ti, j . 1 C (hi, j ¤ Bi, j )n
(3.2)
That this equation uses a convolution means that lateral inhibition pools the photoreceptor outputs linearly. It is also imposed for simplicity and because there is no evidence against, that if i2 C j2 D k2 C l2 , then hi, j D hk,l , that is, the inhibitory lter is isotropic. Finally, that equation 3.1 raises hi, j ¤ Bi, j to the power of n could mean the cooperation of n molecules of inhibitory transmitter or, nonexclusively, nonlinearities in the calcium or chloride conductances. 3.2 Lateral Inhibition’ s Extent. The goal is to nd the lateral-inhibition lter that minimizes the error in equation 2.1 in a model like that in Figure 2.2. Elsewhere (Grzywacz & Balboa, 1999), we performed a mathematical analysis to estimate this error. The results of the analysis are similar to the intuition provided in Figure 2.2B, which shows that formulas for errors similar to those presented below are expected from models like that in Figure 2.2A no matter what approximations one uses. Briey, the analysis used the following mathematical procedures: Errors were typically quantied as spatial integrals of the square of the relative differences between desired attributes and their estimates. We used a Bayesian formulation to weigh the various possible images and the estimated attributes from these images. For this purpose, it was assumed that the attribute estimation from a given image was corrupted by Poisson noise. We then linearized the errors with Taylor expansions around the mean intensity, assuming that the images did not have intensities that were too low (relatively little noise) and contrasts that were too high (Ruderman & Bialek, 1994; Vu, McCarthy, & Owen, 1997; Zhu & Mumford, 1997; Balboa & Grzywacz, 1999b). Finally, it was assumed that the width of inhibitory lter is much narrower than the mean size of objects in the images, making simultaneous crossing of two borders when centering the lter on a point rare but not impossible. This was a simplifying assumption that was probably at best rough for “cluttered” images (Zhu & Mumford, 1998). However, this assumption probably held for most images, since the receptive elds of retinal cells are typically narrow. This assumption allowed approximating the number of local-asperity errors (multiple-border crossings) as a Poisson distribution (large number of positions and low probability of crossings). After this analysis, a good
1494
Rosario M. Balboa and Norberto M. Grzywacz
approximation for the error (see equation 2.1) was E D EQ C E A ,
(3.3)
where EQ and E A are the quantal-noise and local-asperity errors, respectively. (Compared to equation 2.1, equation 3.3 arbitrarily left out the Ci coefcients, because EQ and E A have free, multiplicative gain parameters.) The analysis (Grzywacz & Balboa, 1999) also revealed that the quantalnoise error is to a good approximation
EQ D
QQ
±P P i
O2 j hi, j
hIi, j i1 / 2
²1/2
,
(3.4)
P where QQ > 0 is a free parameter of the hypothesis, hO i, j D hi, j / i, j hi, j is the lter normalized, and hIi, j i is the mean input to the photoreceptor ’s synapse from the photoreceptor ’s outer segment averaged over all positions (i, j). Although we do not derive this equation here, understanding its interpretation will help clarify its appropriateness. The equation’s parameter QQ indicates the noisiness of the photoreceptors. The quantal noisiness varies with the amplitude and time course of the single-photon response (Fuortes & Yeandle, 1964; Lillywhite, 1977; Baylor et al., 1979; Grzywacz et al., 1988), and thus, QQ has a wide range of values in nature. Expectedly, as the noisiness increases, the quantal-noise error (E Q ) becomes larger. Also as expected, the quantal-noise error falls with the square root of the input. This is because the quantal-noise error is a relative error and the noise (standard deviation)-to-signal (mean) ratio in a Poisson process is inversely proportional to the square root of the signal (Burington, 1973). Finally, the error in equation 3.4 can be reduced by changing hO i, j . As hO i, j becomes more spatially extended, its values become P smaller, since i, j hO i, j D 1 and hO i, j · 1 (since hi, j ¸ 0). This causes hO 2i, j to P fall rapidly, reducing i, j hO 2i, j and thus EQ . In other words, the larger the spatial extent of the inhibitory lter (hi, j ) is, the smaller the quantal-noise error is. We also analyzed the local-asperity error (EA in equation 3.3; Grzywacz & Balboa, 1999). Because this error depends on the size of the inhibitory lter, one needs to estimate its width. One example of such an estimate is P the spatial standard deviation (r ) of hO ij . (Because ij hO ij D 1 and 0 · hO ij · 1, hO ij is a probability function and thus has a well-dened standard deviation.) It should not be surprising that under the assumptions described above, the
Early Retinal Lateral Inhibition
1495
local-asperity error is to a good approximation 0 EA D QA
XX k
l
@
X ij ( i¡k )2 C ( j¡l )2 ·pr 2
PO
(
1
! rI I A ,
(3.5)
i, j
where QA > 0 is a constant and PO (| rI I | i, j ) is the probability that there is rI an occluding border at position ( i, j) given that the contrast there is I i, j (footnote 3). This equation is proportional to the mean number of occluding borders found when a window of radius r is placed over an arbitrary point in the image. The “k” and “l” sums perform calculation of mean over the retina (with the normalization constant included in QA ). The “i” and “j” sum is proportional (also through QA ) to the number of borders around position ( k, l). This is because PO gives the probability (fraction) of a border at each position. Elsewhere (Balboa & Grzywacz, 1999b), we present results of measurements of PO (| r I / I| i, j ), which are needed for estimating the local-asperity error through equation 3.5. In that work, we show that one of the main properties of this function is that it is a sigmoidal function of local contrast. Because the vast majority of contrasts are low (Ruderman & Bialek, 1994; Vu et al., 1997; Zhu & Mumford, 1997; Balboa & Grzywacz, 1999b), we here neglect the saturation portion of the sigmoidal and approximate PO
(
! m rI rI D , I I i, j
(3.6)
i, j
where m ¸ 1. A multiplicative constant that should appear on the righthand side of this equation is set to 1 without loss of generality, since the balance between EQ and E A is controlled by the parameters QQ and QA (see equations 3.4 and 3.5). 3.3 Lateral Inhibition’s Amplitude. Why does the amplitude of h not matter much for the computation of error? Although the lateral-inhibition extent matters for the quantal-noise error, the amplitude of hi, j does not, since equation 3.4 depends only on the lter normalized, hO i, j . (However, this amplitude matters for the OPL’s output.) The amplitude is eliminated in the derivation of equation 3.4, since the quantal-noise error is relative, that is, it is standard deviation over mean. Similarly, errors made by the 3 This is the most natural denition of local contrast for the problem at hand. One thinks of local contrast as the ratio between the difference of intensities in neighbor points and the mean intensity. However, for an image on a continuous domain, neighbor points are not well dened. Therefore, a differential denition of contrast using the gradient operation is natural.
1496
Rosario M. Balboa and Norberto M. Grzywacz
local-asperity term do not depend on h’s amplitude. They are just a count of how many multiple borders are crossed on average when centering the lter on a point in the image. This is because the error depends only on the probability that borders pass through points inside the lter, and this probability is independent of the lter ’s amplitude. 4 Results 4.1 Bell-Shape Behavior. We simulated our hypothesis with 89 natural underwater and atmospheric images collected from the World Wide Web (Balboa, 1997). In the simulations, we found the inhibitory lter that minimized the attribute-estimation error expressed in equation 3.3 using equations 3.4 and 3.5, and the methods in the appendix. The main result of the simulations appears in Figure 3. It reveals a bell-shape behavior of the lateral-inhibition extent as a function of intensity. The rise at low intensities begins at around 3 to 10 photons per integration time of the photoreceptor. Therefore, the hypothesis predicts that the bell-shape behavior is due to cone processing, because rods are saturated at these intensity levels (Rao, Buchsbaum, & Sterling, 1994). This behavior contrasted with that stemming from other theories of lateral inhibition in the literature (Srinivasan et al., 1982; Atick & Redlich, 1992; Field, 1994). Those theories predicted that the lateral-inhibition extent falls with intensity (Balboa & Grzywacz, 2000). The stepwise appearance of this extent as a function of intensity in Figure 3 was due to our searching of the extent of the optimal lter in jumps of powers of 2 (see the appendix). However, averaging the lateral-inhibition extent across images eliminated this stepwise appearance. Figure 3 also presents the dependence of the bell-shape behavior of the lateral-inhibition extent on Q, which is proportional to the ratio between the weight of the quantal-noise error (QQ —see equation 3.4) and that of the localasperity error (QA —see equation 3.5) (see the appendix). The effect of a rise in Q is to increase the lateral-inhibition ± ² extent as Q rises. Figure 4 illustrates
rI the effect of the shape of PO on the bell-shape curves. The main I i, j free parameter quantifying this shape is m, which gives the power of the ± ²
rI rI rise of PO as a function of (see equation 3.6). The effect of this I i, j I parameter is to increase the lateral-inhibition extent as m rises. Moreover, for small m (little curvature), the new hypothesis no longer predicts a bellshape behavior but only a fall of the lateral-inhibition extent as a function of intensity. For m D 1, there never was a bell-shape behavior; for m D 2 this behavior occurred in 80 images out of 89; and for m D 4, this behavior was always there.
4.2 Habitats. Besides depending on the hypothesis parameters, the lateral inhibition’s bell-shape behavior also depends on the visual habitat.
Early Retinal Lateral Inhibition
1497
Figure 3: Lateral-inhibition extent as a function of intensity and parametric on Q, which weighs the quantal-noise error relative to the local-asperity error. Standard curve (solid): Q D 1000, m D 4. The other curves vary only Q. There is a bell-shape behavior in the lateral-inhibition extent for the two images shown. By increasing Q, the lateral-inhibition extent increases.
There is a small statistical difference in the lateral-inhibition extent between underwater and atmospheric images (one-sided Mann-Whitney test, DF D 87, U D 727, z D 2.12, p < 0.0172). The difference is such that the mean maximal extent of lateral inhibition is 33% larger in underwater than in atmospheric images. One can explain this by observing that light scattering and absorption are so high in water (Mobley, 1995) that underwater images
1498
Rosario M. Balboa and Norberto M. Grzywacz
Figure 4: Lateral-inhibition extent as a function of intensity, and parametric on m, which models the curvature of the probability-of-real-borders-as-a-functionof-contrast curve. Standard curve (solid): Q D 1000, m D 4. The other curves vary only m. There is a bell-shape behavior in the lateral inhibition extent for the two images shown, but only when m > 1 that is, when there is a curvature in the probability-of-real-borders distribution.
often have large regions of relatively constant intensity. Relatedly, another aspect of visual habitats that seems relevant is whether they have many small objects (as in tropical foliage) or large regions of relatively constant reectance (as in Arctic scenes). To test whether such habitats lead to different receptive elds according to the minimal-local asperity hypothesis,
Early Retinal Lateral Inhibition
1499
we reclassify the images into smooth and rugged categories. 4 The maximal extent of lateral inhibition for smooth images is larger than for rugged images (one-sided Mann-Whitney test, DF D 34, U D 101, p < 0.05). The mean maximal extent was 45% larger for smooth than for rugged images. (Such a difference was observed in horizontal cells when using smooth and rugged backgrounds; Reifsnider & Tranchina, 1995; Kamermans et al., 1998.) 4.3 Other Results Consistent with Biology. As dened in equation 3.2, lateral inhibition would help with luminance adaptation, complementing mechanisms such as pupil-size change and photoreceptor adaptation (for a review see Dowling, 1987). To see this, let Ti, j D T0 and Bi, j D B0 be constants. (This simplistic example corresponds to a steady-state stimulus with no borders or noise.) Then,
B0 D
T0 , 1 C hn0 Bn0
where h0 D
P
i, j hi, j . C1
B0 C hn0 Bn0
(4.1)
From this equation one gets
D T0 .
(4.2)
This equation shows that B0 grows monotonically with T0 , since dT0 D 1 C (n C 1)hn0 Bn0 > 0, dB0
(4.3)
and that if T0 D 0, then B0 D 0. At low light levels, B0 À Bn0 C 1 (because B0 grows with T0 and B0 (T 0 D 0) D 0) and therefore, B0 D T0 .
(4.4)
Consequently, the response grows proportionally with intensity at low intensities, which is consistent with the biological data (Ashmore & Falk, 1980; Robson & Frishman, 1995). Another biologically correct consequence of limT0 !0 B0 D 0 is that lateral inhibition disappears at low intensities (Barlow et al., 1957; Bowling, 1980). 4 Four individuals (two were the authors) performed a classication of all the images according to the following instruction: “We are giving you a set of eighty-nine natural images. Imagine that at each pixel, one inserts a column perpendicular to the plane of the image to a height proportional to the pixel’s gray level. This would form a threedimensional surface. Please classify the images into two sets of approximately equal sizes, such that one set contains those images with the largest regions of relatively low roughness in the imaginary three-dimensional surface.” After the individuals performed this classication, we placed into the “smooth” and “rugged” categories only those images that all the individuals judged to have low and high roughnesses, respectively.
1500
Rosario M. Balboa and Norberto M. Grzywacz
At high intensities, Bn0 C 1 À B0 , and 1
B0 D
T0nC 1 n
h0nC 1
,
(4.5)
which, if h0 does not depend strongly on intensity, corresponds to a strong intensity compression, that is, luminance adaptation. This compression is caused by the division in equation 3.1, not by n, since there is also compression for n D 1. However, the compression is stronger when n > 1. This compression leads to a power law behavior with intensity, being thus directly consistent with the Stevens’s power law (Stevens, 1970). A connection between this law and a similar feedback mechanism was noted previously for invertebrate photoreceptors (Grzywacz & Hillman, 1988). Therefore, one of the roles proposed by the minimal local-asperity hypothesis for lateral inhibition in the OPL is the compression of the wide range of natural intensities into the narrow dynamic range of the bipolar cells (their total range is about 20 mV—Werblin, 1977—and their noise is of the order of 1 mV). This compression should occur in regions of the image away from borders (see Figure 2.2B, site 1). We observe such a compression, which is strengthened as the inhibitory feedback’s power (n) increases (see Figure 5). As mentioned before, this form of compression is identical to the Steven’s power law (Stevens, 1970). However, observe that there is no compression until on average about one photon is absorbed per integration time of the photoreceptor (see Figure 5). Besides the bell-shape behavior (see Figures 3 and 4) and the adaptationrelated Stevens’s power law behavior (see Figure 5), the minimal localasperity hypothesis is also consistent with at least four other aspects of retinal biology (two of which were discussed above, but are recapitulated here to condense all the biological implications of the hypothesis in a single place). First, at low background intensity, lateral inhibition disappears, as discussed after equation 4.4 (Barlow et al., 1957; Bowling, 1980). Second, the mechanism of action should be division-like, as discussed after equation 2.1 (Merwine et al., 1995; Verweij, Kamermans, & Spekreijse, 1996). Third, the same mathematical analysis that led to equation 3.3 (Grzywacz & Balboa, 1999) shows that despite being division-like, lateral inhibition would lead to responses proportional to contrast even when relatively high gradients of illuminations exist (Tranchina et al., 1981). Fourth, that analysis also shows that positive Mach bands tend to be larger than negative ones, particularly at low background intensities (Fiorentini & Radici, 1957; Ratliff, 1965). An intuition for this asymmetry arises when one considers the extreme case of an edge for which only one of its sides has zero intensity. Because inhibition is division-like, the zero-intensity side will not respond (zero divided by another number is still zero) and thus not develop a Mach band (see equation 3.1), whereas the other side will.
Early Retinal Lateral Inhibition
1501
Figure 5: Output at the bipolar cell level as a function of intensity and parametric on n, which models the stoichiometry of the horizontal cell feedback. The units of intensity are arbitrary, since the curves were obtained from equation 4.2 with an arbitrary but constant value of h0 . At low intensities, there is no compression (linear intensity-output relationship; slope = 1), but at higher intensities, there is (sublinear intensity-output relationship; slope < 1). The compression increases with n (lower slopes). In the no-compression regime, lateral inhibition is negligible.
5 Discussion
This article shows that the minimal local-asperity hypothesis can account for the bell-shape behavior with light intensity of the OPL’s lateral-inhibition extent (see Figures 3 and 4; Baldridge, 1993; Jamieson, 1994; Xin et al., 1994; Xin & Bloomeld, 1999). Moreover, this hypothesis is consistent with several other physiological- and psychophysical-related behaviors (see section 4.3). A testable prediction of this hypothesis is that the details of the bell-shape behavior will depend on the habitat where the animal lives (see section 4.2). We also studied the role of three independent parameters of the hypothesis: the weight of the quantal-noise error relative to the local-asperity error (see Figure 3), the power of the dependence on contrast of the probability that an occluding border passes through a point (see Figure 4), and the exponential power of the inhibitory action (see Figure 5). 5.1 Limitations. Before embarking on a discussion of the implications of the results, we address some of the limitations of the minimal local-asperity hypothesis. Perhaps the most important variable that this hypothesis did
1502
Rosario M. Balboa and Norberto M. Grzywacz
not consider is time. Adaptation of lateral inhibition must be performed taking into account the different range of visual speeds of different animals and thus possibly different timescales. Therefore, the hypothesis may have to be reformulated to incorporate the animals’ speeds. In particular, one may have to study the distribution of occluding borders in images with movement at different speeds (movement may cause border blurring and therefore losses of contrast). Besides movement, another issue that the hypothesis does not address, which is particularly relevant for the OPL, is color processing. This issue may not be major for primates, for which the OPL’s lateral inhibition is not color coded (Calkins & Sterling, 1996; Dacey, 1996), but may be key for other vertebrates (Ogden, Mascetti, & Pierantoni, 1985; Ammermuller, ¨ Muller, & Kolb, 1995). Teleost sh, for example, have complexly color-coded lateral inhibition mediated by three types of horizontal cells (for a review, see Dowling, 1987). And these sh’s lateral inhibition appears to contribute to color constancy, discarding the illuminant information (Kamermans et al., 1998). Hence, it may be simplistic to consider the spatial processing role of lateral inhibition without considering its chromatic one. Verweij, Kamermans, van den Aker, & Spekreijse, (1996) data highlight this conclusion by showing that the receptive eld’s spatial proles in the goldsh’s outer retina is color dependent. In our defense, we were trying to explain a spatial behavior and succeeded where previous theories failed (Balboa & Grzywacz, 2000). Moreover, the ideas developed here can be extended from the spatial domain to formulate a spatiochromatic version of the minimal local-asperity hypothesis. Another limitation of the hypothesis comes from a top-down consideration. Estimation of the local-asperity error requires prior knowledge about the probability of occluding borders as a function of contrast in natural images and knowledge about the computations performed by the neural processes involved (see Figure 2.2). Although the hypothesis species the type of knowledge needed (probability distributions) and its mode of use (Bayesian estimation), it does not tell how to obtain the information. All we can say is that it should probably come from evolutionary or developmental learning, or both. Studies on the learning of prior distributions are under way in biological and computer vision (Weiss, Edelman, & Fahle, 1993; Zhu & Mumford, 1997; Jones, Sinha, Vetter, & Poggio, 1997; Bartlett & Sejnowski, 1998). Another computationally related limitation is that we implement the hypothesis using individual images, but propose that local asperity is an error learned over many images. The assumption is that an individual image provides a sufciently rich sample of its habitat. In practice, though, the retina may not make this assumption, pooling a large sample of images to control adaptation. This would explain why the timescale of adaptation is so slow (for a review see Dowling, 1987). One more limitation that we wish to address has to do with our natural images. The images come from the World Wide Web, and thus lack intensity calibration and may miss energy at high spatial frequencies due
Early Retinal Lateral Inhibition
1503
to JPEG (Joint Photographic Experts Group) compression (Fernandez, 1997) and the photographic lens’s modulation transfer function (Kodak Professional Photographic Catalog, Kodak, Rochester, NY). We next argue that lack of intensity calibration and high-frequency loss are not serious problems for our results. To test whether intensity calibration limits our conclusions, we applied the hypothesis to nine calibrated images (Mackeown, 1994) from the Sowerby Image Database (Sowerby Research Centre, British Aerospace). The simulations yielded a bell-shape behavior for all images. The reason that lack of intensity calibration should not affect the bell-shape behavior was that the main effect of passing images through photographic media is to raise the intensities to a power (the gamma of the emulsion; Altman, 1995). According to our denition of contrast (see note 3), the effect of raising the image to a power would be to multiply the contrast by that power. Hence, from equations 3.5 and 3.6, the local-asperity error would be simply multiplied by a constant, which could be absorbed in QA , with no change in the bellshape behavior (see Figure 3). To test the implications of high-frequency loss, we saved the nine Sowerby images in JPEG with the highest compression factor allowed in Photoshop (Adobe, San Jose, CA), which produced les approximately 9.5 times smaller than before compression. As before, the simulations showed the bell-shape behavior for all images. The only effect of JPEG storage was the generation of bell-shape curves with the same peak but slightly broader tuning, with the mean difference between the integrals over intensity before and after JPEG conversion being smaller than 25%. The explanation for the loss of tuning may be related to losses of contrast due to the compression algorithm (Fernandez, 1997). Loss of high-spatial frequencies would smear edges and thus cause the loss of some of them due to noise. With fewer borders, the images would be less rugged and thus lead to larger lateral-inhibition extents than if the images did not lose high frequencies (see section 4.2). Finally, the approximations made to simulate the hypothesis may also limit the validity of the results. For instance, it would be interesting to include more realistic statistics about natural scenes (Zhu & Mumford, 1997) rather than approximate their distribution as homogeneous in our mathematical derivations (see section 3.2). Moreover, although we argue that the amplitude does not matter (see section 3.3), we could eliminate the squarefunction approximation for the inhibitory lter (see the appendix). Below, we will provide arguments for the bell-shape behavior that are essentially independent of the square-function assumption for the lter. 5.2 Why Does the Hypothesis Account for the Bell-Shape Behavior?
We now illustrate with a simple 1D stepwise model of the image and a 1D approximation of the hypothesis what accounts for the bell-shape behavior of the lateral-inhibition extent observed in the biology. This model and the approximation simplify the mathematics and consequently will
1504
Rosario M. Balboa and Norberto M. Grzywacz
help explain the bell-shape behavior and the meaning of the parameters involved. The 1D image model to which we apply the new hypothesis is illustrated by the dashed line in Figure 2.2B. The model species the distributions of intensity (around a mean intensity) and object size (step length), but these distributions are not important here. By adding Poisson noise to the intensity, we can compute the quantal-noise and local-asperity errors needed for equation 3.3. The approximation here uses a continuum version of the image intensity and inhibitory lter. Substituting equation 3.6 for PO in equation 3.5 and setting QA D 1 (see the appendix for a justication) in the 1D version of equation 3.5, we obtain 0 EA D
X
@
k
1
A . I
kC r X rI m iDk¡r
(5.1)
i
The internal sum in this 1D simplication of the model can be approximated as the sum of two terms (E A D E A1 C E A2 ). The rst (E A1 ) is due to occluding borders (with no noise) and the second (E A2 ) to uctuations caused by noise (with no occluding borders). We use a mean-eld approximation for the rst term, that is, we substitute the mean value of | rI I | m i for its particular value at each border. In this case, if we dene h| rI I | m i D L / 4L, i where L > 0 is a constant and L is the image’s size, then from equation 5.1, the rst term is E A1 D
Lr . 2
(5.2)
(See Balboa, 1997, for further justication of equation 5.2.) The second term is the quantal effect in E A without the effect of the occluding borders. Therefore, to calculate this term, we can assume a homogeneous illumination of the retina and compute the quantal error only for the internal sum in equation 5.1 since in this case the effect of the external sum is only to insert a multiplicative constant. For the internal sum, we approximately have a variance 0 Var @
1
2m A D c r s , I hIi i2m
kC r X rI m
iDk¡r
(5.3)
i
where c > 0 is a constant and s 2 is the variance corresponding to the mean intensity hIi i if one adds Poisson noise. The explanation for this equation is as follows: The variance of the sum of independent variables is the sum of the variances, which explains r . To approximate the variance of | rI I | m i , we can write both the numerator and denominator inside the absolute value
Early Retinal Lateral Inhibition
1505
as a mean plus a small error, and extract the rst-order term of the Taylor expansion. Because here the mean of r I is zero, the expansion yields the error of | r I| m divided by the mean of |I| m . This explains the hIi i2m term in equation 5.3 (“m” for the power of |I| and “2” for the variance). Next, we need to evaluate the variance of the error of | r I| m . If we approximate I’s Poisson noise as gaussian (not too low intensities), then the distribution of r I is also gaussian with zero mean and variance proportional to s 2 . Transforming such distributions to | r I| m and calculating the variance (Gradshtein & Ryzhik, 1994) yields a variance proportional to s 2m . This explains the s 2m term in the numerator of equation 5.3. Hence, by taking the square root of this equation and remembering that for a Poisson noise, s 2 is proportional to hIi i, the EA2 error is 1
E A2 D
gr 2
(5.4)
m
hIi ix2
where g > 0 is a constant. Besides these two terms of the local-asperity error (E A1 and E A2 ), one needs the 1D quantal-noise error. To estimate this error, we use for the inhibitory lter a 1D version of the square prole dened in equation A.1 1 in the appendix (hO i D 2r if |i| · r ; 0 otherwise). Substituting this prole for Ohi in equation 3.4, we obtain EQ D
Q¤ 1
(r hIi ix ) 2
,
(5.5)
where Q ¤ > 0 is a constant. The error function to minimize (the approximation of equation 3.3) is the sum of equations 5.2, 5.4, and 5.5, that is, ED
Lr gr 1 / 2 Q¤ C . m C (r hIix )1 / 2 2 hIix2
(5.6)
To nd the optimal lateral-inhibition lter, one must nd the optimal r . From equation 5.6, the optimal r varies with intensity (hIix ) and the power (m) of the probability that there is an occluding border at a point as a function of its contrast. To study the dependence of the optimal r on intensity, one has to take the derivative of equation 5.6 with respect to r , equate the result to zero, and make the variable change z D 11 / 2 , to obtain hIix
gr zm ¡ Q¤ z C Lr 3 / 2 D 0. Let us rst consider m D 2. The equation has the roots p Q¤ § Q¤2 ¡ 4Lgr 5 / 2 . zD 2gr
(5.7)
(5.8)
1506
Rosario M. Balboa and Norberto M. Grzywacz 2
Therefore, for a sufciently small extent (Q¤ > 4Lgr 5 / 2 ), there are two intensities that minimize the error. We will now demonstrate that with the lower of these two intensities, the extent increases with intensity, while with the other one, the extent falls with intensity. Let us study the roots 2 when r ! 0: Q ¤ > 4Lgr 5 / 2 . In this case, there are two positives roots to equation 5.7. p 2 ¤C Q ¤ ¡4Lgr 5 / 2 Case 1: z D Q (high z corresponds to low intensity). Using the 2gr
two rst terms of Taylor ’s expansion of z with respect to 4Lgr 5 / 2 , one can show p 5/2 Q¤ C Q¤ ¡ 12 4Lgr Q¤ C Q¤2 ¡ 4Lgr 5 / 2 Q D Limr !0 D 1. Limr !0 2gr 2gr
Therefore, at low intensities, z increases when r decreases, and thus, r increases with I. p 2 ¤ ¤ 5 /2 Case 2: z D Q ¡ Q2gr¡4Lgr (low z corresponds to high intensity). At the limit, we obtain Limr !0
Q¤ ¡
p 2 5/2 Q¤ ¡ Q¤ C 12 4Lgr Q¤ ¡ 4Lgr 5 / 2 Q¤ D Limr !0 D 0. 2gr 2gr
Therefore, at high intensities z decreases when r decreases, and thus, r decreases as I increases. Hence, if one sets m D 2 on this 1D approximation, then the lter behaves with the intensity we want when minimizing the error. Furthermore, we will show that this 1D approximation accounts for essentially all the other behaviors of lateral inhibition obtained with full natural images. To understand why the extent of the lateral-inhibition lter always falls with intensity when m D 1 (see Figure 4), one must consider equation 5.7. In this case, zD
1 ¡Lr 3 / 2 D , hIix gr ¡ Q¤
(5.9)
where because hIix > 0, it follows that gr ¡ Q¤ < 0. To show that r falls with hIix , one takes the reciprocal of equation 5.9 and differentiates the result by r to get gr ¡ 3Q¤ dhIix D . 2Lr 5 / 2 dr Hence, because gr ¡ Q¤ < 0, then dhIix / dr < 0. This inequality demonstrates that the extent of the lateral inhibition lter (r ) falls with mean intensity (hIix ) when m D 1.
Early Retinal Lateral Inhibition
1507
One can also use equation 5.7 to understand why the bell-shape behavior becomes more pronounced as m increases (see Figure 4). The only negative term in equation 5.7 (the second one) must balance the other two. Let us say that the balance occurs at some low intensity (high z). If we were to lower the intensity further, then the rst term would rapidly rise if m ¸ 1, and more so with larger m. Hence, r would have to fall steeply to lower the positive terms of equation 5.7 and compensate for the rise due to the rst term. Consequently, the effect of raising m is to produce a faster increase of the lateral-inhibition extent as a function of intensity at low intensities. We can also explain the increase of the lateral-inhibition extent with Q (see Figure 3) by using the 1D equation 5.7. As Q¤ increases (more weight to the quantal-noise error), the negative term of equation 5.7 also increases. Therefore, r must rise in positive terms to balance this increase in negativity. From the success of these results, we believe that we can use the 1D approximation in this section to explain why even when using full natural images, the new hypothesis accounts for the bell-shape behavior of lateralinhibition extent. It turns out that this behavior arises from the dependence of the quantal-noise and local-asperity errors on intensity. Not surprisingly, if the lateral-inhibition lter were xed, then the quantal-noise error would fall slowly as the inverse of the square root of intensity (see Figure 6 and the third term of equation 5.6). In turn, the local-asperity error would fall rapidly with intensity at low intensities (see Figure 6 and the second term of equation 5.6) and would stabilize at high intensities (see Figure 6 and the rst term of equation 5.6). As a result, at low and high intensities, the localasperity error would dominate, forcing the lter to be narrow to minimize the total error, by straddling fewer boundaries. In contrast, at intermediate intensities, the most important error would be the quantal-noise error, which would force a wide lter to reduce the total error by averaging out the noise. Why does the local-asperity error fall fast with intensity at low intensities and stabilize at high intensities? At high intensities, the quantal-noise error is negligible, and the only errors are due to the straddling of borders, which does not depend on intensity. At low intensities, the quantal-noise error is large, producing many spurious high-contrast points that reduce the chance that a true Mach band is detected as an outlier, causing local-asperity errors. This explanation for the bell-shape behavior is in terms of the dependence of the local-asperity and quantal-noise errors on intensity. The explanation follows from the 1D analysis and its equation, 5.6. Consequently, as the analysis of this equation showed, the existence of the bell-shape behavior is independent of parameters for m ¸ 2. In other words, no change of parameters can cause the two curves in Figure 6 to stop intersecting at two points. 5.3 Retinal Biology. To discuss whether the minimal local-asperity hypothesis is biologically plausible, we must address two issues. First, can the
1508
Rosario M. Balboa and Norberto M. Grzywacz
Figure 6: Relationship between the two types of errors used to compute the lateral-inhibition extent as a function of intensity. The units of intensity are arbitrary, since the curves were obtained from equation 5.6 with arbitrary choice of parameters and lateral-inhibition lter radius (r ). The local-asperity curve corresponds to the rst two terms of that equation and the quantal-noise curve to the third term. At low and high intensities, the local-asperity error dominates, but its behavior is different in each zone. This error falls fast with intensity at low intensities but is stable at high intensities. In contrast, the quantal-noise error falls steadily and slowly with intensity and dominates at intermediate intensities. When the local-asperity error dominates, the lateral-inhibition lter tends to be narrow to avoid straddling boundaries, and when the quantal-noise error dominates, the lter tends to be wide to average out the noise.
retina implement the mathematical operations specied in equations 3.4 and 3.5 to measure the error, and, relatedly, given that the postulated inhibition is nonlinear (division-like), can the retina access the image intensities or photoreceptor outputs necessary to perform these operations? Second, where in the retina are the various computations performed? What cellular mechanism does the retina use for these computations? We address the rst issue elsewhere (Balboa, 1997). There, we describe simple and plausible neurophysiological mechanisms that would take care of it. We will not address it any further here, instead devoting the rest of this section to retinal-location and cellular-mechanism issues. The minimal-local asperity hypothesis explains spatial modulations of the horizontal cell receptive eld when it adapts to light intensity. There are at least four known mechanisms that could participate in this modulation. One of the mechanisms is the variation of the gap junctions between hori-
Early Retinal Lateral Inhibition
1509
zontal cells (Mangel & Dowling, 1985; Weiler & Akopian, 1992; Myhr et al., 1994). This variation seems to be partially controlled by dopamine (Lasater & Dowling, 1985; Witkovsky & Dearry, 1991; Hampson, Weiler, & Vaney, 1994). Dopamine comes to the horizontal cell from interplexiform cells (for a review see Dowling, 1987) or by diffusion from the inner plexiform layer (Hare & Owen, 1995). In both cases, amacrine cells control the dopamine level. Therefore, the place where the retina computes the local-asperity and quantal-noise errors may be the inner plexiform layer (where amacrine cells synapse). Another modulatory neurotransmitter system that contributes to horizontal cell coupling adaptation involves nitric oxide (NO—Miyachi, Murakami, & Nakaki, 1990; McMahon & Ponomareva, 1996; Pottek, Schultz, & Weiler, 1997). This agent appears to complement dopamine in its effects on the gap junctions (Pottek et al., 1997). Perhaps more most interesting, while NO reduces the sensitivity to glutamate of horizontal cell, dopamine increases this sensitivity (McMahon & Ponomareva, 1996). Hence, the NO system may add to the mechanism proposed here for the modulation of lateral-inhibition strength in light and dark adaptation (see Figure 5). Another such a mechanism may be the modulation of pH (Hampson et al., 1994). A third mechanism contributing to horizontal cell adaptation involves voltage-dependent conductances in this cell (Winslow & Knapp, 1991; McMahon, 1994). In this case, the control of these conductances does not come from the inner plexiform layer (IPL), but rather from the photoreceptors. By increasing the intensity, the horizontal cells are hyperpolarized, modulating these conductances. When the conductances increase, it becomes harder to propagate current from cell to cell, therefore narrowing the receptive eld of the horizontal cell (Winslow & Knapp, 1991). Because there are many types of conductances in each cell, they could contribute to a complex behavior such as the bell-shape behavior of the lateral-inhibition extent as a function of intensity. The last mechanism that we will discuss for lateral-inhibition adaptation is a specialization of teleost sh. For these animals, increasing the intensity causes the horizontal cell to send nger-like processes to the pedicles of cones. These processes, which are called spinules (Wagner, 1980; De Juan, In´ ˜ õ guez, & Dowling, 1991; Weiler & Janssen-Bienhold, 1993), increase the area of contact between the cells, and thus spinules can increase the feedback strength in the OPL. It is interesting that the control of spinule growth is partially through extraretinal efferent bers (De Juan, Garc´õ a, & Cuenca, 1996; De Juan & Garc´õ a, 1998). This suggests that an extraretinal structure may implement some of the computations of our hypothesis in teleost sh. In particular, the external sums in equation 3.5 to estimate the local-asperity error require a computation over the entire retina, and therefore may require an integration over larger distances than available in retinal processes.
1510
Rosario M. Balboa and Norberto M. Grzywacz
5.4 Multiple Tasks. Is lateral inhibition in the inner plexiform layer, extraretinal parts of the brain, or invertebrates doing a different thing from that in the OPL? We ask this question because other lateral-inhibition models t well data from these systems (Srinivasan et al., 1982; Atick & Redlich, 1992). It is possible and likely that these systems use lateral inhibition for different tasks. These tasks may include extraction of different visual attributes such as direction of motion (Barlow & Levick, 1965), orientation of edges (Silito, 1975, 1979), and temporal modulation of intensity (Werblin, Maguire, Lukasiewicz, Eliasof, & Wu, 1988). And because these tasks are different, they may require different mechanisms of lateral inhibition, as observed experimentally (Merwine et al., 1995; Cook & McReynolds, 1998). These differences may explain why lateral inhibition of ganglion cells may not display a bell-shape behavior (Van Ness & Bouman, 1967). Similarly, lateral inhibition in insects does not appear to have a bell-shape behavior, since one can account for the properties of this inhibition with the predictive-coding theory (Srinivasan et al., 1982; van Hateren, 1992; Balboa & Grzywacz, 2000). Why would the tasks be different in insects? Perhaps the answer is that their “retina” must achieve in one or two layers many of the goals of several layers of the vertebrate system. Moreover, the brains of insects are much smaller than those of vertebrates, making neural resources more constraining (Brooke, Hanley, & Laughlin, 1998) and thus forcing “careful consideration” of tasks. Our philosophy is that for each system one must analyze tasks, limitations, and internal knowledge. There is no reason to believe that these factors are identical across systems, therefore allowing for different behavior in different systems. Appendix
The rst section of this appendix describes the computational approximations performed to implement the minimal local-Asperity hypothesis. The second section describes the error-minimization procedure and the choice of the parameters. A.1 Approximations in the Error Estimation.
A.1.1. Approximation for the Lateral-Inhibition Filter. To simplify the computation of hi, j from the properties of P O ( | r I / I| i, j ) (see equation 3.5) and the quantal noise (see equation 3.4), we approximate hi, j with a step (square) function. In other words, we use a disk in two dimensions dened as » hO i, j D
1 pr ¤2
0
if ( i2 C j2 )1 / 2 · r ¤ otherwise.
From this approximation, the radius r in equation 3.5 is r D r ¤ / 2.
(A.1)
Early Retinal Lateral Inhibition
1511
A.1.2 Approximation for | rI I | i, j . The discrete approximation for the absolute value of the intensity’s gradient used in the simulations is
| r I| i, j D
(
( IiC 1, j ¡ Ii¡1, j )2 4x2
C
( Ii, jC 1 ¡ Ii, j¡1 )2 4y2
!1 2
,
where 4x D 4y are the distance between neighbor pixels. Without loss of generality, we set to 4x D 4y D 1. Because we have to calculate | rI I | i, j , we estimate the local intensity as Ii, j D
Ii, jC 1 C Ii, j¡1 C IiC 1, j C Ii¡1, j , 4
to reduce the effect of noise. When Ii, jC 1 D Ii, j¡1 D IiC 1, j D Ii¡1, j D 0, there is a problem with the computation of rI I . We assume that this occurs at very dim intensities, such that the probability of each photoreceptor’s absorbing a photon is small but not zero. One can approximate this case as if one of the photoreceptors absorbs a photon but the others do not. Therefore, approximately,
rI 1 I D 22 i, j
(
¡4¢ ! 2
1
1 2
D2 ,
if the four values are zero. We justify this approximation by noting that if we wait a sufciently long time, some photons will be absorbed, but the probability of absorbing more than one photon in the integration time of the four photoreceptors is very small. A.2 Optimal Lateral-Inhibition Extent.
A.2.1 Error-Minimization Process. The program has to nd the lateralinhibition extent (r D r ¤ / 2; see equations 3.4, 3.5, and A.1) that minimizes equation 3.3. The inputs to the program are the natural image, the mean intensity, and the parameters (see the next section). The program rst adds noise to the image (Balboa & Grzywacz, 2000) and then computes | rI I | i, j . Next, the program plugs | rI I | m i, j into equation 3.6, uses it to calculate equation 3.5, and with the aid of this equation and equation 3.4 minimizes equation 3.3. The minimization procedure is by exhaustive search over r in geometric steps of 2, starting at 2 pixels. After completion of the exhaustive search, the program estimates the minimal r through interpolation with a parabola around the minimum. We use a parabola as the simplest polynomial interpolation (Burden & Faires, 1989). This minimization was performed in the [10¡1 , 106 ] range of intensities in steps of 0.5 log units.
1512
Rosario M. Balboa and Norberto M. Grzywacz
A.2.2. Parameters. There are four parameters in the equations (n, QQ , QA , and m), but they are not all independent. Because the parameters QQ and QA weigh two terms whose sum is to be minimized, these parameters are not independent. Multiplying them by a common factor does not change the minimizing lter. Hence, without loss of generality, we set the parameter QA D 1. We also dene the parameter Q D QQ / (2p 1 / 2 ) so that for the lter in equation A.1, E Q has the simple form EQ D Q /r hIi1 / 2 . Thus, the only true free parameters are n, Q, and m. To learn their effects on the theory, we vary them in the ranges [1, 3], [100, 10,000], and [1, 4], respectively, in the simulations. Acknowledgments
We thank Stewart Bloomeld for sending us his data on horizontal cell adaptation before publication and Alexander Ball for loaning us his only copy of Jamieson’s doctoral dissertation. Thanks are also due to Andrew Wright for allowing us to use the Sowerby Image Database (British Aerospace) and to Alan Yuille and Scott Konishi for sharing with us their TIFF version of the images from this database. We thank Joaqu´õ n De Juan, Russell Hamer, and the reviewers of this article for critical comments. Alan Yuille, Pierre-Ives Burgi, and Davi Geiger wrote letters of support for R.B.’s work before it was defended as a thesis in computer science. Finally, we thank Jos´e Oncina for nancial support during R.B.’s Ph.D. work at the University of Alicante, Alicante, Spain. This work was supported by National Eye Institute grants EY08921 and EY11170 and the William A. Kettlewell Chair to N.M.G., by Grant TIC97-0941 to Dr. Jos´e Oncina, and by a National Eye Institute Core grant to Smith-Kettlewell. References Adelson, E. H. (1993). Perceptual organization and the judgment of brightness. Science, 262, 2042–2044. Altman, J. H. (1995). Photographic lms. In M. Bass (Ed.), Handbook of optics, Vol. 1: Fundamentals, techniques and design. New York: McGraw-Hill. Ammerm uller, ¨ J., Muller, J. F., & Kolb, H. (1995). The organization of the turtle inner retina. II. Analysis of color-coded and directionally selective cells. J. Comp. Neurol., 358, 35–62. Ashmore, J. F., & Falk, G. (1980). The single-photon signal in rod bipolar cells of the dogsh retina. J. Physiol. Lond., 300, 151–166. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comp., 4, 196–210. Balboa, R. M. (1997). Teor´õas para la inhibici´on lateral en la retina basadas en autocorrelaci´on, curtosis y aspereza local de las im´agenes naturales. Unpublished doctoral dissertation, University of Alicante, Alicante, Spain.
Early Retinal Lateral Inhibition
1513
Balboa, R. M., & Grzywacz, N. M. (1998). The minimal-local asperity theory of retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 39, S562. Balboa, R. M. & Grzywacz,N. M. (1999a).Occlusions contributeto scaling in natural images. Manuscript submitted for publication. Balboa, R. M., & Grzywacz, N. M. (1999b). Occulusions and their relationship with the distribution of contrasts in natural images. Manuscript submitted for publication. Balboa, R. M., & Grzywacz, N. M. (1999c).Biological evidence for an ecologicalbased theory of early retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 40, S386. Balboa, R. M., & Grzywacz, N. M. (2000). The role of early retinal lateral inhibition: More than maximizing luminance information. Vis. Neurosci., 17:1. Baldridge, W. H. (1993).Effect of background illumination on horizontal cell receptiveeld size in the retina of the goldsh (Carassius auratus). Unpublished doctoral dissertation. McMaster University, Hamilton, Ontario, Canada. Baldridge, W. H., & Ball, A. K. (1991).Background illumination reduces horizontal cell receptive eld size in both normal and 6-hydroxydopamine-lesioned goldsh retinas. Visual Neurosci., 7, 441–450. Barlow, H. B., Fitzhugh, R., & Kufer, S. W. (1957). Change of organisation in the receptive elds of the cat’s retina during dark adaptation. J. Physiol., 137, 338–354. Barlow, H. B., & Levick, W. R. (1965). The mechanism of directionally selective units in the rabbit’s retina. J. Physiol., 178, 477–504. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network, 9, 399– 417. Baylor, D. A., Lamb, T. D., & Yau, K.-W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 613–634. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer-Verlag. Bowling, D. B. (1980).Light responses of ganglions cells in the retina of the turtle. J. Physiol., 209, 173–196. Brooke, M. de L., Hanley, S., & Laughlin, S. B. (1998). The scaling of eye size with body mass in birds. Proc. R. Soc. Lond. B, 266, 405–412. Burden, R. L., & Faires, J. D. (1989). Numerical analysis (4th ed.). Boston: PWSKENT Publishing Company. Burington, R. S. (1973). Handbook of mathematical tables and formulas. New York: McGraw-Hill. Calkins, D. J., & Sterling, P. (1996). Absence of spectrally specic lateral inputs to midget ganglion cells in primate retina. Nature, 381(6583), 613–615. Campbell, F. W., & Robson, J. G. (1968). Application of Fourier analysis to the visibility of gratings. J. Physiol. (Lond.), 197, 551–566. Cook, P. B., & McReynolds, J. S. (1998). Modulation of sustained and transient lateral inhibitory mechanisms in the mudpuppy retina during light adaptation. J. Neurophysiol., 79, 197–204.
1514
Rosario M. Balboa and Norberto M. Grzywacz
Dacey, D. M. (1996). Circuitry for color coding in the primate retina. Proc. Natl. Acad. Sci. USA, 93, 582–588. De Juan, J., & Garc´õ a, M. (1998). Interocular effect of actin depolymerization on spinule formation in teleost retina. Brain Res., 792, 173–177. De Juan, J., Garc´õ a, M., & Cuenca, N. (1996). Formation and dissolution of spinules and changes in nematosome size require optic nerve integrity in black bass (Micropterus salmoides) retina. Brain Res., 707, 213–220. De Juan, J., In´ ˜ õ guez, C., & Dowling, J. E. (1991). Nematosomes in external horizontal cells of white perch (Roccus americana) retina: Changes during dark and light adaptation. Brain Res., 546, 176–180. Dowling, J. E. (1987).The retina: An approachablepart of the brain. Cambridge, MA: Belknap Press of Harvard University Press. Fain, G., & Dowling, J. E. (1973). Intracellular recordings from single rods and cones in the mudpuppy retina. Science, 180, 1178–1181. Fernandez, J. N. (1997). GIFs, JPEGs and BMPs: Handling Internet graphics. Cambridge, MA: MIT Press. Field, D. J. (1994). What is the goal of sensory coding? Neural Comp., 6, 559– 601. Fiorentini, A., & Radici, T. (1957). Binocular measurements of brightness on a eld presenting a luminance gradient. Atti. Fond. Giorgio Ronchi, XII, 453– 461. Fuortes, M. G. F., & Yeandle, S. (1964). Probability of occurrence of discrete potentials waves in the eye of the Limulus. J. Physiol., 47, 443–463. Gallant, S. I. (1995).Expert systems and decision systems using neural networks. In M. A. Arbib, (Ed.), The handbook of brain theory and neural networks (pp. 377389). Cambridge, MA: MIT Press. Gradshtein, I. S., & Ryzhik, I. M. (1994). Table of integrals, series, and products (5th ed.). San Diego, CA: Academic Press. Grzywacz, N. M., & Balboa, R. M. (1999). The minimal-local asperity theory of early retinal lateral inhibition. Manuscript submitted for publication. Grzywacz, N. M., & Hillman, P. (1988). Biophysical evidence that light adaptation in Limulus photoreceptors is due to a negative feedback. Biophys. J., 53, 337–348. Grzywacz, N. M., Hillman, P., & Knight, B. W. (1988). The quantal source of area supralinearity of ash responses in Limulus photoreceptors. J. Gen. Physiol., 91, 659–684. Hampson, E. C., Weiler, R., & Vaney, D. I. (1994). pH-gated dopaminergic modulation of horizontal cell gap junctions in mammalian retina. Proc. R. Soc. Lond. B., 255, 67–72. Hare, W. A., & Owen, W. G. (1995).Similar effects of carbachol and dopamine on neurons in the distal retina of the tiger salamander. Vis. Neurosci., 12, 443–455. Jamieson, M. S. (1994). Mechanisms regulating horizontal cell receptive eld size in goldsh retina. Unpublished doctoral dissertation, McMaster University, Hamilton, Ontario, Canada. Jamieson, M. S., Baldridge, W. H., & Ball, A. K. (1994). Modulation of horizontal cell receptive eld size in light-adapted godlsh retinas. In Proceedings of the Association for Research in Vision and Ophthalmology (pp. 1821, 2623).
Early Retinal Lateral Inhibition
1515
Jones, M. J., Sinha, P., Vetter, T., & Poggio, T. (1997). Top-down learning of lowlevel vision tasks. Curr. Biol., 7, 991–994. Kamermans, M., Haak, J., Habraken, J. B., & Spekreijse, H. (1996). The size of the horizontal cell receptive elds adapts to the stimulus in the light adapted goldsh retina. Vision Res., 36, 4105–4119. Kamermans, M., Kraaij, D. A., & Spekreijse, H. (1998). The cone/horizontal cell network: A possible site for color constancy. Vis. Neurosci., 15, 787–797. Lankheet, M. J., Rowe, M. H., Van Wezel, R. J., & van de Grind, W. A. (1996). Spatial and temporal properties of cat horizontal cells after prolonged dark adaptation. Vision Res., 36, 3955–3967. Lasater, E. M., & Dowling, J. E. (1985). Dopamine decreases conductance of the electrical junctions between cultured retinal horizontal cells. Proc. Natl. Acad. Sci. USA, 82, 3025–3029. Lillywhite, P. G. (1977). Single photon signals and transduction in an insect eye. Journal of Comparative Physiology, 122, 189–200. Mackeown, W. P. J. (1994). Labelling image database and its application to outdoor scene analysis. Unpublished doctoral dissertation, University of Bristol. Mangel, S. C., & Dowling, J. E. (1985). Responsiveness and receptive eld size of carp horizontal cells are reduced by prolonged darkness and dopamine. Science, 229, 1107–1109. McCarthy, S. T., & Owen, W. G. (1996). Preferential representation of natural scenes in the salamander retina. Invest. Ophthalmol. Vis. Sci., 37, S674. McMahon, D. G. (1994). Modulation of electrical synaptic transmission in zebrash retinal horizontal cells. J. Neurosci., 14, 1722–1734. McMahon, D. G., & Ponomareva, L. V. (1996). Nitric oxide and cGMP modulate retinal glutamate receptors. J. Neurophysiol., 76, 2307–2315. Merwine, D. K., Amthor, F. R., & Grzywacz, N. M. (1995). Interaction between center and surround in rabbit retinal ganglion cells. J. Neurophysiol., 73, 1547– 1567. Miyachi, E., Murakami, M., & Nakaki, T. (1990). Arginine blocks gap junctions between retinal horizontal cells. Neuroreport, 1, 107–110. Mobley, C. (1995).Optical properties of water. In M. Bass (Ed.),Handbook of optics, Vol. 1: Fundamentals, techniques and design. New York: McGraw-Hill. Myhr, K. L., Dong, C. J., & McReynolds, J. S. (1994). Cones contribute to lightevoked, dopamine-mediated uncoupling of horizontal cells in the mudpuppy retina. J. Neurophysiol., 72, 56–62. Ogden, T. E., Mascetti, G. G., & Pierantoni, R. (1985). The outer horizontal cell of the frog retina: Morphology, receptor input, and function. Invest. Ophthalmol. Vis. Sci., 26, 643–656. Pottek, M., Schultz, K., & Weiler, R. (1997). Effects of nitric oxide on the horizontal cell network and dopamine release in the carp retina. Vision Res., 37, 1091–1102. Rao, R., Buchsbaum, G., & Sterling, P. (1994). Rate of quantal transmitter release at the mammalian rod synapse. J. Biophys., 67, 57–63. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden-Day.
1516
Rosario M. Balboa and Norberto M. Grzywacz
Reifsnider, E. S., & Tranchina, D. (1995).Background contrast modulates kinetics and lateral spread of responses to superimposed stimuli in outer retina. Vis. Neurosci., 12, 1105–1126. Ridley, M. (1993). Evolution. Boston: Blackwell Scientic. Robson, J. G., & Frishman L. J. (1995). Response linearity and kinetics of the cat retina: The bipolar cell component of the dark-adapted electroretinogram. Vis. Neurosci., 12, 837–850. Rodiek, R. W., & Stone, J. (1965). Analysis of receptive elds of cat retinal ganglion cells. J. Neurophysiol., 28, 832–849. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Letter, 73, 814–817. Rushton, W. A. H. (1961). The intensity factor in vision. In W. D. McElroy & H. B. Glass (Eds.), Light and Life (pp. 706–722). Baltimore: Johns Hopkins University Press. Shapley, R. M., & Tolhurst, D. J. (1973).Edge-detectors in human vision. J. Physiol. (Lond.), 229, 165–183. Silito, A. M. (1975). The contribution of inhibitory mechanisms to the receptive eld properties of neurones in the striate cortex of the cat. J. Physiol. (Lond.), 250, 305–322. Silito, A. M. (1979). Inhibitory mechanisms inuencing complex cell orientation selectivity and their modication at high resting discharge levels. J. Physiol. (Lond.) 289, 33–53. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B., 216, 427–459. Stevens, S. S. (1970). Neural events and the psychophysical law. Science, 170, 1043–1050. Tranchina, D., Gordon, J., Shapley, R., & Toyoda, J. (1981). Linear information processing in the retina: A study of horizontal cell responses. Proc. Natl. Acad. Sci. USA, 78, 6540–6542. van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive elds of y LMC’s, and experimental validation. J. Comp. Physiol. A, 171, 157–170. Van Ness, F. L., & Bouman, M. A. (1967). Spatial modulation transfer in the human eye. J. Opt. Soc. Am., 57, 401–406. Verweij, J., Kamermans, M., & Spekreijse, H. (1996). Horizontal cells feed back to cones by shifting the cone calcium-current activation range. Vision Res., 36(24), 3943–3953. Verweij, J., Kamermans, M., van den Aker, E. C., & Spekreijse, H. (1996). Modulation of horizontal cell receptive elds in the light adapted goldsh retina. Vision Res., 36, 3913–3923. Vu, T. Q., McCarthy, S. T., & Owen, W. G. (1997). Linear transduction of natural stimuli by dark-adapted and light-adapted rods of the salamander, it Ambystoma tigrinum. J. Physiol., 505, 193–204. Wagner, H. J. (1980).Light-dependent plasticity of the morphology of horizontal cell terminals in cone pedicles of sh retinas. J. Neurocytol., 9, 573–590. Weiler, R., & Akopian, A. (1992). Effects of background illuminations on the receptive eld size of horizontal cells in the turtle retina are mediated by
Early Retinal Lateral Inhibition
1517
dopamine. Neurosci. Lett., 140, 121–124. Weiler, R., & Janssen-Bienhold, U. (1993). Spinule-type neurite outgrowth from horizontal cells during light adaptation in the carp retina: An actindependent process. J. Neurocytol., 22, 129–139. Weiss, Y., Edelman, S., & Fahle, M. (1993). Models of perceptual learning in vernier hyperacuity. Neural Comp., 5, 695–718. Werblin, F. S. (1977). Synaptic interactions mediating bipolar response in the retina of the tiger salamander. In H. B. Barlow and P. Fatt (Eds.), Vertebrate Photoreception (pp. 205–230). San Diego, CA: Academic Press. Werblin, F. S., Maguire, G., Lukasiewicz, P., Eliasof, S., & Wu, S. M. (1988). Neural interactions mediating the detection of motion in the retina of the tiger salamander. Vis. Neurosci., 1, 317–329. Winslow, R. L., & Knapp, A. G. (1991). Dynamic models of the retinal horizontal cell network. Prog. Biophys. Mol. Biol., 56, 107–133. Wishart, K. A., Frisby, J. P., & Buckley, D. (1997). The role of 3-D surface slope in a lightness/brightness effect. Vis. Res., 37, 467–473. Witkovsky, P., & Dearry, A. (1991)Functional roles of dopamine in the vertebrate retina. Prog. Ret. Res., 11, 247–292. Xin, D., & Bloomeld, S. A. (1999). Dark- and light-induced changes in coupling between horizontal cells in mammalian retina. J. Comp. Neurol., 405, 75–87. Xin, D., Bloomeld, S. A., & Persky, S. E. (1994). Effect of background illumination on receptive eld and tracer-coupling size of horizontal cells and AII amacrine cells in the rabbit retinas. In Proceedings of the Association for Research in Vision and Ophthalmology (p. 1363). Zhu, S. C., & Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 1236–1250. Zhu, S. C., & Mumford, D. (1998). GRADE: Gibbs reaction and diffusion equations. A framework for pattern synthesis, image denoising, and removing clutter. In Proceedings of the 6th International Conference on Computer Vision (pp. 847–854). Received October 21, 1998; accepted August 6, 1999.
NOTE
Communicated by Emilio Salinas
Multidimensional Encoding Strategy of Spiking Neurons Christian W. Eurich Stefan D. Wilke Institut fur ¨ Theoretische Physik, Universit¨at Bremen, D-28334 Bremen, Germany Neural responses in sensory systems are typically triggered by a multitude of stimulus features. Using information theory, we study the encoding accuracy of a population of stochastically spiking neurons characterized by different tuning widths for the different features. The optimal encoding strategy for representing one feature most accurately consists of narrow tuning in the dimension to be encoded, to increase the singleneuron Fisher information, and broad tuning in all other dimensions, to increase the number of active neurons. Extremely narrow tuning without sufcient receptive eld overlap will severely worsen the coding. This implies the existence of an optimal tuning width for the feature to be encoded. Empirically, only a subset of all stimulus features will normally be accessible. In this case, relative encoding errors can be calculated that yield a criterion for the function of a neural population based on the measured tuning curves. 1 Introduction
The question of an optimal tuning width for the representation of a stimulus by a neural population is still controversial. On the one hand, narrowly tuned cells are frequently encountered, emphasizing the importance of single neurons for perception and motor control (Lettvin, Maturana, McCulloch, & Pitts, 1959; Barlow, 1972). On the other hand, theoretical arguments suggest that in most cases, broadly tuned units and distributed information processing are better suited for accurate representations (Hinton, McClelland, & Rumelhart, 1986; Georgopoulos, Schwartz, & Kettner, 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994; Snippe, 1996; Eurich & Schwegler, 1997; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999). A useful measure of the information content of a set of spike trains emitted by a population of neurons is the Fisher information matrix (Deco & Obradovic, 1997; Brunel & Nadal, 1998), which yields a lower bound on the mean squared error for unbiased estimators of the encoded quantity (Cram´er-Rao inequality). In a recent approach using Fisher information, Zhang and Sejnowski (1999) derived an expression for the encoding accuracy of a population of spiking neurons as a function of the tuning width, c 2000 Massachusetts Institute of Technology Neural Computation 12, 1519– 1529 (2000) °
1520
Christian W. Eurich and Stefan D. Wilke
s, for radially symmetric receptive elds in a D-dimensional space. The calculation shows that the encoding accuracy increases with s if D ¸ 3. This result can be complemented by the following consideration. Neurons generally respond to many stimulus features. The main function of a neural population, however, may be the processing of only a subset of these features. In the following, we derive an optimal coding strategy of a population of neurons whose tuning widths differ in the different dimensions. We also study the case of extremely small receptive elds where the population approach breaks down, and demonstrate the existence of an optimal tuning width if only one of the stimulus features is to be encoded accurately. Furthermore, we consider the situation that only a part of all encoded stimulus properties is accessible to an observer. General formulas will be illustrated by the example of a population of neurons with gaussian tuning functions and Poissonian spike statistics. Since part of this article is a continuation of Zhang and Sejnowski (1999), we adopt much of their formalism. 2 Model
Consider a stimulus characterized by a position x D (x1 , . . . , xD ) in a Ddimensional stimulus space, where xi (i D 1, . . . , D) is measured relative to the total range of values in the ith dimension such that it is dimensionless. Furthermore, consider a population of N identical stochastically spiking neurons that re n D ( n(1) , . . . , n( k) , . . . , n( N ) ) spikes in a time interval t following the presentation of the stimulus. The joint probability distribution, P( n I x ), is assumed to take the form P(n I x ) D
N Y kD1
P( k) (n( k) I x ),
(2.1)
that is, the neurons have independent spike generation mechanisms. Note that the neural ring rates may still be correlated; the neurons may have common input or even share the same tuning function. The tuning function of neuron k, f ( k) ( x ), gives the mean ring rate of neuron k in response to the stimulus at position x . Unlike Zhang and Sejnowski (1999), we assume here a form of the tuning function that is not necessarily radially symmetric, f
(k)
( x ) D Fw
(
D X (xi ¡ c(i k) )2 iD1
si2
!
±
²
D: Fw j ( k )2 ,
(2.2)
( )2 ( ) ( )2 ( )2 where ji k :D ( xi ¡ ci k )2 / si2 for i D 1, . . . , D, and j ( k)2 :D j1k C ¢ ¢ ¢ C jDk . F > 0 denotes the maximal ring rate of the neurons, which requires that maxz w (z ) D 1. The ring rates depend on the stimulus only by the local values of the tuning functions, such that P( k) ( n( k) I x ) can be written in the form P( k) (n( k) I x ) D S(n( k) , f (k) ( x ), t ). The function S: N0 £ [0I F]£]0I 1[¡!]0I 1]
Multidimensional Encoding Strategy of Spiking Neurons
1521
is required to be logarithmically differentiable with respect to its second argument but is otherwise arbitrary. For a population of tuning functions with centers c(1) , . . . , c (N ) , a density g( x ) is introduced according to g( x ) :D PN (k) kD1 d( x ¡ c ). The Fisher information matrix, ( Jij ), is dened as µ Jij (x ) :D E
¡@
@xi
ln P(n I x )
¢¡ @
@xj
ln P( n I x )
¢¶
(2.3)
(Deco & Obradovic, 1997), where E[. . .] denotes the expectation value over the probability distribution P (n I x ). The Cram´er-Rao inequality gives a lower bound on the expected estimation error in the ith dimension, 2 i,min ( i D 1, . . . , D ), provided that the estimator is unbiased. In the case of a diagonal 2 Fisher information matrix, it is given by 2 i,min D 1 / Jii ( x ). 3 Information Content of Neural Responses 3.1 Population Fisher Information. For a single-model neuron k described in the previous section, the Fisher information (see equation 2.3) reduces to ( )
k Jij ( x ) D
± ² 1 (k) Aw j ( k)2 , F, t ji( k)jj . si sj
(3.1)
The function Aw , which is independent of k, abbreviates the expression Aw ( z, F, t ) :D 4F2 w 0 ( z )2
1 X
S( n, Fw ( z ), t )T 2 [n, Fw (z ), t ] ,
(3.2)
nD0
d ( ). where T[n, z, t ] :D @@z ln S( n, z, t ) and w 0 ( z) :D dz w z The independence assumption (see equation 2.1) implies that the population Fisher information, Jij ( x ), is the sum of the contributions of the indiP (k ) vidual neurons, Jij ( x ) D N kD1 Jij ( x ). For a constant distribution of tuning curves, g(x ) ´ g ´ const., the population Fisher information becomes independent of x, and the off-diagonal elements vanish (Zhang & Sejnowski, 1999). In this case, the diagonal elements Ji :D Jii are given by
Ji D gDKw (F, t, D )
QD
kD1 sk , si2
(3.3)
where Kw is dened to be Kw ( F, t, D) :D
1 D
Z
1 ¡1
Z dj1 . . .
1 ¡1
djD Aw (j 2 , F, t )j12 .
(3.4)
1522
Christian W. Eurich and Stefan D. Wilke
For identical tuning widths in all dimensions, si ´ s (i D 1, . . . , D), the total P ¡1 ¡1 D¡2 , that Fisher information, J :D ( D iD1 Ji ) , is given by J D gKw ( F, t, D)s is, equation 2.3 from Zhang and Sejnowski (1999) is recovered. Equation 3.3 shows that the Fisher information in the ith dimension is determined by a trade-off between Q the product of the tuning widths in the 6 remaining dimensions k D i, kD 6 i sk , and the tuning width in dimension i, si . In order to assess the consequences of equation 3.3 for neural encoding strategies, we provide an intuitive interpretation of the ratio of tuning widths by introducing effective receptive elds. 3.2 Effective Receptive Fields and Encoding Subpopulation. The tuning functions f ( k) ( x ) encountered empirically typically have a single maximum. For such curves, large values of the single-neuron Fisher information (see equation 3.1) are generally restricted to a region around the center of the tuning function, c (k ) . The fraction p(b ) of Fisher information that falls p ( )2 into a region j k · b around c (k) is given by
R p(b ) :D R
ED RD
dDx dDx
PD
( ) iD1 Jii x
PD
( ) iD1 J ii x
,
(3.5)
where the index ( k ) was dropped because the tuning curves are assumed to have identical forms. A straightforward calculation shows that p(b ) is given by Rb 0 p(b ) D R 1 0
dj j DC 1 Aw (j 2 , F, t ) dj j DC 1 Aw (j 2 , F, t )
.
(3.6) ( )
k Equation 3.6 allows the denition of an effective receptive eld, RFeff , inside of which neuron k conveys a major fraction p0 of Fisher information, (k) RF eff
» q ¼ ( )2 k :D x | j · b0 ,
(3.7)
where b0 is chosen such that p(b0 ) D p0 . (k ) The Fisher information a neuron k carries is small unless x 2 RF eff . This has the consequence that a xed stimulus x is actually encoded by only a subpopulation of neurons. If the distribution of tuning functions does not vary in the proximity of the stimulus position x , g( x0 ) ´ g D const for |xi ¡ x0i | < b0 si (i D 1, . . . , D), the point x in stimulus space is covered by Ncode :D g
D 2p D / 2 (b0 )D Y sj DC ( D / 2) jD1
(3.8)
Multidimensional Encoding Strategy of Spiking Neurons
1523
receptive elds. With the help of equation 3.8, the population Fisher information (see equation 3.3) can be rewritten as Ji D
D2 C(D / 2) Ncode . Kw ( F, t, D) 2p D / 2 (b0 )D si2
(3.9)
Equation 3.9 can be interpreted as follows: We assume that the population of neurons encodes stimulus dimension i accurately, while all other dimension are of secondary importance. The minimal encoding error for dimension i, Ji¡1 , is determined by the shape of the tuning curve that enters through b0 , Kw (F, t, D ), and Ncode ; by the tuning width in dimension i, si ; and by the active subpopulation, Ncode . There is a trade-off between si and Ncode . On the one hand, the encoding error can be decreased by decreasing si , which enhances the Fisher information carried by each single neuron. Decreasing si , on the other hand, will also shrink the active subpopulation via equation 3.8. This impairs the encoding accuracy, because the stimulus position is evaluated by fewer independent estimators. If equation 3.9 is valid due to a sufcient receptive eld overlap, Ncode can be increased by increasing 6 i. This effect is illustrated the tuning widths, sj , in all other dimensions j D in Figure 1. 3.3 Narrow Tuning. In order to study the effects of narrow tuning in dimension i, si ¡! 0, we consider a constant distribution of stimuli, r ( x ) D const., in the region of stimulus space containing the receptive elds of the neural population. A straightforward calculation shows that in this case, the stimulus-averaged Fisher information,
Z hJi i :D
Z
1
dx1 . . .
¡1
1 ¡1
dxD r (x ) Ji ( x ),
(3.10)
is given by hJi i D Ji , that is, the average Fisher information for arbitrary distributions of tuning functions g( x ) is equal to the Fisher information (see equation 3.3) for the uniformly distributed population. Even if si becomes so small that gaps appear between the receptive elds, the mean Fisher information still increases with decreasing si . This property is due to those few stimuli that can be localized extremely accurately by the ring of the narrowly tuned cells. The majority of stimuli, however, fall into the gaps and are not well represented any more. A neural system with these properties is obviously a bad encoder, which shows that hJi i is not a suitable measure for the system performance. Consider instead the stimulus-averaged squared minimal encoding error for unbiased estimators, Z h2
2 i,min i
:D
1
¡1
Z dx1 . . .
1 ¡1
h i dxD r ( x ) J( x )¡1 . ii
(3.11)
1524
Christian W. Eurich and Stefan D. Wilke
x2
x2
A
x 2,s
B
x 2,s
x1
x1,s
x1,s
x1
Figure 1: Population response to the presentation of a stimulus characterized by parameters x1,s and x2, s . Feature x1 is to be encoded accurately. Effective receptive eld shapes are indicated for both populations. If neurons are narrowly tuned in x2 (A), the active population (shaded) is small (here: Ncode D 3). Broadly tuned receptive elds for x2 (B) yield a much larger population (here: Ncode D 15), thus increasing the encoding accuracy.
For uniformly distributed tuning curves and a sufcient receptive eld overlap, equation 3.11 can be simplied using equation 3.9, which yields 2 h2 i,min i D 1 / Ji as expected. For narrowly tuned cells, however, the condition g( x ) ´ g D const. breaks down, and equation 3.9 is no longer valid. The following argument 2 shows that in contrast to the high-overlap approximation 1 / Ji , h2 i,min i will diverge for si ! 0. Let l k ( x ) ¸ 0 be the kth eigenvalue of the real-valued, symmetrical Fisher information matrix, and U( x ) the orthogonal transformation that diagonalizes J(x ), that is, U ( x ) J( x )U( x )T D diag[l 1 (x ), . . . , l D (x )]. If l max (x ) :D maxj fl j ( x )g, one has * 2 h2 i,min i
U (x )ij
D
* ¸
D £ X
¤2
jD1
1 l max ( x )
1 lj(x )
D £ X j D1
+
U( x )ij
¤2
+ ½ D
1 l max ( x )
¾ (3.12)
Multidimensional Encoding Strategy of Spiking Neurons
1525
independent of i. For si ! 0, regions of stimulus space emerge that are not covered by any receptive elds. These gaps are characterized by very small eigenvalues of J (x ), that is, l max ( x ) ! 0. Thus, the minimal encoding error 2 h2 i,min i diverges in the presence of unrepresented areas in stimulus space. PD 2 2 This implies that the total error h2 min i :D iD1 h2 i,min i will also become innite if any of the tuning widths approaches zero. Equation 3.3 shows that the accuracy also decreases for large si . Consequently, there must be an optimal tuning width between the two regimes of broad and small tuning. In the next section, we calculate this optimal tuning width in a specic example. Example: Poissonian Spiking and Gaussian Tuning Functions. For Poissonian spike generation and gaussian tuning functions, the single-neuron Fisher information, equation 3.1, becomes ( ) Jijk ( x ) D
± ² 1 ( ) Ft exp ¡j ( k)2 / 2 ji( k)jj k . si sj
(3.13)
The population Fisher information, equation 3.3, reduces to
Ji D
(2p )D / 2
gFt
QD
kD1 sk . si2
(3.14)
The denition of the effective receptive eld is obtained from p(b ) in equation 3.6, for which we have
8 D/2 X > 2 > b 2k > , > 1 ¡ e¡b / 2 > 2k k! < kD0 p(b ) D " # > ( D¡1 > X) / 2 p D¡2k > 2 b b D!! 1 p > ¡b 2 p > : 2D / 2 C(1C D / 2) ¡e / ( D¡2k )!! C D!! 2 Erf( 2 ) , kD0
D even
D odd, (3.15)
R where Erf(x) :D p2p 0x dt exp(¡t2 ) is the gaussian error function. For a specic distribution of tuning curves in the stimulus space, the optimal tuning width mentioned in the previous paragraph can be calculated explicitly. Consider a population of neurons with tuning curves aligned on a P regular grid with xed spacing D , such that c( k ) D D l D1 D kl el , where kl 2 Z , and el is the lth unit vector. For simplicity, we restrict the calculation to the 2 case D D 2 and study the stimulus-averaged minimal encoding error h2 1,min i
1526
Christian W. Eurich and Stefan D. Wilke
dened in equation 3.11 as a function of the tuning widths sO 1 :D s1 / D and sO 2 :D s2 / D . As a consequence of the grid regularity, it can be written in the form 1
h2
2 1,min i
DD
O 13 sO 2 2 4s Ft
1
Z2sO 1
Z2sO 2 dj1
0
dj2 0
µ ¶¡1 E1 (j1 , sO 1 )2 E1 (j2 , sO 2 )2 £ E 2 (j1 , sO 1 )E0 (j2 , sO 2 ) ¡ , E2 (j2 , sO 2 )E0 (j1 , sO 1 )
(3.16)
P where El (j, s) O :D k2Z exp[¡(j ¡k / s) O 2 / 2](j ¡k / sO )l . For large tuning width in either direction, sO À 1, the sum in E l (j, s) O may be approximated by an integral that is independent of j , and one nds lims!1 E1 (j, sO ) ! 0 and O p (j, lims!1 s) O ! 2p s O for D 0, 2. Thus, the broad-tuning limit sO 1 À 1 El l O 2 and sO 2 À 1 yields h2 1,min i D D 2 sO 1 / (2p Ft sO 2 ) D 1 / Ji ; one recovers the broadtuning Fisher information Ji from (3.14) with D D 2 and 1 /g D D 2 . The encoding error (see equation 3.16) is plotted in Figure 2 as a function of sO 1 for different values of sO 2 . The agreement with the broad-tuning approximation for large tuning widths is clearly visible. If one of the tun2 ing widths is small, however, the minimal encoding error h2 1,min i no longer equals the inverse of equation 3.14. The deviations can be observed in Figure 2 for small sO 1 (right part of the plotted functions) and small sO 2 (upper curve). Thus, Figure 2 reects two main results of this article. First, the broader the tuning is for feature 2, the smaller the encoding error of feature 1. This illustrates our proposed encoding strategy. Second, the encoding error for feature 1 has a unique minimum as a function of sO 1 . From the arguments opt of the previous paragraph, one expects the optimal tuning width s1 to be approximately equal to the smallest possible tuning width for which no gaps appear between the receptive elds of neighboring neurons. Indeed, opt the numerical calculation yields an optimal tuning curve width of 2s1 ¼ 0.8D , which corresponds to the array of tuning curves shown in the inset of Figure 2. For very small tuning widths sO i , one practically leaves the validity range of our Fisher information analysis. If the stimulus falls into a gap between receptive elds such that the cells do not re within a reasonable time interval t , there is no possibility of making an unbiased estimation of the stimulus features. Thus, the Fisher information measure cannot be applied for sO i ! 0. At the calculated optimal tuning width in our example, however, there is still a sufcient receptive eld overlap to allow unbiased estiopt mates (cf. the inset of Figure 2). Therefore, we argue that s1 is well within the range where Fisher information is a valid measure of encoding accuracy.
Multidimensional Encoding Strategy of Spiking Neurons
1527
Dummy
10
3
10
2
10
1
10
0
<e
1,min
2
>Ft /D
2
1 f
0
4
0
4
x
10
1
10
2
10
1
0
10 s 1/D
10
1
Figure 2: Numerical results for the stimulus-averaged squared minimal en2 i / ( D 2 F¡1 t ¡1 ) as a function coding error from equation 3.16. Dashed line: h2 1,min of sO1 for D D 2 and different values of sO2 (from top to bottom): sO 2 D 0.25, 2 i / (D 2 F¡1 t ¡1 ) D 0.5, 1, 2. Solid line: analytical broad-tuning result, that is, h2 1,min (1 / J1 ) / ( D 2 F¡1 t ¡1 ) from equation 3.14. The inset shows gaussian tuning curves of optimal width, s opt ¼ 0.4D .
3.4 The Problem of Hidden Dimensions. A situation that is frequently encountered empirically is that of incomplete knowledge of the neural behavior. Usually an observer will study a neural population with respect to a number of well-dened stimulus dimensions and be unable to nd all relevant dimensions or unable to nd a suitable parameterization for certain stimulus properties. We assume that only d · D of the D total dimensions are known. The Fisher information (see equation 3.3) can be written as Q Q Ji D X ( jdD1 sj ) / si2 , where X :D gDKw (F, t, D )( jDDdC 1 sj ) is an experimentally unknown constant. If the neurons’ tuning curves have been measured in more than one dimension, X can be eliminated by considering the tuning widths as a relative measure of information content, 2 2 i,min
Pd
2 jD1 2 j,min
si2
DP d
2 j D1 sj
,
(3.17)
1528
Christian W. Eurich and Stefan D. Wilke
where i is one of the known dimensions 1, . . . , d. Equation 3.17 is independent of the unknown (or experimentally ignored) stimulus dimensions d C 1, . . . , D, which are to be held xed during the measurement of the d tuning widths. On the basis of the accessible tuning widths only, it gives a relative quantitative measure on how accurately the individual stimulus dimensions are encoded by the neural population. Equation 3.17 states that if si2 is small compared to the sum of the squares of all measured tuning widths, the population activity allows an accurate reconstruction of the corresponding stimulus feature: the population contains much information about xi . A large ratio, on the other hand, indicates that the population response is unspecic with respect to feature xi . As an example, consider the different pathways of signal processing that have been suggested for the visual system (Livingstone & Hubel, 1988). The model states that information about form, color, movement, and depth is in part processed separately in the visual cortex. Based on the physiological properties of visual cortical neurons, equation 3.17 yields a quantitative assessment of the specicity of their encoding with respect to the abovementioned properties—our method provides a test criterion for the validity of the pathway model.
4 Conclusion
We have calculated the Fisher information for a population of stochastically spiking neurons encoding a multitude of stimulus features. If a single feature is to be encoded accurately, narrow tuning for this dimension and a broad tuning in all other dimensions is found to be the optimal encoding strategy. The narrow tuning increases the information content of individual neurons for the feature to be encoded, while the broad tuning in other dimensions increases the population of neurons that actively take part in the representation. However, extremely narrow tuning functions impair performance because the tuning functions must always have sufcient overlap. For an accurate representation of a single feature, there is an optimal tuning width no matter how many stimulus dimensions are encoded in addition. Our results are suitable for applications to sensory or motor systems. Equation 3.17 provides a criterion that can be used to derive quantitative statements on the functional signicance of neuron classes based on the measurement of tuning curves.
Acknowledgments
We thank Klaus Pawelzik, Matthias Bethge, and Helmut Schwegler for fruitful discussions. This work was supported by Deutsche Forschungsgemeinschaft, SFB 517.
Multidimensional Encoding Strategy of Spiking Neurons
1529
References Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Deco, G., & Obradovic, D. (1997). An information-theoreticapproach to neural computing (2nd ed.). New York: Springer-Verlag. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive eld neurons. Biological Cybernetics, 76, 357–363. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institute of Radio Engineers (New York), 47, 1940–1951. Livingstone, M. S., & Hubel, D. H. (1988). Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science, 240, 740–749. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. Journal of Computational Neuroscience, 1, 89–107. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences USA, 90, 10749–10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–539. Snippe, H. P., & Koenderink, J. J. (1992). Discrimination thresholds for channelcoded systems. Biological Cybernetics, 66, 543–551. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Received April 13, 1999; accepted August 13, 1999.
LETTER
Communicated by Laurence Abbott
Synergy in a Neural Code Naama Brenner NEC Research Institute, Princeton, NJ 08540, U.S.A. Steven P. Strong Institute for Advanced Study, Princeton, NJ 08544, and NEC Research Institute, Princeton, NJ 08540, U.S.A. Roland Koberle Instituto di F´õsica de S˜ao Carlos, Universidade de S˜ao Paulo, 13560–970 S˜ao Carlos, SP Brasil, and NEC Research Institute, Princeton, NJ 08540, U.S.A. William Bialek Rob R. de Ruyter van Steveninck NEC Research Institute, Princeton, NJ 08540, U.S.A. We show that the information carried by compound events in neural spike trains—patterns of spikes across time or across a population of cells— can be measured, independent of assumptions about what these patterns might represent. By comparing the information carried by a compound pattern with the information carried independently by its parts, we directly measure the synergy among these parts. We illustrate the use of these methods by applying them to experiments on the motion-sensitive neuron H1 of the y’s visual system, where we conrm that two spikes close together in time carry far more than twice the information carried by a single spike. We analyze the sources of this synergy and provide evidence that pairs of spikes close together in time may be especially important patterns in the code of H1. 1 Introduction
Throughout the nervous system, information is encoded in sequences of identical action potentials or spikes. The representation of sense data by these spike trains has been studied for 70 years (Adrian, 1928), but there remain many open questions about the structure of this code. A full understanding of the code requires that we identify its elementary symbols and characterize the messages that these symbols represent. Many different possible elementary symbols have been considered, implicitly or explicitly, in previous work on neural coding. These include the numbers of spikes in time windows of xed size and individual spikes themselves. In cells that c 2000 Massachusetts Institute of Technology Neural Computation 12, 1531– 1552 (2000) °
1532
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
produce bursts of action potentials, these bursts might be special symbols that convey information in addition to that carried by single spikes. Yet another possibility is that patterns of spikes—across time in one cell or across a population of cells—can have a special signicance, a possibility that has received renewed attention as techniques emerge for recording the activity of many neurons simultaneously. In many methods of analysis, questions about the symbolic structure of the code are mixed with questions about what the symbols represent. Thus, in trying to characterize the feature selectivity of neurons, one often makes the a priori choice to measure the neural response as the spike count or rate in a xed window. Conversely, in trying to assess the signicance of synchronous spikes from pairs of neurons, or bursts of spikes from a single neuron, one might search for a correlation between these events and some particular stimulus features. In each case, conclusions about one aspect of the code are limited by assumptions about another aspect. Here we show that questions about the symbolic structure of the neural code can be separated out and answered in an information-theoretic framework, using data from suitably designed experiments. This framework allows us to address directly the signicance of spike patterns or other compound spiking events. How much information is carried by a compound event? Is there redundancy or synergy among the individual spikes? Are particular patterns of spikes especially informative? Methods to assess the signicance of spike patterns in the neural code share a common intuitive basis: Patterns of spikes can play a role in representing stimuli if and only if the occurrence of patterns is linked to stimulus variations. The patterns have a special role only if this correlation between sensory signals and patterns is not decomposable into separate correlations between the signals and the pieces of the pattern, such as the individual spikes. We believe that these statements are not controversial. Difculties arise when we try to quantify this intuitive picture: What is the correct measure of correlation? How much correlation is signicant? Can we make statements independent of models and elaborate null hypotheses? The central claim of this article is that many of these difculties can be resolved using ideas from information theory. Shannon (1948) proved that entropy and information provide the only measures of variability and correlation that are consistent with simple and plausible requirements. Further, while it may be unclear how to interpret, for example, a 20% increase in correlation between spike trains, an extra bit of information carried by patterns of spikes means precisely that these patterns provide a factorof-two increase in the ability of the system to distinguish among different sensory inputs. In this work, we show that there is a direct method
Synergy in a Neural Code
1533
of measuring the information (in bits) carried by particular patterns of spikes under given stimulus conditions, independent of models for the stimulus features that these patterns might represent. In particular, we can compare the information conveyed by spike patterns with the information conveyed by the individual spikes that make up the pattern and determine quantitatively whether the whole is more or less than the sum of its parts. While this method allows us to compute unambiguously how much information is conveyed by the patterns, it does not tell us what particular message these patterns convey. Making the distinction between two issues, the symbolic structure of the code and the transformation between inputs and outputs, we address only the rst of these two. Constructing a quantitative measure for the signicance of compound patterns is an essential rst step in understanding anything beyond the single spike approximation, and it is especially crucial when complex multi-neuron codes are considered. Such a quantitative measure will be useful for the next stage of modeling the encoding algorithm, in particular as a control for the validity of models. This is a subject of ongoing work and will not be discussed here. 2 Information Conveyed by Compound Patterns
In the framework of information theory (Shannon, 1948), signals are generated by a source with a xed probability distribution and encoded into messages by a channel. The coding is in general probabilistic, and the joint distribution of signals and coded messages determines all quantities of interest; in particular, the information transmitted by the channel about the source is an average over this joint distribution. In studying a sensory system, the signals generated by the source are the stimuli presented to the animal, and the messages in the communication channel are sequences of spikes in a neuron or a population of neurons. Both the stimuli and the spike trains are random variables, and they convey information mutually because they are correlated. The problem of quantifying this information has been discussed from several points of view (MacKay & McCulloch, 1952; Optican & Richmond, 1987; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). Here we address the question of how much information is carried by particular “events” or combinations of action potentials. 2.1 Background. A discrete event E in the spike train is dened as a specic combination of spikes. Examples are a single spike, a pair of spikes separated by a given time, spikes from two neurons that occur in synchrony, and so on. Information is carried by the occurrence of events at particular times and not others, implying that they are correlated with some stimulus features and not with others. Our task is to express the information
1534
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
conveyed, on average, by an event E in terms of quantities that are easily measured experimentally.1 In experiments, as in nature, the animal is exposed to stimuli at each instant of time. We can describe this sensory input by a function s(t0 ), which may have many components to parameterize the time dependence of multiple stimulus features. In general, the information gained about s( t0 ) by observing a set of neural responses is ID
Z
X responses
Ds( t0 )P[s( t0 ) & response] log2
¡ P[s(t ) & response] ¢ 0
P[s(t0 )]P[response]
, (2.1)
where information is measured in bits. It is useful to note that this mutual information can be rewritten in two complementary forms: Z I D ¡ C
or
Ds( t0 )P[s( t0 )] log2 P[s( t0 )] Z X P[response] Ds( t0 )P[s( t0 )|response] log2 P[s( t0 )|response]
responses
D S[P( s )] ¡ hS[P(s|response)]iresponse,
I D¡
X responses
P(response) log2 P (response)
Z
C
(2.2)
Ds( t0 )P[s( t0 )]
X responses
P[response|s(t0 )] log2 P[response|s( t0 )]
D S[P(response)] ¡ hS[P(response|s)]is ,
(2.3)
where S denotes the entropy of a distribution; by h¢ ¢ ¢is we mean an average over all possible values of the sensory stimulus, weighted by their probabilities of occurrence, and similarly for h¢ ¢ ¢iresponses. In the rst form, equation 2.2, we focus on what the responses are telling us about the sensory stimulus (de Ruyter van Steveninck & Bialek, 1988): different responses point more or less reliably to different signals, and our average uncertainty about the sensory signal is reduced by observing the neural response. In the second form, equation 2.3, we focus on the variability and reproducibility of the neural response (de Ruyter van Steveninck, Lewen, Strong, Koberle, 1 In our formulation, an event E is a random variable that can have several outcomes; for example, it can occur at different times in the experiment. The information conveyed by the event about the stimulus is an average over the joint distribution of stimuli and event outcomes. One can also associate an information measure with individual occurrences of events (DeWeese & Meister, 1998). The average of this measure over the possible outcomes is the mutual information, as dened by Shannon (1948) and used in this article.
Synergy in a Neural Code
1535
& Bialek, 1997; Strong et al., 1998). The range of possible responses provides the system with a capacity to transmit information, and the variability that remains when the sensory stimulus is specied constitutes noise; the difference between the capacity and the noise is the information. We would like to apply this second form to the case where the neural response is a particular type of event. When we observe an event E, information is carried by the fact that it occurs at some particular time tE . The range of possible responses is then the range of times 0 < tE < T in our observation window. Alternatively, when we observe the response in a particular small time bin of size D t, information is carried by the fact that the event E either occurs or does not. The range of possible responses then includes just two possibilities. Both of these points of view have an arbitrary element: the choice of bin size D t and the window size T. Characterizing the properties of the system, as opposed to our observation power, requires taking the limit of high time resolution (D t ! 0) and long observation times (T ! 1). As will be shown in the next section, in this limit, the two points of view give the same answer for the information carried by an event. 2.2 Information and Event Rates. A crucial role is played by the event rate rE ( t), the probability per unit time that an event of type E occurs at time t, given the stimulus history s( t0 ). Empirical construction of the event rate rE ( t) requires repetition of the same stimulus history many times, so that a histogram can be formed (see Figure 1). For the case where events are single spikes, this is the familiar time-dependent ring rate or poststimulus time histogram (see Figure 1c); the generalization to other types of events is illustrated by Figures 1d and 1e. Intuitively, a uniform event rate implies that no information is transmitted, whereas the presence of sharply dened features in the event rate implies that much information is transmitted by these events (see, for example, the discussion by Vaadia et al., 1995). We now formalize this intuition and show how the average information carried by a single event is related quantitatively to the time-dependent event rate. Let us take the rst point of view about the neural response variable, in which the range of responses is described by the possible arrival times t of the event. 2 What is the probability of nding the event at a particular time t? Before we know the stimulus history s( t0 ), all we can say is that the event can occur anywhere in our experimental window of size T, so that the probability is uniform P(response) D P(t) D 1 / T, with an entropy of S[P( t)] D log2 T. Once we know the stimulus, we also know the event rate rE ( t), and so our uncertainty about the occurrence time of the event is reduced. Events will occur preferentially at times where the event rate is large, so the probability distribution should be proportional to rE ( t); with proper
2
We assume that the experiment is long enough so that a typical event is certain to occur at some time.
1536
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
Figure 1: Generalized event rates in the stimulus-conditional response ensemble. A time-dependent visual stimulus is shown to the y (a), with the time axis dened to be zero at the beginning of the stimulus. This stimulus runs for 10 s and is repeatedly presented 360 times. The responses of the H1 neuron to 60 repetitions are shown as a raster (b), in which each dot represents a single spike. From these responses, time-dependent event rates rE (t) are estimated: (c) the ring rate (poststimulus time histogram); (d) the rate for spike pairs with interspike time t D 3 § 1 ms and (e) for pairs with t D 17 § 1 ms (e). These rates allow us to compute directly the information transmitted by the events, using equation 2.5.
Synergy in a Neural Code
1537
normalization P (response| s) D P( t|s ) D rE (t) / (TNrE ). Then the conditional entropy is Z T S[P( t|s)] D ¡ dt P( t|s) log2 P( t|s) 0
1 D¡ T
Z
T 0
rE ( t) log2 dt rNE
¡ r ( t) ¢ E
rNE T
.
(2.4)
In principle one should average this quantity over the distribution of stimuli s( t0 ); however, if the time T is long enough and the stimulus history sufciently rich, the ensemble average and the time average are equivalent. The validity of this assumption can be checked experimentally (see Figure 2c). The reduction in entropy is then the gain in information, so I (EI s ) D S[P( t)] ¡ S[P( t|s )] Z 1 T rE ( t) D log2 dt T 0 rNE
¡
or equivalently ½ I (EI s ) D
¡ r (t) ¢ E
rNE
¢
log2
¡ r ( t) ¢ E
rNE
¡ r ( t) ¢ ¾ E
rNE
,
(2.5)
,
(2.6)
s
where here the average is over all possible value of s weighted by their probabilities P( s). In the second view, the neural response is a binary random variable, sE 2 f0, 1g, marking the occurrence or nonoccurrence of an event of type E in a small time bin of size D t. Suppose, for simplicity, that the stimulus takes on a nite set of values s with probabilities P(s ). These in turn induce the event E with probability pE ( s) D P (sE D 1|s)PD rE (s )D t, with an average probability for the occurrence of the event pNE D s P(s ) rE ( s )D t D rNE D t. The information is the difference between the prior entropy and the conditional entropy: I ( EI s) D S( s) ¡ hS(s|sE )i, where the conditional entropy is an average over the two possible values of sE . The conditional probabilities are found from Bayes’ rule, P (s )pE ( s) pNE P (s )(1 ¡ pE ( s)) P(s|sE D 0) D , (1 ¡ pNE ) P(s|sE D 1) D
(2.7)
and with these one nds the information, I ( EI s) D ¡ D
X
X s
s
P( s) log2 P(s ) C
µ P(s ) pE ( s) log2
X
¡ p (s ) ¢ sE D0,1 E
pNE
P( s|sE ) log2 P(s|sE ) C (1 ¡ pE ( s)) log2
¡ 1 ¡ p ( s) ¢ ¶ E
1 ¡ pNE
. (2.8)
1538
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
Taking the limit D t ! 0, consistent with the requirement that the event can occur at most once, one nds the average information conveyed in a small time bin; dividing by the average probability of an event, one obtains equation 2.6 as the information per event. Equation 2.6 and its time-averaged form, equation 2.5, are exact formulas that are valid in any situation where a rich stimulus history can be presented
Synergy in a Neural Code
1539
repeatedly. It enables the evaluation of the information for arbitrarily complex events, independent of assumptions about the encoding algorithm. The information is an average over the joint distribution of stimuli and responses. But rather than constructing this joint distribution explicitly and then averaging, equation 2.5 expresses the information as a direct empirical average: by estimating the function rE ( t) as a histogram, we are sampling the distribution of the responses given a stimulus, whereas by integrating over time, we are sampling the distribution of stimuli. This formula is general and can be applied for different systems under different experimental conditions. The numerical result (measured in bits) will, of course, depend on the neural system as well as on the properties of the stimulus distribution. The error bars on the measurement of the information are affected by the niteness of the data; for example, sampling must be sufcient to construct the rates rE ( t) reliably. These and other practical issues in using equation 2.5 are illustrated in detail in Figure 2. 2.3 The Special Case of Single Spikes. Let us consider in more detail the simple case where events are single spikes. The average information conveyed by a single spike becomes an integral over the time-dependent spike rate r( t),
I (1 spikeI s) D
1 T
Z
T 0
dt
¡ r(t) ¢ rN
log2
¡ r(t) ¢ rN
.
(2.9)
It makes sense that the information carried by single spikes should be related to the spike rate, since this rate as a function of time gives a complete Figure 2: Facing page. Finite size effects in the estimation of the information conveyed by single spikes. (a) Information as a function of the bin size D t used for computing the time-dependent rate r(t) from all 360 repetitions (circles) and from 100 of the repetitions (crosses). A linear extrapolation to the limit D t ! 0 is shown for the case where all repetitions were used (solid line). (b) Information as a function of the inverse number of repetitions N, for a xed bin size D t D 2 ms. (c) Systematic errors due to nite duration of the repeated stimulus s(t). The full 10-second length of the stimulus was subdivided into segments of duration T. Using equation 2.9 the information was calculated for each segment and plotted as a function of 1 / T (circles). If the stimulus is sufciently long that an ensemble average is well approximated by a time average, the convergence of the information to a stable value as T ! 1 should be observable. Further, the standard deviation of the information measured from different p segments of length T (error bars) should decrease as a square root law s / 1 / T (dashed line). These data provide an empirical verication that for the distribution used in this experiment, the repeated stimulus time is long enough to approximate ergodic sampling.
1540
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
description of the “one-body” statistics of the spike train, in the same way that the single particle density describes the one-body statistics of a gas or liquid. This is true independent of any models or assumptions, and we suggest the use of equations 2.9 and 2.5 to test directly in any given experimental situation whether many body effects increase or reduce information transmission. We note that considerable attention has been given to the problem of making accurate statistical models for spike trains. In particular, the simplest model—the modulated Poisson process—is often used as a standard against which real neurons are compared. It is tempting to think that Poisson behavior is equivalent to independent transmission of information by successive spikes, but this is not the case (see below). Of course, no real neuron is precisely Poisson, and it is not clear which deviations from Poisson behavior are signicant. Rather than trying to t statistical models to the data and then to try to understand the signicance of spike train statistics for neural coding, we see that it is possible to quantify directly the information carried by single spikes and compound patterns. Thus, without reference to statistical models, we can determine whether successive spikes carry independent, redundant, or synergistic information. Several previous works have noted the relation between spike rate and information, and the formula in equation 2.9 has an interesting history. For a spike train generated by a modulated Poisson process, equation 2.9 provides an upper bound on information transmission (per spike) by the spike train as a whole (Bialek, 1990). Even in this simple case, spikes do not necessarily carry independent information: slow modulations in the ring rate can cause them to be redundant. In studying the coding of location by cells in the rat hippocampus, Skaggs, McNaughton, Gothard, and Markus (1993) assumed that successive spikes carried independent information, and that the spike rate was determined by the instantaneous location. They obtained equation 2.9 with the time average replaced by an average over locations. DeWeese (1996) showed that the rate of information transmission by a spike train could be expanded in a series of integrals over correlation functions, where successive terms would be small if the number of spikes per correlation time were small; the leading term, which would be exact if spikes were uncorrelated, is equation 2.9. Panzeri, Biella, Rolls, Skaggs, and Treves (1996) show that equation 2.9, multiplied by the mean spike rate to give an information rate (bits per second), is the correct information rate if we count spikes in very brief segments of the neural response, which is equivalent to asking for the information carried by single spikes. For further discussion of the relation to previous work, see appendix A. A crucial point here is the generalization to equation 2.5, and this result applies to the information content of any point events in the neural response—pairs of spikes with a certain separation, coincident spikes from two cells, and so on—not just single spikes. In the analysis of experiments, we will emphasize the use of this formula as an exact result for the infor-
Synergy in a Neural Code
1541
mation content of single events, rather than an approximate result for the spike train as a whole, and this approach will enable us to address questions concerning the structure of the code and the role played by various types of events. 3 Experiments in the Fly Visual System
In this section, we use our formalism to analyze experiments on the movement-sensitive cell H1 in the visual system of the blowy Calliphora vicina. We address the issue of the information conveyed by pairs of spikes in this neuron, as compared to the information conveyed independently by single spikes. The quantitative results of this section—information content in bits, effects of synergy, and redundancy among spikes—are specic to this system and to the stimulus conditions used. The theoretical approach, however, is valid generally and can be applied similarly to other experimental systems, to nd out the signicance of various patterns in single cells or in a population. 3.1 Synergy Between Spikes. In the experiment (see appendix B for details), the horizontal motion across the visual eld is the input sensory stimulus s( t), drawn from a probability distribution P[s( t)]. The spike train recorded from H1 is the neural response. Figure 1a shows a segment of the stimulus presented to the y, and Figure 1b illustrates the response to many repeated presentations of this segment. The histogram of spike times across the ensemble of repetitions provides an estimate of the spike rate r(t) (see Figure 1c), and equation 2.5 gives the information carried by a single spike, I (1 spikeI s) D 1.53 § 0.05 bits. Figure 2 illustrates the details of how the formula was used, with an emphasis on the effects of niteness of the data. In this experiment, a stimulus of length T D 10 sec was repeated 360 times. As seen from Figure 2, the calculation converges to a stable result within our nite data set. If each spike were to convey information independently, then with the mean spike rate rN D 37 spikes per second, the total information rate would be Rinfo D 56 bits per second. We used the variability and reproducibility of continuous segments in the neural response (de Ruyter van Steveninck et al., 1997; Strong et al., 1998) to estimate the total information rate in the spike train in this experiment and found that Rinfo D 75 bits per second. Thus, the information conveyed by the spike train as a whole is larger than the sum of contributions from individual spikes, indicating cooperative information transmission by patterns of spikes in time. This synergy among spikes motivates the search for especially informative patterns in the spike train. We consider compound events that consist of two spikes separated by a time t , with no constraints on what happens between them. Figure 1 shows segments of the event rate rt (t ) for t D 3 ( §1) ms (see Figure 1d),
1542
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
Figure 3: Information about the signal transmitted by pairs of spikes, computed from equation 2.5, as a function of the time separation between the two spikes. The dotted line shows the information that would be transmitted by the two spikes independently (twice the single-spike information).
and for t D 17 ( § 1) ms (see Figure 1e). The information carried by spike pairs as a function of the interspike time t , computed from equation 2.5, is shown in Figure 3. For large t , spikes contribute independent information, as expected. This independence is established within approximately 30 to 40 ms, comparable to the behavioral response times of the y (Land & Collett, 1974). There is a mild redundancy (about 10–20%) at intermediate separations and a very large synergy (up to about 130%) at small t . Related results were obtained using the correlation of spike patterns with stimulus features (de Ruyter van Steveninck & Bialek, 1988). There the information carried by spike patterns was estimated from the distribution of stimuli given each pattern, thus constructing a statistical model of what the patterns “stand for” (see the details in appendix A). Since the time-dependent stimulus is in general of high dimensionality, its distribution cannot be sampled directly and some approximations must be made. de Ruyter van Steveninck and Bialek (1988) made the approximation that patterns of a few spikes encode projections of the stimulus onto low-dimensional subspaces, and further that the conditional distributions P[s(t )| E] are gaussian. The information obtained with these approximations provides a lower bound to the true information carried by the patterns, as estimated directly with the methods presented here.
Synergy in a Neural Code
1543
3.2 Origins of Synergy. Synergy means, quite literally, that two spikes together tell us more than two spikes separately. Synergistic coding is often discussed for populations of cells, where extra information is conveyed by patterns of coincident spikes from several neurons (Abeles, Bergmann, Margalit, & Vaadia, 1993; Hopeld, 1995; Meister, 1995; Singer & Gray, 1995). Here we see direct evidence for extra information in pairs of spikes across time. The mathematical framework for describing these effects is the same, and a natural question is: What are the conditions for synergistic coding? The average synergy Syn[E 1 , E2 I s] between two events E1 and E 2 is the difference between the information about the stimulus s conveyed by the pair and the information conveyed by the two events independently:
Syn[E1 , E2 I s] D I[E1 , E2 I s] ¡ (I[E1 I s] C I[E2 I s] ).
(3.1)
We can rewrite the synergy as: Syn[E1 , E2 I s] D I[E1 I E2 |s] ¡ I[E 1 I E2 ].
(3.2)
The rst term is the mutual information between the events computed across an ensemble of repeated presentations of the same stimulus history. It describes the gain in information due to the locking of compound event (E 1 , E2 ) to particular stimulus features. If events E1 and E2 are correlated individually with the stimulus but not with one another, this term will be zero, and these events cannot be synergistic on average. The second term is the mutual information between events when the stimulus is not constrained or, equivalently, the predictability of event E2 from E1 . This predictability limits the capacity of E2 to carry information beyond that already conveyed by E1 . Synergistic coding (Syn > 0) thus requires that the mutual information among the spikes is increased by specifying the stimulus, which makes precise the intuitive idea of “stimulus-dependent correlations” (Abeles et al., 1993; Hopeld, 1995; Meister, 1995; Singer & Gray, 1995). Returning to our experimental example, we identify the events E1 and E2 as the arrivals of two spikes, and consider the synergy as a function of the time t between them. In terms of event rates, we compute the information carried by a pair of spikes separated by a time t , equation 2.5, as well as the information carried by two individual spikes. The difference between these two quantities is the synergy between two spikes, which can be written as
¡ rN ¢
µ ¶ Z 1 T rt ( t) rt (t) C Syn(t ) D ¡ log2 log2 dt T 0 rNt r( t)r( t ¡ t ) ¶ Z T µ 1 rt ( t) rt (t C t ) r(t) C C ¡2 log2 [r(t)]. dt T 0 rNt rNt rN t rN2
(3.3)
The rst term in this equation is the logarithm of the normalized correlation function, and hence measures the rarity of spike pairs with separation t;
1544
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
the average of this term over t is the mutual information between events (the second term in equation 3.2). The second term is related to the local correlation function and measures the extent to which the stimulus modulates the likelihood of spike pairs. The average of this term over t gives the mutual information conditional on knowledge of the stimulus (the rst term in equation 3.2). The average of the third term over t is zero, and numerical evaluation of this term from the data shows that it is negligible at most values of t . We thus nd that the synergy between spikes is approximately a sum of two terms, whose averages over t are the terms in equation 3.2. A spike pair with a separation t then has two types of contributions to the extra information it carries: the two spikes can be correlated conditional on the stimulus, or the pair could be a rare and thus surprising event. The rarity of brief pairs is related to neural refractoriness, but this effect alone is insufcient to enhance information transmission; the rare events must also be related reliably to the stimulus. In fact, conditional on the stimulus, the spikes in rare pairs are strongly correlated with each other, and this is visible in Figure 1a: from trial to trial, adjacent spikes jitter together as if connected by a stiff spring. To quantify this effect, we nd for each spike in one trial the closest spike in successive trials, and measure the variance of the arrival times of these spikes. Similarly, we measure the variance of the interspike times. Figure 4a shows the ratio of the interspike time variance to the sum of the arrival time variances of the spikes that make up the pair. For large separations, this ratio is unity, as expected if spikes are locked independently to the stimulus, but as the two spikes come closer, it falls below one-quarter. Both the conditional correlation among the members of the pair (see Figure 4a) and the relative synergy (see Figure 4b) depend strongly on the interspike separation. This dependence is nearly invariant to changes in image contrast, although the spike rate and other statistical properties are strongly affected by such changes. Brief spike pairs seem to retain their identity as specially informative symbols over a range of input ensembles. If particular temporal patterns are especially informative, then we would lose information if we failed to distinguish among different patterns. Thus, there are two notions of time resolution for spike pairs: the time resolution with which the interspike time is dened and the absolute time resolution with which the event is marked. Figure 5 shows that for small interspike times, the information is much more sensitive to changes in the interspike time resolution (open symbols) than to the absolute time resolution (lled symbols). This is related to the slope in Figure 2: in regions where the slope is large, events should be nely distinguished in order to retain the information. 3.3 Implications of Synergy. The importance of spike timing in the neural code has been under debate for some time now. We believe that some issues in this debate can be claried using a direct information-theoretic approach. Following MacKay and McCulloch (1952), we know that marking
Synergy in a Neural Code
1545
Figure 4: (a) Ratio between the variance of interspike time and the sum of variances of the two spike times. Variances are measured across repeated presentations of same stimulus, as explained in the text. This ratio is plotted as a function of the interspike time t for two experiments with different image contrast. (b) Extra information conveyed cooperatively by pairs of spikes, expressed as a fraction of the information conveyed by the two spikes independently. While the single-spike information varies with contrast (1.5 bits per spike for C D 0.1 compared to 1.3 bits per spike for C D 1), the fractional synergy is almost contrast independent.
spike arrival times with higher resolution provides an increased capacity for information transmission. The work of Strong et al. (1998) shows that for the y’s H1 neuron, the increased capacity associated with spike timing indeed is used with nearly constant efciency down to millisecond resolution. This efciency can be the result of a tight locking of individual spikes to a rapidly varying stimulus, and it could also be the result of temporal patterns providing information beyond rapid rate modulations. The anal-
1546
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
Figure 5: Information conveyed by spike pairs as a function of time resolution. An event—pair of spikes—can be described by two times: the separation between spikes (relative time) and the occurrence time of the event with respect to the stimulus (absolute time). The information carried by the pair depends on the time resolution in these two dimensions, both specied by the bin size D t along the abscissa. Open symbols are measurements of the information for a xed 2 ms resolution of absolute time and a variable resolution D t of relative time. Closed symbols correspond to a xed 2 ms resolution of relative time and a variable resolution D t of absolute time. For short intervals, the sensitivity to coarsening of the relative time resolution is much greater than to coarsening of the absolute time resolution. In contrast, sensitivity to relative and absolute time resolutions is the same for the longer, nonsynergistic, interspike separations.
ysis given here shows that for H1, pairs of spikes can provide much more information than two individual spikes, and information transmission is much more sensitive to the relative timing of spikes than to their absolute timing. The synergy is an inherent property of particular spike pairs, as it persists when averaged over all occurrences of the pair in the experiment. One may ask about the implication of synergy to behavior. Flies can respond to changes in a motion stimulus over a time scale of approximately 30 to 40 ms (Land & Collett, 1974). We have found that the spike train of H1 as a whole conveyed about 20 bits per second more than the information conveyed independently by single spikes (see Section 3.1). This implies that over a behavioral response time, synergistic effects provide about 0.8 bit of additional information, equivalent to almost a factor of two in resolving power for distinguishing different trajectories of motion across the visual eld.
Synergy in a Neural Code
1547
4 Summary
Information theory allows us to measure synergy and redundancy among spikes independent of the rules for translating between spikes and stimuli. In particular, this approach tests directly the idea that patterns of spikes are special events in the code, carrying more information than expected by adding the contributions from individual spikes. It is of practical importance that the formulas rely on low-order statistical measures of the neural response, and hence do not require enormous data sets to reach meaningful conclusions. The method is of general validity and is applicable to patterns of spikes across a population of neurons, as well as across time. Specic results (existence of synergy or redundancy, number of bits per spike, etc.) depend on the neural system as well as on the stimulus ensemble used in each experiment. In our experiments on the y visual system, we found that an event composed of a pair of spikes can carry far more than the information carried independently by its parts. Two spikes that occur in rapid succession appear to be special symbols that have an integrity beyond the locking of individual spikes to the stimulus. This is analogous to the encoding of sounds in written English: each of the compound symbols th, sh, and ch stands for sounds that are not decomposable into sounds represented by each of the constituent letters. For spike pairs to act effectively as special symbols, mechanisms for “reading” them must exist at subsequent levels of processing. Synaptic transmission is sensitive to interspike times in the 2–20 ms range (Magelby, 1987), and it is natural to suggest that synaptic mechanisms on this timescale play a role in such reading. Recent work on the mammalian visual system (Usrey, Reppas, & Reid, 1998) provides direct evidence that pairs of spikes close together in time can be especially efcient in driving postsynaptic neurons.
Appendix A: Relation to Previous Work
Patterns of spikes and their relation to sensory stimuli have been quantied in the past using correlation functions. The event rates that we have dened here, which are connected directly to the information carried by patterns of spikes through equation 2.5, are just properly normalized correlation functions. The event rate for pairs of spikes from two separate neurons is related to the joint poststimulus time histogram (PSTH), dened by Aertsen, Gerstein, Habib, and Palm (1989; Vaadia et al., 1995). Making this connection explicit is also an opportunity to see how the present formalism applies to events dened across two cells. Consider two cells, A and B, generating spikes at times ftiA g and ftBi g, respectively. It will be useful to think of the spike trains as sums of unit
1548
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
impulses at the spike times, r A ( t) D r B ( t) D
X i
X i
d(t ¡ tA i )
(A.1)
d(t ¡ tBi ).
(A.2)
Then the time-dependent spike rates for the two cells are rA ( t) D hr A (t)itrials, rB ( t) D hr B (t )itrials ,
(A.3) (A.4)
where h¢ ¢ ¢itrials denotes an average over multiple trials in which the same time-dependent stimulus s(t0 ), or the same motor behavior observed. These spike rates are the probabilities per unit time for the occurrence of a single spike in either cell A or cell B, also called the PSTH. We can dene the probability per unit time for a spike in cell A to occur at time t and a spike in cell B to occur at time t0 , and this will be the joint PSTH, JPSTHAB ( t, t0 ) D hr A ( t)r B ( t0 )itrials.
(A.5)
Alternatively, we can consider an event E dened by a spike in cell A at time t and a spike in cell B at time t ¡ t , with the relative time t measured with a precision of D t. Then the rate of these events is Z rE (t) D
D t /2 ¡D t / 2
dt0 JPSTHAB ( t, t ¡ t C t0 )
(A.6)
¼ D tJPSTHAB ( t, t ¡ t ),
(A.7)
where the last approximation is valid if our time resolution is sufciently high. Applying our general formula for the information carried by single events, equation 2.5, the information carried by pairs of spikes from two cells can be written as an integral over diagonal “strips” of the JPSTH matrix, 1 I ( EIs) D T
Z
T 0
dt
JPSTHAB ( t, t ¡ t )
hJPSTHAB ( t, t ¡ t )it
" log2
JPSTHAB ( t, t ¡ t )
hJPSTHAB ( t, t ¡ t )it
# ,
(A.8)
where hJPSTHAB (t, t¡t )it is an average of the JPSTH over time; this average is equivalent to the standard correlation function between the two spike trains. The discussion by Vaadia et al. (1995) emphasizes that modulations of the JPSTH along the diagonal strips allow correlated ring events to convey information about sensory signals or behavioral states, and this information
Synergy in a Neural Code
1549
is quantied by equation A.8. The information carried by the individual cells is related to the corresponding integrals over spike rates, equation 2.9. The difference between the the information conveyed by the compound spiking events E and the information conveyed by spikes in the two cells independently is precisely the synergy between the two cells at the given time lag t . For t D 0, it is the synergy—or extra information—conveyed by synchronous ring of the two cells. We would like to connect this approach with previous work that focused on how events reduce our uncertainty about the stimulus (de Ruyter van Steveninck & Bialek, 1988). Before we observe the neural response, all we know is that stimuli are chosen from a distribution P[s( t0 )]. When we observe an event E at time tE , this should tell us something about the stimulus in the neighborhood of this time, and this knowledge is described by the conditional distribution P[s(t0 )| tE ]. If we go back to the denition of the mutual information between responses and stimuli, we can write the average information conveyed by one event in terms of this conditional distribution, Z Z P[s(t0 ), tE ] I (EI s ) D Ds( t0 ) dtE P[s( t0 ), tE ] log2 P[s( t0 )]P[tE ] Z Z P[s( t0 )| tE ] D . (A.9) dtE P[tE ] Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )]
¡
¢
¡
¢
If the system is stationary, then the coding should be invariant under time translations: P[s( t0 )| tE ] D P[s( t0 C D t0 )| tE C D t0 ].
(A.10)
This invariance means that the integral over stimuli in equation A.9 is independent of the event arrival time tE , so we can simplify our expression for the information carried by a single event, Z Z P[s( t0 )| tE ] ( ) ] I EI s D dtE P[tE Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )] Z 0 )| ] ( P[s t tE D . (A.11) Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )]
¡ ¢
¡
¢
This formula was used by de Ruyter van Steveninck and Bialek (1988). To connect with our work here, we express the information in equation A.11 as an average over the stimulus, ½ ¾ P[s(t0 )| tE ] P[s( t0 )| tE ] ( ) log2 . (A.12) I EI s D P[s( t0 )] P[s( t0 )] s
¡
¢
¡
¢
Using Bayes’ rule, P[s( t0 )| tE ] P[tE |s(t0 )] rE ( tE ) D D , 0 P[s( t )] P[tE ] rNE
(A.13)
1550
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
where the last term is a result of the distributions of event arrival times being proportional to the event rates, as dened above. Substituting this back to equation A.11, one nds equation 2.6. Appendix B: Experimental Setup
In the experiment, we used a female blowy, which was a rst-generation offspring of a wild y caught outside. The y was put inside a plastic tube and immobilized with wax, with the head protruding out. The proboscis was left free so that the y could be fed regularly with some sugar water. A small hole was cut in the back of the head, close to the midline on the right side. Through this hole, a tungsten electrode was advanced into the lobula plate. This area, which is several layers back from the compound eye, includes a group of large motion-detector neurons with wide receptive elds and strong direction selectivity. We recorded spikes extracellularly from one of these, the contralateral H1 neuron, Franceschini, Riehle & le Nestour, 1989; Hausen & Egelhaaf, 1989). The electrode was positioned such that spikes from H1 could be discriminated reliably and converted into pulses by a simple threshold discriminator. The pulses were fed into a CED 1401 interface (Cambridge Electronic Design), which digitized the spikes at 10 m s resolution. To keep exact synchrony over the duration of the experiment, the spike timing clock was derived from the same internal CED 1401 clock that dened the frame times of the visual stimulus. The stimulus was a rigidly moving bar pattern, displayed on a Tektronix 608 high-brightness display. The radiance at average intensity IN was about 180 mW/(m2 £ sr), which amounts to about 5 £104 effectively transduced photons per photoreceptor per second (Dubs, Laughlin, & Srinivasan, 1984). The bars were oriented vertically, with intensities chosen at random to be IN(1 § C), where C is the contrast. The distance between the y and the screen was adjusted so that the angular subtense of a bar equaled the horizontal interommatidial angle in the stimulated part of the compound eye. This setting was found by determining the eye’s spatial Nyquist frequency through the reverse reaction (Gotz, ¨ 1964) in the response of the H1 neuron. For this y, the horizontal interommatidial angle was 1.45 degrees, and the distance to the screen 105 mm. The y viewed the display through a round 80 mm diameter diaphragm, showing approximately 30 bars. From this we estimate the number of stimulated ommatidia in the eye’s hexagonal raster to be about 612. Frames of the stimulus pattern were refreshed every 2 ms, and with each new frame, the pattern was displayed at a new position. This resulted in an apparent horizontal motion of the bar pattern, which is suitable to excite the H1 neuron. The pattern position was dened by a pseudorandom sequence, simulating independent random displacements at each time step, uniformly distributed between ¡0.47 degrees and C 0.47 degrees (equivalent to ¡0.32 to C 0.32 omm, horizontal ommatidial spacings). This corresponds to a dif-
Synergy in a Neural Code
1551
fusion constant of 18.1(± )2 per second or 8.6 omm2 per second. The sequence of pseudorandom numbers contained a repeating part and a nonrepeating part, each 10 seconds long, with the same statistical parameters. Thus, in each 20 second cycle, the y saw a 10 second movie that it had seen 20 seconds before, followed by a 10 second movie that was generated independently. Acknowledgments
We thank G. Lewen and A. Schweitzer for their help with the experiments and N. Tishby for many helpful discussions. Work at the IAS was supported in part by DOE grant DE–FG02–90ER40542, and work at the IFSC was supported by the Brazilian agencies FAPESP and CNPq. References Abeles, M., Bergmann, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629–1638. Adrian, E., (1928). The basis of sensation. London: Christophers. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Bialek, W. (1990). Theoretical physics meets experimental neurobiology. In E. Jen (Ed.), 1989 Lectures in Complex Systems (pp. 513–595). Menlo Park, CA: Addison-Wesley. de Ruyter van Steveninck, R., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowy visual system: coding and information transfer in short spike sequences. Proc. R. Soc. Lond. Ser. B, 234, 379. de Ruyter van Steveninck, R. R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 275. DeWeese, M. (1996). Optimization principles for the neural code. Network, 7, 325–331. DeWeese, M., & Meister, M. (1998). How to measure information gained from one symbol. APS Abstracts. Dubs, A., Laughlin, S., & Srinivasan, M. (1984). Single photon signals in y photoreceptors and rst order interneurones at behavioural threshold. J. Physiol., 317, 317–334. Franceschini, N., Riehle, A., & le Nestour, A. (1989). Directionally selective motion detection by insect neurons. In D. Stavenga & R. Hardie (Eds.) Facets of Vision, page 360. Berlin: Springer-Verlag. Gotz, ¨ K. (1964). Optomotorische Untersuchung des Visuellen Systems einiger Augenmutanten der Fruchtiege Drosophila. Kybernetik, 2, 77–92.
1552
Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck
Hausen, K., & Egelhaaf, M. (1989).Neural mechanisms of visual course control in insects. In D. Stavenga & R. Hardie (Eds.), Facets of Vision, page 391. SpringerVerlag. Hopeld, J. J. (1995). Pattern recognitin computation using action potential timing for stimulus representation. Nature, 376, 33–36. Land, M. F., & Collett, T. S. (1974). Chasing behavior of houseies (Fannia caniculaus): A description and analysis. J. Comp. Physiol., 89, 331–357. MacKay, D., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Magelby, K. (1987). Short-term synaptic plasticity. In G. M. Edelman, V. E. Gall, & K. M. Cowan (Eds.), Synaptic function (pp. 21–56). New York: Wiley. Meister, M. (1995). Multineuronal codes in retinal signaling. Proc. Nat. Acad. Sci. (USA), 93, 609–614. Optican, L. M., & Richmond, B. J. (1987).Temporal encodingof two-dimensional patterns by single units in primate inferior temporal cortex. III: Information theoretic analysis. J. Neurophys., 57, 162–178. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. J., 27, 379–423, 623–656. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 555–586. Skaggs, W. E., McNaughton, B. L., Gothard, K. M., & Markus, E. J. (1993). An information-theoretic approach to deciphering the hippocampal code. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing, 5 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Usrey, W. M., Reppas, J. B., & Reid, R. C. (1998). Paired-spike interactions and synaptic efciency of retinal inputs to the thalamus. Nature, 395, 384. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neural interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Received February 22, 1999; accepted May 20, 1999.
LETTER
Communicated by Alain Destexhe
Increased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A Phase-Shift Mechanism Elizabeth Thomas Thierry Grisar Universit´e de Li`ege, Institut L´eon Fredericq, Department of Neurochemistry, Li`ege, Belgium A computer model of a thalamic network was used in order to examine the effects of an isolated augmentation in a low-threshold calcium current. Such an isolated augmentation has been observed in the reticular thalamic (RE) nucleus of the genetic absence epilepsy rat from the Strasbourg (GAERS) model of absence epilepsy. An augmentation of the low-threshold calcium conductance in the RE neurons ( gTs ) of the model thalamic network was found to lead to an increase in the synchronized ring of the network. This supports the hypothesis that the isolated increase in gTs may be responsible for epileptic activity in the GAERS rat. The increase of gTs in the RE neurons led to a slight increase in the period of the isolated RE neuron ring. In contrast, the low-threshold spike of the RE neuron remained relatively unchanged by the increase of gTs . This suggests that the enhanced synchrony in the network was primarily due to a phase shift in the ring of the RE neurons with respect to the thalamocortical neurons. The ability of this phase-shift mechanism to lead to changes in synchrony was further examined using the model thalamic network. A similar increase in the period of RE neuron oscillations was obtained through an increase in the conductance of the calcium-mediated potassium channel. This change was once again found to increase synchronous ring in the network. 1 Introduction
Absence epilepsy is a generalized epilepsy characterized by the presence of bilateral and synchronous spike and wave discharges in the electroencephalogram of the patient. Studies of the disease have been greatly facilitated by the development of animal models of absence epilepsy. One such successful model has been the genetic absence epilepsy rat of Strasbourg (GAERS). Neurophysiological, behavioral, pharmacological, and genetic studies have demonstrated that GAERS is a good animal model for the study of human absence epilepsy (Marescaux, Vergnes, & Depaulis, 1992). c 2000 Massachusetts Institute of Technology Neural Computation 12, 1553– 1571 (2000) °
1554
Elizabeth Thomas and Thierry Grisar
In vivo studies have revealed the presence of synchronized oscillations in the thalamus during paroxysmal activity in the cortex (Buszaki, 1991; Steriade & Contreras, 1995). Many attempts to understand epilepsy have therefore focused on the generation of synchronous, rhythmic activity in the thalamus (von Krosigk, Ball, & McCormick, 1993; Huguenard, Coulter, & Prince, 1990; Crunelli, Turner, Williams, Guyon, & Leresche, 1994; Steriade & Llinas, 1988). Investigations of the generation of rhythmic activity within the thalamus have revealed that the relay nuclei and the reticular thalamic nucleus are involved in the activity (Steriade, Jones, & Llinas, 1990; von Krosigk et al., 1993). Of particular importance in the generation of rhythmic activity in the thalamus is the presence of a low-threshold calcium current that allows the thalamic neurons to re a low-threshold calcium spike (LTS). This spike is characterized by a slowly rising and falling triangle-like potential crowned by one to several fast spikes (Jahnsen & Llinas, 1984; Crunelli, Lightowler, & Pollard, 1989; Huguenard & McCormick, 1992). The availability of the GAERS model of epilepsy allows for the comparison of neuronal membrane properties between the epileptic and nonepileptic rats. A study by Tsakiridou, Bertollini, de Curtis, Avanzini, & Pape (1995) found in comparing the voltage-dependent calcium channels of both rats that the amplitude of the low-threshold calcium current of the GAERS rat was higher. The whole-cell patch clamp analysis revealed that the amplitude of the low-threshold calcium current in the reticular thalamic (RE) neurons (ITs ) of the GAERS rat was ¡198 C / ¡19 pA as opposed to ¡128 C / ¡14 pA for the nonepileptic rat. This augmentation was found to be isolated to particular areas of the thalamus. While the ITs current was found augmented in the RE nucleus, the low-threshold current in the thalamocortical (TC) neurons of both rats was found not to be signicantly different. No signicant differences were found in the high-threshold Ca2C current, IL , between the nonepileptic and GAERS rat. More detailed tests, conducted to understand the reason for the increased amplitude of ITs in the GAERS rat, revealed no signicant differences in the activation and inactivation curves for ITs between the normal and epileptic rat. The kinetics of the current for the two rats was also similar. The augmentation of the ITs current therefore reected an increase in the number of T-type Ca2C channels or an increase in single-channel conductance. The purpose of our project was to investigate the effects of an isolated increase of ITs with the use of a computer model of a thalamic network. This was done by examining the effects of increasing the maximum conductance of the low-threshold calcium current in the RE neurons (gN Ts ) of the model. The transition we observed of increased synchronous ring with the increased gN Ts suggests that this may be responsible for the epileptic activity of the GAERS rat. The second part of the investigation focused on the mechanism by which the increase in gN Ts caused the transition in network activity. Much of this was done by examining the effects of an increase in gN Ts on isolated RE neurons.
Thalamic Synchrony via Phase Shifts
1555
The low-threshold spike of the RE neurons was found not to be signicantly altered by the increase. Instead, the frequency of the RE neuron oscillations was found to decrease slightly with each increase in gN Ts . There was now a phase lag in the points at which the RE neurons red an LTS. In other computational studies of thalamic networks, enhanced synchrony mediated by a decrease in network GABAA has been accompanied by changes in the LTS of the thalamic neurons and big changes in frequency (Golomb, Wang, & Rinzel, 1994; Destexhe, Bal, McCormick, & Sejnowski, 1996). In this study, however, the nature of the low-threshold spikes was relatively unchanged with the enhanced synchrony. The changes in frequency accompanying the enhanced synchrony were slight, but led to changes in the phase relationship between the network RE and TC neurons. This strongly suggests that the phase relationship between the RE and TC neurons is an important control parameter in the network synchrony. These studies may therefore contain some insights for understanding experimental investigations in which changes in network synchrony were accompanied by only small changes in the LTS and frequency of the constituent neurons (Huguenard & Prince, 1994). The ability of this mechanism to bring about other enhancements in synchrony was tested. The increase in the calcium-mediated potassium conductance ( gN K[Ca] ) also led to a similar decrease in the period of RE neuron ring. An increase in synchronous ring was once again observed. Various computer models of the thalamocortical anatomy as well as lowfrequency, synchronous oscillations by the system have already been constructed (Thomas, Patton, & Wyatt, 1991; Patton, Thomas, & Wyatt, 1992; Thomas & Wyatt, 1995; Thomas & Lytton, 1997; Lytton & Thomas, 1997). These models have been of varying degrees of complexity in anatomical structure and in physiological activity. In particular, Lytton and Thomas (1997) review Hodgkin-Huxley models constructed by numerous investigators in the eld. These models have succeeded in reproducing various experimental results, among them intrinsic oscillations of frequencies in the isolated thalamic neurons, low-frequency synchronous oscillations in the thalamic network, and enhanced synchrony with a decrease in network GABAA conductance. The success of these models in reproducing various experimental results encourages their use in order to gain further insight into the underlying ionic channel and single-cell activities that nally lead to bifurcations in network activity. 2 Methods 2.1 Model Construction. The simulated network consisted of 13 RE neurons and 39 TC neurons in a columnar arrangement. Each column consisted of 1 RE neuron reciprocally connected to 3 TC neurons. Each RE neuron also received input from the RE and TC neurons in the column immediately adjacent to it, and the TC neurons received input from the RE neurons of the adjacent column.
1556
Elizabeth Thomas and Thierry Grisar
Each neuron was represented by a single compartment with a 1000 m m2 area. The physiology of the neurons was constructed using the HodgkinHuxley formalism. Most of the parameters and equations necessary to describe the ionic currents of the model were obtained from previous computer models of the thalamic network. The RE neurons were constructed using the following intrinsic currents: a low-threshold calcium current ITs , the calcium-mediated potassium current IK[Ca] , the leak current Il , the fast sodium current INa , and the fast potassium current IK . Inhibition for the RE neurons was mediated by the GABAA synapse acted on by the neurotransmitter c -aminobutyric acid (GABA). The excitatory current IAMPA was due to the fast-acting amino-3-hydroxy-5-methyl-4-isoxazole-proprionic acid (AMPA). The TC neurons were constructed using the low-threshold calcium current IT , the hyperpolarization-activated cationic current Ih , the fast sodium current INa , the fast potassium current IK , and a leak current Il . The synaptic currents for the TC neurons were the two inhibitory IGABAA and IGABAB currents. The internal calcium concentration of each neuron was also computed. The equivalent circuit representation for the neuronal membrane comprises a capacitor and several resistors in parallel (Johnston & Wu, 1994). In accordance with this parallel conductance model for ionic currents in a membrane, the following equations describe the activity of the neurons: Cm VP RE D ¡ITs ¡ IK[Ca] ¡ INa ¡ IK ¡ Il ¡ IGABAA ¡ IGABAB ¡ IAMPA (2.1) (2.2) Cm VP TC D ¡IT ¡ Ih ¡ INa ¡ IK ¡ Il ¡ IGABAA ¡ IGABAB . VRE is the membrane potential of the RE neuron, and VTC is the membrane potential of the TC neuron; Cm is the capacitance of the neuron; Cm had a value 1 m F/cm2 . The values of each intrinsic current I were obtained from Ohm’s law, I D g £ gN £ ( V ¡ Eeq ),
(2.3)
where g is the conductance of the channel, gN is the maximum conductance of the channel, and Eeq is the reversal potential of the current. The magnitude of the current depends on the state of the gates controlling the channels. The opening of these gates is a voltage- and time-dependent process. With the exception of the fast sodium, fast potassium, and calciummediated potassium currents, it was not necessary to obtain the forward and reverse rate constants for the gates. Instead, the state of the gate at time innity, gate1 , and the gate time constant tgate were directly obtained from available experimental results. It was then possible to compute the state of the activation or inactivation gate via the following equation (Hodgkin & Huxley, 1952): gatet D gate1 ¡ ( gate1 ¡ gatet¡1 ) exp[¡D t / tgate ].
(2.4)
Thalamic Synchrony via Phase Shifts
1557
The gate controlling a channel could be an activation gate m or an inactivation gate h. The conductance of a channel g depended on the state of these gates, g D mN h,
(2.5)
where N is the number of activation gates. In the following section we describe the intrinsic currents that were used to model the neurons. In some cases, the equations and parameters used to model the currents were the same for both the TC and the RE neurons. The neuron for which the equation applies will be put in parentheses next to the name of the current 2.2 Low-Threshold Calcium Current, IT (TC Neuron). The presence of a low-threshold Ca2C current responsible for the LTS has been reported in various investigations of thalamocortical cells (Coulter, Huguenard, & Prince, 1989; Jahnsen & Llinas, 1984; Huguenard & McCormick, 1992). Equations for the low-threshold current that were used in this model were obtained from voltage clamp experiments conducted by Huguenard and McCormick (1992) on acutely dissociated rat thalamic relay neurons:
IT D gN T m2 h(V ¡ E Ca ) m1 ( V ) D 1.0 / f1.0 C exp[¡(V C 57.0) / 6.2]g tm ( V ) D 0.612 C 1.0 / fexp[¡(V C 132.0) / 16.7]
(2.6)
h1 ( V ) D 1.0 / f1.0 C exp[(V C 81) / 4.0]g th ( V ) D exp[(V C 467.0) / 66.6] V < ¡80 mV
(2.9)
C exp[(V C 16.8) / 18.2]g
D 28.0 C exp[¡(V C 22.0) / 10.5]
V ¸ 80 mV.
(2.7) (2.8) (2.10) (2.11)
The reversal potential for IT , ECa was xed at 120 mV, as had been done in the simulation by Golomb et al. (1994). The maximum IT conductance gN T D 1.75 mS/cm2 . 2.3 Low-Threshold Calcium Current, ITs (RE Neuron). The RE neurons have also been observed to undergo burst activity during rhythmic activity (Contreras, Curro Dossi, & Steriade, 1993; Avanzini et al., 1992). The lowthreshold calcium current that mediates this burst has been described from voltage clamp experiments by Huguenard and Prince (1992). The equations that describe these currents were obtained from Destexhe, Contreras, Sejnowski, & Steriade, (1994). Unlike the work of Destexhe, Contreras, Sejnowski, and Steriade (1994), the reversal potential of Ca2C was not computed. It was xed at 120 mV, as had been done in the thalamic model of Golomb et al. (1994):
ITs D gN Ts m2 h ( V ¡ ECa )
(2.12)
1558
Elizabeth Thomas and Thierry Grisar
(2.13) m1 ( V ) D 1.0 / f1.0 C exp[¡(V C 52.0) / 7.4]g tm ( V ) D 0.44 C 0.15 / fexp[(V C 27) / 10] C exp[¡(V C 102) / 15]g (2.14) (2.15) h1 ( V ) D 1.0 / f1.0 C exp[(V C 80) / 5.0]g th ( V ) D 22.7 C 0.27 / fexp[(V C 48) / 4] C exp[¡(V C 407) / 50]g.
(2.16)
Unlike the case for the low-threshold current of the TC neurons, gN Ts was varied in order to simulate increases in ITs found in the RE nucleus. In the simulations involving changes of IK[Ca] , gN Ts was xed at 1.75 mS/cm 2 . 2.4 Hyperpolarization Activated Current Ih (TC Neuron). The critical role played by this current in the rhythmic activity of thalamic neurons has been observed by several investigators in the eld. Soltesz et al. (1991) observed that the application of Cs2C could transform spindle oscillations in some thalamic neurons to pacemaker oscillations. In other thalamic neurons, they found that the same procedure eliminated the low-frequency pacemaker oscillations. McCormick and Pape (1990) found that the results from the application of Cs2C in thalamic slices were dose dependent. The application could lead to an increased number of spikes in a burst or to the elimination of the bursts and oscillations. The equations for this current were obtained from voltage clamp measurements made by Huguenard and McCormick (1992) on guinea pig thalamic slices:
Ih D gN h m( V ¡ Eh ) m1 ( V ) D 1.0 / f1.0 C exp[(V C 75) / 5.5]g tm ( V ) D 1 / 1.25fexp[¡14.59 ¡ 0.086V]
C exp[¡1.87 C 0.0701V]g.
(2.17) (2.18) (2.19)
The value of gN h was 0.15 mS/cm 2 . 2.5 Calcium-Mediated Potassium Current IK[Ca] (RE Neurons). Equations for the Ca2C mediated potassium channel came from Destexhe, Mainen, and Sejnowski (1994) and Wallenstein (1994):
IK[Ca] D gN K[Ca] m2 (V ¡ EK ) ([Ca2C
2C n ]
(2.20) 2C n ]
]) D a[Ca m1 / fa[Ca C C 2 2 tm ([Ca ]) D 1 / (a[Ca ] C bg.
C bg
(2.21) (2.22)
The value for n was 2, a D 48 ms¡1 mM¡2 , and b D 0.03 ms¡1 . The value of gN K[Ca] unless otherwise specied was 10 mS/cm2 . 2.6 Leak Current (RE and TC Neuron). McCormick and Huguenard (1992) found the rhythmic activity of their model TC neurons to depend
Thalamic Synchrony via Phase Shifts
1559
critically on the leak currents. The following equation was used to model the leak Il : Il D gl ( V ¡ El ).
(2.23)
The value of gl was 0.01 mS/cm2 for the TC neurons and 0.05 mS/cm 2 for the RE neurons. 2.7 Fast Na C and K C Currents: INa and IK (RE and TC Neuron). The fast spikes riding on top of the depolarizing envelope during a low-threshold spike were found by Jahnsen and Llinas (1984) to be mediated by a fast NaC current. Both a fast NaC current and a delayed rectier current were included in the model in order to reproduce this feature of the low-threshold spikes. The equations for the fast sodium current INa were obtained from a model by McCormick and Huguenard (1992). The equations for the delayed rectier IK were obtained from a thalamic model by Wallenstein (1994). The maximum conductance for the fast sodium channel gN Na for the RE neurons was 20 mS/cm2 , and a value of 30 mS/cm 2 was used for the TC neurons. In the case of gN K , a value of 8 mS/cm2 was used for both the RE and TC neurons. 2.8 Intrinsic Calcium Dynamics. The internal calcium dynamics and the values for internal calcium concentration [Ca2C ] were obtained from a model of the TC neuron constructed by McCormick and Huguenard (1992). The value of [Ca2C ] was calculated for a depth of 100 nm in a single compartmental model of area 1000 m m2 . The following discretized equation was used:
n [Ca2C ]t D [Ca2C ]t¡1 C D t. [¡5.18 £ 10¡3 .(IT or ITs )] / (area depth ) o ¡b[Ca2C ]t¡1 ] (2.24) The constant ¡5.18 £ 10¡3 was used to convert current (in nanoamperes), time (in milliseconds), and volume (in cubic micrometers) to concentration of Ca2C ions. The value of b used in the simulations was 1. 2.9 Synaptic Currents. The synaptic currents were modeled using a kinetic scheme of the binding of transmitter to postsynaptic receptor developed by Destexhe, Mainen, and Sejnowski (1994). According to this scheme, the fraction of postsynaptic receptors in the open state, r, can be obtained from the following equation:
dr D a[T](1 ¡ m ) ¡ bm, dt
(2.25)
1560
Elizabeth Thomas and Thierry Grisar
where [T] is the concentration of neurotransmitter in the synapse and a and b are the forward and backward binding rates, respectively. Both the GABAA and AMPA receptors were affected by neurotransmitters that were released in a pulse of 1 ms duration and 1 mM amplitude. For the GABAB receptor, the neurotransmitter was released in a pulse of 85 ms and 1.0 m M amplitude. The following parameters were used for the GABAA synapses: a D 0.53, b D 0.1, the reversal potential EGABAA D ¡80 mV. For the GABAB synapses, a D 0.016, b D 0.0047, and EGABAB D ¡103 mV. The AMPA conductances were computed from the values a D 2.0, b D 0.3, and EAMPA D 0 mV. The following maximum conductances were used for the synapses on the RE neurons, gN GABAA D 5 nS and gN AMPA D 0.5 nS, and for the TC neurons, gN GABAA D 5nS and gN GABAB D 0.01nS. 2.10 Autocorrelation. The frequency of neuronal ring was computed with the use of an autocorrelation function. The membrane potential V of a neuron was measured at intervals of 0.1 ms. The autocorrelation function C(t ) was then calculated for each delay t :
C(t ) D 1 / Nt
nt X nD1
V (n )V (n C t ),
where 1 / Nt is the number of bins available for the particular delay t that is being used. The value of t that gave a maximum for C(t ) was then taken as the period of the neuronal oscillation. 2.11 Rastergrams. Rastergrams were used to display the synchronized activity of the network (see Figures 1 a–d). Each black dot on the rastergram marks a point where a neuron had a membrane potential above 0 mV. Time is on the x-axis. On the y-axis, the TC neurons are numbered 0 to 38, and the RE neurons are numbered 39 to 51. 2.12 Counting the Number of Cycles of Synchronized Activity. The number of cycles of synchronized activity was computed by constructing autocorrelograms from the discrete activity represented in the rastergrams. Each value of C(t ) greater than 200 was counted as one cycle of synchronized activity (Note that one action potential can be represented by more than one dot since neuronal activity was sampled every 0.1 ms.) Using this method, the number of cycles of synchronized activity in Figure 1a was 3, while the number of cycles of synchronized activity in Figure 1d was 13. This simulation was developed using Microsoft Visual C++. The equations were integrated using a variable time-step Runge-Kutta method (Press, Teukolsky, Vetterling, & Flannery, 1992). The simulations were run on a Silicon Graphics IRIX machine as well as a Pentium 233 MHz.
Thalamic Synchrony via Phase Shifts
1561
Figure 1: Increase in network synchrony as the value of gN Ts is increased. The values of gN Ts are in the following order for the gures A–D. (A) 1.7 mS/cm2 , (B) 1.9 mS/cm2 , (C) 2.1 mS/cm2 , (D) 2.3 mS/cm2 .
1562
Elizabeth Thomas and Thierry Grisar
Figure 2: Number of cycles of synchronous activity in the network for each value of gN Ts . The number of cycles of synchronous activity was computed as described in section 3.
3 Results 3.1 Enhancement of Synchrony with Increased gN Ts . We investigated the effects of an isolated increase in the low-threshold current of the RE neurons by progressively increasing the maximum conductance of the lowthreshold calcium channel in the RE neurons ( gN Ts ). The value of gN Ts was progressively increased from 1.7 to 2.7 mS/cm 2 . The initial stimulus to the network in each case was a pulse in the RE layer of the model. Figure 1 demonstrates the progressive increase in the number of cycles of synchronized activity by the TC neurons. Each rastergram displays the activity of the neuronal populations for 2000 ms. In Figure 1a, at the gN Ts value of 1.7 mS/cm2 , the TC neurons re approximately three cycles of synchronized activity. This number progressively increases until at the gN Ts value of 2.3 mS/cm2 , the TC neurons re in a synchronized manner for every cycle of oscillatory activity (see Figure 1d). The number of cycles of synchronized activity for each value gN Ts can also be seen in Figure 2. The number of cycles of synchronized activity was computed as described in section 2. Although there was an overall increase in the cycles of synchronized activity as gN Ts was increased, some increases of gN Ts were accompanied by a slight decrease in network synchrony. In Figure 1a, the TC neurons re approximately three cycles of synchronized activity and subsequently stop participating in the ring activity of the network. The TC neurons participate in the network ring in the time interval between 150 and 640 ms. An explanation for their failure to reach threshold following 640 ms may be found by examining the average period for the RE neuron population oscillations within this interval and compar-
Thalamic Synchrony via Phase Shifts
1563
ing it to the times outside the interval. It was found that the average period of the RE neuron population ring was approximately 122 ms during the interval when the TC neurons were ring. Outside this interval, however, when the TC neuron oscillations were subthreshold, the RE neuron population was ring with an average period of approximately 67 ms. This reduced period in the oscillations of the RE neuron population probably led to the inhibition of the TC neurons before threshold was reached. 3.2 Single Unit Changes in RE Neurons with Increase in gN Ts . We attempted to understand the mechanism by which the increase in gN Ts led to enhanced synchrony in the network. This was done by studying the effects of the increased gN Ts on the activity of the isolated RE neurons. Figure 3 displays the activity of one isolated RE neuron for each of the gN Ts values in Figure 1. The changes affecting this RE neuron are representative of the changes to the other RE neurons of the network. The nature of the LTS in the isolated RE neuron remained essentially unchanged as gN Ts was increased. The number of fast spikes riding on top of each depolarized hump remained one to two. Figure 3 displays the oscillatory activity of this RE neuron for the gN Ts values 1.7, 1.9, 2.1, and 2.3 mS/cm 2 . A very small increase in the amplitude of the LTS was observed with the alterations in gN Ts . The maximum amplitude of the RE neuron LTS displayed in Figure 3a was 41.9 mV, and the maximum amplitude of the same RE neuron LTS in Figure 3d was 45.7 mV. More notable than the changes to the LTS were the changes in the period of the isolated RE neuron oscillations with each increase in gN Ts . The period of the isolated RE neuron oscillations was found to change with each increase in gN Ts . The number of cycles in the RE oscillations of Figure 2a was 17, while in Figure 2d, the number had decreased to 14. The period of the isolated RE neuron oscillations had therefore increased with each increase in gN Ts . An autocorrelogram was constructed for the RE neuron oscillations in Figure 2 for each value of gN Ts . The peaks of the autocorrelograms are displayed in Figure 4. The period of the RE oscillations increased from a value of 128.7 ms at a gN Ts value of 1.7 mS/cm2 to a period of 157.63 ms at a gN Ts value of 2.7 mS/cm2 . 3.3 Altered Phasic Relationships with Increased gN Ts . The change in the ring frequency of the isolated RE neurons also affected their phasic relationship with the TC neurons in the connected network. Figure 5 displays the change in the relationship of the RE neuron to TC neuron ring as gN Ts is changed. It displays the changing phasic relationship between the RE neuron in Figure 2 and a connected TC neuron. The time at which the TC neuron reached threshold in the third cycle of oscillations (TC phase) is displayed along with the time at which the RE neuron reached threshold in the fourth cycle of oscillations (RE phase). At the gN Ts value of 1.7 mS/cm 2 , the TC neuron reaches threshold for the third cycle of oscillations at 148.28 ms.
1564
Elizabeth Thomas and Thierry Grisar
Figure 3: Activity of an isolated RE neuron with gN Ts values. (A) 1.7 mS/cm2 , (B) 1.9 mS/cm2 , (C) 2.1 mS/cm2 , (D) 2.3 mS/cm2 .
Thalamic Synchrony via Phase Shifts
1565
Figure 4: Change in the period of oscillations of an isolated RE neuron as gN Ts is increased. The period of the oscillations was computed as decribed in section 3.
Figure 5: Phase of TC and RE neuron ring. A TC neuron ring phase was measured at the point at which the TC neuron reached threshold in the third cycle of network activity (the TC neuron activity was subthreshold until then). The ring phase of a connected RE neuron was measured at the point where the RE neuron reached threshold in the fourth cycle of network activity. The phase difference is the difference in the timing of the two events described above.
The delay to this event had decreased to 143.66 ms with the gN Ts value of 2.7 mS/cm2 . In contrast to this small change in timing, the RE neuron reaches threshold for its fourth cycle of oscillations at 253.11 ms for the gN Ts value of 1.7 mS/cm2 , while for the gN Ts value of 2.7 mS/cm 2 , threshold is reached at
1566
Elizabeth Thomas and Thierry Grisar
Figure 6: Number of cycles of synchronized activity (left) and the period of an isolated RE neuron oscillation (right) as the value of gN K[Ca] is varied from 10 to 18 mS/cm2 .
332.53 ms. The difference in the effect of changes in gN Ts on RE neurons as opposed to TC neurons is best demonstrated by the difference in the slopes of the ring phase versus gN Ts curves for the two types of neurons. The slope of the RE phase versus gN Ts curve is 52.75 s cm2 /mS. On the other hand the slope of the TC phase versus gN Ts value is a much lower: ¡4.49 s cm2 /mS. The increasing delay of RE neuron ring with the increase in gN Ts , along with the smaller changes in TC neuron ring, led to an increase in the phase lag of RE neuron ring with respect to TC neuron ring. Also displayed in Figure 5 is the increasing phase lag of RE neuron ring with respect to the TC neuron ring as gN Ts is increased (phase difference). 3.4 Enhanced Synchrony by Increased gN K[Ca] of RE Neurons. The ability of this increased phase lag in the ring of the RE neurons to increase the synchrony of the network was also tested with alterations to the calciummediated potassium channels of the RE neurons. The value of this conductance was increased from 10 to 18 mS/cm2 in the RE neurons of the model. Figure 6 displays the gradual increase in the period of an isolated RE neuron in the model as well as the accompanying increase in synchrony of the network. The number of cycles of synchronized activity in the network increased from three to nine as the value of gN K[Ca] was increased from 10 to 18 mS/cm2 . The frequency of the population oscillations lay between 6 and 7 Hz. Although there was an overall increase in the number of cycles of synchronized activity, there were in some instances a slight decrease as gN K[Ca] was increased. At the gN K[Ca] value of 15 mS/cm 2 , the cycles of syn-
Thalamic Synchrony via Phase Shifts
1567
chronized activity dropped to 5 from a former value of 7 at the gN K[Ca] value of 14 mS/cm2 . The period of oscillations for an isolated RE neuron of the network was measured for each increment in gN K[Ca] . The period of RE neuron oscillations increased from 128.7 ms to 145.09 ms as gN K[Ca] increased from 10 to 18 mS/cm2 . 4 Discussion
We found in our model of low-frequency oscillations in the thalamic network that an isolated increase of the low-threshold calcium conductance in the RE neurons (gN Ts ) was able to enhance synchrony in the network. The thalamus has been found to be an important area in the maintenance of synchronous, oscillatory ring during epilepsy. Electrophysiological recordings from the GAERS have demonstrated that the spike and wave discharges (SWD) in the cortex are accompanied by paroxysmal activity in the thalamus. In some cases, paroxysmal activity in the thalamus was found to precede that in the cortex (Marescaux et al., 1992). The enhanced synchrony found in the model thalamic network with increased gN Ts provides support for the hypothesis that the increased ITs observed in the GAERS may be responsible for the initiation of epileptic activity in the rat. This enhanced synchrony in the thalamus may then be responsible for increased feedback from the cortex and the further synchronization that leads to SWD. The increased gN Ts was not found to cause any notable changes in the number of fast spikes riding on top of the low-threshold spike of the isolated RE neuron (see Figure 3). The relatively unaltered timing at which the TC neuron reached threshold in Figure 5 is further evidence of an almost unchanged inhibitory postsynaptic potential (IPSP) in the TC neuron, and therefore unvaried nature of the LTS in the model RE neuron as gN Ts is increased. The frequency of the RE neuron oscillations, however, undergoes a more notable decrease with the increase in gN Ts . The ability of such a mechanism to enhance synchrony is further demonstrated by Figure 6. The increased gN K[Ca] causes a lowering in the ring frequency of the RE neurons and an accompanying increase in network synchrony. A possible explanation for the increased synchrony in the network with increasing gN Ts may lie in the increasing phase lag of the RE neuron ring (see Figure 5). The slight delay in RE neuron ring ensures that a TC neuron is now able to re. Much of the TC neuron activity that had been previously subthreshold is now able to reach threshold before inhibition by an RE neuron. The increase in the period of RE neuron oscillations does not increase synchrony in a monotonic fashion. Around the RE oscillation period of 140 ms, the increase in both gN Ts as well as gN K[Ca] led to a slight decrease in the number of cycles of synchronized oscillations. This is evidence that other factors play a role in the dynamics of the system. One possibility is a
1568
Elizabeth Thomas and Thierry Grisar
slight increase in the amplitude of the RE neuron LTS (even if the number of fast spikes on the LTS remains much the same) and a concurrent increase in the TC neuron IPSP. This may then lead to a delay in the TC neuron’s ring that once again subjects it to inhibition by a preceding RE neuron LTS. The overall increase in the number of cycles of synchronous activity with the increase in the period of RE neuron oscillations, however, establishes that it is an important control parameter that can be increased in isolation to enhance the synchrony of the thalamic network. Another indication that the increasing period of the RE neurons is not the only factor determining the state of the system may come from comparing Figures 2 and 5. In Figure 2, an abrupt transition in the number of cycles of synchrony occurs at the gN Ts value of 2.3 mS/cm2 . In Figure 5, however, an abrupt transition in the phase transition between the RE and TC neurons occurs at the gN Ts value of 2.6 mS/cm2 . A more likely explanation for the difference in these transition points, however, is that the values in Figure 2 represent the state of the network for 2000 ms. Figure 5, on the other hand, examines the phase difference between the RE and TC neurons occurring within the rst 350 ms of the simulation (This had to be done because at some values of gN Ts , the TC neurons do not reach threshold for very long after this.) An examination of Figure 1, and especially Figure 1a, shows that the early dynamics of the system (such as the ring interval between the RE neurons) are different from the later dynamics of the system. We found in the model that the RE neuron oscillation frequency decreased with the increase in gN Ts . This may not seem plausible in the light of the depolarizing nature of the ITs current. The ITs current, however, requires hyperpolarization for the deinactivation of the channel. The higher depolarization achieved with a higher level of gN Ts may nally lead to a lower level of deinactivation of the current. A more likely explanation for the decrease of the frequency of the RE neurons may be an increase in the conductance of the calcium-mediated potassium channels (gK[Ca] ) with the increase in gN Ts . The increase in gN Ts may lead to an increase in the internal calcium concentration, which may then lead to an increase in gK[Ca] and a concomitant increase in the afterhyperpolarization for the RE neurons. In a model of the TC neuron constructed by McCormick and Huguenard (1992), the increase of IT was found to increase the ring frequency of the model neurons. This was found to be the case in the TC neurons of our model as well. For the RE neurons of our model, the latency to the rst LTS following the initial stimulus is decreased by the increase in gN Ts , but the overall frequency of the oscillations is decreased. From experimental work, indications of the effects of changes in gN Ts could come from studies on the effects of ethosuximide on thalamic slices (Huguenard & Prince, 1994). Huguenard and Prince (1994) found that the addition of ethosuximide on the rat thalamic slice increased the latency to the crowning fast spikes of the LTS for the RE neurons. Leresche et al. (1998) found that the use of ethosuximide increased the latency of the rst action potential in the burst of a TC neuron.
Thalamic Synchrony via Phase Shifts
1569
They also found, however, that in some cases, the frequency of tonic ring in the TC neurons was increased by the addition of ethosuximide. The studies by Leresche et al. (1998), however, have cast doubts on the question of which channel ethosuximide actually exerts its effects on. It may also appear from both of these studies that an increase in the delay to peak of the LTS in the thalamic neurons should cause desynchrony rather than synchrony, since ethoximide is an antiepileptic. In our study, however, the increased delay in the LTS occurred only in an isolated manner (only to the RE neurons) and would therefore have a different network effect than a delay for all the thalamic neurons. The enhanced synchrony in the model with increased gN Ts is compatible with the results of a previous model that displayed lowered threshold for SWD with an augmentation of the same conductance (Destexhe, 1998). Direct comparisons with this work and the previous investigation are difcult, however. The model by Destexhe (1998) included a cortical network. It was unclear if the transition to SWD had taken place due to the effects of cortical feedback on the altered RE neurons, or if an enhanced synchrony could be seen in the isolated thalamic network. The results from the model predict that mechanisms that lead to a phase lag in the ring of the RE neurons with respect to the TC neurons would lead to enhanced synchrony. The capacity of this mechanism to lead to an augmentation in synchrony was further demonstrated by the increasing period in the oscillations of the RE neurons with increase in gK[Ca] and the concomitant enhancement of network synchrony. An increasing phase lag between the ring of the RE neurons and TC neurons can also be achieved by a phase advance of the TC neurons. The results of the model therefore predict that within certain limits, changes in the TC neuron that lead to a decrease in the delay to reach threshold would enhance synchrony in the network. If the increasing phase lag between the ring of the TC and RE neurons lead to enhanced network synchrony, then drugs that cause the opposite effect—a decrease in the phase lag between RE and TC neuron oscillations—have the potential to bring about antiepileptic effects. Acknowledgments
This work was partly supported by a grant from the National Research Foundation (FNRS) of Belgium. References Avanzini, G., de Curtis, M., Marescaux, C., Panzica, F., Spreaco, R., & Vergnes, M. (1992).Role of the thalamic reticular nucleus in the generation of rhythmic thalamo-cortical activities subserving spike and waves. J. Neural Transm., 35, 85–95.
1570
Elizabeth Thomas and Thierry Grisar
Buszaki, G. (1991). The thalamic clock: Emergent network properties. Neuroscience, 41, 351–364. Contreras, D., Curro Dossi, R., & Steriade, M. (1993). Electrophysiological properties of cat reticular thalamic neurones in vivo. J. Physiol. ( Lond.)., 470, 273– 294. Coulter, D. A., Huguenard, J. R., & Prince, D. A. (1989). Calcium currents in rat thalamocortical relay neurons: Kinetic properties of the transient, lowthreshold current. J. Physiol. (Lond.), 414, 587–604. Crunelli, V., Lightowler, S., & Pollard, C. E. (1989). A type of T-type Ca2+ current underlies low-threshold Ca2+ potentials in cells of cat and rat thalamic geniculate nucleus. J. Physiol (Lond.), 413, 543–561. Crunelli, V., Turner, J. P., Williams, S. R., Guyon, A., & Leresche, N. (1994). Potential role of intrinsic and GABAB IPSP-mediated oscillations of thalamic neurons in absence epilepsy. In A. Malafosse, P. Genton, E. Hirsch, C. Marescaux, D. Broglin, & R. Bernasconi (Eds.), Idiopathicgeneralized epilepsies: Clinical, experimental and genetic aspects (pp. 201–213). London: John Libbey & Company. Destexhe, A. (1998). Spike-and-wave oscillations based on the properties of GABAB receptors. J. Neurosci., 18, 9099–9111. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049–2070. Destexhe, A., Contreras D., Sejnowski, T. J., & Steriade, M. (1994). A model of spindle rhythmicity in the isolated thalamic reticular nucleus. J. Neurophysiol., 72, 803–818. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 1–4. Golomb, D., Wang, X. J., & Rinzel, J. (1994).Synchronization properties of spindle oscillations in a thalamic reticularis nucleus model. J. Neurophysiol., 72, 1109– 1126. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.)., 117, 500–544. Huguenard, J. R., Coulter, D. A., & Prince, D. A. (1990). Physiology of thalamic relay neurons: Properties of calcium currents involved in burst-ring. In M. Alvoli, P. Gloor, G. Kostoploulos, & R. Naquet (Eds.), Generalized epilepsy: Neurobiological approaches (pp. 181–189). Boston: Birkhauser. Huguenard, J. R., & McCormick, D. A. (1992). Simulation of the currents involved in rhythmic oscillations in thalamic relay neurons. J. Neurophysiol., 68, 1373–1383. Huguenard, J. R., & Prince, D. A. (1992). A novel T-type current underlies prolonged Ca2C -dependent burst ring in GABAergic neurons of rat thalamic reticular neurons. J. Neurosci., 12, 3804–3817. Huguenard, J. R., & Prince, D. A. (1994). Nominal T-current modulation causes robust antioscillatory effects. J. Neurosci., 14, 5485–5502. Jahnsen, H., & Llinas, R. (1984). Ionic basis for the electroresponsiveness and
Thalamic Synchrony via Phase Shifts
1571
oscillatory properties of of guinea-pig thalamic neurons in vitro. J. Physiol. (Lond.), 349, 227–247. Johnston, D., & Wu, S. (1994).Foundations of cellular neurophysiology. Cambridge, MA: The MIT Press. Leresche, N., Parri, H. R., Erdemli, G., Guyon, A., Turner, J. P., Williams, S. R., Asprodini, E., & Crunelli, V. (1998). On the action of the anti-absence drug ethosuximide in the rat and cat thalamus. J. Neuroscience, 18, 4842–4853. Lytton, W., & Thomas, E. (1997). Modeling thalamocortical oscillations. In P. S. Ulinski & E. G. Jones (Eds.), Cerebral cortex. New York: Plenum Press. Marescaux, C., Vergnes, M., & Depaulis, A. (1992). Genetic absence epilepsy rat from Strasbourg—A review. J. Neural Transm. [Suppl.], 35, 37–69. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. J. Neurophysiol.,68, 1384–1400. McCormick, D. A., & Pape, H C. (1990). Properties of a hyperpolarizationactivated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Patton, P., Thomas, E., & Wyatt, R. (1992). A computational model of vertical signal propagation in the primary visual cortex. Biol. Cybern., 68, 43–52. Press, W., Teukolsky, S., Vetterling, W., & Flannery B. (1992). Numerical recipes in FORTRAN. Cambridge: Cambridge University Press. Soltesz, I., Lightowler, S., Leresche, N., Jassik-Gerschenfeld, D., Pollard, C., & Crunelli, V. (1991). Two inward currents and the transformation of lowfrequency oscillations of rat and cat thalamocortical cells. J. Physiol. (Lond.)., 441, 175–197. Steriade, M., & Contreras, D. (1995). Relations between cortical and thalamic cellular events during transition from sleep patterns to paroxysmal activity. J. Neurosci., 15, 623–642. Steriade, M., Jones, E. G., & Llinas R. R. (1990). Thalamic oscillations and signalling. New York: Wiley. Steriade, M., & Llinas, R. R. (1988). The functional states of the thalamus and the associated neuronal interplay. Physiological Review, 68, 649–742. Thomas, E., & Lytton, W. (1997). Computer model of antiepileptic effects mediated by alterations in GABAA mediated inhibition. NeuroReport, 9, 691–696. Thomas, E., Patton, P., & Wyatt, R. (1991).A computational model of the vertical anatomical organization of primary visual cortex. Biol. Cybern., 65, 189–202. Thomas, E., & Wyatt, R. (1995). A computational model of thalamocortical spindle oscillations. Mathematics and Computers in Simulation, 40, 35–69. Tsakiridou, E., Bertollini, L., de Curtis, M., Avanzini, G., & Pape, H. C. (1995).Selective increase in T-type calcium conductance of reticular thalamic neurons in a rat model of absence epilepsy. J. Neurosci., 15(4), 3110–3117. von Krosigk, M., Ball, T., & McCormick, D. A. (1993). Cellular mechanisms of a synchronized oscillation in the thalamus. Science, 261, 361–364. Wallenstein, G. V. (1994). A model of the electrophysiological properties of nucleus reticularis thalami neurons. Biophysical J., 66, 978–988. Received April 20, 1999; accepted August 26, 1999.
LETTER
Communicated by Roger Traub
Multispikes and Synchronization in a Large Neural Network with Temporal Delays Jan Karbowski Center for Biodynamics, Department of Mathematics, Boston University, Boston, MA 02215, U.S.A., and Center for Theoretical Physics, Polish Academy of Sciences, 02-668 Warsaw, Poland Nancy Kopell Center for Biodynamics, Department of Mathematics, Boston University, Boston, MA 02115, U.S.A. Coherent rhythms in the gamma frequency range are ubiquitous in the nervous system and thought to be important in a variety of cognitive activities. Such rhythms are known to be able to synchronize with millisecond precision across distances with signicant conduction delay; it is mysterious how this can operate in a setting in which cells receive many inputs over a range of time. Here we analyze a version of mechanism, previously proposed, that the synchronization in the CA1 region of the hippocampus depends on the ring of “doublets” by the interneurons. Using a network of local circuits that are arranged in a possibly disordered lattice, we determine the conditions on parameters for existence and stability of synchronous solutions in which the inhibitory interneurons re single spikes, doublets, or triplets per cycle. We show that the synchronous solution is only marginally stable if the interneurons re singlets. If they re doublets, the synchronous state is asymptotically stable in a larger subset of parameter space than if they re triplets. An unexpected nding is that a small amount of disorder in the lattice structure enlarges the parameter regime in which the doublet solution is stable. Synaptic noise reduces the regime in which the doublet conguration is stable, but only weakly. 1 Introduction
Coherent rhythms in the gamma range of frequencies (30–80 Hz) have been found in many parts of the cortex (Gray, Konig, Engel, & Singel, 1989; Llinas & Ribary, 1993; Bragin et al., 1995; Singer & Gray, 1995; Steriade, Amzica, & Contreras, 1996) and are hypothesized to be important in the creation of cell assemblies (Engel, Konig, & Singer, 1991; Singer & Gray, 1995; Vaadia et al., 1995; Stopfer, Bhagavan, Smith, & Laurent, 1997) and in cognitive function (Bressler, Coppola, & Nakamura, 1993; Joliot, Ribary, & Llinas, 1994; Fries, Roelfsma, Engel, Konig, & Singer, 1997). Coherent rhythms with no c 2000 Massachusetts Institute of Technology Neural Computation 12, 1573– 1606 (2000) °
1574
Jan Karbowski and Nancy Kopell
phase lags among the participating cells have a potentially important role to play in plasticity, since synchronous activity is known to encourage the strengthening of mutual connections (Hebb, 1949; Ahissar et al., 1992; Bliss & Collingride, 1993; Magee & Johnston, 1997; Markram, Lubke, Frotscher, & Sakmann, 1997); though synchronous behavior is possible without rhythms, for the visual cortex it has been reported that such cells having a distance of more than 2 mm synchronize their activity only in the presence of rhythms (Konig, Engel, & Singer, 1995). In addition, coherent activity of a set of cells potentiates their effects downstream in the processing. Synchronization is known to exist between areas of the neocortex that are widely separated (Engel, Kreiter, Konig, & Singer, 1991; Bressler et al., 1993; Konig et al., 1995; Roelfsema, Engel, Konig, & Singer, 1997). In hippocampal slice preparations, synchronization of local eld potentials and single unit recordings in the gamma frequency range are observed with areas up to several millimeters apart synchronized within a millesecond in spite of conduction delays that are many times that long (Andersen, Silfvenius, Sundberg, Sveen, & Wigstrom, 1978; Traub, Whittington, Stanford, & Jefferys, 1996). This article takes up the question of how such synchronization over long distances can take place in a stable manner. For networks of spiking neurons, this question has been addressed in two closely related articles. In a pioneering article about the CA1 region of the hippocampus, Traub, Whittington, Stanford, and Jefferys (1996) used large-scale simulations of models containing excitatory pyramidal cells and inhibitory interneurons. They showed that synchronization was invariably accompanied by double spikes (or “doublets”) per cycle in many of the interneurons. The role of the doublets in the synchronization at a pair of separated sites was deduced by Ermentrout and Kopell (1998). In a much reduced setting, the latter showed that in order for the doublets to aid in the synchronization process, it was important that the rst spike of the doublet be due to excitation from the local pyramidal cells and the second spike be due to excitation from the remote site. The delay between the two spikes of a doublet encoded not just the conduction and synaptic delays but also the longer time to spiking that a cell experiences when it is not fully recovered from a previous spike. Ermentrout and Kopell (1998) demonstrated that the timing between the spikes could be used by the system to adjust the timing of the pyramidal cells automatically on the next cycle, in such a way as to make the coherent rhythm stable. (Related articles not as directly relevant are referred to in section 5.) The hypothesis that the rst spike be due to excitation leads to a phase lag between the local excitatory and inhibitory cells. The value of this lag plays no role in the analysis here or in Ermentrout and Kopell (1998) and can be arbitrarily small. Indeed, as in Whittington, Traub, and Jefferys (1995), the I-cells can be in a parameter regime in which they would re by themselves in the absence of excitation; the only restriction is that they not re before the E-cells, a situation that would produce suppression of the E-cells.
Multispikes and Synchronization for Networks with Delays
1575
The work of Ermentrout and Kopell (1998) gave an answer to how the mechanism of Traub, Whittington, Stanford, and Jefferys (1996) could work if all the remote inputs to a local set of cells arrive with the same conduction delay. It left open the question of how the mechanism could work if a cell is to receive many inputs with a wide range of conduction delays. The role of doublets, in particular, is mysterious in such a larger network because there are now many sources of excitation, not just two. This article addresses that question and shows, surprisingly, that doublets still provide information crucial for synchronization, and that in large-parameter regimes, bursts with more than two spikes cannot support stable synchronization. The network investigated in this article is a one-dimensional array of N identical local circuits of excitatory and inhibitory cells that mimics the CA1 region of the hippocampus. There are two different architectures: a spatially ordered network and one with some disorder in the positions of the local circuits. Each local circuit receives input from a large number of others, with conduction delays proportional to the distances traveled; thus, the inputs to a local circuit come over a range of times. The connection strengths are scaled with the number of inputs, so it takes many inputs to create a response, unlike the network of Ermentrout and Kopell (1998), in which all distant inputs are lumped, and the strengths are such that a single signal from the distant circuit creates a response in the local one. The distributed network is potentially much more exible in its response; depending on conduction times, strengths of synapses, and other parameters like the voltage threshold, the amount of disorder, and the range of connections, the number of spikes red by an inhibitory cell in a cycle (dened by the periodic ring of the excitatory cells) can be one, two, or more, and there can be multiple solutions. The actual outcome of the network therefore depends on the stability of the solutions. In order to perform the complicated analytical calculations associated with this extended network, we introduce a simple model of a neuron that mimics nonlinear behavior caused by ionic channels and is somewhat more complex than the standard “integrate-and-re” neuron: a large input causes an immediate spike (and reset), and a smaller one that (with the help of previous small inputs) causes the voltage to pass some threshold eventually causes a spike, but possibly with a delay. This delay, which depends on the previous spiking history of the neuron as well as the timing of the inputs, contains the information used by the network to create the coherence. We also introduce a probability of inputs arriving at a given cell, depending on distance from that cell. With the explicit neuronal dynamics and a lattice-like (though possibly disordered) structure, we are able to carry out computations for the probabilistic behavior. The computations lead to explicit (though complicated) formulas for the time difference between spikes of the excitatory cells at different points of the lattice as a function of their differences at the previous cycle. This is exactly the information needed to decide if synchronous be-
1576
Jan Karbowski and Nancy Kopell
havior is stable when the I-cells re single spikes, doublets, or more spikes per cycle. That is, we do a linear stability analysis near congurations in which the spiking order and the idea of cycles are well dened. From the analysis, we can get bounds on the parameter regions in which one or more of the above synchronous solutions exists and is stable. We are able to see from this that doublets appear to provide more stability than either single spikes or a large number of spikes per cycle. Furthermore, we get the unintuitive result that a small amount of disorder in the lattice structure helps to stabilize the synchronous state. The article gives the central ideas and a broad outline of the computation; a more detailed outline is given in the appendixes. Finally, we analyze the effect of synaptic noise and nd that it reduces the parameter range in which the doublet conguration has stable synchrony, although rather weakly. The list of symbols used in this article is as follows: R N Li, j vl ( t ) vth v0 kl nl a b c g s f
tmi tsi tme tse tEl tNEl WlI WlE
the linear size of the network measured as a conduction time the number of local circuits distance in time units between i and j circuits membrane potential of the lth excitatory or inhibitory cell, specied by the context threshold potential for ring reset potential after a spike number of inputs to the lth inhibitory cell after it res its rst spike until its voltage crosses vl D 0 number of inputs to the lth inhibitory cell after its rst spike until it res the doublet spike (nl > kl ) strength of excitatory-to-inhibitory synaptic coupling between circuits potential value of inhibitory cell after receiving the kl pulse driving current in excitatory cells amplitude of inhibition in excitatory cells range of synaptic excitatory-to-inhibitory connections measure of spatial disorder in positions of circuits membrane time constant for inhibitory cells inhibitory synaptic time constant between local inhibitory and excitatory cells membrane time constant for excitatory cells excitatory synaptic time constant between nonlocal excitatory and inhibitory cells spiking time of the lth excitatory cell in a previous cycle spiking time of the lth excitatory cell in a next cycle delay time for ring of the lth inhibitory cell delay time for ring of the lth excitatory cell
Multispikes and Synchronization for Networks with Delays
1577
E
E
E
E
E
E
I
I
I
I
I
I
Figure 1: Network architecture. Local circuits, containing one excitatory (E) and one inhibitory (I) cell, are connected only via E7!I synapses. These connections are probabilistic in nature. The distances between neighboring circuits are constant for the ordered network. For the disordered network, these distances are slightly distorted.
2 Network Model
Assumptions concerning our neural network model can be decomposed into two classes: assumptions about the network architecture and assumptions about the network dynamics. 2.1 Network Architecture. The network we analyze is a one-dimensional array of linear size R composed of N identical local circuits, each containing one excitatory (E) and one inhibitory (I) cell with synaptic connections from E to I and I to E (see Figure 1). For nonlocal connections, we consider only the synapses from excitatory to inhibitory cells. As in the smaller network analogue of this work (Ermentrout & Kopell, 1998), we neglect the E7!E connections, which are sparse in the CA1 region of the hippocampus (Knowles & Schwartzkroin, 1981). The I7!I local and nonlocal connections are treated as effectively decreasing the E7!I connections, and hence are not introduced explicitly. One simplication from the connectivity in Ermentrout and Kopell (1998) is the neglect of nonlocal I7!E connections, whose range in the CA1 area is smaller than the E7!I connections (Tamamaki & Nojyo, 1990; Freund & Buzsaki, 1996). Ermentrout and Kopell (1998) showed that the nonlocal E7!I connections sufced to produce synchrony between a pair of distant circuits (though the added connection made this synchrony more robust); we therefore investigated whether such connections would also sufce in a much larger extended network. We treat the network architecture as probabilistic and assume that the probability P( Lij ) of a connection between the ith and jth circuits separated by distance Lij depends on only the distance between them (translational invariance in a statistical sense). In what we describe as an ordered network, that distance is |i ¡ j|; in what we shall call the disordered network, the distances are a perturbation of that quantity. In the calculations, only the times of arrival, not actual distances, are relevant so what we call “distances”
1578
Jan Karbowski and Nancy Kopell
will carry units of time. Finally, for computational reasons, we use periodic boundary conditions for our network architecture. 2.2 Network Dynamics. We denote by vl (t ) the membrane potential of the lth I-cell and the lth E-cell. Its dynamics is modeled by
tma
dvl D |vl | C Il , a D e, i, dt
(2.1)
where l D 1, . . . , N, and Il represents input current. Symbols tmi and tme are effective membrane time constants for the I- and E-cells. These are much smaller than the passive membrane time constant determined by the capacitance and leak current, since there are many active conductances that dominate the leak conductance (see section 5 for estimates showing that for the part of the cycle relevant to the calculation, the effective time constant of the I-cell is a fraction of a msec). We add to the membrane dynamics represented by equation 2.1 the following property. When the potential crosses the threshold value vth ( > 0), a cell immediately res, and its potential is reset to v0 < 0, which mimics the effect of the AHP (afterhyperpolarization) current. We assume that if the I-cell is close to recovery, a single E-spike elicits a spike from the I-cell. Thus, in the local network, the E-I circuit forms an oscillator with one spike per cycle in each cell; the I-cell has recovered by the time it receives excitation in the next cycle. When the I-cell is partially refractory, the same input (or several inputs) to that cell may cause the potential to cross only a resting zero value; equation 2.1 then implies that the cell will also re, but with a variable delay time that depends on the difference between the times of the excitation and the time of the last spike of that cell. This delay is a central feature of the analysis in Ermentrout and Kopell (1998) for a two-circuit network; for our more complicated case, we chose an explicit model that has this delay. We chose the following form for the Il current. For an I-cell, the current is completely of synaptic origin and is given by Il D a
X k
d(t ¡ tl( k) ).
(2.2)
Here the delta functions represent pulses (inputs) coming with delays from nonlocal E-cells with amplitude a at times tl( k) from cell l C k to cell l. For an E-cell, the current has the form Il D c ¡ g
X j
¡(t¡tlj ) / tsi
e
h(t ¡ tlj ),
(2.3)
where c is a driving current (e.g., driven by a stimulus) and the second term represents inhibitory synaptic inputs with a positive amplitude g coming
Multispikes and Synchronization for Networks with Delays
1579
from the local I-cell at times tlj with synaptic I7!E time constant tsi . The function h(x) is a standard step function dened as h(x) D 1 for x ¸ 0 and h(x) D 0 for x < 0. We assume that for I-cells, the synaptic E7!I amplitude a is smaller than both |v0 | and vth . For E-cells, we assume the condition g À c > vth C |v0 |. The rst inequality means that the local inhibition g is strong enough to prevent ring (Traub, Jefferys, & Whittington, 1999); the second implies that the Ecells re frequently in its absence. Values tme , tmi , v0 , and vth can be different for I- and E-cells, but we do not use this explicitly. We assume that the synaptic E7!I time course tse is not important. This assumption is consistent with experimental studies on the CA1 area (Buhl, Halasy, & Somogi, 1994) that reveal that excitatory postsynaptic potentials to I-cells decay rapidly. As a consequence, incoming excitatory inputs to I-cells are taken as structureless (cf. equation 2.2). As we will show, the effective membrane time constant tmi for the Icells plays a role in determining parameter regimes in which the doublets are stable. The corresponding time constant tme for E-cells does not play as important a role. For some calculations, notably of W E dened below, we cannot get specic answers in full generality. However, we do the computation in the complementary regimes tme / tsi ¿ 1 and tme / tsi À 1. Because of the large amount of active conductances that lower tme , we believe tme / tsi ¿ 1 to be the more relevant regime. The qualitative features are shown to be the same in each, with different functional dependence of period of oscillations on tsi in the two regimes. The qualitative shape of the stability diagrams does not change at all. We do not put any constraints on the relative value of tsi and tmi . Note that the membrane dynamics represented by equation 2.1 is piecewise-linear and slightly different from standard integrate-and-re type models (McCullough & Pitts, 1943; Knight, 1972; Abbott & van Vreeswijk, 1993; Usher, Stemmler, Koch, & Olami, 1994; Gerstner, 1995) and from the Hopeld model (Hopeld, 1984; Amit, Gutfreund, & Sompolinsky, 1985; Hertz, Krogh, & Palmer, 1991). Our simple model mimics some nonlinear behavior caused by ionic channels. For negative voltages, it acts as a standard leaky integrate-and-re neuron. For positive voltages, it models the explosive depolarization generated by the opening of active voltage-gated sodium channels. Though we take the same time constants for v < 0 and v > 0, we note that the analysis below has the same form in the more general case in which these constants are allowed to be different. As in “type 1 neurons” (Ermentrout, 1996) for neurons in equation 2.1, there can be a signicant delay between receipt of a suprathreshold signal and the response of the neuron. In summary, in our model, the E-cells re continuously in the absence of local inhibitory connections. With inhibition, an E-cell can re only when the level of inhibition drops below a certain critical value. I-cells re when they are sufciently recovered from their last spike that the excitation (from
1580
Jan Karbowski and Nancy Kopell
whatever source) pushes them above its rst (v D 0) threshold. Thus, if the excitation arrives long after the last spike of an I-cell, the cell res almost immediately because it is recovered. However, the same excitation arriving shortly after a spike will not elicit another spike. We are assuming that the parameter ranges are such that an I-cell recovers well before the end of a cycle; recovery time is shorter than decay of inhibition, since the recovery time depends on the membrane time constant of the cell, which is small. 3 Network Properties Are Revealed by Maps
The above assumptions enable us to obtain analytical results concerning synchronous oscillations and their stability. We associate the synchronous state with the situation in which the E-cells re simultaneously. Below we construct maps that show how the network uses the timing of spikes to accomplish synchronization. The full cycle for E-cells is as follows. An E-cell res and sends a pulse to its local I-cell and many nonlocal I-cells. The local I-cell res almost immediately after receiving the pulse, since the I-cell is recovered from the last cycle. Simultaneously (or very quickly) it sends inhibition to the original E-cell, prohibiting it from further ring. In the meantime, the I-cell gets pulses from many nonlocal E-cells, but it must wait to re, because it is not yet recovered from the rst spike. We assume that the range of connectivity of the nonlocal E7!I connections (measured in time units) is signicantly smaller than the inhibitory time constant tsi , which is about 10 msec. This implies that all the nonlocal EPSPs arrive when the local E-cell is not yet recovered from its local inhibition. The parameter range relevant to the doublet case is one in which the combined excitation from the nonlocal E-cells is adequate to cause the Icell to re again, after a delay we call W I , before the E-cell recovers from inhibition and res again. In that case, the doublet spike sends a second inhibitory pulse to the local E-cell, which will re in the next cycle when the level of inhibition falls sufciently. We denote the time from the second inhibitory spike to the E-cell ring by W E . Thus for the doublet case, we have the following equations tNEl D tEl C W lI ( tEl , tl(1) , . . . , tl( nl ) ) C WlE ( WlI ),
(3.1)
for l D 1, . . . , N, where tEl , tNEl are the times of spiking of the lth E-cell in the previous and next cycles. WlI is a function of tl(1) , . . . , tl( nl ) , which are times of arriving pulses from nonlocal E-cells enumerated as l(1), l(2), . . . , l(nl ). We have tl( k) D tEl( k) C Ll,l( k) , where Ll,l( k) is a distance between the lth and l(k )th circuits. The I-cell res at some moment between the arrival of the nl and nl C 1 pulses. We look for a synchronous solution—one in which tNEi D tNjE provided tEi D tjE for every i and j. Equation 3.1 is the map that gives the
Multispikes and Synchronization for Networks with Delays
1581
timing of the E-spikes in one cycle as a function of spikes in the previous cycle. Note that the hypotheses on the extent of the excitatory connectivity is critical for well-dened cycles, which are needed to dene tNEl . In principle, one can generate multispikes in I-cells when the decay of inhibition is sufciently slow and the range of the E7!I connectivity sufciently large. In the case of triplets, equation 3.1 is modied to tNEl D tEl C WlI,1 ( tEl , tl(1) , . . . , tl( nl ) ) C WlI,2 ( tEl C WlI,1 , tl( nl C 1) , . . . , tl( nl C ml ) ) C WlE ( WlI,1 , WlI,2 ),
(3.2)
where W I,1 and W I,2 denote the waiting time of an I-cell for the second and third spikes from the previous spike. The integer ml denotes the number of inputs between the occurrence of the second and third spikes. Generalization to higher-order multispikes is straightforward. The same formalism also applies to the singlet case. In the latter, the I-cells do not receive enough excitation to overcome their partial refractoriness to reach the threshold v D 0. Recall that the decay time of the inhibition is long enough for the I-cell to recover. Thus, an E-cell spike elicits a spike from the now-recovered I-cell, which in turn inhibits the E-cell. The formalism is as in equation 3.1, with the waiting time W I set to zero. The rst goal is to compute the W I and W E functions. We compute WlI (the ring time of an I-cell started from time tEl ) using equations 2.1 and 2.2. At time t D tEl the potential of the I-cell is instantaneously brought to the value vth by the local E-cell and equally instantaneously reset to the negative value v0 after ring a spike. Now the I-cell begins to get excitatory pulses coming from nonlocal E-cells. The time when the I-cell crosses vth is equal to W lI , which takes the form " WlI
D tl ( n l ) ¡
tEl
¡
tmi ln
b exp[(tl( nl ) ¡ tEl ¡ dtl ) / tmi ] vth # nl a X C exp[(tl( nl ) ¡ tl( i) ) / tmi ] , vth iDk
(3.3)
l
where kl is the number of inputs needed for the I-cell to cross vl D 0, and dtl is the time at which this happens. The kl th pulse takes the voltage from a negative to some positive value b satisfying 0 < b < a. The I-cell res after receiving nl excitatory inputs. The details of the derivation of equation 3.3 are presented in appendix A. Computation of W lE using equations 2.1 and 2.3 is presented in appendix B. Here we briey sketch (Ermentrout & Kopell, 1998) how to derive an approximate formula for WlE . In the limit of fast membrane dynamics of the E-cell relative to the synaptic I7!E time course, that is, when tme / tsi ¿ 1,
1582
Jan Karbowski and Nancy Kopell
we can set the left-hand side of equation 2.1 with a D e to zero. The inhibitory current felt by an E-cell decreases with time as g exp(¡t / tsi ). In the case of doublets, one has two inhibitory spikes separated by time WlI ; the total inhibition is g[exp(¡t / tsi ) C exp(¡(t ¡ W I ) / tsi )]. Similarly, for triplets one has three exponential terms. Our WlE is the time interval from the occurrence of the second inhibitory spike to the local E-cell when the total inhibition drops below a certain critical value. That value is the one for which the total current Il in equation 2.3 becomes positive. After the current Il becomes positive, equation 2.1 implies that the membrane potential starts to grow, reaching the threshold very quickly at a time of the order of tme . In that limit WlE takes the form W lE D tsi ln
hg ± c
I
i
1 C e¡Wl / ts
²i
(3.4)
I
at that time, the E-cell res in the next cycle. In the opposite limit, tme / tsi À 1, WlE is of the order of the membrane time constant tme with a similar functional dependence (cf. appendix B). In the disordered network, the arriving times of pulses from E-cells to the lth I-cell form an irregular pattern. Therefore, we must introduce averaging over those times. We assume that the inputs are independent of one another and their temporal pattern depends only on initial conditions ( tEl ) and the architecture of the network fLl,l( k) g. We average equations 3.1 and 3.2 over the network architecture in which l( k ) D l C k ( mod N ); this corresponds to the assumption about periodic boundary conditions. Since arrival times do not depend on each other, our probability distribution r of fLl,lC k g is of the form r ( Ll,lC 1 , . . . , Ll,lC N¡1 ) D
N¡1 Y kD1
P( Ll,lC k ).
(3.5)
We choose the probability P(Ll,lC k ) of input arriving from ( l C k )th circuit to lth circuit at time Ll,lC k to be P (Ll,lC k ) »
¡
¢ ¡
¢
µ ¶ 1 kR kR Cf d Ll,lC k ¡ ¡ f C d Ll,lC k ¡ e¡Ll, lC k / s , 2 N N
(3.6)
where R is a linear size of the system, f is a measure of spatial disorder in positions of neurons and is the same for every pair of neurons, and s represents the range of synaptic connections between neurons. s is chosen so that if the E-cells are synchronous, all the excitation from the nonlocal E-cells to a local circuit comes during the recovery of a local E-cell from its local inhibition. Note that r does not depend on the position of the lth circuit, which is a consequence of the assumption about the translational invariance. The probability P( Ll,lC k ) is a product of two terms: the one with
Multispikes and Synchronization for Networks with Delays
1583
the delta functions represents the probability of arrival of an input at certain time, and the one with the exponential decay represents the probability of synaptic connection between circuits. The latter factor mimics the locality of connections. When s D 1, we have a network with global E7!I connections. The advantage of formula 3.6 lies in the fact that it captures different types of network architecture, yet is very simple and suitable for analytical investigations. When f D 0, one has an ordered network with regularly 6 arriving pulses separated by time R / N. The case with f D 0 corresponds to a disordered network in which the positions of the neurons are irregular or alternatively in which the positions of the neurons are regular but the lengths of axonal connections between them are irregular. In such a network, pulses come in at times that are random but symmetrically distributed around kR / N for every k. In principle, the range of variability of f can be from zero to R / 2N. However, in order to make our calculations easier, we adopted the weak disorder limit corresponding to f N / R ¿ 1. In this limit, the moments of distribution 3.6 are the same as those of a gaussian distribution for arriving times (cf. appendix C). Since the results of this article depend only on the moments, this implies that the results are equivalent for the two distributions. For the sake of convenience we dene D l¡1 ´ htEl i ¡ htE1 i, where the symbol h. . .i denotes averaging with respect to the probability distribution 3.5 and 3.6, that is, Z hHi D
1 0
Z dLl,lC 1 . . .
1 0
dLl,lC N¡1 r ( Ll,lC 1 , . . . , Ll,lC N¡1 ) ¢ H,
(3.7)
for any function H depending on fLl,lC k g. The next step is to rewrite equations 3.1 and 3.2 in terms of the fD g. This gives us ( N ¡ 1) equations for DN l ´ htNElC 1 i ¡ htNE1 i, which constitute the multidimensional map
DN l D fl (D 1 , . . . , D N¡1 ).
(3.8)
(Recall that the overbar denotes a quantity associated with the next cycle.) In the case of the doublet conguration, the nonlinear function f D ( f1 , f2 , . . . , fN¡1 ) is given by fl ( D 1 , . . . , D N¡1 ) D htNElC 1 (fD g)i ¡ htNE1 (fD g)i,
(3.9)
where htNEl (fD g)i D htEl i C hWlI i C hW lE i.
(3.10)
This means that the average of the next E-cell ring is a function only of the last E-cell average ring time, with no previous history dependence. In
1584
Jan Karbowski and Nancy Kopell
equation 3.10, WlI ( Ll,l(1) , . . . , Ll,l( nl ) I fD g) and WlE ( Ll,l(1) , . . . , Ll,l( nl ) I fD g) are now functions of fLl,l( k) g and fD g instead of the arrival times ftl( k) g. Formulas for hW lI i and hWlE i are complicated; their sum has the form tsi hWlI i C hWlE i D n 2 l coshnl (f / s) µ
¡
0
2 X a1 ,...,anl D1
µ g C £ ln 1 exp WlI c
exp @¡
¡R
nl X
1
(¡1)aj f / s A
jD1
nl R N ¶ ¶ C (¡1)anl f I fD g / tsi , (3.11) N
C (¡1)a1 f , . . . ,
¢
¢
R C (¡1) a1 with WlI ( N f , . . . , nNl R C (¡1)anl f I fD g) given by equation 3.3. The derivation of equation 3.11, including details of how to compute average values of kl and nl , which we denote as k and n, respectively, is given in appendix D. We are interested in the existence and stability of a synchronous state, which corresponds to the xed point DN l D D l with D l D 0, for every l. For the triplet case, the function f can be obtained explicitly as well but is more complex. In principle, the I-cells may produce many spikes in a cycle, provided the size R of the system is sufciently large. Some of these spiking congurations may be stable; some may not (see below). In the case of multistability, the actual conguration depends on the values of parameters used in the system. Especially critical is the interplay between the synaptic strength a and the linear size of the system R. The synaptic strength determines how many inputs from E-cells are needed to trigger ring of an I-cell. To have only doublets, one has to have a sufcient number of inputs to trigger the second spike in I-cell, that is, n < N, and insufcient to trigger a third spike, which means (n C k ) > N. Using the formulas for kl and nl in appendix D, we can use these inequalities to get bounds on the synaptic amplitude a such that the system has only doublets. When R / Ntmi À 1 and 0 < f / tmi ’ 1, which imply weak disorder, these bounds may be written as
cosh
f
i
s
cosh( fs ¡
f
tmi
)
e¡R / tm
x0 . For values b D a / 2 and |v0 | / a D 5, one nds x0 ¼ 2. (B) Disordered network. Note the appearance of the vertical asymptote and the increase in the stability region. Values x0 and y0 are of the same order of magnitude as for the ordered network. In the complementary regime (i.e., when tme / tsi À 1), the shapes of the diagrams remain the same with a substitution tsi 7! tme in coordinates.
Multispikes and Synchronization for Networks with Delays
1587
for arbitrary values of tmi / tsi (see Figure 2B). In the limit R / Ntmi À f / tmi À 1, this critical value takes the form (R / Ntmi )cr ¼
µ ¶ 8|v0 | f R R C ln cosh2 , s Ns Nf a
(3.13)
which is the point where a vertical asymptote crosses the horizontal axis in Figure 2B. Notice that when f D 0, namely, for the ordered network, this asymptote shifts to innity, reducing the stability region. Also note that the stability region grows with an increasing range of connections s and synaptic strength a. Details for how to compute ( R / Ntmi )cr are given in appendix E. For the triplet case in the ordered network, we nd that the upper limit of |l l | is always greater than unity (computation not given). This suggests that the triplet conguration in a such network is unstable. On the contrary, in the disordered network we nd that synchronized oscillations with triplets are stable for all tmi / tsi provided R / Ntmi ratio is sufciently large, that is, R / Ntmi À 1, although we are unable to determine the exact values analytically. We see from the above that spatial disorder in a network helps to stabilize the synchronized oscillations. So does increasing the range of synaptic connections (see equation 3.13). Based on comparison between doublets and triplets, we conjecture that higher-order multispikes are less stable than doublets, if they exist at all. The analysis also allows us to determine the period T of oscillations, which is equal to hW I i C hW E i (see equation 3.10). In the ordered network, with the doublet conguration, a formula is relatively simple. In the limit tme / tsi ¿ 1 we obtain 0 2 g T D tsi ln @ 4 1 C c
(
i
[(vth ¡ a )(1 ¡ e¡R / Ntm ) C a] i a[b C ( a ¡ b )e¡R / Ntm ] £ [a C |v0 |(1 ¡ e
¡R / Ntmi
!tmi / tsi 3 1
)]
5A .
(3.14)
Note that the period of oscillations of the whole network is larger than the period of oscillations of an isolated local circuit, tsi ln( g / c), but both are of the same order of magnitude. In the opposite limit, tme / tsi À 1, one should substitute tme for tsi and g(1 C vth / c)tsi / tme for g. The period grows with increasing size of the system, saturating for R 7! 1 (for xed N). This means that the frequency of oscillations is bounded from below. The dependence on the membrane time constant tmi is essentially the same as on R. T decreases with the increasing strength of E7!I
1588
Jan Karbowski and Nancy Kopell
synapses, which corresponds to increasing the amplitude a. The same conclusions are valid in the case of the disordered network with the doublet solution, but the expression for the period is more complex with similar logarithmic terms. In the regime tme / tsi ¿ 1, we obtain the following expressions for the period of oscillations. For the ordered network, in the limits tmi / tsi ¿ 1 and R / Ntmi À 1, that is, in the parameter range where there is a stable doublet conguration, our period is governed only by the inhibitory synaptic time course tsi (Whittington et al., 1995; Jefferys, Traub, & Whittington, 1996; Traub, Whittington, Colling, Buzsaki, & Jefferys, 1996) and reads T ¼ tsi ln(2g / c). (Proportionality between the period and tsi has also been obtained before in a network of inhibitory neurons; Chow, White, Ritt, & Kopell, 1998.) In the above limits, we also obtain a very simple formula for the ratio of the averaged timing between doublets hW I i to the period— hW I i / T ¼ (tmi / tsi ) ln(vth |v0 | / ab ) / ln(2g / c)—which depends logarithmically on the synaptic nonlocal (E7! I) strength. In the same limit for the disordered network with additional constraint f / tmi ¹ 1, we obtain the period T¼
tsi
ln
¡ 2g ¢ c
¡
f
2
tanh
f
s
.
(3.15)
This implies that spatial disorder reduces the period, and in addition, the longer the range of connections s, the greater is the period. In the complementary regime tme / tsi À 1, we obtain the following expression for the period: T ¼ tme ln
¡ 2gt
i s
ctme
¢
h 1C
f f vth i ¡ tanh . 2 s c
(3.16)
Note that now the period depends more weakly (i.e., logarithmically) on the synaptic time constant tsi . The triplet case occurs only in the disordered network. Also here, an analytical expression for T can be obtained, and again we nd that the leading dependence is logarithmic, although a formula is more complicated than for doublets. Nevertheless, the qualitative dependences on R, tsi , tmi , a and the disorder remain valid here. 4 Inclusion of Synaptic Noise
In this section we estimate the effect of a synaptic noise on the stability of the doublet conguration in the regime tme / tsi ¿ 1. To simplify considerations we will consider the network with N D 2 local circuits with the same type of connections as before. For N D 2 circuits, the assumption about the smallness of excitatory E7!I synaptic strength a is removed; in this case, a single nonlocal excitatory input
Multispikes and Synchronization for Networks with Delays
1589
must trigger the ring of local I-cell. We assume, however, that a < vth , so that an I-cell has a delay between receipt of an input and ring. We can make use of equations 3.3 and 3.4 derived before for W I and W E functions. The formula for W I will be modied because now there is only one strong source of excitation instead of many weak sources. Thus, the sum in the logarithmic term in equation 3.3 now disappears. We note that the time dtl when the membrane potential crosses zero is exactly the moment when nonlocal excitatory pulse arrives, dtl D tE2 ¡ tE1 C R, where R is now the conduction delay time between the circuits. The formula for the value of the membrane potential b right after the reception of an input can also be substituted in equation 3.3. This value now is b D v0 exp(¡dtl / tmi ) C a. Taking all these changes into account, we obtain for the rst circuit ! vth I EC EC i . (4.1) W1 D t2 R ¡ t1 tm ln E E i v0 e¡(t2 C R¡t1 ) / tm C a
(
A similar formula holds for the second circuit with the substitution tE1 $ tE2 . E remains the same as equation 3.4 The equation for W1,2 In the presence of synaptic noise, synaptic strengths a and g will uctuate rapidly around their average values, reducing the precision in values of W I and W E . To estimate the effect of synaptic noise, we will treat a and g as random variables. Repeating the steps from section 3 we have
DN D D C tsi ln
¡ h(¡D ) ¢ h( D )
,
(4.2) E
E
where now D ´ tE2 ¡ tE1 and DN ´ tN2 ¡ tN1 , and ±
² t i /t i m s
C ) h( D ) D v0 e¡(R D / tm C a i
± ²t i /t i i m s C e(RC D ) / tm vth .
(4.3)
Notice that synaptic I7!E strength g cancels out. Averaging the above equation over different possible values of a with a simple probability distribution (cf. appendix C) P (a )noise D 12 [d(a¡a0 ¡da) C d(a¡a0 C da)], where da is a measure of synaptic noise, and solving the stability inequality | @DN / @D | D D0 | < 1 in terms of R, in the limit da / a0 ¿ 1 we obtain, after some tedious algebra, " #! 2|v0 | da 2 i 1C r , (4.4) R / tm > ln a0 a0
¡ ¢
(
with rD
h i i i 2t i t 2 (4| v0 |vth )tm / ts C a0 m / s (1 C tmi / tsi ) i
i
2tmi / ts
(4| v0 |vth )tm / ts C a0
.
(4.5)
1590
Jan Karbowski and Nancy Kopell
When the synaptic noise is absent (da D 0), the right-hand side of equation 4.4 is smaller than ifda > 0. Hence, inclusion of synaptic noise decreases the parameter regime in which there is stability. However, the effect of noise is rather weak, since dependence in equation 4.4 is logarithmic. Finally, let us note that since g dropped out from equations 4.4 and 4.5, synaptic noise in a local I7!E inhibitory synapse is irrelevant. This is the consequence of the fact that we assumed slow I7!E inhibition, that is, tme / tsi ¿ 1. When inhibition is slow, fast synaptic uctuations do not have a great inuence on synaptic transmission, since the synapse has time to average over uctuations. This is not the case for E7!I nonlocal synaptic connections, since in that case excitation is fast (Buhl et al., 1994) and the system lacks the time to average. Therefore, synaptic uctuations in the latter case are important. 5 Discussion
The literature on synchronization is very large; some related papers using the properties of inhibition are Lytton and Sejnowski (1991), Wang and Rinzel (1992, 1993), Friesen (1994), van Vreeswijk, Abbott, and Ermentrout (1994), Bush and Sejnowski (1996), Gerstner, van Hemmen, & Cowan (1996), Wang and Buzsaki (1996), Terman, Kopell, and Bose (1998), Terman and Rubin (1998), White, Chow, Ritt, Soto, & Kopell (1998), and Chow et al. (1998). In most articles on synchronization, conduction delays do not play a part. One that explicitly addresses that question is Konig and Schillen (1991), which gives simulations of a ring-rate model showing that synchronization could be accomplished with delays up to roughly a third of the period. It differs in two respects from our article. First, we consider the spiking neurons type model, in which the timing between single spikes is very important; in ring-rate models, the ne temporal structure is lost. Second, our synchronous state is stable even for innite-range delays, that is, for s 7! 1 (see equation 3.13 and Figure 2B); if there is heterogeneity in the driving current, the range could be reduced (Ermentrout & Kopell 1998). Crook, Ermentrout, Vanier, and Bower (1997) considers temporal delays caused by axons in a continuous model of excitatory oscillators. They found that the synchronous state is stable provided delays are not too large. These authors do not consider the effect of inhibition. Other articles, like Campbell and Wang (1998) and Ernst, Pawelzik, and Geisel (1995) deal with pulsecoupled relaxation oscillators. The rst of these uses excitation with homogeneous global inhibition. The second uses either excitatory or inhibitory homogeneous connections, but not both, as a mechanism for generation of long-range synchronization. Neither of these articles takes into account the time structure of inhibition. Thus, they are quite different in spirit from the models of Whittington et al. (1995), Traub, Whittington, Stanford, and Jefferys (1996) and Ermentrout and Kopell (1998), which deal with synchronization of excitatory and inhibitory cells that spike once or twice per cycle,
Multispikes and Synchronization for Networks with Delays
1591
and make central use of the time course of the inhibition. Another recent related article is Traub, Whittington, Stanford, and Jefferys (1999), which includes a more detailed model than the one in Traub, Whittington, Stanford, and Jefferys (1996) and summarizes a body of work on gamma and beta rhythms. (See also Traub, Jefferys, & Whittington, 1999.) This article extends the work of Ermentrout and Kopell (1998) on gamma oscillations in a smaller network. We have analyzed the mechanism for producing long-range synchrony in a distributed network of spiking neurons, in which each interneuron receives a large number of inputs arriving at different times. We showed that even with many inputs, a ring conguration in which interneurons produce doublets yields stable synchronization over a larger-parameter range than other congurations. These doublets have been seen in large-scale simulations, as well as in vitro studies (Traub, Whittington, Stanford, & Jefferys, 1996; Traub, Whittington, Buhl, Jefferys, & Faulkner, 1999), and might provide additional support for a temporal coding hypothesis (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). We note that (as in Tsodyks, Mitkov, & Sompolinsky, 1993, van Vreeswijk et al., 1994, and Gerstner et al., 1996), our calculations deal only with linear stability, and hence do not give information about global stability. In Ermentrout and Kopell (1998), the synchronization mechanism depends on the properties of the waiting time function called W I in this article and TI in Ermentrout and Kopell (1998). The analysis in that article was valid for all functions TI that have the qualitative property that they decay in time. That is, the longer the time from the spike of an I-cell until when it receives further excitation, the shorter the wait until it then res. This property was shown to hold for a class of biophysical models reduced from those of Traub and Miles (Ermentrout & Kopell, 1998; Traub & Miles, 1991). The analysis in this article uses a particular form for the dynamics (see equation 2.1), a one-dimensional system having two thresholds. This form was chosen as the simplest equation for which the waiting time function TI has a form similar in shape to that computed from the reduced model of Traub and Miles (1991). This was used in the computation of equation 3.3. The essential role in our mechanism is played by local inhibitory neurons that lter inputs coming from nonlocal excitatory cells to a local Ecell, inhibiting it from ring by producing appropriately timed multispikes. Especially in the regime tme / tsi ¿ 1, this inhibition governs the period of oscillations of excitatory cells, as well as stabilizes the synchronous state. For the ordered network (see Figure 2A) this inhibition must be slower than the effective membrane time constant of inhibitory cells in order to preserve stability. In the case of the disordered network (see Figure 2B) this inhibition does not have to be slow for sufciently large R / Ntmi . In the complementary regime, tme / tsi À 1, the time course of the inhibition does not play any role in the stability. Our stability diagrams and the period of oscillations scale with factor R / N, the average time needed for a pulse to travel between neighboring
1592
Jan Karbowski and Nancy Kopell
circuits. This scaling is a direct consequence of equation 3.6 and indicates that the stability diagrams, as well as the period, do not depend on the network size R, provided the density of circuits R / N is xed. The stability diagrams indicate that below a minimal value x0 of R / Ntmi , the doublet conguration is not stable. From equation E.9 (in appendix E), using as parameter values b D a / 2, and v0 / a D 5, we get x0 ¼ 2, with x0 only weakly dependent on a, b and v0 . This gives a minimal size for the distance R / N between the local circuits: R / N > x0 tmi . If the effective time constant tmi is very small (from conductances in Traub, Whittington, Stanford, & Jefferys, 1999, we estimate tmi to be a fraction of a msec), the local circuits can be taken to be very close. This estimate is made as follows: the active conductances include an excitatory metabotropic conductance, AHP conductance and synaptic conductances from local interactions (Traub, Whittington, Stanford, & Jefferys, 1999). For interneurons, the input conductance is about 10 nS. After the rst spike, there is a large, fast AHP. At the peak of the fast AHP, the input conductance is at least 500 to 1000 nS (Traub & Miles, 1995). That means that after the rst action potential, the input conductance increases by at least a factor of 50, which implies that the time constant goes down by such a factor. An estimate of the baseline time constant in Traub and Miles (1995) is 37.5 msec; after the rst action potential and before the doublet, the time constant would then be less than a msec. This takes into account only the AHP current. If one includes also the metabotropic glutamate conductances activated by the tetanic stimulation and the GABAA conductances from the other interneurons, as documented in Whittington, Stanford, Colling, Jefferys, & Traub (1997), one has an effective time constant that is considerably lower. We note that all the computations done in this article concern the time between the rst and second spikes of a doublet (or the third, in the triplet conguration). Thus we are using an effective time constant that reects only the small part of the oscillation cycle. Another property of equation 3.6 is that the disorder parameter f is the same for every pair of circuits. This is the consequence of assumed translational invariance in the system and corresponds to the situation in which disorder is not too strong. We showed that for weak spatial disorder, the results of this article are identical if equation 3.6 is replaced by a gaussian distribution. In the limit of very strong disorder, however, one should not expect the assumption about translational invariance to be correct. In this case, the proper approach would require taking f as a space-dependent variable, that is, different for every pair of circuits. Our analysis concerns perfect (probabilistic) synchrony, in which every cell (excitatory and inhibitory) participates in the synchronization process. For such a complete synchrony, we nd that weak disorder (f N / R ¿ 1) enlarges the stability region of the synchronous state in parameter space in comparison to the ordered network. Our argument probably does not work for a very strong disorder; the existence argument would be especially dif-
Multispikes and Synchronization for Networks with Delays
1593
cult to make, since the pattern of arriving inputs to every cell would be very different. Nevertheless, in such a network, there may exist some partially coherent state in which only a fraction of cells participate at a given time (Tsodyks et al., 1993; van Vreeswijk, 1996). In vitro and in large-scale simulations (Traub, Whittington, Stanford, & Jefferys, 1996), doublets are produced by only some interneurons in some cycles. It remains to understand how the mechanism explained in this article can work in the more disordered situation in which only a fraction of those interneurons that re in a cycle also re a doublet spike. Synaptic noise in I- and E-cells could also have a drastic effect on the stability of various multispike congurations. Analysis for the two-circuit case suggests that noise in general reduces stability, although rather weakly. Thus, noise has an opposite effect from weak spatial randomness in the network, which in general makes long-range synchronization more robust. The results of this article seem to be consistent with the recent experimental in vitro studies concerning the effect of morphine on the long-range synchrony in the hippocampal CA1 area (Whittington, Traub, Faulkner, Jefferys, & Chettiar, 1998). These authors found that adding morphine to the system, which effectively reduces the level of inhibition in the system, causes bursting of inhibitory interneurons accompanied by lack of long-range synchrony. They suggest that this loss of synchrony may be the reason that morphine causes cognitive dysfunction clinically (Whittington et al., 1998). In our model, a decrease in I7!I inhibition corresponds to decreasing |v0 | and/or vth (so less input is required to trigger ring of I-cell). This means that ab / |v0 |vth ratio would increase in equation 3.12, opening the possibility for generating higher-order multispikes in inhibitory neurons. Because such higher-order multispikes are probably unstable (or stable only in a limited parameter space), one could observe bursts without the synchrony. Also notice that the stability regions in Figure 2 get smaller with decreasing strength of nonlocal synaptic E7!I coupling. This is in a qualitative agreement with the large-scale simulations involving multicompartmental realistic neurons of the hippocampal CA1 area (Traub, Whittington, Stanford, & Jefferys, 1996). In those simulations it was found that reduction of E7!I AMPA conductance resulted in disruption of coherent global oscillations. For large values of synaptic coupling, there is some chance that more than one multispike conguration may exist and be stable. This may be undesirable for the network if a xed conguration is needed for downstream processing. We note that architecture-dependent synaptic depression (Thomson and Deuchars, 1994; Markram & Tsodyks, 1996; Abbott, Sen, Varela, & Nelson, 1997, Koch, 1997) might then help to decrease the synaptic coupling to values at which only the doublets are stable. We also note that since the period of the network oscillation depends on the extra inhibition produced by the doublet spike, modulation of the percentage of interneurons ring doublets may provide a mechanism for modulation of the period.
1594
Jan Karbowski and Nancy Kopell
Appendix A
In this appendix we show how to derive equation 3.3. The local lth I-cell receives a spike train of nonlocal excitatory cells at times tl(1) , . . . , tl(kl ) , . . . , tl( nl ) , after which it res, where l( k ) D l C k. Solving equations 2.1 and 2.2 with an initial condition v(tEl ) D v0 , we obtain recursive formulas for the values of the potential v( t) ) v( tl(1) ) D v0 e¡(tl(1) ¡tl / tm C a, E
i
v( tl(2) ) D v( tl(1) )e¡(tl(2) ¡tl(1) ) / tm C a, i
v( tl( kl ) ) D v(tl(kl ¡1) )e¡(tl(kl ) ¡tl(kl ¡1)
¢¢¢¢¢¢
) / tmi
C a,
(A.1)
up to the time dtl ´ tl(kl ) when v(d tl ) D 0. This time is given by " dtl D
tmi ln
# l ¡1 |v0 | kX ( tl(i) ¡tE ) / tmi l ¡ . e a iD1
(A.2)
At that time v( t) in equation 2.1 changes its sign. This causes exponential growth described by the recursive equations (
) / tmi
C a,
(tl(k C 2) ¡tl(k C 1) ) / tmi l l
C a,
v( tl( kl C 1) ) D b ¢ e tl(kl C 1) ¡tl(kl ) v( tl( kl C 2) ) D v( tl( kl C 1) )e
¢¢¢¢¢¢
( ) v(tl( nl ) ) D v( tl( nl ¡1) )e tl(nl ) ¡tl(nl ¡1) / tm C a. i
(A.3)
The I-cell res between receiving spikes enumerated as nl and nl C 1. Thus the time of ring t f is determined from a condition (
v( tf ) D v(tl(nl ) )e t f ¡tl(nl )
) / tmi
´ vth .
(A.4)
This yields t f D tl( nl ) C tmi ln(vth / v( tl( nl ) )). Inserting values of v(tl( nl ) ) and tl(kl ) , and using the fact that WlI D tf ¡ tEl , we obtain equation 3.3. We should mention that v(tl(nl ) ) ¼ vth ; hence the logarithmic term in the formula for t f is very small. Thus, an approximate formula for WlI is WlI ¼ tl( nl ) ¡ tEl . We will make use of this fact in appendix D. Nevertheless, one should be cautious about dropping the logarithmic term too quickly; this term is important in stability considerations, that is, in the calculation of the stability matrix A.
Multispikes and Synchronization for Networks with Delays
1595
Appendix B
In this appendix we show how to derive the W lE function in equation 3.4. At time t D tEl the lth E-cell res and its potential is reset to v0 ( < 0). We solve equations 2.1 and 2.3 with two inhibitory inputs coming from the local I-cell separated by time WlI with an initial condition v( tEl ) D v0 . For simplicity, we assume that the time t0 when the potential of the E-cell crosses zero is larger than tEl C WlI . The former is determined from the condition v(t0 ) D 0, which produces 0 D v0 e¡(t0 ¡tl ) / tm C c(1 ¡ e¡(t0 ¡tl ) / tm ) ¡ E
e
E
e
tsi
gtsi ¡ tme
h i E E E I E I i e i e £ e¡(t0 ¡tl ) / ts ¡ e¡(t0 ¡tl ) / tm C e¡(t0 ¡tl ¡Wl ) / ts ¡ e¡(t0 ¡tl ¡Wl ) / tm . (B.1) In the limit tme / tsi ¿ 1 this equation yields t0 ¡ tEl ¼ tsi ln
hg ± c
I
i
1 C eWl / ts
²i
,
(B.2)
and in the limit tme / tsi À 1 (with c / g ¿ 1), µ
t0 ¡
tEl
¼
tme
²¶ gtsi ± WlI / tme C ln 1 e . ctme
(B.3)
At t D t0 , v( t) changes sign and begins to grow exponentially, reaching the threshold vth . Solving equations 2.1 and 2.3 as before with an initial condition v( t0 ) D 0, we obtain in the limit tme / tsi ¿ 1, the time tth at which v(tth ) D vth : tth ¡ t0 ¼
tme ln
¡
v ti 1 C th e s ctm
¢
.
(B.4)
In the limit tme / tsi À 1, the time tth is given by ±
tth ¡ t0 ¼ tme ln 1 C
vth ² . c
(B.5)
Since WlE D tth ¡ tEl ¡ WlI , we can neglect tth ¡ t0 part in the limit tme / tsi ¿ 1, which then yields WlE ¼ t0 ¡ tEl ¡ WlI D tsi ln
hg ± c
I
i
1 C e¡Wl / ts
²i
.
(B.6)
We cannot neglect tth ¡ t0 part in the limit tme / tsi À 1, because both parts are
1596
Jan Karbowski and Nancy Kopell
of the same order. As a result, in this limit, we obtain µ
i¶ gtsi ± vth ² h I e C C e¡Wl / tm . 1 1 ctme c
W lE ¼ tme ln
(B.7)
Formulas B.6 and B.7 differ mainly by a substitution tsi $ tme ; this is the reason that stability diagrams in both regimes look the same. Both formulas are similar in form to that adopted by Ermentrout and Kopell (1998). Appendix C
In this appendix we demonstrate that the probability distribution P1 ( x) with delta functions (see equation 3.6) and gaussian distribution P2 D exp[¡(x ¡ x0 )2 / 2f 2 ] are equivalent in the limit f / x0 ¿ 1, which is the limit of weak disorder. To prove the above statement, it is sufcient to show that n-moment of x computed using both distributions is the same in the above limit. For the gaussian distribution we have Z hxn iP2 D Z¡1
1 ¡1
Z dxP 2 ( x ) ¢ xn D Z¡1
1 ¡1
dx ¢ xn e¡(x¡x0 )
2
/ 2f 2 ,
(C.1)
where Z is the normalization factor. This integral can be computed using n ¡(x¡x0 )2 / 2f 2
x e
´e
¡x20 / 2f 2
The result is 2n ¡x20 / 2f 2
n
hx iP2 D f e
¡
f
¡@¢ @x0
@
2
n
¢
n
@x0
2
ex0 / 2f e¡(x¡x0 ) 2
2
2
/ 2f 2 .
2
ex0 / 2f ,
(C.2)
(C.3)
which yields hxiP2 D x0 , hx2 iP2 D x20 (1 C f 2 / x20 ), hx3 iP2 D x30 (1 C 3f 2 / x20 ), and so forth. On the other hand, for P1 distribution we have n
hx iP1
1 D 2 D
xn0
Z
1
¡1
"
dx ¢ xn [d(x ¡ x0 C f ) C d(x ¡ x0 ¡ f )]
n( n ¡ 1) f 2 1C 2 x20
# C O(f 4 / x40 ).
(C.4)
Thus, both probability distributions give the same results in the limit f / x0 ¿ 1, which in our case corresponds to the limit f N / R ¿ 1. This is the reason
Multispikes and Synchronization for Networks with Delays
1597
that we have used simple P1 distribution with delta functions instead of more complicated gaussian distribution. In that way we avoided complex multidimensional integrals in equation 3.7. Notice that our proof is valid for a random variable x, assuming both positive and negative values. One can also show that the above conclusion is valid for a random variable dened only on a positive interval 0 < x < 1, although calculations in this case are more tedious. Appendix D
In this appendix we show how to derive the average value of any function when the probability distribution is given by equations 3.5 and 3.6. We also show how to obtain the average number of inputs kl and nl needed for crossing the zero potential value ¡ and threshold, respectively. ¢ For an arbitrary function H Ll,l(1) , . . . , Ll,l( k) I fD l g we want to calculate its average value hHi, that is, carry out the multidimensional integrals in equation 3.7. We obtain hHi D
1
2 X
2k coshk (f / s) 0
a1 ,...,ak D1
£ exp @¡
k X
(¡1)aj
H
¡R N
C (¡1)a1 f , . . . ,
kR C (¡1)ak f I fD l g N
¢
1
f / sA ,
(D.1)
jD1
where the prefactor term comes from the normalization of the probability distribution r (fLg). Using this formula and the facts that tl( k) D tEl( k) C Ll,l( k) and tEk ¡ tE1 ´ D k¡1 , we can easily write formulas for hW I i and hW E i based on equations 3.3 and 3.4. Their forms are as follows ¶ 2 µ X 1 hnl iR C (¡1)af 2 cosh(f / s) aD1 N 0 1 2 nl i X X t a m £ e¡(¡1) f / s ¡ n £ exp @¡ (¡1)aj f / s A 2 l coshnl (f / s) a ,...,a D1 jD1
hWlI i D htEl( nl ) i¡htEl i C
£ ln
¡b
1
vth C
nl
exp[(nl R / N C (¡1)anl f ¡ dtl C D lC nl ¡1 ¡ D l¡1 ) / tmi ]
n a Xl exp [((nl ¡ i)R / N C [(¡1)anl ¡ (¡1)ai ]f vth iDk l
i C D lC nl ¡1 ¡ D lC i¡1 ) / tm
¤
¢
,
(D.2)
1598
Jan Karbowski and Nancy Kopell
and tsi hWlE i D n 2 l coshnl (f / s ) µ £ ln
¡
0
2 X a1 ,...,anl D1
µ g 1 C exp ¡WlI c
exp @¡
¡R
nl X
1 (¡1)aj f / s A
jD1
nl R N ¶ ¶ C (¡1)anl f I fD g / tsi , N
C (¡1)a1 f , . . . ,
¢
¢
(D.3)
where dtl is given by equation A.2. g We can also write the sum hW I iC hW E i D htsi ln[ c (1 C exp(WlI / tsi ))]i, which E is equation 3.11 in the text. Note that hW i and hW I i C hW E i differ only by the sign of W I present in the logarithmic term; compare equations D.3 and 3.11. We can also derive the average value of kl and nl , dened below, in the synchronous state using equation D.1. From equation A.1, we obtain i
hv(tl(kl ) )i D v0 he¡Ll,l(kl ) / tm i C a
kl X iD1
he¡(Ll, l(kl ) ¡Ll, l(i) ) / tm i i
(D.4)
for the synchronous state with D 1 D ¢ ¢ ¢ D D N¡1 D 0. In deriving equation D.4, we again have used the relation tl(k ) D tEl( k) C Ll,l( k) . Invoking equation D.1 yields i
he¡Ll,l(kl ) / tm i D
cosh( sf C cosh
f
s
f
tmi
)
i
he¡kl R / Ntm i
(D.5)
and he¡(Ll,l(kl ) ¡Ll,l(i) ) / tm i D i
cosh( sf C
f
tmi
) cosh( sf ¡
cosh2 sf
f
tmi
)
i
he¡(kl ¡i)R / Ntm i.
(D.6)
Condition hv(tl( kl ) )i D 0 determines the average value of kl denoted by k, which we dene as k D ( Ntmi / R) ln[hexp(kl R / Ntmi )i]. The formula for k takes the form 0 kD
i
[|v0 |(1 ¡ e¡R / Ntm ) cosh
f
s
B C a cosh( fs ¡ tfi )] cosh( fs C tfi ) B Ntmi m m ln B i B 2 f 2 f R @ a[cosh s C e¡R / Ntm sinh tmi ]
1 C C C. C A
(D.7)
Multispikes and Synchronization for Networks with Delays
1599
In a similar manner we obtain the average value of nl , denoted by n, from the condition hv(tl( nl ) )i D vth . Our n, dened as n D ( Ntmi / R) ln[hexp(nl R / Ntmi )i], takes the form 0 i [(vth ¡ a)(1 ¡ e¡R / Ntm ) cosh2 sf B B C a cosh( f C fi ) cosh( f ¡ fi )] Ntmi s tm s tm B ln B nD B [b C ( a ¡ b)e¡R / Ntmi ] cosh( f ¡ fi ) R s tm @ 1 i
£
[|v0 |(1 ¡ e¡R / Ntm ) cosh sf C a cosh( fs ¡ a[cosh2
f
s
i C e¡R / Ntm sinh2 tfi ] m
f
tmi
C )] C C C . (D.8) C A
From the average values k and n, we can compute hWlI (fD D 0g)i of a synchronous state depending only on the parameters of the network architecture. This function plays an important role in the discussion about stability. Making use of the fact that the fourth, logarithmic term in equation D.2 is small (cf. appendix A), we can neglect it. Thus, in the synchronous state, equation D.2 has the approximate form hWlI i ¼
hnl iR f ¡ f tanh . s N
(D.9)
We expect that for large nl , its higher-order cumulants are small, so the relation hnl i ¼ n holds. Using this fact, we can substitute n for hnl i into the above equation. We obtain 0
[(vth ¡ a )(1 ¡ e¡R / Ntm ) cosh2 sf B B C a cosh( f C fi ) cosh( f ¡ fi )] s tm s tm B hW I i ¼ tmi ln B B [b C (a ¡ b )e¡R / Ntmi ] cosh( f ¡ fi ) s tm @ i
i
[|v0 |(1 ¡ e¡R / Ntm ) cosh £
f
1
s
C C C ¡ f tanh f . (D.10) i C 2 f s C e¡R / Ntm sinh t i ] A m
C a cosh( sf ¡ tfi )] m
a[cosh2
f
s
Note that hW I i is of the order of the membrane time constant tmi and additionally that hW I i can be arbitrarily small provided the disorder parameter f is sufciently large.
1600
Jan Karbowski and Nancy Kopell
Appendix E
In this appendix we sketch how to compute the stability diagrams depicted in Figure 2, as well as how to derive equation 3.13. Formulas below are provided only in the regime tme / tsi ¿ 1. In the opposite regime, tme / tsi À 1, formulas are the same with a substitution tsi 7! tme . According to the Gershgorin theorem (Horn & Johnson, 1985), the lth eigenvalue l l of the stability matrix A lies in a disk such that |l l ¡ Al,l | ·
X
|Al,m |,
(E.1)
6 l mD
where Al,m ´ @DN l / @D m | D D0 . Using the inequality |l l ¡ Al,l | ¸ |l l | ¡ |Al,l |, P we obtain |l l | · N¡1 mD1 |Al,m |. The system is stable when |l l | < 1 for every l. Thus, a sufcient condition for stability is N¡1 X
|Al,m | < 1
(E.2)
mD1
for every l D 1, . . . , N ¡ 1. After some algebra the sum in equation E.2 can be found as N¡1 X mD1
|Al,m | D
¡
2 i
1 C e¡nR / Nts
1C
b|v0 | [Ll, l(n ) ¡2Ll, l(k ) ] / tmi b [Ll, l(n ) ¡Ll,l(k ) ] / tmi l l l l he i¡ he i avth vth
¢
1 b [Ll,l(n ) ¡Ll,l(k ) ] / tmi i l l C he i ¡ e¡nR / Nts . 2 vth
(E.3)
Next, invoking equations D.1, D.5, and D.6, we can nd the averages in the above equation. Combining all this, equation E.2 decomposes into two complementary cases, each described by a pair of inequalities, due to the modulus present in equation E.3. We discuss the rst case in detail. Later we comment on the complementary second case. The inequalities for the rst case are given by e( n¡k)R / Ntm > i
2 f
cosh s vth f b cosh( s C tfi ) cosh( fs ¡ m
i
f
tmi
)
e¡nR / Nts
(E.4)
and " e
nR / Ntsi
1C2
f f f b|v0 | cosh( s C 2 tmi ) cosh( s ¡ avth cosh2 sf
f f f b cosh( s C tmi ) cosh( s ¡ ¡ vth cosh2 sf
f
tmi
)
f
tmi
)
e( n¡2k)R / Ntm i
# ( n¡k )R / Ntmi
e
< 2,
(E.5)
Multispikes and Synchronization for Networks with Delays
1601
where n and k are given by equations D.7 and D.8. Inequalities E.4 and E.5 must be satised simultaneously. Notice that after insertion of equations for k and n, exp(nR / Ntsi ) becomes a function of the ratios tmi / tsi and R / Ntmi , and exp(nR / Ntmi ) becomes a function of R / Ntmi ratio for given other parameters. This says that we can compute a sufcient condition for the stability in terms of tmi / tsi and R / Ntmi . In the ordered network case for f D 0, equations E.4 and E.5 reduce to tmi / tsi
(
i
1 vth [b C ( a ¡ b)e¡R / Ntm ] > ln i K b[(vth ¡ a)(1 ¡ e¡R / Ntm ) C a]
! ,
(E.6)
and tmi / tsi
" i 1 1 b[( vth ¡ a )(1 ¡ e¡R / Ntm ) C a] < ¡ ln 1C i 2 K vth [b C ( a ¡ b )e¡R / Ntm ] #! i [|v0 |(1 C e¡R / Ntm ) ¡ a] £ , i [|v0 |(1 ¡ e¡R / Ntm ) C a]
(
(E.7)
with K D ln
(
i
i
[(vth ¡ a)(1 ¡ e¡R / Ntm ) C a][|v0 |(1 ¡ e¡R / Ntm ) C a] i a[b C ( a ¡ b)e¡R / Ntm ]
! .
(E.8)
The two curves on the right-hand side of equations E.6 and E.7 cross in the point with coordinates R / Ntmi ¼ ln[(a C 2b)|v0 | / 2ab]
(E.9)
and tmi / tsi ¼ ¡ ln[1 ¡ 2a2 / ( a C 2b)|v0 |] / ln(vth |v0 | / ab ),
(E.10)
denoted as x0 and y0 in Figure 2, respectively. Above this value of R / Ntmi the set of inequalities E.6 and E.7 is satised. The complementary case differs from the rst by inverting the inequality sign in equation E.4 and multiplying the third term in the brackets in equation E.5 by 3 and setting the right side of this inequality equal to zero. In this case we also obtain the stability region for R / Ntmi greater than the value given by equation E.9. Combining these two cases, we nd the stability region as that shadowed in Figure 2A. Notice that the ratio tmi / tsi must be small in order to preserve stability.
1602
Jan Karbowski and Nancy Kopell
The disordered network case with f > 0 is more complicated algebraically to analyze. In this case, there also exists some minimum value of R / Ntmi of the order of that in equation E.9 for weak disorder, below which the doublet conguration is unstable. The disordered case differs from the ordered case by the appearance of a vertical asymptote in Figure 2B, which crosses the horizontal axis at (R / Ntmi )cr given by equation 3.13. The reason for the appearance of the asymptote is the fact that the term in the bracket in equation E.5 can now change sign. For small R / Ntmi , the bracket term is positive, whereas for R / Ntmi 7! 1, it is negative. This does not happen for the ordered network; in this case, the bracket term is always positive. When the bracket term is negative, inequality E.5 is satised for every tmi / tsi (recall that this ratio comes from exp(nR / Ntsi ) term). Putting the bracket term equal to zero and substituting for n and k values represented by equations D.7 and D.8 determines the vertical asymptote in Figure 2B. In the limit R / Ntmi À f / tmi À 1 we obtain ( R / Ntmi )cr given by equation 3.13. Notice that for the ordered network with f D 0, our asymptote shifts to innity (that is, it disappears), which corresponds to decreasing the stability region. For the triplet conguration, analysis shows that in the ordered network case, the condition E.2 cannot be satised for any tmi / tsi and R / Ntmi . In the disordered network case, we can show that there is some stability region for R / Ntmi À 1. Acknowledgments
We thank B. Ermentrout, C. Chow, and particularly R. Traub for discussions. The work was supported by NSF DMS 9706694 and NIMH R01 MH47150. J. K. thanks the Fulbright Foundation for partial support. References Abbott, L. F., Sen, K., Varela, J. A., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in network of pulse-coupled oscillators. Physical Review E, 48, 1483–1490. Ahissar, E., Vaadia, E., Ahissar, M., Bergman, H., Arieli, A., & Abeles, M. (1992). Dependence of cortical plasticity on correlated activity of single neurons and on behavioral context. Science, 257, 1412–1415. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985).Spin-glass models of neural networks. Physical Review A, 32, 1007–1018. Andersen, P., Silfvenius, H., Sundberg, S., Sveen, O., & Wigstrom, H. (1978). Functional characteristics of unmyelinated bres in the hippocampal cortex. Brain Res., 144, 11–18.
Multispikes and Synchronization for Networks with Delays
1603
Bliss, T. V. P., & Collingride, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–34. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillations in the hippocampus of the behaving rat. J. Neurosci., 15(1), 47–60. Bressler, S. L., Coppola, R., & Nakamura, R. (1993). Episodic multiregional cortical coherence at multiple frequencies during visual task performance. Nature, 366, 153–156. Buhl, E. H., Halasy, K., & Somogi, P. (1994). Diverse sources of hippocampal unitary inhibitory postsynaptic potentials and the number of synaptic release sites. Nature, 368, 823–828. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns of realistic network models. J. Comp. Neurosci., 3, 91–110. Campbell, S. R., & Wang, D. (1998). Relaxation oscillators with time delay coupling. Preprint. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neurosci., 5, 407–420. Crook, S. M., Ermentrout, G. B., Vanier, M. C., & Bower, J. M. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. J. Comp. Neurosci., 4, 161–172. Engel, A. K., Konig, P., & Singer, W. (1991). Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. USA, 88, 9136– 9140. Engel, A. K., Kreiter, A. K., Konig, P., & Singer, W. (1991). Synchronization of oscillatory neuronal responses between striate and extrastriate visual cortical areas of the cat. Proc. Natl. Acad. Sci. USA, 88, 6048–6052. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comp., 8, 979–1002. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ernst, U., Pawelzik, K., & Geisel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Physical Review Lett., 74, 1570– 1573. Freund, T. F., & Buzsaki, G. (1996). Interneurons of the hippocampus. Hippocampus, 6, 347–470. Fries, P., Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Natl. Acad. Sci. USA, 94, 12699– 12704. Friesen, W. (1994). Reciprocal inhibition, a mechanism underlying oscillatory animal movements. Neurosci. Behavior, 18, 547–553. Gerstner, W. (1995). Time structure of the activity in neural networks models. Physical Review E, 51, 738–758. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking? Neural Comp., 8, 1653–1676.
1604
Jan Karbowski and Nancy Kopell
Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computations. Redwood City, CA: Addison-Wesley. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Horn, R. A., & Johnson, C. A. (1985). Matrix analysis. Cambridge: Cambridge University Press. Jefferys, J. G. R., Traub, R. D., & Whittington, M. A. (1996). Neuronal networks for induced “40 Hz” rhythms. Trends Neurosci., 19, 202–208. Joliot, M., Ribary, U., & Llinas, R. (1994). Human oscillatory brain activity near 40 Hz coexists with cognitive temporal binding. Proc. Natl. Acad. Sci. USA, 91, 11748–11751. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knowles, W., & Schwartzkroin, P. (1981). Local circuit synaptic interactions in hippocampal brain slices. J. Neurosci., 1, 318–322. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–210. Konig, P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Konig, P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp., 3, 155–166. Llinas, R., & Ribary, U. (1993). Coherent 40-Hz oscillation characterizes dream state in humans. Proc. Natl. Acad. Sci. USA, 90, 2078–2081. Lytton, W., & Sejnowski, T. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory neurons. J. Neurophysiol., 66, 1059–1097. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209– 213. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. McCullough, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5, 115–133. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997).Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.
Multispikes and Synchronization for Networks with Delays
1605
Steriade, M., Amzica, F., & Contreras, D. (1996). Synchronization of fast (30–40 Hz) spontaneous cortical rhythms during brain activation. J. Neurosci., 16(1), 392–417. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Tamamaki, N., & Nojyo, T. (1990). Disposition of the slab-like modules formed by axon branches originating from single CA1 pyramidal neurons in the rat hippocampus. J. Comp. Neurol., 291, 509–519. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Terman, D., & Rubin, J. (1998). Geometric analysis of population rhythms in synaptically coupled neuronal networks. Neural Comput. (in press). Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119–126. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Traub, R. D., & Miles, R. (1991).Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Traub, R. D., & Miles, R. (1995). Pyramidal cell-to-inhibitory cell spike transduction explicable by active dendritic conductances in inhibitory cell. J. Comp. Neurosci., 2, 291–298. Traub, R. D., Whittington, M. A., Buhl, E. H., Jefferys, J. G. R, & Faulkner, H. J. (1999). On the mechanism of the gamma-beta frequency shift in neuronal oscillations induced in rat hippocampal slices by tetanic stimulation. J. Neurosci., 19, 1088–1105. Traub, R. D., Whittington, M. A., Colling, S., Buzsaki, G., & Jefferys, J. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Traub, R. D., Whittington, M. A., Stanford, M., & Jefferys, J. G. R. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Physical Review Lett., 71, 1280–1283. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local eld potentials. Neural Comp., 6, 795–836. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Vreeswijk, C. (1996). Partial synchrony in populations of pulse-coupled oscillators. Physical Review E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G.B. (1994). When inhibition not excitation synchronizes neuronal ring. J. Comp. Neurosci., 1, 313–321. Wang, X. J., & Buzsaki, G. (1996). Gamma oscillations by synaptic inhibition in an interneuronal network. J. Neurosci., 16, 6402–6413.
1606
Jan Karbowski and Nancy Kopell
Wang, X. J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. Wang, X. J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalami nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. White, J., Chow, C., Ritt, J., Soto, C., & Kopell, N. (1998). Synchronous oscillation in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G. R., & Traub, R. D. (1997). Spatiotemporal patterns of c frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol., 502, 591–607. Whittington, M. A., Traub, R. D., Faulkner, H. J., Jefferys, J. G. R., & Chettiar, K. (1998). Morphine disrupts long-range synchrony of gamma oscillations in hippocampal slices. Proc. Natl. Acad. Sci. USA, 95, 5807–5811. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received January 11, 1999; accepted September 10, 1999.
LETTER
Communicated by Carl van Vreeswijk
Synchrony in Heterogeneous Networks of Spiking Neurons L. Neltner, D. Hansel Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France G. Mato Comisi´on Nacional de Energ´õa At´omica and CONICET, Centro At´omico Bariloche and Instituto Balseiro (CNEA and UNC), 8400 San Carlos de Bariloche, R.N., Argentina C. Meunier Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France The emergence of synchrony in the activity of large, heterogeneous networks of spiking neurons is investigated. We dene the robustness of synchrony by the critical disorder at which the asynchronous state becomes linearly unstable. We show that at low ring rates, synchrony is more robust in excitatory networks than in inhibitory networks, but excitatory networks cannot display any synchrony when the average ring rate becomes too high. We introduce a new regime where all inputs, external and internal, are strong and have opposite effects that cancel each other when averaged. In this regime, the robustness of synchrony is strongly enhanced, and robust synchrony can be achieved at a high ring rate in inhibitory networks. On the other hand, in excitatory networks, synchrony remains limited in frequency due to the intrinsic instability of strong recurrent excitation. 1 Introduction
Synchrony in large-scale neural systems is evidenced by signicant uctuations in time of the ensemble activity. Complementary experimental techniques—EEG, magnetoencephalography, recording local eld potentials, and multiunit and multielectrode recordings—have shown it to be a ubiquitous feature of neural activity in the central nervous system of vertebrates. It occurs, for instance, in the limbic, motor, olfactory, and visual systems. Different manifestations of synchrony have been shown to be associated with specic physiological states. Synchrony is also observed in pathological conditions (e.g., epileptic seizures, Parkinsonian disease) (Gray, 1994; Ritz & Sejnowski, 1997). Recent studies have largely focused on spike-to-spike synchrony and strong oscillatory activity, but the concept of synchrony actually covers a c 2000 Massachusetts Institute of Technology Neural Computation 12, 1607– 1641 (2000) °
1608
L. Neltner, D. Hansel, G. Mato, and C. Meunier
very large span of phenomena, as is evidenced by a few examples. During epileptic seizures, hippocampal pyramidal cells may exhibit global low-frequency synchronous bursting. In the same structure, high-frequency episodes, at about 200 Hz, are observed during sharp-waves events (Buzs´aki & Chrobak, 1995). Oscillatory synchrony, involving neural assemblies that evolve in time, was demonstrated in the olfactory system of insects (Laurent, 1996) and shown to play a role in odor recognition. Indeed, disrupting synchrony in the antennal lobe through GABAA antagonists impedes the discrimination between similar odors (Stopfer, Bhagavan, Smith, & Laurent, 1997). This remains the most convincing evidence on the function of synchrony in neural systems. In the past 10 years, following the pioneering work of Gray and Singer (1989) and Eckhorn et al. (1988), many experiments have been devoted to the investigation of stimulus-induced oscillatory activity in the visual system, from the retina to the primary visual cortex. It has been suggested that the c rhythm might play an important function in the processing of visual stimuli by providing a exible way to implement the concepts of feature binding and segmentation (for a review, see Gray, 1994), but the functional signicance of synchrony in this framework remains a subject of debate (Gray, 1994). A fundamental issue is to understand when synchrony can emerge and how cell properties, synaptic interactions, and network architecture interact to determine the nature of synchronous states in a large neural network. Much theoretical work in the past few years has focused on the emergence of synchrony through the collective dynamics of neural networks. Then, the correlations of pairs of neurons originate neither from the direct inuence of one neuron on another nor from a common external input to the system. Rather, they stem from the common synaptic input generated by the very activity of the network (Hansel & Sompolinsky, 1996). In this mean-eld framework, the respective roles of synaptic interactions and cell properties in building synchrony were largely unraveled by studying networks of integrate-and-re neurons (Abbott & van Vreeswijk, 1993; Gerstner, Van Hemmen, & Cowan, 1996) and analyzing the dynamics of conductance-based models in the weak coupling limit (Hansel, Mato, & Meunier, 1993a, 1993b). In this limit, the collective behavior of the system is determined only by the linear response of the neurons to small depolarizations and the nature of the interaction between neurons. It was thus shown, for a large class of homogeneous neural systems, that slow, excitatory interactions could not lead to full synchrony, while inhibitory interactions did (van Vreeswijk , Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995). These systems are characterized by the fact that the neurons respond to a brief depolarization by advancing the ring of their next spike, with a phase excitability curve that peaks near the end of the interspike. This type of response to perturbations is exhibited by integrateand-re models, as well as many conductance-based models, such as the
Synchrony in Heterogeneous Networks
1609
model of hippocampal inhibitory interneuron introduced in Wang and Buszaki ` (1996). It was experimentally demonstrated that regular spikers in slices of neocortex have such response properties (Reyes & Fetz, 1993a, 1993b). On the basis of these results, it was suggested that mutual inhibition between neurons might play an important role in the emergence of synchrony in large neuronal systems (van Vreeswijk et al., 1994; Hansel, Mato, & Meunier, 1993b; Wang & Busz`aki, 1996). This theoretical idea is supported by several experimental works. In particular, Whittington, Traub, and Jefferys (1995) have shown that a collective oscillation around 40 Hz emerges in hippocampal and neocortical slices due to fast inhibition between interneurons, when slow GABAB inhibition and ionotropic glutamate excitation are pharmacologically blocked. Therefore, the rhythm observed in pyramidal cells could result from their entrainment by the inhibitory network (Cobb, Buhl, Halasy, Paulsen, & Somogyi, 1995). Most theoretical work on synchrony has focused on homogeneous or mildly heterogeneous neural systems. For heterogeneous systems of weakly coupled phase oscillators (Kuramoto, 1984), a transition to the asynchronous state is found at a value of the disorder proportional to the strength of the synaptic coupling. This suggests that increasing the coupling strength might help to synchronize heterogeneous systems. However, increasing the coupling also has adverse effects, such as silencing low-frequency neurons in inhibitory networks (suppression) or strongly increasing the ring rates of neurons in excitatory networks. Therefore, understanding how synchrony can be achieved in neural populations, when the spread of the intrinsic properties of neurons is large (see Gustafsson & Pinter, 1984a, 1984b, for a thorough study of the spread in the intrinsic properties of lumbar a motoneurons in anesthetized cats), remains an open issue. The dependence of synchrony on the synaptic time course in inhibitory systems was recently investigated by several authors (Golomb & Rinzel, 1993; Hansel & Mato, 1993; Wang & Buszaki, ´ 1996; White, Chow, Ritt, Soto-Trevi˜no, & Kopell, 1998; Chow, 1998). When inhibition is fast with respect to the typical ring rate of neurons (i.e., the phasic regime of White et al., 1998), the network settles in cluster states that are easily destabilized when heterogeneities are introduced. In contrast, when inhibition is slow, homogeneous systems display stable in-phase synchrony (White et al., 1998). Yet it was shown that when mild heterogeneities in the cell properties throughout the network are taken into account, a transition to asynchrony occurs in this tonic regime for disorder levels of the order of 5% (White et al., 1998). This is easily understood since the modulation of the synaptic input is small in this regime. Thus, it seems a priori difcult to explain synchrony phenomena at high frequency on the basis of mutual inhibition. Accordingly, recent works on hippocampal high-frequency ripples have investigated the possible role of gap junctions in the emergence of synchrony (Draguhn, Traub, Schmitz, & Jefferys, 1998).
1610
L. Neltner, D. Hansel, G. Mato, and C. Meunier
The goal of this article is to investigate the robustness of synchrony—the maximal spread of cell properties that is compatible with synchrony—in the framework of fully coupled one-population networks. We dene robustness as the value of the disorder at which the asynchronous state becomes linearly stable. For lesser values of the disorder, the system will then evolve at large times to some synchronous state. This is a conservative denition of robustness; synchrony may still be observed at higher disorder if the stable asynchronous state coexists with some synchronous state (multistability). We especially address the question of which interactions, excitatory or inhibitory, are better able to synchronize neurons at a given ring rate. After reviewing the models of neurons and the methods we use to investigate network dynamics, we study the impact of heterogeneities on synchrony. The question is rst addressed in the framework of phase-reduction techniques, where analytical results on the onset of synchrony can be derived. These results are extended to nite coupling by performing numerical simulations. We then study a strong coupling regime that has never been investigated by previous authors, where balancing the synaptic current experienced by the neurons by an external drive of opposite sign prevents the ring rates from drifting. Finally, we discuss our results and their implications. 2 Methods 2.1 Models. We analyze the collective behavior of large networks consisting of only one population of neurons, either excitatory or inhibitory. The neurons are coupled all-to-all with the same coupling strength throughout the network. All the neurons are the same, but we assume that the external input varies from neuron to neuron in such a way that the distribution of the ring rates is either gaussian or Lorentzian when the neurons are uncoupled. Firing frequencies of neurons must be positive. Therefore, we use ringrate distributions restricted between 0 and 2 fN, obtained by truncating the original gaussian or Lorentzian distributions symmetrically with respect to the average frequency, fN, and modifying accordingly the normalization constant (see Figure 1). The standard deviation of the truncated distribution is different from the standard deviation of the original distribution. However, for a gaussian distribution, the relative difference between these two quantities is less than 1% as long as the standard deviation of the original distribution is less than 0.3 fN. Individual neurons are described by a single compartment model— either a simple Lapicque integrate-and-re (IF) model or a Hodgkin-Huxleylike model, namely the Wang-Busz´aki (WB) model, introduced in Wang and Busz`aki (1996). The WB model was designed as a minimal model of
Synchrony in Heterogeneous Networks
1611
Figure 1: Probability density of the ring frequency f . Symmetrically truncated gaussian distribution: s D 0.3 Nf (full line) and s D 0.5 Nf (dashed line). Symmetrically truncated Lorentzian distribution of half-width 0.5 Nf (dotted line).
hippocampal inhibitory interneuron and presents a wide ring frequency range, from 0 to several hundred Hz. The membrane potential of the IF model evolves in the subthreshold regime according to the differential equation C
dV D I` C Isyn( t) C I, dt
(2.1)
where I` D g`( V` ¡ V ) is a passive leak current of conductance g` and reversal potential V`, and Isyn is the synaptic current due to the action of the other neurons of the network. The current I is a constant external drive. The membrane capacitance C is set to 1 m F/cm2 . Requiring a passive membrane time constant t D C / g` of 10 ms leads to a leak conductance g` D 0.1 mS/cm2 . The resting membrane potential Vrest D V` is equal to ¡60 mV in all our simulations. Whenever the membrane potential, V, reaches the threshold h (¡40 mV in our simulations), a spike is red and V is instantaneously reset to the resting level. In the WB model, the membrane potential evolves according to C
dV D INa (t) C IK ( t) C I`( t) C Isyn( t) C I, dt
(2.2)
1612
L. Neltner, D. Hansel, G. Mato, and C. Meunier
where INa and IK are voltage-dependent sodium and potassium currents. Their kinetics as well as the passive parameters of the model are the same as in Wang and Buszaki ` (1996) and recalled in appendix A. In both cases (IF and WB models) the dynamics of the networks are integrated using an adaptive second-order Runge-Kutta algorithm, as explained in Hansel, Mato, Neltner, and Meunier (1998). The synaptic current induced by a given presynaptic neuron into a postsynaptic neuron is modeled as Isyn (t ) D g
X spikes
f ( t ¡ tspike ),
(2.3)
where summation is performed over all the spikes emitted prior to time t by the presynaptic neuron. Each synaptic activation then induces a single event, described by the function f ( t) D
± ² 1 ¡ t ¡ t e t1 ¡ e t2 H ( t), t1 ¡ t2
(2.4)
where H (t ) is the Heaviside function. The larger of the time constants, t1 and t2 , characterizes the rate of exponential decay of the synaptic potentials, while their time to peak is equal to tpeak D (t1 t2 / (t1 ¡ t2 )) ln(t1 / t2 ). The normalization adopted here ensures that the time integral of f (t) is always unity. The coupling strength g is then proportional to the maximal synaptic conductance but not equal to it; it is actually the total charge transferred during a single synaptic event. In a fully connected network of size N, g must scale as 1 / N, so that the total synaptic input remains nite in the limit N ! C 1. Inhibition corresponds to a negative coupling, g < 0, and excitation to a positive coupling, g > 0. Our description of the synaptic coupling does not take into account the shunting effect of the synapses. However, we have checked that the conclusions of our study still hold when synapses are described in term of conductances. 2.2 Synchrony. We measure the degree of synchrony as it was proposed and developed in Hansel and Sompolinsky (1992), Golomb and Rinzel (1994), and Ginzburg and Sompolinsky (1994): by analyzing the temporal uctuations of the network activity. One evaluates the average membrane potential in the network at a given time t,
AN ( t) D
N 1 X Vi (t). N iD1
Its time uctuations can be characterized by the variance D E D N D AN (t)2 ¡ hAN (t)i2t . t
(2.5)
(2.6)
Synchrony in Heterogeneous Networks
1613
When this variance is normalized to the population-averaged variance of single-cell activity,
D D
N ±D E ² 1 X Vi ( t)2 ¡ hVi ( t)i2t , t N iD1
(2.7)
the resulting parameter, SN D
DN , D
(2.8)
generally behaves for large N as SN D Â C
a CO N
¡1¢ N2
.
(2.9)
The constant Â, between 0 and 1, measures the degree of coherence in the system in the innite size limit, while a yields an estimate of nite-size effects. The coherence parameter  is equal to 1 if the system is totally synchronous (i.e., Vi (t) D V ( t) for all i) and vanishes if the state of the system is asynchronous in the large N limit. This denition of coherence encompasses all possible forms of synchrony in the network. In practice,  was estimated by calculating S N for large networks (N ranging from 128 to 4096) and checking that nite-size effects were small (in some cases, we have checked for nite-size effects for network size up to N D 65,536). 2.3 Phase Models. The dynamics of a conductance-based neuron involves several variables, including membrane potential and gating variables. However, if all the neurons in the network re periodically when they are uncoupled, with their ring rates, fi , lying in a narrow range, and if they are weakly coupled together, the dynamics of each neuron in the network can be described by a single angle variable, w i , varying on [0, 2p ]. This variable is the phase of the neuron along its unperturbed limit cycle (i.e., when decoupled from the network). Thanks to this reduction (Hansel et al., 1995), the network dynamics is then described by the simpler differential system, N dw i g X D 2p fi C C(w i ¡ wj ), dt N jD1
(2.10)
where g is the coupling strength and C is the effective interaction between any two neurons. This interaction function is obtained by convolving the synaptic current generated by the presynaptic neuron, j, with the “response
1614
L. Neltner, D. Hansel, G. Mato, and C. Meunier
function” Z of the target neuron, i, over one period of the single-neuron dynamics, C(w i ¡ w j ) D
1 2p
Z
2p 0
Z( u )Isyn( u C w j ¡ w i )du.
(2.11)
The function Z is nothing more than the linear phase response of the neuron to very brief current pulses. It indicates what proportionality exists between the amplitude of a small current pulse and the phase advance (or delay) it induces in the neuron. In the special case of IF neurons, the response function can be analytically derived. It reads, 1 w Z(w ) D e 2p f t . I
(2.12)
For conductance-based models Z and its Fourier expansion have to be determined numerically from the single-neuron dynamics (Hansel et al., 1993a; Ermentrout, 1996). Large networks may display a transition from the asynchronous state to some synchronous state when varying control parameters. In particular, such a transition may occur as the level of disorder is decreased or the coupling strength is changed. This can be analytically investigated in networks of phase oscillators, as shown in appendix B (see Strogatz and Mirollo, 1991). Let h(v) be the distribution of the natural angular frequencies of the phase oscillators, v D 2p f . Let the Fourier expansion of C(w ) be C(w ) D
C1 X
n D0
an sin(nw ) C bn cos(nw ).
(2.13)
For a xed value of the disorder, the asynchronous state becomes linearly unstable, when the coupling strength, g, is changed, as soon as one of the Fourier modes, n, of the interaction function, C, satises the two conditions 2 an 2 p g an C bn 2 2 h(v ¡ gb0 ¡ Vn ) ¡ h(¡v ¡ gb0 ¡ Vn ) bn . dv D v g an 2 C bn 2
h(¡ gb0 ¡ Vn ) D ¡ Z
2
lim
!0 2
1
(2.14) (2.15)
These equations determine the critical value of the disorder, sc , at which the mode n is marginally stable, and the corresponding purely imaginary eigenvalue inVn . The mode n is then linearly unstable for all levels of disorder larger than the critical value (s > sc ) (see appendix B). Alternatively equations 2.14 and 2.15 give the critical value, gc , of the disorder at which the mode n becomes unstable, when the level of disorder s is kept xed. Note
Synchrony in Heterogeneous Networks
1615
that condition 2.14 can be fullled only if an and g have opposite signs. This means that the mode n is always linearly stable in an inhibitory network if an < 0 and cannot be involved in the destabilization of the asynchronous state. Conversely, the mode n is linearly stable in an excitatory network if an > 0. Note that the functions Z and Isyn depend on the intrinsic ring rates of presynaptic and postsynaptic neurons. In equations 2.14 and 2.15, the Fourier coefcients are computed for the effective interaction of neurons with intrinsic angular frequency vN D 2p fN. Taking into account the spread of frequencies would result only in higher-order corrections. 2.4 Robustness of Synchrony. For phase models, sc is strictly proportional to g. Beyond the weak coupling limit, the critical disorder also depends on the coupling strength, albeit in a more complex way. In all cases we dene, for any given value of the coupling strength, g, the robustness of synchrony, R( g ), as the critical disorder at which the asynchronous state becomes linearly stable. This critical disorder is always expressed in what follows as a fraction of the average angular frequency, fN, in the uncoupled network. Meaningful comparisons of the robustness of the synchronous state for two neuron models require that numerical simulations be performed in equivalent conditions. In such a situation, we do not choose the same value of the coupling strength for both models, as the effect of a given synaptic current will largely depend on the membrane properties of the postsynaptic neuron. We set the coupling strength at different values for the two models, chosen so that a single synaptic event leads in both cases to a postsynaptic potential of 0.5 mV in a neuron at rest. This guarantees that the effects of synaptic interactions between neurons are similar in the two models. 3 Results 3.1 Weak Coupling. Figures 2A and 2B, respectively, display the robustness of synchrony in an excitatory network (RE ) of IF neurons and in an inhibitory network (RI ) of IF neurons as a function of the average ring frequency, fN. These results were obtained in the weak coupling limit and for a gaussian distribution of ring rates. Both RE and RI go to innity as the ring rate goes to zero, which means that synchrony can always be achieved at very low rates in a network of IF neurons, even if heterogeneities are strong. This result is also valid for a Lorentzian distribution of frequencies. In this simpler case, this can be analytically derived as shown in appendix C. In the excitatory case, the robustness of synchrony, RE , monotonically decreases with increasing frequency (see Figure 2A) and eventually vanishes at a nite frequency. This means that no synchrony is possible at a high ring rate, even in a homogeneous network. On the contrary, in the inhibitory case, RI is a nonmonotonic function of the ring frequency and never vanishes as
1616
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 2: Weakly coupled IF neurons: Robustness of synchrony, R, versus the average ring frequency, Nf . Firing rates in the uncoupled network are spread according to a symmetrically truncated gaussian distribution (see section 2). Results are displayed for three sets of synaptic time constants: t1 D 1 ms, t2 D 3 ms (full line); t1 D 1 ms, t2 D 7 ms (dashed line); t1 D 6 ms, t2 D 7 ms (dotted line). (A) Excitatory network. (B) Inhibitory network. Results obtained for a symmetrically truncated Lorentzian distribution are also plotted in the inhibitory case (t1 D 6 ms, t2 D 7 ms, empty squares).
Synchrony in Heterogeneous Networks
1617
can be seen in Figure 2B. This entails that a mildly heterogeneous inhibitory network always displays some degree of synchrony. The discontinuities in the derivative visible on all the curves correspond to the frequencies at which changes in the order of the critical mode, nc , occur. As can be seen in Figures 3A and 3B, the order of the critical mode monotonically decays with increasing ring rate, in both the excitatory and the inhibitory cases, and eventually it becomes 1. By contrast, it diverges at low frequency, as shown above. This means that at low frequency, destabilization of the asynchronous state will lead to cluster states where the system splits into several groups of in-phase locked neurons (Golomb, Hansel, Shraiman, & Sompolinsky, 1992). To compare the efciency of inhibitory and excitatory couplings at synchronizing neuronal activity in IF networks for given values of the synaptic time constants, we have computed the ratio RE / RI . In Figure 4, this ratio is plotted as a function of the average ring rate, fN. At high frequency, RE / RI vanishes because the asynchronous state is stable for excitatory interactions but not for inhibitory ones. When the frequency is decreased, RE / RI rapidly becomes larger than 1. This means that at low ring rate, excitation is much more efcient than inhibition at synchronizing neural activity in heterogeneous IF networks. Note that RE / RI remains nite in the limit of vanishing ring rates, in spite of the low-frequency divergence of both RE and RI . We now analyze how synchrony depends on the synaptic time course. For the sake of simplicity, we assume that t1 D t2 D tsyn. The synaptic interaction is then described by an “a function,” f ( t) D
t ¡t / tsyn H ( t). e 2 tsyn
(3.1)
Comparing Figures 4 and 5, we see that the ratio RE / RI diverges when fNtsyn goes to 0 at xed fN, while it tends to a constant value when fNtsyn goes to 0 at xed tsyn. Thus, fNtsyn is not a scaling variable for RE / RI . This is because RE / RI depends not only on the temporal modulation of the synaptic current (controlled by fNtsyn), but also on the phase response of neurons, which explicitly depends on fN. This can be analytically shown for a Lorentzian distribution (see appendix C). Are the above results modied when active properties of neurons are taken into account? To investigate this issue, we have studied networks of WB neurons. We have numerically computed the response function for various values of the external drive, I, leading to different ring rates of the neurons. Using the Fourier expansion of the response function, we have analytically computed the robustness of synchrony. The results are shown for excitatory and inhibitory couplings, respectively, in Figures 6A and 6B. Qualitatively these results are similar to those obtained for the IF model, except that both RE and RI remain nite in the limit of vanishing ring rates
1618
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 3: Weakly coupled IF neurons: order of the critical mode versus the average ring frequency Nf in an excitatory network. (A) Excitatory network. (B) Inhibitory network. Same parameters and conventions as in gure 2.
for the WB model. In particular, RI never vanishes; RE is zero above some critical frequency, fc ; and RE is larger than RI at low frequency (compare Figures 6A and 6B).
Synchrony in Heterogeneous Networks
1619
Figure 4: Weakly coupled IF neurons: Relative robustness RE / RI versus the average ring frequency Nf . Same parameters and conventions as in gure 2.
Figure 5: Weakly coupled IF neurons: Relative robustness RE / RI versus the synaptic time tsyn for a xed averaged ring frequency Nf D 20 Hz and an “alpha function” synaptic interaction.
1620
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 6: Weakly coupled WB neurons: Robustness R of synchrony versus the average ring frequency Nf . (A) Excitatory network. (B) Inhibitory network. Results are displayed for two sets of synaptic time constraints: z1 D 1 ms, z2 D 3 ms (full line and lled squares); z1 D 1 ms, z2 D 7 ms (dashed line and empty squares).
3.2 Beyond Weak Coupling. The results of the previous section were derived under the assumption of weak coupling. To assess their range of validity, we numerically compute RE and RI as a function of the coupling
Synchrony in Heterogeneous Networks
1621
Figure 7: IF neurons: Coherence  as a function of the relative frequency spread s / Nf (gaussian disorder) for excitatory interactions (lled squares) and inhibitory interactions (empty squares and full line); t1 D 3 ms, t2 D 1 ms, g D 0.05 nC/cm2 , Nf D 80 Hz.
strength for both IF or WB neurons and compare those results to the predictions obtained using the phase reduction technique. Figure 7 displays the coherence, Â, versus the spread, s, of ring rates for excitatory and inhibitory networks of IF neurons. The average ring rate was kept xed at 80 Hz. For that value, partial synchrony ( ¼ 0.36) is achieved in the homogeneous (s D 0) excitatory case, and the fully synchronous state ( D 1) is stable in the homogeneous inhibitory case. Remarkably,  decays to 0 much faster for inhibition than for excitation when disorder is increased, although it starts from much higher values. Networks of WB neurons display a qualitatively similar behavior, although the range over which  decays to zero is larger by one order of magnitude. Figures 8A and 8B, respectively, display RE and RI as a function of the coupling strength for IF networks and a xed value of the average ring rate in the uncoupled network. Note on Figure 8B the multistability present in the inhibitory case, as the transition to the asynchronous state occurs at different values of the disorder, when starting from an initial condition that is close to the fully synchronous state, and from an initial condition, where the membrane voltages of neurons are widely spread. For small coupling strength, g, the robustness linearly varies with g with a slope that is correctly predicted by the phase model limit. For larger values of g, RE and RI monotonically decrease with increasing g. In excitatory networks, this stems from
1622
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 8: IF neurons: Robustness versus coupling strength g (in nC/cm2 ). The average free ring rate is Nf D 50 Hz. Filled squares: numerical simulations. Solid line: Results from phase model. (A) Excitatory interactions, t1 D 3 ms, t2 D 1 ms. (B) Inhibitory interactions t1 D 7 ms, t2 D 1 ms. Results are shown in that latter case for two initial conditions differing by the spread of the initial voltage in the neuronal population equal. Filled squares: Spread of 50% of the subtreshold voltage range. Empty squares: 1% (empty squares) of the subtreshold voltage range.
Synchrony in Heterogeneous Networks
1623
the increase of the ring rate with the coupling; in inhibitory networks, this is related to clustering (see appendix D). However, in the inhibitory case, the robustness of synchrony may eventually increase again when the coupling strength is further increased. This is because low-frequency neurons are progressively suppressed when the coupling strength is increased, which decreases the frequency spread of the spiking population and increases the maximum level of disorder required to stabilize the asynchronous state. WB networks behave in a similar way, although synchrony can be achieved in heterogeneous excitatory networks only in a much more restricted range of ring rates (see Figures 9A and9 B). The results of this section show that for both models and both types of interactions, excitatory and inhibitory, when the external input is kept xed and the strength of the interaction is varied, the conclusions derived using the phase-reduction method remain qualitatively valid on a nite range of coupling strength. In this range, excitation is more efcient than inhibition at synchronizing heterogeneous networks at low ring rates. The opposite is true at large ring rates where only inhibition may synchronize heterogeneous networks. 3.3 Synchrony at Strong Coupling. In the weak coupling limit, the interactions between neurons have a negligible effect on the average ring rate. This is not so at strong couplings. This is particularly obvious in excitatory networks where recurrent excitation strongly increases the average ring rate with respect to its value in the uncoupled network. Consequently, the synaptic interaction becomes closer and closer to a constant as the oscillation period becomes much smaller than the synaptic time constants, and it can no longer destabilize the asynchronous state. This is one reason why no synchrony can be achieved at large coupling strengths in an excitatory network. In inhibitory networks, increasing the coupling strength decreases the ring rate and would seem to favor synchrony. However, this also leads to suppression of low-frequency neurons. Moreover, the states of partial synchrony exhibited by inhibitory networks depend on both the ring rate and the coupling strength, and synchrony may also be reduced due to a higher degree of clustering when the coupling strength becomes larger. To analyze the sole effect of the coupling strength, g, on the robustness of synchrony in excitatory and inhibitory networks, independently of the change of ring rate it provokes, we shall simultaneously change g and the external drive, I, acting on neurons so that the average frequency in the network remains constant, whatever the value of g. This can be achieved by making the drive Ii on neuron i depend on g as
Ii ( f, g) D IO( f ) ¡ g fN,
(3.2)
where IO( f ) is the constant drive required for neuron i to re at frequency f , and fN is the average frequency of neurons across the uncoupled network. In
1624
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 9: WB neurons: Robustness of synchrony versus coupling strength, g (in nC/cm2 ). t1 D 3 ms, t2 D 1 ms, and Nf D 3 Hz. (A) Excitatory interactions. Only the increase of RE with g is displayed; at higher values of g (outside the range of the gure), RE actually decreases, as in gure 8A, for IF neurons, and eventually vanishes. (B) Inhibitory interactions. RI again increases at higher values of g, as in Figure 8B (not shown). As in Figure 8 the line indicates the predictions of the phase model.
Synchrony in Heterogeneous Networks
1625
the asynchronous state, the total synaptic input on a neuron due to the rest of the network is then exactly balanced by the coupling dependent term of the applied current, so that the frequency remains that of the single neuron. The results then obtained for excitatory and inhibitory WB networks under the above constraint (see equation 3.2) are shown in Figure 10 for a constant low ring rate, fN D 3 Hz. Increasing the synaptic coupling allows us to improve substantially the robustness for both excitation and inhibition within certain limits. Also note that a larger robustness can be achieved in the excitatory case than in the inhibitory case. Similar, though quantitatively different, results are obtained for IF networks (results not shown). For large enough values of fN, the situation changes. Indeed, for fN larger than a critical value that depends on the synaptic time constants, the asynchronous state is the only stable active state of the excitatory WB network, even in the homogeneous case (it coexists with the stable quiescent state of the network). In other words, RE D 0. By contrast, partial synchrony is achieved in an inhibitory network, even at a high rate: RI monotonically increases with g up to values of more than 20% when fN D 50 Hz (results not shown). Thus, the conclusions previously established for weak couplings generalize at all coupling strengths, when the average frequency is kept xed by appropriately tuning the external drive. Excitation synchronizes neurons more efciently at low ring rates, but above some threshold frequency, only inhibition can synchronize neural activity. 3.4 The Balanced Synchronous State. The previous section might give the sense that a precise tuning of the external drive, so that it exactly compensates for the shift of the average frequency in the asynchronous state that is induced by the network dynamics, is required to achieve synchrony at large coupling. Here we show that when both external drive input and the coupling strength are strong enough, and, under some restrictions that are elaborated on in appendix D, this balancing automatically takes place in the asynchronous state, without any need to ne-tune the parameters. For simplicity, we dene all currents relative to the current threshold, Ic , so that any current above (respectively below) this threshold is positive (respectively negative). We shall assume that both the coupling strength, g, and the average external drive, I, linearly depend on a scaling parameter, K, and we shall investigate the large K regime where both coupling and average drive are strong. More specically, we shall assume that,
g D ¡KJ,
(3.3)
where J is some positive constant and that the drive Ii on neuron i satises Ii D KI0 C di ,
(3.4)
1626
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Figure 10: Robustness of synchrony versus coupling strength, g (in nC/cm2 ), in a WB network for a constant average ring rate Nf D 3 Hz in the coupled network. t1 D 3 ms, t2 D 1 ms. (A) Excitatory interactions. Here again, as in Figure 8A, only the increase of RE with g is shown. (B) Inhibitory interactions. Results are shown in that latter case for two initial conditions differing by the (uniform) spread of the initial voltage in the neuronal population. Filled squares: Spread of § 10 mV. Empty squares: Spread of § 0.1 mV. In both gures the line indicates the prediction of the phase model (as in Figure 1).
Synchrony in Heterogeneous Networks
1627
where I0 is some constant current, I0 > 0, and the di are equally distributed independent random variables with zero mean. For an excitatory network, K is negative, so that the external drive is below the current threshold and opposes the frequency increase due to recurrent excitation. In inhibitory networks, K is positive. The external drive is then positive and opposes the frequency decrease due to inhibitory synaptic interactions. In both cases, when the network is in the asynchronous state with an average ring rate fN, the total current input on neuron i is equal to Ii C g fN D K(I0 ¡ J fN) C di .
(3.5)
Both the ring rate, fi , of neuron i and the average ring rate, fN, are then determined by the self-consistent equations, h i fi D G K( I0 ¡ J fN) C di Z h i Nf D P(d)G K( I0 ¡ J fN) C d dd,
(3.6) (3.7)
where f D G [I] is the frequency-current relation of the neurons and P is the probability density of the random variable d. In the large K limit, the average ring rate of the neurons, fN, remains nite and is equal to I0 fN D C O (1 / K) . J
(3.8)
The correction of order 1 / K to fN, ¡A / K, is determined by the nonlinear equation I0 D J
Z P(d)G [JA C d] dd.
(3.9)
This correction to the average frequency is positive for excitatory networks (K < 0) and negative for inhibitory networks (K > 0). Substituting the order 1 / K approximation of fN thus obtained in equation 3.6, gives the ring rates, fi , of the neurons up to order 1 / K, fi D G [JA C di ] .
(3.10)
Note that the ring rates depend on the single-neuron dynamics only through the frequency-current relation. Also note that in the asynchronous state, the width of the frequency distribution only depends on the probability density, P, and the frequency-current relation, G. It is independent of the average drive on neurons.
1628
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Equation 3.8 implies that although both the external drive on neurons and the synaptic couplings are large, the net current input on neurons and their ring rates remain nite in the limit of large K. Moreover, in this limit, the average ring rate no longer depends on K. Therefore, if the external drive and the coupling strength are sufciently strong and have opposite effects on the ring rate, the system settles into a “balanced state” where the average ring rate depends on only the ratio of these two quantities (up to small corrections). The equations satised by the average drive and the average ring rate in this state are the same as in the “chaotic balanced state” studied in van Vreeswijk and Sompolinsky (1996, 1998). However, at variance with the chaotic balanced state, the neurons re periodically in the asynchronous balanced state we consider here. This is because the network is fully connected in our study (see also Golomb & Hansel, in press). The robustness, RI , of an inhibitory WB network in the balanced state is shown in Figures 11A and 11B as a function of the scaling parameter K for two values of the average frequency in the large K limit, f D 50 Hz (A) and f D 200 Hz (B). In both cases, a much larger robustness than at weak coupling can be achieved by sufciently increasing K. This is because in spite of the averaging effect of the high ring rate, the synaptic current remains strongly modulated as the coupling strength is large. One can even show analytically that as the ring rate increases, the index of the mode that destabilizes the asynchronous state decreases and synchrony becomes more robust. The behavior of excitatory networks is different, as the ring rate can no longer be controlled when the coupling gets too large. This can be analytically proved for a homogeneous IF network (see appendix D). As a consequence, there is a limiting value of the ring rate beyond which a purely excitatory network cannot exhibit synchrony. Therefore, we are led in the strong coupling regime to the same conclusions as before: excitatory couplings give rise to more robust synchrony at low ring rates, while only inhibition can synchronize networks at high rates. 4 Discussion
This study has addressed the issue of synchrony in heterogeneous neural networks made up of only excitatory spiking neurons or only inhibitory spikers. Previous work has stressed the role of inhibitory interactions, showing that mutually inhibiting neurons can synchronize (Wang & Busz`aki, 1996), even in the presence of mild heterogeneities (White et al., 1998). We have shown here, in contrast with the results of White et al. (1998), that synchrony is also possible in heterogeneous inhibitory networks when the spread in the ring rates of neurons is a substantial fraction of the average ring rate, a situation that is probably frequent in biological preparations. Synchrony is even possible at high ring rates, provided that the average
Synchrony in Heterogeneous Networks
1629
Figure 11: Inhibitory network of WB neurons: Robustness, RI versus the scaling factor K (see text). t1 D 7 ms, t2 D 1 ms, Nf D 50 Hz (A) and 200 Hz (B). In both cases RI saturates with increasing K, as is slightly visible in A. Same conventions as in Figure 8.
synaptic current is balanced by a constant external input of the same order of magnitude. In this way, the coupling strength can be strongly increased, and modulation of the synaptic current maintained, while suppression is
1630
L. Neltner, D. Hansel, G. Mato, and C. Meunier
avoided. This situation, which can be achieved without nely tuning the parameters, is different from the asymptotic regimes dened in Chow, White, Ritt, and Kopell (1998), except at very low frequency, where it corresponds to the fast regime. It is neither the phasic regime, because the membrane time constant and the inhibition decay time are of the same order, nor the tonic regime, because the oscillation period is not much higher than the inhibition decay time. By contrast, balancing cannot control the ring rate in excitatory networks when the average ring rate gets too high. As a consequence, synchrony at high ring rates cannot be achieved in excitatory networks. Still, excitation is more efcient than inhibition at synchronizing neurons at sufciently low ring rates. Note that this robustness of synchrony in excitatory networks relatively to the inhibitory situation does not entail that excitatory networks can display synchrony at low ring rates in spite of a large spread of parameters. Indeed, near the bifurcation point, the ring rates of the neurons are strongly sensitive to intrinsic neuronal properties (e.g., the leak conductance) and to the value of the external drive. We have reached these conclusions by analyzing under which conditions the asynchronous state is unstable in large networks and the system therefore settles in some synchronous state. This is in contrast with most previous theoretical works (see, however, Abbott & van Vreeswijk, 1993, and van Vreeswijk, 1996), which mainly focused on synchronous states in homogeneous networks, in particular on the fully synchronous state, and on what these states became when heterogeneities were introduced. More specically, we have studied the destabilization of the asynchronous state to the benet of some coherent state when network parameters are varied. To address this question, we have used the weak coupling limit in which the stability of the asynchronous state and its dependence on synaptic parameters and on the average ring frequency can be analytically investigated. Thus, we could assess the robustness of synchrony, dened as the relative spread in the natural frequencies of neurons beyond which the incoherent state becomes stable, for excitatory and inhibitory networks. This procedure was performed for both the IF model, in which the inputs are passively integrated below the ring threshold, and the WB model, which incorporates voltage-dependent sodium and potassium currents. We reached the same qualitative conclusions for both models. At sufciently low frequency, synchrony can be achieved in both excitatory and inhibitory networks but is more robust in the excitatory case. At high frequency, only inhibition can lead to synchrony, providing that the degree of heterogeneity is sufciently small. The critical frequency at which excitatory couplings can no longer synchronize neural activity depends on the synaptic time constants and the neuron model. Quantitatively, the IF model behaves nonetheless differently from the WB model in several important aspects. First, for excitatory couplings and given synaptic time constants, the critical frequency is larger, by one order
Synchrony in Heterogeneous Networks
1631
of magnitude, in IF networks. Second, the robustness of synchrony is one order of magnitude smaller in inhibitory IF networks than in inhibitory WB networks. Third, the robustness of synchrony diverges as the average ring rate goes to zero in IF networks, while it decreases in WB networks. Many conductance-based models start to re at vanishingly low frequency at the current threshold, like the WB model. All such models that we have studied so far behave qualitatively like the WB model, but the value of the critical frequency may strongly change with the model. The atypical behavior of the IF model stresses the importance of active properties for the emergence of synchrony in networks. A similar conclusion was reached by Golomb and Hansel (in press) in a study of synchrony in sparsely connected neuronal networks. The small robustness of synchrony at low ring rates in inhibitory WB networks is due to the fact that the order of the Fourier mode responsible for the linear instability of the asynchronous state increases with decreasing ring rate. Robustness then decreases with the amplitude of the corresponding component in the Fourier spectrum of the interaction function. For the WB model (and more generally for neurons with similar frequency-current relations), all Fourier components of the response function vanish with the ring rate (Ermentrout, 1996). On the other hand, the modulation of the synaptic current becomes too small at high frequency to entrain neurons efciently and robustness becomes low. Between these two regimes, an optimal value of the average ring rate can be found where robustness is maximal. Similarly, for a given ring frequency, synaptic time constants can be found that maximize the robustness of synchrony in inhibitory networks. Note, however, that robustness is not determined by the sole ratios fNt1 and fNt2 (see Figures 4 and 5.) Using numerical simulations of large networks, we have shown that the picture that emerges from the weak coupling limit remains valid in a nite range of coupling strength. Moreover, for a given external drive, the robustness linearly increases with the coupling strength in this range. However, when the coupling strength becomes too large, robustness again decreases in both excitatory and inhibitory networks, albeit for different reasons. In excitatory networks, the ring rate dramatically increases with the coupling due to recurrent excitation between neurons, and the synaptic current is no longer modulated in time. By contrast, the ring rate decreases with increasing coupling strength in inhibitory networks, which favors cluster states. The higher the order of these cluster states, the more easily they are destabilized and replaced by the asynchronous state when heterogeneities are introduced. These desynchronizing effects can be prevented by introducing an additional current proportional to the coupling and of opposite sign. When both the synaptic interactions and the external drive are strong, the synaptic current on neurons automatically adjusts so that their ring rate remains nearly constant when the coupling strength is varied. This leads
1632
L. Neltner, D. Hansel, G. Mato, and C. Meunier
to robust synchrony in inhibitory networks, even at a high ring rate. By contrast, the ring frequency cannot be controlled in excitatory networks beyond some limiting value of the coupling strength, due to an instability arising from recurrent excitation (see appendix C). The behavior of the networks in the regime where both synaptic currents and external drive are strong still depends on the coupling strength. For instance, the frequency at which the maximum robustness is achieved in the inhibitory case shifts to higher frequencies when the scaling factor, K, increases. For synaptic times t1 D 3 ms, t1 D 1 ms, the maximum robustness for the WB model is observed around 40 Hz for K D 16, and around 60 Hz for K D 32 (for a reference value of the coupling constant, J, always set to unity). Moreover, the transients become shorter with increasing K. For a WB network, ring at 50 Hz, around 100 cycles are required to reach the asymptotic state when K D 1, whereas only 4 or 5 are required when K D 32. In conclusion, we have brought to the light a strong coupling regime where the robustness of synchrony is strongly enhanced. In particular we have shown that signicant synchrony could be achieved in heterogeneous inhibitory networks, even at high ring rate, thanks to the balancing mechanism that occurs when the external drive also is strong and scales as the coupling strength. However, the physiological relevance of this regime remains to be established. The values of the coupling strength we use in our numerical studies of fully connected networks correspond to relevant values of the coupling strength in a network with more realistic connectivity. Consider, for instance, a network of WB neurons, where each neuron is contacted by K D 100 ¡1000 other neurons ring incoherently with an average frequency of f Hz. Assume that each synapse evokes postsynaptic potentials of the order of 0.2 mV. This corresponds, in the WB model, to synaptic strength between two neurons on the order of gs D 0.4 nC/cm 2 . In a state in which the neurons re incoherently with an average frequency f Hz, the total synaptic input received by any neuron is the order of K £ gs £ f . To obtain a total synaptic input in the fully connected WB networks we have investigated, we must set the coupling parameter g such that g » Kgs » 40 ¡400 nC/cm2 . This range of coupling corresponds to the strong coupling regime we have studied in this article. Note, however, that if K is too small compared to the total number of neurons in the system, the stability of the asynchronous state can be signicantly affected by the sparseness of the connectivity pattern (see Golomb & Hansel, in press for a discussion of this issue). In that case, an asynchronous state can be stable even in the strong coupling regime we have studied (Golomb & Hansel, in press). Finally, we also stress that our work did not cover more realistic situations where two populations, one excitatory and the other inhibitory, interact. Our preliminary results (not shown here) on heterogeneous two populations systems show that the loop between excitatory and inhibitory neurons can lead to an astonishingly high robustness of synchrony.
Synchrony in Heterogeneous Networks
1633
Appendix A: The Wang–Buzs a´ ki Model
Parameters of the Wang-Buzs´aki model are as follows. The capacitance C is equal to 1 m F/cm2 . The leak current is given by I` D g`( V` ¡ V ) where g` D 0.1 mS/cm2 and V` D ¡65 mV. The transient sodium current is given by INa D gNa m1 3 h( VNa ¡ V ) where gNa D 35 mS/cm2 and VNa D 55 mV. The activation variable m instantaneously adjusts to its steady-state value m1 D am / (am C bm ), where am ( V ) D ¡0.1(V C 35) / (exp(¡0.1( V C 35)) ¡ 1) and bm ( V ) D 4 exp(¡(V C 60) / 18). The inactivation variable h evolves according to dh D ah (1 ¡ h ) ¡ bh h, where ah ( V ) D 0.35 exp(¡(V C 58) / 20), dt bh ( V ) D 5 / (exp(¡0.1( V C 28)) C 1). The delayed rectier current is given by IK D gK n4 ( VK ¡ V ), where gK D 9 mS/cm2 , VK D ¡90 mV. The activation variable n evolves according to dn D an (1¡ n )¡bn n with an ( V ) D ¡0.05(V C dt 34) / (exp(¡0.1( V C 34)) ¡ 1) and bn ( V ) D 0.625 exp(¡(V C 44) / 80). Appendix B: Instability of the Asynchronous State in Heterogeneous Networks of Phase Oscillators
We consider a population of neurons in which the intrinsic frequencies are distributed according to h(v). The phases of the neurons evolve according to N g X wPi D vi C C(w i ¡ w j ). N jD1
(B.1)
The evolution of the one-phase probability density, r (w , t, v), is then given by the continuity equation, @r @t
C
@ @w
(r v ) D 0,
(B.2)
where vDvC g
C1 X
nD0
rn sin(n(w ¡ yn )) C sn cos(n(w ¡ yn )),
(B.3)
and the coefcients rn , sn , and yn can be expressed in terms of the Fourier coefcients, an and bn , of the interaction function, Z rn einyn D an sn einyn D bn
2p 0
Z
0
Z
C1
¡1 2p Z C 1 ¡1
r (w , t, v)h(v)einw dvdw
(B.4)
r (w , t, v)h(v)einw dvdw .
(B.5)
1634
L. Neltner, D. Hansel, G. Mato, and C. Meunier
To analyze the linear stability of the incoherent state, r 0 (w , t, v) D write r (w , t, v) as r (w , t, v) D r 0 (w , t, v) C 2 g(w , t, v),
1 2p
, we
(B.6)
where g(w , t, v) D
C1 nD X
cn (t, v)einw ,
(B.7)
nD¡1
and linearize the continuity equation around r 0 . Thus, we get @cn @t
D ¡fi(v C gb0 )ngcn ¡ g Cg
nbn i 2
Z
C1
¡1
nan 2
Z
C1
¡1
h(v)cn (t, v)dv
h(v)cn ( t, v)dv.
(B.8)
We then seek eigenvectors of the form cn (t, v) D dn (v)el n t ,
(B.9)
where l n D m n C inVn . Equation B.8 then leads to n 1 D g (¡an C ibn ) 2
Z
C1
¡1
h(v) dv. l n C i(v C gb0 )n
(B.10)
This equation determines the eigenvalues l n for n ¸ 1. Separating the real and imaginary parts of equation B.10, we obtain two equations that determine the bifurcation point and the value of Vn : Z
C1
¡1 Z C1 ¡1
h(v) h(v)
m 2n m 2n
C
mn 2 n (v C gb
0
C Vn
)2
dv D ¡
2 an 2 gn an C bn 2
2 n(v C gb0 C Vn ) bn . dv D 2 2 2 C n (v C gb0 C Vn ) gn an C bn 2
(B.11) (B.12)
When the stability of the mode n changes, the associated eigenvalue is purely imaginary: l n D inVn . Taking the corresponding limit, m n ! 0 in equations B.11 and B.12, we get: 2 an p g an 2 C bn 2 2 h(v ¡ gb0 ¡ Vn ) ¡ h(¡v ¡ gb0 ¡ Vn ) bn . dv D 2 v g an C bn 2
h(¡ gb0 ¡ Vn ) D ¡ 2
Z
lim
!0 2
1
(B.13) (B.14)
Synchrony in Heterogeneous Networks
1635
In the case of a gaussian distribution of frequencies, h(v) D p
¡
1 2p s 2
exp ¡
(v ¡ v) N 2 2 2s
¢
,
(B.15)
bn an
(B.16)
equations B.13 and B.14 lead to r
2 p
Z
V n C gb 0 C vN s
0
¡ exp C
x2 2
¢ dx D ¡
¡
r (Vn C gb0 C v) p an 2 C bn 2 N 2 sc D ¡g exp ¡ 8 2sc 2 an
¢ ,
(B.17)
where sc is the critical value of the disorder at which the mode n becomes unstable. For a Lorentzian distribution, h(v) D
s 1 , p s 2 C (v ¡ v) N 2
(B.18)
equations B.13 and B.14 take the simpler form: s 2 an D¡ 2 s 2 C (Vn C gb0 )2 g an C b2n Vn C gb0 2 bn D . 2 s C (Vn C gb0 )2 g a2n C b2n
(B.19) (B.20)
Therefore: sc D ¡g
an 2
(B.21)
Vn C gb0 D §
bn . 2
(B.22)
In that case, the critical level of disorder, sc , depends on the sole Fourier coefcient, an . Appendix C: Lorentzian Distribution of Intrinsic Frequencies
In this appendix we calculate the robustness at low frequency or for fast synaptic interactions for excitatory or inhibitory IF networks and a Lorentzian distribution of heterogeneities. Let us investigate rst the robustness in the low-frequency limit. We write the Fourier expansion of the response function as: Z(w ) D
C1 X
nD0
An sin(nw ) C Bn cos(nw ).
(C.1)
1636
L. Neltner, D. Hansel, G. Mato, and C. Meunier
A straightforward calculation using equations 2.4 and 2.11 shows that the Fourier coefcients of the phase interaction are given by: an D bn D
An (1 ¡ n2 v2 t1 t2 ) ¡ Bn nv (t1 C t2 ) (1 C n2 v2 t12 )(1 C n2 v2 t22 )
An nv (t1 C t2 ) C Bn (1 ¡ n2 v2 t1 t2 ) . (1 C n2 v2 t12 )(1 C n2 v2 t22 )
(C.2) (C.3)
For the IF model, the coefcents An and Bn are ± ² v2 t 1 2p / vt ¡ 1 e 2p I 1 C (nvt )2 An D ¡nvt Bn .
Bn D
(C.4) (C.5)
The Fourier coefcient an of C can be written under the form an D K(nv, t, t1 , t2 )L(v, t ), where K( x, t, t1 , t2 ) D ¡
x3 (t t1 t2 ) ¡ x(t C t1 C t2 ) ¢¡ ¢¡ ¢ 1 C x2 t12 1 C x2 t22 1 C x2 t 2
(C.6)
and L(v, t ) D
² v2 t ± 2p / vt ¡1 . e 2p I
(C.7)
The function K(x ) reaches its (negative) minimum, Kmin , at x D xmin , and its (positive) maximum, Kmax , at x D xmax , while L keeps a constant sign. For a Lorentzian distribution of angular frequencies, the critical level of disorder depends only on an ; see equation B.23. For excitatory coupling (g > 0) the rst mode, nc , that destabilizes the asynchronous state is then given by the best integer approximation of ¡xmin / v. N Indeed, K is then closest to Kmin , and an takes its most negative value there. When the average frequency goes to zero, K( nc v, N t, t1 , t2 ) goes to Kmin , and the instability condition becomes sc D ¡gK min L(vt N ) / 2. The quantity L( vt N ) diverges in this limit, and so does the critical disorder, sc . By contrast, for inhibitory coupling (g < 0), the critical mode, nc , is given by the best integer approximation of xmax / v. N The critical value of the disorder, sc , then diverges at low ring rate as gKmax L( vt N ) / 2. In both cases, excitatory and inhibitory, the critical mode, nc , goes to innity at low frequency as 1 / v. N This stems from the fact that in that limit, the membrane potential of an IF neuron remains exponentially close to the threshold over a large part of the period so that small depolarizations can induce large changes in the phase of the neuron. This is a particular feature of IF neurons, which passively integrate their inputs over the whole subthreshold potential range.
Synchrony in Heterogeneous Networks
1637
To investigate the limit of fast synaptic interactions, we proceed in a similar way. For simplicity, we assume that t1 D t2 ´ tsyn The Fourier coefcient an can then be written as an D K(nv, t, tsyn )L(v, t ), where ¡ ¢ 2 ¡ x t C 2tsyn x3 t tsyn K(x, t, tsyn ) D ¡ ²2 . ¢± 2 1 C x2 t 2 1 C x2 tsyn
(C.8)
When tsyn is much smaller than t , K can be rewritten as K1 ( xt )K2 ( xtsyn ), where K1 ( y) D
y
(C.9)
1 C y2
K2 ( y) D ¡
y2 ¡ 1 ¢2 . 1 C y2
(C.10)
Consequently, the critical mode, nc , which destabilizes the incoherent state, satises nc vt » 1 in the excitatory case. It keeps a constant value, independent of tsyn, when tsyn goes to 0. On the contrary, the critical mode is larger than 1 / (vtsyn ) for fast inhibitory interactions, so that it diverges when tsyn goes to 0. As a result, RE / RI then scales like t t for xed ring rate fN. syn
Appendix D: Stability of the Asynchronous State in IF Networks
The stability of the asynchronous state of a fully connected homogeneous (di D 0) IF network was studied by Abbott and van Vreeswijk (1993). They showed that the eigenvalues, l, that determine the linear stability of the asynchronous state are the solutions of the equation ±
²
f0 el / f0 ¡ 1 (l C a1 )(l C a2 ) D la1 a2
Z
1 0
dyF ( y)ely / f0 ,
(D.1)
where f0 is the ring rate of neurons in units of 1 / t , a1 D t / t1 , a2 D t / t2 , and F( y ) D
gf0 expy / f0 . gf0 C I
(D.2)
In the balanced state (see section 3.4) any neuron in the network is submitted to the external drive KI0 and the synaptic drive ¡KJf0 , so that it receives the total current K(I0 ¡ Jf0 ). The interaction function then reads F( y ) D
f0 J ey / f0 . I0 ¡ f 0 J
(D.3)
1638
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Integrating over one period the evolution equation of any neuron of the network, one also derives the equation 1 ¢ D 1 ¡ e¡1 / f0 , K I0 ¡ Jf0
(D.4)
¡
which enables us to express I0 as a function of f0 . The eigenvalue equation can then be rewritten as ±
²
el / f0 ¡ 1 (l C a1 )(l C a2 ) D ¡a1 a2 f0 KJ ±
lC1
(l C 1) / f0
£ e
±
l ²
¡1 .
1 ¡ e¡1 / f0
²
(D.5)
Let us investigate the high-frequency limit f0 À 1. Considering the complex part of the spectrum and looking for solutions of equation D.5 of the form l n D 2p inf0 C gn
(D.6)
where gn ¿ f0 , one obtains gn D
1 4p 2 n2 f 02 a1 a2 KJ
¡1
.
(D.7)
6 The ansatz, D.6, is valid when K D af02 , with a D 4p 2 n2 / a1 a2 J. Then the bounded quantity gn is just the real part of the nth eigenvalue and determines the linear stability of the asynchronous state with respect to the Fourier mode of order n. For inhibitory interactions (K > 0) gn is positive for modes of order n larger than
nc D
p
a1 a2 aJ , 2p
(D.8)
while low-order Fourier modes are stable. Accordingly, the homogeneous network settles in cluster states. Let us also note that the coupling strength at which the mode n becomes stable scales as n2 , so that it is difcult to stabilize high-order Fourier modes by increasing the coupling. Higher-order cluster states are easily destabilized in favor of the asynchronous state, when the disorder is increased. This is why the robustness of synchrony, RI , in heterogeneous inhibitory networks may be small, although the asynchronous state is never stable in homogeneous networks.
Synchrony in Heterogeneous Networks
1639
The behavior of excitatory networks (K < 0) is quite different. Indeed, it is not possible to increase the coupling strength to compensate the decrease in the modulation of the synaptic current that occurs at high frequency. If we consider the real part of the spectrum, the eigenvalues satisfy l 2 C d (a1 C a2 ) C a1 a2 (1 C KJ ) D 0
(D.9)
in the high-frequency limit. When the actual coupling strength, KJ, becomes smaller than ¡1, the larger real eigenvalue becomes positive, and an instability develops. Balancing is then impossible as the inhibitory external drive cannot oppose recurrent excitation. For this reason, excitatory interactions cannot synchronize neuronal systems at high ring rates. Acknowledgments
G. M. was partially supported by grant PICT97 03-00000-00131 from ANPCyT and by Fundacion ´ Antorchas. We thank D. Golomb and C. Van Vreeswijk for fruitful discussion. References Abbott, L. F., & Van Vreeswijk, C. (1993). Asynchronous states in networks of pulsed-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Buzs´aki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuronal networks. Curr. Opin. Neurobiol., 5, 504–510. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neuro., 5, 407–420. Cobb, S. R., Buhl, E. H., Halasy, K., Paulsen, O., Somogyi, P. (1995). Synchronization of neuronal activity in hippocampus by individual GABAergic interneurons. Nature, 378, 75–78. Draguhn, A., Traub, R. D., Schmitz, D., Jefferys, J. G. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analyses in the cat. Biol. Cybern., 60, 121–130. Ermentrout, B. (1996).Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Gerstner, W., Van Hemmen, J. L., Cowan, J. D. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Ginzburg I., & Sompolinsky, H. (1994).Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191.
1640
L. Neltner, D. Hansel, G. Mato, and C. Meunier
Golomb, D., & Hansel, D. (in press). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation. Golomb, D., Hansel, D., Shraiman, B., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Physical Review A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Gray, C. M. (1994). Synchronous oscillations in neuronal systems, mechanisms and functions. J. Comp. Neuro., 1, 11–38. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698– 1702. Gustafsson, B., & Pinter, M. (1984a). Relations among passive electrical properties of lumbar a-motoneurones of the cat. J. Physiol., 356, 401–434. Gustafsson, B., & Pinter, M. (1984b). An investigation of threshold properties among cat spinal a-motoneurones. J. Physiol., 357, 453–483. Hansel, D., & Mato, G. (1993).Patterns of synchrony of a Hodgkin-Huxley neural network at weak coupling. Physica A, 200, 662. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin–Huxley neurons. Europhysics Letters, 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Phase reduction and neural modeling. Concepts Neurosci., 4, 193–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hansel, D., Mato, G., Neltner, L., & Meunier, C. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comp., 10, 467–484. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comp. Neurosc., 3, 7–34. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Laurent, G. (1996). Dynamical representation of odors by oscillating and evolving neural assemblies. Trends Neurosci., 19, 489–496. Reyes, A. D., and Fetz, E. E. (1993a).Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Reyes, A. D., and Fetz, E. E. (1993b). Effects of transient depolarizing potentials on the ring rate of cat neocortical neurons. J. Neurophysiol., 69, 1673–1683. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Curr. Opin. Neurobiol., 7, 536–546. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635.
Synchrony in Heterogeneous Networks
1641
van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1372. Wang, X.-J., & Buzs`aki, G. (1996). Gamma oscillations in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neuro., 5, 5–16. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received April 7, 1999; accepted August 4, 1999.
LETTER
Communicated by Roger Traub
Dynamics of Spiking Neurons with Electrical Coupling Carson C. Chow Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15206, U.S.A. Nancy Kopell Department of Mathematics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A. We analyze the existence and stability of phase-locked states of neurons coupled electrically with gap junctions. We show that spike shape and size, along with driving current (which affects network frequency), play a large role in which phase-locked modes exist and are stable. Our theory makes predictions about biophysical models using spikes of different shapes, and we present simulations to conrm the predictions. We also analyze a large system of all-to-all coupled neurons and show that the splay-phase state can exist only for a certain range of frequencies.
1 Introduction
Electrical coupling between neurons has long been thought to have the effect of synchronizing oscillatory neurons, especially if the neurons involved are similar to one another. Here we analyze in more detail the effects of coupling periodically spiking cells by electrical synapses and show that gap junctions can actively foster asynchrony. We focus on the effects of spike shape and size, along with driving current (which inuences the network frequency). Even when the spikes are very thin, the current ow during the spike is shown to have a signicant effect on the self-organization of the network, and the current ow during the afterpotentials is an important part of the synchronizing process. Indeed, the frequency of the network plays a signicant role in whether the circuit will synchronize, changing the balance of these processes by altering the percentage of time occupied by the spike in a cycle. We analyze what modes of stable locking are possible for the network and show that synchronization is possible at much higher frequencies than for coupling via inhibitory synapses. However, at very high frequencies, gap junctions can be asynchronizing if the strength of the synapse is sufciently low; at intermediate frequencies, asynchronous modes can stably exist with the synchronous ones. The models we use are described by an integrate-and-re formalism, with the addition of action potentials that are inserted when a cell reaches c 2000 Massachusetts Institute of Technology Neural Computation 12, 1643– 1678 (2000) °
1644
Carson C. Chow and Nancy Kopell
threshold. The electrical synapses are modeled as giving currents proportional to the differences in the voltages of the two cells. The analysis is done by means of the spike response method (Gerstner & van Hemmen, 1992; Gerstner, 1995; Gerstner, van Hemmen, & Cowen, 1996; Chow, 1998), in which the effects of the coupling and the spikes are encoded in response kernels. Bressloff and Coombes (1998, 2000) have a similar formalism for analyzing these types of networks. Although the spike response method was invented for synaptic interactions, we show here that it can be used for electrical interactions as well. Indeed the spike response method allows us to consider the simultaneous effects of both kinds of coupling in a uniform formalism. We consider the consequences of the interaction of electrical and inhibitory synapses in a future publication. Here, we consider only electrical coupling and focus on the different effects of spike shape on synchronization in different regimes. We start in section 2 with the equations of the neurons and the synapses as coupled differential equations. Because the equations are piecewise linear between spikes, they can be explicitly integrated to compute the response kernels. We compute those kernels in terms of the spike parameters, the strength of the electrical synapses, and the intrinsic recovery rate of the uncoupled neuron. This gives explicit solutions for the coupled equations in terms of those parameters and the driving currents to each of the cells (which need not be the same). These explicit solutions are the basis for the analysis in the rest of the article. We note that the solutions have the same form as those analyzed in Chow (1998) describing interactions of cells via synaptic interactions. Thus, the general stability criteria developed there can also be used for the current problem. In section 3, we consider phase-locked solutions between two spiking neurons connected by gap junctions. We nd that, depending on the parameters of the neurons and the gap junction strength, the neurons can either synchronize, antisynchronize, be phase locked at an arbitrary phase, or lose periodic ring. One nonintuitive result is that changing the shape of the spikes may have different effects on the network in different parameter regimes. For example, increasing the amplitude and width of the spikes diminishes the range over which synchrony is stable when the frequency is relatively high. However, at low frequencies in which there is bistability, it can enhance synchrony by diminishing the range of parameters over which the competing antisynchronous solutions are stable. For nonsynchronous modes such as antisynchrony or splay phase, electrical coupling can change the period of the network. Not only does the period of each cell change in such nonsynchronous modes, but the network period also increases signicantly; for example, in the antisynchronous mode, the network frequency is twice that of the cell frequency. Thus, relatively weak electrical coupling may be functionally important in creating appropriate frequency ranges for oscillations in a coupled network. Whether a given mode of locking has a stable existence also depends on the network fre-
Dynamics of Spiking Neurons
1645
quency. The theory predicts a sequence of bifurcations as the driving current to the cells, and hence the frequency of the network, is changed. We test that theory against the two biophysical models and nd agreement for a model that has large spikes and another model that has small spikes (see the appendix for details of the models). In section 4 we consider an all-to-all electrically coupled network of neurons. We show that a large family of periodic phase-locked states can exist, including synchronous and clustered states. We also show that the splayphase state, in which all the neurons re in sequence, can exist and be stable over only a nite range of periods. Finally, we review some related modeling work and place our results in the context of physiological systems that are electrically coupled. 2 Equations and Methods 2.1 Spiking Neuron Model. We consider a simple integrate-and-re model of the neuron with the form
C
X dV D IQ ¡ gL V C AQ ( tQ ¡ tQ l ), dtQ
(2.1)
l
where V is the voltage, C is the capacitance, IQ is the applied current, gL is an effective passive leak conductance, AQ ( tQ) is a term representing spiking and restoring currents, and tQ l represents the times that V ( tQ) reaches a threshold V D VT from below. When V reaches threshold, a new term AQ ( tQ ¡ tQ l ) is added, which generates a spike (action potential) and resets the potential V to V0 . The current AQ ( tQ) associated with an individual spike and recovery is chosen to be (
QQ
gL VA ej t , AQ ( tQ) D ¡C(VT ¡ V0 C VM )d( tQ ¡ DQ ),
0 < tQ · DQ , DQ < t.Q
(2.2)
Here VA is a spiking amplitude scale, jQ is the rise rate of the spiking current, DQ is the width of the spike from threshold to the peak, VM is the maximum amplitude above threshold that the spike reaches before the potential is reset, and d(¢) is the Dirac delta function (which is used to reset the potential). AQ ( tQ) represents nonlinear currents that mediate a fast activation to generate a spike, followed by an even faster inactivation and hyperpolarization to bring the potential back to V0 . VM is completely determined by the membrane dynamics and AQ ( tQ). The potential can be shifted and rescaled with v D ( V ¡ V0 ) / (VT ¡ V0 ) so that the threshold has a value of v D 1 and the reset potential is v D 0. We
1646
Carson C. Chow and Nancy Kopell
also rescale time by t D tQ / tm , where tm D C / gL . We then arrive at a rescaled system of the form dv D I ¡ v C A( t), dt
(2.3)
where ID and
(IQ ¡ gL V0 ) , gL ( V T ¡ V 0 )
(2.4)
» jt v e , A(t) D A ¡(1 C vM )d( t ¡ D ),
0 < t · D, D < t,
(2.5)
with vA D VA / ( VT ¡ V0 ), vM D VM / ( VT ¡ V0 ), D D DQ / tm , and j D jQ tm . The membrane equation has four dimensionless parameters: the applied current I, the spike rise rate j , the spike width D , and the spike amplitude scale v A . The latter three parameters control the shape and amplitude of the spike. In this form, I must be larger than unity in order for the potential to reach the threshold for ring. To compute the scaled maximum amplitude vM , we integrate equation 2.3 over the width of one spike. Taking initial conditions to be the potential at threshold (v D 1 and t D 0), we obtain v( t) D 1 C I (1 ¡ e¡t ) C
vA j t [e ¡ e¡t ], 1Cj
t < D,
(2.6)
with vM dened by vM D v(D ) ¡ 1. For a fast-rising and narrow spike (compared to the membrane timescale) we can assume vM ’ vA ej D / (1 C j ). Figure 1 shows four examples of the voltage traces for different parameters of the model given in equation 2.3. 2.2 Spike Response Model for Coupled Neurons. We use the spike response formalism for a system of two neurons coupled with resistive gap junctions. (We can also include synaptic coupling within this formalism.) In doing so, the effects of the gap junction coupling will be separated from the intrinsic dynamics. After the equations are derived, we will apply the techniques previously used to understand the dynamics of synaptically coupled neurons. We will also show how this method can be generalized to a network of N all-to-all coupled neurons. The equations are X dv1 D I1 ¡ v1 ¡ g( v1 ¡ v2 ) C A( t ¡ tl1 ), dt l
(2.7)
X dv2 D I2 ¡ v2 ¡ g( v2 ¡ v1 ) C A( t ¡ tl2 ), dt l
(2.8)
Dynamics of Spiking Neurons
4
1.5
a)
3
v(t)
1647
1
2 0.5
1 0
0
40
1
2
3
4
0
5
4
10
2 0
1
2
t
3
4
0
5
1
2
3
4
5
2
3
4
5
d)
6
20
0
0
8
c)
30
v(t)
b)
0
1
t
Figure 1: Examples of voltage traces for the integrate-and-re model with parameters (a) j D 50, D D 0.1, vA D 1, I D 1.3, (b) j D 12, D D 0.1, vA D 1, I D 1.3, (c) j D 12, D D 0.5, vA D 1, I D 1.55, (d) j D 12, D D 0.5, vA D .1, I D 1.55.
where g is a gap junction strength (scaled by gL ) and tli represents the times when vi crosses the threshold from below. The spiking kernel A(t ) generates a spike and resets the potential to zero. Our strategy is to express the dynamics for vi in terms of a set of response kernels, which we will explicitly calculate. We rst transform into normal modes vC D v1 C v2 and v¡ D v1 ¡ v2 to obtain X X dvC D IC ¡ vC C A( t ¡ tl1 ) C A( t ¡ tm 2) dt m l X X dv¡ D I¡ ¡ rv ¡ C A( t ¡ tl1 ) ¡ A ( t ¡ tm 2 ), dt m l
(2.9) (2.10)
where r D 1 C 2g, IC D I1 C I2 and I¡ D I1 ¡ I2 . Integrating gives vC D IC (1 ¡ e¡t ) C
X l
gC (t ¡ tl1 ) C
X m
gC ( t ¡ t m 2 ),
(2.11)
1648
Carson C. Chow and Nancy Kopell
v¡ D
X X I¡ (1 ¡ e¡rt ) C g¡ (t ¡ tl1 ) ¡ g¡ ( t ¡ t m 2 ). r m l
(2.12)
Since we are interested in the steady state, we can start the interaction at any initial conditions; we have taken initial conditions of vC D v¡ D 0. The kernels are nonzero only for positive argument. They are given by Z ga ( t ) D
t 0
e¡ra ( t¡t ) A( t0 )dt0 , 0
(2.13)
and after integrating » v (r C j )¡1 [ej t ¡ e¡ra t ], ga ( t ) D A a ¡(1 C vM ¡ ga ( D ))e¡ra ( t¡D ) ,
0 < t·D D < t,
(2.14)
where a D § , rC D 1 and r¡ D r. We note that for fast-rising and narrow spikes, vM ’ gC (D ). Returning to the original coordinates and assuming that the neurons have been spiking for a long time so that initial conditions have decayed away, we obtain the spike response equations, v1 (t ) D IO1 C v2 (t ) D IO2 C where
¡ ¡
X l
X l
c s (t ¡ tl1 ) C c s (t ¡ tl2 ) C
¢ ¢
¡ ¡
X
c c ( t ¡ tm 2 ),
(2.15)
m
c c ( t ¡ tm 1 ),
(2.16)
m
X
¢ ¢
µ ¶ 1 1 1 1C IO1 D I1 C 1 ¡ I2 , 2 r r µ ¶ 1 1 1 1¡ IO2 D I1 C 1 C I2 , 2 r r
(2.17) (2.18)
and 1 [gC ( t) C g¡ ( t)], 2 1 c c D [gC ( t) ¡ g¡ ( t)]. 2 cs D
(2.19) (2.20)
c s is the spike generation and reset kernel and c c is the coupling kernel in the Pspike response method (Gerstner et al., 1996; Chow, 1998). We note that m c c ( t ¡ tm i ) can be interpreted as the contribution to the membrane potential due to the gap junction coupling. It is the gap junction analog of the postsynaptic potential.
Dynamics of Spiking Neurons
1649
2 g
a)
s
1 0
1
2
3
t
1 b)
0.2 g
c
0
1
2
3
4
5
t
0.2 0.4 Figure 2: (a) Two examples of the c s kernel with parameters D D 0.1, j D 50, vA D 1, and gap junction conductances g D 0.5 (solid line) and g D 5 (dashed line). (b) Two examples of the c c kernel with D D 0.1, j D 50, vA D 1, and gap junction conductances g D .5 (solid line) and g D 5 (dashed line).
Written out explicitly, the c s kernel is given by c s ( t) D
¤ » vA £ (1 C j )¡1 [ej t ¡ e¡t ] C ( r C j )¡1 [ejt ¡ e¡rt ] , 2 £ ¤ ¡ 12 e¡(t¡D ) C dc e¡r(t¡D ) ,
0< t·D D < t,
(2.21)
where dc D 1 C 2c c (D ).
(2.22)
The c s kernel has the shape of a single spike as seen in Figure 2a. Increasing the gap junction strength through r mainly affects the recovery phase.
1650
Carson C. Chow and Nancy Kopell
The c c kernel is given by ¤ » vA £ (1 C j )¡1 [ej t ¡ e¡t ] ¡ ( r C j )¡1 [ej t ¡ e¡rt ] , 2 £ ¤ c c( t) D ¡ 12 e¡(t¡D ) ¡ dc e¡r(t¡D ) ,
0< t·D D < t.
(2.23)
The c c kernel is the contribution to the electrical coupling due to a single spike. Examples of c c are shown in Figure 2b. For short times, c c is positive, rising until a cusp, which occurs at the peak of the spike. After the cusp, c c begins to decrease, becoming negative for long times. The sum over the c c kernels can be thought of as setting a background potential level on which the spiking takes place. The parameter dc , a measure of the maximum amount of excitatory input received, is important for controlling the background potential level. Later, it will be seen that dc is the critical parameter for determining the synchronizing behavior of the network. For fast-rising spikes, we can assume exp(j D ) À exp(¡D ), leading to
¡
d c ’ 1 C vM 1 ¡
1Cj rCj
¢
,
(2.24)
where we have used vM ´ gC ( D ) ’ vA ej D / (1 C j ). dc can be increased by increasing r through the gap junction strength g or increasing the maximum spike amplitude vM . (Note that v M is measured in reference to a xed postspike hyperpolarization.) While increasing the spike rise rate j increases vM , if v M is kept xed, increasing j actually decreases dc . For weak coupling strength (g ¿ 1), dc has the form dc ’ 1 C
2gvM . 1Cj
(2.25)
In the ensuing text, we will refer to spikes with dc À 1 as large spikes and spikes with dc » 1 as small spikes. 3 Phase-Locked States
In this section we apply the formalism of the previous section and analyze the existence and stability of phase-locked states for two neurons coupled through gap junctions. We focus on the states of synchrony (S) and antisynchrony (AS), although a third phase-locked state can also exist. We show that by changing the frequency of the network, the neurons can make transitions between these phase-locked states. In some instances, bistability is possible. 3.1 Conditions for Periodic Phase Locking. We are interested in the existence and stability of phase-locked periodic solutions. Here, we describe
Dynamics of Spiking Neurons
1651
a self-consistent method for determining the period T of a periodic solution and the phase difference w between the cells in that periodic solution. We show that there is a function G(w , T ) whose zeros for xed T give the phase difference; there is a companion equation that is used to determine T. We consider two neurons ring in a phase-locked pattern. We analyze this situation by supposing that the neurons re periodically at tl1 D ¡lT and tl2 D (w ¡ l)T. Without loss of generality, we assume that cell 1 is ahead of or at the same position as cell 2, that is, w ¸ 0. The next time the ith neuron res is when the potential vi reaches threshold from below (i.e., vP i > 0). Thus, using the spike response equations, we can derive a condition for phaselocked ring. Noting that neuron 1 will next re at t D 0 and neuron 2 will next re at t D wT, we obtain from equations 2.15 and 2.16 v1 (0) D 1 D IO1 C v2 (w T ) D 1 D IO2 C
X l ¸1
X l ¸1
c s (lT ) C c s (lT ) C
X l ¸1
X l ¸0
c c ( lT ¡ w T, )
(3.1)
c c ( lT C w T ).
(3.2)
This is a set of two equations with two unknowns. Note that the second sum in equation 3.2 starts with the index l D 0 because cell 2 feels the effect of cell 1 in the rst cycle, but not vice versa. We can rewrite this system in a more convenient form. Subtracting equation 3.2 from 3.1 yields G(w , T ) D b I1 ¡ b I2 D
I1 ¡ I2 , 1 C 2g
(3.3)
where G(w , T ) D c c (w T ) C
X l ¸1
[c c (lT C wT) ¡ c c ( lT ¡ w T )].
(3.4)
This equation can be interpreted as a condition for the phase w given a xed period T (provided a solution with that T is possible). Adding equation 3.2 to 3.1 and dividing by two yields F(w , T ) D 1 ¡ I,N
(3.5)
where ¶ Xµ 1 1 c s ( lT ) C (c c ( lT ¡ w T ) C c c (lT C wT)) F(w , T ) D c c (w T ) C 2 2 l ¸1
(3.6)
and IN D (b I1 C b I2 ) / 2 D ( I1 C I2 ) / 2. This equation gives a condition for the period T given a xed phase w .
1652
Carson C. Chow and Nancy Kopell
Although equation 3.5 could possibly allow multiple solutions for a given phase w, there is only one “physical” solution, which is given by the smallest solution for T. This is because the neuron is considered to re and reset immediately on crossing the threshold from below. A condition that also must be satised is that a neuron must remain subthreshold throughout its period. We examine this condition in detail for AS in section 3.3. For homogeneous systems (I1 D I2 ) S (w D 0) and AS (w D 0.5) both satisfy equation 3.3. For S, the period condition, equation 3.5, is the same as that for a single neuron. As we will show in section 3.2, a solution to equation 3.5 can almost always be found for S. However, the existence of AS is not guaranteed. As will be shown in section 3.3, a neuron may not remain subthreshold while the other neuron res. Stability of these periodic phase-locked solutions can also be analyzed. In Chow (1998), a sufcient condition and a separate necessary condition for stable phase locking were derived. The sufcient condition for stability of a locked solution with phase w is that the slopes of the kernels (c s and c c ) are both positive at times t D lT § wT; that is, they are rising at the time of ring (Gerstner et al., 1996; Chow, 1998). The necessary condition for stability is that the derivative of G(w , T ) with respect to w must be positive (van Vreeswijk, Abbott, & Ermentrout, 1994; Chow, 1998). 3.2 Period of Phase Locking. The network period depends on the parameters and the phase-locked state. For a given xed phase w , the period T is determined implicitly from equation 3.5. Here we will show that for S, a solution for the period can always be found if the input current is suprathreshold and not too large. However, for AS, condition 3.5 may not always have a solution for T. We show that if I is too small (depending on the other parameters), there is no solution, implying that AS does not exist in that regime. As will be seen in section 3.3, even if a solution exists to equation 3.5, AS still may not exist if the value obtained for the period is too large. When both S and AS exist, we analyze equation 3.5 to see the relationship between their periods, TS and T AS , respectively. The result is that for large spikes, TAS < TS for xed parameter values, but the opposite is true for small spikes. We also look at how the period changes with parameters. We show that increasing the gap junction strength g decreases T AS for large spikes, but increases it for small spikes. Since equation 3.5 is a continuous function of w , T changes continuously with changing w . For S, using equations 2.19 and 2.20, equation 3.6 becomes
F(0, TS ) D
X l ¸1
gC (lTS ).
(3.7)
Inserting the kernel gC from equation 2.14 into 3.7 and evaluating the sums, condition 3.5 becomes F(0, TS ) D
eD D 1 ¡ IN 1 ¡ eTS
(3.8)
Dynamics of Spiking Neurons
1653
for TS > D . It can be easily shown that this is the period condition for an uncoupled neuron (Chow, 1998; Chow, White, Ritt, & Kopell, 1998). The period is dened implicitly by equation 3.8. The function F(0, TS ) is negative and monotone increasing in T S . For 1 < IN · 1 C (1 ¡ eD )¡1 there is always a solution for TS . (We note that the period is well dened only for TS > D .) Solving equation 3.8 for T S yields ! IN ¡ 1 C eD . (3.9) TS D ln IN ¡ 1
(
The threshold for ring is IN D 1. The period decreases with increasing I.N For AS, equation 3.6 gives F(1 / 2, T AS ) D
X l ¸1
c s (lT AS ) C
X m ¸0
c c (mTAS C T AS / 2).
(3.10)
Inserting the kernels 2.21 and 2.23 for t > D into equation 3.10 and evaluating the sums leads to the period condition equation 3.5: F(1 / 2, T AS ) D
dc 1 erD eD N ¡ D 1 ¡ I. 2 rT T / AS AS C1 2e 2 e /2 ¡ 1
(3.11)
for T AS > 2D . The smallest T AS satisfying equation 3.11 is the only physical solution. There are situations where a solution to equation 3.11 does not exist. From that equation, we see that for a given I,N for dc large enough and r small enough, then TAS does not exist since F(1 / 2, T AS ) stays above 1 ¡ I.N By the same reasoning, solutions for TAS can then exist for subthreshold input (IN < 1); the excitation provided by each neuron through the gap junction would sustain ring. Except for cases where dc and r are very large, near the smallest solution of equation 3.11, we have @F(1 / 2, T AS ) / @T > 0. This implies that increasing IN decreases the period. From equation 3.11 we see that increasing dc increases F(1 / 2, T AS ) and thus decreases T AS . This can be understood heuristically since the spike contributes an effective excitation through the gap junction and dc increases with increasing width and amplitude of the spike. We note that dc increases when r D 1 C 2g increases. On the other hand, the factor erD / (erTAS / 2 C 1) increases only if T AS < 2D and decreases otherwise. However, for relatively small r and T AS (recall that small is in comparison to the effective leak time), the factor decreases slowly. Thus, for cells with large spikes and periods that are fast compared to the leak time, increasing r generally decreases the period for AS, while for cells with small spikes and long periods, increasing r increases the period. We can compare the periods of AS to S by rewriting F(1 / 2, T ) as F(1 / 2, T ) D F(0, T ) C y (T )
(3.12)
1654
Carson C. Chow and Nancy Kopell
where
y (T ) D
dc erD 1 eD ¡ . 2 rT 2 e / C 1 2 eT / 2 C 1
(3.13)
Comparing to equation 3.8, we nd TS D TAS when y D 0. For a given period, there is a set of possible parameters dc and r for which y D 0. The trivial solution isdc D 1 and r D 1, which corresponds to uncoupled neurons. For weak coupling and a small spike amplitude, TS » T AS . If y > 0, then TS > TAS , and vice versa. From equation 3.13, we nd that y can be positive if dc is large enough. Again, this can be understood heuristically. For strong spikes and weak coupling, AS has a shorter period, but for weak spikes and strong coupling, the opposite is true. 3.3 Existence of the Antisynchronous State. In order for AS to exist, the potential of the neurons must remain below threshold for the duration of a period. This can be violated if the spikes have large enough amplitude or the electrical coupling is strong enough so that when one of the neuron spikes, it induces the second neuron to cross threshold; that is, as soon as one of the neurons res, the other will be induced to re nearly immediately and the neurons will tend to synchronize. This effect is similar to fast threshold modulation (Somers & Kopell, 1993). The spike mediated through the gap junction acts like a brief excitatory pulse synchronizing the two neurons, prohibiting antisynchrony. We can make this more concrete by considering AS where tl1 D ¡lT and l t2 D T / 2 ¡ lT. The potential of neuron 1 obeys
v 1 ( t ) D I1 C
X l ¸0
[c s ( t C lT ) C c c ( t ¡ T / 2 C lT )],
t > 0.
(3.14)
We require v1 ( t) < 1 for 0 < t < T. When the spike of neuron 2 is felt by neuron 1 at t D T / 2 C D , there is a chance that neuron 1 may be induced to re. To prevent this, we must have v 1 ( T / 2 C D ) D I1 C
X l ¸0
[c s (T / 2 C D C lT ) C c c ( D C lT )] < 1.
(3.15)
Using equations 2.21 and 2.23 in 3.15 and evaluating the sums yields dc
D , G0 (0, T ) / T D
eD rerD ¡ d . c eT ¡ 1 erT ¡ 1
(3.22)
1658
Carson C. Chow and Nancy Kopell
Note that dc > 1 sincec c ( D ) > 0 and r > 1. Thus, for T small enough, dc large enough, and r not too large, the second term will dominate the rst term in equation 3.22, and G0 (0, T ) will become negative, indicating the instability of S. The lower bound of the period for stable S is given by the critical period TCS , which satises G0 (0, T CS ) D 0. For T below T CS , the necessary condition for stability is violated, and stable S cannot exist. We examine how TCS behaves when we change the parameters. The critical condition (see equation 3.22) can be rearranged to take the form S
erTC ¡ 1 S
eTC ¡ 1
D rdc e( r¡1)D .
(3.23)
For large r, TCS decreases with increasing r. Thus for very strong gap junctions, the lower bound is the width of the spike, implying that synchrony can be stable at any allowable period (periods smaller than the width of the spike are not allowable in our model). This synchronizing tendency is the general presumption of gap junctions. However, if the gap junction is not strong, then there can be a range of frequencies for which S is unstable. The left-hand side of equation 3.23 is monotone increasing in TCS . Thus, TCS increases with increasing dc or D . The conditions for stable synchrony depend importantly on the spike width D and on dc . As either of these is increased, a longer period (lower frequency) is required for stable S. A larger-amplitude spike leads to a larger dc and hence makes it more difcult to synchronize two neurons in the sense that the range of frequencies where they can synchronize is reduced. We now consider sufcient conditions for stability. In Chow (1998), it was shown that cPs (lT ) > 0, cPc (0) ¸ 0, and cPc (mT ) > 0, for l ¸ 0, m ¸ 1, was sufcient for stability. From equation 2.21 and Figure 2a, the slope of c s is always positive. From equation 2.23 and Figure 2b, the slope of c c (lT ) is negative between the maximum at t D D and the minimum t D tmin but positive everywhere else. Minimizing equation 2.23 for t > D gives tmin D D C
1 ln dc (1 C 2g). 2g
(3.24)
Recall that cPc (0) D 0. For T > tmin , we have cPc (mT ) > 0, ensuring stability. For T < tmin , the slope of c c ( T ) can be negative, and stability is no longer ensured. (Note that T > D ; the period must be longer than the width of a spike.) For strong gap junctions, tmin approaches D . This implies that as the gap junction becomes stronger, S is guaranteed to be stable at higher and higher frequencies. For g ¿ 1, we can use equation 2.25 for dc in 4.24 and expand to obtain tmin ’ D C 1 C
vM . 1Cj
(3.25)
Dynamics of Spiking Neurons
1659
Thus, for weak gap junction strength, g has no effect on tmin at linear order. 3.6 Antisynchrony . The behavior of G(w , T ) indicates that AS could be stable for either long periods or very short periods. We now investigate these possibilities in more detail. A necessary condition for stability of AS is that G0 (0.5, T ) > 0. The sufcient conditions for stability are given by cPs ( lT C T / 2) > 0 and cPc ( lT C T / 2) > 0. From equation 3.4, we nd that
G0 (0.5, T ) / T D 2
X l ¸0
cPc ( lT C T / 2).
(3.26)
After evaluating the sum, we obtain ( 0
G (0.5, T ) / T D
¡r(3T / 2¡D ) e¡(3T / 2¡D ) ¡ dc r e 1¡e¡rT , 1¡e¡T ¡r(T / 2¡D ) ¡ dc r e 1¡e¡rT ,
2cPc ( T / 2) C e¡(T / 2¡ D ) 1¡e¡T
T · 2D
T > 2D ,
(3.27)
where cPc (T / 2) D
µ ¶ v A j ej T / 2 C e¡T / 2 j ejT / 2 C re¡rT / 2 ¡ , 2 1Cj rCj
T · 2D .
(3.28)
First consider the behavior for T · 2D ´ TCAS2 . This is the situation where the spikes are overlapped; one neuron reaches threshold, while the other neuron is still in a spiking phase. From equation 4.27 and as seen in Figure 3d, AS can be stable for T D T CAS2 , provided r is small enough. However, cPc (0) D 0 and is monotone, increasing in T until T D 2D . Thus, if T is reduced, there is a bifurcation at TCAS3 when G0 (0.5, TCAS3 ) D 0, (not shown in Figure 3). Stability of AS is possible for TCAS3 < T · T CAS2 . For small r, increasing r will increase TCAS3 . Thus, the regime for stable AS at these short periods is reduced and could possibly be eliminated by increasing the gap junction strength. The sufcient conditions for stability are satised if T / 2 > tmin , where tmin is given in equation 3.24. This sets a minimum on the period that depends on dc . In the short period regime, we have shown that AS could satisfy the necessary condition for stability when the period is reduced to less than twice the width of the spike. However, at this point, the sufcient conditions are not satised since T / 2 < D < tmin . Although stable solutions are possible if the sufcient condition is violated, it may be that for extremely high frequencies, neither S nor AS is stable. For T slightly larger than 2D , G0 (0.5T ) is negative. Once the spikes are no longer overlapped, AS immediately loses stability. This bifurcation is seen in Figures 3c and 3d. The bifurcation is discontinuous because the
1660
Carson C. Chow and Nancy Kopell
spike shapes are not smooth. Had a smoother spike shape been chosen, the transition from stability to instability would be continuous. As the period is increased, the necessary condition for AS stability begins to be satised at the critical period TCAS1 satisfying G0 (0.5, T CAS1 ) D 0. Thus, TCAS1 gives a lower bound for stable AS in the long period regime (T > 2D ). Applying this to equation 4.27 for this regime gives the following condition for this lower bound: sinh(rTCAS1 / 2) sinh(TCAS1 / 2)
(
)
D rdc e r¡1 D .
(3.29)
The left-hand side of equation 4.29 is monotone increasing in TCAS1 , so increasing dc or D increases TCAS1 . Thus, large spikes increase the lower bound on period for stable AS in the long period range. For strong gap junctions, TCAS1 decreases and in the limit r ! 1, TCAS1 ! 2D . The maximum period allowed for existence of AS as given by equation 3.18 also decreases as r is increased and at a slightly faster rate. This implies that as the gap junction strength increases, AS can be stable at shorter periods in the long period regime, although it likely loses existence before the period can get too short. To summarize, the necessary condition for stability of AS is satised for periods longer than TCAS1 provided the gap junction is not too strong. It is also satised for short periods between TCAS3 and TCAS2 , although the sufcient conditions are not satised. This region is reduced with increasing gap junction strength. For long periods, existence is lost if the gap junction is too strong or the period is too long. Results from section 3.5 lead to the conclusion that in the short period regime, large spikes diminish the range of periods over which synchrony is stable. From this section we see that in the long period regime, large spikes diminish the range over which AS is stable. These results are borne out in numerical solutions, as will be shown in section 3.9. 3.7 Bifurcations for I and g. In section 3.4, we showed that for weak gap junctions, a bifurcation sequence shown in Figure 4 takes place for changes in the period T. We now relate this sequence to bifurcations, using as parameters the applied current I and gap junction strength g. This is accomplished by identifying the locations of the four bifurcation points: TCAS1 , TCAS2 , TCAS3 , and TCS . From section 3.2 we showed that increasing the applied current I decreases the period T. Thus, moving from right to left on the bifurcation plot in Figure 4 corresponds to increasing I. However, for a xed I, the states S, AS, or third mode will have different periods, so the diagram will not carry over directly. The bifurcation sequence will vary according to whether the period decreases or increases as the phase of the periodic locked state moves
Dynamics of Spiking Neurons
1661
AS1
0.5
AS2
IC
IC
AS3
IC
f 0
S
I
IC
Figure 5: Bifurcation sequence for phase-locked states with phase w and applied current I for large spikes. Solid lines indicate stable states, dashed lines unstable, and dotted lines unknown behavior. Moving from left to right corresponds to decreasing period. There are four bifurcation points: ICAS1 , ICAS2 , ICAS3 , and ISC .
away from w D 0. For each T bifurcation point in Figure 4, there is a corresponding I bifurcation point, which we label ICAS1 , ICAS2 , ICAS3 , and ICS . The bifurcation diagram for large spikes is shown in Figure 5. In this case, the period decreases with increasing phase (i.e., for a xed I, AS has a shorter period than the third mode, which has a shorter period than S). Thus, increasing I will almost lead to the same bifurcation sequence as decreasing T. We simply reverse the plot in Figure 4. The one difference is that for very large I beyond the bifurcation point TCAS3 , AS cannot bifurcate to the third mode because the latter has a longer period and may not exist at the bifurcation point. AS may lose stability to a nonperiodic or nonphase-locked state that our analysis does not consider. The situation changes for the bifurcation diagram for small spikes (dc small), which is shown in Figure 6. In this case, for xed parameters, S has a shorter period than the third mode and AS. The bifurcation picture given in Figure 4 breaks down since the S state cannot bifurcate into the third mode if the third mode does not exist at that parameter. The third mode could exist for higher values of I, but it may not be connected to S through a simple bifurcation. There may also be nonperiodic or nonphase-locked behavior. At very large I, AS could possibly exist and bifurcate into the third mode. However, the sufcient conditions for stable AS are not satised in this regime. Recall from the previous two sections that the critical points TC move toward larger values as dc and D are increased (except for T CAS2 D 2D , which changes only with changing D ). This implies that the corresponding critical points in the I diagram (IC ) move toward lower values. Thus, for larger dc and D , the bifurcation points have lower values of I if the spikes are large and higher values if the spikes are small. The bifurcation sequences for g
1662
Carson C. Chow and Nancy Kopell
AS1
0.5
AS2
IC
IC
AS3
IC
f 0
S
I
IC
Figure 6: Bifurcation sequence for phase-locked states with phase w and applied current I for small spikes. Solid lines indicate stable states, dashed lines unstable, and dotted lines unknown behavior.
can be deduced by examining Figures 5 and 6. As the gap junction strength g is increased fairly strongly for g, the critical points TC will decrease (IC will increase). Thus, the bifurcations occur at a higher value of I. Hence, the region for stable S increases. The region in which the necessary condition for AS stability is satised increases in the low I regime. However, as discussed in section 3.3, AS also loses existence at lower I. Thus, for very strong g, only S will exist stably. As g is decreased, the state can bifurcate into AS or third mode depending on the value of I. 3.8 Adding Heterogeneity. With the addition of heterogeneity, the locking condition (see equation 3.3) shows that neither S nor AS is a solution. However, it was shown in Chow (1998) that near synchrony and near antisynchrony are possible for weak enough heterogeneity. As long as G0 (w , T ) is positive at the phase-locked solution, then stability is possible. The stronger the gap junction is, the more likely the two neurons will synchronize. This is expected heuristically but can also be seen from the behavior of G(w , T ). Recall from equation 3.4 that G(w , T ) is constructed from a sum over c c ( t) at periodic intervals. As the gap junction strength increases, c c ( t) begins to look more and more like gC (t ), which has positive slope everywhere. Since cPc (0) D 0, it does not contribute to stability of the synchronous state of w D 0. However, cPc ( t) increases with t for the duration of the spike. This can cause G(w , T ) to rise fairly steeply for w > 0. So for w near to but away from zero, the spike contributes positively to G0 (w , T ). This also implies that a fastrising spike may also make synchrony easier to maintain in the presence of heterogeneity. 3.9 Application to Biophysical Neuron Models. We compared the predictions of our analysis on the integrate-and-re model to biophysical con-
Dynamics of Spiking Neurons
1663
50
a)
0
v(t)
50 0
10
20
30
40
50
t 50
b)
0 v(t) 50
100
0
10
20
30
40
50
t Figure 7: (a) Voltage trace for the interneuron model with QI D 0. (b) Voltage trace for the reduced Traub-Miles model with QI D 2.0.
ductance-based neuron models. We considered two different models that represented ranges of possible spike shapes. Figure 7 show examples of the spike forms for an interneuron model of White, Chow, Ritt, Soto-Trevino, and Kopell (1998) and a reduced Traub-Miles (RTM) model (Traub, Jefferys, & Whittington, 1997; Ermentrout & Kopell, 1998). The dynamical equations are given in the appendix. The interneuron model has a wider spike and a larger spike (the size of the spike is in reference to a postspike hyperpolarization) compared to the RTM model. This implies that the interneuron model has a larger dc than the RTM model. In the RTM model, the spiking currents play a very small role in the recovery phase, and so a passive decay to threshold is a good approximation. However, for the interneuron model, the spiking currents seem to play an important role in the recovery. From the gures, it appears that the effective leak time of the interneuron model is longer than the RTM model. Recall that time has been scaled by the leak
1664
Carson C. Chow and Nancy Kopell Table 1: Table of Period T and Stable States for the Interneuron Model with g D 0.025 as a Function of Varying Applied Current T. I
T
State
< ¡0.55 ¡0.55–1.65
> 68.1 68.1–9.6 24.0–7.7 9.6 – 3.4 3.4 – 3.1 < 3.1 2.7
S (only) S AS S Third mode AS AS Nonperiodic
1.65–19.0 19.0–23.0 > 23.0 25.0 > 25.4
time in the analysis. Thus, for a xed period, lengthening the leak time (i.e., reducing the leak rate) effectively shortens the scaled period. We looked for steady-state behavior over a range of applied currents I and gap junction conductances g, and ran for many hundreds of periods to ensure that the observed states were not transient. We rst summarize the analysis of our simplied model. For very strong gap junctions, synchrony is the only state that can exist. For weak gap junctions, a bifurcation sequence can be observed as the applied current is changed. The bifurcation sequence will be different for large and small spikes, as seen in Figures 5 and 6. We predict that the interneuron model should behave like a large spike cell and the RTM model should behave like a small spike cell. From section 3.2, we predict that increasing the gap junction strength should decrease the period of AS for the interneuron model since the effective leak time of the interneuron model is long and dc is large. On the other hand, it should increase the RTM period since the leak time is short and dc is small. The same arguments also predict that for a xed set of neuronal parameters, the period of S will be longer than the period of AS for the interneuron model, but the opposite will hold for the RTM model. For the interneuron model with weak coupling (g D 0.025), we numerically found a bifurcation sequence (see Table 1) that matched our analytical results. For low levels of applied current, only the synchronous (S) state could be found. At a current of I D ¡0.55, AS appeared. S and AS coexisted for ¡0.55 · I < 1.65. An example of AS in the bistable regime is shown in Figure 8. At I D 1.65, AS lost stability. S persisted with increasing current until I D 19, where it bifurcated into a stable third mode, as seen in Figure 9. The phase difference of the third mode approached w D 0.5 until I D 23, when it bifurcated into AS. Figure 10 shows the voltage traces for AS and IQ D 25. At I ’ 25.4, AS lost stability to a nonperiodic state (see Figure 11).
Dynamics of Spiking Neurons
1665
50
v(t)
0
50 0
20
40
60
80
100
t Figure 8: Interneuron model: Antisynchrony in the bistable regime, g D 0.025, QI D ¡0.55, T D 24 ms.
25 v(t) 0 25 50
0
5
10 t
15
20
Figure 9: Interneuron model: Third mode, g D 0.025, QI D 23, T D 3.1.
Although they are difcult to perceive in the gure, the amplitudes of the spikes are slowly increasing and decreasing. As the gap junction strength was increased in any of the above states, the cells fell into S, as expected. The period of AS was always shorter than S in the bistable regime, also as expected. In the bistable regime, the periods differed
1666
Carson C. Chow and Nancy Kopell
20
v(t)
0
20 40
0
5
10 t
15
20
Figure 10: Interneuron model: Antisynchrony in the high-frequency regime, g D 0.025, IQ D 25, T D 2.7.
20
v(t)
0 20 40
0
5
10 t
15
20
Figure 11: Interneuron model: Nonperiodic state in the high-frequency regime, g D 0.025, IQ D 25.4.
roughly by about 20%. Increasing the gap junction strength shortened the AS period, also as predicted (see Figure 12). The bifurcation sequence for the RTM model with weak coupling (g D 0.01) is given in Table 2. For very low currents, only the synchronous state
Dynamics of Spiking Neurons
1667
50 40 T
30 I=- 0.05
20 10 0
I=0 0
0.01 0.02 0.03 0.04 0.05 g
Figure 12: AS period as a function of the gap junction strength for the interneuron model. The S period has the value of 19.0 for I D 0 and 50.8 for I D ¡0.05 and does not change with g. AS has a shorter period than S for a xed I and g. Table 2: Table of Period T and Stable States for the RTM Model with g D 0.01 as a Function of Varying Applied Current T. I
T
State
< 0.08 0.08–0.55
> 114 158–42 114–39 < 39
S (only) AS S S Non-S
> 0.55 Very large
Note: The non-S behavior found at very high I was for smaller values of g.
existed. At IQ D 0.08 and a period of T D 158 ms, the AS state came into existence and both S and AS were stable. Figure 13 shows an example of the AS state. At IQ D 0.55 the AS state became unstable. Our numerics for the RTM model found that synchrony extended to very high frequencies. This was to be expected since the spike was extremely narrow. (We note that the sufcient conditions for stable AS are not satised for T < D , so AS need not be observed at high frequencies.) In our analytical model, the width D corresponds to the time for the spike to reach the peak from
1668
Carson C. Chow and Nancy Kopell
50
0 v(t) 50
0
50
100 t
150
200
Figure 13: Reduced Traub–Miles model: Antisynchrony in the bistable regime, g D 0.01, QI D .55, T D 42 ms.
threshold. For the RTM model, this was on the order of 0.1 ms, which would require a biologically improbable period near T ’ 0.2 or 5000 Hz. On the other hand, the time spent above threshold for the spike was approximately 0.5 ms, which would require T ’ 1 ms or 1000 Hz. We were unable to drive the RTM model fast enough for S to lose stability. At very high applied current (I ’ 1370, T ’ .8 ms), a Hopf bifurcation took place, and continuous spiking was replaced by steady state. (This is often observed in conductancebased models but does not occur in our analytical model.) However, when we lowered the gap junction strength to g D 0.0001, we were able to nd nonsynchronous behavior for very large I. Figure 14 shows a third mode state for I D 1000. In the bistable regime, the AS period was longer than the S period, opposite of the interneuron model, and the predicted result for small spikes (i.e., dc small). The periods between S and AS differed roughly by about 10%. Increasing the gap junction strength increased the period of the AS state, also as expected (see Figure 15). 4 Larger Networks
We consider an all-to-all coupled network of N neurons with gap junctions. Although this is probably unrealistic for very large N, it does serve to gain some qualitative understanding of large network effects. We consider the
Dynamics of Spiking Neurons
1669
20 0 v(t)
20 40 60 0
1
2
3
4
t Figure 14: Reduced Traub–Miles model: Third mode for g D 0.0001, QI D 1000, T D 5.761 ms.
70 I=0.5 60 T 50 I=0.3 40
0
0.01
0.02
g Figure 15: AS period as a function of the gap junction strength for the RTM model. The S period has the value of 41.8 for I D 0.3 and 55.8 for I D 0.5. AS has a longer period than S.
model X X dvi ( vi ¡ vj ) C D I ¡ vi ¡ g A( t ¡ tli ), dt 6 i jD l
(4.1)
1670
Carson C. Chow and Nancy Kopell
where i is the neuron index. We consider a network of homogeneous neurons, although the analysis can be done as well with heterogeneous applied currents. We rewrite equation 4.1 in the form of a matrix equation, dvE D M ¢ vE C fE, dt
(4.2)
where P 1 I C l A(t ¡ tl1 ) P l C BI C l A ( t ¡ t2 ) C B fE D B C .. @ A . P l I C l A( t ¡ tN )
0
1 v1 B v2 C B C vE D B . C , @ .. A vN
0
(4.3)
and 0 ¡1 ¡ ( N ¡ 1)g B g B MDB .. @ . g
g ¡1 ¡ (N ¡ 1)g .. . g
¢¢¢ ¢¢¢ ¢¢¢
1 g C g C C . (4.4) .. A . ¡1 ¡ ( N ¡ 1)g
Proceeding as we had for two neurons, we diagonalize system 4.2. Due to the high degree of symmetry, the matrix M has only two distinct eigenvalues: ¡1 and ¡1 ¡ Ng (which is N ¡ 1 times degenerate). Transforming to the diagonalized system, integrating, and transforming back gives us the spike response form vi ( t ) D I C
X l
C s ( t ¡ tli ) C
XX 6 i jD
m
C c ( t ¡ tjl ),
(4.5)
where C s (t) D gC ( t) C ( N ¡ 1)g¡ ( t) C c (t) D gC ( t) ¡ g¡ ( t).
(4.6) (4.7)
The kernel gC (t ) is identical to the two-neuron network kernel (see equation 2.14), and g¡ (t) has the same form but with r¡ D 1 C Ng. The equations are similar to those of the N neuron synaptically coupled neuronal network studied in Chow (1998). We now consider periodic phase-locked states in a network of homogeneous neurons. Suppose the neurons re at tli D w i ¡ lT. At threshold they satisfy the condition 1DIC
X l
C s ( lT ) C
X j,m
C c ( mT C (w i ¡ w j )T ).
(4.8)
Dynamics of Spiking Neurons
1671
Due to the permutation symmetry of the neuron index, any symmetric combination of the phases is a possible phase-locked periodic solution (Chow, 1998). Examples include the synchronous state where all the neurons re together, antisynchrony where half the neurons re and then the other half res, the splayphase state where the neurons re periodically in sequence, and clustered states. The synchronous solution always exists if the driving current is above threshold. The period is given by the single neuron period. The analysis for the existence of the antisynchronous solution follows as in the two-neuron case. We now examine conditions under which the splay-phase solution can exist. The splay-phase state is dened by w i D iT / N, where i varies from 0 to N ¡ 1. The splay-phase state can exist if a solution for the period T can be found for 1DIC
X l ¸1
C s ( lT ) C
X j,m ¸1
¡
C c mT ¡
¢
j T . N
(4.9)
For simplicity, suppose the spikes are separated by at least D (i.e., spikes do not overlap). Then C s ( t) D ¡e¡(t¡D ) ¡ (N ¡ 1)dc e¡r(t¡D ) , ¡(t¡D )
C c ( t) D ¡e
C dc e
¡r(t¡D )
.
(4.10) (4.11)
We may now substitute this into equation 4.9 and sum the resulting geometric series to obtain I¡1D
eD eD erD erD C C ( N ¡ 1)dc ¡ dc rT N . rT T N ¡1 e / ¡1 e ¡1 e / ¡1
eT
(4.12)
The period of the splay-phase state is given by the smallest solution of T to equation 4.12. For a xed N, a solution can always be found for some I and T. However, there will be a maximum period that can sustain the splay phase due to the fast threshold modulation effect, as described for antisynchrony in section 3.1, that is, the neurons must remain subthreshold when the other neurons re. j The sufcient conditions for stability are that CP c ( mT ¡ N T ) > 0, for all m and j (Chow, 1998). Thus the separation between the neurons must exceed tmin from equation 3.24 to ensure stability. This sets a minimum period of T D Ntmin . It also suggests that there is a region of allowed periods for stable splay phase. If the period is too short, the state loses (sufcient condition for) stability, and if the period is too long, it loses existence. The allowed period for the splay-phase must also scale with the number of neurons in the network N since the separation between the neurons must remain relatively constant independent of network size. This means that the applied
1672
Carson C. Chow and Nancy Kopell
current must be reduced as N increases in order to sustain a stable splayphase state. Numerical simulations of the interneuron model conrmed these observations. We simulated networks up to N D 8 and found the existence of the splay-phase state only over a midrange of periods. 5 Discussion 5.1 Related Modeling Work. The rst articles to point out that electrical coupling can be antisynchronizing were by Sherman and Rinzel (1992) and Sherman (1994) using simulations done in the context of pancreatic beta cells, which have bursting electrical behavior. Later work on this system (de Vries, Zhu, & Sherman, 1998) used the fact that the envelopes of the bursts were roughly sinusoidal in shape and used an analysis near a Hopf bifurcation to see how the coupling could destabilize synchrony. Two other papers (Han, Kurrer, & Kuramoto, 1995, 1997) showed that electrical coupling can give rise to antisynchronous solutions if the components of the cells have trajectories close to a homoclinic bifurcation. The analysis that we do in this article concerns networks of spiking neurons, focusing on the shapes of the spikes. The most similar work deals with excitatory and inhibitory chemical synapses, using the spike response method (Gerstner & van Hemmen, 1992; Gerstner, 1995; Gerstner et al., 1996; Chow, 1998). These and other related models (van Vreesijk et al., 1994; Hansel et al., 1995; Bressloff & Coombes, 1998, 1999) showed that for integrate-and-re models, inhibitory synapses can stabilize synchrony (provided the timescales of rise and fall of the synapse are slow enough), while excitation generally destabilizes the synchronous solution. (See also Terman, Kopell, & Bose, 1998, and Bose, Terman, & Kopell, in press) for related results about inhibition and excitation in bursting neurons.) One way to understand intuitively the result that gap junctions can be antisynchronizing is to think of the effects of the electrical coupling as combining those of excitation and inhibition. To see this, consider the coupling currents for a nonrectied electrical synapse when the spikes are not (yet) synchronous. In the spike phase of the cycle of one cell, the coupling currents to the other cell are depolarizing, acting like excitation; in the postpolarization phase, these currents can be hyperpolarizing (depending on the phase of the other cell). This intuition suggests that wide and tall spikes should help to destabilize the synchronous state, whereas a long and deep postpolarization phase should encourage synchrony. This is indeed what the mathematics conrms, producing more details about effects of sizes and shapes as well as rigor. This intuition also suggests why frequency of the network plays a role in synchronization: as the frequency increases, the size of the interspike interval decreases much more than any change in the shape of the spike, favoring the effects of the spike over those of the postpolarization phase and thus favoring destabilization of the synchronous solution. This argument shows that one should not expect frequency to play an important role in the
Dynamics of Spiking Neurons
1673
stability of sinusoidal like oscillators coupled electrically. There are several other articles on electrically coupled neurons (or compartments) that are related to the current work. Kepler, Marder, and Abbott (1990) considered the electrical coupling of a bursting cell and a passive cell to show that the coupling may increase or decrease the frequency of the oscillation, depending on the shape of the waveform of the oscillator. Kopell, Abbott, and Soto-Trevino (1998) showed that if one of the elements of the network is bistable and the other is an oscillator, the network can exhibit much greater complexity; they introduced new geometric techniques very different from the ones in this article or in Kepler et al. (1990). Manor, Rinzel, Segev, and Yarom (1997) showed that cells with heterogeneous properties, none of them oscillators, could produce an oscillation in an electrically coupled system that is the appropriate average of the dynamics. Electrical coupling is also relevant to understanding the dynamics of compartmental models in which the conductances vary between compartments (Booth & Rinzel, 1995; Li, Bertram, & Rinzel, 1996; Mainen & Sejnowski, 1996; Pinsky & Rinzel, 1994). Finally, we point out that the literature on neural oscillators is large and rapidly growing, so the above list constitutes only the work that is most directly related to the themes of this article. Ritz and Sejnowski (1997) provide a review of some recent work on neural oscillators. 5.2 Gap Junctions in Physiological Systems. Gap junctions are found in many tissues of the body, including the nervous system. (For reviews, see Bennett, 1997, and Dermietzel & Spray, 1993.) Even in the nervous system, electrical coupling is found in a wide variety of cells, including astrocytes and oligendrocytes, as well as neurons. Many functions have been attributed to these electrical synapses, including exchanges of metabolites and second messengers, and buffering the K+ activity surrounding active neurons (Dermietzel & Spray, 1993). For neurons, the most common function ascribed to electrical synapses is mediating synchrony among active cells or relaying signals quickly. Although gap junctions are thought to be most prevalent (at least in vertebrates) during early development, they are also known to exist in the adult mammalian central nervous system, for example, in the neocortex (Gibson, Beierlein, & Connors, 1999), the hippocampus (Draguhn, Traub, Schmitz, & Jefferys, 1998), the inferior olive (Llinas, Baker, & Sotelo, 1974), and the retina (Dowling, 1991). In this article, we have shown that electrical coupling can organize a rhythm to be asynchronous, especially at low coupling strengths. The ability of a collection of spiking neurons to synchronize is dependent on the size and shape of the spike form, as well as the frequency at which the cells are ring. At low coupling strengths and very high ring rates (dependent on the shape of the spike wave form), the synchronized state is unstable, and a pair of cells res in antisynchrony. For a lower range of frequencies, the synchronized and antisynchronized states are bistable. For a population,
1674
Carson C. Chow and Nancy Kopell
the network behavior can have the phases splaying out over the circle. One preparation in which there are well-documented gap junctional connections and oscillator cells ring at high rates is the pacemaker nucleus of the weakly electric sh, Apteronotus (Dye, 1988, 1991; Moortgat, Bullock, & Sejnowski, 2000a, 2000b). In this tissue, cells re very precisely, and synchronously (Dye, 1988). Evidence for distribution of phases of the pacemaker cells of that nucleus is given in Figure 8 of Dye (1988). Pharmacological manipulations that increase the internal concentration of Ca2C and are presumed to decrease the strength of the gap junctional coupling (Spray & Bennett, 1985) do partially desynchronize the population; however, the desynchronization may be due to differences in the natural frequencies rather than the ability of the weak coupling to desynchronize actively even identical cells, as discussed above. In contrast to Dye (1991), Moortgat et al. (2000a) observe that adding gap junction blockers results in a general reduction of the frequency of oscillation of the pacemaker nucleus. This is predicted above for the model with large spikes. Gap junctions have recently been documented within two distinct subsets of interneurons in the rat neocortex (Gibson et al., 1999), one of which (the fast-spiking interneurons) gets strong inputs from the thalamus. In recent work, Gibson and Beierlein (pers. comm.) have injected depolarizing current into pairs of electrically coupled fast-spiking interneurons to modulate their frequency. In one pair of cells at a frequency of 40 Hz, the pair oscillated synchronously; with further depolarization that caused the frequency of each cell to go up to approximately 100 Hz, the cells red in antisynchrony. This is in agreement with predictions from our theory. However, other pairs did not show this effect. Another potential application of this work concerns bursting behavior of interneurons in the hippocampus. In recent simulations of data from the work of Zhang et al. (1998), Skinner, Zhang, Velasquez, and Carlen (1999) investigated a model network of cells coupled by gap junctions and inhibitory synapses. Blocking the inhibition increased the frequency of each of the cells; at the higher frequencies, the spikes within a burst of the electrically coupled cells had an antisynchronous relationship. In a future publication, we will analyze this system and show that one important effect of the inhibition on the network can be to change its frequency, which affects what congurations are stable using the electrical coupling. The techniques used for the analysis are similar to those used in Chow (1998) to analyze the effects of chemical synapses on model neurons. For these neurons, the effects of the chemical and electrical neurons in the spike-response equations are additive. Hence, the same formalism can be used for situations involving both kinds of synapses acting in parallel. High-frequency spiking is also found in hippocampal interneurons during ripples, and gap junctional coupling is implicated in the coordination of these rhythms (Draguhn et al., 1998). More recent work (Traub, Schmitz, Jefferys, & Draguhn, 1999) suggests that the mechanism of production of the
Dynamics of Spiking Neurons
1675
rhythms involves spontaneous production of spikes in the axons with transmittal through a sparsely connected network of axo-axonal connections. Although the mechanism that produces the frequency in that case is different (it depends on the network connectivity and the spontaneous rate of action potentials), it remains to be understood if the mechanism that produces the coherence of the network is similar to the one described in this article. In any physiological situation, the network behavior is affected by heterogeneity, spatial structure of the neurons, connectivity of the network, and interaction with chemical synapses. The analysis presented considered only connections between soma and/or axons, in which spatial and delay effects are not included. For dendro-dendritic connections, such delays would be important. Although it is not within the scope of this article to present the complete analysis, we report here that a similar (but more complicated) analysis of neurons connected with gap junctions at the end of a passive dendrite leads to a change in parameter ranges in which different congurations are stable. In particular, the regime in which asynchrony is stable is greatly increased and that in which synchrony is stable is reduced. We believe that this is due to the ltering properties of the dendrite, which changes the shape of the spike at the synapse. Ability to re in an asynchronous way increases the exibility of network dynamics. This article provides insight into how such asynchrony can be fostered by electrical coupling, even in the absence of the above complexities, all of which create a much richer dynamical environment. It remains to understand how these extra features might interact with the mechanisms described in this article to allow exible modulation of network behavior. Appendix: Neuron Dynamics
In our simulations, we considered a network of N conductance-based singlecompartment neuron models coupled electrically with gap junctions. The membrane potential obeyed the current balance equation,
C
X dVi D IQ ¡ INa ¡ IK ¡ IL ¡ g(Vi ¡ Vj ), dt 6 i jD
(A.1)
where i is the neuron index that runs from 1 to N, g is the gap junction conductance, IQ is the applied current, INa D gNa m3 h(Vi ¡ VNa ) and IK D gK n4 (Vi ¡ VK ) are the spike-generating currents, IL D gL (Vi ¡ VL ) is the leak current, and C D 1m F/cm 2 . The interneuron model in White et al. (1998) used the following parameters: gNa D 30 mS/cm2 , gK D 20 mS/cm2 , gL D 0.1 mS/cm2 , VNa D 45 mV, VK D ¡80 mV, and VL D ¡60 mV. The activation variable m was assumed fast and substituted with its asymptotic value m D m1 (v ) D (1 C
1676
Carson C. Chow and Nancy Kopell
exp[¡0.08(v C 26)])¡1 . The gating variables h and n obey dh h1 ( v) ¡ h D , th ( v) dt
dn n1 ( v ) ¡ n D , tn ( v ) dt
(A.2)
with h1 ( v ) D (1 C exp[0.13(v C 38)])¡1 , th (v ) D 0.6 / (1 C exp[¡0.12(v C 67)]), n1 ( v ) D (1 C exp[¡0.045(v C 10)])¡1 , and tn (v ) D 0.5 C 2.0 / (1 C exp[0.045(v¡ 50)]). The reduced Traub–Miles model (Traub et al., 1997; Ermentrout & Kopell, 1998) used the following parameters: gNa D 100 mS/cm2 , gK D 80 mS/cm2 , gL D 0.05 mS/cm2 , VNa D 50 mV, VK D ¡100 mV, VL D ¡67 mV; m D Q m( v) / ( a Q m ( v) C bQm ( v )), where aQ m ( v ) D 0.32(54 C v) / (1 ¡ exp(¡(v C m1 ( v ) D a Q 54) / 4)) and bm ( v) D 0.28(v C 27) / (exp((v C 27) / 5) ¡ 1); dn Da Q n ( v )(1 ¡ n ) ¡ bQn (v )n dt
(A.3)
with aQ n ( v ) D 0.032(v C 52) / (1 ¡ exp(¡(v C 52) / 5)), bQn (v ) D 0.5 exp(¡(v C 57) / 40); h D h1 ( v ) D max[1 ¡ 1.25n, 0]. The ODEs were integrated using the CVODE method with the program XPPAUT written by G. B. Ermentrout and available online at http://www. pitt.edu/»phase/. Acknowledgments
We thank John White and Bard Ermentrout for many helpful discussions. This work was supported by NIH grant K01 MH01508 (CC), NIMH grant 47150 (NK), and NSF grant 9200131 (NK), and the A. P. Sloan Foundation (CC). References Bennett, M. (1997). Gap junctions as electrical synapses. J. Neurocytology, 26, 349–366. Booth, V., & Rinzel, J. (1995). A minimal compartment model for a dendritic origin of bistability of motoneuron ring patterns. J. Comp. Neurosci., 2, 1–14. Bose, A., Terman, D., & Kopell, N. (in press). Almost-synchronous solutions in networks of neurons for mutually coupled excitatory synapses. Physica D. Bressloff, P. C., & Coombes, S. (1998). Desynchronization, mode locking, and bursting in strongly coupled integrate-and-re oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Comp., 12, 91–129. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370.
Dynamics of Spiking Neurons
1677
Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronous networks of inhibitory neurons. J. Comp. Neurosci., 5, 407– 420. Dermietzel, R., & Spray, D. (1993). Gap junction in the brain: Where, what type, how many and why? TINS, 16, 186–192. de Vries, G., Zhu, H.-R., & Sherman, A. (1998). Diffusively coupled bursters: Effects of heterogeneity. Bull. Math. Biol., 60, 1167–1200. Dowling, J. E. (1991). Retinal neuromodulation: The role of dopamine. Visual Neuroscience, 7, 87–97. Draguhn, A., Traub, R. D., Schmitz, D., & Jefferys, J. G. R. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Dye, J. (1988). An in vitro physiological preparation of a vertebrate communicatory behavior: Chirping in the weakly electric sh, Apteronotus. J. Comp. Physiol. A, 163, 445–458. Dye, J. (1991).Ionic and synaptic mechanisms underlying a brainstem oscillator: An in vitro study of the pacemaker nucleus of Apteronotus. J. Comp. Physiol A, 168, 521–532. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Nat. Acad. Sci. USA, 95, 1259–1264. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., van Hemmen, J. L., & Cowen, J. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Gibson, J. R., Beierlein, M., & Connors, B. (1999). Two networks of electrically coupled neurons in the neocortex. Nature, 402, 75–79. Han, S. K., Kurrer, C., & Kuramoto, Y. (1995).Dephasing and bursting in coupled neural oscillators. Phys. Rev. Lett., 75, 3190–3193. Han, S. K., Kurrer, C., & Kuramoto, Y. (1997). Diffusive interaction leading to dephasing of coupled neural oscillators. Int. J. Bifurcations and Chaos, 7, 869– 876. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Kepler, T. B., Marder, E., & Abbott, L. F. (1990). The effect of electrical coupling on the frequency of model neuronal oscillators. Science, 6, 83–85. Kopell, N., Abbott, L. F., & Soto-Trevino, C. (1998). On the behavior of a neural oscillator electrically coupled to a bistable element. Physica D 121, 367– 395. Li, Y.-X., Bertram, R., & Rinzel, J. (1996). Modeling N-methyl-D-aspartate– induced bursting in dopamine neurons. Neuroscience, 71, 397–410. Llinas, R., Baker, R., & Sotelo, C. (1974). Electrotonic coupling between neurons in the cat inferior olive. J. Neurophysiol., 37, 560–571. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366.
1678
Carson C. Chow and Nancy Kopell
Manor, Y., Rinzel, J., Segev, I., & Yarom, Y. (1997). Low amplitude oscillation in the inferior olive: A model based on electrical coupling of neurons with heterogeneous channel densities. J. Neurophysiol., 77, 2736–2752. Moortgat, K. T., Bullock, T. H., & Sejnowski, T. J., (2000a). Precision of the pacemaker nucleus in a weakly electric sh: network vs. cellular inuences. J. Neurophysiol., 83, 971–983. Moortgat, K. T., Bullock, T. H., & Sejnowski, T. J. (2000b). Gap junction effects on precision and frequency of a model pacemaker network. J. Neurophysiol., 83, 984–997. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Comp. Neurosci., 1, 39–60. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Current Opinion in Neurobiology, 7, 536– 546. Sherman, A. (1994). Anti-phase, asymmetric and aperiodic oscillations in excitable cells—I. Coupled bursters. Bull. Math. Biology, 56, 811–835. Sherman, A., & Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci. U.S.A., 89, 2471–2474. Skinner, F. K., Zhang, L., Velasquez, J. L. P., & Carlen, P. L. (1999). Bursting in inhibitory interneurons: A role for gap-junctional coupling. J. Neurophysiol., 81, 1274–1283. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Spray, D. C., & Bennett, M. V. (1985). Physiology and pharmacology of gap junctions. Annu. Rev. Physiol., 47, 281–303. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1997). Simulation of gamma rhythms in networks of interneurons and pyramidal cells. J. Comp. Neurosci., 4, 141–150. Traub, R. D., Schmitz, D., Jefferys, J. G. R., & Draguhn, A. (1999).High-frequency population oscillations are predicted to occur in hippocampal pyramidal neuronal network interconnected by axo-axonal gap-junctions. Neuroscience, 92, 407–426. van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comp. Neurosci., 1, 313–321. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Zhang, Y., Velazquez, J. L. P., Tian, G. F., Wu, C. P., Skinner, F. K., Carlen, P. L., & Zhang, L. (1998). Slow oscillations (· 1 Hz) mediated by GABAergic interneuronal networks in rat hippocampus. J. Neurosci., 18, 9256–9268. Received January 29, 1999; accepted September 2, 1999.
LETTER
Communicated by Zachary Mainen
A Model for Fast Analog Computation Based on Unreliable Synapses Wolfgang Maass Thomas Natschl ager ¨ Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria We investigate through theoretical analysis and computer simulations the consequences of unreliable synapses for fast analog computations in networks of spiking neurons, with analog variables encoded by the current ring activities of pools of spiking neurons. Our results suggest a possible functional role for the well-established unreliability of synaptic transmission on the network level. We also investigate computations on time series and Hebbian learning in this context of space-rate coding in networks of spiking neurons with unreliable synapses. 1 Introduction
This article explores links between two levels of modeling computation in biological neural systems: the level of individual synapses and spiking neurons and the network level. Such links are of interest under the hypothesis that important aspects of the computational function of neuronal activity can be understood only on the larger scale of networks consisting of hundreds and more neurons. One particular challenge is the task to provide models for fast analog computation in neuronal systems that are consistent with experimental data. Thorpe, Fize, and Marlot (1996) and others have demonstrated that biological neural systems involving 10 or more synaptic stages are able to carry out complex computations within 100 to 150 ms. This cannot be be explained through models based on an encoding of analog variables through ring rates of spiking neurons, since the ring rates in these neural systems are typically well below 100 Hz and interspike intervals are highly variable (Koch, 1999). In addition, it has recently been argued that synaptic depression makes ring rates above 20 Hz indistinguishable for the postsynaptic neurons (Abbott, Varela, Sen, & Nelson, 1997). One approach for explaining the possibility of fast analog computation relies on the assumption that the relevant analog variables are encoded in small temporal differences among the ring times of neurons (Thorpe et al., 1996; Hopeld, 1995; Maass, 1997). These models are able to explain the possibility of fast analog computation in networks where neuronal ring and synaptic transmission are highly reliable, or where the average ring c 2000 Massachusetts Institute of Technology Neural Computation 12, 1679– 1704 (2000) °
1680
W. Maass and T. Natschl¨ager
Table 1: Relationships Between Details of Neuronal Hardware and Possible Computational Functions on the Network Level. Neuron level
Network level
Synaptic unreliability
Graded response of ring activity in pools of neurons, yielding the power to approximate arbitrary continuous functions in space-rate coding
Decaying parts of excitatory and inhibitory postsynaptic potentials in combination with refractory effects and recurrent connections
Emulation of arbitrary given linear lters with nite and innite impulse response with regard to space-rate coding
Time-sensitive rule for long-term potentiation
Hebbian learning for space-rate coding
times of pools of neurons encode analog variables on a timescale of a few milliseconds. Although some evidence for such coding has been found, a more common type of coding encountered in vertebrate cortex is a population coding where information about the stimulus or subsequent responses of the organism are encoded in a space-rate code, that is, in the fractions of neurons in various pools that re within some short time interval (say, of length D between 5 and 10 ms). We refer to this coding scheme as spacerate coding. We analyze in this article possible links between well-known response characteristics of individual neurons and synapses on one hand, and computational functions of networks of neurons with regard to space-rate coding on the other hand, as indicated in Table 1. In particular we address the possible computational role of the unreliability of synaptic transmission, which has so far been ignored in this context. In section 2 we show that unreliability of synaptic transmission can be used as an essential ingredient for a model for fast analog computation in space-rate coding that spreads ring activity uniformly among all neurons in a pool. We exhibit a suitable tool from probability theory for analyzing large-scale effects of unreliable synaptic transmission (the Berry-Esseen theorem), and we analyze inherent noise sources of this computational model. Our theoretical analysis is complemented by results of computer simulations. In section 3 we turn to the question what other types of computations (besides approximating arbitrary continuous functions between vectors of analog input and output values) can be induced by synaptic unreliability in combination with other details of the neuronal hardware on the larger scale of networks of neurons with space-rate coding. We show that for computations on time series, the interaction of the time courses of excitatory postsynaptic potentials’ (EPSPs) and inhibitory postsynaptic potentials’ (IPSPs)
Model for Fast Analog Computation
1681
refractory effects of neurons, and recurrent excitation and inhibition provides in combination with unreliable synapses the means for realizing in space-rate coding a rich class of linear lters with nite and innite impulse response. Finally in section 4 we briey address the relationship between temporal learning rules for individual neurons and Hebbian learning in space-rate coding. 2 A Model for Fast Analog Computation in a Space-Rate Code
We say the analog variable x 2 [0, 1] is encoded by a pool U of N neurons in a space-rate code if during a short time interval of length D , a total of Nx neurons re. If the time interval is short enough, say D D 5 ms, one can assume that each neuron res at most once during this time interval. Although there exists substantial empirical evidence that many cortical systems encode relevant analog variables by such space-rate code, it has remained unclear how networks of spiking neurons compute in terms of such a code. Some of the difculties become apparent if one just wants to understand, for example, how the trivial linear function f ( x) D x / 2 can be computed by such a network if the input x 2 [0, 1] is encoded by a space-rate code in a pool U of neurons and the output f (x ) 2 [0, 1 / 2] is supposed to be encoded by a space-rate code in another pool V of neurons. If one assumes that all neurons in V have the same ring threshold and that reliable synaptic connections from all neurons in U to all neurons in V exist with approximately equal weights, a ring of a fraction x of neurons in U during a short time interval will typically trigger almost none or almost all neurons in V to re, since they all receive about the same input from U. Several mechanisms have already been suggested that could in principle achieve a smooth, graded response in terms of a space-rate code in V instead of a binary all-or-none ring, such as strongly varying ring thresholds or different numbers of synaptic connections from U for different neurons v 2 V (Wilson & Cowan, 1972). Neither option is completely satisfactory, since ring thresholds of biological neurons appear to be rather homogeneous and both options would fail to spread average activity over all neurons in V. Hence they would fail to spread the energy consumption in the neuronal ensembles and also would make the computation less robust against failures of individual neurons. In this section we introduce an alternative model for analog computing in space-rate coding that takes the well-known stochastic properties of biological synapses into account. We show that the unreliability of synaptic transmission provides a very useful mechanism for reliable analog computation in space-rate coding. Our model does not require that the ring threshold of the neurons involved is different, and it automatically spreads the ring activity uniformly over all neurons within a pool. In section 2.3, inherent noise sources of this model for analog computation are investigated. In section 2.5, it is shown that our model induces an analog version of the
1682
W. Maass and T. Natschl¨ager
familiar model of a synre chain that is computationally more powerful and simultaneously more consistent with experimental data than the traditional “digital” version of a synre chain. 2.1 Denition of the Computational Model. We start by addressing the most basic question concerning computations in space-rate coding: how can the ring activity in a pool V of neurons be related to the ring activity in n presynaptic pools U1 , . . . , Un of neurons? For simplicity, we assume that n pools U1 , . . . , Un consisting of N neurons each are given, and we also assume that all neurons in these pools have synaptic connections to all neurons in another pool V of N neurons. 1 We write xi for the analog variable ranging over [0, 1] that is encoded by space-rate coding in pool Ui and y for the analog variable ranging over [0, 1] that is encoded by space-rate coding in pool V. We assume throughout this article that for each pool Ui , all neurons in Ui are excitatory or all neurons in Ui are inhibitory. By choosing the number n of pools Ui sufciently large, one can approximate with this model another type of computational model where instead of discrete pools of neurons that encode one analog variable each, one has a continuous spatial pattern of ring activity. We address in section 2.4 the question what functions can be computed by multilayer feedforward networks in terms of space-rate coding, and we address in section 3 the question of which additional computational capabilities arise when one takes recurrent connections involving excitatory and inhibitory neurons into account. In accordance with standard results from neurophysiology, we assume in our model that an action potential (“spike”) from a neuron u 2 Ui triggers with a certain probability rvu (“nonfailure probability”) the release of one or several vesicles lled with neurotransmitter at one or several release sites of the synapses between neurons u 2 Ui and v 2 V. The data from Markram, Lubke, ¨ Frotscher, Roth, and Sakmann (1997), Larkman, Jack, and Stratford (1997), and Dobrunz and Stevens (1997) strongly suggest that in the case of a release, the amplitude of the resulting EPSP in neuron v is a stochastic quantity. Consequently we model the amplitude of the EPSP (or IPSP) in the case of a release by a random variable avu with probability density function w vu . Empirical data show that the variance of the distribution w vu is typically rather high (Markram, Lubke, ¨ Frotscher, Roth, & Sakmann, 1997; Larkman et al., 1997; Stevens & Zador, 1998). This high variability can be traced back to two sources. First, it is uncertain whether a vesicle is released at a single synaptic release site in response to an action potential from the presynaptic neuron. Thus, the amplitude of the postsynaptic potential (PSP) in the postsynaptic neuron v varies in dependence of the actual number of vesicle releases that take place at the different synaptic release sites between neu-
1
Our results remain valid for connection patterns given by sparser random graphs.
Model for Fast Analog Computation
1683
Figure 1: (A) Computer simulation of the model described in section 2.1 with a time interval D of length D D 5 ms for space-rate coding and neurons modeled by the spike-response model of Gerstner (1999b). We have chosen n D 6, a pool size N D 200, and hw1 , . . . , w6 i D h10, ¡20, ¡30, 40, 50, 60i for the effective weights. Each dot is the result of a simulation P nwith an input hx1 , . . . , x6 i selected randomly from [0, 1]6 in such a way that iD 1 wi xi covers the range [¡10, 70] almost uniformly. The y-axis shows the fraction y of neurons in pool V that re during a 5 ms time interval in response to the ring of a fraction xi of neurons in pool Ui for i D 1, . . . ,P 6 during an earlier time interval of length 5 ms. The n solid line is a plot of s( iD1 wi xi ) as described in the text (cf. equation 2.6). (B) Distribution of nonfailure probabilities rvu for synapses between the pools U4 and V underlying this simulation. (C) Example of a probability density function w vu of EPSP amplitudes as used for this simulations. This corresponds to a synapse with ve release sites and a release probability of 0.3. For details about the simulation, see the appendix.
rons u and v. Second, even at a single release site, the amplitude of the EPSP (“quantal size”) caused by a vesicle release at this site varies from trial to trial (Dobrunz & Stevens, 1997; Auger, Kondo, & Marty, 1998). Most of our results—in particular the results in the following section— hold for arbitrary probability density functions w vu and arbitrary values of the parameters rvu . In section 2.3 we address the question of what effect specic probability density functions w vu may have on analog computations in terms of space-rate coding. The equation for the probability density function w vu for the case of multiple release sites and a gaussian distribution of the quantal size at each release site, which is modeled in our computer simulations, is discussed in the appendix. Figure 1C shows an example of w vu for a synapse with ve release sites.
1684
W. Maass and T. Natschl¨ager
2.2 Graded Response Through Unreliable Synapses. In this section we introduce analytical tools for estimating the ring activity y in pool V in terms of the ring activities x1 , . . . , xn in n presynaptic pools U1 , . . . , Un . We show that y can basically be approximated by a function of a weighted sum of the xi , although some unexpected complications will turn up in this analysis. First, we demonstrate that the model dened above is indeed able to produce a graded output y encoded in a space-rate code in pool V such that the activity is uniformly distributed over all neurons v 2 V. We prove this fact theoretically by employing a special version of the central limit theorem of probability theory—the Berry-Esseen theorem. Our theoretical results are then compared with computer simulations of our model. Consider an idealized mathematical model where all neurons that re in the pool Ui re synchronously at time Tin , and the probability that a neuron v 2 V res (at time Tout ) can be described by the probability that the sum hv of the amplitudes of EPSPs and IPSPs resulting from ring of neurons in the pools U1 , . . . , Un exceeds the ring threshold h (which is assumed to be the same for all neurons v 2 V). We assume in this section that the ring rates of neurons in pool V are relatively low, so that the impact of their refractory period can be neglected. We investigate refractory effects in section 3. The random S variable (r.v.) hv is the sum of random variables hvu for all neurons u 2 niD1 Ui , where hvu models the contribution of neuron u to hv . We assume that hvu is nonzero only if neuron u 2 Ui res at time Tin (which occurs with probability xi )2 and if the synapse between u and v releases one or several vesicles (which occurs with probability rvu whenever u res). If both events occur, then the value of hvu is chosen according to some probability density function w vu . The functions w vu , as well as the parameters rvu , are allowed to vary arbitrarily for different P pairs P u, v of neurons. For each neuron v 2 V, we consider the sum hv D niD1 u2Ui hvu of the r.v.’s hvu , and we assume that v res at time Tout if and only if hv ¸ h . Although the r.v.’s hvu may have quite different distributions (for example, due to different w vu and rvu ), their© stochastic ª independence allows us to approximate the ring probability P hv ¸ h through a normal distribution W. The Berry-Esseen theorem (Petrov, 1995) implies that
© ª rv |P hv ¸ h ¡ (1 ¡ W(hI m v , sv ))| · 0.7915 3 , sv
(2.1)
where W(hI m v , sv ) denotes the normal distribution function with mean m v and variance sv2 . The three moments occurringPin equation 2.1 can be P related to the r.v.’s hvu through the equations m v D niD1 u2Ui E[hvu ], sv2 D Pn P Pn P 3 iD1 u2Ui Var[hvu ], and r v D iD1 u2Ui E[|hvu ¡ E[hvu ]| ]. According to 2 This holds if the pool size is large enough that we can treat xi (y) as the probability that a neuron u 2 Ui (v 2 V) will re once during a certain input (output) interval of length D .
Model for Fast Analog Computation
1685
the denitionR of the r.v. hvu we have E[hvu ] D xi rvu aN vu and E[h2vu ] D xi rvu aO vu , where aN vu D aw vu ( a )da denotes the mean EPSP (IPSP) amplitude and aO vu D R aw vu ( a )da denotes the second moment. Hence we can assign to m v and sv in equation 2.1 the values mv D sv2 D
n X X iD1 u2Ui
xi rvu aN vu ,
n X ± X iD1 u2Ui
(2.2) ²
2 2 xi rvu aO vu ¡ x2i rvu aN vu .
(2.3)
A closer look reveals that the right-hand side of equation 2.1 scales like that for large N, we can approximate N ¡1 / 2 .3 Hence, equation © 2.1 implies ª the ring probability P hv ¸ h by the term 1¡W(hI m v , sv ), which smoothly grows with m v . The gain of this sigmoidal function depends on the size of sv . In particular, if synaptic transmission were reliable, this function would degenerate P to a step function. WithPthe denition of the formal weights wvi :D u2Ui rvu aN vu , we have m v D niD1 wvi xi , and hence 1 ¡ W(hI m v , sv ) P smoothly grows with the weighted sum niD1 wvi xi of©the inputs xi . ª So far we have considered only the probability P hv ¸ h that a single neuron v 2 V will re, but we are really interested in the expected fraction y of neurons in pool V that will re in response to a ring of a fraction xi of neurons in the pools Ui for i D 1, . . . , n. According to equation 2.1, one can approximate y for sufciently large pool sizes N by yD
ª 1 X © 1 X P hv ¸ h D 1 ¡ W(h I m v , sv ) . N v2V N v2V
Hence y is approximated by an average of the N sigmoidal functions 1 ¡ W(h I m v , sv ). If the weights wvi have similar values for different P v 2 V, one can expect that y grows smoothly with the weighted sum m D niD1 wi xi , N P where we write wi D v2V wvi / N for the “effective weights” wi between the pools of neurons Ui and V. In order to test these theoretical predictions for an idealized mathematical model, we have carried out computer simulations of a more detailed model consisting of more realistic models for spiking neurons and time intervals 3
More precisely, the right-hand side of equation 2.1 scales like N ¡1 / 2 if for all N, the
3
average value of the terms E[ hvu ¡ E[hvu ]] is uniformly bounded from above and the average value of the terms Var[h vu ] is uniformly bounded from below by a constant > 0 Sn for i 2 f1, . . . , ng and u 2 iD1 Ui . The latter can be achieved only for inputs where xj > 0 for some j, since otherwise Var[hvu©] D 0 for ª all u 2 Ui and all i 2 f1, . . . , ng. But in the case xi D 0 for all i 2 f1, . . . , ng, both P hv ¸ h and 1 ¡ W(hI m v , sv ) have value 0 if h > 0 and hence the left-hand side of equation 2.1 has value 0.
1686
W. Maass and T. Natschl¨ager
Iin and Iout of length D D 5 ms for space-rate coding. The resulting fraction y of neurons in pool V that red during Iout was measured for a large variety of different inputs hx1 , . . . , x6 i 2 [0, 1]6 encoded through the fractions of ring neurons in U1 , . . . , U6 during an earlier time interval Iin .4 It is shown in Figure 1A that y can be approximated quite well by a suitable sigmoidal “activation function” s (more precisely, by the function s derived in equaP tion 2.6 applied to a weighted sum 6iD1 wi xi of the inputs x1 , . . . , x6 . Note that s has not been implemented explicitly in our computational model, but rather emerges implicitly through the large-scale statistics of the ring activity. The deviation of the data points in Figure 1A from the sigmoidal function P s( niD1 wi xi ) (solid line) can be traced back to two independent sources of noise. One source of noise is stochastic uctuations due to the nite pool size N D 200. This type of noise is guaranteed to disappear for N ! 1 with N¡1 / 2 according to equation 2.1. Another source of noise is of a more systematic nature and is predicted by our preceding theoretical analysis (cf. equation 2.1). For large pool sizes N, the ring probability of a neuron v in pool V converges to P 1 ¡ W(h I m v , sv ). This function depends not just on the weighted sum m v D niD1 wvi xi , but also on another function sv of the input vector hx1 , . . . , xn i. This fact, which causes a noise for computing in spacerate coding that dominates the sampling noise already for moderate pool sizes (cf. Figure 2), will be addressed in the next section. In the following, the term systematic noise refers to the fact that the output y in space-rate code P does not merely depend on a single variable mN D niD1 wi xi but is actually a function that depends in a more complex way on the input hx1 , . . . , xn i. 2.3 Analysis of the Systematic Noise. At the end of the preceding section, we pointed out that the output y of our model for P computing in spacerate coding depends not just on the weighted sum niD1 wi xi of the inputs x1 , . . . , xn . Here we provide tools for analyzing on what other quantities the output y may depend, and we exhibit a specic type of synaptic connectivity that minimizes the impact of this type of systematic noise. © Equation ª 2.1 shows that if one wants to approximate the ring probability P hv ¸ h of an arbitrary neuron v in pool V by a function of the weighted sum m v of the ring probabilities xi in pools Ui , it is indeed necessary that sv also can be approximated by a function of m v . This follows easily from the equation W(hI m v , sv ) D W((h ¡ m v ) / sv I 0, 1). In order to analyze the possibilities for approximating sv by a function of m v , we restate sv2 as
sv2 D
4
n X iD1
wO vi xi ¡
n X iD1
x2i
X u2Ui
r2vu aN 2vu with wO vi :D
X u2Ui
rvu aO vu .
The ring times of the individual neurons are distributed uniformly over Iin .
(2.4)
Model for Fast Analog Computation
1687
Figure 2: Quantitative study of the noise underlying the computation in our model. The model is constructed in exactly the same way as described in Figure 1, with the same values of the weights wi . (A) The empirical standard deviation (std) of the outputs y of 200 individual simulations in dependence of (0) (0) the pool size N for two cases. We used the same input x(0) D hx1 , . . . , xn i for all 200 simulations and measured the standard deviation l 0 of the observed outputs. Thus, l 0 describes the contribution of the stochastic uctuations to the total noise. In order to test to what extent the response depends just on mN , we Pn generated 200 different inputs x D hx1 , . . . , xn i such that iD 1 wi xi always was Pn (0) equal to mN (0) D iD1 wi xi D 24 and measured the standard deviation l t of the resulting output y, which describes the amount of the total noise. (B) The ratio l t /l 0 between the total noise and the stochastic uctuations.
P P We will focus on scenarios where the second term niD1 x2i u2Ui r2vu aN 2vu of equation 2.4 can be neglected for sufciently large N. This can be achieved if either the nonfailure probabilities rvu or the mean EPSP (IPSP) amplitudes aN vu scale such that the weights wvi remain bounded for large N.5 Under this assumption, one can approximate sv2 by some second-order polynomial Am 2v C Bm v C C in m v if the angle Q between the weight vector w v D hwv1 , . . . , wvn i and the “virtual” weight vector w O v D hwO v1 , . . . , wO vn i that is dened in equation 2.4 is rather small.6 One can easily show that sv2 can be expressed exactly as a function of m v if and only if Q D 0 or Q D p . Hence it is worthwhile to analyze what type of 5
Pn
P
The term iD1 x2i r2 aN 2 converges to 0 for N ! 1 if, for example, the empiru2U i vu vu ical standard deviation of the terms rvu aN vu over u 2 Ui does not grow much faster than the
P
¡ P
¢1 / 2
wvi 2 (r ) · mean wvi / N D u2U i rvu aN vu / N for N ! 1, for example, if N1 u2Ui vu aN vu ¡ N wvi . This appears to be a quite reasonable assumption, and it furthermore implies that N P Pn 2 P r2 aN 2 · N2 ¢ w2vi . Hence the second term x r2 aN 2 in equation 2.4 u2Ui vu vu iD 1 i u2Ui vu vu converges to 0 for N ! 1 if the “weights” wvi remain bounded for N ! 1. 6 Q is dened by cos Q D ( w ¢ w O v k). v O v ) / (kw v k ¢ kw
1688
W. Maass and T. Natschl¨ager
synaptic connectivity structure can achieve this. Obviously Q D 0 (Q D p ) holds if and only if there exists a constant c > 0 (c < 0) such that wO vi D c wvi for all i 2 f1, . . . , ng. Note that this implies that all pools Ui consist of excitatory (Q D 0) or inhibitory (Q D p ) neurons, since wO vi ¸ 0 by denition of wO vi . In the following, we consider the case where all neurons are excitatory. The condition wO vi D c wvi for all i 2 f1, . . . , ng with a common constant c > 0 is quite difcult to achieve in the case of heterogeneous weights wvi for different i. However, one scenario in which this can in principle be achieved even with heterogeneous S nonnegative weights wvi is that where there exists for each neuron u 2 niD1 Ui just a single release site between u and v with a common stereotyped probability density function w for the amplitudes avu of the EPSPs caused by a synaptic release. For such architecture, the nonfailure probabilities rvu (equal to the release probability in that case) can be chosen arbitrarily in order to achieve a desired heterogeneous assignment of nonnegative weights wvi for i 2 f1, . . . , ng, while the condition wO vi D c wvi for all i 2 f1, . . . , ng will automatically be satised with a common constant c > 0. This follows easily from the fact that for P a common stereotyped P probability density function w , one gets wvi D aN u2Ui rvu and wvi D aO u2Ui rvu R R with aN D aw ( a)da and aO D a2 w (a )da. A connection from a neuron u 2 Ui to a neuron v 2 V via a single release site also seems to be advantageous with regard to synaptic plasticity. In the case of a single release site, it is a reasonable assumption that the nonfailure probabilities rvu (equal to the release probability in that case) and the distribution of the EPSP amplitudes w vu can be cho7 sen independently. ¡PThis implies ¢ ¡P that one ¢ can approximate the effective weights wvi by N1 aN vu rvu and the “virtual weights” wO vi by u2U u2U i i ¡ ¢ ¡P ¢ 1 P O vi D c wvi for all i 2 f1, . . . , ng u2Ui aO vu u2Ui rvu . Thus, the condition w N reduces to 1 X c X aO vu D aN vu for all i 2 f1, . . . , ng. N u2U N u2U i
(2.5)
i
Assume that for a given set of effective and virtual weights, the relation wO vi D c wvi holds for all i 2 f1, . . . , ng, and subsequently the weights wvi are subject to change due to some plasticity mechanism. In order to maintain the property that wO vi D c wvi for all i, such a change is best implemented by changes in the nonfailure probabilities rvu , since equation 2.5 holds independent of their values. If the weight change would be implemented by changes in the mean EPSP amplitudes aN vu , then the average of the second moments aO vu would have to change proportionally to keep equation 2.5 valid. If one considers the well-known relation aO vu D aN 2vu C aQ 2vu , where aQ2vu is 7
For synapses with more than one release site, the assumption that rvu can be chosen independently from w vu does not hold (see the appendix for an example).
Model for Fast Analog Computation
1689
Figure 3: Quantitative study of the noise underlying the computation in our model if all connections consist of just a single release site. The model is constructed in a similar way as described in Figure 1, with hw1 , . . . , w6 i D h10, 20, 30, 40, 50, 60i for the effective weights. The simulation protocol is the same as described in the caption of Figure 2. (A) Stochastic uctuations (l 0 ) and the total noise (l t ). (B) Ratio between l 0 andl t has a value of about 1 for N ¸ 200. This P shows that the output y of the network depends just on the weighted sum mN D niD1 wi xi .
R the variance of the distribution w vu (Qa2vu D w vu ( z)( z ¡ aN vu )2 dz), it becomes even clearer that such a change would involve complex changes in the distributions w vu . Hence, in the case of connections with single release sites, it is advantageous to implement weight changes through changes in the nonfailure probabilities rather than through changes in the distribution of the EPSP amplitudes. Thus, our preceding analysis exhibits a possible computational advantage of synaptic connections consisting of single release sites. At rst sight, such connectivity structure appears to be quite undesirable because of the high unreliability of such synaptic connections. However, our preceding arguments show that such connectivity structure supports precise analog computations in space-rate coding better than multiple release sites. Together with equation 2.5, it induces an angle Q D 0 between the vectors hwv1 , . . . , wvn i and hwO v1 , . . . , wO vn i, thereby allowing that the resulting ring P activity y in pool V depends on only m v D niD1 wvi xi for arbitrary values x1 , . . . , xn 2 [0, 1] of presynaptic pool activities (see Figure 3). Now we consider the case where the angle Q between the weight vector O v D hwO v1 , . . . , wO vn i w v D hwv1 , . . . , wvn i and the “virtual” weight vector w 6 is rather small but D 0. We will show that in this case, we can still approximate sv2 P well by the expression B 0m v C C0 for B0 :D w O / kw Nv ¢w N v k2 and C0 : D 12 niD1 ( wO vi ¡ B0 wvi ). In fact, this specic linear expression in m v provides an approximation to sv2 that is in a heuristic sense optimal
1690
W. Maass and T. Natschl¨ager
among all second-order polynomials Am 2v C Bm v C C in m v . To see this, we choose the parameters A, B, and C such that the average error E D R ¡ 2 ¢2 ( 2 C Bm v C C) dx of such an approximation is minimized. [0,1]n sv ¡ Am v dE dE As shown in the appendix, the conditions dA D 0, dE dB D 0, and dC D 0 imply for N ! 1 that A D 0, B D B0 , and C D C0 . For these values B, and C, the average error E has the minimum value Emin D R of A, 2 ¡( m C (s ˆ v k2 (1 ¡ cos2 Q ) / 12. Emin decreases with B0 v C0 ))2 dx D kw n v [0,1] decreasing Q , especially Emin D 0 if Q D 0. This indicates that © observation ª one can approximate the ring probability P hv ¸ h of neuron v in pool V ¡ ¢ by the sigmoidal function 1 ¡ W hI m v , (B0 m v C C0 )1 / 2 of the single variable m v if the angle Q is not too large. Furthermore, if we assume that the values wvi and wO vi do not differ too much for different neurons v 2 V, we can approximate the output y of the network by
±
y ¼ 1 ¡ W hI mN , ( B0 mN C C0 )1 / 2
²
(2.6)
P P for mN D niD1 wi xi and wi D (1 / N ) v2V wvi . This theoreticalP prediction is supported by our computer simulations. The function of mN D niD1 wi xi that appears on the right-hand side of equation 2.6 is drawn as a solid line in Figure 1A. Complications arise if some of the weights wvi are positive and some are negative. In this case, it is impossible to satisfy the condition wO vi D c wvi 6 0 for all i 2 f1, . . . , ng. Hence, s 2 depends not with a common constant c D v only on m v , and y depends not only on m v . This indicates that the systematic noise is larger in the case where one considers a combination of inhibitory and excitatory input. However, the simulations reported in Figure 1 show that even in this P case, the output y of the network approximates a sigmoidal function of mN D niD1 wi xi . 2.4 Multilayer Computations. The preceding arguments show that approximate computations of functions of the form hx1 , . . . , xn i ! y D s P ( niD1 wi xi ), with inputs and output in space-rate code, can be carried out within 10 ms by a network of spiking neurons. Hence, the universal approximation theorem for multilayer perceptrons implies that arbitrary continuous functions f : [0, 1]n ! [0, 1]m can be approximated with a computation time of not more than 20 ms by a network of spiking neurons with three layers. Thus, our model provides a possible theoretical explanation for the empirically observed very fast multilayer computations in biological neural systems that were mentioned in section 1. Results of computer simulations of the computation of a specic function f in space-rate coding that requires a multilayer network because it interpolates the boolean function XOR are shown in Figure 4.
Model for Fast Analog Computation
1691
Figure 4: (A) Plot of a function f (x1 , x2 ): [0, 1]2 ! [0, 1], which interpolates XOR. Since f interpolates XOR, it cannot be computed by a single sigmoid unit. (B) Computation of f by a three-layer network in space-rate coding with spikeresponse model neurons (N D 200) according to our model (for details, see the appendix).
2.5 Consequences for Synre Chains. Our computational model also throws new light on the familiar model of a synre chain (Abeles, 1991). There one considers a longer chain of pools of neurons, with large diverging and converging connectivity between adjacent pools in the chain. Abeles, Bergman, Margalit, and Vaadia (1993) pointed out that this architecture has the property that relatively synchronous ring in the rst pool triggers even more synchronous ring in the subsequent pools of the chain. We have shown in this article that if one takes synaptic unreliability into account, a synchronous ring of a certain fraction x of neurons in the rst pool will cause a certain fraction y of neurons in the subsequent pools to re synchronously. Whereas Abeles (1991) considered only the situation where x and y are close to 1, our analysis shows that more subtle information processing can be carried out by a synre chain. One can arrange that y is a smooth function of x, for a wide range of values for x. Such “graded activation” of synre chains allows substantially more complex computations in networks of linked and/or reverberating synre chains, since it supports the implementation of “graded pointers” from one synre chain to other ones. The original version and our analog version of a synre chain (based on unreliable synapses) make slightly different predictions regarding the ring behavior of individual neurons in the synre chain. In the analog version of a synre chain, only a certain fraction y of neurons will re in a pool V of a chain, where y may assume arbitrary values between 0 and 1. Furthermore, for repeated activations of the synre chain with the same ring activity x in the rst pool, the actual set of neurons in V that re will change from trial to trial. Hence, precisely timed ring patterns among neurons in different pools of the chain would not occur every time when the rst pool is activated
1692
W. Maass and T. Natschl¨ager
with the same x, but just with a certain probability that is somewhat higher than for completely randomly ring neurons. This prediction appears to be consistent with published data (Abeles et al., 1993). 3 Analog Computation on Time Series in a Space-Rate Code
We showed in section 2 that biological neural systems with space-rate coding have at least the computational power of multilayer perceptrons. In this section, we demonstrate that they have strictly more computational power. This becomes clear if one considers computations on time series rather than on static batch inputs, as in section 2. We now analyze the behavior of our computational model if the ring probabilities in the pools Ui change with time. Writing xi ( t) (y(t)) for the probability that a neuron in pool Ui (V) res during the tth time window of length D (e.g., for D in the range between 1 and 5 ms), our computational model from section 2 maps a vector of n analog time series fxi ( t)gt2N onto an output time series fy(t)gt2N (where (N D f0, 1, 2, . . .g). As an example, consider a network that consists of one presynaptic pool U1 connected to the output pool V with the same type of synapses as discussed in section 2. In addition, there are feedback connections between individual neurons v 2 V. The results of simulations reported in Figures 5 and 6 show that this network computes an interesting map in the time-series domain: the space-rate code in pool V represents a sigmoidal function s (as in section 2) applied to the output of a bandpass lter. Figure 5B shows the response of such a network to a sine wave with some bursts of activity added on top (see Figure 5A). Figures 6A and 6B show the frequency response of the bandpass lter that is implemented by this network of spiking neurons (see the appendix for the denition of the frequency response). Figure 5C shows the output of another network of spiking neurons (which approximates a lowpass lter) to the same input (shown in Figure 5A). The theoretical analysis of these networks will be presented in section 3.3. To analyze theoretically the computational power of such networks of spiking neurons in the time-series domain, we use the spike-response model (Gerstner, 1999b). We will focus on the situation where one can neglect the nonlinear effects of the sigmoidal function s; we consider time series of the form xi ( t) D x0 C xQ i (t) and y( t) D y0 C yQ ( t), where the magnitudes of the signals xQ i ( t) and yQ ( t) are small enough that s can be approximated in this range by a linear function x 7! Kx. 3.1 Taking the Time Course of Postsynaptic Potentials into Account.
We model the effect of a ring of a neuron u 2 Ui at time k on the membrane potential of a neuron v 2 V at time t by a “response function” ei (t ¡ k ), as in the spike-response model (Gerstner, 1999b). Such response function has the typical shape of an EPSP or IPSP. It is usually described mathematically as a difference of two exponentially decaying functions (see the appendix).
Model for Fast Analog Computation
1693
Figure 5: (A) Response of two different networks to the same input. (B) Response of a network that approximates a bandpass lter. (C) Response of a network that approximates a low-pass lter. The gray-shaded bars in B and C show the actual measured fraction of neurons that re in pool V of the network during a time interval of length 5 ms in response to the input activity in pool U 1 shown in A. The solid lines in B and C are plots of the output predicted by equation 3.4. For details about the parameters, see the appendix.
Figure 6: (A) Amplitude response |H( f )| and (B) phase response y ( f ) of the lter (bandpass) implicitly implemented by the network of spiking neurons described at the beginning of section 3 (see the appendix for the denition of |H( f )|, y ( f ), and the parameters of the network). Solid lines are plots of the amplitude and phase response predicted by the theory presented in section 3.3. Dots are simulation results for the approximating network of spiking neurons (see the appendix for details).
1694
W. Maass and T. Natschl¨ager
If one takes this time course of EPSPs and IPSPs into account but ignores the impact of v’s ring on its own membrane potential, our arguments from section 2 imply that one can write the expected value m ( t) of the membrane potentials of neurons v 2 V at time t as m (t ) D
n X
wi
iD1
t X kD0
ei ( k ) xi ( t ¡ k ) .
(3.1)
The “weights” wi result from the release probabilities and distributions of PSP amplitudes as described in section 2. We will ignore the impact of synaptic dynamics and assume that these weights wi do not depend on t. 3.2 Taking Refractory Effects into Account. When the resulting ring probabilities in pool V become sufciently large, the refractory effects of neurons v 2 V start to affect the resulting computation on time series. We then have to replace equation 3.1 by
m (t ) D
n X
wi
iD1
t X kD0
ei ( k ) xi ( t ¡ k ) C
t X kD1
g(k)y(t ¡ k ),
(3.2)
where g(k) typically assumes a very large negative value for a few ms, and then decays exponentially with a relatively small time constant (Gerstner, 1999b). This yields a special case of an innite impulse response timeinvariant linear lter, as we show in the next section in a more general context. Since different specic ring properties of specic neurons, such as adaptation or rhythmic ring, correspond in this spike-response model to different refractory functions g(k) of specic neurons, a rich class of different innite impulse response lters may arise in this way in biological neural systems. 3.3 Adding Recurrent Connections in Pool V . We now assume that in addition to the model dened in section 2 there are m different kinds of recurrent connections among neurons in pool V, with different response functions r j ( k ) and weights wQ j , j D 1, . . . , m. We assume that the large-scale structure of these recurrent connections is of the same type as those between pools Ui and V; that is, we assume that these are connections with unreliable synapses. Therefore, the weights wQ j also result from release probabilities and PSP amplitude distributions as described in section 2. We then arrive at the following equation for the average membrane potential of neurons in pool V:
m (t) D
n X iD1
wi
t X kD0
ei ( k ) xi ( t ¡ k ) C
t X kD1
g(k)y(t ¡ k ) C
m X jD1
wQ j
t X kD1
r j ( k)y( t ¡ k ).
Model for Fast Analog Computation
1695
If t is sufciently large such that ei ( t0 ) D r j ( t0 ) D 0 for t0 ¸ t, then for xi ( t) D x0 C xQ i (t ) and y( t) D y0 C yQ (t), one can rewrite m ( t) as m ( t) D m 0 C mQ (t ) with m 0 D x0
n X iD1
wi
1 X kD0
ei ( k ) C y0
1 X
g(k) C y0
kD1
m X j D1
wQ j
1 X
r j ( k)
kD1
and mQ (t ) D
n X iD1
wi
t X kD0
ei ( k)xQ i ( t ¡ k ) C
t X kD1
g(k)yQ ( t ¡ k ) C
m X jD1
wQ j
t X kD1
r j ( k) yQ ( t ¡ k ).
We showed in section 2 that under certain conditions, one can approximate the ring activity y( t) in pool V by a function of the form s(m ( t)), for some suitable sigmoidal function s (see equation 2.6). Since we consider only small signals xQ i ( t) and yQ ( t), we can make the approximation yQ ( t) D K ¢ mQ (t ) for some constant K, which depends on the sigmoidal function s. This yields for yQ ( t) the recursive equation yQ (t ) D K ¢
n X
wi
iD1
CK
t X kD1
t X kD0
ei ( k )xQ i ( t ¡ k )
g(k)yQ ( t ¡ k ) C K
m X jD1
wQ j
t X kD1
r j (k )yQ ( t ¡ k ).
(3.3)
We now show that equation 3.3 describes a very powerful computational model for computations on time series. This becomes clear if one considers just the special case where xi ( t) D x( t) (hence, xQ i ( t) D xQ ( t) for all i 2 f1, . . . , ng). We then have yQ (t ) D
t X kD0
bk xQ ( t ¡ k) ¡
t X kD1
ak yQ ( t ¡ k )
(3.4)
P P with bk D K niD1 wi ei ( k ) and ak D ¡Kg(k) ¡ K jmD1 wQ jr j ( k ). For arbitrary real valued numbers bk and ak for k 2 N with bk D ak D 0 for k > k0 (where k0 is some sufciently large constant) 8 this is the general form of an innite impulse response (IIR) time-invariant linear lter (see Haykin, 1996; Back & Tsoi, 1991). Furthermore if each response function ei (¢) and r j (¢) is represented as a difference of two exponentially decaying functions and g(¢) is also decaying function, then P represented as an exponentially P the kernels K niD1 wi ei ( k) and ¡Kg(k) ¡ K jmD1 wQ jr j ( k ) in equation 3.4 are 8
Since in biological systems the responses to individual spikes vanish after some time, it is a reasonable approximation to set ei (k) D rj (k) D g(k) D 0 for k > k0 .
1696
W. Maass and T. Natschl¨ager
weighted sums of exponentially decaying functions of the form e¡k / t for various different values of t. In that case, one can easily show that the map from arbitrary input functions xQ ( t) to the output yQ ( t) can in principle approximate any given time-invariant linear lter with IIR for any nite number t of discrete time steps. 9 In the case where ak D 0 for all k 2 N, P that is, yQ ( t) D tkD0 bk x( t ¡ k ), the resulting mapping from xQ ( t) to yQ ( t) is a time-invariant linear lter with nite impulse response (FIR). Hence, if there are no recurrent connections and one can neglect the refractory effects, our model can still approximate any given FIR lter. One can build a large class of practically relevant lters with the help of such FIR and IIR lters, including good approximations to low-pass and bandpass lters (see Figure 6). Note that different architectures of recurrent circuits with different delays and different types of neurons involved in the recurrent loop give rise to a large variety of different coefcient vectors ha1 , . . . , ak0 i. The variety of different vectors hb0 , . . . , bk0 i appears to be more limited in comparison, since time courses of PSPs in isolated neurons are relatively uniform. However, it is shown in Bernander, Douglas, Martin, and Koch (1991) that the level of background activity in cortical neurons may change the time constants of PSPs considerably. Hence, such a mechanism supports the implementations of a large variety of vectors hb0 , . . . , bk0 i. The preceding analysis implies that assuming that one can approximate any given real-valued sequences bk and ak on any nite interval, the output of the network of spiking neurons can approximate in space-rate coding on any nite time interval the saturated version of any given linear timeinvariant lter with nite and innite impulse response. One may view this result as a universal approximation theorem for linear lters by networks of spiking neurons. Back and Tsoi (1991) have shown that an articial neural network consisting of IIR lters as “synapses” and two layers of sigmoidal gates as “neurons” can be adjusted (via gradient descent for the parameters involved) to approximate for low-pass-ltered white noise as input in its output a very complex given time series. Our combined results from section 2 and the results of this section show that in principle, an arbitrary network of the type considered by Back and Tsoi (1991) can be implemented by a network of spiking neurons in space-rate coding. Note, however, that in this imple-
9 For a rigorous proof of that result, one just needs to observe that any function g: f0, . . P . , k0 g ! R can be approximated arbitrarily closely by functions of the form n e (k), where each function ei (¢) is a difference of two exponentially def (k) D i D 1 wi i caying functions. This approximation result relies on two facts: that any continuous function can be approximated arbitrarily closely by a polynomial on any bounded interval (approximation theorem of Weierstrass) and that the variable transformation u D ¡ log x for u 2 [0, 1) transforms an exponentially decaying function e¡u / t over [0, 1) into the function x1 / t over (0, 1]. It becomes clear from this proof that it sufces to consider just time constants t · 1 for that purpose.
Model for Fast Analog Computation
1697
mentation, each IIR lter is implemented not by a single synapse but by a pool of spiking neurons. An interesting aspect of this analysis is that these computations in the time-series domain with space-rate coding make essential use of two aspects of neuronal dynamics that had received little attention in preceding computational models for neural systems: the specic forms of the refractory functions g and of the decaying parts of PSPs. 4 Hebbian Learning for Space-Rate Coding
The model for space-rate coding based on unreliable synapses that we analyzed in section 2 has the characteristic property that the actual sets of neurons that re will vary from trial to trial, even for the same input. In this section, we show that this property is quite desirable from the point of view of learning. Most theoretically attractive learning rules for a computational unit in an articial neural net, such as the Hebb rule, require a weight change D wi that depends on the product xi y of the ith analog input xi and the analog output y of the unit. If these values xi and y are encoded in a network of spiking neurons by a space-rate code, it is not clear how this product xi y can be “computed” locally by a biological synapse or by the two spiking neurons that the synapse connects. On the other hand Markram, Lubke, ¨ Frotscher, and Sakmann (1997) have shown that biological synapses regulate their efcacies in dependence of the temporal relationship between pre- and postsynaptic ring. For example, a biological synapse may increase its release probability by an amount b if the postsynaptic neuron v res within 100 ms after the presynaptic neuron u. When we apply this rule in the context of our computational model from section 2, we see that the fraction of synapses between neurons u 2 Ui and v 2 V so that v res within 100 ms after the ring of u has an expected value of xi y. Hence the expected fraction of synapses between Ui and V whose release probability is increased by b according to the rule from Markram, Lubke, ¨ Frotscher, and Sakmann (1997) is xi y. This causes an average change in release probability between the pools Ui and V proportional to xi y , although the value of the product xi y is nowhere explicitly computed in the network. A closer look shows that this average change in release probability has the desired effect only of increasing the response y after a few repetitions under a special condition: if the actual sets of neurons in the pools that re vary from trial to trial. Such trial-to-trial variability of the specic neurons that re is inherent in our model for analog computation in space-rate coding presented in section 2, in contrast to previous models, which ignore the large-scale effects of synaptic unreliability. We have compared the relationship between weight changes resulting from Hebb’s learning rule for an articial neural network and local changes of synaptic efcacy based on temporal relationship of pre- and postsynaptic ring according to Markram, Lubke, ¨ Frotscher, and Sakmann (1997) for
1698
W. Maass and T. Natschl¨ager
Figure 7: (A, B) Typical change in the distribution of release probabilities at synapses between U1 and V during learning as described in the text, (A) before and (B) after learning. (C) Distribution of the normalized errors |wnew ¡ Winew | / |wmax | for all i 2 f1, . . . , 6g where Winew is the P ideal value i P of the (1 N updated weight according to the Hebb rule and wnew D N) / i v2V u2Ui rvu avu is the “effective weight” resulting from applications of Markram’s rule in our simulated network of spiking neurons (|wmax | D 100 in this case; the mean is 0.0062). See the appendix for details.
our computational model from section 2 through computer simulations (see the appendix). A typical change in the distribution of release probabilities between pools U1 and V resulting from repeated applications of Markram’s rule is shown P in the P Figures 7A and 7B. After 10 learning steps, the weights wi :D (1 / N ) v2V u2Ui rvu aN vu for all i 2 f1, . . . , ng resulting from Markram’s rule differ on average by 0.0062 (normalized error; see Figure 7C) from the weights Wi resulting from applications of Hebb’s rule for an articial neural network with the corresponding parameter values and the same initial weights. For the application of Hebb’s learning rule, y was P computed by s( niD1 wi xi ) for the sigmoidal function s given by the righthand side of equation 2.6, with the same inputs xi that are used as ring probabilities for the pools U1 , . . . , Un . These results show that due to the large trial-to-trial variability of neuronal ring in our model from section 2, iterated local applications of Markram’s temporal learning rule implement with high delity Hebb’s learning rule for space-rate coding on the network level. 5 Discussion
We have addressed specic links between details of realistic models for biologic neurons and synapses on one hand and resulting large-scale effects for computations with populations of neurons on the other hand. In particular we have investigated possible macroscopic effects of the well-known stochastic nature of synaptic transmission. This aspect has been neglected in previous analytical studies of the dynamics of populations of spiking neurons (see, for example, the references in Gerstner, 1999a, 1999b).
Model for Fast Analog Computation
1699
In section 2 we showed that the unreliability of synaptic transmission sufces to explain the possibility of fast analog computations in space-rate coding on the network level. In fact, theoretically any given bounded continuous function can be approximated arbitrarily close by such networks, with a computation time of not more than 20 ms. In contrast to other possible explanations of analog computation in space-rate coding, this model spreads ring activity uniformly among the neurons and therefore is able to explain the large trial-to-trial variability in neuronal ring patterns. Whereas it is occasionally conjectured that synaptic unreliability “averages out” on the network level, and hence causes no signicant macroscopic effects, we showed in section 2 through rigorous theoretical analysis and extensive computer simulations that synaptic unreliability may be essentially involved in generating a “graded response” for computations on the network level. Furthermore, we showed that most of these results are quite robust, since they hold for arbitrary assignments of synaptic failure probabilities and arbitrary distributions of PSP amplitudes. We also investigated a specic source of systematic noise for analog computation in space-rate coding that arises in this model. In contrast to sampling errors, this systematic noise is not likely to disappear when the pool size N is chosen very large. However, our computer simulations suggest that the computational effect of this systematic noise is rather small. We showed in section 2.3 that this source of systematic noise is removed if synaptic connections consist of single release sites and the “weights” of synaptic connections are encoded in release probabilities of synapses rather than in the amplitudes of postsynaptic responses caused by the releases. We also showed in subsection 2.5 that our computational model gives rise to an analog version of the familiar model of a synre chain. This analog version appears to be computationally more powerful than the classical binary version, and its ring behavior appears to be more consistent with experimental data. In section 3 we addressed the question of what additional computational operations on the network level are supported by other prominent features of biological neurons and microcircuits, such as the refractory behavior of neurons and local recurrent connections. For that purpose, we have moved from computations on static inputs toward an analysis of computations on time series, which is arguably a biologically particular relevant computational domain. Various ltering properties of biological neural systems have already been addressed in Koch (1999). We have exhibited specic macroscopic effects for analog computations on time series in space-rate coding that are caused by unreliable synapses in combination with the decaying parts of postsynaptic potentials and refractory effects on the neuron level, and in combination with recurrent connections in microcircuits. We have shown that through these features, a rich repertoire of linear lters, especially arbitrary time-invariant linear lters with nite and innite impulse response, can be approximated on the level of space-rate coding in pop-
1700
W. Maass and T. Natschl¨ager
ulations of neurons. In combination with the results from section 2, this shows that for computations on time series, in principle the full computational power of the articial neural networks proposed in Back and Tsoi (1991) can be emulated by networks of biologically realistic neurons with space-rate coding. In section 4 we investigated macroscopic effects of a local learning rule for spiking neurons that is empirically supported by the experiments of Markram, Lubke, ¨ Frotscher, and Sakmann (1997). Whereas this learning rule is usually discussed only in the context of temporal coding in small neuronal circuits, we showed that the same learning rule gives rise to a Hebbian learning rule on the network level with space-rate coding (hence without a coding of relevant information in ring times). This link between the Markram rule and Hebbian learning for space-rate coding appears to be obvious on rst sight. A closer look, however, shows that it requires the large trial-to-trial variability of neuronal ring that is provided—in contrast to previous models—by the model for analog computation with space-rate coding based on unreliable synapses that was presented in section 2. Appendix A.1 Synapse Model. In all the simulations reported in this article, a connection between a neuron u 2 Ui and v 2 V is modeled by dvu independent release sites, where dvu may vary for different pairs u, v of neurons. The release probability pvu is assumed to be equal at all dvu release sites of one connection. In the case of a release at a single release site, the amplitude of the resulting EPSP (IPSP), also known as quantal size, is drawn from a gaussian distribution with mean qN vu and variance qQvu . For such a connection, the probability density function w vu of EPSP (IPSP) amplitudes in the case of a release is given by (we skip the indices vu for sake of brevity)
w ( a) D
¡¢
¡
d 1X d k p (1 ¡ p)d¡k Q aI kNq, r kD1 k
¢
q kqQ2 ,
¡ ¢ where Q ¢I q, N qQ is the normal probability density function with mean qN and variance qQ and r D 1 ¡ (1 ¡ p )n is the probability that at least at one of the release sites some probability.” R vesicle is released, that is, r is the “nonfailure R The mean aN :D aw ( a )da and the second moment aO :D a2 w ( a)da of such ¡ ¢ a distribution are given by aN D qdp N and aO D 1r qN2 ( dp2 (d ¡ 1) C dp ) C qQ2 dp . Note that d D 1 yields aN D qN and aO D qN 2 C qQ 2 . A.2 Neuron Model. For all simulations we have used the spike-response model as described in Gerstner (1999b). The time course of an EPSP (IPSP) a ( ¡t / t1 is modeled by t1 ¡t ¡ e¡t / t2 ), where t1 and t2 describe the rise and fall e 2
Model for Fast Analog Computation
1701
time of the PSP, and a denes the amplitude. 10 If not stated otherwise, we have set t1 D 5 ms, t2 D 12 ms for the EPSPs and t1 D 10 ms, t2 D 12 ms for the slower IPSPs. The refractory behavior is modeled by the function g(t) D ¡Re¡t / tr where tr D 4 ms was used for all simulations. R was chosen to be equal to the threshold of the neuron. A.3 Synaptic Parameters. The parameters dvu , pvu , qNvu , and qQ vu for a connection between a neuron u 2 Ui and v 2 V are chosen independently from distributions reported in the literature. The number of release sites nvu was chosen randomly from the set f1, 2, 3, 4, 5g Markram, Lubke, ¨ Frotscher, Roth, and Sakmann (1997). The release probabilities pvu are drawn from an exponential distribution (Huang & Stevens, 1997) with mean pi (see Figure 1B for an example). The mean quantal sizes qNvu are drawn from a normal distribution with mean qN i and a variance of 0.1| qN i |. In accordance with the results reported in Auger et al. (1998), we have chosen the standard deviation of the quantal sizes qQ vu D 0.05qN vu . pi and qNi were chosen to reect different “effective weights” wi between pool Ui and pool V, as specied in the caption of Figure 1. A.4 Second-Order Approximation of ¾v2 . We dene the average error E Rbetween sv2 and the second-order polynomial Am 2v C Bm v C C as E :D 1 (s 2 ( 2 C Bm v C C))2 dx . We want to nd the values A0 , B0 , 2 [0, 1]n v ¡ Am v
and C0 for the parameters A, B, and C such that E assumes its minimum dE D 0, dE Emin . Therefore, we simultaneously solve the equations dA dB D 0 and dE dC D 0. After some calculations one gets P 4 1 iw Pi P , A0 D ¡ P 4 N i wi C (5 / 2) i jD 6 i w i wj 0 1 P P 3 X XX O 1 w w w A 0 i i B0 D Pi 2 ¡ P i i2 ¡ P 2 @ w3i C w2i wj A , N i wi i wi i wi i i jD 6 i C0 D
1X 1 X 2 A0 X X A0 X 2 ( wO i ¡ B0 wi ) ¡ wi ¡ wi wj ¡ wi . 2 i 3N i 4 i jD 3 i 6 i
P wO i wi In the limit N ! 1 these equations reduce to A0 D 0, B0 D Pi 2 , and w i i P C0 D 12 i ( wO i ¡ Bwi ). These solutions lead to a minimal value of the average error E of ¡P 2 ¢ ¡P 2 ¢ ¡P ¢2 Oi O i wi iw i wi ¡ iw P . Emin D 12 i w2i 10 If the two time constants of these two exponentially decaying functions converge to a common value, then their difference converges to an “a-function,” which is another common description of the time course of a PSP.
1702
W. Maass and T. Natschl¨ager
With the denitions w O v D hwO v1 , . . . , wO vn i and cos Q D N v D hwv1 , . . . , wvn i, w O k2 (w v ¢ w (1 ¡ cos2 Q ) . O v ) / (kw v k ¢ kw O v k) this reduces to Emin D kw12 A.5 Frequency Response of a Time-Invariant Linear Filter. The transfer function (in terms of the z-transformation. Haykin, 1996) of a time( ) invariant P k linear lter that P k transforms a time series x t into P k a time series y( t) D k0D0 bk x( t ¡ k )¡ k0D1 ak y( t ¡ k ) is given by H ( z) D ( k0D0 bk z¡k ) / (1C p P k0 ¡k ), where is a complex variable. Setting D j2p f (j D ¡1), we z z e kD1 ak z get the lter frequency response denoted by H( e j2p f ), where f denotes the frequency in hertz. Expressing H( e j2p f ) in its polar form |H( f )| e jy ( f ) , one denes the frequency response of the lter in terms of two components: the amplitude response |H( f )| and the phase response y ( f ). These two quantities have the following meaning: Obviously, if one applies a sine wave with frequency f at the input, the output of a linear lter is also a sine wave, but with different amplitude and phase. The amplitude response |H( f )| is the ratio between the amplitudes of the output and the input sine wave. The phase response y ( f ) is the difference between the phases of the output and the input sine waves. In that way, one can measure for each frequency f the amplitude and phase response, as was done in the simulations reported in Figure 6. A.6 Details of Simulations Reported in Figures 5 and 6. The network that approximates a low-pass lter (cf. Figure 5C) consists of a single excitatory presynaptic pool U1 and the postsynaptic pool V. The effective weight between U1 and V is w1 D 20. For the time constants of the EPSPs we have chosen t1 D 25 ms and t2 D 26 ms. For the network that approximates a bandpass lter (cf. Figure 5B), there were in addition inhibitory recurrent connections from pool V to pool V with an effective weight wQ 1 D ¡40 (t1 D 30 ms and t2 D 26 ms). The effective weight from pool U1 to pool V was w1 D 130 (t1 D 10 ms and t2 D 11 ms). A.7 Details to Simulations Reported in Figure 7. We performed 300 simulations for the model from section 2 for n D 6, N D 200 with different values of b and different sets of initial release pvu (and hence P probabilities P different sets of initial weights wi D (1 / N ) v2V u2Ui rvu aN vu ). Throughout these simulations, we used synapses with just a single release site. Each of these 300 simulations consisted of 10 individual trials with different inputs () () () hx1l , . . . , x6l i, l D 1, . . . , 10, where the inputs xi l are drawn from normal 2 distributions with mean xN i and variance xO i (xN i 2 [0, 1] and xO i 2 [0.1, 0.3] were chosen uniformly from the given interval for each simulation). In each trial, we simulated a network of spiking neurons and applied Markram’s rule locally. The release probability pvu of a synapse between neuron u 2 Ui and v 2 V was increased by b if the postsynaptic neuron red within 20 ms after the presynaptic neuron. After these 10 trials, we compared the resulting
Model for Fast Analog Computation
1703
P P effective weights wnew D (1 / N ) v2V u2Ui rvu aN vu with the weights W inew D i P ( l) ( l) predicted by the Hebb rule, where aN i is the mean Wiold C Nb aN i 10 lD1 xi y PSP amplitude resulting from spikes of neurons u 2 Ui and y( l) is given by P6 ( l) y( l) D s( iD1 wi xi ) for the sigmoidal function given by the right-hand side of equation 2.6. Acknowledgments
We thank Peter Auer for helpful discussions and an anonymous referee for helpful comments. Research for this article was partially supported by the ESPRIT Working Group NeuroCOLT, No. 8556, and the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austria, project P12153. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and gain control. Science, 275, 220–224. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70(1), 1629–1638. Auger, C., Kondo, S., & Marty, A. (1998). Multivesicular release at single functional synaptic sites in cerebellar stellate and basket cells. Journal of Neuroscience, 18(12). Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3, 375–385. Bernander, O., Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity determines spatial-temporal integration in pyramidal cells. Proceedings of the National Academy of Sciences, 88, 11569–11573. Dobrunz, L., & Stevens, C. (1997). Heterogeneity of release probability, facilitation and depletion at central synapses. Neuron, 18, 995–1008. Gerstner, W. (1999a). Populations of spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 261–293). Cambridge, MA: MIT Press. Gerstner, W. (1999b). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 3–53). Cambridge, MA: MIT Press. Haykin, S. (1996). Adaptive lter theory. Englewood Cliffs, NJ: Prentice Hall. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 367, 33–36. Huang, E. P., & Stevens, C. F. (1997). Estimating the distribution of synaptic reliabilities. J. Neurophysiology, 78(6), 2870–2880. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press.
1704
W. Maass and T. Natschl¨ager
Larkman, A. U., Jack, J. J. B., & Stratford, K. J. (1997). Quantal analysis of excitatory synapses in rat hippocampal CA1 in vitro during low-frequency depression. J. Physiology, 505, 457–471. Maass, W. (1997). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Markram, H., Lubke, ¨ J., Frotscher, M., Roth, A., & Sakmann, B. (1997). Physiology and anatomy of synaptic connections between thick tufted pyramidal neurons in the developing rat neocortex. J. Physiology, 500(2), 409–440. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Petrov, V. V. (1995). Limit theorems of probability theory. New York: Oxford University Press. Stevens, C. F., & Zador, A. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neuroscience, 3, 210–217. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysics Journal, 12, 1–24. Received July 2, 1998; accepted March 18, 1999.
LETTER
Communicated by Bartlett Mel
Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces Aapo Hyv a¨ rinen Patrik Hoyer Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02015 HUT, Finland Olshausen and Field (1996) applied the principle of independence maximization by sparse coding to extract features from natural images. This leads to the emergence of oriented linear lters that have simultaneous localization in space and in frequency, thus resembling Gabor functions and simple cell receptive elds. In this article, we show that the same principle of independence maximization can explain the emergence of phase- and shift-invariant features, similar to those found in complex cells. This new kind of emergence is obtained by maximizing the independence between norms of projections on linear subspaces (instead of the independence of simple linear lter outputs). The norms of the projections on such “independent feature subspaces” then indicate the values of invariant features. 1 Introduction
A fundamental approach in signal processing is to design a statistical generative model of the observed signals. Such an approach is also useful for modeling the properties of neurons in primary sensory areas. Modeling visual data by a simple linear generative model, Olshausen and Field (1996) showed that the principle of maximizing the sparseness (or nongaussianity) of the underlying image components is enough to explain the emergence of Gabor-like lters that resemble the receptive elds of simple cells in mammalian primary visual cortex (V1). Maximizing sparseness is in this context equivalent to maximizing the independence of the image components (Comon, 1994; Bell & Sejnowski, 1997; Olshausen & Field, 1996). We show in this article that this same principle can also explain the emergence of phaseand shift-invariant features: the principal properties of complex cells in V1. Using the method of feature subspaces (Kohonen, 1995, 1996), we model the response of a complex cell as the norm of the projection of the input vector (image patch) onto a linear subspace, which is equivalent to the classical energy models. Then we maximize the independence between the norms of such projections, or energies. Thus we obtain features that are localized c 2000 Massachusetts Institute of Technology Neural Computation 12, 1705– 1720 (2000) °
1706
Aapo Hyv¨arinen and Patrik Hoyer
in space, oriented, and bandpass (selective to scale/frequency), like those given by simple cells, or Gabor analysis. In contrast to simple linear lters, however, the obtained features also show the emergence of phase invariance and (limited) shift or translation invariance. Phase invariance means that the response does not depend on the Fourier phase of the stimulus; the response is the same for a white bar and a black bar, as well as for a bar and an edge. Limited shift invariance means that a near-maximum response can be elicited by identical bars or edges at slightly different locations. These two latter properties closely parallel the properties that distinguish complex cells from simple cells in V1. Maximizing the independence, or equivalently, the sparseness of the norms of the projections to feature subspaces thus allows for the emergence of exactly those invariances that are encountered in complex cells, indicating their fundamental importance in image data. 2 Independent Component Analysis of Image Data
The basic models that we consider here express a static monochrome image I ( x, y ) as a linear superposition of some features or basis functions bi (x, y ), I ( x, y ) D
m X
bi (x, y )si ,
(2.1)
iD1
where the si are stochastic coefcients, different for each image I ( x, y ). The crucial assumption here is that the si are nongaussian and mutually independent. This type of decomposition is called independent component analysis (ICA) (Comon, 1994; Bell & Sejnowski, 1997; Hyv¨arinen & Oja, 1997), or, from an alternative viewpoint, sparse coding (Olshausen & Field, 1996, 1997). Estimation of the model in equation 2.1 consists of determining the values of si and bi (x, y ) for all i and ( x, y ), given a sufcient number of observations of images, or in practice, image patches I (x, y ). We restrict ourselves here to the basic case where the bi ( x, y ) form an invertible linear system. Then we can invert the system as si D hwi , Ii,
(2.2) P
where the wi denote the inverse lters, and hwi , Ii D x,y wi (x, y )I ( x, y) denotes the dot product. The wi ( x, y) can then be identied as the receptive elds of the model simple cells, and the si are their activities when presented with a given image patch I (x, y ). Olshausen and Field (1996) showed that when this model is estimated with input data consisting of patches of natural scenes, the obtained lters wi ( x, y) have the three principal properties of simple cells in V1: they are localized, oriented, and bandpass. van Hateren and van der Schaaf (1998) compared quantitatively the obtained lters wi ( x, y ) with those measured by single-cell recordings of the macaque cortex and found a good match for most of the parameters.
Emergence of Phase- and Shift-Invariant Features
1707
3 Decomposition into Independent Feature Subspaces
In addition to the essentially linear simple cells, another important class of cells in V1 is complex cells. Complex cells share the above-mentioned properties of simple cells but have the two principal distinguishing properties of phase invariance and (limited) shift invariance (Hubel & Wiesel, 1962; Pollen & Ronner, 1983), at least for the preferred orientation and frequency. Note that although invariance with respect to shift and global Fourier phase are equivalent, they are different properties when phase is computed from a local Fourier transform. Another distinguishing property of complex cells is that the receptive elds are larger than in simple cells, but this difference is only quantitative and of less consequence here. (For more details, see Heeger, 1992; Pollen & Ronner, 1983; Mel, Ruderman, & Archie, 1998.) To this date, very few attempts have been made to formulate a statistical model that would explain the emergence of the properties of visual complex cells. It is simple to see why ICA as in equation 2.1 cannot be directly used for modeling complex cells. This is due to the fact that in that model, the activations of the neurons si can be used linearly to reconstruct the image I ( x, y ), which is not true for complex cells due to their two principal properties of phase invariance and shift invariance. The responses of complex cells do not give the phase or the exact position of the stimulus, at least not as a linear function, as in equation 2.1. (See von der Malsburg, Shams, & Eysel, 1998, for a nonlinear reconstruction of the image from complex cell responses.) The purpose of this article is to explain the emergence of phase- and shiftinvariant features using a modication of the ICA model. The modication is based on combining the technique of multidimensional independent component analysis (Cardoso, 1998) and the principle of invariant-feature subspaces (Kohonen, 1995, 1996). We rst describe these two recently developed techniques. 3.1 Invariant Feature Subspaces. The classical approach for feature extraction is to use linear transformations, or lters. The presence of a given feature is detected by computing the dot product of input data with a given feature vector. For example, wavelet, Gabor, and Fourier transforms, as well as most models of V1 simple cells, use such linear features. The problem with linear features, however, is that they necessarily lack any invariance with respect to such transformations as spatial shift or change in (local) Fourier phase (Pollen & Ronner, 1983; Kohonen, 1996). Kohonen (1996) developed the principle of invariant-feature subspaces as an abstract approach to representing features with some invariances. This principle states that one may consider an invariant feature as a linear subspace in a feature space. The value of the invariant, higher-order feature is given by (the square of) the norm of the projection of the given data point on that subspace, which is typically spanned by lower-order features.
1708
Aapo Hyv¨arinen and Patrik Hoyer
A feature subspace, like any linear subspace, can always be represented by a set of orthogonal basis vectors, say, wi ( x, y ), i D 1, . . . , n, where n is the dimension of the subspace. Then the value F( I ) of the feature F with input vector I ( x, y) is given by F( I ) D
n X iD1
hwi , Ii2 .
(3.1)
(For simplicity of notation and terminology, we do not distinguish clearly the norm and the square of the norm in this article.) In fact, this is equivalent to computing the distance between the input vector I ( x, y) and a general linear combination of the basis vectors (lters) wi ( x, y ) of the feature subspace (Kohonen, 1996). A graphical depiction of feature subspaces is given in Figure 1. Kohonen (1996) showed that this principle, when combined with competitive learning techniques, can lead to the emergence of invariant image features. 3.2 Multidimensional Independent Component Analysis. In multidimensional independent component analysis (Cardoso, 1998), a linear generative model as in equation 2.1 is assumed. In contrast to ordinary ICA, however, the components (responses) si are not assumed to be all mutually independent. Instead, it is assumed that the si can be divided into couples, triplets, or in general n-tuples, such that the si inside a given n-tuple may be dependent on each other, but dependencies among different n-tuples are not allowed. Every n-tuple of si corresponds to n basis vectors bi ( x, y ). We call a subspace spanned by a set of n such basis vectors an independent (feature) subspace. In general, the dimensionality of each independent subspace need not be equal, but we assume so for simplicity. The model can be simplied by two additional assumptions. First, although the components si are not all independent, we can always dene them so that they are uncorrelated and of unit variance. In fact, linear dependencies inside a given n-tuple of dependent components could always be removed by a linear transformation. Second, we can assume that the data are whitened (sphered); this can be always accomplished by, for example, PCA (Comon, 1994). Whitening is a conventional preprocessing step in ordinary ICA, where it makes the basis vectors bi orthogonal (Comon, 1994; Hyv¨arinen & Oja, 1997), if we ignore any nite-sample effects. These two assumptions imply that the bi are orthonormal and that we can take bi D wi as in ordinary ICA with whitened data. In particular, the independent subspaces become orthogonal after whitening. These facts follow directly from the proof in Comon (1994), which applies here as well, due to our above assumptions.
Emergence of Phase- and Shift-Invariant Features
1709
Figure 1: Graphical depiction of the feature subspaces. First, dot products of the input data with a set of basis vectors are taken. Here, we have two subspaces with four basis vectors in each. The dot products are then squared, and the sums are taken inside the feature subspaces. Thus we obtain the (squares of the) norms of the projections on the feature subspaces, which are considered as the responses of the subspaces. Square roots may be taken for normalization. This scheme represents features that go beyond simple linear lters, possibly obtaining invariance with respect to some transformations of the input, for example, shift and phase invariance. The subspaces, or the underlying vectors wi , may be learned by the principle of maximum sparseness, which coincides here with maximum likelihood of a generative model.
1710
Aapo Hyv¨arinen and Patrik Hoyer
Let us denote by J the number of independent feature subspaces and by Sj , j D 1, . . . , J the set of the indices of the si belonging to the subspace of index j. Assume that the data consist of K observed image patches Ik ( x, y ), k D 1, . . . , K. Then we can express the likelihood L of the data given the model as follows: ¡ ¢ L Ik ( x, y ), k D 1 . . . KI wi (x, y ), i D 1 . . . m 2 3 D
K Y
4 |det W |
kD1
J Y
jD1
pj (hwi , Ik i, i 2 Sj )5 ,
(3.2)
where pj (.), which is a function of the n arguments hwi , Ik i, i 2 Sj , gives the probability density inside the jth n-tuple of si , and W is a matrix containing the lters wi ( x, y ) as its columns. The term |det W | appears here as in any expression of the probability density of a transformation, giving the change in volume produced by the linear transformation (see, e.g., Pham, Garrat, & Jutten, 1992). The n-dimensional probability density pj (.) is not specied in advance in the general denition of multidimensional ICA (Cardoso, 1998). 3.3 Combining Invariant Feature Subspaces and Independent Subspaces. Invariant-feature subspaces can be embedded in multidimensional
independent component analysis by considering probability distributions for the n-tuples of si that are spherically symmetric, that is, depend only on the norm. In other words, the probability density pj (.) of the n-tuple with index j 2 f1, . . . , Jg, can be expressed as a function of the sum of the squares of the si , i 2 Sj only. For simplicity, we assume further that the pj (.) are equal for all j, that is, for all subspaces. This means that the logarithm of the likelihood L of the data, that is, the K observed image patches Ik ( x, y), k D 1, . . . , K, given the model, can be expressed as ¡ ¢ log L Ik (x, y ), k D 1 . . . KI wi ( x, y ), i D 1 . . . m 0 1 D
J K X X kD1 jD1
X
log p @
i2Sj
hwi , Ik i2 A C K log |det W |,
(3.3)
P where p( i2Sj s2i ) D pj ( si , i 2 Sj ) gives the probability density inside the jth n-tuple of si . Recall that prewhitening allows us to consider the wi ( x, y ) to be orthonormal, which implies that log |det W | is zero. This shows that the likelihood in equation 3.3 is a function of the norms of the projections of Ik ( x, y) on the subspaces indexed by j, which are spanned by the orthonormal basis sets given by wi ( x, y ), i 2 Sj . Since the norm of the projection of visual data
Emergence of Phase- and Shift-Invariant Features
1711
on practically any subspace has a supergaussian distribution, we need to choose the probability density p in the model to be sparse (Olshausen & Field, 1996), that is, supergaussian (Hyv¨arinen & Oja, 1997). For example, we could use the following probability distribution, 0 log p @
1 X i2Sj
2
s2i A D ¡a 4
X i2Sj
3 1/2 s2i 5
(3.4)
C b,
which could be considered a multidimensional version of the exponential distibution (Field, 1994). The scaling constant a and the normalization constant b are determined so as to give a probability density that is compatible with the constraint of unit variance of the si , but they are irrelevant in the following. Thus, we see that the estimation of the model consists of nding subspaces such that the norms of the projections of the (whitened) data on those subspaces have maximally sparse distributions. The introduced independent feature subspace analysis is a natural generalization of ordinary ICA. In fact, if the projections on the subspaces are reduced to dot products, that is, projections on one-dimensional subspaces, the model reduces to ordinary ICA, provided that, in addition, the independent components are assumed to have symmetric distributions. It is to be expected that the norms of the projections on the subspaces represent some higher-order, invariant features. The exact nature of the invariances has not been specied in the model but will emerge from the input data, using only the prior information on their independence. When independent feature subspace analysis is applied P to natural image data, we can identify the norms of the projections ( i2Sj s2i )1 / 2 as the responses of the complex cells. If the individual lter vectors wi (x, y ) are identied with the receptive elds of simple cells, this can be interpreted as a hierarchical model where the complex cell response is computed from simple cell responses si , in a manner similar to the classical energy models for complex cells (Hubel & Wiesel, 1962; Pollen & Ronner, 1983; Heeger, 1992). It must be noted, however, that our model does not specify the particular basis of a given invariant-feature subspace. 3.4 Learning Independent Feature Subspaces. Learning the independent feature subspace representation can be simply achieved by gradient ascent of the log-likelihood in equation 3.3. Due to whitening, we can constrain the vectors wi to be orthogonal and of unit norm, as in ordinary ICA; these constraints usually speed up convergence. A stochastic gradient ascent of the log-likelihood can be obtained as
0
D wi ( x, y ) / I ( x, y )hwi , Iig @
1 X
r2Sj(i)
hwr , Ii2 A ,
(3.5)
1712
Aapo Hyv¨arinen and Patrik Hoyer
where j( i) is the index of the subspace to which wi belongs, and g D p0 / p is a nonlinear function that incorporates our information on the sparseness of the norms of the projections. For example, if we choose the distribution in equation 3.4, we have g( u ) D ¡ 12 au¡1 / 2 , where the positive constant 12 a can be ignored. After every step of equation 3.5, the vectors wi need to be orthonormalized; for a variety of methods to perform this, see (Hyv¨arinen & Oja, 1997; Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997). The learning rule in equation 3.5 can be considered as “modulated” nonlinear Hebbian learning. If the subspace that contains wi were just onedimensional, this learning rule would reduce to the learning rules for ordinary ICA given in Hyv¨arinen and Oja (1998) and closely related to those in Bell and Sejnowski (1997), Cardoso and Laheld (1996), and Karhunen et al. (1997). The difference is that in the general case, the Hebbian P term is divided by a function of the output of the complex cell, given by r2Sj(i) hwr , Ii2 , if we assume the terminology of the energy models. In other words, the Hebbian term is modulated by a top-down feedback signal. In addition to this modulation, the neurons interact in the form of the orthogonalizing feedback. 4 Experiments
To test our model, we used patches of natural images as input data Ik ( x, y) and estimated the model of independent feature subspace analysis. 4.1 Data and Methods. The data were obtained by taking 16 £ 16 pixel image patches at random locations from monochrome photographs depicting wildlife scenes (animals, meadows, forests, etc.). The images were taken directly from PhotoCDs and are available on the World Wide Web.1 The mean gray-scale value of each image patch (i.e., the DC component) was subtracted. The data were then low-pass ltered by reducing the dimension of the data vector by principal component analysis, retaining the 160 principal components with the largest variances. Next, the data were whitened by the zero-phase whitening lter, which means multiplying the data by C ¡1 / 2 , where C is the covariance of the data (after PCA) (see, e.g., Bell and Sejnowski, 1997). These preprocessing steps are essentially similar to those used in Olshausen and Field (1996) and van Hateren and van der Schaaf (1998). The likelihood in equation 3.3 for 50,000 such observations was maximized under the constraint of orthonormality of the lters in the whitened space, using the averaged version of the learning rule in equation 3.5; that is, we used the ordinary gradient of the likelihood instead of the stochastic gradient. The fact that the data were contained in a 160-dimensional subspace meant that the 160 basis vectors wi now formed an orthonormal system for that subspace and not for the original space, but this did not necessitate any 1
WWW address: http://www.cis.hut./projects/ica/data/images
Emergence of Phase- and Shift-Invariant Features
1713
Figure 2: Linear lter sets associated with the feature subspaces (model complex cells), as estimated from natural image data. Every group of four lters spans a single feature subspace (for whitened data).
changes in the learning rule. The density p was chosen as in equation 3.4. The algorithm was initialized as in Bell and Sejnowski (1997) by taking as wi the 160 middle columns of the identity matrix. We also tried random initial values for W . These yielded qualitatively identical results, but using a localized lter set as the initial value considerably improves the convergence of the method, especially avoiding some of the lters getting stuck in local minima. This initialization led, incidentally, to a weak topographical organization of the lters. The computations took about 10 hours on a single RISC processor. Experiments were made with different dimensions Sj for the subspaces: 2, 4, and 8 (in a single run, all the subspaces had the same dimension). The results we show are for four-dimensional subspaces, but the results are similar for other dimensions. 4.2 Results. Figure 2 shows the lter sets of the 40 feature subspaces (complex cells), when subspace dimension was chosen to be 4. The results are shown in the zero-phase whitened space. Note that due to orthogonality, the lters are equal to the basis vectors. The lters look qualitatively similar in the original, not whitened space. The only difference is that in the original space, the lters are sharper—that is, more concentrated on higher frequencies.
1714
Aapo Hyv¨arinen and Patrik Hoyer
Figure 3: Typical stimuli used in the experiments in Figures 4 to 6. The middle column shows the optimal Gabor stimulus. One of the parameters was varied at a time. (Top row) Varying phase. (Middle row) Varying location (shift). (Bottom row) varying orientation.
The linear lters associated with a single complex cell all have approximately the same orientation and frequency. Their locations are not identical, but are close to each other. The phases differ considerably. Every feature subspace can thus be considered a generalization of a quadrature-phase lter pair as found in the classical energy models (Pollen & Ronner, 1983), enabling the cell to be selective to given orientation and frequency, but invariant to phase and somewhat invariant to shifts. Using four lters instead of a pair greatly enhances the shift invariance of the feature subspace. In fact, when the subspace dimension was 2, we obtained approximately quadrature-phase lter pairs. To demonstrate quantitatively the properties of the model, we compared the responses of a representative feature subspace and the associated linear lters for different stimulus congurations. First, an optimal stimulus for the feature subspace was computed in the set of Gabor lters. One of the stimulus parameters was changed at a time to see how the response changes, while the other parameters were held constant at the optimal values. Some typical simuli are depicted in Figure 3. The investigated parameters were phase, orientation, and location (shift). Figure 4 shows the results for one typical feature subspace. The four linear lters spanning the feature subspace are shown in Figure 4a. The optimal stimulus values (for the feature subspace) are represented by 0 in these results; that is, the values given here are departures from the optimal values. The responses are in arbitrary units. For different phases, ranging from
Emergence of Phase- and Shift-Invariant Features
1715
Figure 4: Responses elicited from a feature subspace and the underlying linear lters by different stimuli, for stimuli as in Figure 3. (a) The four (arbitrarily chosen) lters spanning the feature subspace. (b) Effect of varying phase. Upper four rows: absolute responses of linear lters (simple cells) to stimuli of different phases. Bottom row: response of feature subspace (complex cell). (c) Effect of varying location (shift), as in b. (d) Effect of varying orientation, as in b.
¡p / 2 to p / 2, we thus obtained Figure 4b. On the bottom row, we have the response curve of the feature subspace; for comparison, the absolute values of the responses of the associated linear lters are also shown in the four plots
1716
Aapo Hyv¨arinen and Patrik Hoyer
above it. Similarly, we obtained Figure 4c for different locations (location was moved along the direction perpendicular to the preferred orientation; the shift units are arbitrary) and Figure 4d for different orientations (ranging from ¡p / 2 to p / 2). The response curve of the feature subspace shows clear invariance with respect to phase and location; note that the response curve for location consists of an approximately gaussian envelope, as observed in complex cells (von der Heydt, 1995). In contrast, no invariance for orientation was observed. The responses of the linear lters, on the other hand, show no invariances with respect to any of these parameters, which is in fact a necessary consequence of the linearity of the lters (Pollen & Ronner, 1983; von der Heydt, 1995). Thus, this feature subspace shows the desired properties of phase invariance and limited shift invariance, contrasting them with the properties of the underlying linear lters. To see how the above results hold for the whole population of feature subspaces we computed the same response curves for all the feature subspaces and, for comparison, all the underlying linear lters. Optimal stimuli were separately computed for all the feature subspaces. In contrast to the above results, we computed the optimal stimuli separately for each linear lter as well; this facilitates the quantitative comparison. The response values were normalized so that the maximum response for each subspace or linear lter was equal to 1. Figure 5 shows the responses for 10 typical feature subspaces and 10 typical linear lters, whereas Figure 6 shows the median responses of the whole population of 40 feature subspaces and 160 linear lters, together with the 10% and 90% percentiles. In Figures 5a and 6a, the responses are given for varying phases. The top row shows the absolute responses of the linear lters; in the bottom row, the corresponding results for the feature subspaces are depicted. The gures show that phase invariance is a strong property of the feature subspaces; the minimum response was usually at least two-thirds of the maximum response. Figures 5b and 6b show the results for location shift. Clearly, the receptive eld of a typical feature subspace is larger and more invariant than that of a typical linear lter. As for orientation, Figures 5c and 6c depict the corresponding results, showing that the orientation selectivity was approximately equivalent in linear lters and feature subspaces. Thus, we see that invariances with respect to translation and especially phase, as well as orientation selectivity, hold for the population of feature subspaces in general. 5 Discussion
An approach closely related to ours is given by the adaptive subspace selforganizing map (Kohonen, 1996), which is based on competitive learning of a invariant-feature subspace representation similar to ours. Using twodimensional subspaces, an emergence of lter pairs with phase invariance and limited shift invariance was shown in Kohonen (1996).However, the
Emergence of Phase- and Shift-Invariant Features
1717
Figure 5: Some typical response curves of the feature subspaces and the underlying linear lters when one parameter of the optimal stimulus is varied, as in Figure 3. Top row: responses (in absolute values) of 10 linear lters (simple cells). Bottom row: responses of 10 feature subspaces (complex cells). (a) Effect of varying phase. (b) Effect of varying location (shift). (c) Effect of varying orientation.
emergence of shift invariance in Kohonen (1996) was conditional to restricting consecutive patches to come from nearby locations in the image, giving the input data a temporal structure, as in a smoothly changing image sequence. Similar developments were given by Foldi ¨ ak ´ (1991). In contrast to
1718
Aapo Hyv¨arinen and Patrik Hoyer
Figure 6: Statistical analysis of the properties of the whole population of feature subspaces, with the corresponding results for linear lters given for comparison. In all plots, the solid line gives the median response in the population of all cells (lters or subspaces), and the dashed lines give the 90 percent and 10 percent percentiles of the responses. Stimuli were as in Figure 3. Top row: responses (in absolute values) of linear lters (simple cells). Bottom row: responses of feature subspaces (complex cells). (a) Effect of varying phase. (b) Effect of varying location (shift). (c) Effect of varying orientation.
these two theories, we formulated an explicit image model, showing that emergence is possible using patches at random, independently selected locations, which proves that there is enough information in static images to explain the properties of complex cells. In conclusion, we have provided a model of the emergence of phase- and shift-invariant features—the principal properties of visual complex cells— using the same principle of maximization of independence (Comon, 1994; Barlow, 1994; Field, 1994) that has been used to explain simple cell properties (Olshausen & Field, 1996). Maximizing the independence or, equivalently, the sparseness of linear lter outputs, the model gives simple cell properties. Maximizing the independence of the norms of the projections on linear subspaces, complex cell properties emerge. This provides further evidence for the fundamental importance of dependence reduction as a strategy for sensory information processing. It is, in fact, closely related to such fundamental objectives as minimizing code length and reducing redundancy
Emergence of Phase- and Shift-Invariant Features
1719
(Barlow, 1989, 1994; Field, 1994; Olshausen & Field, 1996). It remains to be seen if our framework could also account for the movement-related, binocular, and chromatic properties of complex cells; the simple cell model has had some success in this respect (van Hateren & Ruderman, 1998). Extending the model to include overcomplete bases (Olshausen & Field, 1997) may be useful for such purposes. References Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. (1994). What is the computational goal of the neocortex? In C. Koch & J. Davis (Eds.), Large-scale neuronal theories of the brain. Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1997).The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338. Cardoso, J. F. (1998). Multidimensional independent component analysis. In Proc. ICASSP’98 (Seattle, WA). Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3030. Comon, P. (1994).Independent component analysis—a new concept? Signal Processing, 36, 287–314. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6, 559– 601. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–198. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 73, 218–226. Hyv¨arinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Hyv¨arinen, A., & Oja, E. (1998). Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64(3), 301–313. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8(3), 486–504. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptivesubspace self-organizing map. Biological Cybernetics, 75, 281–291. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience, 18, 4325–4334. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325.
1720
Aapo Hyv¨arinen and Patrik Hoyer
Pham, D.-T., Garrat, P., & Jutten, C. (1992). Separation of a mixture of independent sources through a maximum likelihood approach. In Proc. EUSIPCO (pp. 771–774). Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial frequency lters. IEEE Trans. on Systems, Man, and Cybernetics, 13, 907–916. van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatiotemporal lters similar to simple cells in primary visual cortex. Proc. Royal Society ser. B, 265, 2315–2320. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proc. Royal Society ser. B, 265, 359–366. von der Heydt, R. (1995).Form analysis in visual cortex. In M. S Gazzaniga (Ed.), The cognitive neurosciences (pp. 365–382). Cambridge, MA: MIT Press. von der Malsburg, C., Shams, L., & Eysel, U. (1998). Recognition of images from complex cell responses. Abstract presented at at Proc. of Society for Neuroscience Meeting. Received September 29, 1998; accepted May 7, 1999.
LETTER
Communicated by Jack Cowan
Tilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex James A. Bednar Risto Miikkulainen Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, U.S.A. RF-LISSOM, a self-organizing model of laterally connected orientation maps in the primary visual cortex, was used to study the psychological phenomenon known as the tilt aftereffect. The same self-organizing processes that are responsible for the long-term development of the map are shown to result in tilt aftereffects over short timescales in the adult. The model permits simultaneous observation of large numbers of neurons and connections, making it possible to relate high-level phenomena to low-level events, which is difcult to do experimentally. The results give detailed computational support for the long-standing conjecture that the direct tilt aftereffect arises from adaptive lateral interactions between feature detectors. They also make a new prediction that the indirect effect results from the normalization of synaptic efcacies during this process. The model thus provides a unied computational explanation of selforganization and both the direct and indirect tilt aftereffect in the primary visual cortex.
1 Introduction
The tilt aftereffect (TAE; Gibson & Radner, 1937) is a simple but intriguing visual phenomenon. After staring at a pattern of tilted lines or gratings, subsequent lines appear to have a slight tilt in the opposite direction (see Figure 1). The effect resembles an afterimage from staring at a bright light, but it represents changes in orientation perception rather than in color or brightness. Most modern explanations of the TAE are loosely based on the featuredetector model of the primary visual cortex (V1), which characterizes this area as a set of orientation-detecting neurons. Experiments showed that these neurons became more difcult to excite during repeated presentation of oriented visual stimuli, and the desensitization persisted for some time (Hubel & Wiesel, 1968). This observation led to the fatigue theory of the TAE: perhaps active neurons become fatigued due to repeated ring, causing the response to a test gure to change during adaptation. Assumc 2000 Massachusetts Institute of Technology Neural Computation 12, 1721– 1740 (2000) °
1722
James A. Bednar and Risto Miikkulainen
Figure 1: Tilt aftereffect patterns. Fixate your gaze on the circle inside the central diagram for at least 30 seconds, moving your eye slightly inside the circle to avoid developing strong afterimages. Now xate on the diagram at the left. The vertical lines should appear slightly tilted clockwise; this phenomenon is called the direct tilt aftereffect. If you instead xate on the horizontal lines at the right, they should appear barely tilted counterclockwise, due to the indirect tilt aftereffect. (Adapted from Campbell & Maffei, 1971.)
ing the perceived orientation is some sort of average over the orientation preferences of the activated neurons, the nal perceived orientation would thus show the direct TAE (Coltheart, 1971). The fatigue theory has been discredited for a number of reasons, chief among which is that individual V1 neurons actually do not appear to fatigue (Finlayson & Cynader, 1995; McLean & Palmer, 1996). In fact, their response to direct stimulation is essentially unchanged by adaptation to a visual stimulus (Vidyasagar, 1990). The now-popular inhibition theory postulates that the TAE instead results from changing inhibition between orientation-detecting neurons (Tolhurst & Thompson, 1975). The inhibition hypothesis has recently been incorporated into theories of the larger purpose and function of the cortex. Barlow (1990) and Foldi ¨ ak ´ (1990) have proposed that the early cortical regions are acting to reduce the amount of redundant information present in the visual input. They suggest that aftereffects are not aws in an otherwise well-designed system, but an unavoidable result of a self-organizing process that aims at producing an efcient, sparse encoding of the input through decorrelation (see also Field, 1994; Miikkulainen, Bednar, Choe, & Sirosh, 1997; Sirosh, Miikkulainen, & Bednar, 1996). Based on these theories, Dong (1995) has shown analytically that perfect decorrelation can result in direct tilt aftereffects that are similar to those found in humans. Only very recently, however, has it become computationally feasible to test the inhibition-decorrelation theory of the TAE in a detailed model of cortical function, with limitations similar to those known to be present in the cortex. A Hebbian self-organizing process (the receptive-eld laterally interconnected synergetically self-organizing map, or RF-LISSOM; Miikkulainen et al., 1997; Sirosh, 1995; Sirosh & Miikkulainen, 1994a, 1997) has been shown
Tilt Aftereffects
1723
to develop feature detectors and specic lateral connections that could produce such aftereffects. The RF-LISSOM model gives rise to anatomical and functional characteristics of the cortex such as topographic maps, ocular dominance, orientation, and size preference columns and the patterned lateral connections between them. Although other models exist that explain how the feature detectors and afferent connections could develop by inputdriven self-organization, RF-LISSOM is the rst model that also shows how the long-range lateral connections self-organize as an integral part of the process. The laterally connected model has also been shown to account for many of the dynamic aspects of the adult visual cortex, such as reorganization following retinal and cortical lesions (Miikkulainen et al., 1997; Sirosh & Miikkulainen, 1994b; Sirosh et al., 1996). These ndings suggest that the same self-organization processes that drive development may be acting in the adult (see Fregnac, 1996; Gilbert, 1998 for review). We explore this hypothesis here with respect to the TAE. The current work is a rst study of the functional behavior of the RFLISSOM model, specically the response to stimuli similar to those known to cause the TAE in humans. The model permits simultaneous observation of activation and connection patterns between large numbers of neurons. This makes it possible to relate higher-level phenomena to low-level events, which is difcult to do experimentally. The results provide detailed computational support for the idea that the TAE results from a general selforganizing process that aims at producing an efcient, sparse encoding of the input through decorrelation.
2 Architecture
The cortical architecture for the model has been simplied and reduced to the minimum necessary conguration to account for the observed phenomena. Because the focus is on the two-dimensional organization of the cortex, each “neuron” in the model cortex corresponds to a vertical column of cells through the six layers of the primate cortex. The cortical network is modeled with a sheet of interconnected neurons and the retina with a sheet of retinal ganglion cells (see Figure 2). Neurons receive afferent connections from broad overlapping circular patches on the retina. The N £ N network is projected on to a central region of the R £ R retinal ganglion cells, and each neuron is connected to ganglion cells in an area of radius r around the projections. Thus, neurons at a particular cortical location receive afferents from a corresponding location on the retina. Since the lateral geniculate nucleus (LGN) accurately reproduces the receptive elds of the retina, it has been bypassed for simplicity. Each neuron also has reciprocal excitatory and inhibitory lateral connections with itself and other neurons. Lateral excitatory connections are short range, connecting each neuron with itself and its close neighbors. Lateral in-
1724
James A. Bednar and Risto Miikkulainen
Figure 2: Architecture of the RF-LISSOMnetwork. A small RF-LISSOM network and retina are shown, along with connections to a single neuron (shown as a large circle). The input is an oriented gaussian activity pattern on the retinal ganglion cells (shown by grayscale coding); the LGN is bypassed for simplicity. The afferent connections form a local anatomical receptive eld (RF) on the simulated retina. Neighboring neurons have different but highly overlapping RFs. Each neuron computes an initial response as a scalar product of its RF and its afferent weight vector. The responses then repeatedly propagate within the cortex through the lateral connections and evolve into activity bubbles. After the activity stabilizes, weights of the active neurons are adapted.
hibitory connections run for comparatively long distances, but also include connections to the neuron itself and to its neighbors.1 The input to the model consists of two-dimensional ellipsoidal gaussian patterns representing retinal ganglion cell activations (as in Figures 2 and 3a). For training, the orientations of the gaussians are chosen randomly from the uniform distribution in the full range [0, p ), and the positions are chosen randomly within the retina. The elongated spots approximate natural visual stimuli after the edge detection and enhancement mechanisms 1 For high-contrast inputs, long-range interactions must be inhibitory for proper selforganization to occur (Sirosh, 1995). Recent optical imaging and electrophysiological studies have indeed shown that long-range interactions in the cortex are inhibitory at high contrasts, even though individual lateral connections are primarily excitatory (Grinvald, Lieke, Frostig, & Hildesheim, 1994; Hata, Tsumoto, Sato, Hagihara,& Tamura, 1993;Hirsch & Gilbert, 1991; Weliky, Kandler, Fitzpatrick, & Katz, 1995). The model uses explicit inhibitory connections for simplicity since all inputs used are high contrast, and since it is the high-contrast inputs that primarily drive adaptation in the Hebbian model (Bednar, 1997).
Tilt Aftereffects
1725
in the retina. They can also be seen as a model of the intrinsic retinal activity waves that occur in late prenatal development in mammals (Bednar & Miikkulainen, 1998; Shatz, 1990). The RF-LISSOM network models the selforganization of the visual cortex based on these natural sources of elongated features. The afferent weights are initially set to random values, and the lateral weights are preset to a smooth gaussian prole. The connections are organized through an unsupervised learning process. At each training step, neurons start out with zero activity. The initial response gij of neuron (i, j) is calculated as a weighted sum of the retinal activations:
gij D s
(
X
! jab m ij,ab ,
(2.1)
a,b
where jab is the activation of retinal ganglion ( a, b) within the anatomical receptive eld (RF) of the neuron, m ij,ab is the corresponding afferent weight, and s is a piecewise linear approximation of the sigmoid activation function. The response evolves over a very short timescale through lateral interaction. P At each time step, the neuron combines the above afferent activation jm with lateral excitation and inhibition: gij (t) D s
(
X
jm C c e
X k,l
Eij,klgkl ( t ¡ 1) ¡ c i
X k,l
! Iij,klgkl ( t ¡ 1) , (2.2)
where Eij,kl is the excitatory lateral connection weight on the connection from neuron ( k, l) to neuron ( i, j), Iij,kl is the inhibitory connection weight, and gkl (t ¡ 1) is the activity of neuron (k, l) during the previous time step. The scaling factors c e and c i determine the relative strengths of excitatory and inhibitory lateral interactions. While the cortical response is settling, the retinal activity remains constant. The cortical activity pattern starts out diffuse and spread over a substantial part of the map (as in Figure 3c), but within a few iterations of equation 2.2, converges into a small number of stable, focused patches of activity, or activity bubbles (see Figure 3d). After the activity has settled, the connection weights of each neuron are modied. Both afferent and lateral weights adapt according to the same mechanism: the Hebb rule, normalized so that the sum of the weights is constant: wij,mn ( t C dt) D P
wij,mn ( t) C agij Xmn £ ¤ , mn wij,mn ( t ) C agij Xmn
(2.3)
where gij stands for the activity of neuron ( i, j) in the nal activity bubble,
1726
James A. Bednar and Risto Miikkulainen
(a) Input pattern
(b) Orientation map
I
-0.6 avg
-1.3 avg
S
I -90
-45
0
+45
(c) Initial activity
+90
S -90
-45
0
+45
+90
(d) Settled activity
wij,mn is the afferent or lateral connection weight (m , E or I), a is the learning rate for each type of connection (aA for afferent weights, aE for excitatory, and aI for inhibitory), and Xmn is the presynaptic activity (j for afferent, g for lateral). The larger the product of the presynaptic and postsynaptic activity gij Xmn , the larger the weight change is. Therefore, when the presynaptic and postsynaptic neurons re together frequently, the connection becomes stronger. Both excitatory and inhibitory connections strengthen by correlated activity; normalization then redistributes the changes so that the sum of each weight type for each neuron remains constant. At long distances, very few neurons have correlated activity, and therefore most long-range connections eventually become weak. The weak con-
Tilt Aftereffects
1727
nections are eliminated periodically, resulting in patchy lateral connectivity similar to that observed in the visual cortex. The radius of the lateral excitatory interactions starts out large, but as self-organization progresses, it is decreased until it covers only the nearest neighbors. Such a decrease is one way of varying the balance between excitation and inhibition to ensure that global topographic order develops while the receptive elds become well tuned at the same time (Miikkulainen et al., 1997; Sirosh, 1995). Similar effects can be achieved by changing the scaling factors c e and c i over time, or by using connections whose sign depends on the activation level of the neuron (as in Stemmler, Usher, & Niebur, 1995).
Figure 3: Facing page. Orientation map activation. After being trained on inputs like the one in (a) with random positions and orientations, the RF-LISSOM network developed an orientation map whose central 96 £ 96 region is shown in (b). As in (c,d), short white lines in (b) indicate the orientation preference of selected neurons; longer lines denote a stronger preference. The shading also gives information about the preference; dark shading denotes a preference close to vertical (0 degrees), and light denotes one close to horizontal ( § 90 degrees). The white outline shows the extent of the patchy self-organized lateral inhibitory connections of one neuron, marked with a white square, which has a vertical orientation preference. The strongest connections of each neuron are extended along its preferred orientation and link columns with similar orientation preferences, avoiding those with orthogonal preferences. The shading in (c,d) shows the strength of activation for each neuron to pattern (a). The initial response of the organized map is spatially broad and diffuse (c, top), like the input, and its cortical location at, above, and below the center of the cortex indicates that the input is vertically extended around the center of the retina. The response is patchy because the network is also encoding orientation, and the neurons that encode orientations far from the vertical do not respond (compare c to b). The histogram (c, bottom) sums up the orientation coding of the response. Each bin represents a range of 5 degrees, which is the precision to which the orientation map was measured. A wide range of neurons preferring orientations around 0 degrees are activated, but the average orientation is approximately 0 degrees) (¡0.6 degrees for this particular run). After the network settles through lateral interactions, the activation is much more focused, both spatially (d, top) and in representing orientation (d, bottom), but the spatial and orientation averages continue to match the position and orientation of the input, respectively. The average orientation of the settled response (¡1.3 degrees here) is taken to be the perceived orientation for the TAE experiments. Color and animated versions of these gures can be seen online at http://www.cs.utexas.edu/users/nn/pages/research/ selforg.html.
1728
James A. Bednar and Risto Miikkulainen
3 Experiments
The model consisted of an array of 192 £ 192 neurons and a retina of 36 £ 36 ganglion cells. The center of the anatomical RF of each neuron was placed at the location in the central 24 £ 24 portion of the retina corresponding to the location of the neuron in the cortex, so that every neuron would have a complete set of afferent connections. The RF had a circular shape, consisting of random-strength connections to all ganglion cells within 6 units from the RF center. The cortex was self-organized for 20, 000 iterations on oriented gaussian inputs with major and minor axes of half-width s D 7.5 and 1.5, respectively.2 The training took 2.5 hours on 16 processors of a Cray T3E supercomputer at the Texas Advanced Computing Center. The model requires 1.5 gigabytes of physical memory to represent the 400 million connections in this small section of the cortex. 3.1 Orientation Map Organization. In the self-organization process, the neurons developed oriented RFs organized into orientation columns very similar to those observed in the primary visual cortex (see Figure 3b). The strongest lateral connections of highly tuned cells link areas of similar orientation preference and avoid neurons with the orthogonal orientation preference. Furthermore, the connection patterns of highly oriented neurons are typically elongated along the direction in the map that corresponds to the neuron’s preferred stimulus orientation (as subsequently found experimentally by Bosking, Zhang, Schoeld, & Fitzpatrick, 1997). This organization reects the activity correlations caused by the elongated gaussian input pattern: such a stimulus activates primarily those neurons that are tuned to the same orientation as the stimulus and located along its length (Sirosh et al., 1996). Since the long-range lateral connections are inhibitory, the net result is decorrelation: redundant activation is removed, resulting in a sparse representation of the novel features of each input (Barlow, 1990; Field, 1994;
2 The initial lateral excitation radius was 19 and was gradually decreased to 1. The lateral inhibitory radius of each neuron was 47, and inhibitory connections whose strength was below 0.00005 were pruned away at 20, 000 iterations. The lateral inhibitory connections were initialized to a gaussian prole with s D 100, and the lateral excitatory connections to a gaussian with s D 15, with no connections outside the nominal circular radius. The lateral excitation c e and inhibition strength c i were both 0.9. The learning rate aA was gradually decreased from 0.007 to 0.0015, aE from 0.002 to 0.001, and aI was a constant 0.00025. The lower and upper thresholds of the sigmoid were increased from 0.1 to 0.24 and from 0.65 to 0.88, respectively. The number of iterations for which the lateral connections were allowed to settle at each training iteration was initially 9 and was increased to 13 over the course of training. These parameter settings were used by Sirosh (pers. comm.) to model development of the orientation map, and were not tuned or tweaked for the TAE simulations. Small variations produce roughly equivalent results (Sirosh, 1995).
Tilt Aftereffects
1729
Sirosh et al., 1996). As a side effect, illusions and aftereffects may sometimes occur, as will be shown below. 3.2 Aftereffect Simulations. In psychophysical measurements of the TAE, a xed stimulus is presented at a particular location on the retina. To simulate these conditions in the RF-LISSOM model, the position and angle of the inputs were xed to a single value for a number of iterations. To permit more detailed analysis of behavior at short timescales, the learning rates were reduced from those used during self-organization to aA D aE D aI D 0.000005. All other parameters remained as in selforganization. To compare with the psychophysical experiments, it is necessary to determine what orientation the model “perceives” for any given input. Precisely how neural responses are interpreted for perception remains quite controversial (see Parker & Newsome, 1998, for review), but results in primates suggest that behavioral performance approaches the statistical optimum given the measured properties of cortical neurons (Geisler & Albrecht, 1997). Accordingly, we extracted the perceived orientation using a vector sum procedure, which has been shown to be optimal under conditions present in the model (Snippe, 1996). For this procedure, each active neuron was represented by a vector whose magnitude corresponded to the activation level and whose direction corresponded to the orientation preference of the neuron. Perceived orientation was then measured as a vector sum over all neurons that responded to the input. In the model, the perceived orientation was found to match the absolute orientation of the input pattern to within a few degrees (Bednar, 1997). To determine the TAE in the model, the perceived orientation was rst computed for inputs of all orientations. The model then adapted to a xed input for 90 iterations, and the perceived orientation was again computed for all inputs. The difference between the initial perceived angle and the one perceived after adaptation was taken as the magnitude of the TAE. Figure 4 plots these differences for the different angles. For comparison, it also shows the most detailed data available for the TAE in human foveal vision (Mitchell & Muir, 1976). The results from the RF-LISSOM simulation are strikingly similar to the psychophysical results. For the range 5 to 40 degrees, all subjects in the human study exhibited angle repulsion effects nearly identical to those found in the RF-LISSOM model; the data were most complete for the subject shown. The magnitude of this direct TAE increases very rapidly to a maximum angle repulsion at 8 to 10 degrees, falling off somewhat more gradually to zero as the angular separation increases. The simulations with larger angular separations (from 40 to 85 degrees) show a smaller angle attraction—the indirect effect. Although there is a greater intersubject variability in the psychophysical literature for the indirect effect than the direct
1730
James A. Bednar and Risto Miikkulainen
Figure 4: Tilt aftereffect at different angles. The open circles represent the TAE for a single human subject (DEM) from Mitchell and Muir (1976) averaged over 10 trials. For each angle in each trial, the subject adapted for 3 minutes on a sinusoidal grating of a given angle and then was tested for the effect on a horizontal grating. Error bars indicate § 1 standard error of measurement. The subject shown had the most complete data of the four in the study. All four showed very similar effects in the x-axis range § 40 degrees; the indirect TAE for the larger angles varied widely between § 2.5 degrees. The graph is roughly antisymmetric around 0 degrees, so the TAE is essentially the same in both directions relative to the adaptation line. For comparison, the heavy line shows the average magnitude of the TAE in the RF-LISSOM model over 10 trials with different adaptation angles. Error bars indicate § 1 standard error of measurement; in most cases they are too small to be visible since the results were very consistent among runs. The network adapted to an oriented line at the center of the retina for 90 iterations; then the TAE was measured for test lines oriented at each angle. The duration of adaptation was chosen so that the magnitudes of the human data and the model match; this was the only parameter t to the data. The result from the model closely resembles the curve for humans, showing both direct and indirect TAEs.
effect, those found for the RF-LISSOM model are well within the range seen for human subjects. In parallel with the angular changes in the TAE, its peak magnitude in humans increases logarithmically with adaptation time (Gibson & Radner, 1937), eventually saturating at a level that depends on the experimental protocol used (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986).
Tilt Aftereffects
1731
Figure 5: Direct TAE over time. The circles show the magnitude of the TAE as a function of adaptation time for human subjects MWG (unlled circles) and SM (lled circles) from Greenlee and Magnussen (1987). They were the only subjects tested in the study. Each subject adapted to a single +12 degree line for the time period indicated on the horizontal axis (bottom). To estimate the magnitude of the aftereffect at each point, a vertical test line was presented at the same location, and the subject was requested to set a comparison line at another location to match it. The plots represent averages of ve runs; the data for 0 to 10 minutes were collected separately from the rest. For comparison, the heavy line shows average TAE in the LISSOM model for a C 12 degree test line over 9 trials (with parameters as in Figure 4). The horizontal axis (top) represents the number of iterations of adaptation, and the vertical axis represents the magnitude of the TAE at this time step. The RF-LISSOMresults show a similar logarithmic increase in TAE magnitude with time, but do not show the saturation that is seen for the human subjects.
As the number of adaptation iterations is increased in the model, the magnitude of the TAE also increases logarithmically (see Figure 5). (The single curve that best matched the magnitude of the human data was shown in Figure 4, but the ones for different amounts of adaptation all had the same basic shape.) The time course of the TAE in the RF-LISSOM model is qualitatively similar to the human data, but it does not completely saturate over the adaptation amounts tested so far.
1732
James A. Bednar and Risto Miikkulainen
3.3 How Does the TAE Arise in the Model?. The TAE seen in Figure 4 must result from changes in the connection strengths between neurons, since no other component of the model changes as adaptation progresses. Simulations performed with only one type of weight adapting (either afferent, lateral excitatory, or lateral inhibitory) show that the inhibitory weight changes determine the shape of the curve for all angles (Bednar, 1997). In what way do the changing inhibitory connections cause these effects? We will demonstrate this process in a simulation where only inhibitory weights adapted, the adaptation period was longer, and a higher learning rate was used, all in order to exaggerate the effect and make its causes more clearly visible. First, the connections change in a way that leaves the perceived orientation of the adaptation line unchanged. Figure 6a shows the initial response (I), the settled response (S), and the settled response after adaptation (A) for a vertical input (0 degrees, marked with a vertical line.) The response after adaptation is more diffuse because the most active neurons have become inhibited (see equation 2.3). Throughout adaptation, the distribution of active orientation detectors is centered around approximately the same angle, and a constant angle is perceived. The initial and settled responses to a test line with a slightly different orientation (e.g., 10 degrees, marked with a dotted line in Figure 6b) are again centered around that orientation (6bI and S). However, comparing the histograms of settled activity before and after adaptation (6bS and A), it is clear that fewer neurons close to 0 degrees respond after adaptation, but an increased number of those representing distant angles (over 10 degrees) do. This is because during adaptation, inhibition was strengthened primarily between neurons close to the 0 degree adaptation angle, and not between those that prefer larger orientations. Such adaptation increases the ability of the map to detect small differences from the adaptation line. Before adaptation, the settled histograms for 0 degrees and 10 degrees are fairly similar, with averages differing by only 9.2 degrees in this simulation (6aS and 6bS). After adaptation, the histograms are very different, resulting in a 23 degree difference in perceived orientation (compare 6aA with 6bA). This adaptation is manifested at the psychological level as a shift of the perceived orientation away from the adaptation line, that is, the direct TAE. Meanwhile, the response to a very different test line (e.g., 50 degrees, the dotted line in Figure 6c) becomes broader and stronger after adaptation (compare 6cS and A). Adaptation occurred only in activated neurons, so neurons with orientation preferences greater than 50 degrees are unchanged. However, those with preferences less than 50 degrees actually now respond more strongly (6cS and A). The reason is that during adaptation, the inhibitory connections of these neurons with each other became stronger. Because of normalization (see equation 2.3), their connections to other neurons, those representing distant angles such as 50 degrees, became weaker. As a result, the 50 degree line now inhibits them less than before
Tilt Aftereffects
1733
0 test line
10 test line
I
10.1 avg
-1.3 avg
S
-2.7 avg
A
-0.6 avg
-90
-45
0
+45
(a) Adaptation
50 test line
I
49.2 avg
I
7.9 avg
S
47.5 avg
S
20.3 avg
A
45.1 avg
A
+90 -90
-45
0
+45
(b) Direction
+90 -90
-45
0
+45
+90
(c) Indirect effect
Figure 6: Explanation of the TAE. This gure shows activation histograms similar to those in Figures 3c and 3d for several test lines. The histograms focus on the orientation-specic aspects of the cortical response by abstracting out the spatial content, and demonstrate how changes in the response cause the TAE. The top row of histograms (marked I, for Initial) shows the initial response to a vertical line (0 degrees), a 10 degree line, and a 50 degree line, each marked by dotted lines. In each case, the initial response is roughly centered around the orientation of the input line. The next row (marked S, Settled) shows the settled response, which has been focused by the lateral connections but is still centered around the input orientation. The bottom row (marked A, Adapted) shows the settled response to the same input after adapting to a vertical (0 degree) line, marked by a vertical line on the plots. To magnify and clarify the effect for explanatory purposes, only the inhibitory weights were modiable in this simulation, their learning rate was increased to 0.00005, and the adaptation lasted for 256 iterations. (a) The settled response to the 0 degree test line broadens with adaptation, as the inhibition between the active units increases (aS!A). Since the response remains centered around 0 degrees, there is little change in the perceived orientation and the TAE is close to 0 degrees (as can also be seen in Figure 4). (b) In contrast, a dramatic orientation shift is evident for the 10 degree test line: while the settled histogram before adaptation was centered around 7.9 degrees, after adaptation it is centered around 20.3 degrees (bS!A), inducing a direct effect of 12.4 degrees. The direct TAE is caused by the same changes that caused the broadening in (a): the activity around 0 degrees has decreased, while the activity at larger angles has increased. (c) The changes are more subtle for the indirect effect. For the 50 degree stimulus, only the neurons around 0 degrees in (cI) fall in the range of orientations initially activated by the adaptation line (aI), and thus those are the only ones that change their behavior between (cS) and (cA). During adaptation, their inhibition to and from neurons around 0 degrees was increased, and the weight normalization caused a corresponding decrease to other neurons, including those around 50 degrees. As a result, they are now less inhibited than before adaptation, and the average response shifts toward the adaptation angle (the indirect TAE). Color and animated demos of these examples can be seen online at http://www.cs.utexas.edu/users/nn/pages/research/selforg.html.
1734
James A. Bednar and Risto Miikkulainen
adaptation. Thus, they are more active, and the perceived orientation shifts toward 0 degrees, causing the indirect tilt aftereffect. The indirect effect is therefore true to its name, caused indirectly by the strengthening of inhibitory connections during adaptation. This explanation of the indirect effect is novel and emerges automatically from the RFLISSOM model. The model thus shows computationally how both the direct and indirect effects can be caused by the same activity-dependent adaptation process and that it is the same process that drives the development of the map. 4 Discussion and Future Work
Our results suggest that the same self-organizing principles that result in sparse coding and reduce redundant activation in the visual cortex may also be operating over short time intervals in the adult, with quantiable psychological consequences such as the TAE. This nding demonstrates a potentially important computational link between development, structure, and function. Although the RF-LISSOM model was not originally developed as an explanation for the TAE, it exhibits TAEs that have nearly all of the features of those measured in humans. The effect of varying angular separation between the test and adaptation lines is similar to human data, the time course is approximately logarithmic, and the TAE is localized to the retinal location that experienced the stimulus. With minor extensions, the model should account for other features of the TAE as well, such as higher variance at oblique orientations, frequency localization, movement direction specicity, and ocular transfer (Bednar, 1997). One difference, however, shows up in prolonged adaptation: the TAE in humans eventually saturates near 4 to 5 degrees (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986; Wolfe & O’Connell, 1986). The TAE in the RF-LISSOM model does not saturate quite as quickly (see Figure 5), and it also appears to saturate for different reasons. In the model, saturation occurs only after the inhibitory weights have been strengthened so much that the cortical response to the adaptation line is entirely suppressed. However, in humans, the adaptation line is still easily detectable even after saturation (Magnussen & Johnsen, 1986). Apparently the human visual system has additional constraints on how much adaptation can occur in the short term, and those constraints are currently not part of the model. Another feature of the human TAE that does not directly follow from the model is the gradual recovery of accurate perception even in complete darkness (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986). With the purely Hebbian learning used in the model, only active units are adapted, and thus no weight changes will occur for blank inputs. One possible explanation for both saturation and dark recovery is that the inhibitory weights modied during tilt adaptation are a set of small, temporary weights adding
Tilt Aftereffects
1735
to or multiplying more permanent connections. Changes to these weights would be limited in magnitude, and the changes would gradually decay in the the absence of visual stimulation. Such a mechanism was proposed by von der Malsburg (1987) as an explanation of visual object segmentation, and it was implemented in the RF-LISSOM model of segmentation by Choe and Miikkulainen (1998) and Miikkulainen et al. (1997). A variety of physical substrates for temporary modications in efcacy have been found (reviewed in Zucker, 1989). Such effects could also be included in the TAE model, which would allow it to replicate the saturation and dark recovery aspects of the human TAE. The main contribution of the RF-LISSOM model of the TAE is its novel explanation of the indirect aftereffect. Proponents of the lateral inhibitory theory of the TAE and tilt illusion have generally ignored indirect effects or postulated that they occur only at higher cortical levels (Wenderoth & Johnstone, 1988; van der Zwan & Wenderoth, 1995), partly because it has not been clear how they could arise through inhibition in V1. However, recent theoretical models have suggested that indirect effects could occur as early as V1 for simultaneous stimuli (Mundel, Dimitrov, & Cowan, 1997), and RF-LISSOM demonstrates that a quite simple, local mechanism in V1 is also sufcient to produce indirect aftereffects: If the total synaptic strength at each neuron is limited, bolstering the lateral inhibitory connections between active neurons eventually weakens their inactive inhibitory connections, causing the indirect aftereffect. Such weight normalization is computationally necessary: weights governed by a pure Hebbian rule would otherwise increase indenitely or would have to reach a xed maximum strength (Rochester, Holland, Haibt, & Duda, 1956; von der Malsburg, 1973; Miller & MacKay, 1994), and neither of these outcomes is biologically plausible. Furthermore, very recent experimental results demonstrate that several processes of normalization are actually occurring in visual cortex neurons near the timescales at which the indirect effect is seen (Turrigiano, Abbott, & Marder, 1994; Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Turrigiano, 1999). Such regulation changes the efcacy of each synapse while keeping their relative weights constant (Turrigiano et al., 1998), as in the RF-LISSOM model. The current RF-LISSOM results predict that similar whole-cell regulation of total synaptic strength underlies the indirect tilt aftereffect in V1. The RF-LISSOM results may also help explain why there is relatively large intersubject variability in the shape and magnitude of the angular function of the indirect effect (Mitchell & Muir, 1976). Although the overall t in Figure 4 is quite good, there is a small discrepancy at very large angles. This may be an interesting artifact of the visual environment of modern humans. As humans orient themselves with respect to man-made objects, such as books and pictures, they will often see angles near 90 degrees, whereas the model was trained with only single lines. If the model were trained on more realistic visual inputs that included corners, crosses, and approximately or-
1736
James A. Bednar and Risto Miikkulainen
thogonal angles, it would develop more connections between neurons with orthogonal preferences. These connections would have an inhibitory effect, and they would decrease through normalization during adaptation. As a result, the model would show an increased indirect effect near § 90 degrees, matching the human data. The prediction is therefore that the indirect effect differences between subjects arise from differences in their long-term visual experience, due to factors such as attention and different environments. Through mechanisms similar to those causing the TAE, the RF-LISSOM model should also be able to explain tilt illusions between two stimuli presented simultaneously. Such an explanation was originally proposed by Carpenter and Blakemore (1973), and the principles have been demonstrated recently in an abstract model of orientation (Mundel et al., 1997). Testing this hypothesis in RF-LISSOM for spatially separated stimuli would require a much larger inhibitory connection radius than used for these experiments; such simulations are not yet practical because of computational constraints. Bayesian analysis may offer a way to extract the response to two simultaneously presented overlapping lines (Zemel, Dayan, & Pouget, 1998), which could allow the tilt illusion to be measured in the current model. The tilt illusion can also be tested with existing computing hardware by rst combining the current orientation map simulation with an ocular dominance simulation (such as that of Sirosh & Miikkulainen, 1997). Then perceived orientations can be computed for inputs in different eyes; if the tilt illusion is occurring, then the perceived orientation for an input in one eye will change when a different orientation is presented to the other eye (as found in humans; Carpenter & Blakemore, 1973). In addition, many similar phenomena such as aftereffects of curvature, motion, spatial frequency, size, position, and color have been documented in humans (Barlow, 1990). Since specic detectors for most of these features have been found in the cortex, RF-LISSOM should be able to account for them by the same process of decorrelation mediated by self-organizing lateral connections. 5 Conclusion
The experiments reported here lend strong computational support to the theory that tilt aftereffects result from Hebbian adaptation of the lateral connections between neurons. Furthermore, the aftereffects occur as a result of the same decorrelating process that is responsible for the initial development of the orientation map. The same model should also apply to other aftereffects and to simultaneous tilt illusions. Computational models such as RF-LISSOM can demonstrate many visual phenomena in high detail that are difcult to measure experimentally, thus presenting a view of the cortex that is otherwise not available. This
Tilt Aftereffects
1737
type of analysis can provide an essential complement to both high-level theories and to experimental work with humans and animals, signicantly contributing to our understanding of the cortex.
Acknowledgments
Thanks to Joseph Sirosh for supplying the original RF-LISSOM code. This research was supported in part by the National Science Foundation under grants IRI-9309273 and IIS-9811478. Computer time for the simulations was provided by the Texas Advanced Computing Center at the University of Texas at Austin and by the Pittsburgh Supercomputing Center under grant IRI-940004P. Software for the RF-LISSOM model and demos of the data presented here are available online at http://www.cs.utexas.edu/ users/nn/.
References Barlow, H. B. (1990). A theory about the functional role and synaptic mechanism of visual after-effects. In C. Blakemore (Ed.), Vision: Coding and efciency (pp. 363–375). New York: Cambridge University Press. Bednar, J. A. (1997). Tilt aftereffects in a self-organizing model of the primary visual cortex. Master’s thesis, The University of Texas at Austin. Technical Report AI97–259. Bednar, J. A., & Miikkulainen, R. (1998). Pattern-generator-driven development in self-organizing models. In J. M. Bower (Ed.), Computational neuroscience: Trends in research, 1998 (pp. 317–323). New York: Plenum. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6), 2112–2127. Campbell, F. W., & Maffei, L. (1971). The tilt aftereffect: A fresh look. Vision Research, 11, 833–840. Carpenter, R. H. S., & Blakemore, C. (1973). Interactions between orientations in human vision. Experimental Brain Research, 18, 287–303. Choe, Y., & Miikkulainen, R. (1998). Self-organization and segmentation in a laterally connected orientation map of spiking neurons. Neurocomputing, 21, 139–157. Coltheart, M. (1971). Visual feature-analyzers and aftereffects of tilt and curvature. Psychological Review, 78, 114–121. Dong, D. W. (1995). Associative decorrelation dynamics: A theory of selforganization and optimization in feedback networks. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 925–932). Cambridge, MA: MIT Press. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601.
1738
James A. Bednar and Risto Miikkulainen
Finlayson, P. G., & Cynader, M. S. (1995). Synaptic depression in visual cortex tissue slices: An in vitro model for cortical neuron adaptation. Experimental Brain Research, 106(1), 145–155. Foldi ¨ a´ k, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fregnac, Y. (1996). Dynamics of functional connectivity in visual cortical networks: An overview. Journal of Physiology (Paris), 90, 113–139. Geisler, W. S., & Albrecht, D. G. (1997). Visual cortex neurons in monkeys and cats: Detection, discrimination, and identication. Visual Neuroscience, 14(5), 897–919. Gibson, J. J., & Radner, M. (1937). Adaptation, after-effect and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20, 453–467. Gilbert, C. D. (1998). Adult cortical dynamics. Physiological Reviews, 78(2), 467– 485. Greenlee, M. W., & Magnussen, S. (1987). Saturation of the tilt aftereffect. Vision Research, 27(6), 1041–1043. Grinvald, A., Lieke, E. E., Frostig, R. D., & Hildesheim, R. (1994). Cortical pointspread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. Journal of Neuroscience, 14, 2545–2568. Hata, Y., Tsumoto, T., Sato, H., Hagihara, K., & Tamura, H. (1993). Development of local horizontal interactions in cat visual cortex studied by crosscorrelation analysis. Journal of Neurophysiology, 69, 40–56. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11, 1800–1809. Hubel, D. H., & Wiesel, T. N. (1968). Receptive elds and functional architecture of monkey striate cortex. Journal of Physiology, 195, 215–243. Magnussen, S., & Johnsen, T. (1986). Temporal aspects of spatial adaptation: A study of the tilt aftereffect. Vision Research, 26(4), 661–672. McLean, J., & Palmer, L. A. (1996). Contrast adaptation and excitatory amino acid receptors in cat striate cortex. Visual Neuroscience, 13(6), 1069–1087. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (1997). Self-organization, plasticity, and low-level visual phenomena in a laterally connected map model of the primary visual cortex. In R. L. Goldstone, P. G. Schyns, & D. L. Medin (Eds.), Psychology of Learning and Motivation, 36 (pp. 257–308). San Diego, CA: Academic Press. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Mitchell, D. E., Muir, D. W. (1976). Does the tilt aftereffect occur in the oblique meridian? Vision Research, 16, 609–613. Mundel, T., Dimitrov, A., & and Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887–893). Cambridge, MA: MIT Press. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annual Review of Neuroscience, 21, 227–277.
Tilt Aftereffects
1739
Rochester, N., Holland, J. H., Haibt, L. H., & Duda, W. L. (1956). Tests on a cell assembly theory of the action of the brain, using a large digital computer. Reprinted in J. A. Anderson & E. Rosenfeld (Eds.), 1998, Neurocomputing: Foundations of research. Cambridge, MA: MIT Press. Shatz, C. J. (1990). Impulse activity and the patterning of connections during CNS development. Neuron, 5, 745–756. Sirosh, J. (1995). A self-organizing neural network model of the primary visual cortex. Doctoral dissertation. The University of Texas at Austin. Technical Report AI95–237. Sirosh, J., & Miikkulainen, R. (1994a). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Sirosh, J., & Miikkulainen, R. (1994b). Modeling cortical plasticity based on adapting lateral interaction. In J. M. Bower (Ed.), The neurobiology of computation: The Proceedings of the Third Annual Computation and Neural Systems Conference (pp. 305–310). Dordrecht: Kluwer. Sirosh, J., & Miikkulainen, R. (1997). Topographic receptive elds and patterned lateral interaction in a self-organizing model of the primary visual cortex. Neural Computation, 9, 577–594. Sirosh, J., Miikkulainen, R., & Bednar, J. A. (1996). Self-organization of orientation maps, lateral connections, and dynamic receptive elds in the primary visual cortex. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral interactions in the cortex: Structure and function. Austin, TX: UTCS Neural Networks Research Group. Available online at: http://www.cs.utexas.edu/ users/nn/web-pubs/htmlbook96. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–529. Stemmler, M., Usher, M., & Niebur, E. (1995). Lateral interactions in primary visual cortex: A model bridging physiology and psychophysics. Science, 269, 1877–1880. Tolhurst, D. J., & Thompson, P. G. (1975). Orientation illusions and aftereffects: Inhibition between channels. Vision Research, 15, 967–972. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neuroscience, 22, 221– 227. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264, 974–977. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 845–846. van der Zwan, R., & Wenderoth, P. (1995). Mechanisms of purely subjective contour tilt aftereffects. Vision Research, 35(18), 2547–2557. Vidyasagar, T. R. (1990). Pattern adaptation in cat visual cortex is a co-operative phenomenon. Neuroscience, 36, 175–179. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. Reprinted in J. A. Anderson & E. Rosenfeld (Eds.), 1988, Neurocomputing: Foundations of research. Cambridge, MA: MIT Press.
1740
James A. Bednar and Risto Miikkulainen
von der Malsburg, C. (1987).Synaptic plasticity as basis of brain organization. In J.-P. Changeux & M. Konishi (Eds.), The neural and molecular bases of learning (pp. 411–432). New York: Wiley. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. C. (1995).Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neuron, 15, 541–552. Wenderoth, P., & Johnstone, S. (1988). The different mechanisms of the direct and indirect tilt illusions. Vision Research, 28, 301–312. Wolfe, J. M., & O’Connell, K. M. (1986). Fatigue and structural change: Two consequences of visual pattern adaptation. Investigative Ophthalmology and Visual Science, 27(4), 538–543. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2), 403–430. Zucker, R. S. (1989).Short-term synaptic plasticity. Annual Review of Neuroscience, 12, 13–31. Received November 28, 1998; accepted July 13, 1999.
1741 Errata
In the paper entitled “Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations” by Michele Giugliano, that appeared in Neural Computation 12, 903–931 (2000), subscripts placed under the left-facing arrows used in the description of the computer implementation of a sample algorithm, on pages 924, 926, 927, 928, and 929 (i.e. “x7”, “x12”, “x28”, “x11”), are typos and they carry no meaning related to the assignment statements indicated by the arrows themselves.
Neural Computation, Volume 12, Number 8
CONTENTS Article
Neural Systems as Nonlinear Filters Wolfgang Maass and Eduardo D. Sontag
1743
Note
Analytical Model for the Effects of Learning on Spike Count Distributions Gianni Settanni and Alessandro Treves 1773 Letters
Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs A. N. Burkitt and G. M. Clark
1789
Statistical Procedures for Spatiotemporal Neuronal Data with Applications to Optical Recording of the Auditory Cortex O. Fran¸cois, L. Mohamed Abdallahi, J. Horikawa, I. Taniguchi, and T. Herv´e 1821 Probabilistic Motion Estimation Based on Temporal Coherence Pierre-Yves Burgi, Alan L. Yuille, and Norberto M. Grzywacz
1839
Boosting Neural Networks Holger Schwenk and Yoshua Bengio
1869
Gradient-Based Optimization of Hyperparameters Yoshua Bengio
1889
A Signal-Flow-Graph Approach to On-line Gradient Calculation Paolo Campolucci, Aurelio Uncini, and Francesco Piazza
1901
Bootstrapping Neural Networks ¨ Jurgen Franke and Michael H. Neumann
1929
Information Geometry of Mean-Field Approximation Toshiyuki Tanaka
1951
Measuring the VC-Dimension Using Optimized Experimental Design Xuhui Shao, Vladimir Cherkassky, and William Li
1969
ARTICLE
Communicated by James Williamson
Neural Systems as Nonlinear Filters Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria Eduardo D. Sontag Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, U.S.A. Experimental data show that biological synapses behave quite differently from the symbolic synapses in all common articial neural network models. Biological synapses are dynamic; their “weight” changes on a short timescale by several hundred percent in dependence of the past input to the synapse. In this article we address the question how this inherent synaptic dynamics (which should not be confused with long term learning) affects the computational power of a neural network. In particular, we analyze computations on temporal and spatiotemporal patterns, and we give a complete mathematical characterization of all lters that can be approximated by feedforward neural networks with dynamic synapses. It turns out that even with just a single hidden layer, such networks can approximate a very rich class of nonlinear lters: all lters that can be characterized by Volterra series. This result is robust with regard to various changes in the model for synaptic dynamics. Our characterization result provides for all nonlinear lters that are approximable by Volterra series a new complexity hierarchy related to the cost of implementing such lters in neural systems. 1 Introduction
Synapses in common articial neural network models are static: the value wi of a synaptic weight is assumed to change only during “learning.” In contrast to that, the “weight” wi (t) of a biological synapse at time t is known to be strongly dependent on the inputs xi (t¡t ) that this synapse has received from the presynaptic neuron i at previous time steps t ¡t . Varela et al. (1997) have shown that a model of the form wi (t) D wi ¢ D(t) ¢ (1 C F (t))
(1.1)
with a constant wi , a depression term D (t ) with values in (0, 1], and a facilitation term F (t) ¸ 0 can be tted remarkably well to experimental data for synaptic dynamics. The facilitation term F (t) is usually modeled as a linear c 2000 Massachusetts Institute of Technology Neural Computation 12, 1743– 1772 (2000) °
1744
Wolfgang Maass and Eduardo D. Sontag
lter with exponential decay: If xi (t ¡ t ) is the output of the presynaptic neuron (typically modeled by a sum of d-functions), then the current value of this facilitation term is of the form Z 1 (1.2) F(t) D r xi (t ¡ t ) ¢ e¡t /c dt 0
for certain parameters r , c > 0 that vary from synapse to synapse. A few other models have been proposed for synaptic dynamics (see e.g. Dobrunz & Stevens, 1997; Murthy, Sejnowski, & Stevens, 1997; Tsodyks, Pawelzik, & Markram, 1998; Koch, 1999; Maass & Zador, 1998, 1999) that are all quite similar. Closely related models had already been proposed and investigated in Grossberg (1969, 1972, 1984); Francis, Grossberg, & Mingolla, 1994). Our analysis in this article is primarily based on the model of Varela et al. (1997). However we will prove that our results also hold for the somewhat more complex model for synaptic dynamics in a mean-eld context of Tsodyks et al. (1998). We show that such inherent synaptic dynamics empower neural networks with a remarkable capability for carrying out computations on temporal patterns (i.e., time series) and spatiotemporal patterns. This computational mode, where inputs and outputs consist of temporal patterns or spatiotemporal patterns—rather than static vectors of numbers—appears to provide a more adequate framework for analyzing computations in biological neural systems. Furthermore their capability for processing temporal and spatiotemporal patterns in a very efcient manner may be linked to their superior capabilities for real-time processing of sensory input; hence, our analysis may provide new ideas for designing articial neural systems with similar capabilities. We consider not just computations of neural systems with a single temporal pattern as input, but also characterize their computational power for the case where several different temporal patterns u1 (t), . . . , un (t) are presented in parallel as input to the neural system. Hence we also provide a complete characterization of the computational power of feedforward neural systems for the case where salient information is encoded in temporal correlations of ring activity in different pools of neurons (represented by correlations among the corresponding continuous functions u1 (t), . . . , un (t )). Therefore, various informal suggestions for computational uses of such code can be placed on a rigorous mathematical foundation. It is easy to see that a large variety of computational operations that respond in a particular manner to correlations in temporal input patterns dene time-invariant lters with fading memory; hence they can in principle be implemented on each of the various kinds of dynamic networks considered in this article. Previous standard models for computations on temporal patterns in articial neural networks are time-delay neural networks (where temporal structure is transformed into spatial structure) and recurrent neural networks, both being based on standard “static” synapses (Hertz, Krogh, &
Neural Systems as Nonlinear Filters
1745
Palmer, 1991). Such transformation makes it impossible to let “time represent itself ” (Mead, 1989) in subsequent computations, which tends to result in a loss of computational efciency. The results of this article suggest that feedforward neural networks with simple dynamic synapses provide an attractive alternative. Various questions regarding articial neural networks with more general recurrent structure, in which the time-series character of the data plays a central role, were answered, within the framework of computational learning theory, in the papers (Dasgupta & Sontag, 1996—studied hard-threshold lters with a discrete timescale; Koiran & Sontag, 1998) (discrete-time recurrent networks), and (Sontag, 1998—continuous-time recurrent networks). Sontag (1997) summarizes some of the approximation capabilities and other properties of these classes of recurrent networks. In section 2, we introduce the formal notion of a dynamic network, which combines biologically realistic synaptic dynamics according to Varela et al. (1997) with standard sigmoidal neurons (modeling ring activity in a population of neurons), and we review some basic concepts regarding lters. In section 3 we characterize the computational power of feedforward dynamic networks for computations on temporal patterns (i.e., functions of time), and we show that our result can be extended to the model of Tsodyks et al. (1998) for synaptic dynamics. The formal proofs of the characterization results in this article rely on standard techniques from mathematical analysis. In section 4 we extend our investigation to computations on spatiotemporal patterns. Section 5 discusses some conclusions. 2 Basic Concepts
In contrast to the static output of gates in feedforward articial neural networks the output of biological neurons consists of action potentials (“spikes”)—stereotyped events that mark certain points in time. These spikes are transmitted by synapses to other neurons, where they cause changes in the membrane potential that affect the times when these other neurons re and thereby emit a spike. We will focus in this article on the implications of one type of temporal dynamics provided by the components of such neural computations: the inherent temporal dynamics of synapses. The empirical data of Varela et al. (1997) describe the amplitudes of excitatory postsynaptic currents (EPSCs) in a neuron in response to a spike train from a presynaptic neuron. These two neurons are likely to be connected by multiple synapses, and the resulting EPSC amplitude can be understood as a population response of these multiple synapses. Therefore it is justied to employ a deterministic model for synaptic dynamics in spite of the stochastic nature of synaptic transmission at a single release sit (Dobrunz & Stevens, 1997). The EPSC amplitude in response to a spike is modeled in Varela et al. (1997) by terms of the form w¢ (1 C F ) and w¢ D ¢ (1 C F ), where F
1746
Wolfgang Maass and Eduardo D. Sontag
is a linear lter with impulse response r ¢ e¡t /c modeling facilitation and D is some nonlinear lter modeling depression at synapses. In some versions of the model considered in Varela et al. (1997) this lter D consists of several depression terms. However, it only assumes values > 0 and is always time invariant and has fading memory. We analyze the impact of this synaptic dynamics in the context of common models for computations in populations of neurons where one can ignore the stochastic aspects of computation in individual neurons in favor of the deterministic response of pools of neurons that receive similar input (“population coding” or “space rate coding”; see Georgopoulos, Schwartz, & Ketner, 1986; Abbott, 1994; Gerstner, 1999). More precisely, our subsequent neural network model is based on a mean-eld analysis of networks of biological neurons, where pools P of neurons serve as computational units, whose time-varying ring activity (measured as the number of neurons in P that re during a short time interval [t, t C D ]) is represented by a continuous bounded function y (t). In case pool P receivesP inputs from m other pools of neurons P1 , . . . , Pm , we assume that y (t ) D s( m i D 1 wi (t ) xi ( t ) C w0 ), where xi (t) represents the time-varying ring activity in pool Pi and wi (t) represents the time-varying average “weight” of the synapses from neurons in pool Pi to neurons in pool P.1 In the context of neural computation with population coding considered in this article, we have to expand the model of Varela et al. (1997) to populations of synapses that connect two pools of neurons, where presynaptic activity is described not by spike trains but by continuous functions xi (t) ranging over some bounded interval [B0 , B1 ] with 0 < B0 < B1 . Therefore, we generalize their model for the dynamics of synapses from a nonlinear lter applied to a sequence of d-functions (i.e., to a spike train) to a corresponding nonlinear lter applied to a continuous input function xi (t).2 Thus if xi (t) is a continuous function describing the ring 1
The function s: R ! R is some activation function, for example, s(x) D 1 / (1 C e ¡x ). For the following, it sufces to assume that s is continuous and not a polynomial. In sections 3.2 and 3.3 we have to assume in addition that s assumes nonnegative values only. We refer to Maass and Natschl¨ager (in press) for theoretical arguments and computer simulations that support the use of a sigmoidal activation function in this context. 2 So far no empirical data are available for the temporal dynamics of a population of synapses (that connects two pools of neurons in a feedforward direction) in dependence of the pool activity of the presynaptic pool of neurons. It is not completely unproblematic to assume that synaptic dynamics can be modeled on the level of pool activity in the same way as for spiking neurons, although this is commonly done. The exact formula for the ring activity y(t) in the postsynaptic pool P of neurons requires multiplying for each presynaptic pool Pi of neurons the product of the vector of spike activity of individual neurons ºi, k in pool Pi with the matrix of current synaptic coupling strengths wi,k, j (t) for neurons ºj in pool P. The resulting ring activity y(t) of pool P is the average of the current ring activities of neurons ºj in pool P. In our mean-eld model, we assume that this average over j can be expressed in terms of products of the average wi (t) of the synaptic weights wi,k, j (t) over j and k with the average ring activity xi (t) in the presynaptic pool Pi . In particular this mean-eld model ignores that the value of wi, k, j (t) will in general
Neural Systems as Nonlinear Filters
1747
activity in the ith presynaptic pool Pi of neurons, we model the size of the resulting synaptic input to a subsequent pool P of neurons by terms of the form wi (t )¢xi (t) with wi (t) : D wi ¢(1 C F xi (t)) or wi (t) : D wi ¢D xi (t)¢(1 C F xi (t)), where the lters F and D are dened as in Varela et al. (1997). The rst equation that models facilitation gives rise to the denition of the class DN of dynamic networks in denition 1, and the second equation, which models the more common co-occurrence of facilitation and depression, gives rise to the denition of the class DN¤ . We dene the class DN of dynamic networks (see Figure 1) as the class of arbitrary feedforward networks consisting of sigmoidal gates that map input functions x1 (t), . . . , xm (t) to a function Denition 1.
y (t ) D s with
(
m X iD 1
¡
! wi (t)xi (t) C w0 ,
wi (t) D wi ¢ 1 C r
Z
1 0
xi (t ¡ t )e ¡t /c dt
¢
for parameters wi 2 R and r , c > 0 . s is some “activation function” from R into R , for example, the logistic sigmoid function dened by s(x) D 1 / (1 C e ¡x ). We will assume in the following only that s is continuous and not a polynomial.3 The slightly different class DN¤ is dened in the same way, except that ( wi t ) is of the form Z wi (t) D wi ¢ D xi (t ) ¢ (1 C r
1 0
xi (t ¡ t )e¡t /c dt ),
where D is some arbitrary given time invariant fading memory lter 4 with values D xi (t) 2 (0, 1].5 depend on the specic ring history of the specic presynaptic neuron ºi,k . We refer to Tsodyks et al. (1998) for a detailed mathematical analysis of this problem. It is shown in that article through computer simulations and theoretical arguments that for the slightly different model for synaptic dynamics considered there, the error resulting from generalizing the model from presynaptic individual neurons to presynaptic pools is benign. We will discuss the model from Tsodyks et al. (1998) in sections 3.2 and 3.3, and we will show in theorems 2 and 3 that our results can be extended to their model. 3 According to Leshno, Lin, Pinkus, & Schocken (1993)the subsequent theorems would hold under even weaker conditions on s. 4 See the remainder of this section for a review of these notions. 5 This lter D models synaptic depression, and can, for example, be dened as in
1748
Wolfgang Maass and Eduardo D. Sontag
Figure 1: A dynamic network with one hidden layer consisting of two hidden neurons G1 and G2 . The synapse from the ith input to G1 computes the lter ui (¢) 7! wi (¢)¢ui (¢), the synapse from the ith input to G2 computes the lter P n xi (¢) 7! Q i (¢) ¢ ui (¢). The output (t) ¢ a ¢ s( w of the network is of the form z(t) D 1 iD 1 wi Pn Q i (t) ¢ ui (t) C w Q 0 ) C a0 with a0 , a1 , a2 2 R . Thus the ui (t) C w0 ) C a2 ¢ s( iD 1 w network computes a lter that maps the input functions u1 (¢), . . . , un (¢) onto the output function z(¢).
Thus dynamic networks in DN or DN¤ are simply feedforward neural networks consisting of sigmoidal neurons, where static weights wi are replaced by biologically realistic history-dependent functions wi (t). The input to a dynamic network consists of an arbitrary vector of functions u1 (¢), . . . , un (¢). The output of a dynamic network is dened as a weighted sum z (t) D
k X i D1
ai yi (t) C a0
of the time-varying outputs y1 (t), . . . , yk (t) of certain sigmoidal neurons in the network, where the “weights” a0 , . . . , ak can be assumed to be static. Thus a dynamic network with n inputs maps n input functions u1 (¢), . . . , un (¢) onto some output function z (¢). 6 Varela et al. (1997). Our subsequent results are independent of the specic denition of D . 6 In principle one is also interested in a more general type of operators that map vectors u of real-valued functions on vectors y of m real-valued functions, where m is larger than 1. However, in order to answer the questions that are addressed in this article for the case m > 1, it sufces to focus on the case m D 1. The reason is that operators that output vectors of m real-valued functions can be viewed as vectors of m operators that output one real-valued function each. In this way, our results for the case m D 1 will imply a complete characterization of all operators that can be approximated by a more generalized type of
Neural Systems as Nonlinear Filters
1749
A somewhat related network model has been investigated in Back and Tsoi (1991). They exhibited a learning algorithm for this model, but no characterization of the computational power of such networks was given there. Temporal patterns are modeled in mathematics as functions of time. Hence networks that operate on temporal patterns map functions of time onto functions of time. We will refer to such maps from functions to functions (or from vectors of functions to functions) as lters (in mathematics, they are usually referred to as operators). We will reserve the letters F , H, S for lters, and we write F u for the function resulting from an application of the lter F to a vector u of functions. Notice that when we write F u (t), we mean (F u )(t) (that is, the function F u evaluated at time t). We write C( A, B ) for the class of all continuous functions f : A ! B. We will consider suitable subclasses U µ C( A, B ) for A µ Rk and B µ R, and study lters that map Un into RR (where RR is the class of all functions from R into R), that is, lters that map n functions u(¢), . . . , un (¢) onto another function z(¢). In this section and in section 3, we will focus on the case k D 1, where the input functions u1 (¢), . . . , un (¢) are functions of a single variable, which we will interpret as time. The case k > 1 will be considered in section 4. A trivial special case of a lter is the shifting lter St0 with St0 u (t) D u (t ¡ t0 ). An arbitrary lter F : U n ! RR is called time invariant if a shift of the input functions by a constant t0 causes a shift of the output function by the same constant t0 , that is, if for any t0 2 R and any u D hu1 , . . . , un i 2 Un , one has that F ut0 (t) D F u (t ¡ t0 ) where ut0 D hSt0 u1 , . . . , St0 un i. All lters considered in this article will be time invariant. Note that if U is closed under St0 for all t0 2 R, then a time-invariant lter F : Un ! RR is fully characterized by the values F u (0) for u 2 Un . Another essential property of lters considered in this article is fading memory. If a lter F has fading memory, then the value of F v (0) can be approximated arbitrarily closely by the value of F u (0) for functions u that approximate the functions v for sufciently long bounded intervals [¡T, 0]. The formal denition is as follows: We say that a lter F : U n ! RR has fading memory if for every v D hv1 , . . . , vn i 2 Un and every e > 0 there exist d > 0 and T > 0 so that | F v(0) ¡ F u (0)| < e for all u D hu1 , . . . , un i 2 Un with the property that kv(t) ¡ u(t)k < d for all t 2 [¡T, 0].7 Denition 2.
dynamic networks that output m real-valued functions instead of just one. 7 We will reserve k ¢ k for the max-norm on Rn ; that is, for x D hx , . . . , x i 2 Rn we 1 n write kxk for maxf|xi |: i D 1, . . . , ng.
1750
Wolfgang Maass and Eduardo D. Sontag
Interesting examples of linear and nonlinear lters F : U ! RR can be generated with the help of representations of the form Remark 1.
Z
F u(t) D
0
1
Z ...
1 0
u (t ¡ t1 ) ¢ . . . ¢ u (t ¡ tk )h (t1 , . . . , tk ) dt1 . . . dtk
for measurable and essentially bounded functions u: R ! R. We will always assume in this article that h 2 L1 . One refers to such an integral as a Volterra term of order k. Note that for k D 1, it yields the usual representation for a linear time-invariant lter. The class of lters that can be represented by Volterra series—by nite or innite sums of Volterra terms of arbitrary order—has been investigated for quite some time in neurobiology and engineering (see Palm & Poggio, 1977; Palm, 1978; Marmarelis & Marmarelis, 1978; Schetzen, 1980; Poggio & Reichardt, 1980; Rugh, 1981; Rieke, Warland, Bialek, & deRuyter van Steveninck, 1997). It is obvious that any lter F that can be represented by a sum of nitely many Volterra terms of any order (i.e., by a Volterra polynomial or nite Volterra series) is time invariant and has fading memory. This holds for any class U of uniformly bounded input functions u. According to the subsequent lemma 1, both of these properties are inherited by lters F that can be approximated by some arbitrary innite sequence of such lters. This implies that any lter that can be approximated by nite or innite Volterra series (which converge in the sense used here) is time invariant and has fading memory (over any class U of uniformly bounded functions u). Boyd and Chua (1985) have shown that under some additional assumptions about U (for example, the assumptions in theorem 1 below), the converse also holds: any time-invariant lter F : U ! RR with fading memory can be approximated arbitrarily closely by Volterra polynomials. Remark 2.
1. It is easy to see that for classes U of functions that are uniformly bounded (i.e., U µ C( A, B ) for some bounded set B µ R) our denition of fading memory agrees with that considered in Boyd and Chua (1985). All classes U considered in this article are uniformly bounded. 2. It is obvious that any time-invariant lter F that has fading memory is causal; u (t ) D v (t ) for all t · t0 implies that F u (t0 ) D F v (t0 ) for all t0 2 R . 3. All dynamic synapses considered in this article are modeled as lters that map an input function xi (¢) onto an output function wi (¢) ¢ xi (¢). Furthermore, all of these lters turn out to be time invariant with fading memory. This has the consequence that all models for dynamic networks considered in this article compute time-invariant lters with fading memory.
Neural Systems as Nonlinear Filters
1751
4. If one considers recurrent versions of such networks, then in the absence of noise, such networks can theoretically also compute lters without fading memory. Consider, for example, some lter F with F u (0) D 0 if u (t) D 0 for all t · 0 and F u (0) D 1 if there exists some t0 · 0 so that u (t0 ) ¸ 1. It is obvious that such a lter does not have fading memory. But a network where some “self-exciting” recurrent subcircuit is turned on (and stays on permanently) whenever the input u reaches a value ¸ 1 for some t0 2 R can compute such a lter. Alternatively, a feedforward network can also compute a non-fadingmemory lter if any of its components (synapses or neurons) have some permanent memory feature. 5. A special case of time-invariant lters F with fading memory are those dened by F u (0) D f (u (0)) for arbitrary continuous functions f : Rn ! R. Therefore the universal approximation theorem for lters that follows from our subsequent theorem 1 contains as a special case the familiar universal approximation theorem for functions from Hornik, Stinchcombe, and White (1989). 6. It is obvious that a lter F on U n has fading memory if and only if the functional FQ : U n ! R dened by FQ u : D F u (0) is continuous on U n with regard to the topology T generated by the neighborhoods fu 2 Un : kv(t) ¡ u (t)k < d for all t 2 [¡T, 0]g for arbitrary v 2 Un and d, T > 0.
Assume that U is closed under St0 for all t0 2 R and a sequence (Fn )n2N of lters converges to a lter F in the sense that for every e > 0 there exists an n0 2 N so that | Fn u (t) ¡ F u (t)| < e for all n ¸ n0 , u 2 U n , and t 2 R. Then the following holds: Lemma 1.
1. If all the lters Fn are time invariant, then F is time invariant. 2. If all the lters Fn have fading memory, then F has fading memory.
The rst claim follows immediately from the fact that F u (t) D limn!1 Fn u (t) for all u 2 U n and t 2 R. In order to prove the second claim, we can assume that some e > 0 and some v 2 Un have been given. We x some n0 2 N so that | Fn0 u (t) ¡ F u(t)| < e / 3 for all u 2 U n , t 2 R. Since Fn0 has fading memory there exists some T > 0 and some d > 0 so that | Fn0 u (0) ¡ Fn0 v (0)| < e / 3 for all u 2 U n with the property that ku(t) ¡ v (t)k < d for all t 2 [¡T, 0]. By our choice of n0 this implies that | F u (0) ¡ F v (0)| < e for all u 2 Un with ku(t) ¡ v (t)k < d for all t 2 [¡T, 0]. Hence F has fading memory. Proof.
1752
Wolfgang Maass and Eduardo D. Sontag
3 Computations on Temporal Patterns 3.1 Characterizing the Computational Power of Neural Networks with Dynamic Synapses. Our subsequent theorem 1 shows that simple lters
that model only synaptic facilitation (as considered in the denition of DN) provide the networks with sufcient dynamics to approximate arbitrary given time-invariant lters with fading memory. We show that the simultaneous occurrence of depression (as in DN¤ ) is not needed for that, but it also does not hurt. This appears to be of some interest for the analysis of computations in biological neural systems, since a fairly large variety of different functional roles have already been proposed for synaptic depression: explaining psychological data on conditioning and reinforcement (Grossberg, 1972), boundary formation in vision and visual persistence (Francis et al., 1994), switching between different neural codes (Tsodyks & Markram, 1997), and automatic gain control (Abbott, Varela, Sen, & Nelson, 1997). As a complement of these conjectured roles for synaptic depression our subsequent theorem 1 points to a possible functional role for synaptic facilitation: it empowers even very shallow feedforward neural systems with the capability to approximate basically any linear or nonlinear lter that appears to be of interest in a biological context. Furthermore we show that this possible functional role for facilitation can coexist with independent other functional roles for synaptic depression. Our result shows that one can rst choose the parameters that control synaptic depression to serve some other purpose and can then still choose the parameters that control synaptic facilitation, so that the resulting neural system can approximate any given time-invariant lter with fading memory.8 Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 Un into functions from R into R . Then the following are equivalent:9 Theorem 1.
(a) F can be approximated by dynamic networks S 2 DN (i.e., for any e > 0 there exists some S 2 DN such that | F u (t) ¡ S u (t )| < e for all u 2 U n and all t 2 R). (b) F can be approximated by dynamic networks S 2 DN with just a single layer of sigmoidal neurons.
8 We show in section 3.3 that alternatively one can employ just depressing synapses for approximating any such lter by a neural system. 9 The implication “(c) ) (d)” was already shown in Boyd and Chua (1985).
Neural Systems as Nonlinear Filters
1753
(c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. These equivalences remain valid if DN is replaced by DN¤ . The following result stems from the proof of theorem 1. It shows that the class of lters that can be approximated by dynamic networks is very stable with regard to changes in the denition of a dynamic network. Dynamic networks with just one layer of dynamic synapses and one subsequent layer of sigmoidal gates can approximate the same class of lters as dynamic networks with an arbitrary nite number of layers of dynamic synapses and sigmoidal gates. Even with a sequence of dynamic networks that have an unboundedly growing number of layers, one cannot approximate more lters. Furthermore, if one restricts the synaptic dynamics R 1 in the denition of dynamic networks to the simplest form wi (t) D wi ¢ (1 C r 0 xi (t ¡ t )e ¡t /c dt ) with some arbitrarily xed r > 0 and time constants c from some arbitrarily xed interval [a, b] with 0 < a < b, the resulting class of dynamic networks can still approximate (with just one layer of sigmoidal neurons) any lter that can be approximated by a sequence of arbitrary dynamic networks considered in denition 1. In the case of DN¤ one can choose to x r > 0 or arbitrarily x the interval [a, b] for the value of c . Corollary 1.
In addition we will show in section 3.2 that the claim of theorem 1 remains valid if we replace the model from Varela et al. (1997) for synaptic dynamics (employed in the denition of the classes DN and DN¤ of dynamic networks) by the model from Tsodyks et al. (1998). Furthermore we show in section 3.3 that the claim of theorem 1 also holds for networks where synapses exhibit just depression, not facilitation. The proof of theorem 1 shows that its claim as well as the claims of corollary 1 hold under much weaker conditions on the class U. Apart from the requirement that U is closed under translation, it sufces to assume that U is some arbitrary class of uniformly bounded and equicontinuous 10 functions that is closed with regard to the topology dened in part 6 of remarks 2, since this assumption is sufcient for the application of the Arzela-Ascoli theorem (see Dieudonne, 1969, or Boyd & Chua, 1985) in the proof. Remark 3.
According to lemma 1, any lter that can be approximated by nite or innite Volterra series is time invariant and has fading memory. This implies (d ) ) (c). Furthermore it is shown in Boyd and Chua Proof of Theorem 1.
10 U is equicontinuous if for any e > 0 there exists a d > 0 so that |t ¡ s| · d implies |u(t) ¡ u(s)| · e for all t, s 2 R and all u 2 U.
1754
Wolfgang Maass and Eduardo D. Sontag
(1985) that for the classes U considered in this article, any time-invariant lter F : U ! RR with fading memory can be approximated by a sequence of nite Volterra series (i.e., by Volterra polynomials). This argument can be trivially extended to lters F : Un ! RR with n ¸ 1. This implies (c) ) (d). Hence we have shown that (c) , (d ). The implication (b) ) (a ) is obvious. In order to prove (a) ) (c) we observe that all lters occurring at synapses of a dynamic network (see denition 1) are time invariant and have fading memory. This implies that all lters S dened by dynamic networks (i.e., all S 2 DN [ DN ¤ ) are time invariant and have fading memory. According to lemma 1, this implies that any lter F that can be approximated by such networks is time invariant and has fading memory. For the proof of (c) ) (b) we rst consider the case n D 1. We assume that F is some arbitrary given lter that is time invariant and has fading memory. We will rst show that F can be approximated by lters S 2 DN. The proof is based on an application of the Stone-Weierstrass theorem (see, e.g., Dieudonne, 1969, or Folland, 1984) similarly as in Boyd and Chua (1985). That article extends earlier arguments by Sussmann (1975), Fliess (1975), and Gallman and Narendra (1976) from a bounded to an unbounded time interval. Furthermore, our proof exploits the fact that any continuous function can be uniformly approximated on any compact set by weighted sums of sigmoidal gates (Hornik et al., 1989; Sandberg, 1991; Leshno et al., 1993). We will apply the Stone-Weierstrass theorem to functionals from U¡ : D fu| (¡1,0] : u 2 Ug into R. For that purpose we have to show that the lters H of the form
¡
H u(t) D u(t) ¢ 1 C r
Z
1 0
u (t ¡ t )e¡t /c dt
¢
6 separate points in U¡ ; that is, for any u, v 2 U¡ with u D v there exists 6 (0) (0). a lter H of this form such that Hu Thus, we consider some D Hv 6 v ( t ) for some t · 0. Then the function arbitrary given u, v 2 U with u (t) D 6 0 for some t ¸ 0. This implies u (0) ¢ u (¡t ) ¡ v(0) ¢ v(¡t ) assumes a value D
that Z q (l ) D
0
1
(u (0) ¢ u(¡t ) ¡ v(0) ¢ v (¡t ))e ¡t / l dt
does not assume a constant value for all arguments l in [a, b]. This follows because if q(l) D c for all such l, then q, being an analytic function of l 2 C with real part > 0, would equal c for all real l > 0. Since the limit of q (l) as l ! 1 is zero, this means c D 0. However, the Laplace transform is one-to-one. (This is a standard fact; one way to prove it is using that the Laplace transform of a bounded measurable function w, evaluated at any
Neural Systems as Nonlinear Filters
1755
point of the form s D 1 C iv, coincides, as a function of v, with the Fourier transform of w (t)e ¡t , and the Fourier transform is one to one on integrable functions; cf. Hewitt & Stromberg, 1965, corollary 21.47.) Hence, q (l) does not assume a constant value for all arguments l in [a, b]. Since q is analytic, it therefore assumes, in any interval [a, b] with 0 < a < b, innitely many different values. This implies that for any xed r > 0, Z u (0) C r ¢
1 0
Z 6 v(0) C r ¢ u (0) ¢ u (¡t )e¡t /c dt D
1 0
v (0) ¢ v(¡t )e ¡t /c dt
6 for some c 2 [a, b]. Therefore we have Hc u (0) D Hc v (0) for the lter Hc dened by
¡
Hc u (t) D u (t) ¢ 1 C r ¢
Z
1 0
¢
u (t ¡ t )e¡t /c dt .
In order to apply the Stone-Weierstrass theorem, we also need to show that U¡ is a compact metric space with regard to the topology T dened in part 6 of remarks 2. Obviously this topology T coincides with the topology generated on U ¡ by the metric d(u, v) : D sup t· 0
|u(t) ¡ v (t)| 1 C |t|
(since all functions in U are assumed to be uniformly bounded). The compactness of U ¡ with regard to this metric follows by a routine argument, applying the Arzela-Ascoli theorem successively to the sequence of restrictions U| [¡T,0] : D fu| [¡T,0] : u 2 Ug for T 2 N and by diagonalizing over converging subsequences for these restrictions (see, for instance, lemma 1 in Boyd & Chua, 1985). The Stone-Weierstrass theorem implies that there exists for every given e > 0 some m 2 N, lters Hc 1 , . . . , Hc m as specied above, and a polynomial p such that | F u (0) ¡ p(Hc 1 u (0), . . . , Hc m u (0))|
0 some c 2 [a,Rb] so that u (0) ¢ D u (0) ¢ (1 C r ¢ R 1 for every 1 ¡t /c 6 v (0) ¢ D v (0) ¢ (1 C r ¢ (¡t ) )D (¡t )e¡t /c dt ). u e dt 0 0 v 6 u (0) ¢ D u (0)R D v (0) ¢ D v (0). Then the claim follows since r ¢ R 1Case 2: ¡t /c dt ¡ r ¢ 1 v(¡t )e ¡t /c dt converges to 0 if r ! 0 or c ! 0. (¡t ) u e 0 0 The rest of the argument is exactly as in the preceding argument for lters S 2 DN. This completes the proof of (c) ) (b ) also for the case of dynamic networks that dene lters S 2 DN¤ and n D 1. In the claim of the theorem, we had considered a slightly more general class of lters F that are dened over U n for some given n ¸ 1. In order to extend the preceding proof of (c) ) (b ) to the more general input space for n ¸ 1, one just has to note that U n is a compact metric space with regard to the product topology generated by the topology T over U as in part 6 of remarks 2, and that our preceding arguments imply that lters over Un of 11
These approximation results were previously applied in this context by Sandberg (1991).
Neural Systems as Nonlinear Filters
1757
the form hu1 , . . . , un i ! H ui with i 2 f1, . . . , ng (and H modeling synaptic dynamics according to denition 1) separate points in Un¡ . 3.2 Extension of the Result to the Model for Synaptic Dynamics by Tsodyks, Pawelzik, and Markram. Tsodyks et al. (1998) propose a slightly
different temporal dynamics for depression and facilitation in populations of synapses. In contrast to the model from Varela et al. (1997) that underlies our denition 1, this model has been explicitly formulated for a mean-eld analysis, where the input to a population of synapses consists of a continuous function xi (t) that models ring activity in a presynaptic pool Pi of neurons rather than a spike train from a single presynaptic neuron. We show in this section that our characterization result from the preceding section also holds for this model for synaptic dynamics. The rst difference to the synapse model from Varela et al. (1997) is a R0 ¡r xi (t 0 ) dt 0 ¡t use-dependent discount factor e ¢ e¡t /c instead of just e¡t /c in the model for facilitation, which reduces the facilitating impact of preceding large input xi (¡t ) on the value of the synaptic weight at time 0. In other words, facilitation is no longer modeled by a linear lter; instead, one assumes that facilitation has less impact on a synapse that has already been facilitated by preceding inputs. For a precise denition of the resulting variation DN C of our denition of R1 dynamic networks from denition 1, we replace wi (t) D wi ¢(1 C r 0 xi (t¡t ) e¡t /c dt ) by wi (t) D wi ¢ wO i (t), where Z wO i (t) D r ¢
1 0
xi ( t ¡ t ) ¢ e
¡r
Rt t¡t
xi (t 0 ) dt 0
¢ e¡t /c dt.
(3.1)
This is the model for facilitation proposed in equation 3.5 of Tsodyks et al. (1998) for a mean-eld setting, where xi (t ¡ t ) models ring activity at 1 , r is time t ¡ t in a presynaptic pool Pi of neurons (wO i (t) is denoted by USE denoted by USE, c is denoted by tf acil , and wi is denoted by ASE . We show in the subsequent result that for any given value of the parameter r (which models the normal use of synaptic resources caused by input to a “rested” synapse) and for any given interval [a, b] one can choose the values wi 2 R and time constants c from [a, b] so that a network consisting of facilitating synapses in combination with one layer of sigmoidal neurons can approximate any time-invariant lter with fading memory. Tsodyks et al. (1998) also propose a model for populations of synapses that exhibit both depression and facilitation (one substitutes equation 3.5 for USE in equation 3.3 of Tsodyks et al., 1998). A new feature of this model is that one can no longer express the current synaptic weight wi (t) as a product of the outputs of two separate lters: one for depression and one for facilitation. Rather, the output of the lter for facilitation (see our equation 3.1) enters the computation of the current output of the lter for depression.
1758
Wolfgang Maass and Eduardo D. Sontag
This is biologically plausible, since a facilitated synapse spends its resources more quickly—and hence is subject to stronger depression. In our notation, the model from Tsodyks et al. (1998) for depression and facilitation in a mean-eld setting (equations 3.3 and 3.5 in Tsodyks et al., 1998) yields the following formula for the value wi (t) of the current weight of a population of synapses (with wO i (t) dened according to equation 3.1): Z
wi (t) :D wi ¢ wO i (t) ¢
1 0
e¡t / trec ¢ e
¡
Rt
t¡t
wO i (t 0 )xi (t 0 ) dt 0
dt.
(3.2)
This formula involves another parameter trec : the time constant for the recovery from using synaptic resources. We will write DN C C for the class of feedforward networks consisting of sigmoidal neurons with dynamic weights wi (t) according to equation 3.2. In order to make sure that the integrals in equations 3.1 and 3.2 assume a nite value for bounded synaptic inputs xi (¢), one has to make sure that not only the network inputs but also the outputs of sigmoidal units in networks from the classes DN C and DN C C are always nonnegative. For that purpose, we assume in this section and the next that the sigmoidal activation function s assumes nonnegative values only. This is no real restriction since the output of a sigmoidal unit models the current ring activity in a pool of neurons. Any lter that maps xi (¢) onto wi (¢) ¢ xi (¢) with wi (¢) dened according to equation 3.1 and 3.2 is time invariant and has fading memory. Hence every network in DN C and DN C C computes a time-invariant lter with fading memory. Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 Un into functions from R into R. Then the following are equivalent: Theorem 2.
(a) F can be approximated by dynamic networks S 2 DN C (i.e., for any e > 0 there exists some S 2 DN C such that | F u (t) ¡ S u (t )| < e for all u 2 U n and all t 2 R). (b) F can be approximated by dynamic networks S 2 DN C with just a single layer of sigmoidal neurons. (c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. These equivalences remain valid if DN C is replaced by DN C C . It will be obvious from the proof of theorem 2 that in principle quite small ranges sufce for the “free” parameters c and trec that control the synaptic dynamics according to equations 3.1 and 3.2.
Neural Systems as Nonlinear Filters
1759
In order to approximate an arbitrary given time-invariant fading memory lter F by dynamic networks S from DN C , one can choose for any given r > 0 the parameters c of the synapses in S (dened according to equation 3.1) from some arbitrarily given interval [a, b]. In order to approximate F by networks S from DN C C one can choose for any given r > 0 the parameters c from some arbitrarily given interval [a, b] and the parameters trec according to equation 3.2 from some arbitrarily given interval [a0 , b0 ]. Corollary 2.
It sufces to describe how the proof of (c) ) (b) from theorem 1 has to be changed. For the case of networks from the class DN C , we have to show that the lters HcC of the form Proof of Theorem 2.
Z
HcC u (t) D u (t) ¢ r
1 0
u (t ¡ t ) ¢ e
¡r
Rt t¡t
u (t 0 ) dt 0
¢ e¡t /c dt
separate points in U ¡. We show that for any given r > 0, a, b 2 R with 6 0 < a < b, and any u, v 2 U ¡ with u D v there exists some c 2 [a, b] such 6 that HcC u (0) D HcC v (0). Thus we consider some arbitrary given u, v 2 U 6 v( t ) for some t · 0. According to our argument in the proof of with u (t) D theorem 1, it sufces for that to show that u (0) ¢ u (¡t ) ¢ e
¡r
R0 ¡t
u (t 0 ) dt 0
6 v (0) ¢ v(¡t ) ¢ e D
¡r
R0 ¡t
v(t 0 ) dt 0
for some t ¸ 0,
(3.3)
because this implies that the function q C dened by Z q C (`) : D
1 0
(u (0) ¢ u (¡t ) ¢ e ¡r ¡ v(0) ¢ v (¡t ) ¢ e
R0 ¡t
¡r
u (t 0 ) dt 0
R0 ¡t
v (t 0 ) dt 0
)e ¡t / ` dt
assumes innitely many different values for ` 2 [a, b]. Assume for a contradiction that equation 3.3 does not hold. This implies 6 that u (0) D v (0). Consider some t0 < 0 with u (t0 ) D v (t0 ). We assume without loss of generality that u (t0 ) > v(t0 ). Set t0C :D infft > t0 : u (t) · v (t)g t0¡ :D supft < t0 : u (t) · v (t)g. We have t0C · 0 since u (0) D v(0). Case 1: t0¡ > ¡1 ( i.e., 9 t < t0 : u (t) · v (t)). Then t0¡ < t0 < t0C , u (t0¡ ) D ¡ ( v t0 ), u (t0C ) D v (t0C ) and u (t ) > v(t) for all t 2 (t0¡ , t0C ). According to our as-
1760
Wolfgang Maass and Eduardo D. Sontag
R0 R0 R0 R0 sumption this implies that t C u (t) dt D t C v (t) dt and t¡ u (t) dt D t¡ v (t) dt, 0 0 0 0 R tC R tC although t¡0 u (t) dt > t¡0 v (t ) dt. This yields a contradiction. 0
R0
0
Case 2: u (t) > v(t) for all t < t0 . Our assumptions imply then that
R0
v (t) dt and u (t) > v (t) for all t < t0C . Therefore there exR0 0 (u (t )¡v(t 0 )) dt 0 r ists some e > 0 such that e t ¸ 1 C e for all t · t0 . Hence we can conclude from our assumption that uv((tt)) ¸ 1 C e for all t · t0 . This imR0 0 ( (t )¡v(t 0 )) dt 0 r plies that e t u ! 1 for t ! ¡1, hence uv ((tt)) ! 1 for t ! ¡1. This provides a contradiction to our denition of the class U of functions to which u and v belong, since all functions in U have values in [B0 , B1 ] for 0 < B0 < B1 . This completes our proof of the direction (c) ) (b). The remainder of the proof for the case of dynamic networks from the class DN C is the same as for theorem 1. In order to prove (c) ) (b ) for networks from the class DN C C , we have to show that the lters that map an input function xi (¢) onto the output function wi (¢) ¢ xi (¢) with wi (t) dened according to equation 3.2 separate 6 points in U ¡. Thus we x some u, v 2 U with u (t ) D v (t) for some t · 0. C According to the preceding proof for DN , there exists some c 2 [a, b] such 6 that HcC u (0) D HcC v (0). We want to show for this c that there exists for 0 0 any given a , b with 0 < a0 < b0 some trec 2 [a0 , b0 ] so that the resulting lter dened by the synapse according to equation 3.2 can separate u and v. More precisely, we show for the lter G c that is dened in analogy to equation 3.1 by t0C
u (t) dt D
t0C
Z c u(t) :D r ¢
G
1 0
(thus HcC u (t) D u (t) ¢ G Z u (0) ¢ G 6 D
c u (0) ¢
v (0) ¢ G
0 c
u(t ¡ t ) ¢ e
c 1
¡r
Rt t¡t
u (t 0 ) dt 0
¢ e ¡t /c dt
u (t)) that e¡t / trec ¢ e
v (0) ¢
Z
1 0
¡
R0 ¡t
Gc u (t 0 )¢u (t 0 ) dt 0
e¡t / trec ¢ e
¡
R0 ¡t
dt
Gc v(t 0 )¢v(t 0 ) dt 0
dt.
(3.4)
It is obvious that the function h: R ! R dened by R0 Gc u (t 0 )¢u (t 0 ) dt 0 ¡t (0) ¢ u e c R0 ¡ G v (t 0 )¢v (t 0 ) dt 0 (0) (0) ¡ v ¢ G c v ¢ e ¡t c
h (t ) : D u (0) ¢ G
6 0 for some t ¸ 0, since h(0) D 6 0 by our choice of c . Hence assumes a value D
Neural Systems as Nonlinear Filters
1761
by the argument via the Laplace transform from the proof of theorem 1, there exists some trec 2 [a0 , b0 ] (for any given a0 , b0 2 R with 0 < a0 < b0 ) so that Z
1 0
6 0, e¡t / trec ¢ h (t ) dt D
which is equivalent to the desired inequality (see equation 3.4). Thus we have shown that the lters dened by the temporal dynamics of synapses in dynamic networks from the class DN C C separate points in U ¡. The rest of the proof is the same as for theorem 1. 3.3 Universal Approximation of Filters with Depressing Synapses Only.
We show in this section that the computational power of feedforward neural networks with dynamic synapses remains the same if the synapses exhibit just depression, not facilitation. This holds provided that the time constants trec for their recovery from depression can be chosen individually from some interval [a, b] (this holds for any values of a, b with 0 < a < b). This result is of interest since according to Tsodyks et al. (1998), all synapses between pyramidal neurons exhibit depression, not facilitation. We will employ the model from Tsodyks et al. (1998) for synaptic depression in a mean-eld setting, which is specied in equation 3.3 of Tsodyks et al. (1998). We write DN ¡ for the class of feedforward neural networks consisting of sigmoidal neurons (whose activation function s assumes nonnegative values only) with weights wi (t) evolving according to Z wi (t) D wi ¢ USE ¢
1 0
e
¡t / trec
¢e
¡
Rt t¡t
USE ¢xi (t 0 ) dt 0
dt
(3.5)
in dependence of the presynaptic pool activity xi (t ), where USE > 0 is some given constant. Note that wi (t) ¢ xi (t) agrees with the term ASE ¢ hy(t)i with hy(t)i dened by equation 3.3 in Tsodyks et al. (1998), which models the average value of the postsynaptic current caused in pool P by the ring activity xi (t) in the pool Pi in the case of depressing synapses between pools Pi and P (the parameter ASE is denoted by wi in our notation). Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 U n into functions from R into R. Then the following are equivalent: Theorem 3.
(a) F can be approximated by dynamic networks S 2 DN ¡ (i.e., for any e > 0 there exists some S 2 DN ¡ such that | F u (t) ¡ S u (t)| < e for all u 2 Un and all t 2 R).
1762
Wolfgang Maass and Eduardo D. Sontag
(b) F can be approximated by dynamic networks S 2 DN ¡ with just a single layer of sigmoidal neurons. (c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. It is obvious that all lters dened by dynamic networks from the class DN¡ are time invariant and have fading memory. Hence it sufces to show how the proof of (c) ) (b) has to be changed in comparison with the proof of theorem 1. Assume that parameters a, b, USE 2 R with 0 < a < b and 0 < USE have been xed in some arbitrary manner. We have to show that for any two functions u, v 2 U with u (t0 ) > v (t0 ) for some t0 · 0, there exists some trec 2 [a, b] so that the lter that models synaptic dynamics according to equation 3.5 differs at time 0 for the two input functions u, v (instead of xi ); we have to show that Proof.
Z u (0) ¢
1 0
e
¡t / trec
¢e
¡
R0 ¡t
USE¢u(t 0 ) dt 0
Z dt
6 v (0)¢ D
1 0
e
¡t / trec
¢e
¡
R0 ¡t
USE¢v(t 0 ) dt 0
dt.
According to our argument with the Laplace transform in the proof of the6 0 for some t ¸ 0 , where h: R ! R is orem 1, it sufces to show that h (t ) D the function dened by h (t ) : D u (0) ¢ e
¡
R0 ¡t
USE ¢u(t 0 ) dt 0
¡ v (0) ¢ e
¡
R0 ¡t
USE ¢v(t 0 ) dt 0
.
6 6 If u (0) D 0 . Hence we assume that v (0), this is obvious, since then h (0) D u (0) D v(0) . Furthermore, we assume for a contradiction that h (t ) D 0 for all t ¸ 0 . Set
t0C : D inf ft > t0 : u(t) · v(t)g. R0 R0 R0 Then we have t0 < t0C · 0 , t C u (t 0 ) dt 0 D tC v (t 0 ) dt 0 , and t0 u (t 0 ) dt 0 D 0 0 R0 (t 0 ) dt 0 . This yields a contradiction to the fact that u (t 0 ) > v (t 0 ) for all t0 v R tC R tC t 0 2 [t0 , t0C ], and hence t00 u (t 0 ) dt 0 > t00 v (t 0 ) dt 0 . 3.4 Focusing on Excitatory Synapses. In the preceding dynamic network models, we had assumed that the constant factors wi could be chosen to be positive or negative, thus yielding excitatory or inhibitory synapses in a biological interpretation. This formal symmetry between excitatory and inhibitory synapses is not adequate for most biological neural systems— for example, the cortex of primates, where just 15% of the synapses are inhibitory. We would like to point out that according to Maass (in press) one can replace the dynamic networks considered in the preceding sections
Neural Systems as Nonlinear Filters
1763
by an alternative type of network where just the dynamics of excitatory synapses matters—which can be just depressing, just facilitating, or depressing and facilitating, as in the preceding sections. The key observation is that instead of approximating the given polynomial p by a weighted sum of sigmoidal neurons in the proof of theorem 1 (and analogously in the subsequent theorems), one can approximate p by a single soft winner-take-all module applied to several weighted sums of the lters Hc 1 , . . . , Hc m with nonnegative weights wi only.12 The resulting network structure corresponds to a biological neural system where the lters Hc 1 , . . . , Hc m are realized exclusively by excitatory dynamic synapses, and the role of inhibitory synapses is restricted to the realization of the subsequent soft winner-take-all module. We refer to Maass (in press) for details of this alternative style of network construction. 3.5 Allowing the Input Functions to Vanish. We had assumed in theorem 1 that all input functions ui satisfy ui (t) ¸ B0 for all t 2 R, where B0 > 0 is some arbitrary constant. This assumption is usually met in the sketched application to biological neural systems, because the minimum ring rate of neurons is larger than 0 (typically in the range of 5 Hz). Alternatively one can assume that all input functions ui are superimposed with some positive constant input (which could be interpreted as background ring activity). The following result shows that from a mathematical point of view, the assumption B0 > 0 is necessary at least in the case of single-layer networks, since in the case B 0 D 0 a strictly smaller class of lters is approximated by dynamic networks with a single layer of sigmoidal neurons. Theorem 4 gives a precise characterization of this smaller class of lters.
Assume that U is the class of functions u in C (R, [0, B 1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B1 and B2 are arbitrary real-valued constants with 0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps functions from U into RR . We consider here only the version of dynamic networks giving rise to lters in DN. Then F can be approximated by dynamic networks with a single layer of sigmoidal gates if and only if F is time invariant, has fading memory, and there exists a constant cF such that F u (t) D cF for all u 2 U and t 2 R with u (t) D 0. Theorem 4.
The form of the lters dening the class DN implies that when u (t ) D 0, all lters output the value 0 at the given time t, and hence the sigmoidal gate outputs the value G(t) D s(w0 ), irrespective of the values of u at other times. It is easy to see that all lters approximated by Proof of Theorem 4.
12 If one prefers, one can replace the nonnegative weighted sums of the lters H c 1 , . . . , H c m in this alternative approximation result by sigmoidal neurons applied to nonnegative weighted sums of the lters H c 1 , . . . , H c m .
1764
Wolfgang Maass and Eduardo D. Sontag
such networks must also have the same property. The converse implication is established in almost exactly the same manner as in the proof of theorem 1. The only difference is as follows. It could be that u (0) D v (0) D 0, in which case our argument fails to provide a separating lter. However, this separation is not needed if we only need to approximate lters that are constant on the set of inputs with zero value at t D 0. This follows from the following lemma, which is a small variation of the Stone-Weierstrass theorem. Given a compact Hausdorff topological space U and a closed subset S of U, we say that a function f : U ! R is S-constant if the restriction of f to S is a constant function. We say that a class of real-valued functions on U is S-separating if, for each 6 u, v 2 U, u D v, such that not both of u and v are in S, there is some f 2 F 6 such that f (u ) D f (v ). Suppose that F is a class consisting of continuous and S-constant functions that S-separates. Then, polynomials in elements of F approximate every S-constant continuous function U ! R. Lemma 2.
This lemma is proved as follows. We consider the quotient space US : D U / S, where we collapse all points of S to one point, endowed with the quotient topology (its open sets are those open sets V in U for which S \ 6 ; implies S µ V). The topological space US is compact, because the V D canonical projection onto the quotient is continuous, and U was assumed to be compact. In addition, US is a Hausdorff space, as follows from the fact 6 S, there are disjoint neighborhoods of x and S (since S is that for each x 2 compact). The lemma is established by noticing that continuous S-constant functions induce continuous functions on US , so that we may apply the standard Stone-Weierstrass theorem to this quotient space. Now the theorem follows, using as S the set consisting of all inputs so that u (0) D 0. The S-separation property is established just as in the proof of theorem 1; we omit the routine details. 3.6 Combining Synaptic Dynamics with Membrane Dynamics. One other source of temporal dynamics in biological neural systems is the dynamics of the membrane potential of neurons. Hence it is of interest to consider a variation of our notion of a dynamic network where a function x (t) is processed at a connection of the network rst by a lter H that maps xi (¢) onto wi (¢) ¢xi (¢) (modeling synapse dynamics) and then by another lter G modeling membrane dynamics of the receiving neurons. Since each single excitatory postsynaptic potential and inhibitory postsynaptic potential can be tted very well by a function of the form b1 e¡t /c 1 ¡ b2 e¡t /c 2 , it appears to be justied to model membrane dynamics in the context of our model for population coding with pools of neurons by rst-order linear lters G with an impulse response g (t ) consisting of a weighted sum of functions
Neural Systems as Nonlinear Filters
1765
of the form e¡t /c . In the resulting variation of our notion of a dynamic network, one replaces the lters H that model synaptic dynamics according to denition 1 at the connections of the network by compositions G ± H with such linear lters G . All preceding results remain valid for this variation of the network model. In order to approximate arbitrary given time-invariant lters with fading memory by such networks, one just has to show that for 6 v (t ) for some t · 0 there exist lters any two functions u, v 2 U with u (t) D 6 ( (G ± H )v(0). This holds H , G of the desired type such that G ± H )u (0) D even for u, v 2 C(R, [B0 , B1 ]) with B0 D 0, since we have to nd a lter H 6 H v (t ) modeling synaptic dynamics according to denition 1 so that Hu (t) D for some t · 0 (hence we can allow that u (0) D v (0) D 0). We then can apply to the functions Hu (t) and Hv (t) for t · 0 the argument from the proof of theorem 1 to nd a linear lter G with impulse response e ¡t /c so that 6 ( G ± H )v (0). The same argument shows that theoretically the (G ± H )u (0) D same class of lters can be approximated by dynamic networks if one relies on only linear lters G modeling membrane dynamics. 4 Computations on Spatiotemporal Patterns
A closer look shows that many temporal patterns that are relevant for biological neural systems are not just temporal but spatiotemporal patterns. For example, in auditory systems, the additional spatial dimension parameterizes different frequency bands. These are represented by spatial locations of the inner hair cells on the cochlea and corresponding spatial maps in higher parts of the auditory system. In visual systems, it is obvious that the analysis of moving objects (and/or the stabilization of visual input in spite of body movements of the receiving organism) requires the processing of complex spatiotemporal patterns. In this context two spatial dimensions correspond directly to retina locations. But one should note that other “spatial” dimensions in the subsequent denition 3 need not correspond to spatial locations in the outer world (or on the retina), but can also correspond to scales in a more abstract feature space, for example, to spatial frequency or phase. Therefore, we will consider spatiotemporal patterns with an arbitrary nite number d 2 N of spatial dimensions. The transformation and classication of complex spatiotemporal patterns appear to be relevant also for higher cortical areas, since recordings from larger populations of neurons via voltage-sensitive dyes or MEG, EEG, and so forth suggest that sensory input is encoded in spatiotemporal activation patterns of associated cortical neural systems. These spatiotemporal activation patterns provide the input to higher cortical areas. Also the output of various systems in the motor cortex can be viewed as spatiotemporal patterns (Georgopoulos, 1995). Hence, one may argue that also higher cortical areas carry out computations on spatiotemporal patterns. We will show that one can extend our analysis of computations on temporal patterns to an analysis of computations on spatiotemporal patterns.
1766
Wolfgang Maass and Eduardo D. Sontag
For that purpose, we introduce a suitable extension of our denition of a dynamic network that allows for d spatial dimensions in the input functions u. We dene a spatial dynamic network and the corresponding classes SDN and SDN¤ of lters as a variation of denition 2 of a dynamic network. Fix some arbitrary d 2 N. A spatial dynamic network with d spatial input dimensions (in addition to the time dimension) assigns to vectors u of n input functions u: Rd £ R ! R an output function z: R ! R. The only difference to the preceding denition of a dynamic network is that now there exists for each network a nite set X of vectors x 2 Rd so that the actual input to the network consists of functions of the form u (x, ¢) for x 2 X. Denition 3.
According to this denition, any spatial dynamic network samples the input functions u: Rd £ R ! R just at a xed nite set X of points x. Nevertheless we will show in theorem 5 that these networks can approximate a very large class of lters on functions u: Rd £ R ! R. The notion of a Volterra series (see remark 1) can be readily extended to input functions u: Rd £ R ! R (again we assume that u is measurable and essentially bounded). In this case a kth order Volterra term is of the form Z1 ¡1
...
Z1 Z1 ¡1 0
...
Z1 0
u (x1 , . . . , xd , t ¡ t1 ) ¢ . . . ¢ u(x1 , . . . , xd , t ¡ tk )
(4.1)
¢ h(x1 , . . . , xd , t1 , . . . , tk ) dx1 . . . dxd dt1 . . . dtk
for some function h 2 L1 . Analogously as before we refer to a series consisting of nitely many such terms as a Volterra polynomial or nite Volterra series. Let U be the class of functions in C(Rd £ R, [B0 , B1 ]) that satisfy Q for all hx, ti, hQx, ti Q 2 Rd £ R, where d 2 N |u(x, t) ¡ u (xQ , tQ)| · B2 khx, ti ¡ hQx, tik and B0 , B 1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Then the following holds for any lter F : U n ! RR : F can be approximated by spatial dynamic networks (i.e., for any e > 0 there exists some S 2 SDN such that | F u (t) ¡ S u(t)| < e for all u 2 U n and t 2 R) if and only if F can be approximated by a sequence of (nite or innite) Volterra series. The claim remains valid if SDN is replaced by SDN¤ . Theorem 5.
We show that the two alternative conditions on F in the claim of the theorem are equivalent by proving that both conditions are equivalent to a third condition: the condition that F is time invariant and has fading Proof.
Neural Systems as Nonlinear Filters
1767
memory—for the straightforward extension of this notion to lters F : U n ! RR , where U is now a class of functions from Rd £ R into R. We say that such lter F has fading memory if for every hv1 , . . . , vn i 2 Un and every e > 0 there exist d > 0, T > 0, and K > 0 so that |F v (0) ¡ F u (0)| < e for all u D hu1 , . . . , un i 2 U n with the property that kv(x, t) ¡ u (x, t)k < d for all t 2 [¡T, 0] and all x 2 [¡K, K]d . This condition implies “fading inuence” of u (x, t) for arguments hx, ti where |t| or kxk are very large. It is obvious that any kth-order Volterra term of the form 4.1 is time invariant and has fading memory. Furthermore this property is preserved by taking sums and limits (analogously as in lemma 2). Hence also for lters with inputs u: Rd £ R ! R we have that any lter that can be approximated by a sequence of nite or innite Volterra series is time invariant and has fading memory. On the other hand one can extend the proof of theorem 1 in Boyd and Chua (1985) in a straightforward manner to show that any time-invariant lter that has fading memory and receives inputs from Un (for a class U with the properties as in the claim of the theorem) can be approximated arbitrarily closely by Volterra polynomials. For this extension of the argument of Boyd and Chua (1985), one just has to verify that this class U is compact with regard to the topology generated by the neighborhoods fu 2 U n : kv(x, t) ¡ u (x, t)k < e for all t 2 [¡T, 0] and all x 2 [¡K, K]n g for arbitrary v 2 Un and e, T, K > 0. It is clear that any spatial dynamic network according to denition 3 is time invariant and has fading memory. Thus it remains only to show that any time-invariant lter F : U n ! RR with fading memory can be approximated arbitrarily closely by spatial dynamic networks. In order to extend the proof of theorem 1 in Boyd and Chua (1985) to this case, one just has to observe that the proof of theorem 1 implies that for any two functions u, v, 2 U with 6 u (x, t) D v (x, t) for some t · 0 and x 2 Rd there exists some x 2 Rd and some lter H modeling synaptic dynamics as in denition 2 which satises 6 H u(x, ¢)(0) D H v (x, ¢)(0). Thus we have shown that a lter F : Un ! RR can be approximated by spatial dynamic networks if and only if F is time invariant and has fading memory. This completes the proof of theorem 5. Remark 4.
1. Analogous versions of corollary 1, remark 3, and theorems 2 through 4 also hold for the framework of computations on spatiotemporal patterns considered in theorem 5. 2. If one considers a system consisting of many spatial dynamic networks that provide separate outputs for different spatial output locations, one also can produce spatiotemporal patterns in the output of such systems. Theorem 5 implies that exactly those maps A from spatiotemporal patterns to spatiotemporal patterns can be realized by such systems where the restriction of the output of A to any xed output location is a time-invariant lter F with fading memory.
1768
Wolfgang Maass and Eduardo D. Sontag
5 Conclusion
We have analyzed the power of feedforward models for biological neural systems for computations on temporal patterns and spatiotemporal patterns. We have identied the class of lters that can be approximated by such models and shown that this result is quite stable with regard to changes in the model. In particular, we have shown that all lters that can be approximated by Volterra series can be approximated by models consisting of a single layer of dynamic synapses and neurons. Furthermore, the class of lters that can be approximated does not change if one considers feedforward networks with arbitrarily many layers. In addition, the lters in this class are characterized by a very simple property (time invariance and fading memory) that is in general easy to check for a concrete lter. This class of lters is very rich. In fact, one might argue that any lter that is possibly useful for a function of a biological organism belongs to this class. Since we have included in our analysis the case where several temporal patterns are presented simultaneously as input to a neural system, our approach also provides a new foundation for analyzing computations that respond in a particular manner to temporal correlations in the ring activity of different pools of neurons. We show that any such computation that can be described by time-invariant lters with fading memory (which is, for example, the case for most conjectured computations involving binding of features belonging to the same object via temporal correlations) can in principle be carried out by a feedforward neural system. So far the analysis of the possible functional role of short-term synaptic dynamics has found the most convincing computational role for synaptic depression. Our results here point to a possible computational role for the other important dynamical mechanism in biological synapses: facilitation. We show that via facilitation, models for neural systems gain the power to approximate any lter in the very large class of linear and nonlinear lters described above. Furthermore, we have shown that this possible function of facilitation does not interfere with any other computational role of synaptic depression, since we have shown that for any xed depression mechanisms, one can nd parameters for the synaptic facilitation mechanisms that allow the approximation of arbitrary lters from this class. Apart from this, we have also shown that the same very rich class of linear and nonlinear lters can be approximated by models for neural systems whose synapses just exhibit depression, not facilitation. The view of neural systems as computational models for computations on temporal patterns or spatiotemporal patterns—rather than on static vectors of numbers—is likely to have signicant consequences for the analysis of learning in neural systems. It suggests that learning should be analyzed in the context of adaptive lters. Whereas we do not contribute directly to any learning result in this article, our results identify exactly the class of lters within which such lter adaptation would take place, and thereby prepare
Neural Systems as Nonlinear Filters
1769
the ground for a closer analysis of learning in neural systems from the point of view of adaptive ltering. Finally we point out that our “universal approximation results” for computations on temporal and spatiotemporal patterns suggest a new complexity measure and a new parameterization for nonlinear lters in this domain, which may be more appropriate in the context of biological neural systems. We show that instead of measuring the complexity of a nonlinear lter F by the degree of the Volterra polynomial or Wiener polynomial that is needed to approximate F within a given 2 , one can also measure the complexity of F by the number of sigmoidal gates and dynamic synapses that are needed to approximate F within 2 . Our results show that both complexity hierarchies characterize the same class of linear and nonlinear lters. However, the latter measure is more adequate in the context of neural computation, because the approximation of a single sigmoidal gate requires a high-order Volterra polynomial for a good approximation. Hence, the order of the Volterra polynomial required to approximate a given nonlinear lter F is in general a poor measure for the cost of implementing F in neural hardware. On the other hand, the alternative complexity measure for lters F that is suggested by our results is closely related to the cost of implementing F in neural hardware. In addition, our approach via formal models for dynamic networks provides a new parameterization for all lters that are approximable by Volterra series—in terms of parameters that control the architecture as well as the temporal dynamics and scale of synaptic dynamics. Such parameterization is particularly of interest for the analysis of learning (if the goal is to learn a map from spatiotemporal to spatiotemporal patterns), especially since the parameters that occur in our new parameterization appear to be related to those parameters that are “plastic” in biological neural systems. This article also prepares the ground for an investigation of the required complexity of models for neural systems for approximating specic lters that are of particular interest in this context. Preliminary computer simulation results (Natschl¨ager, Maass, & Zador, 1999) suggest that in fact quite small instantiations of the dynamic network models considered in this article sufce to approximate specic quadratic lters. Other topics of current research are the role of noise in this context and the possible role of lateral and recurrent connections in the network (see Maass, 1999). Acknowledgments
We thank Anthony M. Zador for bringing the problems addressed in this article to our attention. In addition we thank Larry Abbott and two anonymous referees for helpful comments. W. M. thanks the Sloan Foundation (USA), the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austria, project P12153, and the NeuroCOLT project of the EC for partial
1770
Wolfgang Maass and Eduardo D. Sontag
support, and the Computational Neurobiology Lab at the Salk Institute for its hospitality. References Abbott, L. F. (1994). Decoding neuronal ring and modelling neural networks. Quarterly Review of Biophysics, 27, 291–331. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3, 375–385. Boyd, S., & Chua, L. O. (1985). Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Trans. on Circuits and Systems, 32, 1150. Dasgupta, B., & Sontag, E. D. (1996). Sample complexity for learning recurrent perceptron mappings. IEEE Trans. Inform. Theory, 42, 1479–1487. Dieudonne, J. (1969). Foundations of modern analysis. New York: Academic Press. Dobrunz, L., & Stevens, C. (1997). Heterogeneous release probabilities in hippocampal neurons. Neuron, 18, 995–1008. Fliess, M. (1975). Un Outil algebrique: les s´eries formelles non commutatives. In G. Marche (Ed.), Mathematical systems theory (pp. 122–148). New York: Springer-Verlag. Folland, G. B. (1984). Real analysis. New York: Wiley. Francis, G., Grossberg, S., & Mingolla, E. (1994). Cortical dynamics of feature binding and reset: Control of visual persistence. Vision Research,34, 1089–1104. Gallman, P. G., & Narendra, K. S. (1976). Representations of nonlinear systems via the Stone-Weierstrass theorem. Automatica, 12, 618–622. Georgopoulos, A. P. (1995). Reaching: coding in motor cortex. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A. P., & Ketner, R. E. (1986).Neuronal population coding of movement direction. Science, 233, 1416–1419. Gerstner, W. (1999). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 32–53). Cambridge, MA: MIT Press. Grossberg, S. (1969).On the production and release of chemical transmitters and related topics in cellular control. J. Theor. Biol., 22, 325–364. Grossberg, S. (1972). A neural theory of punishment and avoidance, II: Quantitative theory. Mathematical Biosciences, 15, 253–285. Grossberg, S. (1984).Some psychophysiological and pharmacological correlates of a developmental, cognitive, and motivational theory. In R. Karrer, J. Cohen, & P. Tueting (Eds.), Brain and information: Event related potentials (pp. 58–142). New York: New York Academy of Science. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hewitt, E., & Stromberg, K. (1965). Real and abstract analysis. Berlin: SpringerVerlag.
Neural Systems as Nonlinear Filters
1771
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koiran, P., & Sontag, E. D. (1998). Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Math., 86, 63–79. Leshno, M., Lin, V., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Maass, W. (in press). On the computational power of winner-take-all. Neural Computation. Maass, W. (1999). On the role of noise and recurrent connections in neural networks with dynamic synapses. Unpublished manuscript. Maass, W., & Natschl¨ager, T. (in press). A model for fast analog computation based on unreliable synapses. Neural Computation. Maass, W., & Zador, A. (1998). Dynamic stochastic synapses as computational units. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 194–200). Cambridge, MA: MIT Press. Maass, W., & Zador, A. (1999). Computing and learning with dynamic synapses. In W. Maass & C. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Marmarelis, P. Z., & Marmarelis, V. Z. (1978). Analysis of physiological systems: The white-noise approach. New York: Plenum Press. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Murthy, V., Sejnowski, T., & Stevens, C. (1997).Heterogeneous release properties of visualized individual hippocampal synapses. Neuron, 18, 599–612. Natschl¨ager, T., Maass, W., & Zador, A. (1999). Temporal processing with dynamic synapses (Tech. Rep.). Institute for Theoretical Computer Science, Technische Universitaet Graz. Palm, G. (1978). On representation and approximation of nonlinear systems. Biol. Cybernetics, 31, 119–124. Palm, G. & Poggio, T. (1977). The Volterra respresentation and the Wiener expansion: Validity and pitfalls. SIAM J. Appl. Math., 33(2), 195–216. Poggio, T., & Reichardt, W. (1980). On the representation of multi-input systems: Computational properties of polynomial algorithms. Biol. Cybernetics, 37, 167–186. Rieke, F., Warland, D., Bialek, W., & de Ruyter van Steveninck, R. (1997).SPIKES: Exploring the neural code. Cambridge, MA: MIT Press. Rugh, W. J. (1981). Nonlinear system theory. Baltimore: John Hopkins University Press. Sandberg, I. W. (1991). Structure theorems for nonlinear systems. Multidimensional Systems and Signal Processing, 2, 267–286. Schetzen, M. (1980). The Volterra and Wiener theories of nonlinear systems. New York: Wiley. Sontag, E. D. (1997).Recurrent neural networks: Some systems-theoretic aspects. In M. Karny, K. Warwick, & V. Kurkova (Eds.), Dealing with complexity: A neural network approach (pp. 1–12). Berlin: Springer-Verlag.
1772
Wolfgang Maass and Eduardo D. Sontag
Sontag, E. D. (1998). A learning result for continuous-time recurrent neural networks. Systems and Control Letters, 34, 151–158. Sussmann, H. J. (1975). Semigroup representations, bilinear approximations of input-output maps, and generalized inputs. In G. Marchesini (Ed.), Mathematical systems theory (pp. 172–192). New York: Springer-Verlag. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci., 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. Journal of Neuroscience, 17, 7926–7940. Received October 19, 1998; accepted September 16, 1999.
NOTE
Communicated by Carl van Vreeswijk
Analytical Model for the Effects of Learning on Spike Count Distributions Gianni Settanni Alessandro Treves SISSA, Programme in Neuroscience, Trieste, Italy The spike count distribution observed when recording from a variety of neurons in many different conditions has a fairly stereotypical shape, with a single mode at zero or close to a low average count, and a long, quasi-exponential tail to high counts. Such a distribution has been suggested to be the direct result of three simple facts: the ring frequency of a typical cortical neuron is close to linear in the summed input current entering the soma, above a threshold; the input current varies on several timescales, both faster and slower than the window used to count spikes; and the input distribution at any timescale can be taken to be approximately normal. The third assumption is violated by associative learning, which generates correlations between the synaptic weight vector on the dendritic tree of a neuron, and the input activity vectors it is repeatedly subject to. We show analytically that for a simple feedforward model, the normal distribution of the slow components of the input current becomes the sum of two quasi-normal terms. The term important below threshold shifts with learning, while the term important above threshold does not shift but grows in width. These deviations from the standard distribution may be observable in appropriate recording experiments. 1 Spike Counts and the S C F model
The variability in the emission of action potentials by nerve cells may be characterized by several measures, one of which is the distribution of spike counts in a time window of xed length. While other measures, such as the distribution of consecutive interspike intervals, are more direct descriptions of the irregularity in the ring, spike count distributions provide estimates of the entropy of neural codes, inasmuch as the code used by a particular neuron is thought to be expressed simply by its ring frequency. Because of this interpretation, the observation that spike count distributions often have quasi-exponential tails (Abeles, Vaadia, & Bergman, 1990; Barnes, McNaughton, Mizumori, Leonald, & Lin, 1990) has been linked to an optimal coding principle (Levy & Baxter, 1996). If the spike count is taken to be the symbol coding for the message represented c 2000 Massachusetts Institute of Technology Neural Computation 12, 1773– 1787 (2000) °
1774
Settanni and Treves
by the input arriving at the cell at any given time, and the mean count is constrained by a xed metabolic budget, then optimal usage of symbols occurs, in the noiseless case, when their probabilities are exponentially distributed—that is, have maximal entropy under the constraint (Shannon, 1948). It is not clear, however, how this coding principle could apply to the more meaningful situation of noisy coding and, more important, how it could be compatible with deviations from the pure exponential shape often prominent at the low-count end of the distribution, that is, the nonzero mode. An alternative suggestion (Panzeri, Booth, Wakeman, Rolls, & Treves, 1996; Treves, Panzeri, Rolls, Booth, & Wakeman, 1999) is that the stereotypical shape does not reect any design or optimization, but rather reects precisely the lack of any principle capable of imparting signicant structure to the distribution; that is, it represents a sort of null hypothesis against which any organizing principle could be tested. The suggestion is embodied in a crude model of the variability of the input current into the soma of a typical neuron, which assumes that (1) the input current translates linearly into a spike count—of course, above a ring threshold; (2) its variability has frequency components at several different time scales, both slower and faster (hence the name S C F model) than the time window used to count spikes; and (3) each component is normally distributed. The S C F model generates a formula for the spike count distribution with three free parameters: the position of the mean current relative to ring threshold and the standard deviations of slow and fast components of its variability. The formula was found to t adequately spike counts recorded from monkey inferior temporal cortex neurons responding to quasi-ecological stimulation with a video of natural images, counts that could not instead be tted by the exponential or other simple models (Treves et al., 1999). The S C F model was then found to provide satisfactory ts in other experiments, such as with primate hippocampal neurons responding to continuously changing views of the monkey’s environment (Panzeri, Rolls, Treves, Robertson, & Georges-Fran¸cois, 1997) or with a class of rat somatosensory neurons spontaneously active under anesthesia (Irina Erchova, pers. comm.). This raises the issue of whether it is at all possible to identify situations in which a well-dened principle or process does affect ring statistics, and its effects can be quantitatively demonstrated in the spike count distributions, as deviations from the null hypothesis expectation. We consider here a model of associative learning of discrete input vectors by a (feedforward) network comprising a single output neuron. While the S C F model neglects correlations between afferent input patterns and synaptic weights, the learning process considered here produces precisely such correlations, and we have calculated analytically the resulting changes in spike count distributions. The possibility of observing these changes in experiments is discussed briey at the end.
Spike Count Distributions After Learning
1775
2 An S C F Model Rened by Learning
The correlations induced by associative learning could affect any frequency component in the variability of the input current. To stick to a simple model, we consider only the case in which they affect solely the low-frequency components. This corresponds to the simplied scenario in which the slow variability in the input (and output) of our neuron codes for meaningful, slowvarying, stimuli, while fast uctuations, reect only noise. We thus take fast uctuations to be normally distributed, as in the standard S C F model, and in fact do not include them, leaving them to be added up at the end, after deriving the distribution of the slow components. This should not be taken to imply that fast variability is negligible (it is not; see Treves et al., 1999), but the analytical convenience of concentrating on slow variability warrants the minor imprecision in the transparent expression obtained at the end. The single output unit in the model receives N inputs, through synaptic weights J that modify with a learning rule that models associative plasticity. Learning is one shot, in that the weights are taken to have been modied by a single presentation of each of p input patterns. The learning rule includes balanced potentiation and depression components, so that the net average change of each weight is zero. The variance in the value of each weight is also taken to remain constant. The input patterns are uncorrelated among themselves and with the preexisting synaptic weight vector. Under these conditions, the distribution of the summed input current over all novel input pattern vectors remains the same (normal in the N ! 1 limit) after learning. What changes, and what we are going to compute, is the distribution of the input current over the p familiar input vectors—those that have been learned. The output distribution is calculated with the mean-eld analysis detailed in the appendix. For the sake of clarity, we try to keep the notation consistent with Treves (1995), where a similar calculation was reported. Storage and retrieval (S and R) here refer to the rst presentation of one of the p input patterns, and to the subsequent presentation that generates the output distribution we aim for. g is the input vector during storage. It represents ring rates computed over a time window of, say, a few hundred milliseconds, and may be measured in Hz. We take each of its N components to be distributed independently, P (g) D
Y
Pg (gi ),
(2.1)
i
according to some distribution Pg , which for consistency should itself be of the stereotypical form mentioned above. One result we nd, though, is that the precise form of Pg is not critical. V is the input vector during retrieval. The stimulus that has been previously learned is taken to have been reproduced with some added gaussian
1776
Settanni and Treves
noise d (with zero mean and variance sd2 ), followed by rectication: 1 Vi D [gi C di ] C .
(2.2)
Therefore P (V |g) D
Y i
"
# (V i ¡ g i ) 2 H (Vi ) d(Vi )W (¡gi / sd ) C p exp ¡ , 2sd2 2p sd
(2.3)
where d (x ) is the Dirac delta function, H (x) is the Heaviside function, and W(x) is the normalized probability integral, that p is, the integral of the normal Rx distribution of unit variance: W(x) D ¡1 (dx / 2p ) exp ¡x2 / 2. The rst term in each factor of the product represents the probability that the input unit i be below threshold, and the second term that it be above threshold. The noise level sd parameterizes the variability in the ring frequency of input units, measured over a trial of, say, a few hundred milliseconds, among trials with the same stimulus. It is thus a measure of slow components in the noise, which could again be taken to be similar, like the frequency distribution, between input and output units, and it could in principle be evaluated experimentally. Note that the distributions P (g) and P (V ) need not be assumed to be identical, even if both take the general stereotypical form; the addition of the noise term d, which induces a difference between the two, could be thought to reect the altered attentional state, for example, of successive with respect to the rst presentation of a novel stimulus. Z is the output during storage resulting from the product between input and synaptic vectors followed by thresholding and rectication. Gaussian noise 2 S with zero mean and variance s2 2 is also added to the output, h iC Z D JS ¢ g C Z± C 2 S .
(2.4)
The threshold (¡Z± ) may lump together a bona-de current threshold, inhibitory terms due to nonselective effects of interneurons and competition among output cells, and the baseline mean value of the product J L ¢g, taken to be constant. The gain of the threshold-linear transfer function has been set to 1 by rescaling synaptic weights to pure numbers of order 1 / N, so that Z, 2 S , and Z0 , like g, may be measured in Hz. The output distribution during storage is
¡
P (Z|g ) D d (Z )W ¡ H (Z)
C p 1
2p s2
JS ¢ g C Z ±
¢
s2
exp ¡
(Z ¡ JS ¢ g ¡ Z± )2 , 2s2 2
[x] C D x for x > 0 and 0 otherwise.
(2.5)
Spike Count Distributions After Learning
1777
where the rectication has been applied directly, without rst adding fast uctuations. This omission, which is carried over in the synaptic modication terms below, is deliberate; it simplies analytically what was already a rather crude model. u is the steady component of the summed input current to the neuron during retrieval. It is convenient to calculate its distribution before adding fast noise and rectication, after which operations one would have the actual output during retrieval, U. If we take u to include slow noise with the same variance s2 2 , it will follow the conditional probability density, P (u|Z, V , g) D p
1 2p s2
exp ¡
(u ¡ J R ¢ V ¡ U± )2 , 2s2 2
(2.6)
which we have to average appropriately in order to nd the target distribution. JS and JR are the weight vectors during storage and retrieval. With respect to the pattern being considered, they are assumed to have random components, except for a term, present in J R but absent in J S , that reects the storage of the pattern and produces the effect on the spike count distribution, during retrieval, which is the aim of the calculation. Each component JiS is then taken to have zero mean (any baseline value can be incorporated into the constant threshold ¡Z± ) and xed variance sJ2 D 1 / N (so that the scalar product JS ¢ g is of order 1). Apart from the relevant pattern, the components of the new vector JiR reect also the storage of many other patterns, intervened between storage and retrieval. A stable regime, in which storing new patterns and gradually forgetting old ones does not alter the mean strength and variance of the vector, can be described (after Treves, 1995) by a pseudo-rotation from the random direction JS into a new random direction JN , JiR D cos(h ) JiS C
p h c p (gi ¡ g)(Z ¡ Z± ) C sin(h ) JiN , N N
(2.7)
where: JiS is multiplied by a factor cos(h ), that reduces its relative importance with time, parameterized by h. JiN , again of zero mean and variance sJ2 D 1 / N, is multiplied by sin(h ), which grows with the successive learning of new patterns. The relation of h to real time need not be detailed, except that obviously h D 0 implies that no other pattern has modied the weights between the storage and retrieval of the one being considered, while h D p / 2 implies complete oblivion of the original weights. The passociative modication term is multiplied by a normalizing factor h / N, designed to ensure that the variance of the modication term
1778
Settanni and Treves
is set to c sJ2 . Since sJ2 D 1 / N, one needs to adjust 1 / h2 to the value of the variance of the g and Z factors. In this way the plasticity c can be taken to be the average proportion of the synaptic weight variance accounted for by a single learned pattern 2 if all that JR encodes are p patterns, with equal strength, c D 1 / p. The quantity calculated, in the large-N limit, is the distribution hP(u)i of the static or low-frequency component of the current into the output neuron. The brackets indicate averaging over all possible values of the weight vectors. To convert u into an output spike count, one would need to consider additional high-frequency components of the noise (as explained by Treves et al., 1999), to multiply by a gain (which, if different among patterns, cannot be taken to equal 1) and discretize the resulting output frequency U into a number of spikes emitted. These steps are needed when analyzing experimental results, but since they could be handled in a number of alternative ways, they are beyond the scope of this article, which focuses on how hP(u)i differs from a normal distribution. 3 Result and Parameters
The calculation reported in the appendix yields the expression » µ ¶ 1 b1 (u ¡ U± C Z± g ) C b0 Z± hP(u)i D p W ¡ p b0 2p det T µ ¶ (u ¡ U± C Z± g )2 1 £ p exp ¡ 2b 0 det T b0 " # (b1 C b2 g )(u ¡ U± ) C (b 0 C 2b1 g C b 2 g2 )Z± CW p b0 C 2b1 g C b 2 g2 1 £p b 0 C 2b1 g C b 2 g2 µ ¶¼ (u ¡ U± )2 £ exp ¡ , 2(b0 C 2b1 g C b2 g2 ) det T
(3.1)
where the matrix T and the matrix elements b0,1,2 are TD
¡s
T ¡1 D
2 C
z± cos(h )w± 2
¡b
0
¡b1
cos(h )w± s2 2 C y±
¢
(3.2)
¢
¡b1 , b2
(3.3)
2 For consistency, a decay factor cos(h ) should express the gradual forgetting of this learned pattern, along with the others. We assume such factor to be incorporated into the p c factor, purely to simplify the resulting formulas.
Spike Count Distributions After Learning
1779
and the averages x± , y± , w± , andpz± are dened in the appendix. Learning is now parameterized by g D hx± Nc , which, apart from the number hx± of order unity, can be seen to measure roughly the inverse square root of the storage load, that is, of the effective number of learned patterns 1 /c divided by the number of inputs N. The expression is a sum of two terms: the rst originating from cases in which the original unthresholded response f was below threshold and the second from those with f > 0.3 Each term is a gaussian modulated by a W (x) factor, which suppresses the gaussian for values of u not matching the corresponding f . Thus, the rst term is more important for u values below threshold, and once the current is converted into a spike count, its detailed form will be largely uninuential. The second term is more important above threshold and is more directly observable. The partition in two term therefore stems from the thresholding (rectication) of the response f , which is the crucial assumption, while the detailed form of each term reects all other ingredients of the model. In the limit of zero plasticity, c ! 0 so that also g ! 0, both terms reduce to normal distributions of mean U± and variance b 0 det T ´ s2 2 C y± , with the two prefactors adding up to unity. The model then reduces to a normal distribution of slow uctuations, as in the standard S C F model, with their variance given by the sum of that of “slow noise”, s2 2 , and “signal” (y± can be seen as the product of the variance of synaptic weights and that of the activity of input units). The two terms depend on learning (i.e., on plasticity) in two simple but different ways. One should rst realize what range of values is accessible for the parameter g. The parameter c in principle can range from 0 (no plasticity) to 1 (the entire synaptic variance is due to the storage of the single memory pattern being examined); a meaningful set of values, though, is around 1 / N, since p ’ O (N ) is the memory p capacity of associative nets (Rolls & Treves, 1998). Given that, besides c N, the other factors that determine g reduce to a number of order unity, g itself is nonnegative, and can be considered to range from 0 to values of order 1. As g increases from 0, the rst gaussian has its mean shifted by an amount ¡Z± g, that is, toward negative values if the mean output during the learning phase, Z± , is positive, and to positive values in the (also common) case in which the distribution of responses to novel stimuli—ideally derived from the hP(Z)i distribution—corresponds to a negative mean of the slow component of the S C F current. The width of this rst gaussian does not change. The modulating factor W contributes to make the former effect difcult to detect, as it suppresses further, for Z± > 0, the gaussian peak.
3
Integrating hP(f , u)i rst over u and then over f , one can check that the normalization of equation 3.1 is correct.
1780
Settanni and Treves
The second gaussian has the mean unaffected by learning, while its width grows, roughly doubling in value when g reaches values of order 1 (if the b factors, which are all positive, are also of similar magnitude). The modulating prefactor has a complex dependence on g, but is in any case larger for Z± > 0. The distribution depends on two additional parameters: the noise on input variables, sd , and the input distribution Pg (g), which theoretically has innite degrees of freedom. However, it depends on both in a rather mild way, only through the averages x± , y± , w± , and z± . In fact, it can be seen that for sd ! 0, that is, negligible input noise, the specic form of Pg becomes irrelevant, and only its rst two momenta matter. For the sake of simplicity, we use a binary Pg in the graphs, but any other choice gives similar results. The binary Pg reduces to a probability 1 ¡ a for the input activity to be 0, and a probability a to be at the arbitrary level of 50 Hz; a parameterizes the sparseness of the activity (Treves & Rolls, 1991) in that it gives, for such a binary distribution, the fraction of active units. The graphs show the effect of increasing values of the learning parameter g for three cases that correspond to negative, zero, and positive mean levels of the output current, and thus reproduce three typical regimes of ring statistics. In each case, we set Z± D U± and equal, and very low, noise levels in the input and output s2 D sd D 0.5 Hz. Three values p of the learnp ing parameter are included, which correspond to Nc D 0.0, Nc D 1.45, p and Nc D 5.8. The resulting g values differ in each gure as they depend on the value of the hx0 factor and are reported in the legend. In any case the three values p correspond to no plasticity, intense plasticity, and very strong plasticity ( Nc D 5.8 implies that even taking N ’ 104 , a single pattern accounts for about 1 / 300 of the variance of synaptic weights, on Figure 1: Facing page. The effects of learning on spike count distributions that fall largely (a) below threshold, (b) around threshold, or (c) above threshold. Each graph shows as solid curves the normal distribution P(u) for g D 0, and the distributions obtained with two different values of the learning parameter, while the dashed lines indicate the gaussian curves that best t each distribution. (a) The mean value of the original distribution is U± D Z± D ¡20 Hz and the two nonzero values of the plasticity correspond to g D 1.21 and g D 4.86. (b) The mean value is U± D Z± D 0 Hz, while the same plasticity values result in g D 1.44 and g D 5.77. (c) U± D Z± D 20 Hz, g D 1.30 and g D 5.21. The g values used in each of the three panels are somewhat different because of the different h factor, resulting from a different h(Z ¡ Z± )2 i average. Other parameters: a D 0.5, h D 0, sd D s2 D 0.5 Hz. Since differences in the portion of the distribution below threshold are difcult to detect in practice, the most favorable situation to experimentally observing learning effects is that exemplied in panel a, with the original distribution largely below threshold, and the modied distribution substantially shifted to higher spike counts.
Spike Count Distributions After Learning
1781
average). The sparseness is set in each case at a D 0.5; different sparseness values change the distributions quantitatively but not the qualitative effects of learning. In the rst case, Figure 1a, the original output distribution is
1782
Settanni and Treves
above threshold only with its upper tail, Z± D ¡20 Hz. The main effect is a shift of the bulk of the distribution to higher values, with its width remaing constant, while the upper tail is broadened. This situation (mode of the original distribution below threshold, that is quasi-exponential spike count) corresponds to a fraction of cortical recordings and may be more typical of hippocampal cells (Panzeri et al., 1997). In the second case, Figure 1b, the mean output in the absence of learning is exactly at threshold, Z± D 0 Hz. The half distribution below threshold then stays unchanged, while the half distribution above is broadened by learning. The latter effect is the dominant one in the third case, Figure 1c, in which the original output spike count has been taken to have a peak above threshold, Z± D 20 Hz. In this case, only the lower tail of the distribution shifts, and in the opposite direction, to lower values below threshold (an effect difcult to detect in practice). 6 Z± , The effects on the distribution are somewhat more complex if U± D and it would take several gures to describe the detailed dependence on h, sd , s2 , and so on. However, Figure 1 provides a useful indication of the main effects—those that could be hoped to be observed in experiments— and the low noise levels used make such effects particularly salient. 4 Can Deviations from Normality Be Observed?
Figure 1 also shows, with dashed lines, the normal distributions in u that most closely match (in the least mean square sense) the actual distribution for the two nonzero values of the learning parameter. Visual inspection of the gure claries the relative likelihood of observing the effects of learning in experiments. In experiments in which the only data are the ring statistics to presumably well-learned visual stimuli, the effects of learning have to be demonstrated as mismatches between the observed spike count and the closest underlying normal distribution of (slow) uctuations. Figure 1 indicates that such a mismatch will be substantial only below threshold, and then more so for distributions concentrated below threshold (as the one in the example of Figure 1a). The shape of the distribution of slow uctuations below threshold is difcult to extract, particularly with the limited data available in practice, from the observed spike count. It is therefore likely that deviations from normality, if they indeed occur as an effect of learning, will be observable only in experiments designed for that purpose, in particular, involving extensive sampling (recording sessions). A thorough analysis will be needed to disentangle among deviations from normality, if observed, those due to learning from those due to any of the many simplifying assumptions of our model. The situation is different if the effects of learning can also be gauged by the deviations of the observed spike count from a “control” spike count obtained in equivalent conditions, but with the cell stimulated with novel (nonlearned) stimuli. In that case the parameters of the normal distribution
Spike Count Distributions After Learning
1783
assumed by the S C F model for g D c D 0 can in principle be extracted from the data and used to try to t the spike count obtained with the familiar (presumably learned) stimuli. Therefore, not only the shape of the distribution of slow uctuations but also its estimated mean and variance will offer indications about the effects of learning, whether they are consistent with those predicted by the analysis and, if so, yield estimates of the learning parameters g and c . Now deviations from normality due to other effects can hopefully be subtracted out. Experiments that in principle allow for such a comparison have been carried out in the laboratory of James Ringo (Ringo & Nowicka, 1996). The idea is to train a monkey to discriminate among a set of 12 to 40 visual images, which are generated as simple combinations of elementary shapes and colors, with a given algorithm. These images are used throughout training and become familiar to the monkey. During testing, the statistics of single cell responses to such images are contrasted with the statistics of responses to a much larger set of images, each of which is novel but generated by the same algorithm. Under such conditions, any difference in the statistics can be used, in particular the mean and variance of the spike count (Ringo & Nowicka, 1996) and, after analysis with the S C F model, the estimated mean and variance of the distribution of slow uctuations. An analysis along these lines is in progress (facing subtleties posed, as usual, by limited sampling) and will be reported elsewhere. In addition to limited sampling, the analysis of experimental recording has to confront effects inherent in the behavior of real neurons, which have not been considered in this simple model. For example, for experiments in which steady stimuli are presented in successive trials, each for a xed time, in principle one has to take into account the variability in the latency of the response, adaptation in the ring, across-trials trends in the response to the same stimulus, and so on. The quantication of these effects requires a study of its own, beyond the scope of this article, but their presence should clearly be borne in mind. In conclusion, a simple (threshold-linear) model predicts simple effects of learning on the statistics of trial-to-trial uctuations in the input current to a neuron. These effects can be summarized by the next-order-cumulant rule: below threshold, where the output, fast noise aside, is zero (i.e., constant with respect to the input current), the effect of learning is a linear increase in the rst-order cumulant of the input distribution (i.e., its mean value); above threshold, where the output is roughly linear in the input, the effect is on the second-order cumulant (i.e., the variance). Experimental constraints make the validation of these model predictions not quite straightforward; but if the effects turn out to be observable, they will allow an estimate, at least as an order of magnitude, of the plasticity parameter, that is, a measure of the amount of learning stored in the synapses to a neuron.
1784
Settanni and Treves
Appendix
The quantity we have evaluated, hP(u)i, is the average across all possible values of the synaptic weight vector of the distribution P (u ), which is itself already integrated over all values of g, V , and Z. It can thus be written 0 1 0 1 ( JSi )2 (JiN )2 Z Y S N ¡ ¡ Y B dJ B dJ i 2s 2 C 2s 2 C i J J hP(u)i D e e @p A @p A 2p sJ 2p sJ i i Z £
dV dZ dgP (u|Z, V , g)P(V |g)P(Z|g)P(g).
(A.1)
To proceed, one has just to insert the conditional probabilities from equations 2.3, 2.5, and 2.6 and carry out a succession of integrals. It is convenient to write P (V |g) and P (Z|g ) as pure gaussian integrals over the dummy variables vi and f , of which Vi and Z are the real parts (e.g., Z D f H (f )). Moreover, the normal distributions of the variables f and u around their mean values can be written as gaussian integrals in the noise terms 2 S and 2 R . One can then integrate over the synaptic weights, obtaining Z hP(u)i D
C1
¡1
µ d2 R i2 exp 2 2p s2
R(
(2 R )2 u ¡ U± ) ¡ s2 2 2s2 2
¶
µ S ¶Z Y (2 S )2 d2 S df i2 (f ¡ Z± ) ddi dvi £ exp ¡ dgi Pg (gi ) 2 2 2 2 2p s s 2s ¡1 2 2 2 i 2p sd ( ) r X 2 Rh c d i (v i ¡ g i ) (gi ¡ g)[ £ exp i N f H (f ) ¡ Z± ]vi H (vi ) C s2 2 N sd2 i ( X d2 sJ2 [gi2 S C cos(h )vi H (vi )2 R ]2 i C £ exp ¡ 2 s2 4 2sd2 i ) sJ2 [sin(h )vi H (vi )2 R ]2 C . (A.2) 2 s2 4 Z
C1
One may then introduce, in order to separate integration variables, the average input parameters, xD
1 X (gi ¡ g)V N i N i
(A.3)
yD
1 X 2 Vi N i
(A.4)
Spike Count Distributions After Learning
1785
wD
1 X gi V i N i
(A.5)
zD
1 X 2 g , N i i
(A.6)
which are also integrated over, and constrained to take the above values by Q y, Q w, Q and delta functions dened in terms of the conjugated parameters x, zQ . The intermediate formula becomes Z Z Z Z NdxdxQ NdydyQ NdwdwQ NdzdQz hP(u)i D 2p 2p 2p 2p Z C1 Z C1 d2 R d2 S df £ exp NF 2 2 ¡1 2p s2 ¡1 2p s2 » R 2 (2 ) C (2 S )2 £ exp ¡ 2s2 2 ) sJ2 N S )2 S R R 2 C [z(2 C 2w cos(h )2 2 C y (2 ) ] 2s2 4 » R 2 (u ¡ U ± ) 2 Rh p C £ exp i c N[f H (f ) ¡ Z± ]x s2 2 s2 2 ¼ 2 S (f ¡ Z± ) C , (A.7) s2 2 where " # d2 dddv (g) ( ) exp F D exp ¡ 2 C i xxQ C yyQ C wwQ C zQz dgPg 2p sd2 2sd " d(v ¡ g) (v ) £ exp i ¡ xQ (g ¡ g)vH N sd2 # Z
Q 2 H (v) ¡ wgvH (v ) ¡ zQg2 . Q ¡yv
(A.8)
Having set in the model sJ2 N D 1, it is possible to calculate the integrals in the N ! 1 limit with the saddle point method, which amounts to using the formula, R n lim d xg (x ) exp(NF(x )) ¼ N!1
s
¼ g (x max ) exp[NF(xmax )]
¡ 2p ¢ s n
N
1 , ¡ det H[F (x max )]
(A.9)
1786
Settanni and Treves
Q w, Q y, Q zQ ), where x is a short hand for the n D 8-dimensional vector (x, y, w, z, x, which maximizes the exponent F in x max , and H[F (x max )] is the Hessian of F at the maximum. One nds the maximum of F at xQ ± D yQ ± D wQ ± D zQ ± D 0 " Z (g)(g ¡ g) x± D dgPg N gW "
Z
dgPg (g) (g
y± D
2
C sd2 )W
"
Z
dgPg (g)g gW
w± D
¡g¢ sd
Z z± D
¡g¢ sd
sd
C p
¡g¢
2p
( (
exp ¡
g2 2sd2
(A.10)
!#
(A.11)
gsd g2 C p exp ¡ 2 sd 2sd 2p !# sd g2 C p exp ¡ 2 2sd 2p
!# (A.12)
(
(A.13)
dgPg (g)g 2 ,
(A.14)
in which W is, as above, the normal distribution function. At the maximum, F (x± , y± , w± , z± , xQ ± , yQ ± , wQ ± , zQ ± ) D 0
(A.15)
det H [F(x± , y± , w± , w± , xQ ± , yQ ± , wQ ± , zQ ± ) D ¡1,
(A.16)
so we are left with hP(u)i ¼ g (x± , y± , w± , z± , xQ ± D 0, yQ ± D 0, wQ ± D 0, zQ ± D 0) Z C1 Z C1 S d2 R d2 df D 2 2 ¡1 2p s2 ¡1 2p s2 » R 2 (u ¡ U ± ) 2 Rh p 2 C £ exp i c N[f H (f ) ¡ Z± ]x± C 2 2 s2 s2 ( (2 R )2 C (2 S )2 £ exp ¡ 2s2 2 C
sJ2 N 2s2 4
[z± (2
S )2
C 2w± cos(h )2
S R
2
C y± (2
S (f
R )2
¡ Z± ) s2 2
¼
) ] . (A.17)
The above expression is but a gaussian integral in R 2 in the variables 2 S and 2 R ; one has to carry out the nal integral in df to obtain the expression in the text.
Acknowledgments
We are grateful to James Ringo, Stefano Panzeri, and Andrea Benucci for discussions of our results. This work was in partial fulllment of require-
Spike Count Distributions After Learning
1787
ments for an M.Sc. thesis (G.S.) and was supported in part by CNR, INFM, and HFSP. References Abeles, M., Vaadia, E., & Bergman, H. (1990). Firing patterns of single units in the prefrontal cortex and neural network models. Network, 1, 13–25. Barnes, C. A., McNaughton, B. L., Mizumori, S. J. Y., Leonard, B. W., & Lin, L.-H. (1990). Comparison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing. In J. Storm-Mathisen, J. Zimmer, & O. P. Ottersen (Eds.), Understanding the brain through the hippocampus (pp. 287–300). Amsterdam: Elsevier. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Comp., 8, 531–543. Panzeri, S., Booth, M., Wakeman, E. A., Rolls, E. T., & Treves, A. (1996). Do ring rate distributions reect anything beyond just chance? Society for Neuroscience Abstracts, 22, 1124. Panzeri, S., Rolls, E. T., Treves, A., Robertson, R. G., & Georges-Fran¸cois, P. (1997). Efcient encoding by the ring of hippocampal spatial view cells. Society for Neuroscience Abstracts, 23, 195.4. Ringo, J. L., & Nowicka, A. (1996). Long term memory for visual images in the hippocampus and inferotemporal cortex. Society for Neuroscience Abstracts, 22, 281. Rolls, E. T., & Treves, A. (1998).Neural networks and brain function. Oxford: Oxford University Press. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Treves, A. (1995). Quantitative estimate of the information relayed by the Schaffer collaterals. J. Comp. Neurosci., 2, 259–272. Treves, A., Panzeri, S., Rolls, E. T., Booth, M. C. A., & Wakeman, E. A. (1999). Firing rate distributions and efciency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Comp., 11, 611–641. Treves, A., & Rolls, E. T. What determines the capacity of autoassociative memories in the brain. Network, 2:371–397, 1991. Received November 2, 1998; accepted June 8, 1999.
LETTER
Communicated by Fabrizio Gabbiani
Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs A. N. Burkitt G. M. Clark Bionic Ear Institute, East Melbourne, Victoria 3002, Australia We present a new technique for calculating the interspike intervals of integrate-and-re neurons. There are two new components to this technique. First, the probability density of the summed potential is calculated by integrating over the distribution of arrival times of the afferent postsynaptic potentials (PSPs), rather than using conventional stochastic differential equation techniques. A general formulation of this technique is given in terms of the probability distribution of the inputs and the time course of the postsynaptic response. The expressions are evaluated in the gaussian approximation, which gives results that become more accurate for large numbers of small-amplitude PSPs. Second, the probability density of output spikes, which are generated when the potential reaches threshold, is given in terms of an integral involving a conditional probability density. This expression is a generalization of the renewal equation, but it holds for both leaky neurons and situations in which there is no time-translational invariance. The conditional probability density of the potential is calculated using the same technique of integrating over the distribution of arrival times of the afferent PSPs. For inputs with a Poisson distribution, the known analytic solutions for both the perfect integrator model and the Stein model (which incorporates membrane potential leakage) in the diffusion limit are obtained. The interspike interval distribution may also be calculated numerically for models that incorporate both membrane potential leakage and a nite rise time of the postsynaptic response. Plots of the relationship between input and output ring rates, as well as the coefcient of variation, are given, and inputs with varying rates and amplitudes, including inhibitory inputs, are analyzed. The results indicate that neurons functioning near their critical threshold, where the inputs are just sufcient to cause ring, display a large variability in their spike timings. 1 Introduction
The stream of action potentials (spikes) generated by a neuron, which is the means by which neurons communicate, plays an important role in neural information processing. The generation of an action potential can be described c 2000 Massachusetts Institute of Technology Neural Computation 12, 1789– 1820 (2000) °
1790
A. N. Burkitt and G. M. Clark
in terms of the membrane potential of the neuron reaching a threshold value, as in the integrate-and-re model. This model, which has both a long history (Lapicque, 1907) and wide application (Tuckwell, 1988a), has recently been compared with the Hodgkin-Huxley model for the case where both models receive a stochastic input current, and it was found that the threshold model provides a good approximation (Kistler, Gerstner, & van Hemmen, 1997). One of the most universal characteristics of neuronal behavior is the apparent stochastic nature of the timing of action potentials (Tuckwell, 1988b, 1989). The traditional view has long been that the mean rate of ring of a neuron provides an adequate description of the information it conveys (Adrian, 1928). There are, however, a number of instances within the nervous system where the temporal information contained in the timing of individual spikes plays an important role, such as the encoding of sound frequency information in the auditory pathway, where spikes become locked to the phase of the incoming sound wave (see Theunissen & Miller, 1995, and Carr & Friedman, 1999, for reviews of temporal coding). There has also been considerable interest in studies of cross-correlational activity of neurons in the visual cortex, which provide data suggesting a functional role for the synchronization of neural activity, as reviewed by (Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Singer, 1993). This has led to a recent reappraisal of the information that can be contained in the timing of neuronal spikes (Bialek & Rieke, 1992), and a variety of mathematical models of networks of spiking neurons have been proposed (Abeles, 1982a; Judd & Aihara, 1993; Gerstner, 1995) in which the timing of individual spikes encodes information (Hopeld, 1995; Gabbiani & Koch, 1996; Maass, 1996a, 1996b). Studies suggest that the integrate-and-re mechanism is unable to account for the irregularity observed in the interspike intervals (ISIs) in spike trains from neurons in the visual cortex of behaving monkeys (Softky & Koch, 1992, 1993). One possible explanation is that the neurons act as coincidence detectors (Abeles, 1982b), which may be made possible with active dendritic conductances (Softky, 1994). However, subsequent studies (Shadlen & Newsome, 1994, 1998) suggest that the observed irregularity may be accounted for by the effect of inhibitory postsynaptic potentials (IPSPs) and that the timing of neuronal spikes conveys little, if any, information. The resolution of such issues requires the development and analysis of neuronal models that are sufciently complex to capture the essential features of the neuronal processing, including stochastic input, while still being mathematically tractable. The earliest solution of the integrate-and-re model that incorporated stochastic activity was to model the incoming postsynaptic potentials (PSPs) as a random walk (Gerstein & Mandelbrot, 1964). Subsequent developments have largely built on this diffusion approach using stochastic differential equations and the Ornstein-Uhlenbeck Process (OUP). Stein formulated the integrate-and-re model with stochastic input to include the decay of the membrane potential (Stein, 1965), and a number of other authors subsequently have investigated the model using both
Calculation of Interspike Intervals
1791
stochastic differential equations and numerical techniques (Tuckwell, 1977; Wilbur & Rinzel, 1982; L a´ nsky, ´ 1984). These techniques have been used to examine the role of inhibition in the Stein model (Tuckwell, 1978, 1979; Cope & Tuckwell, 1979; Tuckwell & Cope, 1980) and a number of other features, such as the effect of reversal potentials (Tuckwell, 1979; Hanson & Tuckwell, 1983; Musila & L´ansky, ´ 1994). We present here an alternative formulation of the stochastic integrateand-re model that allows the analysis of arbitrary synaptic response functions (which describe the time course of the incoming PSP, including a nite rise time and subsequent decay). The rst step is the calculation of the probability density of the summed potential, which is given by integrating over the arrival times of the incoming PSPs. The probability density function at any time is calculated in the gaussian approximation (i.e., expanding in powers of the PSP amplitude, a, and neglecting terms of order higher than quadratic in the expansion), and it is found to be a gaussian distribution, with the (time-dependent) mean and width of the distribution, depending on the details of the distribution of PSP arrival times. In the case of a Poisson distribution of PSP arrival times, our results reproduce the known expression for the probability density of the membrane potential in the diffusion limit. In order to calculate the probability density of output spikes, it is necessary to know when the potential reaches threshold for the rst time. This rst-passage density is usually related to the probability density of the potential by a renewal equation (Tuckwell, 1988b), a procedure that works only when there is no decay of the membrane potential (the perfect integrator model). However, it is possible to formulate a more general integral relationship involving the probability density of output spikes and a conditional probability density of the potential (Plesser & Tanaka, 1997; Burkitt & Clark, 1999a). We show how this conditional probability density can also be calculated using the technique of integrating over the distribution of arrival times of the afferent PSPs. In addition to reproducing the results of the perfect integrator and Stein models, this method allows us to investigate models that incorporate not only the membrane potential leakage but also a nite rise time of the postsynaptic response. Numerical solutions of the spike output distribution (i.e., the interspike interval distribution) of these models are given. We have also applied this technique to the study of the synchronization problem, in which the relationship among a number of approximately synchronized inputs (with a spread of arrival times giving the temporal jitter of the inputs) with the spread in time of the output spike distribution is examined (Burkitt & Clark, 1999a). The analysis supports previous studies (Bernander, Koch, & Usher, 1994; Diesmann, Gewaltig, & Aertsen, 1996; MarLsalek, ´ Koch, & Maunsell, 1997) showing which the ratio of the output jitter to the input jitter is consistently less than one and that it decreases for increasing numbers of inputs. Moreover, we identied the variation in the spike-generating thresholds of the neurons and the variation in the num-
1792
A. N. Burkitt and G. M. Clark
ber of active inputs as being important factors that determine the timing jitter in layered networks, in addition to the previously identied factors of axonal propagation times and synaptic jitter. This study of the synchronization problem using the techniques presented here is complementary to this study, since the problems addressed represent two extreme cases of the temporal information that the inputs contain: in the synchronization problem, all of the input arrival times are related, whereas for a Poisson distribution of PSP inputs, the arrival times are completely independent. Details of the technique are presented in section 2, and it is then applied to various neural models in section 3. The result for the perfect integrator (Gerstein & Mandelbrot, 1964), in which the decay of the membrane potential over time is neglected, is reproduced in section 3.1. The solution for the moments of the ring distribution in the Stein model (Gluss, 1967) is reproduced in section 3.2, and plots of the ring probability, mean interspike interval, width of the interspike interval distribution, and coefcient of variation are given. We also plot the output rate of ring as a function of the input rate for a number of thresholds. More general synaptic response functions, which include not only the decay of the membrane potential but also its rise time, are considered in section 3.3. The effect of a distribution of PSP amplitudes and rates is examined in section 3.4, and the effect of including inhibitory PSPs is examined in section 3.5. The nal section contains a discussion of the results and the conclusions. 2 Technique for Calculating Output Spike Density
We calculate the interspike interval for an integrate-and-re neuron with N afferent bers. The time of arrival of the inputs on the kth ber is modeled as a renewal process, and the results in sections 2.1 and 2.2 hold for any renewal process, that is, the successive time intervals between inputs are independent and identically distributed. The times of arrival of the inputs are described by the probability density function, which we denote by pk (t); this is the probability density of the interspike interval distribution on the kth ber. In section 2.3 we examine the case in which the arrival times form a
Poisson process with constant rate l k . However, the technique outlined here is applicable to probability densities other than the Poisson distribution, which is useful when refractory effects are considered and for nonstationary processes. The membrane potential V (t) is assumed to be reset to its initial value at time t D 0, V (0) D v0 after an action potential (AP) has been generated. The potential is the sum of the excitatory and inhibitory postsynaptic potentials (EPSPs and IPSPs, respectively), V ( t ) D v0 C
N X k D1
ak
1 X m D1
uk (t ¡ tkm ),
t ¸ 0,
(2.1)
Calculation of Interspike Intervals
1793
where the index k D 1, . . . , N denotes the afferent ber and the index m denotes the mth input from the particular ber, whose time of arrival is tkm (0 < tk1 < tk2 < ¢ ¢ ¢ < tkm < ¢ ¢ ¢). The amplitude of the inputs from ber k is ak (positive for EPSPs and negative for IPSPs), and the time course of an input at the site of spike generation is described by the synaptic response function uk (t), which is zero for t < 0 and positive for t ¸ 0 with a peak value normalized to one (the denitions for the three cases examined here are given in equations 3.1, 3.6, and 3.17). In section 3 we examine a number of synaptic response functions corresponding to various neural models, including the perfect integrator model and the Stein model (also called the shot noise model). More general models of the synaptic response function enable us to model the time course of the incoming PSPs, taking into account both the leakage of the potential across the membrane and the effect of propagation along the dendritic tree. The various versions of the integrate-and-re neural model have as their key components a passive integration of the synaptic inputs, a voltage threshold for spike generation, and a lack of the specic currents that underlie spiking. The complex mechanisms by which the sodium and potassium conductances cause action potentials to be generated are not part of the model. An important assumption is also that the synaptic inputs interact only in a linear manner, so that phenomena involving nonlinear interaction of synaptic inputs, such as synaptic currents that depend on postsynaptic currents and pulse-paired facilitation and depression (which require higherorder terms in uk (t)), are neglected. A recent review of the integrate-and-re neural model, giving details of the various variations of the model and comparisons with both experimental data and other models, is given in Koch, (1999). In the results that follow, we express the relationship between the inputs and the threshold in terms of the threshold ratio R, which is dened as RD
h Vth ¡ v0 , D Na Na
(2.2)
where N is the number of afferent bers that contribute EPSPs and a is the amplitude of the individual PSPs (taken here to all be equal in magnitude) and h is the difference between the threshold potential, Vth , and the resting potential, v0 . We choose the units of voltage to be set by the difference between the threshold and reset values, h D Vth ¡ v0 D 1. The utility of this parameterization of the threshold will become apparent when we consider how the results scale with increasing numbers of inputs, N, and decreasing amplitudes, a, in section 3. 2.1 The Probability Density of the Potential. The probability density of the potential having the value v at time t is evaluated by considering the fraction of inputs for which this holds. This is given by the integral over the
1794
A. N. Burkitt and G. M. Clark
probability distribution of the inputs p(v, t | v0 ) D
N Y
(Z
1
0
k D1
Z £
Z dt k1 pk (tk1 ) 1
t k2
1 tk 1
dtk2 pk (tk2 ¡ tk1 ) )
dtk3 pk (tk3 ¡ tk2 ) . . . d(V(t) ¡ v),
(2.3)
where p (v, t | v0 ) is the probability that the potential, V (t), takes the value v at time t, given the initial condition V (0) D v0 . This initial condition corresponds to the reset of the membrane potential to v0 immediately after the previous spike, which is assumed to occur at time t D 0. The dots indicate the sequence of integrals, each corresponding to the successive arrival of inputs on the particular ber (note that the integrals are normalized), and d(x) is the Dirac delta function. Since the contributions from the afferent bers are independent, the above probability density may be written as Z p(v, t | v0 ) D
1 ¡1
Z D
1 ¡1
N Y dx expfix(v ¡ v0 )g Fk (x, t) 2p kD 1 ( ) N X dx exp ix (v ¡ v0 ) C ln Fk (x, t) , 2p kD 1
(2.4)
where the Fourier representation of the Dirac delta function is used: Z d (z ) D
1 ¡1
dx expf¡i x zg. 2p
(2.5)
The function Fk (x, t) is given by (Z Fk (x, t) D
1 0
Z dtk1 pk (tk1 ) (
£ exp
¡ i x ak
1 tk 1
1 X m D1
) dtk2 pk (tk2 ¡ tk1 ) . . . )
u ( t ¡ tk m ) .
(2.6)
The exponential in the above expression for Fk (x, t) can be expanded in terms of the individual amplitudes, ak , and only the linear and quadratic terms are retained. A formal solution to the resulting equations is given (we do not address such questions as the convergence of the expansion). However, in order to obtain a handle on the accuracy of this approximation, the results are compared in the following sections with those obtained by
Calculation of Interspike Intervals
1795
numerical simulations. The expression for Fk (x, t ) gives (
x2 a2k ln Fk (x, t) D ln 1 ¡ i x ak Dk (t) ¡ Ek (t) C O(a3k ) 2
)
x2 a2k (Ek (t) ¡ D2k (t)) C O (a3k ), 2
D ¡i x ak Dk (t) ¡
(2.7)
where O (y ) denotes a remainder term that is smaller than C £ |y3 | when y lies in a neighborhood of 0 for some positive constant C. The functions Dk (t ) and Ek (t) are given by (Z Dk (t) D
1 0
(Z Ek ( t ) D
1 0
£
Z dtk1 pk (tk1 )
t k1
Z dtk1 pk (tk1 )
1 X m,m0 D1
1
1 t k1
) dtk2 pk (tk2 ¡ tk1 ) . . . )
1 X mD 1
u k ( t ¡ tk m )
dtk2 pk (tk2 ¡ tk1 ) . . .
uk (t ¡ tkm )uk (t ¡ tkm0 ).
(2.8)
The expressions for Dk (t) and Ek (t) may be solved using Laplace transforms, as shown in appendix A. Neglecting the O (a3k ) terms in equation 2.7 and retaining only the linear and quadratic terms (the gaussian approximation), the probability density function can then be evaluated as p (v, t | v0 ) Z 1
( ! ) N N X dx x2 X 2 2 exp i x v ¡ v0 ¡ D ak Dk ( t ) ¡ a (Ek (t) ¡Dk (t)) 2 k k ¡1 2p k » ¼ (v ¡ v0 ¡ U (t))2 1 exp ¡ , (2.9) D p 2C (t) 2p C(t)
(
with U (t) D C(t) D
N X
ak Dk (t)
kD 1 N X kD 1
a2k (Ek (t) ¡ D2k (t)).
(2.10)
In the case of the Poisson process, the expressions for U (t ) and C (t) take particularly simple forms, as will be shown in section 2.3 (We assume here
1796
A. N. Burkitt and G. M. Clark
that C (t) > 0, which is indeed the case for Poisson inputs, as shown later in equation 2.23). The gaussian approximation for Fk (x, t), equation 2.7 (i.e., keeping only terms to second order in the amplitude ak of the individual postsynaptic contributions) is an approximation that is expected to work best for large values of N and values of the amplitude that are small (voltage is measured in units of the difference h between the threshold, Vth , and reset, v0 , potentials). We assume that the amplitude, ak , scales with the number of inputs, N (i.e., that the amplitude is of order 1 / N) and that U (t ) is therefore of the same order as h, thus giving a reasonable output rate of spikes. Consequently the threshold ratio, equation 2.2, is a useful quantity with which to compare neurons that sum different number of inputs of various amplitude. The results in the following sections do indeed support this assumption. The question of how many inputs and how small the individual synaptic amplitudes must be in order for the analysis to provide an accurate evaluation of the neural response will be addressed in subsequent sections, where we compare the results with those obtained from numerical simulations for a range of number of inputs and relative amplitudes. However, the situation in which a large number of small-amplitude postsynaptic potentials are required for the ring threshold to be reached is one that occurs in many neurons (Abeles, 1991; Douglas & Martin, 1991). We consider here the case in which the amplitudes of the contributions from the inputs are equal in magnitude, |ak | D a, the rates on each of the incoming bers are equal, l k D l, and all of the inputs are excitatory. It is straightforward to generalize the results to any distribution of amplitudes, as discussed in section 3.4, since the amplitude dependence of the functions U (t), C (t) is given explicitly in equation 2.10 (note that Dk (t), Ek (t) are independent of the amplitude). In section 3.5 the generalization to the case in which there are both excitatory and inhibitory inputs is considered. Since there is no interaction from the contributions of different incoming bers in this model, it is possible to consider having different synaptic response functions on different afferent bers. This may enable us to model inputs coming from bers that are situated on the dendritic tree at different distances from the site (typically the axon hillock) at which the summed potential generates the output spike. However, we consider here only the case in which the synaptic response function from each input is the same, uk (t) D u (t). Consequently all input bers and their characteristics are equivalent, and therefore the index k is dropped in the analysis of the following section. 2.2 First-Passage Time to Threshold. A spike is generated when the summed membrane potential reaches the threshold, Vth (Vth D v0 C h), for the rst time. The probability density of output spikes, which is the interspike interval (ISI) density, is therefore equivalent to the probability density of the rst time that the potential, V (t), reaches the threshold, which
Calculation of Interspike Intervals
1797
is called the rst-passage time to threshold density, denoted by fh (t). This may be obtained from the integral equation (Plesser & Tanaka, 1997; Burkitt & Clark, 1999a), Z p (v, t | v0 ) D
t 0
dt0 fh (t0 )pO(v, t | Vth , t0 , v0 ),
v ¸ Vth ,
(2.11)
where pO(v2 , t2 | v1 , t1 , v0 ) is an approximation to the conditional probability density of the potential. The conditional probability density, p (v2 , t2 | v1 , t1 , v0 ), is dened as the probability that the potential has the value v2 at time t2 , given that V (t1 ) D v1 and the reset value of the potential after a spike is v0 . Our approximation to the conditional probability density, pO(v2 , t2 | v1 , t1 , v0 ), has the same denition, with the added restriction that the potential is subthreshold in the time interval 0 · t < t0 . If the uctuations in the input currents are not temporally correlated, then the approximate expression is exact, as is the case for both the perfect integrator model and the Stein model with Poisson distributed inputs. However, the two expressions will differ if there are temporal correlations, which occurs if the presynaptic spikes have correlations or the synaptic currents have a nite duration. The integral expression for fh (t) given above is analogous to the renewal equation (Tuckwell, 1988b), Z p (v, t | v0 ) D
t 0
dt0 fh (t0 )p(v, t ¡ t0 | Vth ).
(2.12)
This renewal equation is a special case of equation 2.11 that is valid only for the perfect integrator (i.e., nonleaky) neural model, and it arises because the perfect integrator relates the conditional probability density to the original probability density, that is, p (v, t | Vth , t0 , v0 ) D p (v, t ¡ t0 | Vth ) for v ¸ Vth (in more general neural models, there is no such spatial and temporal homogeneity). There is also no assumption about the stationarity of the conditional probability density, pO(v, t | Vth , t0 , v0 ), in equation 2.11, it does not have any time-translational invariance. The conditional probability density takes account of possible multiple crossings of the threshold between times t0 and t. The conditional probability density is given by (Bayes’ theorem) p (v2 , t2 | v1 , t1 , v0 ) D
p ( v 2 , t2 , v 1 , t1 | v 0 ) , p (v1 , t1 | v0 )
(2.13)
and the joint probability density p (v2 , t2 , v1 , t1 | v0 ), which is the probability that V (t) takes the value v1 at time t1 and takes the value v2 at time t2 (given that the reset value of the potential after a spike is generated is again at v0 ),
1798
A. N. Burkitt and G. M. Clark
may be evaluated in a similar way to the probability density, equations 2.3 and 2.4: p ( v 2 , t2 , v 1 , t 1 | v 0 ) ( Z N Z 1 Y D dtk1 pk (tk1 ) k D1
Z D
0
1
dx2 ¡1 2p
Z
1 t k1
) dtk2 pk (tk2 ¡tk1 ) . . . d(V(t1 ) ¡v1 )d(V (t2 ) ¡v2 )
1
dx1 expfi x2 (v2 ¡ v0 ) C i x1 (v1 ¡ v0 ) ¡1 2p C N ln F (x2 , x1 , t2 , t1 )g.
(2.14)
The function ln F (x2 , x1 , t2 , t1 ) is expanded in the amplitude a of the synaptic response function as before (see equation 2.7) and contains cross-terms in x2 x1 : ln F (x2 , x1 , t2 , t1 ) ( D ln 1 ¡ i x2 a D(t2 ) ¡ i x1 a D (t1 )
x2 x2 ¡ 2 a2 E (t2 ) ¡ 1 a2 E (t1 ) ¡ x2 x1 a2 G (t2 , t1 ) C O(a3 ) 2 2
)
D ¡ i x 2 a D ( t2 ) ¡ i x 1 a D ( t1 )
¡
x22 2 x2 a (E (t2 ) ¡ D2 (t2 )) ¡ 1 a2 (E (t1 ) ¡ D2 (t1 )) 2 2
¡ x2 x1 a2 (G (t2 , t1 ) ¡ D(t2 )D (t1 )) C O (a3 ),
(2.15)
where D (t ), E (t) are given by equation 2.8 (the index k is dropped, since all incoming bers are treated identically) and G(t2 , t1 ) is given by (Z G ( t2 , t1 ) D
1 0
£
Z 0
0
dt1 p (t1 )
1 X mD 1
1
t01
u (t1 ¡ t0m )
) 0
0
0
dt2 p(t2 ¡ t1 ) . . .
1 X m0 D 1
u (t2 ¡ t0m0 ).
(2.16)
This expression for G(t2 , t1 ) may also be solved in terms of Laplace transforms (see appendix B). Once again we retain only the linear and quadratic terms in the amplitude, a, in equation 2.14. In order to evaluate the integrals in equation 2.14 we make a change of variable y1 D x1 C k (t2 , t1 )x2 where k (t2 , t1 ) is chosen
Calculation of Interspike Intervals
1799
in order to eliminate the cross-term x2 y1 :
k ( t2 , t1 ) D
N a2 [G(t2 , t1 ) ¡ D(t2 )D(t1 )] . C ( t1 )
(2.17)
The y1 and x2 integrals are now independent and may be evaluated. The y1 integral yields exactly p (v1 , t1 | v0 ), and the x2 integral gives the conditional probability density, p (v2 , t2 | v1 , t1 , v0 ) D p
1
2pc (t2 , t1 ) » ¼ [(v2 ¡v0 ¡U (t2 ) ¡k (t2 , t1 )(v1 ¡v0 ¡U (t1 ))]2 £ exp ¡ , 2c (t2 , t1 )
(2.18)
where c (t2 , t1 ) D C (t2 ) ¡ k 2 (t2 , t1 ) C(t1 ).
(2.19)
The value of c (t2 , t1 ) is guaranteed to be positive (for t2 > t1 ) for Poissondistributed inputs (see the end of section 2.3). The rst passage-time density is evaluated using equation 2.11. For inputs described by a Poisson process, the equation may be solved explicitly for two neural models: the perfect integrator and the Stein model. In the case of the perfect integrator, the method reproduces the known result of the inverse gaussian density (Tuckwell, 1988b), as discussed in section 3.1. In the case of the Stein model, this technique reproduces the known solution (Gluss, 1967), in which the moments of the rst passage-time density are expressed in terms of the Laplace transforms of the probability density and the conditional probability density (section 3.2). However, apart from these special cases, equation 2.11 has not been solved analytically for fh (t). We solve equation 2.11 for fh (t) numerically using standard methods for Volterra integral equations (Press, Flannery, Teukolsky, & Vetterling, 1992; Plesser & Tanaka, 1997). The moments of this distribution are then straightforward to evaluate. The zeroth moment of the distribution is the integral of the probability density itself and gives the probability r of a spike being produced. The rst moment, t f , is the average time of the rst threshold crossing and hence the ISI. The second moment sout is the spread of the distribution in time of the output spikes (i.e., the width of the ISI). Although the ISI distributions are frequently characterized by their moments, the numerical results show clearly that they are somewhat skewed, resembling more the generalized inverse gaussian distribution (Iyengar & Liao, 1997).
1800
A. N. Burkitt and G. M. Clark
2.3 Poisson Process. In a renewal process, the interarrival times of the presynaptic inputs are independent and identically distributed with an arbitrary distribution. The Poisson process is a particular case of a renewal process in which the distribution is exponential; the probability density of the rst input on ber k is
pk (tk ) D l k e¡l k tk ,
(2.20)
where tk is the time since the previous input and l k is the constant rate of input on the kth ber. Likewise, the probability density of the mth input on ber k is ¡l pk (tkm ¡ tk(m¡1) ) D l k e k
(tkm ¡tk(m¡1) )
,
(2.21)
where tk(m¡1) is the time of the preceding input (i.e., the (m ¡ 1)th input). The Laplace transform of pk (t) is pk,L (s) D
lk s C lk
,
(2.22)
and Dk,L (s ), Ek,L (s ), equations A.1 and A.2, may be evaluated to yield the functions U (t) and C (t), equation 2.10,
U (t) D C (t) D
X
Z ak l k
0
k
X k
t
Z a2k l k
t 0
dt0 u (t0 ) dt0 u2 (t0 )
(l k Dl )
D
Z N al
(l k Dl )
D
t 0
dt0 u (t0 )
Z N a2 l
t 0
dt0 u2 (t0 ).
(2.23)
For a Poisson process, these expressions may also be arrived at by noting that a superposition of Poisson processes gives a Poisson process (with a rate that is the sum of the contributing rates) and by integrating the synaptic response function to obtain the average and variance of the depolarization (Stein, 1967; Abeles, 1982b). The conditional probability density, p (v2 , t2 | v1 , t1 , v0 ), may be obtained in closed analytic form for any synaptic response function, u (t), since the expression for k (t2 , t1 ) follows from equation 2.17 and the solution of G(t2 , t1 ) in appendix B, equation B.1:
k ( t2 , t1 ) D
N a2 l
R t1 0
dt 0 u (t0 ) u (D C t0 ) , C ( t1 )
D
D t2 ¡ t1 ¸ 0.
(2.24)
Note that the rate, l, given here is the average rate on each afferent ber, and therefore the total rate of inputs is Nl. For a Poisson process, k (t2 , t1 ) and
Calculation of Interspike Intervals
1801
C (t1 ) are related to the autocovariance of the sequence of Poisson inputs, C(t2 , t1 ), by the relationship (Papoulis, 1991),
k ( t2 , t1 ) D
C (t 2 , t1 ) , C (t 1 , t1 )
C (t1 ) D C (t1 , t1 ),
(2.25)
which ensures that the expression for c (t2 , t1 ), equation 2.19, is positive denite. 3 Neural Models 3.1 Perfect Integrator Model. The perfect integrator model (also called the leakless integrate-and-re model) is the simplest of the neural models to analyze, since there is no decay of the potential with time. Although this is clearly an unrealistic assumption, it provides a reasonable approximation for situations in which the integration occurs over a much shorter timescale than the decay constant. The synaptic response function for the perfect integrator simply takes the form » 1 for t ¸ 0 (3.1) u (t ) D 0 for t < 0.
The probability density of the potential for this model is well known (Tuckwell, 1988b) and may be evaluated as a Wiener process with drift (Gerstein & Mandelbrot, 1964). The resulting expressions are identical to those obtained using the above technique. The expressions for U (t) and C (t ), for the case in which all the inputs are excitatory and the input rate on each ber is identical, follow from equation 2.23: U (t) D N al t
C(t) D N a2 l t.
(3.2)
Moreover, the expressions for k (t2 , t1 ) and c (t2 , t1 ) are
k ( t2 , t1 ) D 1 (3.3) c (t2 , t1 ) D C (t2 ) ¡ C (t1 ) D N a2 l (t2 ¡ t1 ), and therefore the conditional probability density is related to the probability density in this particular case by p (v2 , t2 | v1 , t1 , v0 ) D p(v2 , t2 ¡ t1 | v1 ),
v2 > v1 .
(3.4)
Equation 2.11 is therefore a genuine renewal equation, as we would expect for the perfect integrator. The density of the output spike distribution is
1802
A. N. Burkitt and G. M. Clark
given by standard Laplace transform techniques as the inverse gaussian density (Gerstein & Mandelbrot, 1964), fh (t) D p
» ¼ (h ¡ N al t )2 exp ¡ , 2 N a2 l t 2p N a2 l t3 h
(3.5)
where the threshold is given by Vth D v0 C h. 3.2 Solution of the Stein Model. The passive decay of the membrane potential may be incorporated in the model by a synaptic response function of the form (Stein, 1965),
» ¡t t e / u(t) D 0
for t ¸ 0 for t < 0,
(3.6)
where t is the time constant of the membrane potential decay. This form of the integrate-and-re model with stochastic inputs was rst studied by Stein (1965) and is known as the Stein model or shot-noise threshold model. For a Poisson process in the case in which all the inputs are excitatory (inhibitory inputs are considered in section 3.5) and the input rate on each ber is identical, it is straightforward, using equations 2.19, 2.23, and 2.24, to calculate the functions U (t) D N al t (1 ¡ e¡t / t ) C (t) D
1 N a2 l t (1 ¡ e¡2t / t ) 2
k (t2 , t1 ) D e¡(t2 ¡t1 ) / t c (t2 , t1 ) D
1 N a2 l t (1 ¡ e ¡2(t2 ¡t1 ) / t ) D C (t2 ¡ t1 ). 2
(3.7)
Thus for the Stein model the conditional probability density, equation 3.19, possesses temporal homogeneity (i.e., it is stationary), p(v2 , t2 | v1 , t1 , v0 ) ´ q(t2 ¡ t1 I v2 , v1 , v0 ),
(3.8)
where the function q(tI v2 , v1 , v0 ), which is introduced here to make clearer the equivalence with the known result (Gluss, 1967), is 1 2p C (t) » ¼ [v2 ¡v1 e ¡t / t ¡v0 (1 ¡ e¡t / t ) ¡U (t)]2 £ exp ¡ . (3.9) 2C (t )
q (tI v2 , v1 , v0 ) D p
Calculation of Interspike Intervals
1803
The equation for the probability density of the output spike distribution, equation 2.11, is therefore a convolution equation that allows a solution in terms of Laplace transforms fhL (s) D
pL (v, s | v0 ) qL (sI v, v0 , v0 ) , D qL (sI v, Vth , v0 ) qL (sI v, Vth , v0 )
(3.10)
where we have used the following straightforward general identity between the probability density and conditional probability density, p (v, t | v0 ) D p (v, t | v0 , 0, v0 ) D q(tI v, v0 , v0 ).
(3.11)
This result for fhL (s ), equation 3.10, valid for v ¸ Vth agrees exactly with the known result obtained for fhL (s ) in the case v D Vth in the diffusion limit (Gluss, 1967). The rst three moments of the probability density fh (t) may be obtained from the Laplace transform by the relationships Z 1 dt fh (t) D fhL (0) Z
0
1 0
Z
1 0
dt t fh (t) D ¡ dt t2 fh (t ) D
¡ df
hL ( s)
¡d f 2
ds
hL (s ) ds 2
¢
(3.12)
¢
s D0 sD 0 .
Denoting derivatives with respect to s by a prime and taking the value of each of the functions at s D 0, and with values of v ¸ Vth , we obtain solutions for r , the probability of a spike’s being generated in a nite time interval, t f , the mean ISI, and sout , the standard deviation of the output distribution of spikes (the width of the ISI distribution), Z 1 pL r D dt fh (t) D qL 0 Z 1 q0 p0 1 (3.13) tf D dt t fh (t) D L ¡ L r 0 qL pL Z 1 1 p00 q00 q0 2 p0 2 sout D dt (t ¡ t f )2 fh (t ) D L ¡ L C 2L ¡ 2L , r 0 pL qL qL pL where the primes denote derivatives with respect to s and pL ´ pL (v, 0 | v0 ),
qL ´ qL (0I v, Vth , v0 ).
(3.14)
The evaluation of the expressions for p0L , q0L , p00L , and q00L as numerical integrals follows from the standard properties of Laplace transforms, as in equation 3.12.
1804
A. N. Burkitt and G. M. Clark
While it is perfectly possible to evaluate the parameters r , t f and sout numerically using the above Laplace transform methods, we have used direct numerical methods for Volterra integral equations (Press et al., 1992; Plesser & Tanaka, 1997) to evaluate the rst-passage time density fh (t ) from its denition in equation 2.11 using equation 2.19. This equation has been solved for some typical parameter values in order to illustrate the nature of the solutions and compare them with results in subsequent sections, where we include synaptic response functions with nite rise times (section 3.3) and inhibition (section 3.5). A typical plot of the ring probability, r (see equation 3.13), as a function of the threshold ratio, R (see equation 2.2), is given in Figure 1 for various numbers of afferent bers, N. The input rate of EPSPs on each of the incoming bers is l in D 1.0, and time is measured in units of the membrane time constant, t D 1.0 (rates are given as spikes per unit of t ). The curves represent interpolations between solutions that were evaluated at intervals of 0.01 on the R-axis. The steepest curve corresponds to the largest number of inputs (N D 1024) and indicates an abrupt change between a ring regime at low threshold ratios and a quiescent (nonring) regime at high threshold ratios. In contrast, where there are fewer inputs (smaller N values), this transition between ring and nonring regimes extends over a broader range of threshold ratios, R. The value of the threshold ratio at which this transition from ring to nonring behavior takes place we dene as the critical threshold ratio, Rcrit . The criteria used to dene the critical threshold ratio in this plot were that the probability of ring, r , be equal to 0.05, which was chosen to signify that the probability had risen signicantly above zero (in principle, the criteria could be set at any intermediate value between zero and one). The dependence of the critical threshold ratio on the rate of inputs l in is shown in Figure 2, where it is plotted for inputs N D 16, 32, 64, 128, 256, 512, 1024, and the results for equal input rates, l in , are connected by lines. The results indicate that higher rates of input will generate an output spike over a larger range of threshold ratios, as expected. Moreover, the plots in Figure 2 show that the critical threshold ratio, Rcrit, decreases only slightly as the number of inputs, N, increases. Consequently a neuron with a small number of large-amplitude inputs has a similar threshold ratio to one with a large number of small-amplitude inputs. In the large-N limit, the behavior becomes deterministic as the variance of the inputs decreases (lim(N!1 ) C(t) D 0), and consequently a spike output is produced when the mean input reaches the threshold. The critical threshold ratio is therefore given by the condition lim (N!1) U (t ! 1) D h , which gives a value of lim (N!1 ) Rcrit D tl in (from the rst expression in equation 3.7), consistent with the nite-N results in Figure 2 (for which t D 1). The mean ISI, tf (equation 3.13), is plotted in Figure 3 for a number of input rates, l in (the lines represent interpolations between closely spaced individual numerical solutions). The solid lines are the results for N D
Calculation of Interspike Intervals
1805
1.0 l
in
= 1.0
Probability of firing r
0.8
0.6
0.4 N=16 N=1024
0.2
0
0
0.2
0.4
0.6 0.8 1 Threshold ratio R
1.2
1.4
1.6
Figure 1: Plots of the probability of ring, r (equation 3.13), versus threshold ratio, R (equation 2.2),for the Stein model. The rate of incoming EPSPs isl in D 1.0 on each afferent ber, with time in units of the membrane time constant, t (i.e., rates given as spikes per unit of t ), and number of inputs N D 1024 (solid line), 256 (dotted line), 64 (dash-dotted line), and 16 (dashed line).
1024 inputs, which indicate that for all threshold ratios, R, the mean ISI, tf , decreases as the input rate, l in , increases. Likewise, lower threshold ratios, R, cause shorter mean ISIs, tf , for any xed input rate, l in . The results have only a very slight dependence on the number of inputs, as illustrated by the dashdotted line for N D 16 inputs at l in D 1.5, which is almost indistinguishable from the N D 1024 results at the same input rate. The dotted lines give the values of the threshold ratio, R, where the rate of the incoming PSPs, l in , equals the rate of the output spikes, l out D 1 / t f , which they generate. In general we expect that in the normal operating regime of a network of such neurons, there would indeed be some approximate equality between the input and output rates, in order for the neural information to be propagated to further stages of processing. The relationship between the input rate, l in , and the output rate, l out , is illustrated in Figure 4 (joined by solid lines) for the Stein model with N D 1024 inputs and three values of the threshold ratio, R (the dashed lines connect results for the synaptic response function discussed in section 3.3). This relationship for the Stein model is substantially linear, with lower threshold ratios, R, resulting in higher output rates, l out . Note that refractory effects, which result in a saturation of the output rate, have not been included in this plot.
1806
A. N. Burkitt and G. M. Clark
6 l
4
l
3
l
Critical threshold ratio R
crit
5
in
in
in
l
in
l
2
in
l
in
l
1
in
l
0
16
32
64 128 256 Number of Inputs N
512
in
= 5.0
= 4.0
= 3.0 = 2.5 = 2.0 = 1.5 = 1.0 = 0.5
1024
Figure 2: Dependence of the critical threshold ratio, Rcrit , on the number of inputs, N, and the rate of inputs, l, for the Stein model (with units t D 1.0). Rcrit was dened as the threshold ratio, R, above which the ring probability, r , fell below 0.05.
The width of the ISI distribution, sout (equation 3.13), as a function of the input rate, l in , is illustrated in Figure 5 for a threshold ratio value of R D 0.5, input rates in the range 0.5 · l in · 2.0 and various numbers of inputs, N. Also plotted is the coefcient of variation of the ISI distribution, CVISI D
sout D soutl out . tf
(3.15)
The lower value of l in on each of the lines in this plot is determined by the critical threshold ratio Rcrit . The error bars joined by dotted lines represent the results, with their statistical errors, of numerical simulations involving 10,000 trials for values of N D 16, 32, and 64. A stream of Poisson-distributed inputs was generated using a pseudorandom number generator, and the error bars, which represent the statistical error, are only barely discernible in this plot at large values of l in , since the agreement between simulation and analytical results is so close. Near the critical threshold, the error bars become larger, and there is a small difference for the smaller values of N in the results for CV between the simulations and the analytical results given here. These results indicate that the width of the output spike distribution, sout , decreases with both increasing number of inputs, N, and increasing rate of inputs, l in . The coefcient of variation behaves in much the same
Calculation of Interspike Intervals
l
1807
Mean interspike interval tf
2.0
in
= 0.5
1.5 l
= 0.8
in
l
1.0
=1
in
l
= 1.5
in
l
0.5
in
=2 l
0
0
0.2
0.4 0.6 Threshold ratio R
0.8
in
=5
1
Figure 3: Dependence of the mean ISI, tf , on the threshold ratio, R, for the Stein model for N D 1024 (solid lines) with various rates of ring l in . Also plotted are results for N D 16 with l in D 1.5 (dash-dotted line). The dotted lines indicate the values of the threshold ratio, R, at which the input rate, l in , is equal to the output rate, l out . Time is measured in units of the membrane time constant, t (i.e., rates are given as spikes per unit of t ).
way for values of the input rate above the critical threshold; that is, CV decreases as N increases in this domain. However, in the neighborhood of the critical threshold (i.e., l in < 0.5), the coefcient of variation increases substantially. These results indicate that there are two quite different regimes: (1) above the critical threshold, tl in > Rcrit , and (2) near the critical threshold, tl in ¼ Rcrit. In the region substantially above the critical threshold, the coefcient of variation decreases with increasing N in the way observed in Softky and Koch (1993), which led them to conclude that there was a conict between the values of CV observed experimentally and the results of the integrate-and-re model. However, in the region near the critical threshold, the coefcient of variation is substantially larger and well within the range of experimental observations (Softky & Koch, 1992; Shadlen & Newsome, 1998). Softky and Koch (1992) analyzed recordings from the primary visual cortex and extrastriate cortex of awake, behaving macaque monkeys and found values of CV in the range 0.5–1.0. Shadlen and Newsome (1998) found somewhat higher values of CV in the range 1.0–1.5 in their recordings from the visual cortex of alert, behaving rhesus monkeys. However, Gur, Beylin, and Snodderly (1997) found that a large component of the response vari-
1808
A. N. Burkitt and G. M. Clark
7
R = 0.5
6
R = 0.8
Rate of output l
out
5
R = 1.0
4 R = 0.2
3
R = 0.5
2 1 0
0
1
2 3 Rate of input l
4
5
in
Figure 4: Mean output rate, l out , of spikes generated as a function of the input rate, l in , of afferent PSPs for N D 1024 inputs. The solid lines are the results for the Stein model with the three threshold ratios given. Also plotted in dashed lines are the results for the alpha model (equation 3.17) with the two threshold ratios given. Time is measured in units of the membrane time constant, t (i.e., rates are given as spikes per unit of t ).
ability in alert monkeys was caused by xational eye movements, and that when this was controlled for, the resulting values of CV were found to be in the range 0.1–0.5, with a systematic effect for stimulus strength (the variance was independent of the stimulus strength, whereas the output spike rate increased as the stimulus strength increased, resulting in lower values of CV). Shadlen and Newsome (1998) simulated three models for the integration of synaptic input, which in the units used here correspond to (1) membrane time constant t D 20 ms, no inhibition, excitatory inputs with a rate l in D 50 spikes/sec D 1.0 spikes / t , N D 300 inputs, and amplitude a D 1 / 150, giving a threshold ratio of R D 0.5, and an output rate of l out D 50 spikes/sec D 1.0 spikes / t ; (2) membrane time constant t D 1 ms, no inhibition, excitatory inputs with a rate; l in D 50 spikes/sec D 0.05 spikes / t , N D 300 inputs, and amplitude a D 1 / 16, giving a threshold ratio of R D 16 / 300 ¼ 0.05, and an output rate of l out D 50 spikes/sec D 0.05 spikes / t ; and (3) membrane time constant t D 20 ms, balanced excitatory and inhibitory inputs with input ratel in D 50 spikes/sec D 1.0 spikes / t , a lower barrier to the potential at the reset potential, N D 300 inputs, amplitude a D 1 / 15, giving a threshold ratio
Calculation of Interspike Intervals
1809
of R D 0.05, and an output rate of l out D 50 spikes/sec D 1.0 spikes / t . Their results for case (1) correspond to the value of the threshold ratio, R D 0.5, plotted in Figure 5, and the results plotted in this gure are consistent with their results and extend them to a whole range of input rates and numbers of inputs (their results are for l in D 1.0 spikes / t , N D 300). These results and their implications are discussed further when we look at the effect of including inhibitory inputs in section 3.5. 3.3 Postsynaptic Response Functions with Finite Rise Times. Both the perfect integrator model and the Stein model analyzed above have discontinuous voltage trajectories in which an arriving EPSP causes an instantaneous jump of amplitude a in the voltage. There are a number of models of the time course of EPSPs that have a smoothly varying voltage similar to that observed in intracellular recordings (Rhode & Smith, 1986; Paolini, Clark, & Burkitt, 1997). One such model is that in which the synaptic input current has a time course given by the alpha function (Jack, Noble, & Tsien, 1985),
I (t ) D kte¡at ,
a > 0,
(3.16)
which corresponds to delivering a total charge of k / a2 to the cell. The synaptic response function u (t) of this model (which we call the alpha model) may be evaluated by integrating the voltage in the equivalent circuit representation of the integrate-and-re neural model. The resulting expression, given u (0) D 0, is (Tuckwell, 1988a),
u (t ) D
8 ¡t / t µ ¶ (eBt ¡ 1) ke > Bt > > , te ¡ > > B < BC 2
kt ¡t / t > , > e > > > : 2C 0,
6 0, BD
t¸0
B D 0,
t¸0
(3.17)
t < 0,
where t D RC and B D 1 / t ¡ a. The Stein model may be recovered in the limit a ! 1. Models such as this, which have a nite rise time, provide an approximation of the postsynaptic potential at the soma (or specically at the site at which the action potential is generated, typically the axon hillock) that incorporates the time course of the diffusion of the current along the dendritic tree (Tuckwell, 1988a). The equation for the density of output spikes, fh (t) (see equation 2.11), may be solved directly to obtain r (the probability of a spike’s being produced), t f (the mean ISI), and sout (the width of the ISI distribution) as described in section 2.2. The effect of using this more general synaptic response function, which incorporates a nite rise time of the EPSP, is illustrated in Figure 4, where the output rate of spikes generated is plotted as a function
1810
A. N. Burkitt and G. M. Clark
Width of output distribution s
out
1.2 1.0 0.8 0.6 0.4 0.2 0.5
0.8
1.0 1.2 1.5 Rate of input
N=16 N=1024
2.0
1.0 0.8
CV
ISI
0.6 0.4
N=16
0.2 N=1024
0.3
0.5
0.8 1.0 1.2 1.5 Rate of input
2.0
Figure 5: Analytical results for the Stein model using equations 2.11 and 3.9 with threshold ratio R D 0.5 and number of inputs N D 16, 32, 64, 128, 256, 512, 1024 (lines plotted in order from top to bottom), plotted as a function of the input rate, l in . (a) Output jitter, sout . (b) Coefcient of variation, equation 3.15. The error bars joined by dotted lines represent the results of numerical simulations involving 10,000 trials.
Calculation of Interspike Intervals
1811
of the incoming EPSPs (only excitatory inputs are included here). The solid lines show the results of the Stein model at three threshold ratios, and the dashed lines are the results for the above model with a D 5 at the threshold ratios R D 0.2 and 0.5. The results indicate that the alpha model produces spikes at a lower rate than the Stein model for the same threshold ratio and that the input-output relationship differs substantially from the linear behavior evident for the Stein model. The width of the output spike distribution, sout , was also found to increase relative to that for the Stein model at the same threshold ratio. The value of a D 5 chosen here is somewhat on the low side of physiologically realistic values, but serves to illustrate the effect of including a nite rise time of the EPSP. 3.4 Distribution of PSP Amplitudes. The expressions for the mean and width of the probability density function, equation 2.10, are written in a form that makes their dependence on the distribution of PSP amplitudes transparent, since the functions Dk (t) and Ek (t) are independent of the amplitude, ak . In the discussion so far, we have considered only that case where the amplitudes are positive (i.e., excitatory PSPs) and equal, ak D a. The amplitude of the PSPs, however, will in general depend on their location on the dendritic tree, quantal uctuations, and other variables (Softky & Koch, 1993). However, it is straightforward using equation 2.10 to consider the effect of various distributions of amplitudes. In particular, if we consider a distribution of amplitudes, P (a), with mean aN and standard deviation sa
Z aN D
1 0
Z a P (a) da,
sa2 D
1 0
(a ¡ aN )2 P (a ) da,
(3.18)
then the expressions for U (t) and C (t), equation 2.23, which determine the probability density of the potential, become (Stein, 1967) Z U (t) D N aN l
t 0
dt0 u (t0 )
C(t) D N (Na2 C sa2 ) l
Z
t 0
dt0 u2 (t0 ).
(3.19)
3.5 Including Inhibition. IPSPs may be included as a special case of the previous section in which the inhibitory terms are negative, ak < 0. In the simplest situation, there are NE afferent bers contributing EPSPs of amplitude aE and NI afferent bers contributing IPSPs of amplitude aI :
ak D aE > 0 for k D 1, . . . , NE
ak D ¡aI < 0 for k D NE C 1, . . . , (NE C NI ).
(3.20)
1812
A. N. Burkitt and G. M. Clark
The functions U (t) and C (t), equation 2.10, then take the form U (t) D NE aE DE (t) ¡ NI aI DI (t) ±
²
±
²
C (t) D NE a2E EE (t ) ¡ D2E (t) C NI a2I EI (t ) ¡ D2I (t) ,
(3.21)
where DE,I and EE,I are given by equation 2.8 for the excitatory and inhibitory inputs respectively, which may have differing synaptic response functions uE (t) and uI (t). The effect of including inhibitory inputs is illustrated in Figure 6 for the Stein model, where the lines join the results for particular numbers of inhibitory inputs. The rst plot shows the relationship between the input and output rates at a threshold ratio of R D 0.5 for the case of NE D 64 excitatory inputs and NI D 0, 16, 32, and 48 inhibitory inputs, both of the same magnitude, aE D aI . The threshold ratio is dened, as before, just in terms of the excitatory inputs, R D h / NE aE (equation 2.2). The plot indicates that the inhibitory inputs lower the rate of outputs. The second plot shows the coefcient of variation, equation 3.15, as a function of the input rate. The error bars joined by the dotted lines for the NI D 0 and 16 cases represent the results and associated statistical errors of numerical simulations of Poissondistributed inputs involving 10,000 trials. It is observed that introducing inhibitory inputs increases the coefcient of variation of the output spikes, and that this effect is most pronounced near the critical threshold ratio (the leftmost points of the plots in Figure 6). It has been postulated (Shadlen & Newsome, 1994) that the inclusion of inhibition is crucial in explaining the apparent discrepancy between the observed irregularity of ISIs in the cortex and the regularity that is evident in models of integrate-and-re neurons (Softky and Koch, 1992, 1993). The experimentally observed values of the coefcient of variation take values in the range 1.0–1.5 (Shadlen & Newsome, 1998), as discussed in section 3.2 (although substantially lower values in the range 0.1–0.5 have been reported when eye movement has been controlled for; (Gur et al., 1997). The integrateand-re model with only excitatory inputs gives a train of output spikes that are very regular, as illustrated in Figure 5 by the small values of the coefcient of variation in the region substantially above the critical threshold. However, in the parameter region near the critical threshold, the coefcient of variation for the integrate-and-re model takes values much closer to one, as discussed in section 3.2 and plotted in Figure 5. The results for including inhibitory inputs are plotted in Figure 6, which causes an increase in the coefcient of variation, and this increase is largest near the critical threshold. The results plotted here show values of the coefcient of variation and trends that are in agreement with those in the study of a conductance-based integrate-and-re model (Troyer & Miller, 1997): a decrease in the coefcient of variation as the input rate increases and an increase in the coefcient of variation with more inhibitory inputs (the parameters used in their study
Calculation of Interspike Intervals
1813
were l I /l E D 0.75, NEl E ¼ 140 spikes / t , NIl I ¼ 65 spikes / t and their model includes a reversal potential at the reset voltage, which is not included in this analysis). Of particular interest is the model in which the excitation and inhibition are balanced, which gives rise to values of CV close to one (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996; Shadlen & Newsome, 1998). In this balanced situation, the amplitude of the EPSPs and IPSPs may p be much larger, of order 1 / N, so that the output spike rate is of the same order of magnitude as the input rates and the uctuations in the voltage remain nite in the large N limit (Troyer & Miller, 1997; van Vreeswijk & Sompolinsky, 1998). To model inhibitory inputs accurately, it is necessary to take into account the effect of the reversal potential, which gives the intracellular potential a lower bound and causes the amplitude of the inputs to depend on the current value of the potential (i.e., the PSP amplitude decreases the nearer the membrane potential is to the reversal level). In the simulation of Shadlen and Newsome (1998; their case 3, as discussed here in section 3.2), the effect of the inhibitory reversal potential is approximated by introducing a lower barrier below which the potential is not permitted to fall. A more complete investigation of the effect of inhibition, which addresses these questions, is beyond the scope of this article, which deals only with synaptic inputs that are summed linearly. Such questions are addressed elsewhere (Burkitt & Clark, 1999b), where it is shown how the technique presented here for analyzing integrate-and-re neurons can be extended to include the effect of reversal potentials. 4 Conclusions
We have presented a new technique for analyzing the behavior of integrateand-re neurons, which allows both the probability density of the summed potential and the density of output spikes (i.e., the ISI distribution) to be calculated. The technique for calculating the probability density, which involves integrating over the arrival times of the input distribution, provides an alternative to the conventional diffusion approximation methods involving random walks or stochastic differential equations, and this new method allows the analysis of both arbitrary input distributions and synaptic response functions (i.e., including rise time and decay of the membrane potential). The output distribution of spikes is given by the rst-passage time to threshold, which is dened by an integral relationship involving the rst passage time and a conditional probability density. This relationship is a generalization of the renewal equation that holds for both leaky neurons and situations in which there is no time-translational invariance. The technique has been formulated in terms of an arbitrary probability distribution of inputs, and results are derived for inputs that have a Poisson distribution.
1814
A. N. Burkitt and G. M. Clark
6 N =0 I
Rate of output l
out
5
N =16
4
I
3
N =32 I
2
N =48
1 0
I
0.5
1.0
1.5 2.0 Rate of input
2.5
3.0
1.0 0.8
CV
ISI
0.6
N =48 I
0.4
N =32 I N =16 I N =0
0.2
I
0.5
1.0
1.5 2.0 Rate of input
2.5
3.0
Figure 6: Results for the Stein model with N E D 64 excitatory inputs, threshold ratio R D 0.5, and number of inhibitory inputs NI D 0, 16, 32, and 48, plotted as a function of the input rate, l in . The amplitudes of the EPSPs and IPSPs are equal. (a) Mean output rate, l out . (b) Coefcient of variation of the output spikes, equation 3.15. The error bars joined by dotted lines represent the results of numerical simulations involving 10,000 trials.
Calculation of Interspike Intervals
1815
The technique is a formal calculation that gives results that coincide with those using the diffusion method for the two cases studied here: the perfect integrator and the Stein model. For the perfect integrator model, the method reproduces the well-known result (Gerstein & Mandelbrot, 1964). In the case of the Stein model (Stein, 1965), which incorporates decay of the potential, the method reproduces the known solution for the moments of the probability density of output spikes, which are expressed in terms of Laplace transforms of the probability density of the potential and the conditional probability density (Gluss, 1967). Synaptic response functions with nite rise time have also been analyzed, and the moments of the spike distribution, which give, respectively, the probability of an output spike, the average time of spike generation, and the output jitter, have been obtained numerically. The results are expected to be most accurate when the amplitude of the individual synaptic inputs, a, is small in comparison to the threshold, h (which is the the difference between the ring and reset potentials), and the number of inputs, N, is large (we assume that the amplitude, a, scales with 1 / N), that is, where a large number of small-amplitude inputs are required to summate in order for the membrane potential to reach the ring threshold, as is the case in many neural systems. The linear summation of the individual contributions from each synapse in equation 2.1 makes it possible to analyze situations in which a single neuron has a variety of synaptic response functions. Such freedom could be used to model the effect of synapses at different parts of the synaptic tree, whereby synapses nearer the soma would have a synaptic response function with larger amplitude and shorter rise time than synapses farther away. The technique also enables a distribution of PSP amplitudes to be analyzed, including amplitudes that uctuate randomly about a mean value, which could be used to model the effect of quantal uctuations. The relationship between the rate of incoming EPSPs, l in , and the resultant rate of ring, l out , and the width of the ISI, sout , and the coefcient of variation, CV equation 3.15, have been analyzed here for the Stein model and the alpha model (which includes both rise time and decay of the membrane potential) for a range of both thresholds and numbers of inputs. The results indicate that the probability of ring, r , falls from one to zero as the threshold ratio, R, increases, and that this decline in the ring probability is sharper as the number of inputs, N, increases. This rapid transition from a ring to a nonring domain can be characterized by a value of the threshold ratio, Rcrit , which depends principally on the input rate. In the cases investigated here, the small value of the coefcient of variation, CV, above the critical threshold indicates that the output distribution is not Poissonian in this range of parameters, but rather is much more regular than typically observed in cerebral cortex (Softky & Koch, 1992, 1993). However, the results here indicate that near the critical threshold, the coefcient of variation, CV, has values much closer to those experimentally observed. An alternative
1816
A. N. Burkitt and G. M. Clark
explanation involves the effect of inhibitory inputs (Shadlen & Newsome, 1994), and studies using this technique do indeed show that the coefcient of variation increases when IPSPs are included. A more comprehensive examination of this issue incorporating reversal potentials (Tuckwell, 1979; Hanson and Tuckwell, 1983; Musila & L´ansky, ´ 1994) for the inhibitory PSPs is under way to examine this possibility (Burkitt and Clark, 1999b). Appendix A: Evaluation of Dk (t) and Ek (t)
The functions Dk and Ek , equation 2.8, may be evaluated using Laplace transforms, which are indicated by the subscript L, pk,L (s ) D L fpk (t)g, as follows, Dk,L (s) D uk,L (s)
1 X mD 1
( ) pm k,L s D
pk,L (s ) uk,L (s), 1 ¡ pk,L (s )
(A.1)
where the convolution property of Laplace transforms is repeatedly applied. The expression for Ek is separated into a diagonal part (m D m0 ) and an 6 m0 ) that are evaluated independently. The diagonal off-diagonal part (m D term is evaluated as above for Dk,L (s ) (except now with u2k (t)). The lower and upper off-diagonal terms are equal and consist of two innite sums, each of which may be successively treated in the same way as above: E k,L (s) D L fu2k (t)g C2
D
1 X mD1
1 X mD1
( ) pm k,L s
( ) pm k,L s L
(
( uk ( t ) L
¡1
uk,L (s )
1 X m0 D 1
)) 0 ( ) pm k,L s
pk,L (s ) pk,L (s) L fu2k (t )g C 2 L fuk (t)Dk (t)g. 1 ¡ pk,L (s ) 1 ¡ pk,L (s)
(A.2)
For a Poisson process, these expressions take particularly simple forms, as shown in section 2.3. Appendix B: Evaluation of G (t2 , t1 )
As in the case of the above expression for Ek,L , equation A.2, we separate 6 G(t2 , t1 ) into diagonal (m D m0 ) and off-diagonal (m D m0 ) parts that are evaluated independently. The diagonal term is evaluated in a similar fashion to the above, except now the Laplace transforms are carried out with respect to t1 , and the diagonal term is u(t1 )u (t1 C D ), where D D t2 ¡ t1 ¸ 0 is treated as a parameter. The lower and upper off-diagonal terms are equal and consist of two innite sums, each of which may be successively treated
Calculation of Interspike Intervals
1817
in the same way as in appendix A: GL (sI D ) D L fu(t1 )u (t1 C D )g C2
D
1 X mD 1
( pm L (s)L
1 X mD 1
( ) pm L s
u ( t1 ) L
( ¡1
pL ( s ) L fu(t1 )u (t1 C D )g 1 ¡ pL ( s ) C2
0
uL ( s )
pL ( s ) L fu(t1 ) D(t1 C D )g, 1 ¡ pL ( s )
1 X m0 D 1
)) 0 0 pm L (s )
(B.1)
where s and s0 refer to the Laplace transform with respect to t1 and t2 , respectively. For a Poisson process, these expressions take particularly simple forms (see section 2.3). Acknowledgments
This work was funded by the Cooperative Research Centre for Cochlear Implant, Speech and Hearing Research. We thank the anonymous referees for their useful comments on the manuscript. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1982b). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Adrian, E. (1928).The basis of sensation: The action of sense organs. London: Christophers. Bernander, O., Koch, C., & Usher, M. (1994). The effect of synchronized inputs at the single neuron level. Neural Comput., 6, 622–641. Bialek, W., & Rieke, F. (1992).Reliability and information transmission in spiking neurons. Trends Neurosci., 15, 428–434. Burkitt, A. N., & Clark, G. M. (1999a). Analysis of integrate-and-re neurons: Synchronization of synaptic input and spike output in neural systems. Neural Comput., 11, 871–901. Burkitt, A. N., & Clark, G. M. (1999b).Balanced networks: Analysis of leaky integrate and re neurons with reversal potentials. Submitted. Carr, C. E., & Friedman, M. (1999). Evolution of time coding systems. Neural Comput., 11, 1–20. Cope, D. K., & Tuckwell, H. C. (1979). Firing rates of neurons with random excitation and inhibition. J. Theor. Biol., 80, 1–14.
1818
A. N. Burkitt and G. M. Clark
Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1996). Characterization of synre activity by propagating “pulse packets”. In J. Bower (Ed.), Computational neuroscience: Trends in research (pp. 59–64), San Diego: Academic Press. Douglas, R., & Martin, K. (1991). Opening the grey box. Trends Neurosci., 14, 286–293. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comput., 8, 44–66. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gluss, B. (1967). A model for neuron ring with exponential decay of potential resulting in diffusion equations for probability density. Bull. Math. Biophysics, 29, 233–243. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17, 2914–2920. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximations for neuronal activity including synaptic reversal potentials. J. Theoret. Neurobiol., 2, 127– 153. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Iyengar, S., & Liao, Q. (1997). Modeling neural activity using the generalized inverse gaussian distribution. Biol. Cybern., 77, 289–295. Jack, J. J. B., Noble, D., & Tsien, R. W. (1985). Electric current ow in excitable cells. Oxford: Clarendon Press. Judd, K. T., & Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. Oxford: Oxford University Press. L´ansky, ´ P. (1984). On approximations of Stein’s neuronal model. J. Theor. Biol., 107, 631–647. Lapicque, L. (1907).Recherches quantitatives sur l’excitation electrique des nerfs trait´ee comme une polarization. J. Physiol. (Paris), 9, 620–635. Maass, W. (1996a). Lower bounds for the computational power of networks of spiking neurons. Neural Comput., 8, 1–40. Maass, W. (1996b).Networks of spiking neurons. In P. Bartlett, A. Burkitt, & R. C. Williamson (Eds.), Proceedings of the Seventh Australian Conference on Neural Networks. Canberra: Australian National University. MarLsalek, ´ P., Koch, C., & Maunsell, J. (1997).On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci.
Calculation of Interspike Intervals
1819
USA, 94, 735–740. Musila, M., & L´ansky, ´ P. (1994). On the interspike intervals calculated from diffusion approximations of Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Paolini, A. G., Clark, G. M., & Burkitt, A. N. (1997). Intracellular responses of rat cochlear nucleus to sound and its role in temporal coding. NeuroReport, 8(15), 3415–3422. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.), Singapore: McGraw-Hill International Editions. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys. Lett. A, 225, 228–234. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in Fortran: The art of scientic computing. Cambridge: Cambridge University Press. Rhode, W. S., & Smith, P. H. (1986). Encoding timing and intensity in the ventral cochlear nucleus of the cat. J. Neurophysiol., 56, 261–286. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neurosci., 58, 13–41. Softky, W. R., & Koch, C. (1992). Cortical cells should re regularly, but do not. Neural Comput., 4, 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous denition. J. Comput. Neurosci., 2, 149–162. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9, 971–983. Tsodyks, M. V., & Sejnowski, T. J. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Tuckwell, H. C. (1977). On stochastic models of the activity of single neurons. J. Theor. Biol., 65, 783–785. Tuckwell, H. C. (1978). Neuronal interspike time histograms for a random input model. Biophys. J., 21, 289–290. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology: Vol. 1, Linear cable theory and dendritic structure. Cambridge: Cambridge University Press.
1820
A. N. Burkitt and G. M. Clark
Tuckwell, H. C. (1988b). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Tuckwell, H. C., & Cope, D. K. (1980). Accuracy of neuronal interspike times calculated from a diffusion approximation. J. Theor. Biol., 83, 377–387. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wilbur, W. J., & Rinzel, J. (1982). An analysis of Stein’s model for stochastic neuronal excitation. Biol. Cybern., 45, 107–114. Received April 28, 1998; accepted June 7, 1999.
LETTER
Communicated by Misha Tsodyks
Statistical Procedures for Spatiotemporal Neuronal Data with Applications to Optical Recording of the Auditory Cortex O. Fran c¸ ois L. Mohamed Abdallahi Department of Statistics, LMC-IMAG, BP 53, 38041 Grenoble cedex 9, France J. Horikawa I. Taniguchi Department of Neurophysiology, Tokyo Medical and Dental University, Kanda-surugadai 2-3-10, Tokyo 101, Japan T. Herv e´ TIMC-IMAG, Facult´e de M´edecine, IAB, 38700 La Tranche, France This article presents new procedures for multisite spatiotemporal neuronal data analysis. A new statistical model—the diffusion model—is considered, whose parameters can be estimated from experimental data thanks to mean-eld approximations. This work has been applied to optical recording of the guinea pig’s auditory cortex (layers II–III). The rates of innovation and internal diffusion inside the stimulated area have been estimated. The results suggest that the activity of the layer balances between the alternate predominance of its innovation process and its internal process. 1 Introduction
During the past few years, optical recording has become a prominent technique in analyzing neural network activity (Grinvald, Frostig, Lieke, & Hildesheim, 1988; Cohen et al., 1989; Orback, Cohen, & Grinvald, 1996). However, such processing supplies a huge amount of spatiotemporal data from which the information is sometimes difcult to extract. According to Hebb (1949), the implementation of representation of sensory events in the brain would rely on high-order statistics computed from the temporal neuronal discharge patterns within groups of cells. Perception would be dynamically encoded in the temporal relationship between coinciding active cells. Estimating interaction parameters and their temporal evolution is thus a major issue in multisite data interpretation. Methodologies inspired from the statistical mechanics formalism have been introduced several times. For instance, the concept of random eld (RF) has been used to analyze spatial data (Herv´e, Dolmazon, & Demongeot, 1990). When the RF parameters vary, a large number of scenarios may be statistically rec 2000 Massachusetts Institute of Technology Neural Computation 12, 1821– 1838 (2000) °
1822
O. Fran¸cois et al.
covered (Fran¸cois, Demongeot, & Herv´e, 1992). Martignon, Von Hasseln, Gruen, Aertsen, and Palm (1995) and Makarenko, Welsh, Lang, and Llinas (1997) have also proposed statistical procedures based on Markov random elds (MRF). Nevertheless, the statistics obtained by tting RF or MRF on neural data are not intuitive. Although these methods have given useful insights on the structure of the data, there is an obvious need for different procedures. This article presents a new statistical method for multidimensional neural signal processing. The method relies on a parametric spatiotemporal model, the diffusion model (DM), with three parameters: the inverse duration of activity, the frequency of external arrivals, and the rate of diffusion of the activity from a site toward its neighboring sites. Section 2 presents the MRF and DM. The estimation of the diffusion parameter is performed on the basis of standard statistics such as the mean activity and the spatial covariance between neighboring sites. An identication with an MRF model is used to estimate the frequency of external arrivals. The statistical procedures needed to estimate the parameters of the two models are described in section 3. The method has been used on data gathered from the auditory cortex of the guinea pig for which spatiotemporal patterns have been observed (Horikawa, Hosokawa, Kubota, Nasu, & Taniguchi, 1996). The neural activity of layers II–III has been recorded via an optical method in response to pure tone bursts. The continuous signal of each photodiode has been converted into binary code. The neural activity is then analyzed in the light of MRF and DM approaches (section 4). These approaches enable an estimate of an input process arriving on layers II–III and an internal process. From these results, the excitatory and inhibitory processes involved in this experiment are discussed. It is suggested that the activity of layers II–III balances between the alternate predominance of its innovation process and its internal process (section 4.3). 2 Model Description
In what follows, it is assumed that binary data have been obtained from a grid S of optical diodes (two dimensions). The geometrical structure induced by the measures involves lattice models, the neighbors of a site being the nearest sites on the grid. The neighborhood of site i 2 S is denoted by Ni and has size m D 4. The number of sites is n (typically n D 144). A site with the value 1 is considered as active, whereas the value 0 means that the site is passive. At each instant t, an n-dimensional binary vector x D (x1 , . . . , xn ) is observed and represents the activity of the sites. This section presents two statistical models that account for the spatial nature of the data. They do not model the brain activity directly but rather the data collected by the grid. The rst model is the Ising MRF model, which has been used several times in this context. The second model is the DM. In each model, the global activity of the sites is separated in two components—
Statistical Procedures for Spatiotemporal Neuronal Data
1823
an internal one and an external one—for which distinct parameters are introduced. 2.1 Ising Model. MRF on the lattice are known to provide pertinent models for interacting binary systems (Besag, 1974; Ripley, 1988; Guyon, 1995). Among MRF models, the most famous is the Ising model, which has been used by Herv´e et al. (1990), Martignon et al. (1995), and Makarenko et al. (1997) to analyze neural signals. The Ising model has two parameters, a and b, and the likelihood function is dened as
L (a, bI x ) D exp(H(x)) / Z(a, b ) ,
x 2 f0, 1gS ,
(2.1)
where H (x ) is equal to H (x ) D
X i2S
axi C
1XX bxi xj , 2 i2S j2N
(2.2)
i
and Z(a, b ) is the normalization constant. The parameter a is called the singleton potential and can be viewed as the intensity of an external eld applied to the network. The parameter b is called the pair potential and can be interpreted as the intensity of the interaction between pairs of neighboring sites. A positive value of b translates the coactivity of neighboring sites, whereas a negative one inhibits links between sites. Historically, such a model has been introduced to explain the phenomenon of magnetism. Though the model is popular, using it for biological interpretation may be difcult because the physical meaning of its parameters in the context of neural data is not always clear. So alternative models are relevant, and new statistical methods are needed to estimate the parameters of the new models. 2.2 Diffusion Model. The DM is based on the principle of diffusion instead of interaction. In constrast to MRF, the description of the DM is dynamical. Data are assumed to be sampled from a multidimensional continuoustime stochastic process. The DM is a Markovian model dened by the rates of transition l(x, xi ) between the vector x and the vector xi obtained by modifying x at site i: xi D (x1 , . . . , xi¡1 , 1 ¡ xi , xi C 1 , . . . , xn ). The probability of modifying a site value during a small time interval dt is given by
Prob(x(t C dt ) D xi | x (t) D x ) D l (x, xi )dt C o(dt ),
(2.3)
with l(x, xi ) D
» d lC
m m
P
j2Ni xj
if xi D 1 if xi D 0
(2.4)
1824
O. Fran¸cois et al.
where d is a positive constant, and the following condition is assumed to hold, lC
m X xj > 0, m j2N
(2.5)
i
in order to warrant that the rates (see equation 2.4) are always positive. In this approach, signals are modeled as point processes as many references did before (Perkel, Gerstein, & Moore, 1967). The parameters may be interpreted as follows. Processes of arrivals and departures are associated with each site. Innovation on a passive site, which may be viewed as the arrival of spikes, activates this site at rate l. A departure on an active site stops its activity. The constant 1 /d can be interpreted as the mean duration of activity of a site. An active site returns to its resting state independent of the other sites. An active site may also propagate (or diffuse) its activity to a neighboring site chosen with uniform probability 1 / m at a rate equal to m . The parameter m is called the diffusion rate. If m D 0, the sites behave inde6 0, an interaction pendently, and no coherent activity can be observed. If m D between the sites exists, and a coherent activity may be observed. During the activity of a site, the inuence on the neighbors may be either excitatory or inhibitory, depending on the sign of m . With large, positive values of m , a large number of simultaneously active neighbors are observed. At the opposite, with negative values, the active sites are isolated. Provided that they are nonnegative, all the parameters have the dimension of a frequency and can be measured from the real data in kHz. When they are negative, the absolute value of the parameters may also be interpreted as a frequency. In this case, the occurrence of an event (arrival or diffusion) on a given site may be viewed as having an inhibitory effect. Indeed, the probability of ring is then decreased instead of being increased. 2.3 Mean-Field Analysis of the Diffusion Model. This section presents the analysis of the DM. The Markov process dened by the rates (see equation 2.4) can be considered an interacting particle system. (A reference textbook for such systems is Liggett, 1985.) Very few estimation procedures exist to identify the parameters of these systems. When l is positive, the model is additive (Griffeath, 1979). The convergence to equilibrium is exponentially fast (equilibrium is unique). Moreover, according to Griffeath (1979), the exponential rate of convergence is bounded above by l uniformly in n. This property allows considering that the system is under equilibrium. However, the equilibrium distribution of the process has no convenient analytical representation. The analysis will then rely on low-order statistics and mean-eld approximations. For each instant t ¸ 0, consider the probability that a site is active,
u (t) D P(xi (t) D 1),
(2.6)
Statistical Procedures for Spatiotemporal Neuronal Data
1825
which is independent of i by spatial invariance, and u D lim u (t).
(2.7)
t!1
The Kolmogorov equations of the DM lead to, for all i in S and j in Ni , du (t) D l ¡ (d C l ¡ m )u(t) ¡ m E[xi (t)xj (t )], dt
(2.8)
and, under equilibrium, l ¡ (d C l ¡ m )u ¡ m E[xi xj ] D 0.
(2.9)
The probability u that a site is active can be estimated by the empirical mean, uO D
n 1X xi , n iD 1
(2.10)
while the expectation E[xi xj ] can be estimated as rO D
1 nm
X
xi xj .
(2.11)
i2S I j2Ni
The mean-eld approximation assumes that the spatial covariance between two neighboring sites is null, that is, Cov(xi , xj ) D E[xi xj ] ¡ E[xi ]E[xj ] ¼ 0.
(2.12)
E[xi ] D u,
(2.13)
Since
equation 2.9 implies that l ¡ (d C l ¡ m )u ¡ m u2 ¼ 0.
(2.14)
According to the sign of the covariance, an excitatory or an inhibitory effect can be directly (or indirectly through MRF) detected. However, such a conclusion fails when the covariance is small. In contrast, mean-eld approximations become relevant. Thus, mean-eld statistics are complementary to standard ones. The mean-eld statistics for the DM are obtained from equation 2.14. An expression of the diffusion rate m is given by m D
(d C l) u (1 ¡ u )
¡
u¡
l l Cd
¢
.
(2.15)
1826
O. Fran¸cois et al.
3 Statistical Procedures
The statistical procedures needed to estimate the parameters of the MRF and DM are presented in this section. For the DM, the procedures are new. The data x D (x1 , . . . , xn ) consist of n binary variables observed on a lattice. This section also discusses the advantage of each model over the other. 3.1 Markov Random Field. Usual estimation procedures are the maximum likelihood (ML) method, which is computationally heavy, and the pseudo-likelihood (PL) method, owing to Besag (1974). In the pseudolikelihood method, the quantity to maximize is
PL(a, b ) D
n Y iD 1
P(xi | xj , j 2 Ni ).
(3.1)
Computing the conditional probability leads to P(xi D 1 | xj I j 2 Ni ) D
1 , 1 C e¡yi
(3.2)
with yi D a C b
X
xj .
(3.3)
j2Ni
Hence, the maximization of the PL can be achieved by the gradient algorithm aO k C 1 D a Ok ¡ h
@ log PL @a
@ bOk C 1 D bOk ¡ h
(aO k , bOk )
log PL (aO k , bOk ), @b
(3.4)
with h a small constant, a O 0 , bO0 arbitrary initial values, and k ¸ 0. This yields the following algorithm: aO k C 1 D a Ok C h bOk C 1 D bOk C h
n X iD 1 n X iD 1
¡
xi ¡
¡
¢
1 k
1 C e ¡yi
yQ i xi ¡
¢
1 k
1 C e¡yi
,
(3.5)
with yQ i D
X j2Ni
xj ,
(3.6)
Statistical Procedures for Spatiotemporal Neuronal Data
1827
and O k C bO k yki D a
X
xj .
(3.7)
j2Ni
The PL approximates the likelihood by assuming that the sites behave independently given the values of the neighboring sites. This approximation may also be viewed as a mean-eld one. The PL method is a consistent method and usually yields good results (Guyon, 1995). 3.2 Diffusion Model. The DM has three parameters: the mean inverse duration of activity d, the innovation rate l, and the diffusion rate m . Unfortunately, the analytical expression of the likelihood is unavailable, and ML or PL methods must be precluded. The emphasis here is on the estimation of l and m . It is assumed that the parameter d has been estimated from the temporal observations. There is no difculty in doing so. The estimation of l and m will rely on the mean-eld approximation (see equation 2.14) on one hand and on the identication between MRF and DM on the other hand. A property of MRF is that the ratio of the “birth” rate and the “death” rate at a site must be equal to the ratio of conditional probabilities at this site. For the parameters, this yields
d l Cd C
m m
P j2Ni
xj
¼
1 1Ce
aC b
P j2N i
xj
.
(3.8)
According to equations 2.14 and 3.8, an expression for l can be deduced:
l ¼ du
e
aC b
P
P j2Ni
u¡
1 m
xj
¡ P
j2Ni
xj
m (1¡u )
j2Ni
xj
.
(3.9)
O Therefore, the statistic lO Recall that u is estimated by the empirical mean u. O is obtained by plugging the values of a, O b, and uO in equation 3.9 and then averaging over all sites. Finally, using the mean-eld equation again, m can be estimated as ! O (d C l) lO mO D . (3.10) uO ¡ uO (1 ¡ uO ) lO C d
(
O mO is actually a mean-eld statistic. Mean-eld Since u is estimated by u, statistics do not take into account the topology imposed on the model by the structure of the measures. The accuracy of the mean-eld approximation has been checked numerically for a range of parameters corresponding to plausible values (to be obtained in section 4). The system was numerically
1828
O. Fran¸cois et al.
simulated with parameters d, l, and m around 0.2, 0.02, and 0.2. The biases of lO and mO are of the order of magnitude 10¡3 and 10¡2 . The standard deviations are of the order of magnitude 10 ¡2 . These results are evidence that the model is actually weakly dependent on the topology. Models with such a property are useful for integrating the ignorance of the real topology of the underlying network. Furthermore, mean-eld approximation is justied when the spatial empirical covariance cO D
1 nm
X i2S , j2Ni
xi xj ¡ uO 2
(3.11)
is small (the empirical covariance theoretically ranges between ¡0.25 and 0.25.) Within the range of tested parameters, the covariance remained small (below 0.05), and the estimation bias was acceptable. Bias increases as the covariance increases and becomes signicant above 0.05. This happens, for instance, when the parameter l takes high negative values (see Fran¸cois, Larota, Horikawa, and Herv´e, 1999). 3.3 Comparison Between MRF and DM. Although they are of different nature, the MRF and the DM share common features. In the previous section, we gave a static description of the MRF model. The DM, in contrast, has been dened as a dynamical model. In order to compare the models, we also consider a dynamical version of the MRF. In this version, the probability of modifying a site during a small time interval is given by
Prob(x(t C dt ) D xi | x (t) D x ) D l MRF (x, xi )dt C o (dt ),
(3.12)
with » l MRF (x, xi ) D
1 P exp(a C b j2Ni xj )
if xi D 1 if xi D 0.
(3.13)
Therefore, the parameters a and b are likely to be viewed as log rates. In this perspective, a is the log rate that a passive site be activated by an external signal, and b is the log rate that a site be activated by an active neighbor (given that the other neighbors are passive). In the DM, the rate of transition from 0 to 1 may be seen as a log-linear version of the corresponding rate in the MRF model (this is less clear concerning the transition 0!1). Henceforth, the same kind of relationship exists between MRF and DM than between the log-linear formulation of a Poisson process and its “ring-rate” parameterization. However, these models correspond to different statistical structures. Although a log-linear relation between rates exists, it is difcult to identify a relationship between the parameters (a, b ) and (l , m ) themselves. In the DM, a sum of elementary rates is computed. This summation corresponds to the superposition of independent innitesimal effects. The advantage of
Statistical Procedures for Spatiotemporal Neuronal Data
1829
·
0.4
· ·
·
· · ·
· ·
·
· · ·
·
· · ·
· 0.3
·
· · · · ·
· ·
·
0.2
MEAN ACTIVITY
· · ·
·
0.1
·
· ·
·
·
· ·
·
· 20
25
30
35
40
Milliseconds
Figure 1: Mean activity value.
the DM over the MRF model is that its formulation is more intuitive, and hence easier to interpret. The advantage of MRF over the DM is that its parameters are likely to be better estimated. 4 Data Analysis 4.1 Experiment Description. The animal (guinea pig) preparation and optical recording procedures have been described in detail Horikawa et al., 1996). Pure tone bursts (duration 50 ms, rise-fall time 10 ms, frequency 7 kHz, intensity 75 dB SPL) were delivered to the ear contralateral to the recording side. The uorescent signals emitted by the auditory cortex were recorded using a 12 £ 12-channel photodiode array at a rate of 0.576 ms per frame. The optical responses were expressed as DF/F, where DF is the change in intensity due to neural responses and F is the light intensity at rest. The noise due to cortical movements caused by pulsation was cancelled out by synchronizing stimulation and recording with the heartbeat and by subtracting a recording without a stimulus from that with a stimulus. The cortical recording area was 3 mm £ 3 mm, covering part of both the anterior (A, corresponding to primary auditory cortex) and dorsocaudal (DC) elds.
1830
O. Fran¸cois et al.
0.08
·
·
·
0.06
· · ·
· ·
·
·
· ·
·
·
· ·
0.04
COVARIANCE
·
·
·
· ·
· ·
·
· · ·
0.02
· ·
·
·
·
· · · ·
·
0.0
· · · 20
25
30
35
40
Milliseconds
Figure 2: Empirical covariance.
The measuring microscope was focused 200 m m below the surface of the auditory cortex, and the optical responses in the were those from layers II–III. The responses have been smoothed by running averages and the time of occurrence, and the duration of the response at each of 144 cortical locations was measured as a time of peak of the response and a period of half-peak, respectively. The time of peak is measured when the signal reaches its maximum value above rest + 3 S.D. (mean and standard deviation are measured during [0–18 ms] period after the onset of stimulus). The responses whose peak did not exceed the level of 3 S.D. above the resting potential were discarded. The duration of peak is measured as half-time of duration during which the response is above rest +3 S.D. In order to perform data analysis, the signals were encoded as binary signals in such a way that the signal of a diode is 0 when there is no response and it takes value 1 during the peak duration. The sampling rate of the binary data has been changed to 0.5 ms. 4.2 Results. The results presented in this section correspond to a typical response to a single stimulation. The measures resulting from the array of
Statistical Procedures for Spatiotemporal Neuronal Data
1831
·
-1.0
· · ·
·
· · · · ·
·
· ·
·
· · ·
·
·
·
·
·
·
-2.0
ALPHA
-1.5
·
·
· · · · ·
-2.5
·
· · ·
-3.0
·
· ·
· · · · 20
25
30
35
40
Milliseconds
Figure 3: Estimation of the external eld a.
diodes are displayed in Table 1. The processed data consist of 40 binary vectors (of size n D 144) corresponding to the activity of the network summarized in Table 1. This activity starts 20 ms after the stimulation and ends 40 ms after the stimulation. Each vector has been treated separately, following the procedures described in section 3. First, the mean duration of activity has been estimated. For the data in Table 1, the value of d is equal to d D 0.148 kHz (0.074 inverse halfmilliseconds). The mean activity and the empirical covariance have then been computed (see Figures 1 and 2). The mean activity increases from 0 to 40%, reaches a peak after 30 ms, and then decreases to 0. Neighboring sites interact in an excitatory way during activity (the covariance is positive). An oscillation of the interaction is also observed. The period of this oscillation is approximatively 10 ms. The MRF parameters have been computed according to the iterative procedure (see equation 3.5). Results are displayed in Figures 3 and 4. Nonnegative pair potentials conrm the excitatory aspect of interaction. In Figure 5, the plot (a, b ) is concentrated near the phase transition line of the Ising model (see Preston, 1976), a phenomenon that has been noticed by Makarenko et al. (1997) for data measured from the olive cerebel-
O. Fran¸cois et al.
1.2
1832
·
· · ·
·
· 1.0
·
0.8
·
· · ·
0.6
BETA
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
· · ·
·
·
0.4
· ·
·
·
0.2
· · 20
25
30
35
40
Milliseconds
Figure 4: Estimation of the pair potential b.
lar cortex. Clearly, a linear relationship between a and b exists, meaning that the fraction of activity due to the external signal is lowered when the excitatory interaction level increases. The information obtained by tting MRF is, however, close to that obtained from standard statistics (see Figure 2). Mean-eld statistics have been computed for l and m according to equations 3.9 and 3.10. The results are displayed in Figures 6, 7, and 8. Overall, the results obtained by the MRF analysis are conrmed. The diffusion rate mO is nonnegative. This means that the interaction is excitatory. Negative values of l have also been obtained (an arrival associated to a negative rate can be viewed as an inhibitory event). Again a linear dependence between the arrival and diffusion rates is observed. The parameters lO and mO oscillate in opposite phases (but the mean activity does not oscillate). The period of the oscillation is between 8 and 10 ms. The diffusion rate mO ranges between 0.0 and 0.4 kHz. A value equal to 0.2 kHz means that the frequency of excitatory events due to the activity of neighbors is 1 every 5 ms. When mO increases, the external input is inhibited, and negative values of lO are computed. The analysis supports the presence of oscillations in both the internal and external signals, while the MRF parameters
1.2
Statistical Procedures for Spatiotemporal Neuronal Data
·
·
1833
·
·
· ·
0.8
1.0
· ·
BETA
·
·
· · ·· ··
· ·
·
·
0.6
·
·
·
·
·
·
· ·
·
0.4
· ·
· · · ·
0.2
· · -3.0
-2.5
-2.0
-1.5
-1.0
ALPHA
Figure 5: Plot of a, b, and the phase transition line a C 2b D 0.
stay closer to the standard statistics (mean and covariance). In MRF, the information contained in the data is summarized into two parameters, a and b, which translate rst- and second-order interactions. Interactions of higher order are actually neglected. Such interactions may nevertheless exist and may alter the signicance of a and b. On the other hand, the meaneld approximation is based on a rst-order statistic that attenuates the importance of model geometry. Because all interactions are recorded into the mean activity, the method seems more reliable when the true order of interaction is unknown. The crucial point is that the diffusion parameter (which should be of second order) is well estimated by a rst-order statistic. 4.3 Discussion. In layers II–III of the guinea pig auditory cortex, excitatory responses are mediated by both N-methyl-D-asparate (NMDA) and non-NMDA receptors. Non-NMDA-receptor-mediated responses predominate, while NMDA-receptor-mediated responses are very small. Under normal conditions, the excitatory responses last 40 to 50 ms starting from 17 to 20 ms after the stimulus onset. NMDA-receptor-mediated responses start from 27 to 30 ms—10 ms after the beginning of non-NMDA-receptor-mediated
1834
O. Fran¸cois et al.
Table 1: Experimental Data. 54 0 55 81 46 77 56 38 35 33 56 59
43 46 52 39 44 44 43 46 36 36 41 51
43 43 51 52 38 62 40 50 37 41 55 59
65 54 45 45 50 37 53 40 36 53 41 40
50 71 61 54 50 41 40 39 38 39 39 43
67 56 59 51 55 47 40 41 36 47 41 56
56 68 61 62 45 56 41 45 39 81 39 47
63 58 54 54 66 47 39 41 40 43 43 41
65 63 62 58 61 46 43 40 48 38 48 45
85 55 58 66 67 55 47 45 41 39 46 48
62 38 60 56 47 54 61 44 40 43 40 46
28 0 63 69 89 56 43 43 46 46 53 65
2 0 5 2 7 7 8 12 14 25 6 3
6 5 17 10 14 14 15 8 9 6 8 6
14 10 24 17 20 5 21 5 15 12 3 3
6 17 39 26 24 33 9 21 22 10 17 6
9 9 10 9 5 28 36 24 18 13 5 12
9 6 15 12 16 15 33 17 14 10 10 10
5 9 18 21 28 17 31 20 14 8 20 8
14 15 25 25 20 29 15 25 15 13 32 18
7 6 22 12 9 16 17 24 16 29 3 7
3 8 8 15 3 21 13 18 7 22 7 3
6 6 8 5 23 14 6 14 8 17 18 21
3 0 13 14 5 7 24 13 10 18 5 3
Notes: The rst array gives the instant at which the activity is starting, and the second array gives the duration of the activity for each of the 144 sites. The topology is supposed to be a lattice consisting of 144 sites disposed on a 12 £ 12 grid. In these two arrays, the unit of time is 0.5 ms (resampled from original 0.576 sampling).
responses (Horikawa et al., 1996). Some cortical GABAergic inhibition originating from layers II–III occur, inhibiting the long-lasting NMDA-receptormediated responses and shortening the excitatory responses. In layers II–III, 20 to 25% of neurons are inhibitory. The GABAergic inhibition begins about 10 ms after the non-NMDA-receptor-mediated responses. The innovation process is composed of several inputs. Some specic inputs to layer III come from the stimulus propagation (main auditory inputs). They arise from layer IV, where the dorsal and ventral medial geniculate body (MGB) terminates, or directly from the MGB. Other inputs to layer II come from layer I, which receives them from the medial division of the MGB. These inputs are more conditioning related and probably polymodal. Another input is due to cortico-cortical connections. Such input comes from auditory areas of the both hemispheres. Innovation is characterized by the parameters a (MRF) and l (DM). The innovation process starts at t D 20 ms (see Figures 3 and 6), as the area becomes active (see Figure 1). This
Statistical Procedures for Spatiotemporal Neuronal Data
1835
0.06
·
0.04
· ·
· ·
0.02
· ·
· 0.0
· ·
·
·
· · · ·
·
·
· · ·
·
· · · · ·
·
· · ·
·
-0.02
LAMBDA
· · ·
·
·
-0.04
·
· · 20
25
· 30
35
40
Milliseconds
Figure 6: Estimation of l (kHz).
is consistent with usual recordings. A site is considered active when the response is above a threshold. Since l is alternatively positive and negative, it can be concluded that the innovation process consists of excitatory and inhibitory inputs. Excitatory inputs are directly induced by the stimulus propagation and are weakened by adaptation in the auditory nerve. This phenomenon explains why excitation starts decreasing at t D 23 ms. Subcortical inhibition originated from the MGB is also known to occur. The oscillatory behavior of l may be related to a superposition of excitatory and inhibitory processes. The second negative peak of l can be attributed to subcortical inhibition. The 26 ms latency of the rst negative peak seems correlated to the occurrence of GABAergic inhibition, though this inhibition belongs to the internal process of layers II–III. Although the curves of a and l are different, these parameters exhibit some analogous qualitative behavior: They increase and decrease at the same latencies. The estimation of the innovation process in the MRF approach provides a global trend (oscillation) of this process that is consistent with the estimation obtained in the DM approach. The internal process corresponds to interactions within layers II–III: MRF and DM approaches show that the internal process is excitatory, although
1836
O. Fran¸cois et al.
0.5
·
0.4
·
·
·
·
·
0.3
· ·
·
· · ·
· · 0.2
Diffusion rate MU
·
·
·
· ·
·
· · · ·
·
·
0.1
· · · · ·
0.0
·
·
·
· ·
· · ·
·
· 20
25
30
35
40
Milliseconds
Figure 7: Estimation of m (kHz).
it may be modulated by inhibitory events. This result is consistent with a recent study (Kubota, Sugimoto, Horikawa, Nasu, & Taniguchi, 1997) that outlines a propagation of non-NMDA-receptor mediated excitation within layers II–III of a rat slice preparation. Again, b and m have a similar oscillating behavior (see Figures 4 and 7), with a rst peak at t D 26 ms. Such latency is in accordance with the dynamics of the non-NMDA-receptor-mediated excitations, though this layers II–III excitation process is not yet well known under in vivo conditions. The decrease of b and m up to t D 30 ms is a consequence of the internal GABAergic inhibition that starts at about t D 27 ms. The diffusion increases several milliseconds after the arrival of inputs. This rst peak of diffusion can be interpreted as a consequence of the stimulus, which arrives with a large variance and propagates into the studied area. Intracortical interactions may be responsible for the second peak of diffusion at t D 34 ms. These interactions are delayed due to the intrinsic circuitry within each layer. The shape of the diffusion rate m is closer than b to the covariance (a surprising fact) and is likely to reect a better picture of the internal process. In Figure 8, the relation between l and m appears to be linear except for the points that correspond to the beginning and the end of the signal. Such
Statistical Procedures for Spatiotemporal Neuronal Data
1837
0.5
·
0.4
·
··
·
·
0.3
· ·
MU
·
· · · · ··· ·
·
·
·
0.2
·
·
·
0.1
·
·
·· ·
· ·
·
· ·
0.0
·
·· ·
·
· -0.04
-0.02
0.0
0.02
0.04
0.06
LAMBDA
Figure 8: Plot of (l , m ).
a relation suggests that a competition exists between innovation and the internal process. This may explain how the occurrence of internal GABAergic inhibition (at t D 27 ms) inuences the innovation process and reduces its effectiveness onto layers II–III. It suggests that when innovation becomes inhibitory, the sites activate the internal diffusion task and that when innovation becomes excitatory, the sites are less involved in the diffusion task. Conversely, the internal excitatory activity may also mask the innovation process to layers II–III. Our hypothesis is that the innovation process would be composed of a single long inhibition and a single excitation that become effective when the internal process is weak. Such ideas need empirical evidence. An extension of this study will be to obtain a conrmation of timings and identication for excitatory and inhibitory events in both innovation and the internal process of layers II–III. The question of isofrequency band coding into auditory cortex has not been addressed because spatial differences have not been taken into account. In a further study, it would be useful to distinguish between the activity of the primary auditory cortex and its dorsocaudal eld, which exhibit different response latencies, and have reciprocal connections.
1838
O. Fran¸cois et al.
References Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. B, 60, 172–236. Cohen, L. B., Hopp, H. P., Wu, J. Y., Xiao, C., London, J., & Zecevic, D. (1989). Optical measurement of neuron activity in invertebrate ganglia. Ann. Rev. Physiol., 51, 527–541. Fran¸cois, O., Demongeot, J., & Herv´e, T. (1992).Convergence of a self-organizing stochastic neural network. Neural Networks, 5, 277–282. Fran¸cois, O., Larota, C., Horikawa J., & Herv´e, T. (1999). Diffusion and innovation rates for multidimensional neuronal data with large spatial covariance. Preprint. Griffeath, D. (1979).Additiveand cancellativeinteracting particlesystems. New York: Springer-Verlag. Grinvald, A., Frostig, R. D., Lieke, E., & Hildesheim, R. (1988). Optical imaging of neuronal activity. Physiological Reviews, 68, 1285–1366. Guyon, X. (1995). Random elds on a network. Modelling, statistics, and applications. New York: Springer-Verlag. Hebb, D. (1949). The organization of behaviour. New York: Wiley. Herv´e, T., Dolmazon J. M., & Demongeot, J. (1990). Random eld and neural information. Proc. Natl. Acad. Sci. USA, 87, 806–810. Horikawa, J., Hosokawa, Y., Kubota, M., Nasu, M., & Taniguchi, I. (1996).Optical imaging of spatiotemporal patterns of glutamatergic excitation and GABAergic inhibition in the guinea-pig auditory cortex. J. Physiol., 497, 629–638. Kubota, M., Sugimoto, S., Horikawa, J., Nasu, M., & Taniguchi, I. (1997). Optical imaging of dynamic horizontal spread of excitation in rat auditory cortex slices. Neurosci. Lett., 237, 77–80. Liggett, T. M. (1985). Interacting particle systems. New York: Springer-Verlag. Makarenko, V. I., Welsh, J. P., Lang, E. J., & Llinas, R. (1997). A new approach to the analysis of multidimensional neuronal activity: Markov random elds. Neural Networks, 10, 785–789. Martignon, L., Von Hasseln, H., Gruen, S., Aertsen A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Orback, H. S., Cohen, L. B., & Grinvald, A. (1996). Optical monitoring of optical activity in the mammalian sensory cortex. Journal of Neuroscience, 5, 1886– 1895. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Preston, C. J. (1976). Random elds. New York: Springer-Verlag. Ripley, B. D. (1988). Statistical inference for spatial processes. Cambridge: Cambridge University Press. Received October 22, 1997; accepted October 5, 1999.
LETTER
Communicated by Steven Nowlan
Probabilistic Motion Estimation Based on Temporal Coherence Pierre-Yves Burgi Centre Suisse d’Electronique et Microtechnique, 2007 Neuchˆatel, Switzerland Alan L. Yuille Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A. We develop a theory for the temporal integration of visual motion motivated by psychophysical experiments. The theory proposes that input data are temporally grouped and used to predict and estimate the motion ows in the image sequence. This temporal grouping can be considered a generalization of the data association techniques that engineers use to study motion sequences. Our temporal grouping theory is expressed in terms of the Bayesian generalization of standard Kalman ltering. To implement the theory, we derive a parallel network that shares some properties of cortical networks. Computer simulations of this network demonstrate that our theory qualitatively accounts for psychophysical experiments on motion occlusion and motion outliers. In deriving our theory, we assumed spatial factorizability of the probability distributions and made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argue that these approximations are suitable for the stimuli we are considering (for which spatial coherence effects are negligible). 1 Introduction
Local motion signals are often ambiguous, and many important motion phenomena can be explained by hypothesizing that the human visual system uses temporal coherence to resolve ambiguous inputs. (Temporal coherence is the assumption that motion in natural images is mostly temporally smooth, that is, it rarely changes direction or speed abruptly.) For example, the perceptual tendency to disambiguate the trajectory of an ambiguous motion path by using the time average of its past motion was rst reported by Anstis and Ramachandran (Ramachandran & Anstis, 1983; Anstis & Ramachandran, 1987). They called this phenomenon motion inertia. Other motion phenomena involving temporal coherence include the improvement of velocity estimation over time (McKee, Silverman, & Nakayama, 1986; Ascher, 1998), blur removal (Burr, Ross, & Morrone, 1986), motion-outlier c 2000 Massachusetts Institute of Technology Neural Computation 12, 1839– 1867 (2000) °
1840
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
detection (Watamaniuk, McKee, & Grzywacz, 1994), and motion occlusion (Watamaniuk & McKee, 1995). Another example of the integration of motion signals over time comes from a recent experiment by Nishida and Johnston (1999) on motion aftereffects, where an orientation shift in the direction of the illusory rotation appears to involve dynamic updating by motion neurons. These motion phenomena pose a serious challenge to motion perception models. For example, in motion-outlier detection, the target dot is indistinguishable from the background noise if one observes only a few frames of the motion. Therefore, the only way to detect the target dot is by means of its extended motion over time. Consequently, explanations based solely on single, large local-motion detectors as in the motion energy and elaborated Reichardt models (Hassenstein & Reichardt, 1956; van Santen & Sperling, 1984; Adelson & Bergen, 1985; Watson & Ahumada, 1985; for a review of motion models, see Nakayama, 1985) seem inadequate to explain these phenomena (Verghese, Watamaniuk, McKee, & Grzywacz, 1999). Also, it has been argued (Yuille & Grzywacz, 1998) that all these experiments could be interpreted in terms of temporal grouping of motion signals involving prediction, observation, and estimation. The detection of targets and their temporal grouping could be achieved by verifying that the observations were consistent with the motion predictions. Conversely, failure in these predictions would indicate that the observations were due to noise, or distractors, which could be ignored. The ability of a system to predict the target’s motions thus represents a powerful mechanism to extract targets from background distractors and might be a powerful cue for the interpretation of visual scenes. There has been little theoretical work on temporal coherence. Grzywacz, Smith, and Yuille (1989) developed a theoretical model that could account for motion inertia phenomena by requiring that the direction of motion varies slowly over time while speed could vary considerably faster. More recently Grzywacz, Watamaniuk, and McKee (1995) proposed a biologically plausible model that extends the work on motion inertia and allows for the detection of motion outliers. It was observed (Grzywacz et al., 1989) that for general three-dimensional motion, the direction, but not the speed, of the image motion is more likely to be constant and hence might be more reliable for motion prediction. This is consistent with the psychophysical experiments on motion inertia (Anstis & Ramachandran, 1987) and temporal smoothing of motion signals (Gottsdanker, 1956; Snowden & Braddick, 1989; Werkhoven, Snippe, & Toet, 1992), which demonstrated that the direction of motion is the primary cue in temporal motion coherence. Other reasons that might make motion direction more reliable than speed are the unreliability of local speed measurement due to the poor temporal resolution of local motion units (Kulikowski & Tolhurst, 1973; Holub & MortonGibson, 1981) and involuntary eye movements. This article presents a new theory for visual motion estimation and prediction that exploits temporal information. This theory builds on our pre-
Probabilistic Motion Estimation
1841
vious work (Grzywacz et al., 1989, 1995; Yuille & Grzywacz, 1998) and on recent work on tracking of object boundaries (Isard & Blake, 1996). In addition, this article gives a concrete example of the abstract arguments on temporal grouping proposed by Yuille and Grzywacz (1998). To this end, we use a Bayesian formulation of estimation over time (Ho & Lee, 1964), which allows us simultaneously to make predictions over time, update those predictions using new data, and reject data that are inconsistent with the predictions. (We will be updating the marginal distributions of velocity at each image point rather than the full probability distribution of the entire velocity eld, which would require spatial coupling.) This rejection of data allows the theory to implement basic temporal grouping (for speculations about more complex temporal grouping, see Yuille & Grzywacz, 1998). This basic form of temporal grouping is related to data association as studied by engineers (Bar-Shalom & Fortmann, 1988) but differs by being probabilistic (following Ho & Lee, 1964). Finally, we show that our temporal grouping theory can be implemented using locally connected networks, which are suggestive of cortical structures. In deriving our theory, we made an assumption of spatial factorizability of the probability distributions and made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argued that these assumptions and approximations (as embodied in our choice of prior probability) are suitable for the stimuli we are considering but would need to be revised to include spatial coherence effects (Yuille & Grzywacz, 1988, 1989). The prior distribution used in this article would need to be modied to incorporate these effects. (For speculations about how this might be achieved, see Yuille & Grzywacz, 1998.) There is no difculty in generalizing the Bayesian theory to deal with complex hierarchical motion, but it is unclear whether the specic implementation in this article can be generalized. The article is organized as follows. The theoretical framework for motion prediction and estimation is introduced in section 2. Section 3 gives a description of the specic model chosen to implement the theory in the continuous time domain. Implementation of the model using a locally connected network is addressed in sections 4 and 5. The model’s predictions are tested in section 6 by performing computer simulations on the motion phenomena described previously. Finally, we discuss relevant issues, such as the possible relationship of the model to biology, in section 7. 2 The Theoretical Framework 2.1 Background. The theory of stochastic processes gives us a mathematical framework for modeling motion signals that vary over time. These processes can be used to predict the probabilities of future states of the system (e.g., the future velocities) and are called Markov if the probability of future states depends on only their present state (i.e., not on the time history
1842
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
of how they arrived at this state). In computer vision, Markov processes have been used to model temporal coherence for applications such as the optimal fusing of data from multiple frames of measurements (Matthies, Kanade, & Szeliski, 1989; Clark & Yuille, 1990; Chin, Karl, & Willsky, 1994). Such optimal fusing is typically dened in terms of least-squares estimates, which reduces to Kalman ltering theory (Kalman, 1960). Kalman lters have been applied to a range of problems in computer vision (see Blake & Yuille, 1992, for several examples), and to neuronal mode by Rao and Ballard (1997). Because Kalman lters are (recursive) linear estimators that apply only to gaussian densities, their applicability in complex scenes involving several moving objects is questionable. One solution, which we briey discuss, is the use of data association techniques (Bar-Shalom & Fortmann, 1988). In this article, however, we follow Isard and Blake (1996) and propose a Bayesian formulation of temporal coherence, which is a generalization of standard Kalman lters and can deal with targets moving in complex backgrounds. We begin by deriving a generalization of Kalman lters that is suitable for describing the temporal coherence of simple motion. (Our development follows the work of Ho & Lee, 1964.) Consider the state vector xE k describing the status of a system at time tk (in our theory, xE k denotes the velocity eld at every point in the image). Our knowledge about how the system might evolve to time k C 1 is described in a probability distribution function p (xE k C 1 | xE k ), known as the “prior” (in our theory, this prior will encode the temporal coherence assumption). The measurement process that relates the observation zE k of the state vector to its “true” state is described by the likelihood distribution function p (zE k | xE k ) (in our theory, Ezk will be the responses of basic motion units at every place in the image). From these two distribution functions, it is straightforward to derive the a posteriori distribution function p (xE k C 1 |Zk C 1 ), which is the distribution of xE k C 1 given the whole past set . of measurements Zk C 1 D (Ez0 , . . . , zE k C 1 ) (note that although the prediction of the future depends only on the current state, the estimate of the current state does depend on the entire past history of the measurements). Using Bayes’ rule and a few algebraic manipulations (Ho & Lee, 1964), we get p(xE k C 1 |Zk C 1 ) D
p(Ezk C 1 | xE k C 1 ) p (xE k C 1 |Zk ), p (zE k C 1 |Zk )
(2.1)
where p (xE k C 1 |Zk ) is prediction for the future state xE k C 1 given the current set of past measurements Zk , and p (zE k C 1 |Zk ) is a normalizing term denoting the condence in the measure Ezk C 1 given the set of past measurements Zk . The predictive distribution function can be expressed as Z (2.2) p(xE k C 1 |Zk ) D p (xE k C 1 | xE k )p(xE k |Zk )dxE k . Equations 2.1 and 2.2 are generalizations of the Wiener-Kalman solution for linear estimation in presence of noise. Its evaluation involves two
Probabilistic Motion Estimation
1843
stages. In the prediction stage, given by equation 2.2, the probability distribution p (xE k C 1 |Zk ) for the state at time k C 1 is determined. This function, which involves the present state and the Markov transition describing the dynamics of the system, has a variance larger than that of p(xE k |Zk ). This increase in the variance results from the nondeterministic nature of the motion. In the measurement stage, performed by equation 2.1, the new measurements Ezk C 1 are combined using Bayes’ theorem and, if consistent, reinforce the prediction and decrease the uncertainty about the new state. (Inconsistent measurements may increase the uncertainty still further.) It was shown (Ho & Lee, 1964) that if the measurement and prior probabilities are both gaussian, then equations 2.1 and 2.2 reduce to the standard Kalman equations, which update the mean and the covariance of p (xE k C 1 |Zk C 1 ) and p(xE k C 1 |Zk ) over time. Gaussian distributions, however, are nonrobust (Huber, 1981), and an incorrect (outlier) measurement can seriously distort the estimate of the true state. Standard linear Kalman lter models are therefore not able to account for the psychophysical data that demonstrate, for example, that human observers are able to track targets despite the presence of inconsistent (outlier) measurements (Watamaniuk et al., 1994; Watamaniuk & McKee, 1995). Various techniques, known collectively as data association (Bar-Shalom & Fortmann, 1988), can be applied to reduce these distortions by using an additional stage that decides whether to reject or accept the new measurements. From the Bayesian perspective, this extra stage is unnecessary; robustness can be ensured by correct choices of the measurement and prior probability models. More specically, we specify a measurement model that is robust against outliers. 2.2 Prediction and Estimation Equations for a Flow Field. We intend to estimate the motion ow eld, dened over the two-dimensional image array, at each time step. We describe the ow eld as fvE (xE , t)g, where the parentheses f¢g denote variations over each possible spatial position xE in the image array. The prior model for the motion eld and the likelihood for the velocity measurements are described by the probability distribution functions P (fvE (xE , t)g|fvE (xE , t ¡ d)g) and P (fwE (xE , t)g|fvE (xE , t)g) (see section 3 for details), where d is time increment and fwE (xE , t)g D fw 1 (xE , t), w 2 (xE , t), . . . , w M (xE , t)g represents the response of the local motion measurement units over the image array. Our theory requires that the local velocity measurements are P normalized so that M E , t) D 1 for all x E and at all times t (see section 3 i D 1 w i (x for details). We let W(t) D (fwE (xE , t)g, fwE (xE , t ¡ d)g, . . .) be the set of all measurements up to and including time t, and P (fvE (xE , t ¡ d)g|W(t ¡ d)) be the system’s estimated probability distribution of the velocity eld at time t ¡d. Using equations 2.1 and 2.2 described above, we get
P (fvE (xE , t)g|W(t)) D
P (fwE (xE , t)g|fvE (xE , t)g)P (fvE (xE , t )g|W (t ¡ d)) , P (fwE (xE , t)g|W(t ¡ d))
(2.3)
1844
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
for the estimation, and P (fvE (xE , t)g|W(t ¡ d)) Z D P (fvE (xE , t)g|fvE (xE , t ¡ d)g)P(fvE (xE , t ¡ d)g|W(t ¡ d))d[fvE (xE , t ¡ d)g], (2.4) for the prediction. 2.3 The Marginalization Approximation. For simplicity and computational convenience, we assume that the prior and the measurement distributions are chosen to be factorizable in spatial position, so that the probabilities at one spatial point are independent of those at another. This restriction prevents us from including spatial coherence effects which are known to be important aspects of motion perception (see, for example, Yuille & Grzywacz, 1988). For the types of stimuli we are considering, these spatial effects are of little importance and can be ignored. In mathematical terms, this spatial-factorization assumption means that we can factorize the prior Pp and the likelihood Pl so that
Pp (fvE (xE , t)g|fvE (xE 0 , t ¡ d)g) D
Y xE
Pl (fwE (xE , t )g|fvE (xE , t)g) D
pp (vE (xE , t)|fvE (xE 0 , t ¡ d)g) Y xE
pl (fwE (xE , t)| vE (xE , t)g).
(2.5)
A further restriction is to modify equation 2.4 so that it updates the marginal distributions at each point independently. This again reduces spatial coherence because it decouples the estimates of velocity at each point in space. Once again, we argue that this approximation is valid for the class of stimuli we are considering. This approximation will decrease computation time by allowing us to update the predictions of the velocity distributions at each point independently. The assumption that the measurement model is spatially factorizable means that the estimation stage, equation 2.3, can also be performed independently. This gives update rules for prediction Ppred : Z Ppred (vE (xE , t)|W (t ¡ d)) D [dvE (xE 0 , t ¡ d)]Pp (vE T (xE , t)|fvE (xE 0 , t ¡ d)g) Pe (fvE (xE 0 , t ¡ d)g|W(t ¡ d)),
(2.6)
and for estimation Pe : P e (vE (xE , t )|W (t)) D
Pl (wE (xE , t)| vE (xE , t ))Ppred (vE (xE , t)|W(t ¡ d)) P (wE (xE , t)|W(t ¡ d))
.
(2.7)
In what follows we will consider a specic likelihood and prior function that allow us to simplify these equations and dene probability distributions
Probabilistic Motion Estimation
1845
on the velocities at each point. This will lead to a formulation for motion prediction and estimation where computation at each spatial position can be performed in parallel. 3 The Model
We now describe a specic model for the likelihood and prior functions that, as we will show, can account for many psychophysical experiments involving temporal motion coherence. Based on this specic prior function, we then derive a formulation for motion prediction in the continuous time domain. 3.1 Likelihood Function. The likelihood function gives a probabilistic interpretation of the measurements given a specic motion. It is probabilistic because the measurement is always corrupted by noise at the input stage. (In many vision theories, the likelihood function depends on the image formation process and involves physical constraints, such as the geometry and surface reectance. Because we are considering psychophysical stimuli, consisting of moving dots, we can ignore such complications.) To determine our likelihood function, we must rst specify the input stage. Ideally, this would involve modeling the behaviors of the cortical cells’ sensing natural image motion, but this would be too complex to be practical. Instead, we use a simplied model of a bank of receptive elds tuned to various velocities vE i (xE , t), i D 1 ¢ ¢ ¢ M, and positioned at each image pixel. These cells have observation activities fw i (xE , t)g that are intended to represent the output of a neuronally plausible motion model such as Grzywacz and Yuille (1990). (In our implementation, these simple model cells receive contributions from the motion of dots falling within a local neighborhood, typically the four nearest neighbors. Intuitively, the closer the dot is to the center of the receptive eld and the closer its velocity is to the preferred velocity of the cell, then the larger the response. The spatial prole and velocity tuning curve of these receptive elds are described by gaussian functions whose covariance matrices S m:x and S m:v depend on the direction of vE (xE , t), and are specied in terms of their longitudinal and 2 2 2 2 transverse components sm:x,L , sm:x,T and sm:v,L , sm:v,T . For more details, see the appendix.) The likelihood function species the probability of the receptive eld responses conditioned on a “true” external motion eld. Ideally this should correspond exactly to the way the motion measurements are generated. We make a simplication, however, by assuming that the measurements depend on only the velocity eld at that specic position. This simplies the mathematics at the risk of making the system slightly sensitive to motion blur (i.e., the measurement stage allows for several dots to inuence the local measurement—provided the dots lie within the receptive eld—but the likelihood function assumes that only a single dot causes the measurement.)
1846
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
Q We can therefore write Pl (fwE (xE , t)g|fvE (xE , t)g) D x Pl (wE (xE , t)| vE (xE , t)), where fvE (xE , t)g is the external velocity eld. We now specify P l (wE (xE , t)| vE (xE , t)) Yl (w 1 (xE , t), w 2 (xE , t ), . . . , w M (xE , t), vE (xE , t)) R , D R ¢ ¢ ¢ P J (j1 , j2 , . . . , j M , vEj (xE , t))dj1 dj2 , . . . , djM
(3.1)
where P J is the joint probability distribution and we set Yl to be Yl (w 1 (xE , t), w 2 (xE , t), . . . , w M (xE , t), vE (xE , t)) D wE (xE , t) ¢ fE(vE (xE , t)),
(3.2)
which is attractive (see Yuille, Burgi, & Grzywacz, 1998) because it leads to a simple linear update rule, and with tuning curves f given by: ¡1
e¡(1 / 2)(vEi ¡vE ) S (vE i ¡vE ) fi (vE (xE , t )) D P M ¡(1 2)( E ¡ E )T S ¡1 ( E ¡E ) , vj v / vj v j D1 e T
(3.3)
where S is the covariance matrix that depends on the direction of vE (xE , t) and 2 is specied in terms of its longitudinal and transverse components sl:v,L and 2 sl:v,T ). The experimental data (Anstis & Ramachandran, 1987; Gottsdanker, 1956; Werkhoven et al., 1992) suggest that temporal integration occurs for velocity direction rather than for speed. This is built into our model by choosing the covariances so that the variance is bigger in the direction of motion than in the perpendicular direction (i.e., the velocity component perpendicular to the motion has mean zero and very small variance so the direction of motion is fairly accurate, but the variance of the velocity component along the direction of motion is bigger, which means that the estimation of the speed is not accurate). We have assumed that the response of the measurement device is instantaneous. It would be possible to adapt the likelihood function to allow for a time lag, but we have not pursued this. Such a model might be needed to account for motion blurring. 3.2 Prior Function. We now turn to the prior function for position and velocity. For a dot moving at approximately constant velocity, the state transitions for position and velocity are given respectively by xE (t C d) D E (t) C v xE (t) C d vE (t) C v E x (t), and vE (t C d) D v E v (t), where v E x (t) and v E v (t) are random variables representing uncertainty in the position and velocity. Extending this model to apply to the ow of dots in an image requires a conditional probability distribution Pp (vE (xE , t)|fvEj (xE i , t ¡ d)g), where the set of spatial positions and the set of allowed velocities are discretized with the spatial positions set to be fxE i : i D 1, . . . , Ng, and the Q velocities at xE i to fvEj : j D 1, . . . , Mg. We are assuming that Pp (fvE (xE i , t)g|.) D i Pp (vE (xE i , t)|.) so that the velocities at each point are predicted independent of each other. We also assume that the prior is built out of two components: (1) the probability
Probabilistic Motion Estimation
1847
p (xE | xE i , vE j (xE i , t ¡ d)) that a dot at position xE i with velocity vEj at time t ¡ d will be present at xE at time t and (2) the probability P (vE (xE , t)| vE j (xE i , t ¡ d)) that it will have velocity vE (xE , t). These components are combined to give: Pp (vE (xE , t )|fvE j (xE i , t ¡ d)g) 1 XX D P (vE (xE , t)| vE j (xE i , t ¡ d)) p (xE | xE i , vE j (xE i , t ¡ d)), ( E K x, t ) i j
(3.4)
where K (xE , t) is a normalization factor chosen to ensure that Pp (vE (xE , t)|fvEj (xE i , t ¡ d)g) is normalized. The two conditional probability distribution functions in equation 3.4 are assumed to be normally distributed, and thus, ( p (xE | xE i , vEj (xE i , t ¡ d)) » e¡(1 / 2)(xE ¡xE i ¡d vEj xE i ,t¡d))
P (vE (xE , t)| vEj (xE i , t ¡ d)) » e
T S ¡1 ( E ¡E ¡d E ( E ,t¡d)) x xi vj xi x
¡(1 / 2)(vE (xE ,t)¡vEj (xE i ,t¡d))T S v¡1 (vE (xE,t)¡vEj (xE i ,t¡d))
(3.5) .
(3.6)
These functions express the belief that the dots are, on average, moving along a straight trajectory (see equation 3.5) with constant velocity (see equation 3.6). The covariances S x and S v , which quantify the statistical deviations from the model, are dened in a coordinate system based on the velocity vector vE j . These matrices are diagonal in this coordinate system and are expressed in terms of their longitudinal and transverse components 2 , s 2 , s 2 , s 2 . (This model predicts future positions and velocities insx,L x,T v,L v,T dpendently; it does not take into account the possibility that if the speed increases during the motion, then the dot will travel further. This is an acceptable, and standard, approximation provided that d is small enough. It becomes exact in the limit as d 7! 0.) 3.3 Continuous Prediction. Computer simulations of equation 2.6 (even for the simple case of moving dots) require large-kernel convolutions, which require excessive computational time. Instead, we reexpress equation 2.6 in the continuous time domain. (Such a reexpression is not always possible and depends on the specic choices of probability distributions. Note that a similar approach has been applied for the computation of stochastic completion elds by Williams and Jacobs (1997a, 1997b).) As shown in the appendix for Gaussian priors, the evolution of P (vE (xE , t)|W (t ¡ d)) for d ! 0 satises a variant of a partial differential equation, known as Kolmogorov’s forward equation or Fokker-Planck equation. Our equation is a nonstandard variant because, unlike standard Kolmogorov theory, our theory involves probability distributions at every point in space that interact over time. It is given by: @ @t
P (vE (xE , t)) D
1 2
»
@T @E v
S vE
@ @E v
¼ P (vE (xE , t)) C
1 2
»
@T @E x
S xE
@ @E x
¼ P (vE (xE , t))
1848
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
¡ vE ¢
@ @E x
P (vE (xE , t )) C
1 @ |V| @Ex
Z vEP (vE (xE , t))dvE .
(3.7)
R T T @ @ where |V| is the volume of velocity space dvE , and f @@Ex S Ex @E g and f @@Ev S vE @E g x v are scalar differential operators (for isotropic diffusions, these scalar operators are Laplacian operators). The terms on the right-hand side of this equation stand for velocity diffusion (rst term), spatial diffusion (second term) with deterministic drift (third term), and normalization (fourth term). This equation is local and describes the spatiotemporal evolution of the probability distributions P (vE (xE , t)). 4 The Implementation
We now address the implementation of the update rule for motion estimation (see equations 2.7). At each spatial position, two separate cells’ populations are assumed: one for coding the likelihood function and another for coding motion estimation. If we were to choose the large-kernel convolution for motion prediction (see equation 2.6), then the network implementation would reduce to two positive linear networks multiplying each other followed by a normalization, illustrated in Figure 1 (see Yuille et al., 1998). But a problem with implementing this network on serial computers is that the range of the connections between motion estimation cells depends on the magnitude of vE j (that is, we would need long-range connections for high speed tunings). Such long-range connections occur in biological systems, like the visual cortex, but are not well suited to very large-scale integrated or serial computer implementations. Alternatively, using our differential equation for predicting motion (see equation 3.7) involves only local connections. We now focus on this new implementation (see Figure 2). Kolmogorov’s equation can be solved on an arbitrary lattice using a Taylor series expansion. The spatial term, which contains the diffusion and drift terms, can be written so that the partial derivatives at lattice points are functions of the values of the closest neighbor points. For the sake of numerical accuracy, we have chosen a spatial mesh system where the points are arranged hexagonally. Furthermore, we found it convenient to express the velocity vectors in polar coordinates where their norm s and angle h can be represented on a rectangular mesh. The change of variable is given T @ by pQ(s, h ) D sp (vE ). In polar coordinates, the differential operator @@Ev S vE @E v becomes Q D Lv [p]
¡
1 2 @2 pQ 1 2 2 s ¡ sv,L ¡ sv,T 2 v,L @s2 2
¢ @ ¡1 ¢ @s
s
pQ C
1 2 @2 pQ 1 s , 2 v,T @h 2 s2
(4.1)
2 2 where sv,L and sv,T are the variances in the longitudinal and transverse directions, respectively.
Probabilistic Motion Estimation
1849
Figure 1: Network for motion estimation. The network consists of two interacting layers. The top square shows the observation layer consisting of cells organized in columns on the (x, y) lattice. At each of the 12 spatial positions, there is a velocity column, which we display by two cells, shown as circles, with the adjacent arrows indicating their velocity tunings (here either horizontal or vertical). The lower square represents the estimation layer, which also consists of cells organized as columns on the (x, y) lattice. In the measurement stage, the observation cells inuence the estimation cells by excitatory (indicated by the plus sign) connections and multiplicative interactions (indicated by the cross sign). The excitation is higher between cells with similar velocity tuning, which we indicate by strong (solid lines) or weak (dashed lines) connections. In the prediction stage, cells within the estimation layer excite each other (again with the strength of the excitation being largest for cells with similar velocity tuning). Inhibitory connections within the columns are used to ensure normalization.
1850
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
Figure 2: Network for motion estimation using continuous prediction. (A) Continuous motion prediction (lower part of the network) is accomplished through nearest-neighbor spatial interactions—in this case, on a hexagonal lattice. The velocity distribution is represented at each spatial position by a set of velocity cells (small circles) organized according to a polar representation (small panel). Observation cells in the measurement layer (upper part) inuence the estimation cells by multiplicative interactions (indicated by £) to yield the motion estimation a. (B) Interactions between two populations of velocity cells involve cells tuned to the same velocities; interactions within a population involve cells tuned to different velocities.
We let the activities of the M motion estimation cells at time t be represented by a vector a( E , t) D (a1 (x E , t), . . . , aM (x E , t)). The iterative method for E x determining such activities involves a four-step scheme. The rst three steps are concerned with solving Kolmogorov’s equation. For a cell at position xE D (x, y) and of velocity tuning vEm D (s, h ), the three steps are: t 1/4 t C Dt ax,y,s,h D ax,y,s,h C
tC 1 / 2 ax,y,s,h
D
tC1/4 C ax,y,s,h
t / t / ax,y,s,h ¡ D ax,y,s,h C3
4
C1
2
Ds
X
t Aõ, (s)ax,y,s C õD s,h C D h
õ,
X
Bi, (s, h )ax C i,y/ C Â,s,h t C1 4
i,Â
Dt X X M
s ,h i, 0
0
Bi, (s0 , h 0 )ax C i,y/ C Â,s0 ,h 0 tC 1 2
(4.2)
where the coefcients Aõ, and Bi, are functions of s, h, and the covariance matrices. The constants D s, Dh represent the quantization factors for s and h. The superscripts t C 1 / 4, t C 1 / 2, t C 3 / 4 indicate the order in which these
Probabilistic Motion Estimation
1851
three steps proceed (a single time step for the complete system is broken down into four substeps for implementational convenience). The rst step is to evaluate the velocity differential operator and involves four neighbor cells. The second step calculates the spatial differential operator and involves six neighbor points (on the hexagonal spatial lattice). Periodic boundary conditions are assumed in space and velocity. The third step performs the normalization. To guarantee stability of the whole iterative method, the time step D t has been determined using von Neumann analysis (see Courant & Hilbert, 1953), which considers the independent solutions, or eigenmodes, of nite-difference equations. Typically the stability conditions involve ratios between the time step and the quantization scales of space and velocity. If we are performing prediction without measurement, then the fourth step—determining the activity of the motion estimation cells—simply reduces to t / tC 1 ax,y,s,h . D ax,y,s,h C3
4
(4.3)
If there are measurements, then we apply the motion estimation equation, 2.7, which consists of multiplying the motion prediction with the likelihood function (as dened by equations 2.8 and 2.9). Therefore, the complete update rule is tC 1 ax,y,s,h D
X 1 tC3/4 ax,y,s,h ¢ w j (xE , t) ¢ fj (vE m (xE , t)), N (xE , t C 1) j
(4.4)
P where N (xE , t C 1) is a normalization constant to impose mMD 1 am (xE , t C 1) D 1, 8xE , t so that we can interpret these activities as the probabilities of the true velocities. 5 Methods
Our model has two extreme forms of behavior depending on how well the input data agree with the model predictions. If the data are generally in agreement with the model’s predictions, then it behaves like a standard Kalman lter (i.e., with gaussian distributions) and integrates out the noise. At the other extreme, if the data are extremely different from the prediction, then the model treats the data as outliers or distractors and ignores them. We are mainly concerned here with situations where the model rejects data. The precise situation where noise and distractors start getting confused is very interesting, but we do not address it here. We rst checked the ability of our model to integrate out weak noisy stimuli for tracking a single dot moving in an empty background. For this situation, temporal grouping (data association) is straightforward, and so
1852
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
Figure 3: Speed discrimination for two moving dots. This graph shows the improved ability of our theory to estimate the relative speed of two dots (one of which is moving at 2 jumps per second) as a function of the number of jumps. The level of 80% correct discrimination is reached for diminishing speed differences (measured by dV / V) when the number of jumps increases, consistent with psychophysics (McKee et al., 1986).
standard Kalman lters can be used. To evaluate the model, we ran a set of trials that plotted the relative estimation of two velocities as a function of the number of jumps. The graph we obtained (see Figure 3) is very similar to those reported in the psychophysical literature (McKee et al., 1986). Henceforth, we assume that the model is able to integrate out noisy input. For the remaining experiments, we assumed that the local velocity measurements were “noisy” in the sense that the measurement units specied a probability of velocity measurements rather than a single velocity. However, the “probability of velocities” itself was not noisy. We next tested our model on three psychophysical motion phenomena. First, we considered the improved accuracy of velocity estimation of a single dot over time (Snowden & Braddick, 1989; Watamaniuk, Sekeuler, & Williams, 1989; Welch, Macleod, & McKee, 1997; Ascher, 1998). Then we examined the predictions for the motion occlusion experiments (Watamaniuk & McKee, 1995). Finally we investigated the motion outlier experiments (Watamaniuk et al., 1994). For all simulations, we set the parameters as follows. All dots were moving at a speed of 6 jumps per second (jps), where one jump corresponds to the distance separating two neighboring pixels in the horizontal direction. The longitudinal and transverse components of the covariances matrices were
Probabilistic Motion Estimation
1853
sm:x,L D 0.8 jump, sm:x,T D 0.4 jump, sm:v,L D 3.2 jps, and sm:v,T D 2.6 jps for the measurement, sl:v,L D 2.2 jps, and sl:v,T D 1.1 jps for the likelihood function, and sx,L D 0.6 jump, sx,T D 0.3 jump, sv,L D 0.8 jps, and sv,T D 0.4 jps for the prior function. These parameters were chosen by experimenting with the computer implementation, but the results are relatively insensitive to their precise values; detailed ne tuning was not required. Except for one of the occluder experiments (the occluder dened by distractor motion), where we used 18 velocity channels (six equidistant directions and three speeds), we used 30 velocity channels positioned at each spatial position, that is, six equidistant directions (thus Dh D 60 degrees, starting at the origin) and ve speeds (D r D 2 jps, with the slowest speed channel tuned to 2 jps). To guarantee stability of the numerical method, we set the time increment to D t D 0.6 ms. Initial conditions were uniformly distributed density functions. There were 32 £ 32 spatial positions. 6 Results 6.1 Velocity Estimation in Time. For this experiment, the stimulus was a single dot moving in an empty background. To evaluate how the velocity estimation evolves in time, we measured two numbers. The rst number was the sharpness of the velocity probability at each position, which we dene to be the negative of the entropy of the distribution (the entropy of P ( E , t) log aj (xE , t)) plus the maximum of the distribution is given by ¡ jM D 1 aj x the entropy (to ensure that the sharpness is nonnegative). Observe that the sharpness is the Kullback-Leibler divergence D (akU ) between the velocity distribution aj (xE , t) and the uniform distribution Uj D 1 / M, 8 j (more P ( E , t) logfaj (xE , t) / Uj g.) As is well known (Cover precisely, D (akU ) D jM D 1 aj x & Thomas, 1991) the Kullback-Leibler divergence is positive semidenite, taking its minimum value of 0 when aj D Uj , 8j and increases the more aj differs from the uniform distribution Uj (i.e., the sharper the distribution aj becomes). Hence the higher this sharpness is, the more precise the velocity estimate is. Moreover, the sharpness will be lowest in positions where there are no dots moving (i.e., for which wj (xE , t) D 1 / M, j D 1, . . . , M). The target, a coherently moving dot, should have a relatively high sharpness response surrounded by low sharpness responses in neighboring positions. The second number is the condence factor P (wE (xE , t)|W (t ¡ d)) in equation 2.7. This measures how probable the model judges the new input data. The higher this number is, the more the new measurement is consistent with the prediction (and hence the more likely that the measurement is due to coherent motion). The results of motion estimation for this experimental stimulus are shown in Figure 4. This gure shows an initial fast decrease in the entropy of the velocity distribution followed by a slow decrease (i.e., an increase in sharpness of the density function). Also shown is the increase in the condence
1854
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
Figure 4: Single dot experiment. A single dot is moving with constant velocity. (A) Increasing accuracy of the velocity estimate with time is visible by an increase in the condence factor and the sharpness of the density function. Observe how both the condence and sharpness increase rapidly at rst but then quickly start to atten out. (Our model outputs a data point at each time frame, and our plotting package joins them together by straight-line segments.) (B) Sharpness of the density function shown for each spatial position at successive time frames. The system converges to a single peak in the velocity estimation (though we do not show this here).
Probabilistic Motion Estimation
1855
factor of the target velocity estimation at each new iteration. These two effects indicate the increased accuracy of the velocity estimation with each new iteration, consistent with the psychophysical literature (Snowden & Braddick, 1989; Watamaniuk el al., 1989; Welch et al., 1997; Ascher, 1998). Note that for coherent motion, the sharpness of the model increases rapidly at rst and then appears to slow down. Our results suggest that the model reaches an asymptotic limit, although they do not completely rule out a continual gradual increase. We argue that asymptotic behavior is more likely as both the predictions and the measurements have inherent noise that can never be eliminated. Moreover, it can be proven that the standard (gaussian) Kalman model will converge to an asymptotic limit depending on the variances of the prior and likelihood functions (we veried this by computer simulations). It seems also that human observers reach a similar asymptotic limit, and their accuracy does not become innitely good with an increasing number of input stimuli. In addition, we plot the sharpness of the velocity estimates as a function of spatial position in Figure 4B (each column of velocity-tuned cells has a sharpness associated with its velocity estimate). We also tested the estimation of velocity for a dot moving on a circular trajectory, and our model can successfully estimate the dot’s speed, although the sharpness and condence are not as high as they are for straight motion (recall that the prior prefers straight trajectories). This again seems consistent with the psychophysics (Watamaniuk et al., 1994; Verghese et al., 1999). 6.2 Single Dot with Occluders. Next, we explored what would happen if the target dot went behind an occluding region where no input measurements could be made (we call this a black occluder). The results were the same as the previous case until the target dot reached the occluder. In the occluded region, the motion model continued to propagate the target dot, but, lacking supporting observations, the probability distribution started to diffuse in space and in velocity. This dual diffusion can be seen in Figure 5, where the sharpness of velocity estimation is shown after the dot entered the occluding region. Observe the decrease in sharpness of the density function, indicating a degradation of velocity estimation, when the target is behind the occluder. However, the model still has enough “momentum” to propagate its estimate of the target dot’s position even though no measurements are available (this will break down if the occluder gets too large). This is consistent with the ndings by Watamaniuk and McKee (1995), who showed that observers had a higher-than-average chance of detecting the target when it emerges from the occluder into a noisy background. We then tested our model with motion occluders—occluders dened by motion ows as described in Watamaniuk and McKee (1995) (see Figure 6A). We use the same measures as for the single isolated dot. We also plot the cells’ activities as a function of the direction tuning for the cells tuned to the optimal speed (see Figure 6B). The plot indicates that the cells can signal nonzero probabilities for several motions at the same time. After entering the
1856
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
Figure 5: Single dot experiment with Black occluder. (A) This representation shows how the spatial blur and sharpness of the velocity distribution change with time (measured in time frames) when a single dot moving along the y-axis gets occluded and reappears. The occluding area is from y D 15 up to y D 22. (B) The y-location of the peak of velocity distribution as a function of time. Motion inertia (momentum) keeps the peak moving during the occlusion, albeit with a tendency to slow down (as visible by the wider plateau around time frame 20).
Probabilistic Motion Estimation
1857
Figure 6: Single dot experiment with occluders dened by distractor motion. (A) The occluder is dened by the vertical motion of distractor dots, visible as circles in the left-top frame with arrows indicating velocity (the dots’ speed is 6 jumps per second and is represented by the length of the lines). The target motion, shown at time frame 15 and visible as a diamond, is moving horizontally from left to right. The height and width of the rectangles at each point indicate the condence and sharpness, respectively (a small rectangle indicates low condence and sharpness). The lattice, marked with small dots, is hexagonal for the measurement, estimation, and prediction stages. (B) The probability distribution of the velocity cells tuned to 6 jumps per second plotted as a function of directional tuning at different time steps. The motion-inertia effect of the target motion on the distractors is visible at time frame 18 as the target dot enters the occluder and two peaks start developing in the probability distribution. The bigger the occluder, the more the peak induced by the motion of the distractor dots starts dominating. But as the dot reemerges from the occluder, it rapidly becomes sharp again, as visible at time frame 20.
1858
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
occluding region, the peak corresponding to the target motion gets smaller than the peak due to the occluders. However, the peak of the target remains, and so the target peak can increase rapidly as the target exits from the occluders. Our model predicts that it is easier to deal with black occluders than with motion occluders, which is not consistent with the psychophysics that shows a rough similarity of effects for both types of occluders. We offer two possible explanations for this inconsistency. First, the model’s directionally selective units and / or prediction-propagation mechanism have too broad a directional tuning. Narrowing it may lead to weaker interactions between the motion occluder and the target. Second, in contrast with the black occluders, the motion occluders may group together as a surface, which would help explain why the visual system may be more tolerant of motion occluders. (Such a heightened tolerance might compensate for the initial putative advantage of the black occluders.) Kanizsa (1979) demonstrates several perceptual phenomena, which appear to require this type of explanation. In motion, similar effects have been modeled by Wang and Adelson (1993) in their work on layered motion. A limitation of our current model is that it does not include such effects. 6.3 Outlier Detection. In the outlier detection experiments (Watamaniuk et al., 1994; Verghese et al., 1999), the target dot is undergoing regular motion, but it is surrounded by distractor dots, which are undergoing random Brownian motion. To show the velocity estimation at each position, we plot the response of our network using a rectangle to display the two properties of sharpness and condence. The width of the rectangle gives the sharpness of the set of cells at that point, and the height gives the condence factor. The arrow shows the mean of the estimated velocity. It can be seen in Figure 7 that the target dot’s signal rapidly reaches large sharpness and condence by comparison to the distractor dots, which are not moving coherently enough to gain condence or sharpness. The sharpness of the target dot’s signal does not grow monotonically because the distractor dots sometimes interfere with the target’s motion by causing distracting activity in the measurement cells. 7 Discussion
This work provides a theory for temporal coherence. The theory was formulated in terms of Bayesian estimation using motion measurements and motion prediction. By incorporating robustness in the measurement model, the system could perform temporal grouping (or data association), which enabled it to choose what data should be used to update the state estimation and what data could be ignored. We also derived a continuous form for the prediction equation, and showed that it corresponds to a variant of Kolmogorov’s equations at each node. In deriving our theory, we made an assumption of spatial factorizability of the probability distributions and
Probabilistic Motion Estimation
1859
Figure 7: Outlier detection. A target dot (shown as a diamond with line pointing rightward, indicating its velocity) is halfway down the stimulus frame moving from left to right with noise-free motion. It is surrounded by distractor dots undergoing Brownian motion. After a few time step, the model gives high condence and sharpness for the target dot. Observe that motion estimates of the distractor dots can sometimes become sharp if their motion direction is not changing too radically between two time frames (see the large rectangles at the bottom of the estimation and prediction panels.) Graphic conventions as in Figure 6.
made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argued that these assumptions and approximations (as embodied in our choice of prior probability) are suitable for the stimuli we are considering, but they would need to be revised to include spatial coherence effects (Yuille & Grzywacz, 1988, 1989). The prior distribution used in this article would need to be modied to incorporate these effects. In addition, recent results by McKee and Verghese (personal communication) suggest that a more sophisticated prior may be needed. Their results seem to show that observers are better at predicting where there is likely to be motion rather than what the velocity will be. Recall that our prior probability model has two predictive components based on the current estimation of velocity at each image position. First, the velocity estimation is used to predict the positions in the image where the velocity cells would be excited. Second, we predict that the new velocity is close to the original velocity. By contrast, McKee and Verghese’s results seem to show that human observers can cope with reversals of motion direction (such as a ball bouncing off a wall). This is a topic of current research, and we note that priors that al-
1860
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
low for this reversal have already been developed by the computer vision tracking community (Isard & Blake, 1998). Simulations of the model seem in qualitative agreement with a number of existing psychophysical experiments on motion estimation over time, motion occluders and motion outliers (McKee et al., 1986; Watamaniuk & McKee, 1995; Watamaniuk et al., 1994). Such a qualitative agreement is characterized by three main features: (1) the nal covariance of the motion estimate, (2) the number of jumps needed to reach such a nal covariance, and (3) the extent of motion inertia during an occlusion, or between two measurements. These three features are controlled as follows: the likelihood’s covariance matrix determines the initial distribution, and thus the number of jumps to reach the desired covariance, while the prior’s covariance matrix affects all three features by imposing an upper bound to the covariance and setting the time constants of the diffusion process between measurements. Implementation of the model’s equations led to a network that shares interesting similarities with known properties of the visual cortex. It involves a columnar structure where the columns, at each spatial point, are composed of a set of cells tuned to a range of velocities (columnar structures exist in the cortex but none has yet been found for velocities). The observation cell layer involves local connections within the columns to compute the likelihood function. The estimated velocity ow is computed in a second layer. If these layers exist, it would be natural for them to lie in cortical areas involved with motion such as MT and MST (Maunsell & Newsome, 1987; Merigan & Maunsell, 1993). The estimation layer carries out the calculation according to Kolmogorov’s equation, or the discrete long-range version described in Yuille et al. (1998), and consists of spatial columns of cells that excite locally each other to predict the expected velocities in the future. The calculation requires excitatory and inhibitory synapses (i.e., no “negative synapses”). The inputs from the observation layer multiply the predictions in the estimated layer. Finally, we postulate that inhibition occurs in each column to ensure normalization of the velocity estimates at each column. This normalization could be implemented by lateral inhibition as proposed by Heeger (1992). A difcult aspect of this network, as a biological model, is its need for multiplications (such multiplications arise naturally from our probabilistic formulation using the rules for combining probabilities). Neuronal multiplication mechanisms were argued for by Reichardt (1961) and Poggio and Reichardt (1973, 1976) on theoretical grounds. A specic biophysical model to approximate multiplication was proposed by Torre and Poggio (1978). Detailed investigations of this model, however, showed that it provided at best a rough approximation for multiplication (Grzywacz & Koch, 1987). Moreover, experiments showed that motion computations in the rabbit retina, for which the model was originally proposed, were not well approximated by multiplications (Grzywacz, Amthor, & Mistler, 1990). Recent related work (Mel, Ruderman, & Archie, 1988; Mel, 1993), though arguing
Probabilistic Motion Estimation
1861
for multiplication-like processing, has not attempted to attain good approximations to multiplication. On the other hand, the complexities of neurons make it hard to rule out the existence of domains in which they perform multiplications. Overall, it seems best to be agnostic about neuronal multiplication and trust that more detailed experiments will resolve this issue. More pragmatically, it may be best to favor neuronal models that correspond to known biophysical mechanisms. The biophysics of neurons may prevent them from performing “perfect” Bayesian computations, and perhaps one should instead seek optimal models that respect these biophysical limitations. Finally, we emphasize that simple networks of the type we propose can implement Bayesian generalizations of Kalman lters for estimation and prediction. The popularity of standard linear Kalman lters is often due to pragmatic reasons of computational efciency. Thus, linear Kalman lters are often used even when they are inappropriate as models of the underlying processes. In contrast, the main drawback of such Bayesian generalizations is the high computational cost, although statistical sampling methods can be used successfully in some cases (see Isard & Blake, 1996). Our study shows that these computational costs may be better dealt with by using networks with local connections, and we are investigating the possibility of implementing our model on parallel architectures in the expectation that we will then be able to do Bayesian estimation of motion in real time. Appendix A.1 Receptive Field Tunings. More precisely, we set:
yi (xE , t ) w i (xE , t) D P N , E, t) j D 1 yj ( x where
yi (xE , t) D A C l 1
X xE 0 2Nbh(xE )
o o ) , sx,l G (xE 0 ¡ xE : vEbT (xE 0 , t), sx,t
o o ) £ G(vE T (xE 0 , t) ¡ vE i : vE T (xE 0 , t), sv,t , sv,l
C (1 ¡ l 1 )
X
xE 0 2Nbh(xE )
o o ) , sx,l G (xE 0 ¡ xE : vEbT (xE 0 , t ¡ d), sx,t
o o ), £ G(vE T (xE 0 , t ¡ d) ¡ vE i : vE T (xE 0 , t ¡ d), sv,t , sv,l
and the three terms of the right-hand side of the equation are explained as follows. The rst term, A, is the rest activity of the cells. It ensures that we will have observations w i (xE , t) D 1 / N, 8i if no velocity is present. In other words, if no dot moves across the cell, then all velocities are considered to be equally likely.
1862
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
The second term contains summations over true velocities (i.e., dots) within the receptive eld of the cell. It contains a spatial decay term G(xE 0 ¡ xE : ., ., .) and a velocity tuning curve G(vE T (xE 0 ) ¡ vE i : ., ., .). G is a gaussian function with longitudinal and tranverse variances s.,l s.,t dened in terms bT (x E 0 , t). In addition, the velocity tuning can depend on of the direction T 0 vE the speed . The observation model therefore assumes that the cell vE (xE , t) can measure the components of position and velocity most accurately in the direction perpendicular to the true motion of the stimulus. Intuitively, the closer the dot is to the center of the receptive eld and the closer its velocity is to the preferred velocity of the cell, then the larger the response is. In realistic motions (i.e., not dots!) we can typically assume that the velocity will be constant within each cell except at motion boundaries and transparencies (though there are receptive elds at multiple scales, and so the velocity may not be constant within the large cells). The third term is similar to the second, except that it contains memory of the true velocity at time t ¡d. This “memory” is due to the delayed response of the cell. 0 · l 1 · 1, so the third term is always positive. The response of the observation cells at a point xE at time t can be thought of as a local estimate of the probability of velocity. If there is no true velocity to be observed, then the response of all these cells would be uniform (i.e., w i (xE , t) D 1 / N, 8 i). If the cells receive input from a single true velocity, then the observation cell best tuned to this velocity will have the biggest response. The closer the true velocity stimulus (i.e., the dot) is to the center of the cell, then the stronger the peak. By contrast, multiple peaks can occur if there are several motions in the receptive eld. This can occur for motion transparency, near-motion boundaries, and occluding motion. It can also occur in our model if there is one true motion in the receptive eld at time t ¡ d and a different motion at time t. A.2 Kolmogorov’ s Equation. In this appendix, we derive Kolmogorov’s equation for motion prediction. Let us consider equation 2.6 for d ! 0. Taking this limit is tricky and requires Itoˆ calculus (Jazwinski, 1970). Our case is also nonstandard because our theory involves probability distributions at every point in space, which interact over time. This means that we cannot simply use the standard Kolmogorov’s equations, which apply to a single distribution only. Here we give a simple derivation that which uses delta functions and is appropriate when the prior model is based on a gaussian G(.). The prior specied by equations 3.4 and 3.5 is written as:
P (vE (xE , t C d)|fvE 0 (xE 0 , t)g) Z Z 0 D dxE dvE 0 (xE 0 )G (vE (xE , t C d) ¡ vE 0 (xE 0 , t) : dS vE ) £R
dxE 00
R
G(xE ¡ xE 0 ¡ d vE 0 (xE 0 , t) : dS xE ) . dvE 00 (xE 00 )G(xE ¡ xE 00 ¡ d vE 00 (xE 00 , t)IdS xE )
(A.1)
Probabilistic Motion Estimation
1863
The normalization term in the denominator is used to ensure that the probabilities of the velocities integrate to one at each spatial point xE . In this equation, we have scaled the covariances S by d. This is necessary for taking the limit as d 7! 0 (Jazwinski, 1970). Between observations we can express the evolution of the prior density P (vE (xE , t)) as » ¼ E (x E , t )) @P (v P (vE (xE , t C d)|fvE 0 (xE 0 , t )g) ¡ P (vE (xE , t )) D lim d7 !0 d @t »Z Z 1 D lim dxE 0 dvE 0 (xE 0 )P (vE (xE , t C d)|fvE 0 (xE 0 , t)g)P (fvE 0 (xE 0 , t)g) d7 !0 d ¼ ¡ P (vE (xE , t)) . (A.2) We now perform a Taylor series expansion of P (vE (xE , t C d)|fvE 0 (xE 0 , t)g) in powers ofd keeping the zeroth and rst-order terms (higher-order terms will vanish when we take the limit as d 7! 0). To perform this expansion, we use the assumption that this distribution is expressed in terms of gaussians. As d 7! 0, these gaussians will tend toward delta functions, and the derivatives of gaussians will tend toward derivatives of delta functions (thereby simplifying the expansion). This derivation can be justied rigorously, playing detailed attention to convergence issues, by the use of distribution theory or the application of Itoˆ calculus. If we expand G (vE (xE , t C d) ¡ vE 0 (xE 0 , t) : dS vE ) about d D 0, we nd that the zeroth-order term is a Dirac delta function with argument vE (xE , t C d) ¡ vE 0 (xE 0 , t). This term can therefore be integrated out. The derivative with respect to d will effectively correspond to differentiating the gaussian with respect to the covariance. By standard properties of the gaussian, this will be equivalent to a second-order spatial derivative. We will get similar terms if we expand G(xE ¡ xE 0 ¡ d vE 0 (xE 0 , t) : dS xE ), but we will also get an additional drift term from differentiating the d vE 0 (xE 0 , t ) argument. In addition, we will get other terms from the denominator of equation A.1. (These are required to normalize the distributions and are nonstandard. They are needed because we have a differential equation for a set of interacting probability distributions while the standard Kolmogorov’s equation is for a single probability distribution.) We collect all these zerothand rst-order terms in the expansion and substitute into equation A.3. We can then evaluate the integrals using known properties of the delta functions. After some algebra, these integrals yield our variant of Kolmogorov’s equation: @ @t
P (vE (xE , t)) D
1 2
»
» ¼ 1 @T @ S xE P (vE (xE , t)) 2 @Ex @Ex @E v @Ev Z 1 @ @ ¡vE ¢ P (vE (xE , t)) C vEP (vE (xE , t))dvE , (A.3) |V| @Ex @E x
@T
S vE
@
¼
P (vE (xE , t)) C
1864
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz T
T
@ @ where f @@Ex S xE @E g and f @@Ev S vE @E g are scalar differential operators, and | V | is x vR the volume of velocity space dvE .
Acknowledgments
This work was supported by AFOSR grant F49620-95-1-0265 to A. L. Y. and N. M. G. Further support to A. L. Y. came from ARPA and the Ofce of Naval Research, grant number N00014-95-1-1022. In addition, N.M.G. received support from grants by the National Eye Institute (EY08921 and EY11170) and the William A. Kettlewell chair. Finally, we thank the National Eye Institute for a core grant to Smith-Kettlewell (EY06883). We appreciate useful comments from anonymous referees. References Adelson, E. H., & Bergen, J. R. (1985). Spatio-temporal energy models for the perception of motion. J. Opt. Soc. A., A 2, 284–299. Anstis, S. M., & Ramachandran, V. S. (1987). Visual inertia in apparent motion. Vision Res., 27, 755–764. Ascher, D. (1998). Human visual speed perception: Psychophysics and modeling. Unpublished doctoral dissertation, Brown University. Bar-Shalom, Y., & Fortmann, T. E. (1988). Tracking and data association. Orlando, FL: Academic Press. Blake, A., & Yuille, A. L. (eds.). (1992). Active vision. Cambridge, MA: MIT Press. Burr, D. C., Ross, J., & Morrone, M. C. (1986). Seeing objects in motion. Proc. Royal Soc. Lond. B, 227, 249–265. Chin, T. M., Karl, W. C., & Willsky, A. S. (1994). Probabilistic and sequential computation of optical ow using temporal coherence. IEEE Trans. Image Proc., 3, 773–788. Clark, J. J., & Yuille, A. L. (1990). Data fusion for sensory information processing systems. Norwell, MA: Kluwer. Courant, R., & Hilbert, D. (1953). Methods of mathematical physics (Vol. 1). New York: Interscience. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Gottsdanker, R. M. (1956). The ability of human operators to detect acceleration of target motion. Psych. Bull., 53, 477–487. Grzywacz, N. M., Amthor, F. R., & Mistler, L. A. (1990).Applicability of quadratic and threshold models to motion discrimination in the rabbit retina. Biol. Cybern., 64, 41–49. Grzywacz, N. M. & Koch, C. (1987).Functional properties of models for direction selectivity in the retina. Synapse, 1, 417–434. Grzywacz, N. M., Smith, J. A., & Yuille, A. L. (1989).A computational framework for visual motion’s spatial and temporal coherence. In Proc. IEEE Workshop on Visual Motion. Irvine, CA.
Probabilistic Motion Estimation
1865
Grzywacz, N. M., Watamaniuk, S. N. J., & McKee, S. P. (1995). Temporal coherence theory for the detection and measurement of visual motion. Vision Res., 35, 3183–3203. Grzywacz, N. M., & Yuille, A. L. (1990). A model for the estimate of local image velocity by cells in the visual cortex. Proceedings of the Royal Society of London B, 239, 129–161. Hassenstein, B., & Reichardt, W. E. (1956). Systemtheoretische analyse der zeit-, reihenfolgen- und vorzeichenauswertung bei der bewegungsperzeption des russelkafers. Chlorophanus Z. Naturforsch., 11b, 513–524. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neurosci., 9, 181–197. Ho, Y-C., & Lee, R. C. K. (1964). A Bayesian approach to problems in stochastic estimation and control. IEEE Trans. on Automatic Control, 9, 333–339. Holub, R. A. & Morton-Gibson, M. (1981). Response of visual cortical neurons of the cat to moving sinusoidal gratings: Response-contrast functions and spatiotemporal integration J. Neurophysiol., 46, 1244–1259. Huber, P. J. (1981). Robust statistics. New York: Wiley. Isard, M., & Blake A. (1996).Contour tracking by stochastic propagation of conditional density. In Proc. Europ. Conf. Comput. Vision (pp. 343–356). Cambridge, UK. Isard, M., & Blake, A. (1998).A mixed-state condensation tracker with automatic model-switching. Proc. Int. Conf. Comp. Vis. (pp. 107–112). Mumbai, India. Jazwinski, A. H. (1970). Stochastic processes and ltering theory. Orlando, FL: Academic Press. Kalman, R. E. (1960).A new approach to linear ltering and prediction problems. Trans. ASME, D: J. Basic Engineering, 82, 35–45. Kanizsa, G. (1979). Organization in vision: Essays on gestalt perception. New York: Praeger. Kulikowski, J. J., & Tolhurst, D. J. (1973). Psychophysical evidence for sustained and transient detectors in human vision. Journal of Physiology, 232, 519–548. Matthies, L., Kanade, T., & Szeliski, R. (1989). Kalman lter-based algorithms for estimating depth from image sequences. Int. J. Comput. Vision, 3, 209–236. Maunsell, J. H. R., & Newsome, W. T. (1987).Visual processing in monkey extrastriate cortex. In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson, (Eds.), Annual reviews of neuroscience (Vol. 10, pp. 363–401). Palo Alto, CA: Annual Reviews. McKee, S. P., Silverman, G. H., & Nakayama, K. (1986). Precise velocity discrimination despite random variations in temporal frequency and contrast. Vision Res., 26, 609–619. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1988). Translation-invariant orientation tuning in visual complex cells could derive from intradendritic computations. J. Neurosci., 1, 4325–34. Merigan, W. H., & Maunsell, J.H. (1993). How parallel are the primate visual pathways? In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson
1866
P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz
(Eds.), Annual reviews of neuroscience (Vol. 16, pp. 369–4021). Palo Alto, CA: Annual Reviews. Nakayama, K. (1985). Biological image motion processing: A review. Vision Res., 25, 625–660. Nishida, S., & Johnston, A. (1999). Inuence of motion signals on the perceived position of spatial pattern. Nature, 397, 610–612. Poggio, T. & Reichardt, W. E. (1973). Considerations on models of movement detection. Kybernetik, 13, 223–227. Poggio, T., & Reichardt, W. E. (1976). Visual control of orientation behaviour in the y: Part II: Towards the underlying neural interactions. Q. Rev. Biophys., 9, 377–438. Ramachandran, V. S., & Anstis, S. M. (1983). Extrapolation of motion path in human visual perception. Vision Res., 23, 83–85. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties of the visual cortex. Neural Computation, 9, 721–734. Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the nervous system. In Rosenblith (Ed.), Sensory communication. New York: Wiley. Snowden, R. J., & Braddick, O. J. (1989). Extension of displacement limits in multiple-exposure sequences of apparent motion. Vision Res., 29, 1777–1787. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. Soc. Lond. B, 202, 409–416. van Santen, J. P. H., & Sperling, G. (1984). A temporal covariance model of motion perception J. Opt. Soc. A., A 1, 451–473. Verghese, P., Watamaniuk, S. N. J., McKee, S. P., & Grzywacz, N. M. (1999). Local motion detectors cannot account for the detectability of an extended motion trajectory in noise. Vision Res., 39, 19–30. Wang, J. Y. A., & Adelson, E. H. (1993). Layered representation for motion analysis. In Proc. IEEE Computer Sociey Conference on Computer Vision and Patern Recognition 1993 (pp. 361–366). New York. Watamaniuk, S. N. J., & McKee, S. P. (1995). Seeing motion behind occluders. Nature, 377, 729–730. Watamaniuk, S. N. J., McKee, S. P., & Grzywacz, N. M. (1994). Detecting a trajectory embedded in random-direction motion noise. Vision Res., 35, 65–77. Watamaniuk, S. N. J., Sekuler, R., & Williams, D. W. (1989). Direction perception in complex dynamic displays: The integration of direction information. Vision Research, 29, 47–59. Watson, A. B., & Ahumada, A. J. (1985). Model of human visual motion sensing. J. Opt. Soc. A, A 2, 322–342. Welch, L., Macleod, D.I., & McKee, S. P. (1997). Motion interference: Perturbing perceived direction. Vision Res, 37, 2725–2736. Werkhoven, P., Snippe, H. P., & Toet, A. (1992). Visual processing of optic acceleration. Vision Res., 32, 2313–2329. Williams, L. R., & Jacobs, D. W. (1997a).Local parallel computation of stochastic completion elds. Neural Comp., 9, 859–881. Williams, L. R., & Jacobs, D. W. (1997b). Stochastic completion elds: A neural
Probabilistic Motion Estimation
1867
model of illusory contour shape and salience. Neural Comp., 9, 837–858. Yuille, A. L., Burgi, P.-Y., & Grzywacz, N. M. (1998). Visual motion estimation and prediction: A probabilistic network model for temporal coherence. In Proceedings of the Sixth International Conference on Computer Vision (pp. 973– 978). New York: IEEE Computer Society. Yuille, A. L., & Grzywacz, N. M. (1988). A computational theory for the perception of coherent visual motion. Nature, 333, 71–74. Yuille, A. L. & Grzywacz, N. M. (1989). A mathematical analysis of the motion coherence theory. International Journal of Computer Vision, 3, 155–175. Yuille, A. L., & Grzywacz, N. M. (1998). A theoretical framework for visual motion. In T. Watanabe (Ed.), High-level motion processing: Computational, neurobiological, and psychophysical perspectives. Cambridge, MA: MIT Press. Received November 2, 1998; accepted August 27, 1999.
LETTER
Communicated by Leo Breiman
Boosting Neural Networks Holger Schwenk LIMSI-CNRS, 91403 Orsay cedex, FRANCE Yoshua Bengio DIRO, University of Montr´eal, Montr´eal, Quebec, H3C 3J7, Canada Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm, AdaBoost, has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classiers. In this article we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to bagging, which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of on-line handwritten digits from more than 200 writers. A boosted multilayer network achieved 1.5% error on the UCI letters and 8.1% error on the UCI satellite data set, which is signicantly better than boosted decision trees.
1 Introduction
Boosting, a general method for improving the performance of a learning algorithm, is a method for nding a highly accurate classier on the training set by combining “weak hypotheses” (Schapire, 1990), each of which needs only to be moderately accurate on the training set. (Perrone, 1994, is an earlier overview of different ways to combine neural networks.) A recently proposed boosting algorithm is AdaBoost (Freund, 1995), which stands for “adaptive boosting.” During the past two years, many empirical studies have been published that use decision trees as base classiers for AdaBoost (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996; Maclin & Opitz, 1997; Grove & Schuurmans, 1998; Bauer & Kohavi, 1999; Dietterich, forthcoming). All of these experiments have shown impressive improvements in the generalization behavior and c 2000 Massachusetts Institute of Technology Neural Computation 12, 1869– 1887 (2000) °
1870
Holger Schwenk and Yoshua Bengio
suggest that AdaBoost tends to be robust to overtting. In fact, in many experiments it has been observed that the generalization error continues to decrease toward an apparent asymptote after the training error has reached zero. (Schapire, Freund, Bartlett, & Lee, 1997), suggest a possible explanation for this unusual behavior based on the denition of margin of classication. Other attemps to understand boosting theoretically can be found in Schapire et al., 1997; Breiman 1997a, 1998; Friedman, Hastie, and Tibshirani, 1998; and Schapire, 1999. AdaBoost has also been linked with game theory (Breiman, 1997b; Grove & Schuurmans, 1998; Freund & Schapire, 1996b, 1999) in order to understand the behavior of AdaBoost and to propose alternative algorithms. Mason and Baxter (1999) propose a new variant of boosting based on the direct optimization of margins. There is recent evidence that AdaBoost may very well overt if we combine several hundred thousand classiers (Grove & Schuurmans, 1998). It also seems that the performance of AdaBoost degrades a lot in the presence of signicant amounts of noise (Dietterich, forthcoming; R¨atsch, Onoda, & Muller, ¨ 1998). Although much useful theoretical and experimental work has been done, there is still a lot that is not well understood about AdaBoost’s impressive generalization behavior. To the best of our knowledge, all applications of AdaBoost have been to decision trees and no applications to multilayer articial neural networks have been reported in the literature. This article extends and provides a deeper experimental analysis of our rst experiments with the application of AdaBoost to neural networks (Schwenk & Bengio, 1997, 1998). In this article we consider the following questions: Does AdaBoost work as well for neural networks as for decision trees? The short answer is yes, and sometimes even better. Does it behave in a similar way (as was observed previously in the literature)? The short answer is yes. Furthermore, are there particulars in the way neural networks are trained with gradient backpropagation that should be taken into account when choosing a particular version of AdaBoost? The short answer is yes, because it is possible to weight the cost function of neural networks directly. Is overtting of the individual neural networks a concern? The short answer is, not as much as when not using boosting. Is the random resampling used in previous implementations of AdaBoost critical, or can we get similar performances by weighing the training criterion (which can easily be done with neural networks)? The short answer is that it is not critical for generalization but helps to obtain faster convergence of individual networks when coupled with stochastic gradient descent. In the next section, we describe the AdaBoost algorithm and discuss several implementation issues when using neural networks as base classiers. In section 3, we present results that we have obtained on three medium-sized tasks: a data set of handwritten on-line digits and the letter and satellite data sets of the UCI repository. We nish with a conclusion and perspectives for future research.
Boosting Neural Networks
1871
2 AdaBoost
It is well known that it is often possible to increase the accuracy of a classier by averaging the decisions of an ensemble of classiers (Perrone, 1993; Krogh & Vedelsby, 1995). In general, more improvement can be expected when the individual classiers are diverse and yet accurate. One can try to obtain this result by taking a base learning algorithm and invoking it several times on different training sets. Two popular techniques exist that differ in the way they construct these training sets: bagging (Breiman, 1994) and boosting (Schapire, 1990; Freund, 1995). In bagging, each classier is trained on a bootstrap replicate of the original training set. Given a training set S of N examples, the new training set is created by resampling N examples uniformly with replacement. Some examples may occur several times, and others may not occur in the sample at all. One can show that, on average, only about two-thirds of the examples occur in each bootstrap replicate. The individual training sets are independent, and the classiers could be trained in parallel. Bagging is known to be particularly effective when the classiers are unstable, that is, when perturbing the learning set can cause signicant changes in the classication behavior. Formulated in the context of the bias-variance decomposition (Geman, Bienenstok, & Doursat, 1992), bagging improves generalization performance due to a reduction in variance while maintaining or only slightly increasing bias. Note, however, there is no unique bias-variance decomposition for classication tasks (Kong & Dietterich, 1995; Breiman, 1996; Kohavi & Wolpert, 1996; Tibshirani, 1996). AdaBoost, on the other hand, constructs a composite classier by sequentially training classiers while putting more and more emphasis on certain patterns. For this, AdaBoost maintains a probability distribution Dt (i) over the original training set. In each round t, the classier is trained with respect to this distribution. Some learning algorithms do not allow training with respect to a weighted cost function. In this case, sampling with replacement (using the probability distribution Dt ) can be used to approximate a weighted cost function. Examples with high probability would then occur more often than those with low probability, and some examples may not occur in the sample at all, although their probability is not zero. After each AdaBoost round, the probability of incorrectly labeled examples is increased, and the probability of correctly labeled examples is decreased. The result of training the tth classier is a hypothesis ht : X ! Y where Y D f1, . . . , kg is the space of labels, and X is the P space of input () features. After the tth round, the weighted error 2 t D 6 yi Dt i of i: ht (xi ) D the resulting classier is calculated, and the distribution Dt C 1 is computed from Dt by increasing the probability of incorrectly labeled examples. The probabilities are changed so that the error of the tth classier using these new “weights” Dt C 1 would be 0.5. In this way, the classiers are optimally decoupled. The global decision f is obtained by weighted voting. This ba-
1872
Holger Schwenk and Yoshua Bengio
sic AdaBoost algorithm converges (learns the training set) if each classier yields a weighted error that is less than 50%, that is, better than chance in the two-class case. In general, neural network classiers provide more information than just a class label. It can be shown that the network outputs approximate the a posteriori probabilities of classes, and it might be useful to use this information rather than to perform a hard decision for one recognized class. This issue is addressed by another version of AdaBoost, called AdaBoost.M2 (Freund & Schapire, 1997). It can be used when the classier computes condence scores for each class.1 The result of training the tth classier is now a hypothesis ht : X £ Y ! [0, 1]. Furthermore, we use a distribution Dt (i, y ) over the 6 yi g, where N is the numset of all miss-labels: B D f(i, y): i 2 f1, . . . , Ng, y D ber of training examples. Therefore, |B| D N (k ¡ 1). AdaBoost modies this distribution so that the next learner focuses not only on the examples that are hard to classify, but more specically on improving the discrimination between the correct class and the incorrect class that competes with it. Note that the miss-label distribution DP t induces a distribution over the examples: P Pt (i) D Wit / i Wit , where Wit D y D6 yi Dt (i, y). Pt (i) may be used for resampling the training set. Freund and Schapire (1997) dene the pseudo-loss of a learning machine as: 2 t D
1 X Dt (i, y )(1 ¡ ht (xi , yi ) C ht (xi , y )). 2 (i,y )2B
(2.1)
It is minimized if the condence scores of the correct labels are 1.0 and the condence scores of all the wrong labels are 0.0. The nal decision f is obtained by adding together the weighted condence scores of all the machines (all the hypotheses h1 , h2 ,. . . ). Table 1 summarizes the AdaBoost.M2 algorithm. This multiclass boosting algorithm converges if each classier yields a pseudo-loss that is less than 50%—in other words, better than any constant hypothesis. AdaBoost has very interesting theoretical properties; in particular it can be shown that the error of the composite classier on the training data decreases exponentially fast to zero as the number of combined classiers is increased (Freund & Schapire, 1997). Many empirical evaluations of AdaBoost also provide an analysis of the so-called margin distribution. The margin is dened as the difference between the ensemble score of the correct class and the strongest ensemble score of a wrong class. In the case in which there are just two possible labels, f¡1, C 1g, this is yf (x ), where f is the output of the composite classier and y the correct label. The classication is correct if the margin is positive. Discussions about the relevance of 1
The scores do not need to sum to one.
Boosting Neural Networks
1873
Table 1: Pseudo-Loss AdaBoost (AdaBoost.M2). Input: Init:
Sequence of N examples (x1 , y1 ), . . . , (xN , yN ) with labels yi 2 Y D f1, . . . , kg. 6 yi g Let B D f(i, y) : i 2 f1, . . . , Ng, y D D1 (i, y) D 1 / |B| for all (i, y) 2 B.
Repeat:
1. 2.
Train neural network with respect to distribution Dt and obtain hypothesis ht : X £ Y ! [0, 1]. Calculate the pseudo-loss of ht : 1 X 2 t D Dt (i, y)(1 ¡ ht (xi , yi ) C ht (xi , y)). 2 (i,y)2B
3. 4.
Set bt D 2 t / (1 ¡ 2 t ).
Update distribution Dt : Dt C 1 (i, y) D
1
Dt (i,y) 2 ((1 C ht (x i ,y i )¡h t (xi , y)) bt , Zt
where Zt is a normalization constant. Output: Final hypothesis:
f (x) D arg max y2Y
X± t
log
1 bt
² ht (x, y).
the margin distribution for the generalization behavior of ensemble techniques can be found in Freund & Schapire (1996b), Schapire et al. (1997), Breiman, 1997a,1997b), Grove and Schuurmans (1998), and R¨atsch et al. (1998). In this article, an important focus is on whether the good generalization performance of AdaBoost is partially explained by the random resampling of the training sets generally used in its implementation. This issue will be addressed by comparing three versions of AdaBoost in which randomization is used (or not used) in three different ways. 2.1 Applying AdaBoost to Neural Networks. In this article we investigate different techniques of using neural networks as base classiers for AdaBoost. In all cases, we have trained the neural networks by minimizing a quadratic criterion that is a weighted sum of the squared differences (zij ¡ zO ij )2 , where z i D (zi1 , zi2 , . . . , zik ) is the desired output vector (with a low target value everywhere except at the position corresponding to the target class) and zO i is the output vector of the network. A score for class j for pattern i can be directly obtained from the jth element zO ij of the output vector zO i . When a class must be chosen, the one with the highest score is 6 selected. Let Vt (i, j) D Dt (i, j) / maxkD6 yi Dt (i, k ) for j D yi and Vt (i, yi ) D 1. These weights are used to give more emphasis to certain incorrect labels according to the pseudo-loss AdaBoost.
1874
Holger Schwenk and Yoshua Bengio
What we call epoch is a pass of the training algorithm through all the examples in a training set. We compare three different versions of AdaBoost: R: Training the tth classier with a xed training set obtained by resampling with replacement once from the original training set. Before starting training the tth network, we sample N patterns from the original training set, each time with a probability Pt (i) of picking pattern i. Training is performed for a xed number of iterations always using this same resampled training set. This is basically the scheme that has been used in the past when applying AdaBoost to decision trees, except that we used the pseudo-loss AdaBoost. To approximate the pseudoloss, the training cost that is minimized for a pattern that is the ith one P from the original training set is 12 j Vt (i, j)(zij ¡ zO ij )2 .
E: Training the tth classier using a different training set at each epoch, by resampling with replacement after each training epoch. After each epoch, a new training set is obtained by sampling from the original training set with probabilities Pt (i). Since we used an on-line (stochastic) gradient in this case, this is equivalent to sampling a new pattern from the original training set with probability Pt (i) before each forward or backward pass through the neural network. Training continues until a xed number of pattern presentations has been performed. As for version R, the training cost that is minimized P for a pattern that is the ith one from the original training set is 12 j Vt (i, j)(zij ¡ zO ij )2 .
W: Training the tth classier by directly weighting the cost function (here the squared error) of the tth neural network, that is, all the original training patterns are in the training P set, but the cost is weighted by the probability of each example: 12 j Dt (i, j)(zij ¡ zO ij )2 . If we used directly this formula, the gradients would be very small, even when all probabilities Dt (i, j) are identical. To avoid having to scale learning rates differently depending on the number of examples, the following “normalized” error function was used: 1 P t (i ) X Vt (i, j)(zij ¡ zO ij )2 . 2 maxk Pt (k ) j
(2.2)
In versions E and W, what makes the combined networks essentially different from each other is the fact that they are trained with respect to different weightings Dt of the original training set. In R, an additional element of diversity is built in because the criterion used for the tth network is not exactly the errors weighted by Pt (i). Instead, more emphasis is put on certain patterns while completely ignoring others (because of the initial random sampling of the training set). The E version can be seen as a stochastic version of the W version, as the number of iterations through the data increases and the learning rate decreases, E becomes a
Boosting Neural Networks
1875
very good approximation of W. W itself is closest to the recipe mandated by the AdaBoost algorithm (but, as we will see below, it suffers from numerical problems). Note that E is a better approximation of the weighted cost function than R, in particular when many epochs are performed. If random resampling of the training data explained a good part of the generalization performance of AdaBoost, then the weighted training version (W) should perform worse than the resampling versions, and the xed sample version (R) should perform better than the continuously resampled version (E). Note that for bagging, which directly aims at reducing variance, random resampling is essential to obtain the reduction in generalization error. 3 Results
Experiments have been performed on three data sets: a data set of on-line handwritten digits, the UCI letters data set of off-line machine-printed alphabetical characters, and the UCI satellite data set generated from Landsat multispectral scanner image data. All data sets have a predened training and test set. All the p-values that are given in this section concern a pair (pO1 , pO2 ) of test performance results (on n test points) for two classication systems with unknown true error rates p1 and p2 . The null hypothesis is that the true expected performance for the two systems is not different, that is, p1 D p2 . Let pO D 0.5(pO1 C pO2 ) be the estimator of the common error rate under the null hypothesis. The alternative hypothesis is that p1 < p2 , so the p-value is obtained as the probability of observing such a large difference under the p (O ¡O ) null hypothesis, that is, P (Z > z ) for a normal Z, with z D pn p1 p2 . This is 2pO(1¡pO )
based on the normal approximation of the binomial, which is appropriate for large n (however, see Dietterich, 1998, for a discussion of this and other tests to compare algorithms).
3.1 Results on the On-line Data Set. The on-line data set was collected at Paris 6 University (Schwenk & Milgram, 1996). A WACOM A5 tablet with a cordless pen was used in order to allow natural writing. Since we wanted to build a writer-independent recognition system, we tried to use many writers and impose as few constraints as possible on the writing style. In total, 203 students wrote down isolated numbers that have been divided into a learning set (1200 examples) and a test set (830 examples). The writers of both sets are completely distinct. A particular property of this data set is the notable variety of writing styles that are not equally frequent at all. There are, for instance, 100 zeros written counterclockwise, but only 3 written clockwise. Figure 1 gives an idea of the great variety of writing styles of this data set. We applied only a simple preprocessing: the characters were resampled to 11 points, centered and size normalized to an (x, y)-coordinate sequence in [¡1, 1]22.
1876
Holger Schwenk and Yoshua Bengio
Figure 1: Some examples of the on-line handwritten digits data set (test set).
Table 2 summarizes the results on the test set before using AdaBoost. Note that the differences among the test results on the last three networks are not statistically signicant (p-value > 23%), whereas the difference with the rst network is signicant (p-value < 10 ¡5 ). Tenfold cross-validation within the training set was used to nd the optimal number of training epochs (typically about 200). Note that if training is continued until 1000 epochs, the test error increases by up to 1%. Table 3 shows the results of bagged and boosted multilayer perceptrons with 10, 30, or 50 hidden units, trained for either 100, 200, 500, or 1000 epochs, and using the ordinary resampling scheme (R), resampling with different random selections at each epoch (E), or training with weights Dt on the squared error criterion for each pattern (W). In all cases, 100 neural networks were combined. Table 2: On-line Digits Data Set Error Rates for Fully Connected MLPs (Not Boosted). Architecture a Train: Test: a The notation
22-10-10
22-30-10
22-50-10
22-80-10
5.7% 8.8
0.8% 3.3
0.4% 2.8
0.08% 2.7
22-h-10 designates a fully connected neural network with 22 input nodes, one hidden layer with h neurons, and a 10-dimensional output layer.
Boosting Neural Networks
1877
Table 3: On-line Digits Test Error Rates for Boosted MLPs. Architecture Version
R
Bagging
22-10-10 E W
R
5.4%
500 epochs
22-30-10 E W
R
2.8%
22-50-10 E W 1.8%
AdaBoost
100 epochs 200 epochs 500 epochs 1000 epochs 5000 epochs
2.9% 3.0 2.5 2.8 —
3.2% 2.8 2.7 2.7 —
6.0% 5.6 3.3 3.2 2.9
1.7% 1.8 1.7 1.8 —
1.8% 1.8 1.5 1.6 —
5.1% 4.2 3.0 2.6 1.6
2.1% 1.8 1.7 1.6 —
1.8% 1.7 1.7 1.5 —
4.9% 3.5 2.8 2.2 1.6
AdaBoost improved in all cases the generalization error of the MLPs— for instance, from 8.8% to about 2.7% for the 22-10-10 architecture. The improvement with 50 hidden units from 2.8% (without AdaBoost) to 1.6% (with AdaBoost) is signicant (p-value of 0.38%), despite the small number of examples. Boosting was also always superior to bagging, although the differences are not always very signicant because of the small number of examples. Furthermore, it seems that the number of training epochs of each individual classier has no signicant impact on the results of the combined classier, at least on this data set. AdaBoost with weighted training of MLPs (W version), however, does not work as well if the learning of each individual MLP is stopped too early (1000 epochs). The networks did not learn well enough the weighted examples, and 2 t rapidly approached 0.5. When training each MLP for 5000 epochs, however, the weighted training (W) version achieved the same low test error rate. AdaBoost is less useful for very big networks (50 or more hidden units for this data) since an individual classier can achieve zero error on the original training set (using the E or W method). Such large networks probably have a very low bias but high variance. This may explain why bagging, a pure variance-reduction method, can do as well as AdaBoost, which is believed to reduce bias and variance. However, AdaBoost can achieve the same low error rates with the smaller 22-30-10 networks. Figure 2 shows the error rates of some of the boosted classiers as the number of networks is increased. AdaBoost brings the training error to zero after only a few steps, even with an MLP with only 10 hidden units. The generalization error is also considerably improved, and it continues to decrease to an apparent asymptote after zero training error has been reached. The surprising effect of continuously decreasing generalization error even after training error reaches zero has already been observed by others (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996). This seems to contradict Occam’s razor, but a recent theorem (Schapire et al., 1997) suggests that the margin distribution may be relevant
1878
Holger Schwenk and Yoshua Bengio
MLP 22-10-10
error in %
10
Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)
test unboosted
8 6
test
4 2 train 0 1
10
100
MLP 22-30-10
error in %
10
Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)
8 6 4 2
test
train
0 1
10
100
MLP 22-50-10
error in %
10
Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)
8 6 4 2
test
train
0 1
10 number of networks
100
Figure 2: Error rates of the boosted classiers for increasing number of networks. For clarity, the training error of bagging is not shown (it overlaps with the test error rates of AdaBoost). The dotted constant horizontal line corresponds to the test error of the unboosted classier. Small oscillations are not signicant since they correspond to few examples.
Boosting Neural Networks
1879
to the generalization error. Although previous empirical results (Schapire et al., 1997) indicate that pushing the margin cumulative distribution to the right may improve generalization, other recent results (Breiman, 1997a, 1997b; Grove & Schuurmans, 1998) show that “improving” the whole margin distribution can also yield worse generalization. Figures 3 and 4 show several margin cumulative distributions, that is, the fraction of examples whose margin is at most x as a function of x 2 [¡1, 1]. The networks have been trained for 1000 epochs (5000 for the W version). It is clear in Figures 3 and 4 that the number of examples with high margin increases when more classiers are combined by boosting. When boosting neural networks with 10 hidden units, for instance, there are some examples with a margin smaller than ¡0.5 when only two networks are combined. However, all examples have a positive margin when 10 nets are combined, and all examples have a margin higher than 0.5 for 100 networks. Bagging, on the other hand, has no signicant inuence on the margin distributions. There is almost no difference in the margin distributions of the R, E, or W version of AdaBoost either.2 However, there is a difference between the margin distributions and the test set errors when the complexity of the neural networks is varied (hidden layer size). Finally, it seems that sometimes AdaBoost must allow some examples with very high margins in order to improve the minimal margin. This can best be seen for the 22-10-10 architecture. One should keep in mind that this data set contains only small amounts of noise. In application domains with high amounts of noise, it may be less advantageous to improve the minimal margin at any price (Grove & Schuurmans, 1998; R¨atsch et al., 1998), since this would mean putting too much weight to noisy or wrongly labeled examples. 3.2 Results on the UCI Letters and Satellite Data Sets. Similar experiments were performed with MLPs on the letters data set from the UCI machine learning data set. It has 16,000 training and 4000 test patterns, 16 input features, and 26 classes (A–Z) of distorted machine-printed characters from 20 fonts. A few preliminary experiments on the training set only were used to choose a 16-70-50-26 architecture. Each input feature was normalized according to its mean and variance on the training set. Two types of experiments were performed: (1) doing resampling after each epoch (E) and using stochastic gradient descent and (2) without resampling but using reweighting of the squared error (W) and conjugate gradient descent. In both cases, a xed number of training epochs (500) was used. The plain, bagged, and boosted networks are compared to decision trees in Table 4. In both cases (E and W) the same nal generalization error results were obtained (1.50% for E and 1.47% for W), but the training time using the
2
The W and E versions achieve slightly higher margins than R.
1880
Holger Schwenk and Yoshua Bengio
AdaBoost (R) of MLP 22-10-10 1
2 5 10 50 100
0.8 0.6
AdaBoost (E) of MLP 22-10-10 1 0.8 0.6
0.4
0.4
0.2
0.2
0
0 -1
-0.5
0
0.5
1.0
-1
AdaBoost (R) of MLP 22-30-10 1
2 5 10 50 100
0.8 0.6
1
0.6
0.2
0.2
0
0 0
0.5
1.0
-1
AdaBoost (R) of MLP 22-50-10 1
2 5 10 50 100
0.8 0.6
1 0.8 0.6 0.4
0.2
0.2
0
0 -0.5
0
0.5
1.0
0.5
1.0
-0.5
0
0.5
1.0
AdaBoost (E) of MLP 22-50-10
0.4
-1
0
2 5 10 50 100
0.8
0.4
-0.5
-0.5
AdaBoost (E) of MLP 22-30-10
0.4
-1
2 5 10 50 100
-1
2 5 10 50 100
-0.5
0
0.5
1.0
Figure 3: Margin distributions using 2, 5, 10, 50, and 100 networks, respectively.
Boosting Neural Networks
1881
AdaBoost (W) of MLP 22-10-10 1
2 5 10 50 100
0.8 0.6
Bagging of MLP 22-10-10 1 0.8 0.6
0.4
0.4
0.2
0.2
0
0 -1
-0.5
2 5 10 50 100
0
0.5
1.0
-1
AdaBoost (W) of MLP 22-30-10 1
2 5 10 50 100
0.8 0.6
1
0.6
0.2
0.2
0
0 0
0.5
1.0
-1
Bagging of MLP 22-50-10 1
2 5 10 50 100
0.8 0.6
1 0.8 0.6 0.4
0.2
0.2
0
0 -0.5
0
0.5
1.0
1.0
-0.5
0
0.5
1.0
AdaBoost (W) of MLP 22-50-10
0.4
-1
0.5
2 5 10 50 100
0.8
0.4
-0.5
0
Bagging of MLP 22-30-10
0.4
-1
-0.5
-1
2 5 10 50 100
-0.5
0
0.5
1.0
Figure 4: Margin distributions using 2, 5, 10, 50, and 100 networks, respectively.
1882
Holger Schwenk and Yoshua Bengio
Table 4: Test Error Rates on the UCI Data Sets. CARTa
C4.5b
MLP
Data Set
Alone Bagged Boosted
Alone Bagged Boosted
Alone Bagged Boosted
Letter Satellite
12.4% 6.4% 14.8 10.3
13.8% 6.8% 14.8 10.6
6.1% 4.3% 12.8 8.7
a b
3.4% 8.8
3.3% 8.9
1.5% 8.1
Results from Breiman (1996). Results from Freund & Schapire (1996a).
weighted squared error (W) was about 10 times greater. This shows that using random resampling (as in E or R) is not necessary to obtain good generalization (whereas it is clearly necessary for bagging). However, the experiments show that it is still preferable to use a random sampling method such as R or E for numerical reasons: convergence of each network is faster. For this reason, for the E experiments with stochastic gradient descent, 100 networks were boosted, whereas we stopped training on the W network after 25 networks (when the generalization error seemed to have attened out), which took more than a week on a fast processor (SGI Origin-2000). We believe that the main reason for this difference in training time is that the conjugate gradient method is a batch method and is therefore slower than stochastic gradient descent on redundant data sets with many thousands of examples, such as this one. (See comparisons between batch and on-line methods in Bourrely, 1989, and conjugate gradients for classication tasks in particular Moller, 1992, 1993.) For the W version with stochastic gradient descent, the weighted training error of individual networks does not decrease as much as when using conjugate gradient descent, so AdaBoost itself did not work as well. We believe that this is due to the fact that it is difcult for stochastic gradient descent to approach a minimum when the output error is weighted with very different weights for different patterns (the patterns with small weights make almost no progress). On the other hand, the conjugate gradient descent method can approach a minimum of the weighted cost function more precisely, but inefciently, when there are thousands of training examples. The results obtained with the boosted network are extremely good— 1.5% error, whether using the W version with conjugate gradients or the E version with stochastic gradient—and are the best ever published to date, as far as we know, for this data set. In a comparison with the boosted trees (3.3% error), the p-value of the null hypothesis is less than 10 ¡7 . The best performance reported in STATLOG (Feng, Sutherland, King, Muggleton, & Henery, 1993) is 6.4%. Note also that we need to combine only a few neural networks to get immediate important improvements: with the E version, 20 neural networks sufce for the error to fall under 2%, whereas boosted decision trees typically “converge” later. The W version of AdaBoost actually converged faster in terms of number of networks—after about 7 networks,
Boosting Neural Networks
1883
error in %
10
Bagging AdaBoost (SG+E) AdaBoost (CG+W)
8
test unboosted
6 4
test 2 train
0 1
10 number of networks
100
Figure 5: Error rates of the bagged and boosted neural networks for the UCI letter data set (log-scale). SG+E denotes stochastic gradient descent and resampling after each epoch. CG+W means conjugate gradient descent and weighting of the squared error. For clarity, the training error of bagging is not shown (it attens out to about 1.8%). The dotted constant horizontal line corresponds to the test error of the unboosted classier.
the 2% mark was reached, and after 14 networks the 1.5% apparent asymptote was reached (see Figure 5)—but converged much slower in terms of training time. Figure 6 shows the margin distributions for bagging and AdaBoost applied to this data set. Again, bagging has no effect on the margin distribution, whereas AdaBoost clearly increases the number of examples with large margins. Similar conclusions hold for the UCI satellite data set (see Table 4), although the improvements are not as dramatic as in the case of the letter data set. The improvement due to AdaBoost is statistically signicant (pBagging 1
AdaBoost (SG+E)
2 5 10 50 100
0.8 0.6
1 0.8 0.6
0.4
0.4
0.2
0.2
0
0 -1
-0.5
2 5 10 50 100
0
0.5
1.0
-1
-0.5
Figure 6: Margin distributions for the UCI letter data set.
0
0.5
1.0
1884
Holger Schwenk and Yoshua Bengio
value < 10 ¡6 ), but the difference in performance between boosted MLPs and boosted decision trees is not (p-value > 21%). This data set has 6435 examples, with the rst 4435 used for training and the last 2000 used for testing generalization. There are 36 inputs and 6 classes, and a 36-30-15-6 network was used. The two best training methods are the epoch resampling (E) with stochastic gradient and the weighted squared error (W) with conjugate gradient descent. 4 Conclusion
As demonstrated here in three real-world applications, AdaBoost can signicantly improve neural classiers. In particular, the results obtained on the UCI letters data set (1.5% test error) are signicantly better than the best published results to date, as far as we know. The behavior of AdaBoost for neural networks conrms previous observations on other learning algorithms (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996; Schapire et al., 1997), such as the continued generalization improvement after zero training error has been reached and the associated improvement in the margin distribution. It seems also that AdaBoost is not very sensitive to overtraining of the individual classiers, so that the neural networks can be trained for a xed (preferably high) number of training epochs. A similar observation was recently made with decision trees (Breiman, 1997b). This apparent insensitivity to overtraining of individual classiers simplies the choice of neural network design parameters. Another interesting nding is that the weighted training version (W) of AdaBoost gives good generalization results for MLPs but requires many more training epochs or the use of a second-order (and, unfortunately, batch) method, such as conjugate gradients. We conjecture that this happens because of the weights on the cost function terms (especially when the weights are small), which could worsen the conditioning of the Hessian matrix. So in terms of generalization error, all three methods (R, E, W) gave similar results, but training time was lowest with the E method (with stochastic gradient descent), which samples each new training pattern from the original data with the AdaBoost weights. Although our experiments are insufcient to conclude, it is possible that the weighted training method (W) with conjugate gradients might be faster than the others for small training sets (a few hundred examples). There are various ways to dene variance for classiers (e.g. Kong & Dietterich, 1995; Breiman, 1996; Kohavi & Wolpert, 1996; Tibshirani, 1996). It basically represents how the resulting classier will vary when a different training set is sampled from the true generating distribution of the data. Our comparative results on the R, E, and W versions add credence to the view that randomness induced by resampling the training data is not the main reason for AdaBoost’s reduction of the generalization error. This is in contrast to bagging, which is a pure variance-reduction method. For
Boosting Neural Networks
1885
bagging, random resampling is essential to obtain the observed variance reduction. Another interesting issue is whether the boosted neural networks could be trained with a criterion other than the mean squared error criterion, one that would better approximate the goal of the AdaBoost criterion: minimizing a weighted classication error. Schapire & Singer (1998) addresses this issue. Acknowledgments
Most of the work reported here was done while H. S. was doing a postdoctorate at the University of Montreal. We thank the National Science and Engineering Research Council of Canada and the government of Quebec for nancial support. References Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classication algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139. Bourrely, J. (1989).Parallelization of a neural learning algorithm on a hypercube. In Hypercube and distributed computers (pp. 219–229). Amsterdam: Elsievier Science, North-Holland. Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996).Bias, variance, and arcing classiers (Tech. Rep. No. 460). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1997a). Arcing the edge (Tech. Rep. No. 486). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1997b). Prediction games and arcing classiers (Tech. Rep. No. 504). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1998). Arcing classiers. Annuals of Statistics, 26(3), 801–849. Dietterich, T. (1998).Approximate statistical tests for comparing supervised classication learning algorithms. Neural Computation, 10(7), 1895–1924. Dietterich, T. G. (forthcoming). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning. Available online at: ftp://ftp.cs.orst.edu/pub/ tgd/papers/tr-randomized-c4.ps.gz. Drucker, H., & Cortes, C. (1996). Boosting decision trees. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.),Advances in neural informationprocessing systems (pp. 479–485). Cambridge, MA: MIT Press. Feng, C., Sutherland, A., King, R., Muggleton, S., & Henery, R. (1993). Comparison of machine learning classiers to statistics and neural networks. In Proceedings of the Fourth International Workshop on Articial Intelligence and Statistics (pp. 41–52). Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Freund, Y., & Schapire, R. E. (1996a).Experiments with a new boosting algorithm.
1886
Holger Schwenk and Yoshua Bengio
In Machine Learning: Proceedings of Thirteenth InternationalConference (pp. 148– 156). Freund, Y., & Schapire, R. E. (1996b).Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (pp. 325–332). Freund, Y., & Schapire, R. E. (1997). A decision theoretic generalization of online learning and an application to boosting. Journal of Computer and System Science, 55(1), 119–139. Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29, 79–103. Friedman, J., Hastie, T., & Tibshirani, R. (1998).Additivelogisticregression: A statistical view of boosting (Tech. Rep.). Stanford: Department of Statistics, Stanford University. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Grove, A. J., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Articial Intelligence. Kohavi, R., & Wolpert, D. H. (1996). Bias plus variance decomposition for zeroone loss functions. In Machine Learning: Proceedings of Thirteenth International Conference (pp. 275–283). Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Machine Learning: Proceedings of Twelfth International Conference (pp. 313–321). Krogh, A., & Vedelsby, J. (1995).Neural network ensembles, cross validation and active learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen, (Eds.), Advances in neural information processing systems, 7, (pp. 231–238). Cambridge, MA: MIT Press. Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Articial Intelligence (pp. 546–551). Mason, L., & Baxter, P. B. J. (1999). Direct optimization of margins improves generalization in combined classiers. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11, Cambridge, MA: MIT Press. Moller, M. (1992).Supervised learning on large redundant training sets. In Neural networks for signal processing 2. New York: IEEE Press. Moller, M. (1993). Efcient training of feed-forward neural networks. Unpublished doctoral dissertation, Aarhus University, Aarhus, Denmark. Perrone, M. P. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Unpublished doctoral dissertation, Brown University, Providence, RI. Perrone, M. P. (1994). Putting it all together: Methods for combining neural networks. In J. D. Cowan, G. Tesauro, & J. Alspector, (Eds.), Advances in neural information processing systems, 6, (pp. 1188–1189). San Mateo, CA: Morgan Kaufmann.
Boosting Neural Networks
1887
Quinlan, J. R. (1996).Bagging, boosting and C4.5. In Machine Learning: Proceedings of the Fourteenth International Conference (pp. 725–730). R¨atsch, G., Onoda, T., & Muller, ¨ K.-R. (1998).Soft margins for AdaBoost (Tech. Rep. No. NC-TR-1998-021).Egham, Surrey: Royal Holloway College, University of London. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Schapire, R. E. (1999). Theoretical views of boosting. In Computational Learning Theory: Fourth European Conference, EuroCOLT (pp. 1–10). Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997).Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of Fourteenth International Conference (pp. 322–330). Schapire, R. E., & Singer, Y. (1998). Improved boosting algorithms using condence rated predictions. In Proceedings of the 11th Annual Conference on Computational Learning Theory. Schwenk, H., & Bengio, Y. (1997). Adaboosting neural networks: Application to on-line character recognition. In International Conference on Articial Neural Networks (pp. 967–972). Berlin: Springer-Verlag. Schwenk, H., & Bengio, Y. (1998). Training methods for adaptive boosting of neural networks. In M. I. Jordan, M. J. Kearns, & S. A. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 647–653). Cambridge, MA: MIT Press. Schwenk, H., & Milgram, M. (1996). Constraint tangent distance for online character recognition. In InternationalConference on Pattern Recognition (pp. D:520– 524). New York: IEEE Computer Society Press. Tibshirani, R. (1996). Bias, variance and prediction error for classication rules (Tech. Rep.). Toronto: Department of Statistics, University of Toronto. Received May 13, 1998; accepted March 24, 1999.
LETTER
Communicated by L e´ on Bottou
Gradient-Based Optimization of Hyperparameters Yoshua Bengio D´epartement d’informatique et recherche op´erationnelle, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7 Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efciently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyperparameter gradient involving second derivatives of the training criterion. 1 Introduction
Machine learning algorithms pick a function from a set of functions F in order to minimize something that cannot be measured, only estimated, that is, the expected generalization performance of the chosen function. Many machine learning algorithms can be formulated as the minimization of a training criterion, which involves a hyperparameter, kept xed during this minimization. For example, in the regularization framework (Tikhonov & Arsenin, 1977; Poggio, Torre, & Koch, 1985), one hyperparameter controls the strength of the penalty term: a larger penalty term reduces the “complexity” of the resulting function (it forces the solution to lie in a “smaller ” subset of F ). A common example is weight decay (Hinton, 1987), used with neural networks and linear regression (also known in that case as ridge regression; Hoerl & Kennard, 1970), to penalize the L2-norm of the parameters. Increasing the penalty term (increasing the weight decay hyperparameter) corresponds to reducing the effective capacity (Guyon, Vapnik, Boser, Bottou, & Solla, 1992) by forcing the solution to be in a zero-centered hypersphere of smaller radius, which may improve generalization. A regularization term can also be interpreted as an a priori probability distribution on F : in that case the weight decay is a scale parameter (e.g., inverse variance) of that distribution. A model selection criterion can be used to select hyperparameters, more generally to compare and choose among models that may have a different capacity. Many model selection criteria have been proposed in the past c 2000 Massachusetts Institute of Technology Neural Computation 12, 1889– 1900 (2000) °
1890
Yoshua Bengio
(Vapnik, 1982; Akaike, 1974; Craven & Wahba, 1979). When there is only a single hyperparameter, one can easily explore how its value affects the model selection criterion: typically one tries a nite number of values of the hyperparameter and picks the one that gives the lowest value of the model selection criterion. In this article, we present a methodology to select simultaneously many hyperparameters using the gradient of the model selection criterion with respect to the hyperparameters. This methodology can be applied when some differentiability and continuity conditions of the training criterion are satised. The use of multiple hyperparameters has already been proposed in the Bayesian literature. One hyperparameter per input feature was used to control the prior on the parameters associated with that input feature (MacKay, 1995; Neal, 1998). In this case, the hyperparameters can be interpreted as scale parameters for the prior distribution on the parameters, for different directions in parameter space. Note that independently of this work, Larsen et al. (1998) have proposed a procedure similar to the one presented here, for optimizing regularization parameters of a neural network (different weight decays for different layers of the network). In sections 2, 3, and 4, we explain how the gradient with respect to the hyperparameters can be computed. In the conclusion, we briey describe the results of preliminary experiments performed with the proposed methodology (described in more detail in Latendresse, 1999, and Bengio & Dugas, 1999), and we raise some important open questions concerning the kind of “overtting” that can occur with the proposed methodology. 2 Objective Functions for Hyperparameters
We are given a set of independent data points, D D fz1 , . . . , zT g, all generated by the same unknown distribution P (Z ). We are given a set of functions F indexed by a parameter h 2 V (i.e., each value of the parameter h corresponds to a function in F ). In our applications we will have V µ Rs . We would like to choose a value of h that minimizes the expectation E Z (Q(h, Z)) of a given loss functional Q (h, Z). In supervised learning problems, we have input-output pairs Z D (X, Y ), with X 2 X , Y 2 Y , and h is associated with a function fh from X to Y . For example, we will consider the case of the quadratic loss, with real-valued vectors Y µ Rm and Q (h, (X, Y)) D 12 ( fh (X ) ¡ Y )0 ( fh (X ) ¡ Y ). Note that we use the letter h to represent parameters and the letter l to represent hyperparameters. In the next section, we will provide a formulation for the cases in which Q is quadratic in h (e.g., quadratic loss with constant or afne function sets). In section 4, we will consider more general classes of functions and loss, which may be applied to the case of multilayer neural networks, for example. 2.1 Training Criteria. In its most general form, a training criterion C is any real-valued function of the set of empirical losses Q (h, zi ) and of some
Gradient-Based Optimization of Hyperparameters
1891
hyperparameters l: C(h, l , D) D c(l, Q (h, z1 ), Q (h, z2 ), . . . , Q(h, zT )). Here l is assumed to be a real-valued vector l D (l 1 , . . . , l q ). The proposed method relies on the assumption that C is continuous and differentiable almost everywhere with respect to h and l. When the hyperparameters are xed, the learning algorithm attempts to perform the following minimization: h (l , D ) D argminh C (h, l , D ).
(2.1)
An example of a training criterion with hyperparameters is CD
X (xi ,yi )2D
wi (l)( fh (xi ) ¡ yi )2 C h 0 A (l )h,
(2.2)
where the hyperparameters provide different quadratic penalties to different parameters (with the matrix A(l )), and different weights to different training patterns (with wi (l )), (as in Bengio & Dugas, 1999; Latendresse, 1999). 2.2 Model Selection Criteria. The model selection criterion E is used to select hyperparameters or more generally to choose one model among several. Ideally, it should be the expected generalization error (for a xed l), but P (Z) is unknown, so many alternatives—approximations, bounds, or empirical estimates—have been proposed. Most model selection criteria have been proposed for selecting a single hyperparameter that controls the “complexity” of the class of functions in which the learning algorithm nds a solution, such as the minimum description length principle (Rissanen, 1990), structural risk minimization (Vapnik, 1982), the Akaike Information Criterion (Akaike, 1974), or the generalized cross-validation criterion (Craven & Wahba, 1979). Other criteria are those based on held-out data, such as the cross-validation estimates of generalization error. These are almost unbiased estimates of generalization error (Vapnik, 1982) obtained by testing fh on data not used to choose h. For example, the K-fold cross-validation estimate uses K partitions of D, S11 [ S12 , S21 [ S22 , . . . , SK1 [ SK2 :
Ecv (l , D) D
1X 1 X Q (h (l , Si1 ), zt ). K i |Si2 | z 2Si t
2
P When h is xed, the empirical risk T1 t Q(h, zt ) is an unbiased estimate of the generalization error of fh (but it becomes an optimistic estimate when h is chosen to minimize the empirical risk). Similarly, when l is xed, the crossvalidation criterion is an almost unbiased estimate (when K approaches |D|) of the generalization error of h (l , D). When l is chosen to minimize the
1892
Yoshua Bengio
cross-validation criterion, this minimum value also becomes an optimistic estimate. And when there is a greater diversity of values Q( fh (l ,D) , z) that can be obtained for different values of l, there is more risk of overtting the hyperparameters. In this sense, the use of hyperparameters proposed in this article can be very different from the common use in which a hyperparameter helps to control overtting. Instead, a blind use of the extra freedom brought by many hyperparameters could deteriorate generalization. 3 Optimizing Hyperparameters for a Quadratic Training Criterion
In this section we analyze the simpler case in which the training criterion C is a quadratic polynomial of the parameters h . The dependence on the hyperparameters l can be of higher order, as long as it is continuous and differentiable almost everywhere (see, for example, Bottou, 1998, for more detailed technical conditions sufcient for stochastic gradient descent): 1 0 h H (l )h, 2
C D a (l) C b(l )0h C
(3.1)
where h, b 2 R s , a 2 R, and H 2 Rs£s . For a minimum of equation 3.1 to exist requires that H be positive denite. It can be obtained by solving the linear system @C @h
D b C Hh D 0,
(3.2)
which yields the solution h (l ) D ¡H ¡1 (l)b(l ).
(3.3)
Assuming that E depends only on l through h , the gradient of the model selection criterion E with respect to l is @E @l
D
@E @h @h @l
.
If there were a direct dependency (not through h), an extra partial derivative would have to be added. For example, in the case of the cross-validation criteria, @Ecv @h
D
1 X 1 X @Q (h, zt ) . K i |Si2 | z 2Si @h t
2
In the quadratic case, the inuence of l on h is spelled out by equation 3.3, yielding @hi @l
D ¡
X @Hi,¡1 j j
@l
bj ¡
X j
Hi,¡1 j
@bj @l
.
(3.4)
Gradient-Based Optimization of Hyperparameters
1893 @H¡1
Although the second sum can be readily computed, @li, j in the rst sum is more challenging; we consider several methods below. One solution is based on the computation of gradients through the inverse of a matrix. This general but inefcient solution is the following: @Hi,¡1 j @l
D
X @Hi,¡1 j @Hk,l k,l
@Hk,l
@l
,
where ¡1
@Hi, j
@Hk,l
¡1 ¡1 ¡1 ¡1 D ¡Hi, j Hl,k C IiD 6 l, j D 6 k H i, j minor ( H, j, i ) 0 ,k0 , l
(3.5)
where minor(H, j, i) denotes the “minor matrix,” obtained by removing the jth row and the ith column from H, and the indices (l0 , k0 ) in equation 3.5 refer to the position within a minor matrix that corresponds to the position 6 6 (l, k ) in H (note l D i and k D j). Unfortunately, the computation of this gradient requires O (s5 ) multiply-add operations for an s£s matrix H, which is much more than is required by the inversion of H (O(s3 ) operations). A better solution is based on the following equality: HH ¡1 D I, where I is the s £ s identity matrix. This implies, by differentiating with respect to l: ¡1 ¡1 @H ¡1 , we get H C H @H D 0. Isolating @H @l @l @l @H ¡1 @l
D ¡H ¡1
@H @l
H ¡1 ,
(3.6)
which requires only about 2s3 multiply-add operations. An even better solution (which was suggested by L´eon Bottou) is to return to equation 3.2, which can be solved using about s3 / 3 multiply-add operations (when h 2 R s ). The idea is to backpropagate gradients through each of the operations performed to solve the linear system. The objective is to compute the gradient of E with respect to H and b through the effect of H and b on h, in order to compute @@lE , as illustrated in Figure 1. The backpropagation costs the same as the linear system solution—about s3 / 3 multiply / add operations—so this is the approach that we have kept for our implementation. Since H is the Hessian matrix, it is positive denite and symmetric, and equation 3.2 can be solved through the Cholesky decomposition of H (assuming H is full rank, which is likely if the hyperparameters provide some sort of weight decay). The Cholesky decomposition of a symmetric positive denite matrix H gives H D LL0 where L is a lower diagonal matrix (with zeros above the diagonal). It is computed in time O (s3 ) as follows: for i D 1, . . . , s q P 2 Li,i D Hi,i ¡ i¡1 k D 1 Li,k
for j D i C 1, . . . , s
±
Lj,i D Hi, j ¡
P i¡1
²
k D1 Li,k Lj,k / Li,i .
1894
Yoshua Bengio
Figure 1: Illustration of forward paths (full lines) and gradient paths (dashed) for computation of the model selection criterion E and its derivative with respect to the hyperparameters (l), when using the method based on the Cholesky decomposition and backsubstitution to solve for the parameters (h ).
Once the Cholesky decomposition is achieved, the linear system LL0h D ¡b can be easily solved in two backsubstitution steps: (1) rst solve Lu D ¡b for u; (2) solve L0h D u for h: 1. Iterate once forward through the rows of L: for i D 1, . . . , s, P ) ui D (¡bi ¡ i¡1 k D 1 Li,k uk / Li,i .
2. Iterate once backward through the rows of L:
for i D s, . . . , 1, P hi D (ui ¡ skD i C 1 Lk,ihk ) / Li,i .
The computation of the gradient of h with respect to the elements of H and b proceeds in exactly the reverse order. We start by backpropagating through the backsubstitution steps, and then through the Cholesky decomposition. @E @E Together, the three algorithms that follow allow computing @b and @H , starting from the “partial” parameter gradient
@E @hi h ,...,h 1 s
(not taking into account
the dependencies of hi on hj for j > i). As intermediate results, the algorithm computes the partial derivatives with respect to u and L, as well as the “full gradients” with respect to h,
taking into account all the dependencies
@E , @hi h i
between the hi ’s brought by the recursive computation of the hi ’s. First backpropagate through the solution of L0h D u: initialize dEdtheta à initialize dEdL à 0
@E @h h1 ,...,hs
Gradient-Based Optimization of Hyperparameters
1895
for i D 1, . . . , s dEdui à dEdtheta i / Li,i dEdLi,i à dEdLi,i ¡ dEdtheta i hi / Li,i for k D i C 1 . . . s dEdtheta k à dEdtheta k ¡dEdtheta i Lk,i / Li,i dEdL k,i à dEdLk,i ¡ dEdtheta i hk / Li,i .
Then backpropagate through the solution of Lu D ¡b: for i D s, . . . , 1 @E à ¡dEdui / Li,i @bi dEdLi,i à dEdLi,i ¡ dEdui ui / Li,i for k D 1, . . . , i ¡ 1 dEdu k à dEduk ¡ dEdui Li,k / Li,i dEdL i,k à dEdLi,k ¡ dEdui uk / Li,i . The above algorithm gives the gradient of the model selection criterion E with respect to coefcient b (l ) of the training criterion, as well as with respect to the lower diagonal matrix L, dEdLi, j D @@LE . i, j Finally, we backpropagate through the Cholesky decomposition, to convert the gradients with respect to L into gradients with respect to the Hessian H (l ): for i D s, . . . , 1 for j D s, . . . , i C 1 dEdL i,i à dEdLi,i ¡ dEdLj,i Lj,i / Li,i @E à dEdLj,i / Li,i @Hi, j for k D 1, . . . , i ¡ 1 dEdLi,k à dEdLi,k ¡dEdLj,i Lj,k / Li,i dEdLj,k à dEdLj,k ¡dEdLj,i Li,k / Li,i 1 @E à 2 dEdLi,i / Li,i @Hi,i for k D 1, . . . , i ¡ 1 dEdL i,k à dEdLi,k ¡ dEdLi,i Li,k / Li,i .
Note that we have computed gradients only with respect to the diagonal and upper diagonal of H because H is symmetric. Once we have the gradients of E with respect to b and H, we use the functional form of b(l) and H (l ) to compute the gradient of E with respect to l: @E @l
D
X @E @bi i
@bi @l
C
X @E @Hi, j i, j
@Hi, j @l
(again we assumed that there is no direct dependency froml to E; otherwise, an extra term must be added). Using this approach rather than the one described in the previous subsection, the overall computation of gradients is therefore in about s3 / 3 multiply-
1896
Yoshua Bengio
add operations rather than O (s5 ). The most expensive step is the backpropagation through the Cholesky decomposition itself (three nested s-iterations loops). This step may be shared if there are several linear systems to solve with the same Hessian matrix. For example, this will occur in linear regression with multiple outputs because H is block-diagonal, with one block for each set of parameters associated to oneP output and all blocks being the same (equal to the input “design matrix” t xt x0t , denoting xt the input training vectors). Only a single Cholesky computation needs to be done, shared across all blocks. 3.1 Weight Decays for Linear Regression. In this section, we illustrate the method in the particular case of multiple weight decays for linear regression, with K-fold cross-validation as the model selection criterion. The hyperparameter l j will be a weight decay associated with the jth input variable. The training criterion for the kth partition is
Ck D
1
X
|Sk1 | ( )2 k xt ,yt S1
1 1X X 2 (H xt ¡ yt )0 (H xt ¡ yt ) C lj H i, j . 2 2 j i
(3.7)
The objective is to penalize separately each of the input variables (as in MacKay, 1995; Neal, 1998), a kind of “soft variable selection” (see Latendresse, 1999, for more discussion and experiments with this setup). The training criterion is quadratic, as in equation 3.1, with coefcients aD
1X 0 y yt , 2 t t
H (ij), (i0 j0 ) D di,i0
X t
b(ij) D ¡
X
yt,i xt, j ,
t
xt, j xt, j0 C di,i0 dj, j0 l j ,
where di, j D 1 when i D j and 0 otherwise, and (ij ) is an index corresponding to indices (i, j) in the weight matrix H, for example, (ij ) D (i ¡ 1) £ s C j. From the above denition of the coefcients of C, we obtain their partial derivatives with respect to l: @b @l
D 0,
@H(ij), (i0 j0 ) @l k
D di,i0 dj, j0 dj,k .
By plugging the above denitions of the coefcients and their derivatives in the equations and algorithms of the previous section, we have therefore obtained an algorithm for computing the gradient of the model selection criterion with respect to the input weight decays of a linear regression. Note that here H is block-diagonal, with m identical blocks of size (n C 1), so the Cholesky decomposition (and similarly backpropagating through it) can be performed in about (s / m )3 / 3 multiply-add operations rather than s3 / 3 operations, where m is the number of outputs (the dimension of the output variable).
Gradient-Based Optimization of Hyperparameters
1897
4 Nonquadratic Criterion: Hyperparameters Gradient
If the training criterion C is not quadratic in terms of the parameters h , it will in general be necessary to apply an iterative numerical optimization algorithm to minimize the training criterion. In this section we consider what happens after this minimization is performed—at a value of h where @C / @h is approximately zero and @2 C / @h 2 is positive denite (otherwise we would not be at a minimum of C). The minimization of C(h, l , D ) denes a function h (l , D) (see equation 2.1). With our assumption of smoothness of C, the implicit function theorem tells us that this function exists locally and is differentiable. To obtain this function, we write F (h, l) D
@C @h
D 0,
evaluated at h D h (l , D) (at a minimum of C). Differentiating the above equation with respect to l, we obtain @F @h
C
@h @l
@F @l
D 0,
so we obtain a general formula for the gradient of the tted parameters with respect to the hyperparameters: @h (l , D ) @l
D ¡
¡@ C¢ 2
¡1
@h 2
@2 C @l @h
¡1 D ¡H
@2 C @l @h
.
(4.1)
Let us see how this result relates to the special case of a quadratic training criterion, C D a C b0h C 12 h 0 Hh: @h
D ¡H ¡1
@l
¡ @b @l
C
@H @l
¢
h
D ¡H ¡1
@b @l
C H ¡1
@H @l
H ¡1 b,
where we have substituted h D ¡H ¡1 b. Using the equality 3.6, we obtain the same formula as in equation 3.4. Let us consider more closely the case of a neural network with one layer of hidden units with hyperbolic tangent activations, a linear output layer, squared loss, and hidden-layer weights Wi, j . For example, if we want to use hyperparameters for penalizing the use of inputs, we have a criterion similar to equation 3.7: Ck D with C D
1
X
|Sk1 | ( ,y )2 k xt t S1
P k
1 1X X 2 ( fh (xt ) ¡ yt )0 ( fh (xt ) ¡ yt ) C lj W i, j , 2 2 j i
Ck , and the cross-derivatives are easy to compute:
2
@ C @Wi, j @l k
D dk, j Wi, j .
1898
Yoshua Bengio
The Hessian and its inverse require more work, but can be done respectively in at most O (s2 ) and O(s3 ) operations. See, for example, Bishop (1992) for the exact computation of the Hessian for multilayer neural networks. See Becker and LeCun (1989) and LeCun, Denker, and Solla (1990) for a diagonal approximation, which can be computed and inverted in O (s) operations. 5 Summary of Experiments and Conclusions
We have presented a new methodology for simultaneously optimizing several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters, taking into account the inuence of the hyperparameters on the parameters. We have considered the simpler case of a training criterion that is quadratic with respect to the parameters (h 2 R s ) and the more general nonquadratic case. We have shown a particularly efcient procedure in the quadratic case that is based on backpropagating gradients through the Cholesky decomposition and backsubstitutions. This was an improvement: we have arrived at this s3 / 3-operations procedure after studying rst an O (s5 ) procedure and then a 2s3 -operations procedure. In the case of input weight decays for linear regression, the computation can even be reduced to about (s / m )3 / 3 operations when there are m outputs. We have performed preliminary experiments with the proposed methodology in several simple cases, using conjugate gradients to optimize the hyperparameters. The application to linear regression with weight decays for each input is described in Latendresse (1999). The hyperparameter optimization algorithm is used to perform a soft selection of the input variables. A large weight decay on one of the inputs effectively forces the corresponding weights to very small values. Comparisons on simulated data sets are made in Latendresse (1999) with ordinary regression as well as with stepwise regression methods and the adaptive ridge (Grandvalet, 1998) or LASSO (Tibshirani, 1995). Another type of application of the proposed method has been explored, in the context of a real-world problem of nonstationary time-series prediction (Bengio & Dugas, 1999). In this case, an extension of the cross-validation criterion to sequential data that may be nonstationary is used. Because of this nonstationarity, recent data may sometimes be more relevant to current predictions than older data. The training criterion is a sum of weighted errors for the past examples, and these weights are given by a parameterized function of time (as the wi (l) in equation 2.2). The parameters of that function are two hyperparameters that control when a transition in the unknown generating process would have occurred and how strong that change was or should be trusted. In these experiments, the weight given to past data points is a sigmoid function of the time: the threshold and the slope of the sigmoid are the hyperparameters, representing, respectively, the time of a strong transition and the strength of that transition. Optimizing
Gradient-Based Optimization of Hyperparameters
1899
these hyperparameters, we obtained statistically signicant improvements in predicting one-month-ahead future volatility of Canadian stocks. The comparisons were made against several linear, constant, and ARMA models of the volatility. The experiments were performed on monthly return data from 473 Canadian stocks from 1976 to 1996. The measure of performance is the average out-of-sample squared error in predicting the squared returns. Single-sided signicance tests were performed, taking into account the auto-covariance in the temporal series of errors and the covariance of the errors between the compared models. When comparing the prediction of the rst moment (expected return), no model signicantly improved on the historical average of stock returns (constant model). When comparing the prediction of the second moment (expected squared returns), the method based on hyperparameters optimization beats all the other methods, with a p-value of 1% or less. What remains to be done? First, we need more experiments, in particular with the nonquadratic case (e.g., MLPs), and with model selection criteria other than cross-validation (which has large variance; Breiman, 1996). Second, there are important theoretical questions that remain unanswered concerning the amount of overtting that can be brought when too many hyperparameters are optimized. As we outlined in the introduction, the situation with hyperparameters may be compared with the situation of parameters. However, whereas the form of the training criterion as a sum of independent errors allows dening the capacity for a class of functions and relating it to the difference between generalization error and training error, more work needs to be done to obtain a similar analysis for hyperparameters. Acknowledgments
The author would like to thank L´eon Bottou, Pascal Vincent, Fran¸cois Blanchette, and Fran¸cois Gingras, as well as the NSERC Canadian funding agency. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC-19, 716–728. Becker, S., & LeCun, Y. (1989). Improving the convergence of back-propagation learning with second order methods. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann. Bengio, Y., & Dugas, C. (1999). Learning simple non-stationarities with hyperparameters. Submitted. Bishop, C. (1992). Exact calculation of the Hessian matrix for the multi-layer perceptron. Neural Computation, 4, 494–501.
1900
Yoshua Bengio
Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning in neural networks. Cambridge: Cambridge University Press. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerical Mathematics, 31, 377–403. Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. In L. Niklasson, M. Boden, & T. Ziemske (Eds.), ICANN’98 (pp. 201– 206). Berlin: Springer-Verlag. Guyon, I., Vapnik, V., Boser, B., Bottou, L., & Solla, S. A. (1992). Structural risk minimization for character recognition. In J. Moody, S. Hanson, & R. Lipmann (Eds.), Advances in neural information processing systems, 4 (pp. 471–479). San Mateo, CA: Morgan Kaufmann. Hinton, G. (1987).Learning translation invariant in massively parallel networks. In J. de Bakker, A. Nijman, & P. Treleaven (Eds.), Proceedings of PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin: SpringerVerlag. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Larsen, J., Svarer, C., Andersen, L. N., & Hansen, L. K. (1998). Adaptive regularization in neural networks modeling. In G. B. Orr & K.-R. Muller (Eds.), Neural Networks: Tricks of the Trade (pp. 113–132). Berlin: Springer-Verlag. Latendresse, S. (1999). Utilisation d’hyper-parametres pour la selection de variable. M.Sc. thesis, University of Montreal. LeCun, Y., Denker, J., & Solla, S. (1990). Optimal brain damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. MacKay D. (1995). Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469–505. Available online at: http://www.iro.umontreal.ca/»lisa. Neal, R. (1998). Assessing relevance determination methods using delve. In C. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). Berlin: Springer-Verlag. Poggio, T., Torre, V., & Koch, C. (1985).Computational vision and regularization theory. Nature, 317, 314–319. Rissanen, J. (1990). Stochastic complexity in statistical inquiry. Singapore: World Scientic. Tibshirani, R. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267–288. Tikhonov, A., & Arsenin, V. (1977). Solutions of ill-posed problems. Washington, DC: W. H. Winston. Vapnik, V. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag. Received June 1, 1999; accepted August 30, 1999.
LETTER
Communicated by Andrew Back
A Signal-Flow-Graph Approach to On-line Gradient Calculation Paolo Campolucci ¤ Aurelio Uncini Francesco Piazza Dipartimento di Elettronica ed Automatica, Universit`a di Ancona, 60121 Ancona, Italy A large class of nonlinear dynamic adaptive systems such as dynamic recurrent neural networks can be effectively represented by signal ow graphs (SFGs). By this method, complex systems are described as a general connection of many simple components, each of them implementing a simple one-input, one-output transformation, as in an electrical circuit. Even if graph representations are popular in the neural network community, they are often used for qualitative description rather than for rigorous representation and computational purposes. In this article, a method for both on-line and batch-backward gradient computation of a system output or cost function with respect to system parameters is derived by the SFG representation theory and its known properties. The system can be any causal, in general nonlinear and time-variant, dynamic system represented by an SFG, in particular any feedforward, time-delay, or recurrent neural network. In this work, we use discrete-time notation, but the same theory holds for the continuous-time case. The gradient is obtained in a straightforward way by the analysis of two SFGs, the original one and its adjoint (obtained from the rst by simple transformations), without the complex chain rule expansions of derivatives usually employed. This method can be used for sensitivity analysis and for learning both off-line and on-line. On-line learning is particularly important since it is required by many real applications, such as digital signal processing, system identication and control, channel equalization, and predistortion. 1 Introduction
For many practical problems such as time-series forecasting and classication, identication, and control of nonlinear dynamic systems, feedforward neural networks (NNs) are not adequate; therefore several recurrent neural network (RNN) architectures have been proposed (Tsoi & Back, 1994;
¤
Note: Address reprint request or comments to Paolo Campolucci at campoluc@ tiscalinet.it. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1901– 1927 (2000) °
1902
P. Campolucci, A. Uncini, and F. Piazza
Nerrand, Roussel-Ragot, Personnaz, Dreyfus, & Marcos, 1993; Narendra & Parthasarathy, 1991). RNNs can be essentially divided into three classes: fully recurrent NNs (FRNNs) (Williams & Peng, 1990; Williams & Zipser, 1989; Haykin, 1994), locally recurrent globally feedforward NNs (Back & Tsoi, 1991; Tsoi & Back, 1994; Campolucci, Piazza, & Uncini, 1995; Campolucci, Uncini, Piazza, & Rao, 1999), and NARX architectures (Haykin, 1994; Narendra & Parthasarathy, 1991; Horne & Giles, 1995). In order to train an RNN adaptively, an on-line learning algorithm must be derived for each specic architecture. In many learning algorithms, the gradient of a cost function must be estimated and then used in a parameter updating scheme (e.g., steepest descent or conjugate gradient). By gradient we mean the vector of derivatives of a given variable (here a system output or a function of it, e.g., a cost function) with respect to a set of parameters that inuence such a variable (here, the parameters of a subset of network components). When RNNs (or, in general, systems with feedback) are involved, the calculation of the gradient is more complex and difcult than in the case of feedforward neural networks (or systems without feedback). Due to feedback, in fact, the current output depends on the past outputs of the RNN, so that the present output depends not only on the present parameters but also on the past parameter values, and this dependence has to be considered in calculating the gradient. Quite often in the neural network literature, this dependence has been totally or partially neglected to simplify the derivation of the gradient. There are two different schemes to compute the gradient: the forward computation approach and the backward computation approach. These two names stem from the order in time in which the computations take place. Forward means that both time and propagation of gradient information across the network ow as in the forward pass (i.e., the computation of the network outputs from its inputs); backward indicates the opposite ow. Real-time recurrent learning (RTRL) (Williams & Zipser, 1989) and truncated backpropagation through time (truncated BPTT) (Williams & Peng, 1990) implement the forward and backward on-line computation, respectively. The major drawback of the forward computation over the backward one is its high computational complexity (Williams & Peng, 1990; Williams & Zipser, 1994; Srinivasan, Prasad, & Rao, 1994; Nerrand et al., 1993). On the other hand, the use of the backward technique in the case of RNNs with arbitrary architectures is not straightforward; the mathematics of each particular architecture has to be worked out. For systems with feedback or with internal time delays, the chain rule derivation can easily become complex and sometimes intractable (Werbos, 1990; Williams & Zipser, 1994; Back & Tsoi, 1991; Pearlmutter, 1995; Campolucci et al., 1999). An interesting approach to the computation of gradient information in RNNs with arbitrary architectures was proposed by Wan and Beaufays (1996), who used a diagrammatic derivation to obtain the BPTT batch al-
Signal-Flow-Graph Approach to On-line Gradient Calculation
1903
gorithm. The idea underlying their method is to describe the neural network by a graph model composed of some basic blocks, including summing junctions, branching points, nonlinear functions, weights, and delay operators. Then another graph is introduced to compute the gradient substituting summing junctions with branching points, and vice versa; nonlinear functions with gains; and delay operators with “advance” operators, while the weights are just replicated in the new graph (named reciprocal network). This method considers the derivative system obtained in linearizing the original network. Unfolding the network, multiplying the network output by the error at the proper time step, and applying the interreciprocity property, it is possible to get the desired gradient information by the transposed graph corresponding to the unraveled derivative network. The reciprocal network is this transposed network raveled back in time. The derived method can be computed only in batch mode since the reciprocal network is not causal due to the presence of the noncausal advance operators. The advance operators arise from the unfolding and transposing operations on which the method is based. Following the graph approach, now using the signal-ow-graph (SFG) representation theory and its known properties (Mason, 1953, 1956; Lee, 1974; Oppenheim & Schafer, 1975), in this article we propose a backward computation (BC) method that implements the BPTT algorithm for both on-line and batch-backward gradient computation of a function (e.g., a cost function) of the system output with respect to some parameters. This system can be any causal, nonlinear, and time-variant dynamic system represented by an SFG—in particular, any feedforward (static), time-delay, or RNN. In the following, we shall usually consider the NN framework, but the method is fully general. In this work, we use discrete-time notation; however, the theory holds for the continuous-time case. The new method has been developed mainly for on-line learning. On-line training of adaptive systems is very important in real applications such as digital signal processing, system identication and control, channel equalization, and predistortion, since it allows the model to follow time-varying features of the real system without interrupting the system operation itself (such as a communication channel). It can also be used for off-line learning; in this case, when a mean-squared-error (MSE) cost function is considered, this approach gives the batch BPTT algorithm and is equivalent to the method by Wan and Beaufays (1996). While we provide a proof based on a theorem (Lee, 1974), similar to Tellegen’s theorem for analog network, the proof by Wan and Beaufays (1996) for their diagrammatic derivation is based on the interreciprocity property of transposed graphs, that is, a consequence of the theorem used here. They also propose a proof of the equivalence of RTRL and BPTT in batch mode to show that the computed weight variations are the same, though the complexities of the two algorithms are different. Beaufays and Wan (1994) do not
1904
P. Campolucci, A. Uncini, and F. Piazza
address the problem of graphical derivation of RTRL, as in Wan and Beaufays (1996) for BPTT; nevertheless, the two approaches have been combined for that purpose in Wan and Beaufays (1998), where a diagrammatic derivation of an on-line learning method is reported. However, the resulting RTRL algorithm has a very high computational complexity—O(n2 ) compared to O(n ) of the on-line backward method presented here. In this comparison, it should be considered that any on-line backward method needs a truncation of the past history; forward methods such as RTRL do not. However, it is known (Williams & Peng, 1990) that RTRL also involves a different kind of approximation due to the fact that the weights change in time during learning. To mitigate this problem, an exponential decay on the contributions from past times can be used (Gherrity, 1989); nevertheless, no reduction of the computational complexity can be achieved as provided by the BPTT truncation strategy. The work presented here also generalizes previous work by Osowski (1994), which proposed an SFG approach to neural network learning. That work was basically developed for feedforward networks, and the proposed extension to recurrent networks is valid only for networks that relax to a xed point, as in the work of Almeida (1987); moreover, it is not possible to accommodate delay branches in this framework. General recurrent networks able to process temporal sequences cannot be trained by the proposed adjoint equations. Another related work is that by Martinelli and Perfetti (1991), which applies an adjoint transformation to an analog real circuit implementing a multilayer perceptron (MLP) and obtaining a circuit that implements backpropagation. Since this work is more related to hardware and the network is feedforward and static, it is less general and can be considered a particular case of the method presented here. Finally, our formulation allows dealing with both the learning problem and the sensitivity calculation problem, that is, the problem of computing the derivatives of the system output with respect to some internal parameters. These derivatives can be used for designing robust electrical or numerical circuits. The concepts detailed in this article were developed as part of Paolo Campolucci’s doctorate research and presented in detail in Campolucci (1998) and briey introduced in (Campolucci, Marchegiani, Uncini, and Piazza (1997) and Campolucci, Uncini, and Piazza (1998). 2 Signal Flow Graphs and Lee’s Theorem
A large class of complex systems, such as RNNs, can be represented by an SFG. Therefore in this section, SFG notation and properties will be outlined following Lee’s approach (1974). A different notation and properties of SFGs were also introduced by Oppenheim and Schafer (1975); the SFG theory was rst developed by Mason (1953, 1956).
Signal-Flow-Graph Approach to On-line Gradient Calculation
1905
Figure 1: Signal-ow-graph denitions. If IA is the set of indexes of branches that comes in node A and ZA the set of indexes of branches that comes out from P node A, then the following holds: xj D i2IA vi , 8j 2 ZA .
An SFG is dened as a set of nodes and oriented branches. A branch is oriented from node a to node b if node a is the initial node and node b is the nal node of the branch. An input node is one that has only one outgoing branch, the input branch, and no incoming branch. Associated with the input branch is an input variable uj . An output node is one that has only one incoming branch, the output branch. Associated with the output branch is an output variable yj . All other nodes besides the input and output nodes are called n-nodes. All other branches besides the input and output branches are called f -branches. There are two variables related by a function associated with each of the f branches: the initial variable xj , at the tail of the branch, and the nal variable vj , at the head of the branch. The relationship between the initial and nal variables for branch j is vj D fj [xj ], where fj [.] is a general function describing the operator (or circuit component) of the jth branch. By denition, for each n-node, the value of the xj variables associated with its outcoming branches is the sum of all the vj variables associated with its incoming branches (see Figure 1). Let the SFG consist of p C m C r nodes (m input nodes, r output nodes, and p n-nodes), and q C m C r branches (m input branches, r output branches, and q f -branches). Note that only the branches (and not the nodes) are indexed; therefore, all the reported indexes are to be considered as branch indexes. In the case of discrete-time systems, the functional relationship between the variables xj and vj of the jth f -branch (expressed as vj D fj [xj ]) is usually assumed to be: (
vj (t) D gj [xj (t), aj (t), t] D
vj (t) D q ¡1 xj (t) D xj (t ¡ 1)
for a static branch (without memory)
(2.1)
for a delay branch (one unit memory),
(2.2)
where q¡1 is the delay operator and gj is a general differentiable function, which can depend on an adaptable parameter (or vector of parameters) aj and also on time t to allow shape adaptation or variation in time of the nonlinearity. For a static branch, the relationship between xj and vj often
1906
P. Campolucci, A. Uncini, and F. Piazza
Figure 2: SFG representation of a generic system.
particularizes as »
vj (t) D wj (t)xj (t) vj (t) D fj (xj (t))
for a weight branch for a xed nonlinear branch,
(2.3)
where wj (t) is the jth (i.e., belonging to the jth branch) weight parameter of the system at time t, and fi is a differentiable function again of the jth branch. In the neural networks context, wj (t) is a synaptic weight and fj a sigmoidal activation function. However, equation 2.1 can represent more general nonlinear adaptable functions controlled by a parameter (or vector of parameters) aj , such as the recently introduced spline-based activation functions (Uncini, Vecci, Campolucci, & Piazza, 1999). We are now able to state the following denitions: The SFG GN that represents the system is given by the connection (with or without feedback) of the branches described by equations 2.1, 2.2, and possibly 2.3 (see Figure 2). Denition 1 (SFG).
The branches (components) are conceptually equivalent to the components of an electrical circuit, and the relationship between the SFG and the implemented system is the same as that between the electrical circuit and the system it implements. Not all the SFG can represent real systems. It is known, in fact, that in order to be computed, an SFG has to satisfy the constraint that each loop must include at least one delay branch. A reversed SFG GO N of a given SFG GN is a member of the set of SFGs obtained by reversing the orientation of all the branches in GN , that is, by replacing the summing junctions with branching points, and vice versa. Denition 2 (reversed SFG).
Let uO j , yOj , xOj and vOj , be the variables associated with the branches in GO N corresponding to uj , yj , xj , and vj in GN . In GO N the input variables are yOj
Signal-Flow-Graph Approach to On-line Gradient Calculation
1907
j D 1, . . . , r, the output variables are uO j j D 1, . . . , m, and vO j and xOj are the initial and the nal variable of the jth f -branch, respectively. Let the indexes assigned to the branches of GO N be the same as the indexes assigned to the corresponding branches of GN . Node indexes are not required by this development. The functional relationships between the variables xOj and vOj in GO N are generic; therefore, the graph GO N effectively belongs to a set of SFGs. For an example of SFG, see Figures 3 (a–b). The transposed SFG is a particular reversed SFG such that the relationships of the corresponding branches of GN and GO N remain the same. Denition 3 (transposed SFG).
For a generic SFG, the following theorem, similar to Tellegen’s theorem for electrical network (Tellegen, 1952; Peneld, Spence, & Duiker, 1970), was derived by Lee (1974). It relies on the topological properties of the original and reversed graphs, not on the functional relationships between the branch variables, which therefore can be described by any mathematical relation. Consider a causal dynamic system (Lee, 1974). Let GN be the corresponding SFG and GO N a generic reversed SFG. If uO j , yOj , xOj , and vOj , are the variables in GO N corresponding to uj , yj , xj , and vj in GN , respectively, then Theorem 2.1.
r X jD 1
yOj (t)¤ yj (t) C
q X jD 1
xOj (t)¤ xj (t) D
m X jD 1
uO j (t)¤ uj (t) C
q X j D1
vOj (t)¤ vj (t)
(2.4)
holds true, where “¤” denotes the convolution operator, that is, y (t ) ¤ x ( t ) D
t X s D t0
y (t ¡ s)x (s)
(2.5)
where t0 is the initial time. For a static system the following equation holds: r X jD 1
yOj yj C
q X jD 1
xOj xj D
m X jD 1
uO j uj C
q X jD 1
vOj vj .
(2.6)
3 Sensitivity Computation
In network theory, a sensitivity is usually dened as the derivative of a certain output with respect to a given parameter. A general method to compute sensitivities of a dynamic continuous-time system represented by a SFG was
1908
P. Campolucci, A. Uncini, and F. Piazza
Figure 3: (a) A simple IIR-MLP neural network with two layers, two inputs, one output, and one neuron in the rst layer, with two IIR lter synapses (one zero, one pole transfer function) and one neuron in the second layer, with one IIR lter synapsis (two zeros, two poles, transfer function). (b) A section of its O (a) . SFG GN . (c) Its adjoint SFG G N
Signal-Flow-Graph Approach to On-line Gradient Calculation
1909
derived by Lee (1974) for constant parameters. Similarly, an analogous result can be found for a discrete-time system; however, such a derivation can be used only for sensitivity computation in batch mode. Here we present a method to compute the sensitivity with respect to time-varying parameters, that is, the derivative of a system output with respect to past or present parameters (the weights wi and the nonlinearity control parameters ai ). For an adaptive system, they will depend on time; therefore, the derivatives with respect to such parameters will also depend on time. This is a generalization of work by Lee (1974) and Wan and Beaufays (1996). The derivative of an output yk (t) of the system with respect to the parameter wi (t ¡ t ), where t is the time of the original SFG and t a positive integer (t ¸ 0 and 0 · t · t), by the rst equation in equation 2.3 is @yk (t)
@wi (t ¡ t )
D
@yk (t )
@vi (t ¡ t ) @yk (t) D xi (t ¡ t ), ( ) ( ) @vi t ¡ t @wi t ¡ t @vi (t ¡ t )
(3.1)
where k D 1, . . . , r and i spans over the set of indexes of the weight branches. For the parameters of the nonlinear function, by equation 2.1, the following more general expression holds: @yk (t) @ai (t ¡ t )
D D
D
@yk (t )
@vi (t ¡ t )
@vi (t ¡ t ) @ai (t ¡ t ) @yk (t )
@vi (t ¡ t )
D
@yk (t)
@gi
@vi (t ¡ t ) @ai xi (t¡t ),ai (t¡t ), t¡t
g0i (t ¡ t ),
(3.2)
where k D 1, . . . , r and i spans over the set of indexes of the nonlinear parametric branches, and a simplied notation for the derivative of gi is implicitly dened. In both cases we need to compute the term @yk (t) / @vi (t ¡ t ), which is the derivative of the kth output variable at time t with respect to a signal in the system, that is, the nal variable of the ith f -branch at time t ¡ t . As for the computation of the sensitivity in an electrical circuit, in this case an adjoint network can be introduced to perform this task. The main idea is to apply Lee’s theorem (equation 2.4) to the SFG GN of the original system ( ) and a particular reversed SFG GO N , here called adjoint SFG or GO Na . It can be easily seen (see the proof in the appendix) that expression 2.4 holds true also when small variations of the variables are considered: r X jD 1
yOj (t) ¤ D yj (t) C
q X jD 1
xO j (t ) ¤ D xj (t ) D
C1 m X
j D1
C
uO j (t) ¤ D uj (t)
q X jD 1
vOj (t) ¤ D vj (t).
( ) Since the functional relationships of the branches of GO Na can be freely chosen, because equation 2.4 does not depend on them, these relationships are
1910
P. Campolucci, A. Uncini, and F. Piazza
Table 1: Adjoint SFG Construction Rules.
Note: For batch adaptation, the adjoint branches must be dened, remembering that t D T in this case, where T is the nal instant of the epoch.
chosen in order to obtain an estimate of can now state the following:
@yk (t ) @vi (t¡t )
from this equation. Thus we
( a) The adjoint SFG GO N is the particular reversed graph whose f -branch relationships are related to the f -branch equations of the original graph GN by the correspondence reported in Table 1.
Denition 4 (adjoint SFG).
Table 1 shows that for each delay branch in the original SFG, the corresponding branch of the adjoint SFG is a delay branch (with zero initial condition). The nonlinearity without parameters fj in the original SFG corresponds to a gain equal to the derivative of fj evaluated at the value of the initial variable of the branch at time t ¡ t in the adjoint SFG, where t is the time of the original SFG and t is that of the adjoint SFG. The weight at time t corresponds to the weight at time t ¡ t in the adjoint SFG. A similar
Signal-Flow-Graph Approach to On-line Gradient Calculation
1911
transformation should be performed for the general nonlinearity with control parameters. The outputs of the original SFG correspond to the inputs of the adjoint SFG. For an example of adjoint SFG construction for a simple MLP with innite impulse response (IIR) lter synapses, see Figure 3. Moreover, the inputs of the adjoint SFG must be set to an impulse in correspondence with the output of GN of which the sensitivity has to be computed, and to a constant null signal in correspondence with all the other outputs. In other words, to compute the value of @yk (t) / @vi (t ¡ t ), the input of the adjoint SFG should be:
8 > : 0,
8t
t D0 t > 0
6 k if j D
if j D k
j D 1, . . . , r.
(3.3)
( ) Let uO j , yOj , xO j , and vOj , be the variables associated with the branches in GO Na corresponding to uj , yj , xj , and vj in GN ; it follows (see the proof in the appendix) that
@yk (t) @vi
(t ¡ t ) D vO i (t ),
(3.4)
where k D 1, . . . , r and i D 1, . . . , q. Using this result in equation 3.1, we get @yk (t) @wi (t ¡ t )
D vO i (t )xi (t ¡ t ).
(3.5)
This result states that the derivative of an output variable of the SFG at time t with respect to its weight parameter wi (t ¡ t ) at time t ¡ t is the product of the initial variable of the ith branch in the original SFG at time t ¡ t and the initial variable of the corresponding branch in the adjoint graph at t time units. This result can be easily generalized for the nonlinear parametric function as follows: @yk (t) @ai (t ¡ t )
D vO i (t ) g0i (t ¡ t ).
(3.6)
4 SFG Approach to Learning in Nonlinear Dynamic Systems
In the previous section, we showed how the SFG of the original network (a ) GN and the SFG of the adjoint network GO N can be used to compute the sensitivity of an output with respect to a past or present system parameter (equations 3.5 and 3.6). Now we show how to use this technique to adapt a dynamic network, minimizing a supervised cost function with respect to the network parameters. The idea is to consider the cost function itself as a network connected
1912
P. Campolucci, A. Uncini, and F. Piazza
in cascade to the dynamical network to be adapted. Therefore, the entire system (named GS in the following) has as inputs the inputs of the original network and the desired outputs, while the value of the cost function is its unique output. The gradient of this output (i.e., of the cost function) with respect to the system parameters (e.g., weights) now corresponds to the sensitivity of this output, which can be computed by using equations 3.5 and 3.6 again, applied now to the cascade system GS . 4.1 Cost Functions and Parameter Updating Rules. Let us consider a discrete-time nonlinear dynamic system with inputs uk , k D 1, . . . , m, outputs yk , k D 1, . . . , r, and parameters wi and ai , which have to be adapted with respect to an output error. Using gradient-based learning and following the steepest-descent updating rule, it holds, for example, for the weight wi
D wi
D ¡m
@J @wi
m > 0,
,
(4.1)
where J is the cost function, D wi is the variation of the parameter wi , and m is the learning rate. The major problem is given by the calculation of the derivative @J / @wi . Since for on-line learning the parameters can change at each time instant, @J @wi
D
t X
@J
@wi (t ) t D0
D
t X
@J
@wi (t ¡ t ) t D0
.
(4.2)
As the length of the summation linearly increases with the current time step t, the summation in equation 4.2 must be truncated in order to implement the algorithm,
D wi (t ) D ¡m
t X
@J
@wi (t ) t D t¡h C 1
D ¡m
h¡1 X
@J
@wi (t ¡ t ) t D0
,
(4.3)
where h is a xed positive integer. In this way, not all the history of the system is considered but only the most recent part in the interval [t ¡ h C 1, t]. It is easy to show that a real truncation is necessary only for circuits with feedback, such as RNN, while for feedforward networks with delays (e.g., time delay NN, or TDNN) a nite h can be chosen, so that all the memory of the system is taken into account. An optimal selection of h requires an appropriate choice for each parameter to be adopted. For layered TDNN, h should depend on the layer and should be increased moving from the last to the rst layer, since more memory is involved. The updating rule (see equations 4.2 and 4.3) holds theoretically in the hypothesis of constant weights, but practically it is only a good approximation. Equations 3.5 and 3.6 do not require that hypothesis and can be used
Signal-Flow-Graph Approach to On-line Gradient Calculation
1913
in a more general context. Equations equivalent to 4.1, 4.2, and 4.3 can also be written for the parameters ai . The most common choice for the cost function J is the MSE. The instantaneous squared error at time t is dened as e2 (t) D
r X k D1
e2k (t)
with ek (t) D dk (t) ¡ yk (t),
(4.4)
where dk (t) k D 1, . . . , r are the desired outputs. So the MSE over a time interval [t0 , t1 ] is given by E (t 0 , t1 ) D
t1 X
e2 (t).
(4.5)
tD t0
In the case of batch training, the cost function can be chosen as the MSE over the entire learning epoch E (0, T ), where T is the nal time of the epoch, whereas for on-line training only the most recent errors e2 (t) must be considered, for example, using E (t ¡Nc C 1, t ) with the constant Nc properly chosen (Nerrand et al., 1993; Narendra & Parthasaraty, 1991). Therefore it holds:
8 > > > > < E(t ¡ Nc C 1, t) D T X > > > > e2 (s) : E(0, T ) D
t X
e2 (s)
on-line training
sD t¡Nc C 1
batch training.
sD 0
Assuming J D E (t0 , t1 ), it follows that
D wi (t) D ¡m
h¡1 X @E (t ¡ Nc C 1, t) t D0
@wi (t ¡ t )
for on-line training
(4.6a)
and
D wi
D ¡m
T X
@E (0, T )
@wi (T ¡ t ) t D0
for batch training.
(4.6b)
4.2 Cost Function Gradient Computation by SFGs. Let us consider the SFG GS obtained by a cascade connection of the SFG GN of the system to be adapted and the SFG that implements the cost function J (named G J in the following) (see Figure 4 for an example). The SFG GJ has as inputs the outputs of the adaptable system yk and the targets dk , while the value of J is its unique output. Obviously G J can contain the same kind of operators as GN : delays, nonlinearities, parameters to be adapted (e.g., for regulariza-
1914
P. Campolucci, A. Uncini, and F. Piazza
Figure 4: (a) SFG GS . The system SFG GN in cascade with the error calculation SFG G J with J D E(t ¡ 3, t). sqr(x) means x2 , and dj j D 1, . . . , r are the desired O (a) , that is, G O (a) in cascade with G O (a) . outputs. (b) Its adjoint G S J N
tion purposes), and feedback. The class of cost functions allowed by this approach is therefore enlarged with respect to other approaches since the cost expression can have memory and be recursive. Although some applications would require very complex cost functions, the SFG approach can easily handle them.
Signal-Flow-Graph Approach to On-line Gradient Calculation
1915
( )
Using the adjoint network GO Sa of the new SFG GS , it is easy to calculate the sensitivities of the output with respect to the system parameters (see equations 3.5 and 3.6). Since the output yk (k D 1) of GS is the cost function J, combining equation 4.2 with 3.5 or 3.6 and truncating the past history to h steps (on-line learning) it holds:
8 h¡1 X > @ J (t ) > > D vO i (t )xi (t ¡ t ) > < @wi
on-line
(4.7a)
T > X > @J > > D vO i (t )xi (T ¡ t ) :
batch
(4.7b)
8 h¡1 X > @ J (t ) > > D vO i (t ) g0i (t ¡ t ) > < @ai
on-line
(4.8a)
T > X > @J > > D vO i (t ) g0i (T ¡ t ) :
batch.
(4.8b)
t D0
@wi
t D0
t D0
@ai
t D0
Equations 4.7 and 4.8 hold for any cost function J that can be described by an SFG. In the particular case when J is chosen to be a standard MSE cost function, we show that the explicit implementation of GJ can be avoided since the involved derivatives can be analytically derived. For on-line learning, using the instantaneous squared error J D e2 (t) the input-output relationship of G J is given by equation 4.4; thus, » @e2 (t ) ¡2ek (t), t D 0 , (4.9) D k D 1, . . . , r. 0, t > 0 ( ) @yk t ¡ t Using equations 3.4 and 4.9 and considering the output yk of GN as an internal signal vi , we can explicitly compute the outputs of the adjoint network (a ) GO J
yO k (t ) D
@e2 (t ) @yk (t ¡ t )
» D
¡2ek (t), 0,
t D0 , t > 0
k D 1, . . . , r.
(4.10)
( ) Such outputs can be used as the inputs of GO Na . Therefore if the signal, equa( ) tion 4.10, feeds the network GO Na instead of equation 3.3, the explicit use of ( ) a GO J can be avoided. If J D E (t ¡ Nc C 1, t) is used, then equation 4.10 generalizes to the following (see Figure 4):
yO k (t ) D D
@E (t ¡ Nc C 1, t )
»
@yk (t ¡ t )
¡2ek (t ¡ t ), 0,
t · Nc ¡ 1 , k D 1, . . . , r. t ¸ Nc
(4.11)
1916
P. Campolucci, A. Uncini, and F. Piazza
Figure 5: (a) SFG GS , that is, the system SFG GN in cascade with the error calculation SFG G J with J D E(0, t), sqr(x) means x2 , dj j D 1, . . . , r are the desired O (a) , that is, G O (a) in cascade with G O (a) . The feedback in (a) outputs; (b) its adjoint G J S N sums the error terms from the initial instant. The feedback in (b) injects into the SFG’s left side a constant value equal to 1 for all t .
When batch learning is involved, that is, when J D E (0, T ), the following equation has to be used instead of 4.10 or 4.11: yO k (t ) D
@E (0, T ) @yk (T ¡ t )
D ¡ 2ek (T ¡ t ),
k D 1, . . . , r,
t D 0, 1, . . . , T.
(4.12)
Note that the parameter variations computed by the summation of all gradient terms in [0, T] are the same as those obtained by BPTT (Wan & Beaufays, 1996) (see Figure 5). Thus, for standard MSE cost functions, the derivative calculation can be ( ) performed considering GN and GO Na only, but using either equation 4.10, ( ) a 4.11, or 4.12 as inputs for GO N instead of equation 3.3.
Signal-Flow-Graph Approach to On-line Gradient Calculation
1917
Srinivasan et al. (1994) derive a theorem that substantially states what is expressed in equation 4.7 when J D e2 (t); however, such a theorem has been proved only for a neural network with a particular structure and by a completely different and very specic approach, whereas the method presented here is very general, including any NN as a special case. In the following, we call the resulting algorithm BC(h) (backward computation), where h is the truncation parameter, or BC for a short notation or the batch case. 4.3 Detailed Steps of the BC Procedure. To derive the BC algorithm for training an arbitrary SFG GN using an MSE cost function, the adjoint SFG ( ) GO Na must be drawn by reversing the graph and applying the transformation of Table 1.
On-Line Learning. Choosing J D E (t ¡Nc C 1, t ), then the following steps have to be performed for each time instant t: 1. The system SFG GN is computed one time step forward, storing in memory the internal states for the last h time steps. ( ) 2. The adjoint SFG GO Na is reset (setting null initial conditions for the delays). ( ) 3. GO Na is computed for t D 0, 1, . . . , h ¡ 1 with the input given by equation 4.11 computing the terms of summations 4.7a and 4.8a.
4. The parameters wj and aj are adapted by equation 4.1. Batch Learning. For each epoch:
Choosing J D
PT
tD 0 e
2(
t), then the procedure is simpler.
1. The system SFG GN is computed, storing its internal states from time t D 0 up to time T. ( ) 2. The adjoint SFG GO Na is reset.
( ) 3. GO Na is evaluated for t D 0 up to T, with t D T (see the appendix, accumulating the terms needed by equations 4.7b and 4.8b).
4. The parameters wj and aj are adapted by equation 4.1. However, not all the variables of the adjoint SFG have to be computed— only the initial variables of the branches containing adaptable parameters since they must be used in equations 4.7 and 4.8. It must be stressed that the delay operator of the adjoint SFG delays the t and not the t index. When the MSE is considered, the batch mode BC learning method corresponds to batch BPTT and Wan and Beaufays’ method (1996), while the
1918
P. Campolucci, A. Uncini, and F. Piazza
Figure 6: (a) SFG representing a recurrent neuron. (b) Its adjoint.
on-line mode BC procedure corresponds to truncated BPTT (Williams & Peng, 1990). If the desired cost function J is not a conventional MSE, then the cost function SFG GJ must be designed and connected in cascade with the network ( ) SFG, giving the overall SFG GS , whose adjoint SFG GO Sa must be drawn by reversing the graph and applying the transformation of Table 1. In the previous steps of the BC procedures, now GS must be considered instead of ( ) ( ) ( ) GN and GO Sa instead of GO Na while the input of the adjoint SFG GO Sa is given by equation 3.3 instead of 4.10, 4.11, or 4.12. 4.4 Example of Application of the BC Algorithm. As an example, a very simple SFG, a recurrent neuron with two adaptable parameters, is considered (see Figure 6a). The equations of the BC algorithm will be detailed to make the method clear and to show the relationship with truncated BPTT(h) (Williams & Peng, 1990) in on-line mode and with BPTT in batch mode. The equations of the forward phase of the system can be written looking at Figure 6a. Let us assume the following order of the branches: for the two weights and the nonlinearity operator, the branch index is the subindex shown in the gure, and for the delay operator it is equal to three. Since the system is single-input, single-output (SISO), the u and y variables do not
Signal-Flow-Graph Approach to On-line Gradient Calculation
1919
need any index: D
s(t) D x2 (t ) D w1 (t)u(t) C w4 (t)y (t ¡ 1)
(4.13)
y (t ) D f2 (s(t)).
(4.14)
4.4.1 On-line Case. For the simplest case J D e2 (t) D (d(t) ¡ y (t))2 , according to equation 4.10, the input of the corresponding adjoint SFG (see Figure 6b) must be: » ¡2e(t), t D 0 . (4.15) yO (t ) D 0, t >0 From the adjoint SFG, it results in vO 1 (t ) D vO 4 (t )
¡
D f20 (s(t ¡ t )) yO (t ) C
» wO 4 (t ¡ 1)vO 4 (t ¡ 1), 0,
¢
t > 0 , t D0
(4.16)
where (see Table 1) wO 4 (t ¡ 1) D w4 (t ¡ t C 1).
(4.17)
According to equations 3.5 and 4.3, the weights can be updated using the following two equations:
D w1 (t) D w4 (t)
D m D m
h¡1 X t D0 h¡1 X t D0
d(t ¡ t )u (t ¡ t )
(4.18)
d(t ¡ t )y(t ¡ t ¡ 1),
(4.19)
where D
d(t ¡ t ) D ¡
@e2 (t) @s (t ¡ t )
D ¡
@e2 (t) @v1 (t ¡ t )
D ¡vO 1 (t ).
(4.20)
It can be easily seen that these equations do indeed correspond to truncated BPTT(h), since the quantity d is simply the usual d of truncated BPTT. Note that here a weight buffer has to be used as stated by the theory, while sometimes in the literature only the current weights are used to obtain a simpler approximated implementation (Williams & Peng, 1990). It is worth noting that in spite of the simplicity of the system (a single recurrent neuron), the backward equations are not easy to derive by chain rule; in fact, forward (recursive) equations are usually proposed for adaptation.
1920
P. Campolucci, A. Uncini, and F. Piazza
4.4.2 Batch Case. The graph is computed for t D 0, . . . , T saving the internal states, without any evaluation of the adjoint SFG. The weight time index can be neglected since the weights are now constant. P In this case, the MSE over the entire epoch is taken, J D TtD0 e2 (t); therefore, the input of the adjoint SFG (see Figure 6b) must be, according to equation 4.12: yO (t ) D ¡2e(T ¡ t ),
t D 0, 1, . . . , T
(4.21)
From the adjoint SFG, remembering that t D T, it results in
¡
vO 1 (t ) D vO 4 (t ) D f20 (s(T ¡ t )) yO (t ) C
» wO 4 vO 4 (t ¡ 1), 0,
¢
t > 0 , t D0
(4.22)
where wO 4 D w4 . Therefore the weights can be updated according to the following two equations:
D w1 D w4
Dm Dm
T X t D0 T X t D0
d(T ¡ t )u(T ¡ t )
(4.23)
d(T ¡ t )y (T ¡ t ¡ 1)
(4.24)
where D
d (T ¡ t ) D ¡
@J @s(T ¡ t )
D¡
@J @v1 (T ¡ t )
D ¡Ov1 (t ).
(4.25)
It is easy to see that these equations correspond to BPTT, since the quantity d is simply the usual d of BPTT. 4.5 Complexity Analysis. The complexity of the proposed gradient computation is very low, since it linearly increases with the number of adaptable parameters. In particular, since the adjoint graph has the same topology as the original one, the complexity of its computation is about the same; in practice, it is lower, since not all the variables of the adjoint graph have to be computed, as stated in section 4.3. For on-line learning, the adjoint graph must be evaluated h times for each time step; therefore, the number of operations for the computation of the gradient terms with respect to all the parameters is roughly (in practice it is lower) h times the number of operations of the forward phase, plus the computation needed by equations 4.7 and 4.8 for each time step (whose complexity linearly increases with h and the number of parameters).
Signal-Flow-Graph Approach to On-line Gradient Calculation
1921
4.5.1 Fully Recurrent Neural Networks. For the simple case of a fully recurrent, single-layer, single-delay neural network composed of n neurons, the computational complexity is O (n2 ) operations per time step for batch BC or O (n2 h ) for on-line BC compared with O (n4 ) for RTRL (Williams and Zipser, 1989). The memory requirement is O(nT ) for batch BC or O (nh ) for on-line BC, and O (n3 ) for RTRL. Therefore, as far as computational complexity is concerned, in batch mode, BC is signicantly simpler than RTRL, whereas in on-line mode, the complexity and also the memory requirement ratios depend on n2 / h. However in many practical cases, n2 is large compared to h, and therefore RTRL will be more complex than on-line BC. 4.5.2 Locally Recurrent Neural Networks. For a complex architecture such as a locally recurrent layered network, a mathematical evaluation of complexity can be carried out by computing the number of multiplications and additions for one iteration of the on-line learning (i.e., for each input sample). Results for on-line BC are reported in the signicant special case of a two-layer MLP with IIR temporal lter synapses (IIR-MLP) (Tsoi & Back, 1994; Campolucci et al., 1999) with bias and moving average (MA) and auto regressive (AR) orders (L (l) and I (l ) , respectively), depending only on the layer index l. The number of additions and multiplications is, respectively: h i (1) (1) (2) (2) N2 C h N1 N0 (2I C L ) C 2N2 N1 (I C L ) ¡ N2 (N1 ¡ 1) C N1 , h i N2 C h N1 N0 (2I (1) C L (1) C 1) C 2N2 N1 (I (2) C L (2) ) C N1 (N2 C 1) where Ni is the number of neurons of layer ith (i D 0 corresponds to the input layer). These numbers must be added to the number of operations of the forward phase, which should always be performed before the backward phase, N1 N0 (L (1) C I(1) ) C N2 N1 (L (2) C I (2) ), that is, the same for both addition and multiplication. 5 Conclusion
We have presented a signal-ow-graph approach that allows an easy online computation of the gradient terms needed in sensitivity analysis and learning of nonlinear dynamic adaptive systems represented by SFG, even if their structure includes feedback (of any kind, even nested) or time delays. This method bypasses the complex chain rule derivative expansion traditionally needed to derive the gradient equations. The gradient information obtained in this way can be useful for circuit optimization by output sensitivity minimization or for gradient-based training algorithms, such as conjugate gradient or other techniques. This method should allow the development of computer-aided design software by which
1922
P. Campolucci, A. Uncini, and F. Piazza
Figure 7: (a) The perturbation introduced in the original SFG. (b) The corresponding node in the adjoint SFG.
an operator could dene an architecture of a system to be adapted or a circuit whose sensitivity must be computed, leaving the software the hard task of nding and implementing the needed gradient calculation algorithm. We have easily developed this software in a very general version. This work is related to previous results in different elds and provides an elegant analogy with the sensitivity analysis of analog networks by the adjoint network technique obtained applying Tellegen’s theorem which in that case relates voltages and currents of the network (Director & Rohrer, 1969). Appendix: Proof of the Adjoint SFG Method
Let us consider a nonlinear dynamical system described by an SFG with the notation dened in section 2. Starting at time 0 and letting the system run up to time t we obtain the signals uj (t), yj (t), xj (t), and vj (t). Now let us repeat the same process but introducing at time t ¡t a perturbation e to a specic node of the graph (see Figure 7a). This is equivalent to considering a system with m C 1 inputs, one more than the original system, where the input (m C 1)th is um C 1 (t) D u0 (t) where u0 (t) is dened such as » 0
u (t ¡ w ) D
0, e,
6 t w D . w Dt
Therefore, at time t, all the variables uj (t ), yj (t), xj (t), and vj (t) will be changed to uj (t) C D uj (t), yj (t) C D yj (t), xj (t) C D xj (t), and vj (t) C D vj (t ). Obviously it holds that D u0 (t) D u0 (t), 8t. Let uO j , yOj , xOj , vOj , and uO m C 1 (t) D uO 0 (t) be the corresponding variables associated with the branches of a generic reversed SFG GO N of the original graph
Signal-Flow-Graph Approach to On-line Gradient Calculation
1923
GN ; applying theorem 1 with e D 0 (no perturbation), it follows that r X jD 1
yOj (t) ¤ yj (t ) C
q X jD 1
xOj (t) ¤ xj (t) D
C1 m X
j D1
C
uO j (t) ¤ uj (t)
q X jD 1
vOj (t) ¤ vj (t ),
(A.1)
6 0 (with perturbation), the same equation rewrites as while when e D
r X jD 1
yOj (t) ¤ (yj (t) C D yj (t)) C D
C1 m X
jD 1
q X j D1
xOj (t) ¤ (xj (t) C D xj (t))
uO j (t) ¤ (uj (t ) C D uj (t)) C
q X jD 1
vOj (t ) ¤ (vj (t ) C D vj (t)).
(A.2)
Since the convolution is a linear operator, subtracting equation A.1 from A.2 results in r X jD 1
yOj (t) ¤ D yj (t) C
q X jD 1
xO j (t ) ¤ D xj (t ) D
C1 m X
j D1
C
uO j (t) ¤ D uj (t)
q X jD 1
vOj (t) ¤ D vj (t).
(A.3)
With regard to the input branches yOj of the SFG GO N , it is possible to choose
8 > 1, : 0,
8t
6 k if j D
t D0 if j D k t > 0
j D 1, . . . , r
(A.4)
in order to obtain: r X jD 1
yOj (t) ¤ D yj (t) D D yk (t),
(A.5)
where yk is the output of which we need to compute the gradient. Since the inputs uj j D 1, . . . , m of GN are obviously not dependent on the perturbation e , it holds that
D uj (t) D 0 8t,
j D 1, . . . , mI
(A.6)
1924
P. Campolucci, A. Uncini, and F. Piazza
therefore, C1 m X
jD 1
uO j (t ) ¤ D uj (t) D uO m C 1 (t) ¤ D um C 1 (t) D uO 0 (t) ¤ D u0 (t) t X
D
w D0
uO 0 (w )D u0 (t ¡ w ) D uO 0 (t )D u0 (t ¡ t ).
(A.7)
Now we use the degrees of freedom given by the denition of reversed graph to choose the f -branch operators in GO N such that it holds: vOj (t) ¤ D vj (t) D xOj (t) ¤ D xj (t)
j D 1, . . . , q,
(A.8)
and equation A.3 can be greatly simplied. In this way, the adjoint graph ( ) GO Na will be obtained. For delay branches: vj (t) D q¡1 xj (t) ) D vj (t) D D xj (t ¡ 1) D q¡1 D xj (t).
(A.9)
Let us choose the relationship between the branch variables in GO N as xOj (t ) D q¡1 vOj (t ),
(A.10)
where t is the time index of the adjoint SFG and t the time index of the original SFG; the same notation will be used in the following. Since D xj (t ¡ w ) and D vj (t ¡ w ) are zero, if w > t , since the system is causal, and the initial conditions of GO N are set to zero, that is xOj (0) D 0,
(A.11)
it follows that vOj (t) ¤ D vj (t) D D D
t X w D0 t X w D0
t C1 X sD 0
vOj (w )D vj (t ¡ w ) D
t X w D0
vOj (w )D vj (t ¡ w )
xOj (w C 1)D xj (t ¡ w ¡ 1) D xOj (s)D xj (t ¡ s) D
D xOj (t) ¤ D xj (t).
t X sD 0
t C1 X sD 1
xOj (s)D xj (t ¡ s )
xOj (s)D xj (t ¡ s) (A.12)
For static branches: vj (t) D gj (xj (t), aj (t), t).
(A.13)
Signal-Flow-Graph Approach to On-line Gradient Calculation
1925
Let us choose the relationship between the branch variables xOj and vOj in GN as xOj (t ) D vOj (t )
D D vOj (t ) gj0 (t ¡ t ), @xj xj (t¡t ),aj (t¡t ), t¡t @gj
(A.14)
where gj0 (.) is implicitly dened as in equation 3.2. By differentiating (
D vj (s) D
(@gj / @xj ) D xj (s ) D gj0 (s)D xj (s), s ¸ t ¡ t xj (s),aj (s), s 0, s < t¡t
vOj (t) ¤ D vj (t) D D
t X w D0 t X
w D0
(A.15)
vOj (w )D vj (t ¡ w ) vOj (w ) gj0 (t ¡ w )D xj (t ¡ w ) D xOj (t ) ¤ D xj (t ).
(A.16)
Very interesting particular cases of equation A.14 for nonlinear adaptive lters or neural networks are: ( (A.17) for a weight branch xOj (t ) D wj (t ¡ t ), vOj (t ) 0 (A.18) xOj (t ) D fj (xj (t ¡ t )vOj (t ) for a nonlinear branch. Equations A.17 and A.18 correspond to the branches dened in equation 2.3, respectively. We have completely dened the reversed SFG GO N , which is now called (a ) (a ) adjoint signal-ow-graph GO N . Therefore the adjoint SFG GO N is dened as a reversed SFG GO N with the additional conditions given by equations A.10 and A.11 for delay branches, A.17 and A.18 (or A.14) for static branches, O and A.4 for the y-branches. These denitions are summarized in Table 1. Combining equations A.3, A.5, A.7, and A.8 results in
D yk (t) D uO 0 (t )D u0 (t ¡ t ).
(A.19)
Since uO 0 (t ) obviously does not depend on e , it holds that @yk (t) @u0 (t
¡ t)
D lim
e !0
D yk (t) D u 0 (t ¡ t )
D uO 0 (t ).
(A.20)
From Figure 7a, we observe that u0 and the ith branch comes into the same node; remembering that a node performs a summation, it holds that @yk (t) @u0 (t ¡ t )
D
@yk (t) @vi (t ¡ t )
,
(A.21)
1926
P. Campolucci, A. Uncini, and F. Piazza
and from Figure 7b that uO 0 (t ) D vO i (t ); therefore @yk (t) D vO i (t ). @vi (t ¡ t )
(A.22)
It is important to note that for batch learning, equation A.22 is evaluated only for t D T and, therefore the adjoint branches dened in Table 1 are considered with t D T, where T is the nal instant of the epoch. References Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in combinatorial environment. In Proc. Int. Conf. on Neural Networks (Vol. 2, pp. 609–618). Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modelling. Neural Computation, 3, 375–385. Beaufays, F., & Wan, E. (1994). Relating real-time backpropagation and backpropagation-through-time: An application of ow graph interreciprocity. Neural Computation, 6, 296–306. Campolucci, P. (1998). A circuit theory approach to recurrent neural network architectures and learning methods. Doctoral dissertation in English, University of Bologna, Italy. PDF available online at http://nnsp.eealab.unian.it/ campolucci P or requested from [email protected]. Campolucci, P., Marchegiani, A., Uncini, A., & Piazza, F. (1997). Signal-owgraph derivation of on-line gradient learning algorithms. In Proc. ICNN-97, IEEE Int. Conference on Neural Networks (Houston, TX). Campolucci, P., Piazza, F., & Uncini, A. (1995). On-line learning algorithms for neural networks with IIR synapses. In Proc. IEEE International Conference of Neural Networks (Perth). Campolucci, P., Uncini, A., & Piazza, F. (1998). Dynamical systems learning by a circuit theoretic approach. Proc. ISCAS-98, IEEE Int. Symposium on Circuits and Systems. Campolucci, P., Uncini, A., Piazza, F., & Rao, B. D. (1999). On-line learning algorithms for locally recurrent neural networks. IEEE Trans. on Neural Networks, 10, 253–271. Director, S. W., & Rohrer, R. A. (1969). The generalized adjoint network and network sensitivities. IEEE Trans. on Circuit Theory, CT-16, 318–323. Gherrity, M. (1989). A learning algorithm for analog, fully recurrent neural networks. In Proc. Int. Joint Conference Neural Networks, (Vol. 1, pp. 643–644). Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: IEEE Press–Macmillan. Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrent neural networks. In G. Tasauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 Cambridge, MA: MIT Press. Lee, A. Y. (1974).Signal ow graphs—Computer-aided system analysis and sensitivity calculations. IEEE Transactions on Circuits and Systems, cas-21, 209–216. Martinelli, G., & Perfetti, R. (1991). Circuit theoretic approach to the backpropagation learning algorithm. Proc. IEEE Int. Symposium on Circuits and Systems.
Signal-Flow-Graph Approach to On-line Gradient Calculation
1927
Mason, S. J. (1953). Feedback theory—Some properties of signal-ow graphs. Proc. Institute of Radio Engineers, 41, 1144–1156. Mason, S. J. (1956). Feedback theory—Further properties of signal-ow graphs. Proc. Institute of Radio Engineers, 44, 920–926. Narendra, K. S., Parthasarathy, K. (1991). Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. on Neural Networks, 2, 252–262. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., & Marcos, S. (1993). Neural networks and nonlinear adaptive ltering: Unifying concepts and new algorithms. Neural Computation, 5 165–199. Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Osowski, S. (1994). Signal ow graphs and neural networks. Biological Cybernetics, 70, 387–395. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Trans. on Neural Networks, 6, 1212–1228. Peneld, P., Spence, R., & Duiker, S. (1970). Tellegen’s theorem and electrical networks. Cambridge, MA: MIT Press. Srinivasan, B., Prasad, U. R., & Rao, N. J. (1994). Backpropagation through adjoints for the identication of non linear dynamic systems using recurrent neural models. IEEE Trans. on Neural Networks, 5, 213–228. Tellegen, B. D. H. (1952). A general network theorem, with applications. Philips Res. Rep., 7, 259–269. Tsoi, A. C., & Back, A. D. (1994). Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks, 5, 229–239. Uncini, A., Vecci, L., Campolucci, P., & Piazza, F. (1999). Complex-valued neural networks with adaptive spline activation function for digital radio links nonlinear equalization. IEEE Transactions on Signal Processing, 47, 505–514. Wan, E. A., & Beaufays, F. (1996). Diagrammatic derivation of gradient algorithms for neural networks. Neural Computation, 8, 182–201. Wan, E. A., & Beaufays, F. (1998). Diagrammatic methods for deriving and relating temporal neural networks algorithms. In M. Gori & C. L. Giles (Eds.), Adaptive processing of sequences and data structures. Berlin: Springer-Verlag. Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proc. of IEEE, 78, 1550–1560. Williams, R. J., & Peng, J. (1990). An efcient gradient-based algorithm for on line training of recurrent network trajectories. Neural Computation, 2, 490–501. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Williams, R. J., & Zipser, D. (1994). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications (pp. 433–486). Hillsdale, NJ: Erlbaum. Received July 7, 1997; accepted July 22, 1999.
LETTER
Communicated by Malik Magdon-Ismail
Bootstrapping Neural Networks Jurgen ¨ Franke Department of Mathematics, University of Kaiserslautern, 67653 Kaiserslautern, Germany Michael H. Neumann Department of Economics, Humboldt University of Berlin, 13591 Berlin, Germany Knowledge about the distribution of a statistical estimator is important for various purposes, such as the construction of condence intervals for model parameters or the determination of critical values of tests. A widely used method to estimate this distribution is the so-called bootstrap, which is based on an imitation of the probabilistic structure of the data-generating process on the basis of the information provided by a given set of random observations. In this article we investigate this classical method in the context of articial neural networks used for estimating a mapping from input to output space. We establish consistency results for bootstrap estimates of the distribution of parameter estimates. 1 Introduction
Neural networks provide a tool for learning an unknown mapping, say m, from input space to output space. In the presence of noise, they provide function estimators that are nonparametric in spirit, as discussed in detail by White (1990). A feedforward neural network with independent inputs and noisy outputs is a particular nonlinear regression model. The reliability of estimates for the weights is important for the ability of the trained network to generalize. Their asymptotic normal distribution has been given by White (1989a). If one is interested in quantifying the performance of statistical estimates, Efron’s (1979) bootstrap is an alternative to asymptotic considerations, which often provides an adequate and in many cases better approximation to the actual distribution than standard asymptotics, as discussed, for example, for nonlinear nonparametric regression by H¨ardle and Bowman (1988) and for nonlinear nonparametric time series models by Franke, Kreiss, and Mammen (1997) and Neumann and Kreiss (1998). The bootstrap may be used for calculating condence intervals for predictions and critical values for statistical tests, and it is also helpful for model selection and automatic choice of the amount of smoothing in semi- and nonparametric situations (Efron & Tibshirani, 1993; Hall, 1992; Shao & Tu, 1995). For c 2000 Massachusetts Institute of Technology Neural Computation 12, 1929– 1949 (2000) °
1930
Jurgen ¨ Franke and Michael H. Neumann
the particular case of a parametric nonlinear regression model, the validity and second-order efciency of the bootstrap has been described and investigated in practice by Huet and Jolivet (1989), Huet, Jolivet, and Messean (1990) and Bunke, Droge, and Polzehl (1995). These results can be applied directly to feedforward neural networks in the correctly specied case, where the mapping m can be represented exactly by a network of the form considered. Typically, however, nite-dimensional neural networks used for learning a particular mapping are misspecied. In this article, we describe bootstrap procedures for feedforward neural networks that also cover this misspecied case. For simplicity, we restrict ourselves to networks with only one hidden layer and with one linear output unit, which, for real-valued mappings, already have the universal approximation property (Hornik, Stinchcombe, & White, 1989). However, our arguments can be generalized in a straightforward manner to multilayer-multioutput networks, where the output nodes do not have to be linear. We admit an arbitrary activation function for the neurons of the hidden layer, satisfying only certain smoothness conditions, such that our results cover multilayer perceptrons with sigmoid activation function as well as radial basis function networks with kernel-type activation function. Practitioners like Refenes, Zapranis, and Utans (1996) and Baxt and White (1995) have already used the standard bootstrap, for example, for estimating sampling variability of neural networks. In section 2, we provide the theoretical basis for these applications. We make the difference to correctly specied nonlinear regression models transparent and discuss some pitfalls related to identiability of network parameters. This residual-based bootstrap is compared with the asymptotic normal approximation in a short simulation study in section 3. In section 4, we present a different “wild” bootstrap procedure, which is able to cope with situations where the noise in the data and in particular its variance depends on the input. 2 The Bootstrap Procedure
We consider a training set of independent and identically distributed random row vectors (X0t , Yt ), t D 1, . . . , N, where Xt is of dimension d with marginal density p and Yt is real valued. Suppose we are interested in the relationship between Yt and Xt , and we want to estimate the conditional expectation of Yt given Xt D x (see White, 1989b): m (x ) D E fYt | Xt D xg.
(2.1)
We assume that the residuals et D Yt ¡ m(Xt ) are independent random variables with nite variance E fet2 | Xt D xg D se2 (x ). We want to approximate m by a single hidden-layer feedforward network with H hidden units, H ¸ 1.
Bootstrapping Neural Networks
1931
We write its output given input x as fH (x, #) D v0 C
H X
vh y (e x0 wh ),
(2.2)
hD1
where # D (w01 , . . . , w0H , v0 , . . . , vH )0 is the vector of network weights with w0h D (w0h , . . . , wph ), h D 1, . . . , H, y is the (xed) hidden unit activation function, and e x0 D (1, x0 ) is the input vector augmented by a bias component 1. The parameterization of the network function is not unique; certain simple symmetry operations applied to the weight vector obviously do not change the value of fH (x, # ). For a sigmoid activation function y centered around 0, that is, y (¡x ) D ¡y (x), these symmetry operations correspond to an exchange of hidden units and multiplying all weights of connections going into and out of a particular hidden unit by ¡1. To avoid this ambiguity, we consider only weight vectors # lying in a fundamental domain in the sense of Ruger ¨ and Ossen (1997). For the case of sigmoid activation functions with y (¡x ) D ¡y (x), this means that we restrict our attention to parameter vectors # with v1 ¸ v2 ¸ ¢ ¢ ¢ ¸ vH ¸ 0. To simplify the proofs, we consider only a compact subset H H of such a fundamental domain. Now, we train the network to get the nonlinear least-squares estimate b #N bN (# ), where of the weight vector, that is, b #N D arg min#2H H D bN ( # ) ´ D
N 1 X (Yt ¡ fH (Xt , # ))2 . N tD1
(2.3)
In the correctly specied case where m (x ) D p fH (x, #0 ) for some #0 2 H H , it is well known from nonlinear regression that N (b #N ¡ #0 ) is asymptotically for N ! 1 normally distributed with mean 0 and covariance matrix, se2 [E fr fH (Xt , #0 )r fH (Xt , #0 )0 g]¡1 . Here and in the following, r fH (x, # ) denotes the gradient of fH with respect to #. The analogous result for the misspecied situation is given by White (1989a). Here, there is no true #0 , but b #N converges to the parameter of the best network function approximator for m(x ), that is, to #0 D arg min#2H H D0 (# ), where D0 (# ) ´ E (Yt ¡ fH (Xt , # ))2 Z ´ [(m(x) ¡ fH (x, #))2 C se2 (x )]p (x )dx,
(2.4)
where we assume that the random vectorpXt has a density p (x ). If the size of the training set grows (N ! 1), then N (b #N ¡ #0 ) is again asymptotically normally distributed with mean 0 and covariance matrix A(#0 ) ¡1 B(#0 )
1932
Jurgen ¨ Franke and Michael H. Neumann
A(#0 )¡1 , where A (# ) D E fr 2 d(# )g , B(# ) D E fr d(# )r d (#)0 g, d (# ) ´ d (Xt , Yt , # ) D (Yt ¡ fH (Xt , # ))2 . Here, r 2 denotes the Hessian with respect to #. This result matches with the above result in the correctly specied case since then A (#0 ) reduces to 2E fr d (#0 )r d(#0 )0 g, which implies that A (#0 ) ¡1 B(#0 )A (#0 ) ¡1 equals the covariance matrix se2 [E fr fH (Xt , #0 )r fH (Xt , #0 )0 g]¡1 . For these results to hold, some assumptions have to be satised: A1. The activation function y is bounded and twice continuously dif-
ferentiable with bounded derivatives, and m is bounded. A2. D0 (#) has a unique global minimum at #0 lying in the interior of
H H , and r 2 D0 (#0 ) D A(#0 ) is positive denite.
We think that all results stated below remain valid if the boundedness of m is relaxed to a growth condition such as |m(x)| · CkxkM , for M < 1, in conjunction with the condition P (kXt k > t ) · Cr ¡t for r < 1. We decided to state the results under the stronger condition, A1, in order to keep the proofs as simple as possible. Besides the asymptotic normal distribution, the bootstrap provides an alternative approximation for the distribution of b #N ¡#0 . Following Huet and Jolivet (1989), we rst consider the following approach, which is adequate for identically distributed noise et . For initializing the bootstrap procedure, bN for m, i.e., an estimator for we need any uniformly consistent estimator m bN (x ) ¡ m(x )|g ¡! 0 in probability for N ! 1. m bN which supx2supp(p ) f| m could be chosen, for example, as the type of connectionist sieve estimator considered by White (1990), where the complexity of the network is allowed to grow with N, as a spline smoother with roughness penalty converging slowly to 0, or as a kernel-type smoother with bandwidth decreasing with N. Then the noise variables et can be approximated by b N (Xt ), t D 1, . . . , N. b et D Yt ¡ m
We know from our assumptions that E et D 0. To avoid a systematic error in the bootstrap, we follow Freedman (1981) and center the b et : e et D b et ¡
N 1 X b ek , t D 1, . . . , N. N k D1
eN denote the sample distribution given by e Let F e1 , . . . , e eN . We draw inde¤ eN , that is, for all t: pendent bootstrap errors e1¤ , . . . , eN from F
et¤ D e ek with probability
1 , k D 1, . . . , N. N
Bootstrapping Neural Networks
1933
In the same manner, we draw bootstrap input vectors X1¤ , . . . , X¤N randomly with replacement from X1 , . . . , XN , that is, for all t: Xt¤ D Xk with probability
1 , k D 1, . . . , N. N
Finally, we form bootstrap outputs as b N (Xt¤ ) C et¤ , t D 1, . . . , N, Y¤t D m ¤ , ¤ ). to get a bootstrap training set (X¤1 , Y1¤ ), . . . , (XN YN The basic idea of the bootstrap is the following. Because the et and Xt are independent and identically distributed, their sample distributions approximate for N large enough the true distributions. Therefore, the et¤ and bN is close to m, X¤t behave similar to the et and Xt . As, by construction, m the bootstrap outputs show similar random variations as the true outputs. Therefore, the behavior of estimates like the weights of a network after training should be similar to the original training set and the bootstrap training set. The random mechanism generating the bootstrap training set is, however, known to us, and we can repeat the above procedure as often as we like to get a whole family of independent bootstrap training sets ¤ ( )), (X1¤ (i), Y1¤ (i)), . . . , (X¤N (i), YN i i D 1, . . . , B. Using standard Monte Carlo techniques, we can mimic the behavior of any quantity of interest calculated from the training set. For illustration, let us assume that we are interested in the mean squared error ¡ ¡ ¢¢2 mse(x) D E m (x ) ¡ fH x, b #N ,
which we get if we train a network with H hidden units from the original random training set (Xt , Yt ), t D 1, . . . , N, and use it to estimate the value m (x ) of the function of interest for given input x. We get a bootstrap approximation for mse(x) as mse¤ (x ) D
B ¡ ¡ ¢¢2 1X ¤ b N (x ) ¡ fH x, b , m #N,i B iD 1
¤ where b is the weight vector after training the network using the ith #N,i bootstrap training set (X¤t (i), Y¤t (i)), t D 1, . . . , N. Before we illustrate the procedure with some examples in the next section, we rst state the results guaranteeing that our heuristic arguments can be made rigorous and that the bootstrap really works for neural network function estimators. We rst state a slight extension of White’s (1989a) results on the asymptotic distribution of b #N . In particular, we admit dependence of the noise variance on the input, which we discuss in section 4 in more detail.
1934
Jurgen ¨ Franke and Michael H. Neumann
A3. X1 , X2 , . . . are independent and identically distributed random vec-
tors with density p (x ). e1 , e2 , . . . are independent random variables and n o E fet | Xt D xg D 0 , E et2 | Xt D x D se2 (x ) < 1.
We also impose some further technical assumptions to simplify our proofs. They could be relaxed considerably without changing the validity of our results. (i) se2 (x ) is continuous and 0 < d · se2 (x ) · D < 1 for all x. (ii) There exist constants Cn such that E f| et | n | Xt D xg · Cn < 1 is satised for all x, n. (iii) m is continuous.
A4.
If we assume in addition that P (kXt k > t ) · Cr ¡t for r < 1, then we could replace the boundedness of se2 (x) by a growth condition as |se2 (x )| · CkxkM for M < 1. In the denition of D0 (# ), the expectation is taken over Yt and Xt . However, there is a second natural candidate for an optimal parameter: that value of # that provides the best approximation for a given realization of the input variables X1 , . . . , XN . We consider D N (# ) D
N h¡ i ¢2 1 X m (Xt ) ¡ fH (Xt , # ) C se2 (Xt ) , N t D1
(2.5)
and we dene #N D arg min DN (#). #2HN
bN (#) are connected by the relation DN (# ) D E fD bN (# ) | X1 , . . . , (DN (# ) and D XN g.) Instead of b #N ¡#0 , we consider its two components b #N ¡#N and #N ¡#0 separately, which asymptotically are independent.
Suppose that assumptions A1–A4 are satised. Let m (x) denote the conditional expectation of Yt given Xt D x. Then, for N ! 1, Theorem 1.
p
N
¡ b#
¢
¡ #N ¡! N #N ¡ #0 d N
¡ ¡S 0,
1
0
0 S2
¢¢
,
p p that is, N (b #N ¡ #N ) and N (#N ¡ #0 ) are asymptotically independent normal random vectors with covariance matrices S 1 and S 2 , respectively, where S 1 D A (#0 )¡1 B1 (#0 )A (#0 ) ¡1 , S 2 D A (#0 ) ¡1 B 2 (#0 )A (#0 ) ¡1
Bootstrapping Neural Networks
1935
with Z B1 ( # ) D 4 ¢ B2 ( # ) D 4 ¢
Z
se2 (x )r fH (x, # ) ¢ r fH (x, # )0 p (x )dx (m (x ) ¡ fH (x, # ))2 r fH (x, # ) ¢ r fH (x, # )0 p (x )dx
and A(# ) D r 2 D0 (# ) as above. (See the appendix for the proof.)
p O N ¡ #0 ) is asymptotically normal As an immediate consequence, N (# with mean 0 and covariance matrix S 1 C S 2 . In the correctly specied case, S 2 is equal to the zero matrix, as there is essentially no effect due to the randomness of the Xt ’s, that is, the difference between #N and #0 is asymptotically of smaller order than N ¡1 / 2 . In contrast, in the misspecied case, the randomness of the inputs causes a difference of order N ¡1 / 2 between the two optimal parameters #N and #0 . One way to prove that the bootstrap works asymptotically in a particular situation, where the limit distribution of the quantity of interest is known, is to show that the corresponding bootstrap quantity has the same asymptotic behavior. Let b #N¤ be the weight vector after training the network from the ¤ b¤ (# ), where bootstrap training set, that is, b #N D arg min#2H H D N b¤ ( # ) ´ D N
N 1 X (Y¤ ¡ fH (X¤t , # ))2 . N tD1 t
The bootstrap procedure of this section is based on the assumption that the distribution of et does not depend on Xt . In this case, the analogon of #N is given by #N¤ D arg min#2H H D¤N (# ), where D¤N (#) ´
N h¡ i ¢2 1 X mO N (X¤t ) ¡ fH (Xt¤ , #) C sO e2 , N t D1
PN 2 ¤ and sO e2 D N1 Q t estimates se2 . In contrast to #N , #N can be calculated t D1 e 2 as mO N and sO e are known. To get the bootstrap version of #0 , we have to replace expectation with respect to the joint distribution of Xt and et by expectation with respect to the joint distribution of Xt¤ and et¤ . Then we set #0¤ D arg min#2H H D¤0 (# ), where ¡ ¢2 D¤0 (# ) ´ E ¤ Y¤t ¡ fH (X¤t , # ) D
N ¡ ¢2 1 X O N (Xt ) ¡ fH (Xt , # ) C sO e2 . m N tD 1
Note that et¤ and X¤t are independent and E ¤ et¤ D O N: impose an assumption on m
1 N
P
Qt te
D 0. We have to
1936
Jurgen ¨ Franke and Michael H. Neumann
bN is uniformly consistent on the support of p, that is, A5. m
sup
x2supp(p)
bN (x ) ¡ m (x)|g D oP (1). f| m
This assumption is obviously realistic if p is compactly supported on a sufciently regular set (e.g., on a rectangle) and bounded away from 0 on this set. In the case of unbounded support, one could modify assumption A5 to a condition as bN (x ) ¡ m(x )|g D oP (1), sup f| m
x2X
(2.6)
N
6 X N) D where X N is an appropriate subset of Rd with the property P (Xt 2 o(N ¡1 ). This is possible under the assumption that P (kXt k > Nd ) D o (N ¡1 ), for some sufciently small d > 0 (for details see Franke, Kreiss, Mammen, & Neumann, 1998). Then the bootstrap, described above, works by the following theorem, ¤ which shows that the conditional distribution of b ¡#0¤ given #N¤ ¡ #N¤ and #N the original data (X1 , Y1 ), . . . , (XN , YN ), from which the bootstrap training sets are generated, coincides asymptotically with the distribution of b #N ¡#N and #N ¡ #0 as given in theorem 1.
Suppose that et is independent of Xt , t D 1, . . . , N, that is, in particular se2 (x) ´ se2 , and assume assumptions A1–A5. Then, Theorem 2.
L
¡p ¡ b# N
¤ N #N¤
¢
¢
¤ ¡ #N (X1 , Y1 ), . . . , (XN , YN ) ¡! N ¡ #0¤ d
¡ ¡S 0,
1
0
0 S2
¢¢
,
where this convergence holds in a uniform manner for ((X1 , Y1 ), . . . , (XN , YN )) 2 X N for a suitable sequence of sets X N with P (X N ) ! 1. (See the appendix for the proof.) The bootstrap also works in the case of deterministic inputs x1 , . . . , xN of the training set that are systematically selected by the experimenter. We have only to require that the inputs x1 , . . . , xN behave for increasing N similar to a random sample up to a certain degree. In particular, we need Remark.
N 1 X (m (xt ) ¡ fH (xt , #))2 N t D1
Z ¡!
N !1
(m (x) ¡ fH (x, # ))2 p (x)dx
for some probability density p(x ). If the xt are, for example, equispaced over a xed nite d-dimensional cube C, then the above condition is satised for a constant function p(x ) ´ pC .
Bootstrapping Neural Networks
1937
At the beginning of this section, we restricted the weight vector to a fundamental domain by requiring v1 ¸ v2 ¸ ¢ ¢ ¢ ¸ vH ¸ 0 for sigmoid activation functions with y (¡x ) D ¡y (x ). This avoids the identiability problems caused by the common symmetry properties of a feedforward neural network independent of the function that we want to approximate. However, it may happen that for the optimal weights of the parameter vector #, certain weights of outgoing connections coincide, say, vh D vh C 1 . This situation may happen in practice if the function m itself has certain symmetries related to the symmetries of y . In such a situation, the bootstrap breaks down if it is used for approximating the random uctuations of the weights of connections going into the hidden units numbers h and h C 1, provided one does not take additional precautions to guarantee identiability. To illustrate the problem, let us consider m(x ) D y (1 C x ) C y (x ¡ 1) for the centered logistic function y (u ) D (1 C e¡x ) ¡1 ¡ 12 . m is itself a network function for H D 2 hidden units with w±11 D w±12 D 1, w±01 D 1, w±02 D ¡1, v0 D 0, b N (x ) will have v1 D v2 D 1, and m is symmetric around 0. The estimate m similar properties as in the case of identiable parameters of the network. If we train the network repeatedly using independent bootstrap samples, we b¤01 ¼ 1, w b¤02 ¼ ¡1 and w b¤01 ¼ ¡1, w b¤02 ¼ 1 with approximately get randomly w equal probabilities. Therefore, the bootstrap estimate of the variance of those weights is too large due to the nonidentiability of the parameters of m and the corresponding approximate nonidentiability of the parameters of the bN . In practice, it is easy network, which provides the best approximation of m to detect such situations because they are characterized by almost identical bootstrap estimates of outgoing weights, say, b v¤h (i) ¼ b v¤h C 1 (i), i D 1, . . . , B. To ¤ be more precise, one can check if the differences b v h (i ) ¡ b v¤h C 1 (i), i D 1, . . . , B, are small compared to the sample standard deviation of b v¤h (i) and b v¤h C 1 (i), i D 1, . . . , B, and in such a case take additional precautions to make the parameterization of the network function by its weights unique. Remark.
3 A Bootstrap Procedure for Input-Dependent Noise
The bootstrap procedure can be applied only to situations where the noise et does not depend on the input, that is, if it is additive in the sense of Murata, Yoshizawa, and Amari (1994). In their general model, et D et (Xt ) and, in particular, its variance se2 (Xt ) D E fet2 | Xt g depend on the input Xt , a common situation in many practical problems. Under that circumstance, one way to bootstrap function estimates is the “wild bootstrap” or “external bootstrap.” Hardle ¨ (1990) and H¨ardle and Marron (1991) have discussed this in the context of nonparametric nonlinear regression, which we are considering here. For the connectionist regression estimate, it has the following form. We generate independent and identically distributed random variables b N (Xt ) as in secg1 , . . . , gN with mean 0 and variance 1. Then, with b et D Yt ¡ m tion 2, we draw pairs (Xt¤ , b et¤ ) randomly from the set f(X1 , b e1 ), . . . , (XN , b eN )g,
1938
Jurgen ¨ Franke and Michael H. Neumann
that is, for all t D 1, . . . , N: (X¤t , b et¤ ) D (Xk , b ek ) with probability
1 , k D 1, . . . , N. N
We transform b et¤ randomly by multiplying it with gt , which does not change the mean and the variance. Then we dene bootstrap outputs as b N (Xt¤ ) C gtb et¤ , t D 1, . . . , N. Y¤t D m
In contrast to the standard bootstrap, the bootstrap noise et¤ D gt ¢ b et¤ is generated in a manner depending on the bootstrap input X¤t to reect the dependence of et on the input in the original training set. O WB denote the wild bootstrap version of # O N . Because now the disLet # N tribution of the et depends on Xt , the wild bootstrap analogon of DN (# ) is given by D¤N (# ) ´
N h¡ i ¢2 1 X O N (Xt¤ ) ¡ fH (X¤t , # ) C eO t2 , m N tD 1
and, correspondingly, the analogon of D0 (# ) is ¡ ¢2 D¤0 (# ) ´ E ¤ Y¤t ¡ fH (X¤t , # ) N n¡ ¢2 ¡ ¢ 1 X O N (Xt ) ¡ fH (Xt , # ) C 2 m O N (Xt ) ¡ fH (Xt , # ) eO t E gt D m N tD 1 o C eO t2 E gt2 D
N h¡ i ¢2 1 X O N (Xt ) ¡ fH (Xt , # ) C eOt2 . m N tD 1
They differ from D¤N (# ), D¤0 (# ) of section 2 only by terms not depending on WB , WB of #. Therefore, the wild bootstrap versions #N #0 #N , #0 coincide with ¤ , ¤ of section 2. Analogously to theorem 2, we obtain consistency of the #N #0 bootstrap. Suppose that assumptions A1–A5 are fullled. Then, ! WB b p #NWB ¡ #N S1 ( 0, N WB X , Y1 ), . . . , (XN , YN ) ¡! N WB 1 0 d ¡ #N #0
Theorem 3.
L
(
¡
¢
¡ ¡
0 S2
¢¢ ,
where this convergence holds in a uniform manner for ((X1 , Y1 ), . . . , (XN , YN )) 2 X N for a suitable sequence of sets X N with P (X N ) ! 1. (See the appendix for the proof.)
Bootstrapping Neural Networks
1939
Another popular resampling method for regression models, already discussed by Freedman (1981) for linear regression, is the pairwise bootstrap. It is not based on the sample residuals but generates the bootstrap 0 0 ¤ ) are training set directly from the original training set: (X¤1 , Y¤1 ), . . . , (X¤N , YN 0 0 independently drawn from (X1 , Y1 ), . . . , (XN , YN ) and, for all t: Remark.
0
(X¤t , Y¤t ) D (Xk0 , Yk )
with probability
1 , k D 1, . . . , N. N
For the pairwise bootstrap, theorem 2 holds, too (see Sarishvili, 1998). The proof of this result follows more or less the same lines as for the residualbased bootstrap, and therefore we do not discuss it in detail. As the wild bootstrap, it also may be applied to input-dependent noise. However, in a small simulation study, the pairwise bootstrap performed worse than the wild bootstrap. That may be a feature of the particular examples studied there (see Sarishvili, 1998). By replacing b ek by its denition, the wild bootstrap may be considered a version of the pairwise bootstrap, where some additional noise is introduced to Y¤t to resolve the ties, which, with large probability, are present in the original pairwise boostrap resample. Here, Y¤t is modied in such a manner that the conditional expectation of the output given the input remains approximately unchanged. 4 Two Numerical Examples
In this section we present the results of a small simulation study to illustrate the performance of the common residual-based bootstrap and the wild bootstrap. In both cases, we consider a feedforward neural network with one hidden layer and complete connections. As the sigmoid activation function, we choose the centered logistic function,
y ( x) D
1 1 ¡ . 1 C e¡x 2
The generation of the data and the training of the network have been done using GAUSS 3.1, where we used the BFGS method as a batch mode algorithm for determining the network weights. As a rst example, we consider as the true function to be estimated m(x1 , x2 ) D (1 ¡ x21 ) C C x1 / 2, which is shown in Figure 1. We approximate m by a network with two hidden units, that is, f (x, # ) D v0 C
2 X h D1
vh y (w0h C w1h x1 C w2h x2 ) .
1940
Jurgen ¨ Franke and Michael H. Neumann
Figure 1: True regression function m(x).
Figure 2a shows the best approximation of m by f (x, #0 ), where Z #0 D arg min #
1 ¡1
Z
1 ¡1
|m(x) ¡ f (x, # )| 2 dx,
corresponding to inputs Xt D (Xt1 , Xt2 ), which are uniformly distributed over the square [¡1, 1] £ [¡1, 1]. The difference between m and f (., #0 ) is displayed in Figure 2b. We have to estimate the nine-dimensional parameter vector #0 D (w01 , w11 , w21 , w02, w12, w22, v0 , v1 , v2 )0 . We use the training set (X1 , Y1 ), . . . , (XN , YN ) with sample size N D 100, which obeys the regression model Yt D m (Xt ) C et , where Xt » Unif ([¡1, 1]2 ) and et » N (0, (0.1)2 ) are all independent. To approximate the true distribution of b #N ¡ #0 , we carried out 500 Monte Carlo runs. ¤ To nd a typical bootstrap distribution L (b ¡ #0¤ | (X1 , Y1 ), . . . , (XN , #N YN )), we selected that sample out of 500 randomly generated samples for which the difference between the loss kb #N ¡ #0 k2 and the mean of these losses, taken over all 500 samples, was minimized. For this sample, the estimation error is neither particularly large nor particularly small, and, therefore, it appears to exhibit a typical amount of randomness. Based on this one typical sample, we constructed 500 bootstrap samples ((X1¤ (i), Y1¤ (i)), . . . ,
Bootstrapping Neural Networks
1941
Figure 2: Best approximation of m(x) by a network with (a) two hidden units and (b) approximation error. ¤ ( ), ¤ ( ))), (X100 i Y100 i i D 1, . . . , 500, according to the description in section 2. b N for m, we used a Nadaraya-Watson kernel As a preliminary estimate m estimator with bivariate gaussian kernel and bandwidths h1 D 0.2 and h2 D 1.0. Other preliminary estimates could be considered as well. For each ¤ , of the bootstrap training sets, we calculated the bootstrap estimate b #N,i
1942
Jurgen ¨ Franke and Michael H. Neumann
Figure 3: Estimation error of input weight w21 for sample size 100: true probability density (solid line), its bootstrap approximation (dashed line) and its asymptotic normal approximation (dashed-dotted line).
¡ ¤ ¢ which leads to an estimate of L b #N ¡ #0¤ | (X1 , Y1 ), . . . , (XN , YN ) . Finally, p we also consider the asymptotic limit distribution of N (b #N ¡ #0 ) as given by White (1989a). b21 ¡ w21 ) (solid line), Figure 3 shows estimates of the densities of L (w b¤21 ¡ w¤21 | (X1 , Y1 ), . . . , (XN , YN )) based on the most typical sample of L (w (dashed line) as well as the cumulative distribution function of N (0, (S 1 C S 2 )33 / N ) (dashed-dotted line). Generally we observed a surprisingly good approximation of the true distributions by the bootstrap and the normal approximations. For the other network weights, we got similar results, which seems to indicate that the bootstrap of section 2 works reasonably well even for moderate sample sizes. Of course, larger sample sizes may be necessary for more irregular target functions or higher noise levels. As the second example, we consider heteroscedastic residuals et . In this case, neither the standard asymptotics of White (1989a) nor the residualbased bootstrap of section 2 are applicable. To facilitate the graphical representation of the results, we consider a one-dimensional input x. As the true function to be estimated, we choose the bump function m (x ) D
1 2 x C Q (8x ), ¡1 · x · 1, 2 3
where Q denotes the density of the standard normal distribution. As a training set, we use independent and identically distributed (X1 , Y1 ), . . . , (XN , YN ) with N D 100, where X1 , . . . , XN are uniformly distributed over [¡1, 1]
Bootstrapping Neural Networks
1943
Figure 4: Data, true regression function m(x), and the initial kernel estimate of m(x).
and Yt D m (Xt ) C et with independent zero-mean gaussian residuals e1 , . . . , eN with standard deviation s 2 1 1 1 (E fet2 / Xt g) 2 D s(Xt ) D C m (X t ) , 0.01 C 5 2
¡
¢
that is, the variance of a residual is large if the function value to be estimated is large, which is a typical heteroscedastic situation. As the basis for the wild bootstrap, we consider a typical sample selected in a similar manner as in the rst example. Figure 4 shows these data, the true function m, and the O N with bandwidth 0.07. Nadaraya-Watson kernel estimate m We approximate m (x ) by the network function f (x, # ) D v0 C
3 X h D1
vh y (w0h C w1h x ),
which provides quite a good t for appropriately chosen weights. We train the corresponding network with three neurons in its hidden layer and con-
1944
Jurgen ¨ Franke and Michael H. Neumann
Figure 5: 90% condence intervals for regression function m(x). (a) True condence bands (solid lines) together with true function m (dashed lines). (b) Wild bootstrap condence bands (solid lines) together with initial estimate of m (dashed lines). (c) Wild bootstrap condence bands (solid lines) together with true function m (dashed lines). (d) True (solid lines) and bootstrap (dashed lines) condence bands for m.
O N ) at the points xi D ¡1 C i / 50, i D 0, . . . , 100. sider the estimates f (xi , # We are interested in condence intervals for m (xi ), i D 0, . . . , 100. Figure 5a shows the true 90% condence intervals joined to form a band together with the true function m based on a Monte Carlo simulation with 200 independent runs. Based on the one typical sample, we approximated the distribution of O N ) by means of the wild bootstrap with standard, normally distributed f ( xi , # gt . Using 500 bootstrap replications, we determined 90% condence intervals as above. Figures 5b and 5c show these bootstrap condence intervals O N and the true function m, resp. For bettogether with the kernel estimate m ter comparability, Figure 5d shows the true condence band (solid lines) and the bootstrap condence bands (dashed lines) in one plot.
Bootstrapping Neural Networks
1945
The wild bootstrap captures the heteroscedasticity of the data quite nicely, and it provides a good approximation to the true condence intervals. There are only two problematic areas, as can be seen from Figure 5d: at the right end and around the peak. The former defect is easily explained; we have O N for the well-known boundary not corrected the initial kernel estimate m effects for ease of calculations. Using boundary kernels would immediately O N and the bootstrap condence intervals around improve the estimate m the boundary. The fact that the upper limits of the bootstrap condence intervals around the peak are too large is due to pure chance. Looking at Figure 4, it can be seen that all the residuals (with one exception) around the peak happen to be positive and rather large, and they draw the peak of O N and the corresponding upper bootstrap condence limit upward. Apart m from these explainable effects, the wild bootstrap provides a good method to quantify the reliability of neural network function estimates in the presence of heteroscedasticity. Also, the wild bootstrap does not assume any knowledge about the particular form of the dependence of the variability of et on Xt at all as, for example, a classical parametric asymptotic approach similar to White (1989a) but using nonlinear weighted least squares would have to. We stress that the bootstrap error bars of Figure 5 are based on an asymptotic approximation for the distribution of the estimation error b #N ¡ #0 . Therefore, they are approximate condence bands for the optimal network function fH (x, #0 ), not for the true conditional expectation function m(x ). If, however, H is large enough such that fH (x, #0 ) approximates m(x ), the bootstrap error bars may also be looked at as condence bands for m(x ). In practice, the number H of hidden neurons should be selected rather larger for the purpose of calculating bootstrap error bars than for the purpose of just getting an estimate of the function m(x ). A precise theoretical formulation, which would have to include asymptotics for H ! 1, backing this guideline is still lacking, but one could appeal to the analogous case of estimating m (x ) nonparametrically by kernel smoothing. There, for calculating condence bands, it is advantageous to undersmooth (i.e., to overt) compared to the optimal degree of smoothing for estimating the function (Hall, 1992). Appendix: Proofs A.1 Proof of Theorem 1. (i) Using Bernstein’s inequality in conjunction with a simple truncation argument (which is possible by assumption A4.ii), it is easy to see that for all d > 0 and l < 1 ±
1
bN (# ) ¡ DN (#)| C |DN (# ) ¡ D0 (#)| > Nd¡ 2 P |D
²
D O( N
¡l
)
(A.1)
bN , D N , uniformly in # 2 H H , using assumptions A1, A3, and A4. Since D and D0 are continuous in #, we obtain, by showing that the above result
1946
Jurgen ¨ Franke and Michael H. Neumann
holds simultaneously on a sequence of increasingly ne grids H N µ H H , © ª bN (# ) ¡ DN (# )| C |DN (# ) ¡ D0 (# )| D op (1). sup | D (A.2) #2H H
Since #0 is the unique minimum point of D0 (# ), we obtain |b #N ¡ #N | C |#N ¡ #0 | D op (1).
(A.3)
Hence, by assumptions A1 and A2 with increasing probability, b #N and #N are interior points of H H , that is, we have in particular that b N (b rD #N ) D r DN (#N ) D 0,
with probability converging to 1 for N ! 1. (ii) Hence, with probability tending to 1, 0 D r DN (#N ) ¡ r DN (#0 ) C r DN (#0 ) D r 2 D0 (#0 )( #N ¡ #0 ) C
N 2 X r fH (Xt , #0 )( fH (Xt , #0 ) ¡ m (Xt )) C R1 , N tD 1
(A.4)
where R1 D r DN (#N ) ¡ r DN (#0 ) ¡ r 2 DN (#0 )(#N ¡ #0 ) h i C r 2 DN ( #0 ) ¡ r 2 D0 (#0 ) ( #N ¡ #0 ) D op (k#N ¡ #0 k)
by assumptions A1 and A3. As © ª 2E r fH (Xt , #0 )( fH (Xt .#0 ) ¡ m (Xt )) D r D0 (#0 ) D 0, the middle term of equation A.4 is OP (N ¡1 / 2 ). Thus, we get 1
r 2 D0 (#0 )(#N ¡ #0 ) C op (k#N ¡ #0 k) D Op (N ¡ 2 ), which implies, since r 2 D0 (#0 ) is positive denite, that 1
k#N ¡ #0 k D Op (N ¡ 2 ). Inserting this into equation A.4, we obtain #N ¡ #0 D ¡(r 2 D0 (#0 )) ¡1 1
C op ( N ¡ 2 ).
N ¡ ¢ 2 X r fH (Xt , #0 ) fH (Xt , #0 ) ¡ m(Xt ) N tD 1
(A.5)
Bootstrapping Neural Networks
1947
(iii) Analogously to the above calculations, we have with probability tending to 1 that b N (b 0 D rD #N ) D r DN (b #N ) ¡ D r 2 D0 (#0 )(b #N ¡ #N ) ¡
N 2 X e t r f H (X t , b #N ) N t D1
N 2 X r fH (Xt , #0 ) et C R2 , N tD 1
(A.6)
where, as r DN (#N ) D 0 with probability tending to 1, R2 D r DN (b #N ) ¡ r DN (#N ) ¡ r 2 D0 (#0 )(b #N ¡ #N ) N n o X 2 C et r fH (Xt , #0 ) ¡ r fH (Xt , #O N ) N tD1 1
O N ¡ #N k). D op (kb #N ¡ #N k) C Op (kb #N ¡ #0 kN ¡ 2 ) D op (k#
That means 1
r 2 D0 (#0 )(b #N ¡ #N ) C op (kb #N ¡ #N k) D Op (N ¡ 2 ), which implies 1
bN ¡ #N k D Op (N ¡ 2 ), k#
and therefore, b #N ¡ #N D (r 2 D0 (#0 )) ¡1
N 2 X 1 r fH (Xt , #0 ) et C op (N ¡ 2 ). N t D1
(A.7)
The assertion follows now from equations A.5 and A.7 by a multivariate central limit theorem for functions of independent and identically distributed random vectors. In particular, the asymptotic mean of b #N ¡ #N and #N ¡ #0 vanishes as E fet | Xt D xg D 0. A.2 Proof of Theorem 2. This proof is very similar to that of theorem 1. b¤ (# ), D¤ (# ), and D¤ (# ) converge uniformly to D0 (# ), one can easily Since D 0 N N show that ¤ |b ¡ #0 | C |#0¤ ¡ #0 | D op (1). #N¤ ¡ #0 | C |#N
(A.8)
Then we obtain in complete analogy to the previous proof that
L
¡p ¡ b# N
¤ N #N¤
¡ #N¤ ¡ #0¤
¢
(X1 , Y1 ), . . . , (XN , YN )
¢ » N d
¡ ¡S 0,
¤ 1
0
0 S 2¤
¢¢ ,
1948
Jurgen ¨ Franke and Michael H. Neumann
where S 1¤ and S 2¤ are analogous to S 1 and S 2 , respectively, that is S 1¤ D A¤ (#0¤ ) ¡1 B¤1 (#0¤ )A¤ (#0¤ )¡1 , S 2¤ D A¤ (#0¤ ) ¡1 B¤2 (#0¤ ) A¤ (#0¤ ) ¡1 with A¤ (# ) D r 2 D¤0 (# ) and B¤1 (# ) D B¤2 (# ) D
N 4 X ¢ r fH (Xt , # )r fH (Xt , #)0 sO e2 , N tD 1
N ¡ ¢ 4 X bN (Xt ) ¡ fH (Xt , # ) r fH (Xt , # )) r fH (Xt , # )0 . ¢ m N tD 1
Because of equation A.8 and assumption A5, we obtain S i¤ D S i C oP (1), which nishes the proof. A.3 Proof of Theorem 3. This proof is completely analogous to that of theorem 2 and requires only simple modications. For example, one has to apply a multivariate central limit theorem for independent but not necessarily identically distributed observations. The fact that L ¤ (et¤ ) is not a consistent approximation to L (e ) does not matter since the covariance of a term as in equation A.7 depends on a weighted sum of the individual variances of the et . References Baxt, W. G., & White, H. (1995). Bootstrapping condence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction. Neural Computation, 7, 624–638. Bunke, O., Droge, B., & Polzehl, J. (1995). Model selection, transformations and variance estimation in nonlinear regression (Discussion Paper No. 52). Berlin: Humboldt University. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist., 7, 1–26. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Franke, J., Kreiss, J.-P., & Mammen, E. (1997). Bootstrap of kernel smoothing in nonlinear time series. Unpublished report, University of Kaiserslautern. Franke, J., Kreiss, J.-P., Mammen, E., & Neumann, M. H. (1998). Properties of the nonparametric autoregressive bootstrap (Discussion Paper No. 54/98). Berlin: Humboldt University. Freedman, D. A. (1981). Bootstrapping regression models. Ann. Statist., 9, 1218– 1228. Hall, P. (1992). The bootstrap and edgeworth expansion. Berlin: Springer-Verlag. H¨ardle, W. (1990). Applied nonparametric regression. Cambridge: Cambridge University Press.
Bootstrapping Neural Networks
1949
H¨ardle, W., & Bowman, A. (1988). Bootstrapping in nonparametric regression: Local adaptive smoothing and condence bands. J. Amer. Statist. Assoc., 83, 102–110. H¨ardle, W., & Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparametric regression. Ann. Statist., 13, 1465–1481. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Huet, S., & Jolivet, E. (1989). Exactitude au second order des intervalles de conance bootstrap pour les param´etres d’un modele de r´egression non line´ aire. C.R. Acad. Sci. Paris S´er. I. Math., 308, 429–432. Huet, S., Jolivet, E., & Messean, A. (1990). Some simulation results about condence intervals and bootstrap methods in nonlinear regression. Statistics, 21, 369–432. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— Determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Neumann, M. H., & Kreiss, J.-P. (1998). Regression-type inference in nonparametric autoregression. Ann. Statist., 26, 1570–1613. Refenes, A.-P.N., Zapranis, A. D., & Utans, J. (1996).Neural model identication, variable selection and model adequacy. In A. Weigend et al. (Eds.), Neural networks in nancial engineering. Singapore: World Scientic. Ruger, ¨ S., & Ossen, A. (1997). The metric structure of weightspace. Unpublished manuscript. Berlin: Technical University of Berlin. Sarishvili, A. (1998). Das paarweise Bootstrap fur ¨ neuronale Netze. Diploma thesis, University of Kaiserslautern. Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Berlin: Springer-Verlag. White, H. (1989a). Some asymptotic results for learning in single hidden-layer feedforward network models. J. Amer. Statist. Assoc., 84, 1008–1013. White, H. (1989b). Learning in articial neural networks: A statistical perspective. Neural Computation, 1, 425–464. White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 535–550. Received February 16, 1999; accepted September 18, 1999.
LETTER
Communicated by Shun-ichi Amari
Information Geometry of Mean-Field Approximation Toshiyuki Tanaka Department of Electronics and Information Engineering, Tokyo Metropolitan University, Tokyo 192-0397, Japan I present a general theory of mean-eld approximation based on information geometry and applicable not only to Boltzmann machines but also to wider classes of statistical models. Using perturbation expansion of the Kullback divergence (or Plefka expansion in statistical physics), a formulation of mean-eld approximation of general orders is derived. It includes in a natural way the “naive” mean-eld approximation and is consistent with the Thouless-Anderson-Palmer (TAP) approach and the linear response theorem in statistical physics. 1 Introduction
Estimating model parameters from data is a central issue in various inference and learning problems. One usually considers a parametric model, which is assumed to have generated a given data set, and then casts the problem into a statistical estimation problem. This problem can be regarded as a “backward” problem, because “causes,” or the parameters of the model, should be estimated from their “effects,” or data. The “backward” problem is often hard to solve directly, and in some cases the corresponding “forward” problem is used to help solve the “backward” problem via an error feedback scheme. Boltzmann machine learning by Ackley, Hinton, and Sejnowski (1985) is one such example. The forward problem, evaluating statistical properties of data from model parameters, can be solved in principle by simulating the data-generating process. In the case of Boltzmann machines, it can be done by the Gibbs sampler (Geman & Geman, 1984). However, in practical situations, solving the forward problem may still be computationally expensive, as in the Gibbs sampler, so that some approximation schemes are needed in such cases. Mean-eld approximation for Boltzmann machines is one such approximation. Mean-eld theory, on which mean-eld approximation is based, has been developed in the eld of statistical mechanics, and some renements, such as the so-called TAP approach (Thouless, Anderson, & Palmer, 1977) and the linear response theorem (Parisi, 1988), have been provided there. In an information processing context, mean-eld approximation has been applied to Boltzmann machines (Peterson & Anderson, 1987) as well as other modc 2000 Massachusetts Institute of Technology Neural Computation 12, 1951– 1968 (2000) °
1952
Toshiyuki Tanaka
els. Also, it is now a common understanding that its information-theoretic interpretation is given as a minimization of Kullback-Leibler information divergence (Tanaka, 1996; Saul & Jordan, 1996; Hofmann & Buhmann, 1997). However, most of these information-theoretic considerations are on the “naive” mean-eld approximation. There are attempts to apply advanced mean-eld theory in statistical physics to statistical models, such as Boltzmann machines (Galland, 1993; Kappen & Rodr´õ guez, 1998a, 1998b), Potts spin systems (Hofmann & Buhmann, 1997), the expectation-maximization algorithm (Dunmur & Tittterington, 1997), and so on. I believe this article, employing information geometry (Amari, 1985; Amari & Nagaoka, 1998), gives the rst unied and fully consistent treatment applicable not only to Boltzmann machines but also to wider classes of statistical models. 2 Problem Formulation
To avoid mathematical subtleties I present a formulation in the case of binary, or Ising, variables. But to keep the discussion as general as possible, I will not use the fact that the variables are indeed binary unless I explicitly state so. Thus, the following arguments can be extended to more general cases in a straightforward way. Let us consider N random variables si 2 f¡1, 1g, i D 1, . . . , N and let s D (s1 , . . . , sN ) 2 S D f¡1, 1gN . I consider the following set M of probabilities, which I will call a model: 8 0 19
an1 (x) D an1 (x) > > > ¢¢¢ > > n C1 ( ) > n ( x) > aj¡1 x D aj¡1 > > > < a n C 1 ( x) D a n ( x) x j
j
p
> ajnCC11 (x) D ajn (x)(1 ¡ xp ) > > > > > ajnCC21 (x) D ajnC 1 (x) > > > > > n C1 ¢ ¢ ¢ n : an C 3 (x) D an C 2 (x ).
1996
Fran¸cois Fleuret and Eric Brunet
The unit j of Rn is replaced by two units in Rn C 1 ; all the others remain the same. We will say that Rn is the father of Rn C 1 and that Rn C 1 is a son of Rn . 3.4.3 Split Selection. When the program is initialized, the net has only two units. The rst one is associated with the goal (cf. section 3.3.1), the other with its complement. At each step of the construction, one needs to determine which unit to split and with respect to which parameter. We use a brute-force method consisting of examining all the possible splits, computing a value function for each of them, and keeping the one that maximizes the value. We try to maximize the probability for the net of reaching the goal situation, starting from any initial situation. A natural value function would be the estimation of the probability of reaching the goal situation using the policy selected by the net. Considering the Markovian hypothesis and the stationarity hypothesis, and denoting p the policy chosen by the net R, we obtain Y(R ) D POp (9t < 1, a1 (t ) D 1) X D PO p (9t < 1, a1 (t) D 1 | ai (0) D 1)PO (ai (0) D 1) i
D
X i
di PO (ai (0) D 1).
As we saw in section 3.3.3, such an expression using just the Markovian model could provoke dramatic errors by leading to choosing a net that overestimates its probability of reaching the goal. To handle this, we propose a value function that compares the qualities of models between the sons and their father. 3.4.4 Differential Efciency. The Y function above estimates the probability of reaching the goal, starting from any initial situation and using the optimum policy chosen by the net. One can estimate this probability for any other policy by using a propagation rule similar to equation DP with a xed p . Let R be a net and R0 one of its sons. By denition, any policy p of R associates with a unit an action to perform. Hence a policy of R is also a policy of R0 . This is clear by considering a policy that associates with every unit common to both R and R0 the same action as p does, and to the two new units of R0 the action that p associates with the unit they replace. Then, in view of the preceding discussion, we can estimate with R0 the probability of reaching the goal using the policy p chosen by R. Thus, if R0 is a son of R, if p is the optimum policy chosen by R, and 0 p the one chosen by R0 , we compute with R0 the following value function, called differential efciency (DE), to estimate its quality: W (R0 , R ) D POp 0 (9t < 1, a1 (t) D 1) ¡ POp (9t < 1, a1 (t) D 1).
(DE)
DEA: An Architecture
1997
This value function is large if R0 estimates a high probability of reaching the goal, but also if it estimates that its father has a low probability of doing so. In particular, if a net estimates that its policy is not better than its father ’s, it has a zero value. Most of the time a large value of this function is related to an error in the Markovian model associated with R and solved by R0 . Thus, this split rule takes into account both the quality of the model and the efciency in reaching it. The Kolmogorov-Smirnov test used in U-tree (McCallum 1996a; 1996b) gives large values to splits that substantially change the expectation of reward, and it does not select splits that permit reaching the goal efciently as DEA does. In most of the tests we have made, U-tree requires more states than DEA does. Comparing the cost of these two algorithms, it appears that both consider all possible splits and compute a value that requires the estimation of the di in the candidate networks. For the U-tree algorithm, those di are used to estimate the distribution of the rewards (i.e., the distribution of the desired activation of the next activated unit, doing the selected action), whereas DEA uses the di to compare the estimates of the probability of reaching the goal. The costs of those two operations are similar, and we can, like McCallum (1996b), apply a few techniques in order to reduce the number of considered splits. Finally, the cost of one step is the same for DEA and U-tree. The latter reduces the total cost by applying few splits at each step. We have not tried this strategy with DEA, and as seen in section 4, in some cases U-tree performs better if splits are applied sequentially. 4 Results 4.1 Implementation and Evaluation Protocol. The experiments were done using a program written in C++ with GNU tools on GNU/Linux computers. The kernel simulating the planning unit net is the same whatever universe it is tested on. On a standard PC computer, for the examples we present here, the maximum computation time required to build the net is less than 15 minutes. The DEA is tested on four different universes, addressing various categories of problems. For a given universe, during an initial step, the algorithm records the values of the universe’s parameters and actions while the agent acts randomly, following reex rules. The duration of this period and the random choice of actions depend on the universe. 2 Then the algorithm builds the net by successively adding units, one at a time. The test protocol consists of estimating the performance of the net after each split. To compare the duration of the training period with the equivalent for discrete-time 2
The duration of the reex period is empirically xed so that the l ai, j coefcients are
estimated with enough accuracy.
1998
Fran¸cois Fleuret and Eric Brunet
10 X 01
11
21 Y
Figure 4: Block game. The blocks can move in four directions. The upper-left and upper-right squares are obstacles.
problems, we have to keep in mind the duration of continuous-time actions; a cycle in a continuous-time universe is not equivalent to one action trial. This repeated evaluation period consists of measuring in 1000 test cases the capability of the agent to reach the goal situation, starting in a random initial situation. The failure is the proportion of attempts for which the agent did not reach the goal situation in less time than a xed value. This maximum time before failure will be indicated for each universe and is of the order of the minimum time required to reach the goal, starting from the worst situation. This failure rate is a demanding measure. It quanties the ability of the net to generate a complete chain of actions from any situation to the goal. Because the choice of action is deterministic, a net that generates all but one action necessary to reach the goal fails. Thus, the net can have an arbitrarily high failure rate even if it determines almost all the necessary actions along the trajectory to the goal. We compare DEA to the U-tree algorithm developed by McCallum (1996b). The DEA algorithm applies splits one after another, whereas U-tree simultaneously applies all splits considered useful. Thus, we also tested a modied version of U-tree we call U-tree ¤ , which applies splits sequentially. 4.2 Block Game. The block game is a 3 £ 2 grid on which two blocks (X and Y, as shown in Figure 4) can move. At each time step, one of the two blocks can move a fraction of the size of a square in four directions (up, right, down, left). Therefore, there are eight actions. The moving blocks cannot penetrate each other, and the upper-left and upper-right squares of the grid constitute obstacles the blocks cannot penetrate either. The goal situation is reached when block X is in the lower-right square and block Y is in the lower-left square. There are eight parameters, indicating if one of the two blocks covers, partially or totally, one of the four grid squares. Each block can be on two squares, and so can activate two parameters at the same time.
DEA: An Architecture
1999
Figure 5: Results for the block game: failure rate versus the number of units.
The learning period has a total duration of 250, 000 time steps. At each step, the action performed has probability 0.9 to be the same as the one during the previous step. If it changes, a new action is randomly chosen. Each block can move one-eighth of a square size at each time step, so the goal can always be reached in fewer than 48 time steps. The maximum time before failure has been xed at 100 steps. Figure 5 shows the failure rate versus the number of units. A zero failure rate is obtained with a 15-unit net. At this point, the growth process stops because the W function is zero on all new nets. The total number of possible congurations of parameter values is 24. Hence, this experiment demonstrates the generalization capabilities of DEA, which is able to solve the task without building one unit for each conguration. The U-tree (respectively, U-tree ¤ ) requires 21 (respectively, 23) units to solve the same task with no error. 4.3 Hanoi Towers. The Hanoi towers game is a famous goal-planning task. There are three disks of different diameters that can be threaded on three vertical rods (see Figure 6). The player can move the upper disk of any rod to another rod but can never put a disk on a smaller one. The goal is to move the three disks to the rod on the right. There are six actions, corresponding to the rod from which the upper disk is removed and the rod it is placed on. An action can have no effect if there is no disk on the rst rod chosen, or if it is an illegal move due to the diameter constraint. There are nine parameters, each one associated with a rod and a disk.
2000
Fran¸cois Fleuret and Eric Brunet
A C
B 1
2
3
Figure 6: Hanoi towers problem.
Figure 7: Results for the Hanoi towers problem: failure rate versus the number of units.
The learning period lasts for 30,000 steps. At each step, an action is chosen randomly. All the actions last one cycle. Because the problem requires in the worst case nine moves, the maximum time before failure has been xed at 20 time steps. The zero failure rate is reached with a 20-unit net (see Figure 7). The average time required to reach the goal decreases (almost linearly from 5.3 to 4.6 cycles) as the number of units increases, whereas the failure rate is already zero. By making the model ner, DEA is able to nd shorter paths to the goal. The error rates for both versions of U-tree show that this algorithm requires almost an exhaustive model to solve the problem with a null error rate (it builds 26 units and the universe has 27 states). 4.4 Beaver World. The beaver world consists of a square playing eld containing three trees, two houses, and an animat (see Figure 8). The actions allow the animat to move—to grasp a tree or to release it. The goal situation
DEA: An Architecture
2001
Figure 8: Beaver world. The animat can turn, go forward, and grasp and release trees.
is reached when the animat is close to a house and a tree at the same time. We dene “close to the animat” as closer than a xed distance corresponding to the circle delimiting the eld of view around the animat in the gure. There are ve actions: GF makes the animat move forward one-hundredth of the play eld border length. TR and TL make the animat turn to the right and to the left onehundredth of a complete turn. Tk makes the animat grasp a tree if there is one close to it. Pt makes the animat release the tree if it was holding one. The animat perceives the universe with two eyes (its eld of view is composed of 5 £ 2 cells, with limits that go from ¡60± to C 60± relative to the direction of the animat) and by touching (it knows if there is a house or a tree close to it and it knows if it is holding a tree). Each object (houses, trees) has a color randomly selected in a set of 8 colors. There are 32 parameters: Gl represents the goal. Hl indicates that the animat is holding a tree. at indicates that the animat is touching a tree. a0,0 to a4,1 indicate that the animat has a tree in 1 of the 10 cells of its eld of view. bt indicates that the animat is touching a house.
2002
Fran¸cois Fleuret and Eric Brunet
Figure 9: Results for the beaver world: failure rate versus the number of units.
b0,0 to b4,1 indicate that the animat has a house in 1 of the 10 cells of its eld of view. c0 to c7 indicate which colors are visible in the eld of view. The training period lasts for 2.5 million steps. During this period, the action performed at each step depends on the value of the parameters. If the animat touches a tree (respectively, a house), it has a high probability of performing the “grasp tree” action (respectively, the “release tree” action). Otherwise, if there is an object in one of its elds of view (tree or house), it has a high probability of going forward; otherwise, it turns left or right with the same probability. This rough strategy enables the animat to explore the square and perform the “grasp tree” and “release tree” actions in appropriate conditions and frequently enough to estimate statistical parameters. To solve the problem, the animat must rst go to a tree and grasp it, then go to a house, and release the tree near the house. The maximum time required to reach the goal has been empirically measured and is under 200 time steps. The maximum time before failure has been xed at 1000 time steps. The measured total number of parameter value congurations is 62, 000, and DEA builds a net with 8 units, which solves the problem with a zero failure rate (see Figure 9). In order to control its trajectory to a tree or to a house, the animat goes forward when the target is in front of it and turns in order to put the target back in front as soon as it leaves it. The net built to perform this strategy has two substructures dedicated, respectively, to guidance to a tree and guidance to a house.
DEA: An Architecture
Extremity
2003
Corner
Curve
T-point
Figure 10: The detectors are associated with four kinds of local geometric congurations.
Both versions of U-tree perform very poorly on this task. Comparing the units built by DEA and those built by U-tree, it appears that whereas the former creates only the required units for the retina, both versions of the latter go on splitting units related to the retina. Since one can improve the model of the universe in terms of prediction (but not in terms of efciency to reach the goal) by creating one unit for each possible conguration of the retina’s parameters, this behavior of U-tree is to be expected. The reason for such a high failure rate is that when put in a random state, the beaver has a very high probability of being far from both trees and houses. Thus, a network whose policy is not complete (i.e., makes the animat go to the tree, grasp it, and then go to a house and release the tree) has an error rate close to 100%. 4.5 Eye Universe. This last universe is an elementary character recognition problem and demonstrates to some extent the classication capabilities of DEA and its relation with active vision (Rimey & Brown, 1994; Swain & Stricker, 1993). In this universe, the net can move a retina or name the object it has in its eld of view. Therefore, we unify goal planning and classication by adding “answer actions.” The universe consists of a square retina that can move onto a symbol to recognize, which is chosen from a set of four possible characters. This retina is divided into four quarters, each holding four geometric feature detectors, which detect, respectively, extremities, angles, curves, and Tjunctions (see Figure 10) and can be considered combinations of elementary border detectors, such as those that exist in the cortex (Hubel & Wiesel, 1962). Depending on the character shown to the retina and its location, some of the detectors are activated (see Figure 11). There are eight different actions: four related to the displacement of the retina in each of the four possible directions (one-twentieth of the retina border length at each step) and four others related to the naming of the character. There are 16 parameters, each associated with a quarter of the retina and a detector. The goal is reached when the net correctly gives the name of the character. We used two different sets of four characters: f1, 2, 3, 4g and fA, B, C, Dg.
2004
Fran¸cois Fleuret and Eric Brunet
Extremity
Extremity Corner Curve
Corner
Curve
Corner
Extremity
Extremity
T-point
T-point
Figure 11: Two examples of characters in the eye universe. Each quarter of the retina possesses four different features detectors.
The characters are piecewise linear, with labels “extremity,” “corner,” “curve,” and “t-point” assigned by hand at each vertex. We used the same sets of symbols during the learning and the testing; therefore, we do not prove any form of generalization capability. The learning period lasts 1 million time steps, during which the agent alternates between movements and answers. The movements are chosen in such a way that the character is roughly centered: if all the detectors in half of the retina are not activated, the movements that bring the character in this part of the retina are chosen with a higher probability. The functioning of the net after the learning period demonstrates the utility of an approach that allows the recognition system to reduce ambiguities by moving the retina on the character. For example, the symbols 2 and 3 activate the same detectors if they are in one quarter of the retina, since both have extremities, corners, and curves, and hence cannot be separated. To solve this problem, the net moves them so that they are on two quarters of the retina (the 2 symbol cannot activate two “corner” detectors, whereas the 3 can). The numbers of possible parameter value congurations are, respectively, 247 and 221 for the two sets of characters, and the nets required to solve the problems have 23 and 12 units. On the rst set of characters, U-tree and DEA have a similar failure rate, but DEA is more efcient on the second set (see Figure 12). 4.5.1 Comparison with Classication Trees. We can compare DEA to algorithms for binary trees by considering a universe like the eye universe, without actions to modify the data. Let us call Y the class of the object to be recognized. There is one action associated with each class, and the goal is reached if and only if the agent performs the right action. If a is a function of the parameters, put I (a) D max PO (Y D a | a (0) D 1) PO (a (0) D 1). a
DEA: An Architecture
2005
Figure 12: Results for the eye universe for the sets of characters f1, 2, 3, 4g (top) and fA, B, C, Dg (bottom): failure rate versus number of units.
It is easy to show that by denoting by a the activation of the unit to split and by a0 and a00 the activations of the two created units, we obtain W (R0 , R) D I (a0 ) C I (a00 ) ¡ I (a). This corresponds to a splitting rule that tries to minimize the error rate at each step. This rule has been described by and criticized in Breiman, Friedman, Olshen, and Stone (1984). The main defect is that reducing the misclassication rate at each step does not produce a good global strategy.
2006
Fran¸cois Fleuret and Eric Brunet
For example, two splits that lead to the same error rate are equivalent for this splitting rule even if one of the splits produces a pure unit with respect to Y, which is a better conguration since such a unit is optimal and ignored afterward. The standard method used to solve this problem consists of using criteria such as Shannon entropy, which yields better estimates of the “purity” of units because it depends on all the class probabilities, not only on the probability of the more probable state. We believe that a good criterion for the construction of the net should reduce to the Shannon entropy–based rule or an equivalent one in this particular universe. 5 Discussion
Starting from analogies with cortical structures, we have developed a planning network with a formalization on the MDP theory. We have also proposed a procedure to construct the net by adding units, equivalent to splitting iteratively the perception state. We have shown that this algorithm can solve both goal planning and classication problems, demonstrating that those functions can be combined into one structure. The comparison with U-tree, which addresses the same tasks, underscores the interest of DEA, which attempts to maximize both the quality of the Markovian model and the probability of reaching the goal. With such a strategy, DEA is able to generate more compact models (with a smaller number of states) and needs a smaller number of transition probability estimations to create a fully efcient Markovian model dedicated to one goal. In some critical cases (see the beaver example), increasing the quality of the Markovian model, without taking into account the efciency relative to the goal, leads to an excessive and useless creation of states. This algorithm is a simple example of a large class of algorithms using the general idea of efciency maximization. It could be improved by adding a rule for unit fusion, allowing backtracking the splitting process and canceling a split when it appears useless. Acknowledgments
We thank Yves Burnod for his advice and participation, and Donald Geman for his help during the preparation of this article. References Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1989). Learning and sequential decision making (Tech. Rep. COINS 89-95). Amherst, MA: University of Massachusetts. Bellman, R. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.
DEA: An Architecture
2007
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation. Englewood Cliffs, NJ: Prenctice Hall. Booker, L. B., Goldberg, D. E., & Holland, J. H. (1989). Classier systems and genetic algorithms. Articial Intelligence, 40, 235–282. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classication and regression trees. Belmont, CA: Wadsworth. Burnod, Y. (1989). An adaptative neural network: The cerebral cortex. Paris: Masson collection biologie th´eorique. Chapman, D., & Kaelbling, L. P. (1991).Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of the International Joint Conference on Articial Intelligence (Sydney, Australia). Dehaene, S., & Changeux, J.-P. (1997). A hierarchical neuronal network for planning behavior. Proc. Natl. Acad. Sci. USA, 94, 13293–13298. Guigon, E., Grandguillaume, P., Otto, I., Boutkhil, L., & Burnod, Y. (1994).Neural network models of cortical functions based on the computational properties of the cerebral cortex. Journal of Physiology, Paris, 88, 291–308. Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Kaelbling, L. P. (1993).Learning in embedded systems. Cambridge, MA: MIT Press. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4, 237–285. Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22, 227–250. Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Articial Intelligence, 33, 1–64. McCallum, A. K. (1996).Learning to use selective attention and short-term memory in sequential tasks. In From animals to animats: Proceedings of the Fourth International Conference on Simulation of Adaptative Behavior. Cambridge, MA: MIT Press. McCallum, A. K. (1996b).Reinforcement learning with selective perceptionand hidden state. Unpublished doctoral dissertation, University of Rochester, Rochester, NY. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130, Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21. Mozer, M. C., & Bachrach, J. (1991). Slug: A connectionist architecture for inferring the structure of nite-state environments. Machine Learning, 7, 139– 160. Piaget, J. (1964). Six e´tudes de psychologie. Paris: Editions Denoel. Putterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.
2008
Fran¸cois Fleuret and Eric Brunet
Rimey, R. R., & Brown, C. M. (1994). Control of selective perception using bayes nets and decision theory. International Journal of Computer Vision, 12, 173–207. Ripley, B. D. (1994). Neural networks and related methods for classication. Journal of the Royal Society B, 56(3), 409–456, Sutton, R. S. (1990). Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh Int. Conf. on Machine Learning (pp. 216–224). San Mateo, CA: Morgan Kaufmann. Swain, M. J., & Stricker, M. A. (1993). Promising directions in active vision. International Journal of Computer Vision, 11, 109–126. Received September 18, 1998; accepted September 9, 1999.
NOTE
Communicated by Nicol Schraudolph
On a Fast, Compact Approximation of the Exponential Function Gavin C. Cawley University of East Anglia, Norwich, Norfolk NR4 7TJ, England Recently Schraudolph (1999) described an ingenious, fast, and compact approximation of the exponential function through manipulation of the components of a standard (IEEE-754 (IEEE, 1985)) oating-point representation. This brief note communicates a recoding of this procedure that overcomes some of the limitations of the original macro at little or no additional computational expense. 1 Motivation
Schraudolph quite rightly describes exponentiation as the quintessential nonlinear function of neural computation and provides C code implementing a fast approximation of this function, providing impressive speed improvements in both benchmark tests and in a real-world application (JPEG quality transcoding; (Lazzaro & Wawrzynek, 1999). The use of a static global variable is, however, problematic in a multithreaded environment. This note describes a reimplementation of the algorithm that eliminates the global variable without incurring signicant additional computational expense. 2 The Algorithm
IEEE standard 754 stipulates that double-precision oating-point numbers should be represented in the form (¡1)s (1 C m )2x¡x0 , where s is the sign bit, m is the 52-bit normalized mantissa (a binary fraction in the range (0, 1]) and x is the 11-bit exponent with a bias x0 D 1023. Schraudolph’s fast approximation to the exponential function is based on the identity ey D 2y / ln 2 . Setting the exponent, x, equal to the integer part of y / ln 2 C x0 provides a simple approximation to ey . This approach can be implemented using a C/C++ union construct allowing a double-precision IEEE-754 oating-point number, d, to be stored in the same locations in memory as a structure comprising two 32-bit signed integers, i and j, as shown in Figure 1. Setting 220 y C 220 x0 ¡ C ln 2 jD0 iD
c 2000 Massachusetts Institute of Technology Neural Computation 12, 2009– 2012 (2000) °
2010
Gavin C. Cawley
Figure 1: Bit representation of the union data structure used by the exponential function (Schraudolph, 1999).
results in the integer part of d containing the integer part of y / ln 2 C x0 , the factor of 220 providing the necessary shift left of 20 places. The fractional part overows into the most signicant bits of the mantissa, providing a degree of linear interpolation between integral values. The parameter C provides some control over the error properties of the approximation, and a value of 60801 minimizes the root-mean-square (RMS) relative error. 3 The Implementation
The revised C++ implementation is shown in Figure 2. The EXP macro has been replaced by an in-line function, exponential , and the global static variable eco is replaced by the local variable of exponential . This approach brings two benets. First, this implementation is more suitable for use in multithreaded programs because the static global variable has been eliminated. Second, the use of a function, instead of a macro, more cleanly encapsulates the algorithm and permits better use of type-checking rules. Note that this code can also be compiled by a standard ANSI C compiler if the inline keyword is omitted. 4 Benchmark Results
Table 1 displays benchmark results for the original and revised implementations of Schraudolph’s approximation. The benchmark used is similar to that Schraudolph used, except that for accuracy, the sum of 109 exponentials of random arguments is computed. Note that the exponential function is slightly faster than the EXP macro, except in the case of the Sun workstation, where the EXP macro is marginally faster. The ISO/ANSI C++ standard (ISO/IEC, 1998) states that the inline keyword suggests, but does not require, that the statements comprising the body of the function should be expanded into the body of the calling function, eliminating the computational expense of the function call. However most modern C and C++ compilers are capable of performing this optimization, especially in the case of small leaf functions, such as this. One might still expect the revised code to be slower because a local variable must be allocated and initialized each time the function is called, whereas the orig-
Fast, Compact Approximation of the Exponential Function
2011
#define EXPÇ A (1048576/MÇ LN2) #define EXPÇ C 60801 inline double exponential(double y) [ union [ double d; #ifdef LITTLEÇ ENDIAN struct [ int j, i; ] n; #elseif struct [ int i, j; ] n; #endif ] Ç eco; Ç eco.n.i = (int)(EXPÇ A*(y)) + (1072693248 - EXPÇ C); Ç eco.n.j = 0; return Ç eco.d; ] Figure 2: C++ code fragment implementing a fast approximation of the exponential function.
inal macro is able to reuse a global variable that is statically allocated and initialized. The function, though, does not have the side effect of modifying a global variable. Because it is generally infeasible for the compiler to deTable 1: Seconds Required for 109 Exponentiations. Manufacturer Processor Model/Speed
Intel Pentium II/300
Intel Pentium Xeon/450
Sun UltraSparc Ultra 5/360
DEC Alpha 500au/500
LITTLE ENDIAN
Yes WinNT egcs-2.91.57 C ¡O3 918 415 391
Yes Linux egcs-2.90.27 C++ ¡O1 431 281 261
No Sun OS 5.7 CC 5.0 C++ ¡O4 677 125 129
Yes Digital UNIX V4.0E DEC C V5.8-009 C ¡O2 130 41 40
Operating system Compiler Source Optimization exp (libm.a) EXP macro exponential function
2012
Gavin C. Cawley
termine statically whether this value is ever used, the original macro must contain code to update the global variable, whereas the function can be optimized so that the temporary variable exists only within registers. 5 Summary
A recoding of the splendid approximation of the exponential function described by Schraudolph is presented. Without incurring a signicant additional computational expense, it provides the following additional benets: The reimplementation is better suited to a multithreaded environment because the static global variable has been eliminated. Type-checking rules can be applied more strongly, and the algorithm is encapsulated more cleanly, for improved reliability. Acknowledgments
I thank Mark Fisher, Danilo Mandic, and the anonymous reviewers for their helpful comments and Shaun McCullagh for his assistance in collecting the benchmark results. References IEEE. (1985). Standard for binary oating-point arithmetic (IEEE/ANSI Std. 7541985). New York: American National Standards Institute/Institute of Electrical and Electronic Engineers. ISO/IEC. (1998). Programming languages—C++ (ISO/IEC Std. 144882-1998(E)). New York: American National Standards Institute. Lazzaro, J., & Wawrzynek, J. (1999). JPEG quality transcoding using neural networks trained with a perceptual error measure. Neural Computation, 11(1), 267–298. Schraudolph, N. N. (1999). A fast, compact approximation of the exponential function. Neural Computation, 11(4), 853–862. Received August 11, 1999; accepted October 8, 1999.
LETTER
Communicated by Peter Bartlett
Bounds on Error Expectation for Support Vector Machines V. Vapnik O. Chapelle AT&T Labs-Research, Ecole Normale Sup´erieure de Lyon, Lyon, France We introduce the concept of span of support vectors (SV) and show that the generalization ability of support vector machines (SVM) depends on this new geometrical concept. We prove that the value of the span is always smaller (and can be much smaller) than the diameter of the smallest sphere containing the support vectors, used in previous bounds (Vapnik, 1998). We also demonstate experimentally that the prediction of the test error given by the span is very accurate and has direct application in model selection (choice of the optimal parameters of the SVM).
1 Introduction
Recently, a new type of algorithm with a high level of performance, called support vector machines (SVM), has been introduced (Boser, Guyon, & Vapnik, 1992; Vapnik, 1995). Usually, the good generalization ability of SVM is explained by the existence of a large margin. Bounds on the error rate for a hyperplane that separates the data with some margin were obtained by Bartlett and Shawe-Taylor (1999) and Shawe-Taylor, Bartlett, Williamson, and Anthony (1998). In Vapnik (1998), another type of bound was obtained that demonstrated that for the separable case, the expectation of error probability for hyperplanes passing through the origin depends on the expectation of R2 /r 2 , where R is the maximal norm of support vectors and r is the margin. In this article we derive bounds on the expectation of error for SVM from the leave-one-out estimator, which is an unbiased estimate of the probability of test error. These bounds (which are tighter than the one dened in Vapnik, 1998, and valid for hyperplanes not necessarily passing through the origin) depend on a new concept, the span of support vectors. The bounds obtained show that the generalization ability of SVM depends on more complex geometrical constructions than large margin. To introduce the concept of the span of support vectors, we have to describe the basics of SVM. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2013– 2036 (2000) °
2014
V. Vapnik and O. Chapelle
2 SVM for Pattern Recognition
We call the hyperplane w 0 ¢ x C b0 D 0
optimal if it separates the training data x 2 Rm ,
(x 1 , y1 ), . . . , (x`, y`),
y 2 f¡1, 1g
and if the margin between the hyperplane and the closest training vector is maximal. This means that the optimal hyperplane has to satisfy the inequalities yi (w ¢ xi C b) ¸ 1
i D 1, . . . , `
and has to minimize the functional R (w ) D w ¢ w . This quadratic optimization problem can be solved in the dual space of Lagrange multipliers. One constructs the Lagrangian L (w , b, ® ) D
` X 1 w ¢w ¡ ai [yi (w ¢ x i C b) ¡ 1] 2 iD 1
(2.1)
and nds its saddle point: the point that minimizes this function with respect to w and b and maximizes it with respect to ai ¸ 0,
i D 1, . . . , `.
(2.2)
Minimization over w denes the equation wD
` X
ai yi x i ,
(2.3)
iD 1
and minimization over b denes the equation ` X
i D1
ai yi D 0.
(2.4)
Substituting equation 2.3 back into the Lagrangian, equation 2.1, and taking into account equation 2.4, we obtain the functional W (®) D
` X
iD 1
ai ¡
` 1X ai aj yi yj xi ¢ xj , 2 i, j D1
(2.5)
Bounds on Error Expectation
2015
which we have to maximize with respect to parameters ® satisfying two constraints: equality constraint, equation 2.4, and positivity constraints, equation 2.2. The optimal solution ®0 D (a10 , . . . , a`0 ) species the coefcients for the optimal hyperplane w0 D
` X
iD 1
ai0 yi x i .
Therefore the optimal hyperplane is ` X
iD 1
ai0 yi x i ¢ x C b0 D 0,
(2.6)
where b0 is chosen to maximize the margin. It is important to note that the optimal solution satises the Kuhn-Tucker conditions ai0 [yi (w 0 ¢ x i C b0 ) ¡ 1] D 0.
From these conditions, it follows that if the expansion of vector w 0 uses vector xi with nonzero weight ai0 , then the following equality must hold: yi (w 0 ¢ xi C b0 ) D 1.
(2.7)
Vectors x i that satisfy this equality are called support vectors. Note that the norm of vector w 0 denes the margin r between optimal separating hyperplane and the support vectors r D
1 kw 0 k
.
Therefore, taking into account equations 2.4 and 2.7 we obtain ` ` ` X X X 1 ai0 , (2.8) D w0 ¢ w0 D yi ai0 w 0 ¢ x i D yi ai0 (w 0 ¢ xi C b0 ) D 2 r iD 1 iD 1 iD 1
where r is the margin for the optimal separating hyperplane. In the nonseparable case, we introduce slack variables ji and minimize the functional R(w , b, ® ) D
` X 1 w¢w CC ji , 2 iD 1
subject to constraints yi (w 0 ¢ xi C b0 ) ¸ 1 ¡ ji ji ¸ 0.
2016
V. Vapnik and O. Chapelle
When the constant C is sufciently large and the data are separable, the solution of this optimization problem coincides with the one obtained for the separable case. To solve this quadratic optimization problem for the nonseparable case, we consider the Lagrangian L (w , b, ® ) D
` ` ` X X X 1 w ¢ w¡ ai [yi (w ¢ x i C b) ¡ 1 C ji ] C C ji ¡ ºiji , 2 i D1 i D1 iD 1
which we minimize with respect to w , b, and ji and maximize with respect to the Lagrange multipliers ai ¸ 0 and ºi ¸ 0. The result of minimization over w and b leads to the conditions 2.3 and 2.4, and the result of minimization over ji gives the new condition ai C ºi D C.
(2.9)
Taking into account that ºi ¸ 0, we obtain 0 · ai · C.
(2.10)
Substituting equation 2.3 into the Lagrangian, we obtain that in order to nd the optimal hyperplane, one has to maximize the functional 2.5, subject to constraints 2.4 and 2.10. The box constraints, 2.10 (instead of the positivity constraints, 2.2), entail the difference in the methods for constructing optimal hyperplanes in the nonseparable case and the separable case, respectively. For the nonseparable case, the Kuhn-Tucker conditions, ai0 [yi (w 0 ¢ x i C b0 ) ¡ 1 C ji ] D 0
ºiji D 0,
(2.11)
must be satised. Vectors xi that correspond to nonzero ai0 are referred as support vectors. For support vectors, the equalities yi (w 0 ¢ x i C b0 ) D 1 ¡ ji
hold. From conditions 2.11 and 2.9, it follows that if ji > 0, then ºi D 0 and therefore ai D C. We will distinguish two types of support vectors: support vectors for which 0 < ai0 < C and support vectors for which ai0 D C. To simplify notations, we sort the support vectors such that the rst n¤ support vectors belong to the rst category (with 0 < ai < C) and the next m D n ¡ n¤ support vectors belong to the second category (with ai D C). When constructing SVMs, one usually maps the input vectors x 2 X into a high-dimensional (even innite dimensional) feature space w (x ) 2 F where one constructs the optimal separating hyperplane. Note that both
Bounds on Error Expectation
2017
the optimal hyperplane, 2.6, and the target functional, 2.5, that have to be maximized to nd the optimal hyperplane depend on the inner product between two vectors rather than on input vectors explicitly. Therefore one can use the general representation of inner product in order to calculate it. It is known that the inner product between two vectors w (x 1 ) ¢ w (x 2 ) has the following general representation, w (x1 ) ¢ w (x 2 ) D K (x1 , x2 ),
where K (x 1 , x 2 ) is a kernel function that satises the Mercer conditions (symmetric positive denite function). The form of kernel function K (x 1 , x 2 ) depends on the type of mapping of the input vectors w (x ). In order to construct the optimal hyperplane in feature space, it is sufcient to use a kernel function instead of inner product in expressions 2.5 and 2.6. Further, we consider bounds in the input space X. However, all results are true for any mapping w . To obtain the corresponding results in a feature space, one uses the representation of the inner product in feature space K (x, x i ) instead of the inner product x ¢ x i . 3 The Leave-One-Out Procedure
The bounds introduced in this article are derived from the leave-one-out cross-validation estimate, a procedure usually used to estimate the probability of test error of a learning algorithm. Suppose that using training data of size `, one tries simultaneously to estimate a decision rule and evaluate the quality of this decision rule. Using training data, one constructs a decision rule. Then one uses the same training data to evaluate the quality of the obtained rule based on the leave-one-out procedure: one removes from the training data one element (say, (xp , yp )), constructs the decision rule on the basis of the remaining training data, and then tests the removed element. In this fashion, one tests all ` elements of the training data (using ` different decision rules). Let us denote the number of errors in the leave-one-out procedure by L (x 1 , y1 , . . . , x `, y`). Luntz and Brailovsky (1969) proved the following lemma: The leave-one-out procedure gives an almost unbiased estimate of the probability of test error Lemma 1.
¡1 Ep`error DE
¡L
(x1 , y1 , . . . , x `, y`) `
¢
,
¡1 where p`error is the probability of test error for the machine trained on a sample of size ` ¡ 1.
“Almost” in lemma 1 refers to the fact that the probability of test error is for samples of size ` ¡ 1 instead of `.
2018
V. Vapnik and O. Chapelle
For SVMs, one needs to conduct the leave-one-out procedure only for support vectors; nonsupport vectors will be recognized correctly since removing a point that is not a support vector does not change the decision function. Remark.
In section 5, we introduce upper bounds on the number of errors made by the leave-one-out procedure. For this purpose, we need to introduce a new concept, the span of support vectors. 4 Span of the Set of Support Vectors
Let us rst consider the separable case. Suppose that (x 1 , y1 ), . . . , (xn , yn ) is a set of support vectors, and
®0 D (a10 , . . . , an0 ) is the vector of Lagrange multipliers for the optimal hyperplane. For any xed support vector xp , we dene the set L p as a constrained linear combination of the points fx i gi D6 p :
8 < X n Lp D lx: : iD 1, iD6 p i i
n X 6 p i D 1, i D
9 =
6 p, l i D 1, and 8i D
ai0 C yi yp ap0l i ¸ 0 .
;
Note that l i can be less than 0. We also dene the quantity Sp , which we call the span of the support vector xp as the distance between xp and this set (see Figure 1): Sp2 D d2 (xp , L p ) D min(xp ¡ x )2 .
(4.1)
x 2L p
As shown in Figure 2, it can happen that xp 2 L p , which implies Sp D d (xp , L p ) D 0. Intuitively, for smaller Sp D d(xp , L p ) the leave-one-out procedure is less likely to make an error on the vector xp . Indeed, we will prove (see lemma 3) that if Sp < 1 / (Dap0 ) (D is the diameter of the smallest sphere containing the training points), then the leave-one-out procedure classies xp correctly. By setting l p D ¡1, we can rewrite Sp as Sp2 D min
8 < X n :
(
i D1
!2 l ixi
: l p D ¡1,
n X iD 1
9 =
l i D 0, ai0 C yi yp ap0l i ¸ 0 .
;
(4.2)
Bounds on Error Expectation
2019
Figure 1: Consider the two-dimensional example. Three support vectors with a1 D a2 D a3 / 2. The set L 1 is the semi-opened dashed line: L 1 D fl 2 x 2 C l 3 x 3 , l 2 C l 3 D 1, l 2 ¸ ¡1, l 3 · 2g.
The maximal value of Sp is called the S-span: S D maxfd(x 1 , L 1 ), . . . , d (xn , L n )g D max Sp . p
We will prove (cf. lemma 2 below) that Sp · DSV . Therefore, S · DSV .
(4.3)
Depending on ®0 D (a10 , . . . , an0 ) the value of the span S can be much less than diameter DSV of the support vectors. Indeed, in Figure 2, d(x 1 , L 1 ) D 0 and by symmetry, d (xi , L i ) D 0, for all i. Therefore, in this example, S D 0. Now we generalize the span concept for the nonseparable case. In the nonseparable case we distinguish between two categories of support vectors: the support vectors for which 0 < ai < C
i D 1, . . . , n¤ ,
and the support vectors for which aj D C
j D n¤ C 1, . . . , n.
2020
V. Vapnik and O. Chapelle
Figure 2: In this example, we have x 1 2 L 1 and therefore d(x 1 , L 1 ) D 0. The set L 1 has been computed using a1 D a2 D a3 D a4 .
We dene the span of support vectors using support vectors of the rst category. That means we consider the value Sp D d (xp , L p ) where
8 < X n¤ Lp D lx: : iD 1, iD6 p i i
9 =
¤
n X 6 p i D 1, i D
6 p l i D 1, 8i D
0 · ai0 C yi yp ap0l i · C .
;
The differences in the denition of the span for the separable and the nonseparable case are that in the nonseparable case, we ignore the support vectors of the second category and add an upper bound C in the constraints on l i . Therefore, in the nonseparable case, the value of the span of support vectors depends on the value of C. It is not obvious that the set L p is not empty. It is proved in the following lemma. In both the separable and nonseparable cases, the set L p is not empty. Moreover, Sp D d (xp , L p ) · DSV . Lemma 2.
The proof is in the appendix. Remark.
From lemma 2, we conclude (as in the separable case) that
S · DSV ,
(4.4)
Bounds on Error Expectation
2021
where DSV is the diameter of the smallest sphere containing the support vectors of the rst category. 5 The Bounds
The generalization ability of SVMs can be explained by their capacity control. Indeed, the VC dimension of hyperplanes with margin r is less than D2 / 4r 2 , where D is the diameter of the smallest sphere containing the training points (Vapnik, 1995). This is the theoretical idea motivating the maximization of the margin. This section presents new bounds on the generalization ability of SVMs. The major improvement lies in the fact that the bounds will depend on the span of the support vectors, which gives tighter bounds than ones depending on the diameter of the training points. Let us rst introduce our fundamental result : If in the leave-one-out procedure a support vector xp corresponding to 0 < ap < C is recognized incorrectly, then the inequality Lemma 3.
p ap0 Sp max(D, 1 / C) ¸ 1 holds true. See the proof in the appendix. Lemma 3 leads to the following theorem for the separable case: Suppose that an SVM separates training data of size ` without error. ¡1 Then the expectation of the probability of error p`error for the SVM trained on the training data of size ` ¡ 1 has the bound Theorem 1.
¡1 ·E Ep`error
¡ SD ¢ `r 2
,
where the values of span-of-support vectors S, diameter of the smallest sphere containing the training points D, and the margin r are considered for training sets of size `. Let us prove that the number of errors made by the leave-one-out procedure is bounded by SD . Taking the expectation and using lemma 1 r2 will prove the theorem. Consider a support vector xp incorrectly classied by the leave-one-out procedure. Then lemma 3 gives ap0 Sp D ¸ 1 (we consider here the separable Proof.
2022
V. Vapnik and O. Chapelle
case and C D 1), and ap0 ¸
1 SD
holds true. Now let us sum the left-hand and the right-hand sides of this inequality over all support vectors where the leave-one-out procedure commits an error:
L (x1 , y1 , . . . , x `, y`) X 0 · ai . SD ¤ P Here ¤ indicates that the sum is taken only over support vectors where the leave-one-out procedure makes an error. From this inequality we have n L (x1 , y1 , . . . , x `, y`) 1X · a0 . `SD ` iD 1 i
Therefore we have (using, equation 2.8)
L (x1 , y1 , . . . , x `, y`) `
·
SD
Pn
0 i D 1 ai
`
D
SD `r 2
.
Taking the expectation over both sides of the inequality and using the Luntz and Brailovsky lemma, we prove the theorem. For the nonseparable case the following theorem is true: ¡1 The expectation of the probability of error p`error for an SVM trained on training data of size ` ¡ 1 has the bound
Theorem 2.
`¡1
Eperror · E
(
! p P ¤ S max(D, 1 / C) niD1 ai0 C m `
,
where the sum is taken only over ai corresponding to support vectors of the rst category (for which 0 < ai < C) and m is the number of support vectors of the second category (for which ai D C). The values of the span-of-support vectors S, diameter of the smallest sphere containing the training points D, and the Lagrange multipliers a0 D (a10 , . . . , an0 ) are considered for training sets of size `. The proof of this theorem is similar to the proof of theorem 1. We consider all support vectors, of the second category (corresponding to aj D C) as an error. For the rst category of support vectors, we estimate the Proof.
Bounds on Error Expectation
2023
number L ¤ (x1 , y1 , . . . , x `, y`) of errors in the leave-one-out procedure using lemma 3 as in the proof of theorem 1. We obtain
L (x 1 , y1 , . . . , x `, y`) `
· ·
¤(
L
x1 , y1 , . . . , x `, y`) C m `
p P¤ ai C m S max(D, 1 / C) `
.
Taking the expectation over both sides of the inequality and using the Luntz and Brailovsky lemma, we prove the theorem. Note that when m D 0 (separable case), equality 2.8 holds true. In this case (provided that C is large enough), the bounds obtained in these two theorems coincide. In theorems 1 and 2, it is possible, using inequality 3.4, to bound the value of the span S by the diameter of the smallest sphere containing the support vectors DSV . But, as pointed out by the experiments (see section 6), this would lead to looser bounds because the span can be much less than the diameter. 5.1 Extension. In the proof of lemma 3, it appears that the diameter of the training points D can be replaced by the span of the support vectors after the leave-one-out procedure. But since the set of support vectors after the leave-one-out procedure is unknown, we bounded this unknown span by D. Nevertheless, this remark motivated us to analyze the case where the set of support vectors remains the same during the leave-one-out procedure. In this situation, we are allowed to replace D by S in lemma 3; more precisely, the following theorem is true:
If the sets of support vectors of the rst and second categories remain the same during the leave-one-out procedure, then for any support vector xp , the following equality holds: Theorem 3.
yp ( f 0 (xp ) ¡ f p (xp )) D ap0 Sp2 , where f 0 and f p are the decision function given by the SVM trained, respectively, on the whole training set and after the point xp has been removed. The proof is in the appendix. The assumption that the set of support vectors does not change during the leave-one-out procedure can be wrong. Nevertheless, the proportion of points that violate this assumption is usually small compared to the number of support vectors. In this case, theorem 3 provides a good approximation of the result of the leave-one-out procedure, as pointed out by the experiments (see Figure 4).
2024
V. Vapnik and O. Chapelle
Note that theorem 3 is stronger than lemma 3 for three reasons: the term p ) Sp max(D, 1 / C becomes Sp2 , the inequality turns out to be an equality, and the result is valid for any support vector. The previous theorem enables us to compute the number of errors made by the leave-one-out procedure: Under the assumption of theorem 3, the test error prediction given by the leave-one-out procedure is Corollary 1.
t` D
1 `
L (x 1 , y1 , . . . , x `, y`) D
1 `
Cardfp: ap0 Sp2 ¸ yp f 0 (xp )g.
(5.1)
6 Experiments
The previous bounds on the generalization ability of SVMs involved the diameter of the smallest sphere enclosing the training points (Vapnik, 1995). We have shown (cf. inequality 3.4) that the span S is always smaller than this diameter, but to appreciate the gain, we conducted some experiments. First, we compare the diameter of the smallest sphere enclosing the training points, the one enclosing the support vectors, and the span of the support vectors using the postal database. This data set consists of 7291 handwritten digits of size 16 £ 16 with a test set of 2007 examples. Following Sch¨olkopf, Shawe-Taylor, Smola, and Williamson (1999), we split the training set in 23 subsets of 317 training examples. Our task is to separate digits 0 to 4 from 5 to 9. The error bars in Figure 3 are standard deviations over the 23 trials. The diameters and the span in Figure 3 are plotted for different values of s, the width of the RBF kernel we used: K (x , y ) D e
¡
kx ¡ y k2 2s 2
.
In this example, the span is up to six times smaller than the diameter. We would like to use the span for predicting the test error accurately. This would enable us to perform efcient model selection, that is, choosing the optimal values of parameters in SVMs (the width of the radical basis function (RBF) kernel s or the constant C, for instance). Note that the span S is dened as a maximum S D maxp Sp and therefore taking into account the different values, Sp should provide a more accurate estimation of the generalization error than the span S only. Therefore, we used the span-rule (see equation 5.1) in corollary 1 to predict the test error. Our experiments have been carried out on two databases: a separable one, the postal database, described above, and a noisy one, the breast cancer database. 1 The latter has been split randomly 100 times into a training set containing 200 examples and a test set containing 77 examples. 1
Available online at http://horn.rst.gmd.de/»raetsch/data/breast-cancer.
Bounds on Error Expectation
2025
2
Diameter Diameter SV Span
1.5
1
0.5
0 6
4
2
0 Log sigma
2
4
6
Figure 3: Comparison of D, DSV , and S.
Figure 4a compares the test error and its prediction given by the span rule, equation 5.1, for different values of the width s of the RBF kernel on the postal database. Figure 4b plots the same functions for different values of C on the breast cancer database. The prediction is very accurate, and the curves are almost identical. The computation of the span rule, equation 5.1, involves computing the span Sp , equation 3.2, for every support vector. Note, however, that we are interested in the inequality Sp2 · yp f (xp ) / ap0 , rather than the exact value of the span Sp . Therefore, if while minimizing Sp D d (xp , L p ) we nd a point x ¤ 2 L p such that d (xp , x ¤ )2 · yp f (xp ) / ap0 , we can stop the minimization because this point will be correctly classied by the leave-one-out procedure. Figure 5 compares the time required to train the SVM on the postal database, compute the estimate of the leave-one-out procedure given by the span rule, equation 5.1, and compute exactly the leave-one-out procedure. In order to have a fair comparison, we optimized the computation of the leave-one-out procedure in the following way: for every support vector xp , we take as a starting point for the minimization (see equation 2.5) involved to compute f p (the decision function after having removed the point xp ), the solution given by f 0 on the whole training set. The reason is that f 0 and f p are usually “close.”
2026
V. Vapnik and O. Chapelle
40
Test error Span prediction
35 30
Error
25 20 15 10 5 0 6
4
2
0 Log sigma A
36
2
4
6
Test error Span prediction
34 32
Error
30 28 26 24 22 20 2
0
2
4
Log C B
6
8
10
12
Figure 4: Test error and its prediction using the span rule, equation 5.1.
Bounds on Error Expectation
2027
120
Training Leave one out Span
Time in sec
100 80 60 40 20 0 6
4
2
0 Log sigma
2
4
6
Figure 5: Comparison of time required for SVM training, computation of span, and leave-one-out on the postal database.
The results show that the time required to compute the span is not prohibitive and is very attractive compared to the leave-one-out procedure. 7 Conclusion
We have shown that the generalization ability of support vector machines depends on a more complicated geometrical concept than the margin only. A direct application of the concept of span is the selection of the optimal parameters of the SVM since the span enables an accurate prediction of the test error. Similarly to Weston, (1999), the concept of the span can also lead to new learning algorithms involving the minimization of the number of errors made by the leave-one-out procedure. Appendix: Proofs A.1 Proof of Lemma 2. We will prove this result for the nonseparable case. The result is also valid for the separable case since it can be seen as a particular case of the nonseparable one with C large enough. Let us dene L pC as the subset of L p with additional constraints l i ¸ 0:
L pC D
8 < X n : iD1, iD6 p
9 = l ixi 2 L p : l i ¸ 0
6 p . iD
;
(A.1)
2028
V. Vapnik and O. Chapelle
6 We shall prove that L pC D ; by proving that a vector l of the following form exists:
l j D 0, li D m li D m
C
j D n¤ C 1, . . . , n
¡ ai0 , ap0
(A.2)
yi D yp ,
ai0 , ap0
6 p, iD
i D 1, . . . , n¤
i D 1, . . . , n¤
6 yp , yi D
(A.3) (A.4)
0·m ·1
(A.5)
l i D 1.
(A.6)
n X i D1
P It is straightforward to check that if such a vector l exists, then l i x i 2 C 6 C 6 ;. L p and therefore L p D ;. Since L p ½ L p , we will have L D Taking into account equations A.3 and A.4, we can rewrite constraint A.6 as follows: 0 1 C
n¤ n¤ X m B X C 0 (C ¡ ai ) C 1D 0@ ai0 A . ap iD 1, iD p iD 1 6
y i D yp
(A.7)
6 yp yi D
We need to show that the value of m given by equation A.7 satises constraint A.5. For this purpose, let us dene D as ¤
D
D
n X i / yi D yp
(C ¡ ai0 ) C ¤
D ¡yp
n X i D1
yi ai0 C
¤
n X 6 yp i / yi D
ai0
(A.8)
¤
n X
(A.9)
C.
i / yi D yp
Now, note that n X i D1
yi ai0 D
¤
n X i D1
yi ai0 C C
n X i D n¤ C 1
yi D 0.
Combining equations A.9 and A.10, we get
D
D Cyp D Ck,
n X i D n¤ C 1
¤
yi C
where k is an integer.
n X i / yi Dyp
C
(A.10)
Bounds on Error Expectation
2029
Since equation A.8 gives D > 0, we have,nally, (A.11)
D ¸ C. Let us rewrite equation A.7 as m (D ¡ (C ¡ ap0 )). ap0
1D
We obtain m D
ap0
D ¡ (C ¡ ap0 )
.
(A.12)
Taking into account inequality A.11, we nally get 0 · m · 1. Thus, constraint A.5 is fullled and L pC is not empty. Now note that the set L pC is included in the convex hull of fx i giD6 p , and 6 ;, we obtain since L pC D d(xp , L pC ) · DSV , where DSV is the diameter of the smallest ball containing the support vectors of the rst category. Since L pC ½ L p , we nally get Sp D d (xp , L p ) · d (xp , L pC ) · DSV . A.2 Proof of Lemma 3. Let us rst consider the separable case. Suppose that our training set fx 1 , . . . , x `g is ordered such that the support
vectors are the rst n training points. The nonzero Lagrange multipliers associated with these support vectors are a10 , . . . , an0 .
(A.13)
In other words, the vector ®0 D (a10 , . . . , an0 , 0, . . . , 0) maximizes the functional W (® ) D
` X
iD 1
ai ¡
` 1X ai aj yi yj x i ¢ xj , 2 i, jD1
(A.14)
subject to the constraints
® ¸ 0, ` X
iD 1
ai yi D 0.
(A.15) (A.16)
2030
V. Vapnik and O. Chapelle
Let us consider the result of the leave-one-out procedure on the support vector xp . This means that we maximized functional A.14 subject to constraints A.15 and A.16 and the additional constraint ap D 0,
(A.17)
and obtained the solution p p ®p D (a1 , . . . , a` ).
Using this solution, we construct the separating hyperplane w p ¢ x C bp D 0,
where wp D
` X
p
ai yi xi .
iD 1
We would like to prove that if this hyperplane classies the vector xp incorrectly, yp (w p ¢ xp C bp ) < 0,
(A.18)
then ap0 ¸
1 . Sp D
Since ®p maximizes equation A.14 under constraints A.15–A.17 the following inequality holds true, W (®p ) ¸ W (®0 ¡ ± ),
(A.19)
where the vector ± D (d1 , . . . , dn ) satises the following conditions: dp D ap0 , a0 ¡ d ¸ 0, n X i D1
di yi D 0.
di D 0,
i> n
(A.20)
From inequality A.19 we obtain W (®0 ) ¡ W (®p ) · W (®0 ) ¡ W (®0 ¡ ± ).
(A.21)
Bounds on Error Expectation
2031
Since ®0 maximizes A.14 under the constraints A.15 and A.16, the following inequality holds true: W (®0 ) ¸ W (®p C ° ),
(A.22)
where ° D (c 1 , . . . , c `) is a vector satisfying the constraints
®p C ° ¸ 0, ` X
iD 1
c i yi D 0,
p
ai D 0 H) c i D 0,
(A.23)
6 p. iD
From equations A.21 and A.22, we have W (®p C ° ) ¡ W (®p ) · W (®0 ) ¡ W (®0 ¡ ± ).
(A.24)
Let us calculate both the left-hand side, I1 , and the right-hand side, I2 , of inequality A.24: I1 D W (®p C ° ) ¡ W (®p ) ` X
D
iD 1
¡ D D
p
(ai C c i ) ¡
` X
i D1
` X
iD 1 ` X
iD 1
p
ai C
ci ¡
` X
i, j
` 1X p p (ai C c i )(aj C c j )yi yj x i ¢ xj 2 i, jD 1
` 1X p p ai aj yi yj x i ¢ xj 2 i, jD 1
p
c i aj yi yj x i ¢ xj ¡
c i (1 ¡ yi w p ¢ x i ) ¡
` 1X yi yjc ic j x i ¢ xj 2 i, j
` 1X yi yjc ic j (x i , xj ). 2 i, j
Taking into account that ` X
iD 1
c i yi D 0,
we can rewrite expression I1 D
` X 6 p iD
¡
c i [1 ¡ yi (w p ¢ x i C bp )] C c p [1 ¡ yp (wp ¢ xp C bp )] n 1X yi yjc ic j x i ¢ xj . 2 i, j
2032
V. Vapnik and O. Chapelle
6 p the condition A.23 means that either c i D 0 or x i is a support Since for i D vector of the hyperplane wp , the following equality holds
c i [yi (w p ¢ x i C bp ) ¡ 1] D 0. We obtain I1 D c p [1 ¡ yp (wp ¢ xp C bp )] ¡
n 1X yi yjc ic j xi ¢ xj . 2 i, j
Now let us dene vector c as follows: c p D c k D a, ci D 0
6 fk, pg, i2
p
6 where a is some constant and k is such that yp D yk and ak > 0. For this vector we obtain
a2 kxp ¡ xk k2 2 a2 ¸ a[1 ¡ yp (w p ¢ xp C bp )] ¡ D2 . 2
I1 D a[1 ¡ yp (w p ¢ xp C bp )] ¡
(A.25)
Let us choose the value a to maximize this expression: aD
1 ¡ yp (wp ¢ xp C bp ) D2
.
Putting this expression back into equation A.25, we obtain I1 ¸
1 (1 ¡ yp [(xp , wp ) C bp ])2 . 2 D2
Since, according to our assumption, the leave-one-out procedure commits an error at the point xp (that is, the inequality A.18 is valid), we obtain I1 ¸
1 . 2D2
(A.26)
Now we estimate the right-hand side of inequality A.24: I2 D W (®0 ) ¡ W (®0 ¡ ± ). We choose di D ¡yi yp ap0l i , where l is the vector that denes the value of d (xp , L p ) in equation 4.2.
Bounds on Error Expectation
2033
We have I2 D W (® 0 ) ¡ W ( ® 0 ¡ ± ) D
n X iD 1
C
ai0 ¡
n n X 1 X (ai0 C yi yp ap0l i ) ai0 aj0 yi yj x i ¢ xj ¡ 2 i, jD 1 iD 1
n 1X (a0 C yi yp ap0l i )(aj0 C yj yp ap0l j )yi yj x i ¢ xj 2 i, jD 1 i
0 D ¡yp ap
n X i D1
yil i C yp ap0
(
n X 1 C (ap0 ) 2 l ixi 2 iD 1
!2
n X
i, j D1
ai0l j yi x i ¢ xj
.
(A.27)
Since n X iD 1
l i D 0,
and xi is a support vector, we have I2 D D
yp ap0
n X
l i yi [yi (w 0 ¢ x i C b0 ) ¡ 1] C
iD 1 (ap0 )2 Sp2 .
(ap0 )2 2
(
n X
!2 l i xi
iD 1
(A.28)
2
Combining equations A.24, A.26, and A.28, we obtain ap0 Sp D ¸ 1. Consider now the nonseparable case. The sketch of the proof is the same. There are only two differences. First, the vector ° needs to satisfy
®p C ° · C. A very similar proof to the one of lemma 2 gives us the existence of ° . The other difference lies in the choice of a in equation A.25. The value of a that maximizes equation A.25 is 1 ¡ yp (wp ¢ xp C bp ) . D2 But we need to fulll the condition a · C. Thus, if a¤ > C, we replace a by C in equation A.25, and we get: a¤ D
I1 ¸ C[1 ¡ yp (w p ¢ xp C bp )] ¡
C2 2 D 2
2034
V. Vapnik and O. Chapelle D CD
¡
2
¸ CD2
a¤ ¡
C 2
¢
a¤ 2
C [1 ¡ yp (w p ¢ xp C bp )] 2 C ¸ . 2 D
The last inequality comes from equation A.18. Finally, we have
¡
I1 ¸
1 1 min C, 2 2 D
¢
.
By combining this last inequality with equations A.24 and A.28, we prove the lemma. A.3 Proof of Theorem 3. The proof follows the proof of lemma 3. Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, we can take
± D ° D ®0 ¡ ®p
as ®0 ¡ ®P is a vector satisfying simultaneously the set of constraints A.20 and A.23. Then inequality A.24 becomes an equality: I1 D W (®0 ) ¡ W (®p ) D I2
(A.29)
From inequality A.21, it follows that I2 · I2¤ D W (®0 ) ¡ W (®0 ¡ ± ¤ ),
(A.30)
where di¤ D ¡yi yp ap0l i and l is given by the denition of the span Sp (cf. equation 4.2). The computation of I2 and I2¤ is similar to the one involved in the proof of lemma 3 (cf. equation A.28): I¤2 D I2 D
(ap0 )2 2 (ap0 )2 2
Sp2 ¡ ap0 [yp (w 0 ¢ xp C b0 ) ¡ 1] !2
(
X i
l ¤i xi
¡ ap0 [yp (w 0 ¢ xp C b0 ) ¡ 1],
where l ¤i D yi
p
ai ¡ ai0 . ap0
From equation A.30 we get (
P
¤ )2 i l i xi
· Sp2 .
Bounds on Error Expectation
Now note that Finally, we have
(
P
¤ 6 p l i xi iD
2035
P 2 L p and by denition of Sp , ( i l ¤i x i )2 ¸ Sp2 .
!2
X
D Sp2 .
l ¤i x i
i
(A.31)
The computation of I1 gives (cf. equation A.25) I1 D
ap0 [1
¡ yp (wp ¢ xp C bp )] ¡
(ap0 )2 2
(
X i
!2 l ¤i x i
.
Putting the values of I1 and I2 back in equation equation A.29, we get !2 (ap0 )2
(
X i
l ¤i x i
D ap0 yp [ f 0 (xp ) ¡ f p (xp )],
and the theorem is proved by dividing by ap0 and taking into account A.31: ap0 Sp2 D yp [ f 0 (xp ) ¡ f p (xp )]. Acknowledgments
We thank L´eon Bottou, Patrick Haffner, Yann Lecun, and Sayan Mukherjee for very helpful discussions. References Bartlett, P., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classiers. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola, (Eds.), Advances in kernel methods—Support vector learning (pp. 43–54). Cambridge, MA. MIT Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classiers. In D. Haussler, (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3 (in Russian). Sch¨olkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Generalization bounds via eigenvalues of the Gram matrix (Tech. Rep. No. 99-035). NeuroCOLT. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
2036
V. Vapnik and O. Chapelle
Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J. (1999). Leave-one-out machines. In M. Press (Ed.), Advances in large margin classiers. Received March 24, 1999; accepted September 24, 1999.
LETTER
Communicated by Geoffrey Goodhill
A Computational Model of Lateralization and Asymmetries in Cortical Maps Svetlana Levitan James A. Reggia Deptartments of Computer Science and Neurology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, U.S.A. While recent experimental work has dened asymmetries and lateralization in left and right cortical maps, the mechanisms underlying these phenomena are currently not established. In order to explore some possible mechanisms in theory, we studied a neural model consisting of paired cerebral hemispheric regions interacting via a simulated corpus callosum. Starting with random synaptic strengths, unsupervised (Hebbian) synaptic modications led to the emergence of a topographic map in one or both hemispheric regions. Because of uncertainties concerning the nature of hemispheric interactions, both excitatory and inhibitory callosal inuences were examined independently. A sharp transition in model behavior was observed depending on callosal strength. For excitatory or weakly inhibitory callosal interactions, complete and symmetric mirrorimage maps generally appeared in both hemispheric regions. In contrast, with stronger inhibitory callosal interactions, partial to complete map lateralization tended to occur, and the maps in each hemispheric region often became complementary. Lateralization occurred readily toward the side having a larger cortical region or higher excitability. Asymmetric synaptic plasticity, however, had only a transitory effect on lateralization. These results support the hypotheses that interhemispheric competition occurs, that multiple underlying asymmetries may lead to function lateralization, and that the effects of asymmetric synaptic plasticity may vary depending on whether supervised or unsupervised learning is involved. To our knowledge, this is the rst computational model to demonstrate the emergence of topographic map lateralization and asymmetries. 1 Introduction
Topographic cortical maps are regions of the cerebral cortex that represent some aspect of the environment (visual space, body surface, etc.) in a topology-preserving fashion (Kaas, 1991; Udin & Fawcett, 1988). It has recently been demonstrated electrophysiologically that corresponding regions of topographic maps in primary sensory and motor cortex exhibit a rich range of patterns of individual lateralization and asymmetry, even c 2000 Massachusetts Institute of Technology Neural Computation 12, 2037– 2062 (2000) °
2038
Svetlana Levitan and James A. Reggia
under synchronous bilateral stimulation. For example, in one recent study with more than 100 animals, it was found that somatosensory, auditory, and visual map asymmetries are almost always present in an individual animal’s primary sensory cortex in the context of coarse bilateral stimuli (Bianki, 1993). These map asymmetries were classied as qualitative (complete lateralization), quantitative (partial lateralization), topographic (no lateralization but asymmetric spatial distributions or mosaic patterns), and quantitative-plus-topographic (lateralization and asymmetric spatial distribution). The dependence of asymmetries on callosal connectivity was shown by the fact that callosal sectioning generally caused a partial or complete loss of map asymmetries (Bianki, 1993). In another series of experiments with six adult squirrel monkeys, detailed primary motor cortex forelimb maps were found to be both larger and more complex in the hemisphere opposite the preferred hand (Nudo, Jenkins, Merzenich, Prejean, & Grenda, 1992). The causes of these map asymmetries are not known. In this article we describe a study with a recurrently connected neural model consisting of corresponding left and right hemispheric regions interacting via a simulated corpus callosum. Although the model hemispheric regions are not intended as realistic representations of any specic neurobiological system, they do have a spatial organization motivated by neocortical architecture, their interconnections are roughly homotopic as occurs biologically, and they self-organize using unsupervised (Hebbian) learning. We used this model to examine in theory how map formation could be affected by underlying asymmetries in cortical region size, excitability, plasticity, and so forth and by variations in the assumed excitatory and inhibitory nature of interhemispheric callosal interactions. The specic hemispheric region asymmetries we studied were motivated by experimental evidence that corresponding cortical regions can differ in their relative size (Geschwind & Levitsky, 1968), dendritic branching (Scheibel et al., 1985), neurotransmitter levels (Tucker & Williamson, 1984), and excitability to external stimuli (Macdonell et al., 1991). The variations in callosal inuences were motivated by disagreements in the literature about whether interhemispheric interactions are predominantly excitatory or inhibitory (Berlucchi, 1983; Cook, 1986; Denenberg, 1983; Ferbert, Priori, & Rothwell, 1992; Kinsbourne, 1978; Meyer, Roricht, ¨ von Einsiedel, Kruggel, & Weindl, 1995). Biological callosal connections are presumably excitatory, as they arise primarily from pyramidal cells and synapse on contralateral pyramidal and stellate cells (Innocenti, 1986). However, the resultant transcallosal inuences of one hemisphere on the other have often been argued to be inhibitory or competitive (Cook, 1986; Denenberg, 1983; Kinsbourne, 1978), based in part on strong, prolonged inhibition that follows initial excitatory postsynaptic potentials (Toyama, Tokashiki, & Matsunami, 1969), and supported by recent transcranial magnetic stimulation studies indicating that activation of one cortical region can functionally inhibit the corresponding contralateral region (Ferbert et al., 1992; Meyer et al., 1995).
A Computational Model
2039
Our initial hypotheses were that map lateralization and asymmetries would arise from all of the underlying hemispheric asymmetries that we examined (for example, cortical region size, excitability, and plasticity) and that map lateralization and asymmetries would gradually occur as callosal connections became progressively more inhibitory. These expectations derive from results with an earlier computational study of lateralization involving supervised learning that did not examine map formation (Reggia, Goodall, & Shkuro, 1998). The actual situation proved to be more interesting, with only some factors causing persistent lateralization, and the existence of a sharp transition between callosal strengths leading to map asymmetries that can be demonstrated both analytically and by numerical simulations. These results show that lateralization can occur in the absence of an error signal and that differences in left and right activation levels may sufce to drive lateralization and asymmetries in the case of self-organizing maps. 2 Methods
We adopted the methods used in a previous model of a single hemisphere self-organizing map (Sutton, Reggia, Armentrout, & D’Autrechy, 1994) to a two-hemisphere model. 2.1 The Model. The model consists of two cortical regions interconnected by a corpus callosum and receiving input connections from twodimensional sensory surfaces, as shown in Figure 1. Each hemispheric region or cortical layer represents a small patch of cerebral cortex. The model cortices are two-dimensional, with individual elements representing cortical columns. These elements hexagonally tessellate the cortex, with each element having excitatory connections to its six nearest neighbors. Each cortical element connects by the corpus callosum to those elements lying within a certain radius Rcc of the element homotopic to it in the opposite hemisphere. As is often done in simulations of this sort to avoid edge effects, the opposite edges of a hemispheric region are connected (forming a torus). Two versions of the model were studied. In one version, two sensory areas connect by divergent connections to opposite cortical regions (see Figure 1a), so independent input stimuli can reach model cortical regions. This corresponds to the typical afferent connectivity in important biological cortical regions (e.g., most primary somatosensory and visual cortex). In the second version, a single shared sensory surface connects to both hemispheric regions (see Figure 1b). This architecture minimizes the likelihood that asymmetries in input stimuli themselves would lead to lateralization and asymmetries in maps (i.e., as opposed to intrinsic hemispheric asymmetries or callosal inuences). This second version of the model can be related only to cortical regions receiving bilateral inputs, such as visual cortex for midline retinal areas (Stone, Leicester, & Sherman, 1973; Orban, 1984; Fendrich, Wessinger, & Gazzaniga, 1996), or to special experimental
2040
Svetlana Levitan and James A. Reggia
Corpus Callosum Right Cortex
Left Cortex
Left Sensory Surface
Right Sensory Surface
Corpus Callosum Left Cortex
Right Cortex
Sensory Surface Figure 1: The model of two interacting cortical regions. (a) Two sensory surfaces sending divergent inputs to the contralateral cortices. (b) The same sensory surface for both cortices.
situations where matched bilateral input stimuli are used (Bianki, 1993). In most of our simulations, the sensory surface(s) and both cortical sets have the same number of elements. Each sensory element sends divergent input to the corresponding cortical element in the contralateral cortex and its neighbors within a certain radius (RL or RR ). We used activation and learning rules from previous studies of singlehemisphere map formation (Armentrout, Reggia, & Weinrich, 1994; Sutton
A Computational Model
2041
et al., 1994), modifying them for two hemispheres connected by a corpus callosum. A real-valued activation level aLi (t) is associated with the ith element of the left cortical region, aRi (t) with the right homotopic element, and aSi (t) with the corresponding element of the sensory surface. We give just the equations for the left hemisphere; those for the right are analogous. Activation aLi (t) is governed by: C ¡ L ( M ¡ aLi ) C (cs C inLi )a i , daLi / dt D inLi
(2.1)
where cs < 0 is a self-inhibition constant, M > 0 is the maximal activation level, and initial activation levels are all zero. With an inhibitory corpus callosum, inputs are C inLi D
¡ inLi D
X j
X m
L cLL ij aj C
X k
S cLS ik ak ,
(2.2)
R cLR im am .
(2.3)
C The sums in excitatory input inLi range over the six immediate neighbors j in the same cortical hemisphere and over elements k in the sensory surface ¡ that send input to i. Index m in inhibitory input inLi ranges over elements in the opposite hemisphere within radius Rcc of the element homotopic to P R i. When the corpus callosum is excitatory, the callosal term m cLR im am inC stead is added into inLi , and self-inhibition is increased to prevent excessive ¡ hemispheric activation using inLi D ¡2.6K, where K is the callosal strength. Note that for the reasons given in the section 1, K represents the resultant transcallosal effect and not the excitatory and inhibitory synaptic weight on callosal connections, which are excitatory. A “Mexican hat” pattern of lateral interactions occurs in biological cortex and is widely used in neural models of self-organizing maps (Kohonen, 1995). As in our earlier models, we achieve this by having each sensory element competitively distribute its output among the receiving cortical elements, and each cortical element among its nearest neighbors, using
wLik (aLi C q) L , cLS ik D cp P L L n wnk ( an C q )
L cLL ij D clf P
aLi C q . L n (a n C q )
(2.4)
Here q is a small constant (0.00001), cpL and cLlf are, respectively, the input gain and lateral feedback constants for the left hemisphere, and the sums in the denominators are over all elements of one hemisphere connected to the given source of activation. Such competitive distribution of activation has repeatedly been shown in the past to produce Mexican hat activation patterns and topographic map formation similar to those produced by inhibitory connections (Reggia, D’Autrechy, Sutton, & Weinrich, 1992; Armentrout et al., 1994; Sutton et al., 1994).
2042
Svetlana Levitan and James A. Reggia Table 1: Baseline Parameter Values. Symbol
Meaning
Baseline Value
size KLR , KRL K cpL , cpR cLlf , cRlf
Number of elements in a cortical set Callosal strengths from R to L and back Callosal strength when KLR D KRL D K Input gain in L and R Lateral feedback in L and R
16 £ 16 Varies Varies 1.0 0.6
Self-inhibition constant Maximal activation constant Radius of corpus callosum connections Radii from S to L and R Learning rates for L and R
¡2.0 3.0 5 3 0.01
cs M Rcc RL , RR 2 L, 2 R
Note: L and R denote left and right cortex; S denotes sensory surface.
Intercortical connections work similarly. Let callosal strength to the left from the right hemisphere be KLR and to right from the left be KRL , using just K when these are the same as is usually the case. For inhibitory callosal inuences (K < 0), we use !¡1 LR cLR im D K
(
X n
aLn C q
to normalize total callosal signal. For excitatory corpus callosum (K > 0), the inuence of the opposite hemisphere is distributed competitively, with KLR playing the role of cpL in equation 2.4: aLi C q LR ¢. cLR im D K P ¡ L n an C q
(2.5)
Self-organization is based on an unsupervised learning rule D wLik D 2 L ( aSk ¡ wLik )aLi where 2 L is the left hemisphere learning rate. Learning occurs on the connections from sensory surfaces to cortical regions. Weights on these connections are initialized with random values between 0.00001 and 1.0. Sensory stimuli in the form of random seven-element hexagonal patches of activation are applied to the two sensory surfaces either independently or symmetrically to drive the self-organization of the model. 2.2 Experimental Methods. Using the two-hemispheric region models described above, we performed a series of simulations, systematically varying parameters and studying the effects on map lateralization and map symmetry as a function of callosal strength K. For a baseline parameter set, we used the values shown in Table 1. We altered one parameter at a time and studied the resulting lateralization for callosal strengths K ranging from ¡4
A Computational Model
2043
(strongly inhibitory) to C 3 (strongly excitatory). For maximal comparability, the same seed for random weight generation was used in most runs. This seed was chosen as causing the least lateralization in the symmetric case in numerous simulations with different seeds. However, all major ndings in this article were veried by additional simulations with several different seeds. We ran simulations using both versions of the model (see Figures 1a and 1b) and found the qualitative results in most cases to be similar, indicating that the independence and symmetry of the input stimuli had only a secondary effect on results. (For this reason and for brevity, we primarily describe the results obtained with the single-input-layer model of Figure 1b, noting a few differences with the dual-input-layer version (see Figure 1a) where appropriate. (For the effects of systematically varying input stimuli symmetry and independence with the dual input version, see Levitan, Stoica, & Reggia, 1999.) In addition to visual inspection of the resulting maps, we measured the degree of cortical map organization, lateralization, and mirror symmetry using metrics described in Alvarez, Levitan, and Reggia (1998). Although objective in nature, these measures have been shown to correlate fairly well with corresponding subjective estimates. Briey, the measures are: Organization measure. Indicates the degree of topographic map formation in a single hemispheric region on a 0 (no map forms) to 1 (nearly perfect map) scale. Lateralization measure. Ranges from ¡1, indicating complete map formation on the left and no map on the right, to 1, indicating the opposite; a nonzero value indicates that one map represents a larger total sensory area than the other (i.e., there is “quantitative asymmetry”; Bianki, 1993). Mirror symmetry measure. Ranges from 1 for a pair of cortical maps covering the same regions in the sensory surface, to ¡1 for complementary maps covering different regions. In other words, this measure indicates the extent to which the two maps are mirror images of one another, irrespective of their overall lateralization, with a negative value implying that there is “topographic asymmetry” (Bianki, 1993). These three measures are dened in appendix A and illustrated below. 3 Results
We rst describe the general nature of maps observed during simulations and then the results of simulations in which callosal strengths were systematically varied. 3.1 General Nature of Maps Observed. Figure 2 shows representative examples of the maps observed during simulations. Each part of Figure 2 presents a pair of images indicating the topographic maps for a pair of
2044
Svetlana Levitan and James A. Reggia
Figure 2: Pairs of cortical maps a–i are equal sized and have one sensory region: (a, b, c) Unorganized, pretraining. (d) Organized and symmetric. (e, f) Complementary mosaics. (g, h, i) Asymmetric: left more organized than right. (j) The left hemispheric region has more elements and is better organized. (k, l) Similar maps produced by the model with two sensory regions when the training inputs were independent (k) or symmetric (l).
corresponding cortical regions like those illustrated in Figure 1. A standard, widely used approach to representing map organization is employed in Figure 2 (Kohonen, 1995). Specically, each of the 24 pictures in Figure 2 plots
A Computational Model
2045
the centers of receptive elds of a cortical region in the space of the sensory surface (i.e., it is not a picture of the cortical regions involved). For example, the vertices (nodes) in the left picture in Figure 2h represent the centers of the receptive elds of the left cortical region plotted in the space of the sensory surface in Figure 1b. Each pair of receptive eld centers (vertices) of nearestneighbor cortical elements is connected by a line segment, so there are six line segments for each vertex. The entire grid in Figure 2h shows that a fairly organized map is present in the left cortical region (i.e., the sensory surface projects in a smooth fashion onto the two-dimensional cortex surface). In contrast, the right hemispheric region is fairly disorganized in this case (see the right part of Figure 2h). Also, in each picture in Figure 2, each vertex is encircled by a small ellipse centered on the vertex whose axes are proportional to the relative x and y diameters of the receptive eld. These ellipses do not show the absolute size of receptive elds (the latter being larger and overlapping), but instead are simplied and uniformly reduced-size representations that indicate only the relative sizes of receptive elds (Sutton et al., 1994). Along with regular receptive eld locations, small receptive elds (e.g., Figure 2d) generally indicate highly organized map regions, while larger ones (e.g., Figure 2a) indicate poor map formation. Maps in Figure 2 involve equal-size cortical regions (both 16 £ 16) except for Figure 2j where the right cortical region is smaller (12 £ 12). The smaller right region in Figure 2j is evident in that more vertices are plotted in the left picture. Inspection of these pairs of maps shows that in some cases, maps are disorganized bilaterally, as in Figures 2a to 2c. In these cases, the mirror symmetry measure is taken to be undened because there are no organized map regions whose overlap can be measured (Alvarez et al., 1998). In other cases, the maps are well organized bilaterally (see Figure 2d) or on just one side (see Figures 2g and 2h). Finally, in some cases, maps may form mosaic patterns that divide up the sensory surface in complementary ways, either in roughly equal (see Figures 2e and 2l) or quite unequal shares (see Figures 2f, 2i, and 2j). Table 2 gives the values of the quantitative measures described earlier that we use for the maps shown in Figure 2. For example, maps in Figures 2a through 2d show increasing individual hemispheric region organization, while in each case the lateralization measure does not exceed 0.03, indicating that signicant lateralization has not occurred. Signicant lateralization is measured with many of the other map pairs shown and is of greatest magnitude for Figure 2g (lateralization D ¡0.72). The two maps in Figure 2d are mirror images of each other (mirror symmetry D 1.0), and those in Figures 2e, 2f, 2i, 2j, and 2l are complementary, representing largely disjoint portions of the sensory surface (mirror symmetry D ¡1.0). Figure 2e illustrates that although the lateralization measure is roughly zero (both maps being organized overall the same amounts), there can still be “lateralization of functionality” in the sense that each hemispheric region is organized for a different portion of the sensory surface (mirror symmetry D ¡1.0).
2046
Svetlana Levitan and James A. Reggia Table 2: Values of Quantitative Measures for the Maps in Figure 2. Map
Left Organization
Right Organization
Lateralization
Mirror Symmetry
a b c d e f g h i j k l
0.01 0.25 0.59 0.98 0.62 0.47 0.81 0.96 0.82 0.83 0.39 0.52
0.01 0.28 0.61 0.99 0.64 0.78 0.09 0.32 0.29 0.37 0.57 0.58
0.0 0.03 0.02 0.01 0.02 0.31 ¡0.72 ¡0.64 ¡0.53 ¡0.46 0.18 0.06
Undened Undened Undened 1.0 ¡1.0 ¡1.0 ¡0.81 ¡1.0 ¡1.0 ¡0.99 ¡0.94 ¡1.0
3.2 Symmetric Hemispheric Regions. We rst undertook simulations using a symmetric version of the model where the two hemispheric regions were identical except for the initial random weights. Complete topographic maps similar to those in Figure 2d form in both hemispheres when corpus callosum inuences are excitatory, absent, or weakly inhibitory. However, as callosal inhibition becomes stronger, there is a sharp transition at roughly K D ¡1.4 from two complete maps to two complementary mosaic pattern maps similar to those in Figure 2e. Figure 3 shows how the organization, lateralization, and symmetry measures vary as the callosal strength K is systematically altered. Since the organization values on the left and right are essentially the same, lateralization is very close to zero for all values of K. As seen in Figure 3a, initial organization values are close to 0 for K < ¡1, but grow noticeably for positive K. This happens because inhibitory callosal connections “push” the receptive elds of corresponding cortical elements away from each other and hence from their ideal location in a perfectly organized map, while excitatory callosal connections “pull” the receptive elds of corresponding cortical elements closer to each other and to their ideal locations, thus making the pretraining maps look more organized. Figures 2a through 2c show pretraining maps for K D ¡2, K D 0, and K D 2 in the symmetric case. 3.3 Asymmetric Cortical Excitability. Asymmetric cortical excitability has been associated experimentally with functional lateralization (Macdonell et al., 1991) and regionally may be implied by asymmetries in various neurotransmitter levels (Tucker & Williamson, 1984). In our model, the learning rule is such that the higher the activation is in one hemisphere, the faster it should self-organize. Conversely, a hemisphere that gets very little
A Computational Model
2047
Organization 1 0.8 0.6 0.4 0.2 -4 -3 -2 -1 0
1
2
3
K
Activation 6 5 4 3 2 1 -4
-3 -2
-1
0
1
2
3
K
Figure 3: Symmetric hemispheric regions (except for random initial weights). (a) Organization of left and right maps as a function of callosal strength K. Black lines, after training; gray lines, before training; solid lines, right hemisphere; dashed lines, left hemisphere. (b) Activation in the two hemispheres. (c) Lateralization and mirror symmetry after training. Solid line, lateralization; dashed line, mirror symmetry.
2048
Svetlana Levitan and James A. Reggia
or no activation cannot effectively learn a good topographic map. Several parameters of our model can cause different excitability in the two hemispheres. Figures 2h and 2i show sample maps resulting from a small difference in input gain or lateral feedback (in both cases, the left hemisphere is more active). Figure 4 summarizes the results of simulations when the two hemispheres had only a slightly different input gain favoring the left: cpL D 1.05 and cpR D 1.0. For approximately K > ¡1.2, symmetric highly organized maps always formed without lateralization (similar to Figure 2d). For K < ¡1.2 however, strong lateralization to the left always occurred, usually with mosaic patterns like those in Figure 2i. Note that activation on the left is much higher than on the right, which is the main reason for the lateralization. Similar results were obtained with mildly asymmetric maximal activation constants ML and MR , and with small asymmetries in lateral feedback strengths cLlf and cRlf . Small differences in the strength of corpus callosum inhibition also caused signicant lateralization. With xed KLR D ¡2, for example, when KRL < KLR complete lateralization to the left occurred, and for KRL > KLR , the opposite occurred. Results for independent, asymmetric training input stimuli with the dual input version (see Figure 1a) showed somewhat less lateralization than other cases. 3.4 Explaining the Sharp Transition in Model Behavior. In the simulations, a sharp transition occurs in model behavior at a specic callosal strength (roughly K D ¡1.4 in symmetric case, K D ¡1.2 in asymmetric excitability case). For K above this strength, symmetric, nonlateralized, and highly organized maps occur, while for K below, lateralized and asymmetric mosaic pattern maps occur. What causes this transition or bifurcation in model behavior? We show that a change in the dynamics of total hemispheric activations occurs near K D ¡1.4 and then explain how it leads to asymmetry in map formation. First, consider the difference in total activation of the left hemispheric region L versus the right R in the model. The activation dynamics described by equations 2.1 through 2.5 are highly nonlinear and difcult to analyze. However, as in Reggia and Edwards (1990), a “linearized” version of these equations gives insight into the model’s dynamics. Consider equations C ¡ C inLi , da Li / dt D cs aLi C inLi
(3.1)
C ¡ where aLi is the activation of element i in L, and inLi and inLi are inputs to that element given by equations 2.2 to 2.5. By algebraic manipulations, for j 2 L, k 2 S, m 2 R we have
X i2L
L cLL ij D clf ,
X i2L
L cLS ik D cp ,
X i2L
LR cLR im D K .
Analogous equations hold for the right hemispheric region R.
(3.2)
A Computational Model
2049
Organization 1 0.8 0.6 0.4 0.2 -4 -3 -2 -1
0
1
2
3
K
Activation 6 5 4 3 2 1 -4
-3
-2
-1
0
1
2
3
K
Figure 4: Results with asymmetric input gain (cpL D 1.05, cpR D 1.0). Same notation as in Figure 3. The left hemisphere clearly dominates for roughly K < ¡1.2; its posttraining organization is higher than the organization on the right.
2050
Svetlana Levitan and James A. Reggia
Consider a xed input pattern. Denote the total activation in the left hemisphere by AL , that in the right by AR , and that in the sensory surface by AS . Adding together the equations 3.1 for all elements i in the left hemisphere and using equations 2.2, 2.3, and 3.2, we obtain: dAL / dt D cs AL C cLlf AL C cpL AS C KLR AR
(3.3)
and a similar equation for the right hemisphere. In the symmetric case, where cLlf D cRlf D clf , cpL D cpR D cp and KLR D KRL D K, subtracting dAR / dt from dA L / dt we have: d ( AL ¡ AR ) / dt D (cs C clf ¡ K )(AL ¡ AR ).
(3.4)
This indicates that when K < cs C clf , an initial difference in activation levels of the two hemispheres will grow exponentially with time, while for K > cs C clf , this difference will decay exponentially. With our baseline parameters, this change is expected at K D cs C clf D ¡1.4, precisely where the transition occurs in the symmetric case simulations (see Figure 3). A more involved analysis in appendix B shows that similar changes in asymptotic behavior of total hemispheric activations occur in a more general case for slightly larger K (e.g., when the two hemispheres have different excitation). How does this change in activation growth affect symmetry and lateralization? With random initial weights, an input pattern quickly causes slightly different initial levels of activation in the two hemispheres. With time, this initial difference will either disappear (for K > ¡1.4) or grow (for K < ¡1.4). In the former case, both hemispheres will have nearly equal activation levels by the time a learning step occurs, so they will “learn” the input at about the same speed (assuming equal learning rates), hence eventually developing complete and symmetric maps. In the latter case, by the time of learning, one hemisphere may be largely inactive while the other is highly active, so the input is more effectively learned by the more active hemisphere, resulting in local lateralization of map formation. Recalling that the hemispheric regions are homotopically connected, this competition occurs locally between mirror image sections of the cortex. Thus, a small part of the map can become better organized on one side than the other, and this may spread by a “percolation process.” After a sufciently long training period, a mosaic pattern is therefore likely to occur for K < ¡1.4. For symmetric hemisphere parameters, by chance each hemisphere would be expected to be dominant on about the same number of input patterns as the other, and so complementary maps but almost no lateralization is expected. With asymmetric cortical excitability, we would expect larger total activation more often in the more active hemisphere, so that hemisphere will develop a map covering a larger part of the sensory surface, which leads to lateralization.
A Computational Model
2051
Organization
Organization
1 0.8 0.6 0.4 0.2 -4 -3 -2 -1
0
1 0.8 0.6 0.4 0.2 1
2
3
K
K -4 -3.5 -3 -2.5 -2 -1.5 -1
Activation
Activation 3 2.5 2 1.5 1 0.5
8 6 4 2 -4
-3
-2
-1
0
1
2
3
K
-4
-3.5 -3 -2.5 -2
K -1.5 -1
Figure 5: Results for the two hemispheric regions with different numbers of elements (left larger), when the initial activations are not equal (a–c) and when they are equal due to adjustment of callosal connections (d–f). Same notation as Figure 3.
3.5 Asymmetric Hemispheric Sizes. Experimentally measured differences in hemispheric region sizes have been reported (e.g., Geschwind & Levitsky, 1968), although their nature and signicance have been questioned (Loftus et al., 1993). To examine how asymmetric hemispheric region size inuences map organization, we ran simulations with the two hemispheric regions having different numbers of elements: the left having 16 £ 16 elements and the right 12 £12, for a total of 256 versus 144 elements. Figures 5a to 5c show the results as callosal strength is varied. For excitatory, absent, or
2052
Svetlana Levitan and James A. Reggia
mildly inhibitory callosal strengths (roughly K > ¡1.2), both hemispheres formed highly organized and symmetric maps. For strongly inhibitory callosal strengths (K < ¡1.2), marked map lateralization occurred, with a much better organized map in the larger left region and with complementary maps. Figure 2j shows a typical example of the maps found under these latter conditions. Figure 5b shows quite clearly that when inhibition is strong (roughly K < ¡1.2), the initial activation on the left was much higher than on the right. To remove this potentially biasing factor, callosal inhibition strength was adjusted so that the inhibition from left to right was slightly weaker than inhibition from right to left. A roughly 5% difference produced nearly equal initial activations on both sides for K < ¡1. The effect of this adjustment is demonstrated in Figures 5d through 5f. Lateralization is smaller, but is still present for sufciently strong inhibition (K < ¡1.5), and the maps formed under these conditions remain complementary.
3.6 Asymmetric Learning Rates. Another possible cause of lateralization is a difference in synaptic plasticity in the two hemispheres. The biological existence of asymmetric plasticity is suggested by asymmetric hemispheric neurotransmitters (Tucker & Williamson, 1984), and directly indicated by asymmetric synaptogenesis during development and early life (Bates, Thal, & Janowsky, 1992). We ran simulations where all parameters of the two hemispheres were the same except initial random weights and the learning rates (left 0.01, right 0.001). Figures 6 and 7 display simulation results for brief (16,000 inputs) and long-term (123,000 inputs) training, showing that initially the hemisphere with a higher learning rate organizes faster. However, after sufciently long training, the difference in organization values becomes much smaller. For this type of simulation, much more transient lateralization occurs with excitatory and weak inhibitory callosal connections than with strongly inhibitory ones. For example, Figure 8 shows that for K D ¡3, both hemispheres learn very slowly, while for K D 0, the one with higher learning rate learns much faster than the other. These results differ from those with the asymmetries examined above in that the lateralization that occurs is more pronounced for K > ¡1.4. Given the asymmetry in learning rates, transient lateralization to the left would generally be expected for all values of K: the larger left learning rate would cause its map to organize more quickly, but the slower right side would eventually catch up. Lateralization does not occur with strongly inhibitory callosal connections here because the higher learning rate on the left is largely cancelled by simultaneously lower mean activation levels on the left that slow learning (see Figure 7).
A Computational Model
2053
Different Learning Rates. Lateralization Changes with Time Lateralization in Topographic Map Formation
1 Before Training After 16000 Inputs After 123000 Inputs
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 4
3
2 1 0 1 Corpus Callosum Strength K
2
3
Figure 6: Lateralization as a function of K with asymmetric learning rates. Lateralization is signicant for K > ¡1 after brief training, but subsequently essentially vanishes.
4 Discussion
To our knowledge, this study is the rst use of a neural model to demonstrate emergent lateralization and asymmetries in topographic maps. Although there have been previous neural models of hemispheric interactions and information transfer (Anninos, Argyrakis, & Skouras, 1984; Anninos & Cook, 1988; Cook & Beech, 1990; Ringo, Doty, Demeter, & Simard, 1994), they did not examine lateralization or map self-organization (summarized in Reggia et al., 1998). We recently studied lateralization in two different neural models consisting of paired cerebral hemispheric regions interacting via the corpus callosum. In one study, a model was trained to generate the correct sequence of phonemes for 50 monosyllabic words (Reggia et al., 1998), and in the other it was trained to classify a small set of letters presented in the left, central, or right visual elds (Shevtsova & Reggia, 1999). Both models used supervised learning. Lateralization occurred readily in these two models toward the larger, more excitable, or more plastic side and appeared most readily with strongly inhibitory callosal inuences. A previous mixture-of-experts model using supervised learning has also demonstrated lateralization in processing spatial relations with asymmetric receptive eld sizes (Jacobs &
2054
Svetlana Levitan and James A. Reggia
Organization
Organization
1 0.8 0.6 0.4 0.2 -4 -3 -2 -1
0
1 0.8 0.6 0.4 0.2 2
1
3
K
-4 -3 -2 -1
Activation 5
4 3
3
2
2
1
1 -3 -2
-1
0
2
1
3
K
-4
-3
-2
Mirror Symmetry -4
-3
-2
-1 -0.2
K
2
1
Activation
4
-4
0
1
-1
0
K
2
1
Mirror Symmetry 2
3
K
1 0.5
-0.4 -4
-3
-2
-0.8
-1 -0.5
-1
-1
-0.6
1
2
3
K
Figure 7: Results for the two hemispheres with different learning rates after about 16,000 (a–c) and 123,000 (d–f) training inputs. Solid lines in (a, b, d, e) correspond to the right hemisphere with learning rate 0.001, the dashed lines to the left hemisphere with learning rate 0.01. After 16,000 inputs the left hemisphere is dominant for K > ¡1.4. After very long training, the two hemispheres achieve about the same organization levels. Pretraining organization and activation are as in Figure 3.
A Computational Model
2055
Organization 1 0.8 0.6 0.4 0.2 10 15 20 25 30
5
Frame
Organization 1 0.8 0.6 0.4 0.2 0
5
10 15 20 25 30
Frame
Figure 8: Organization with different learning rates during training (learning curves). One frame corresponds to 4096 inputs. Solid black line, right hemisphere; dashed line, left hemisphere (with higher learning rate). The thick gray line shows the difference between organizations in the two hemispheres. For K D ¡3 the difference is small. For larger values of K, it rst becomes large, but vanishes after longer training.
2056
Svetlana Levitan and James A. Reggia
Kosslyn, 1994). The map formation model we describe in this article differs substantially from these earlier models. Our simulations here examine self-organization of maps using solely unsupervised learning rather than supervised learning to acquire a skill (phoneme generation, letter classication). We have thus examined whether lateralization can arise in the absence of error signals. Although our current model is intended only as an abstract representation of interacting cortical regions, and not of any specic biological system, the results provide some insight into how underlying cortical asymmetries and callosal inuences can interact to cause hemispheric map asymmetries. Our results can be summarized as follows. First, when callosal connections were absent (K D 0), in every case complete, mirror-symmetric maps ultimately formed in both hemispheric regions without signicant lateralization. This nding is consistent with experimental data suggesting that lateralization and complementary mosaic maps arise due to hemispheric interactions (Bianki, 1993). Second, a sharp transition in model behavior was observed depending on callosal strength. For excitatory, absent, or weakly inhibitory callosal strengths, complete and symmetric mirror-image maps typically appeared in both hemispheric regions. In contrast, with stronger inhibitory callosal inuences, partial to complete map lateralization tended to occur, and the maps in each hemispheric region often became complementary, resembling mosaic patterns observed experimentally (Bianki, 1993). These results, along with those of the earlier phoneme-sequencing model, provide support for the long-standing hypothesis that the corpus callosum plays a predominantly functionally inhibitory role (Kinsbourne, 1978; Cook, 1986). A third result of this work is the nding that lateralization occurred readily toward the side having a larger cortical region or higher excitability. These factors produced lateralization more or less independently. This suggests that biological lateralization may be a multifactorial process, consistent with our earlier lateralization models noted above and consistent with arguments that current experimental data do not support the concept of a single underlying factor causing human behavioral lateralization (Hellige, 1993). These results are also consistent with the general concept that initial small quantitative differences in hemispheric regions can ultimately give rise to qualitative differences. They demonstrate that lateralization and map asymmetries can arise due to activation dynamics and unsupervised learning in the absence of error signals. Finally, we observed that asymmetric synaptic plasticity had only a transitory effect on lateralization of map formation. In contrast, asymmetric plasticity in our earlier phoneme-sequencing model led to marked, persistent functional lateralization, while in the letter classication model, mild but persistent lateralization occurred. The difference occurs because map formation here is based solely on unsupervised learning, whereas it was controlled by an error-correction process in the earlier models (e.g., in the phoneme-
A Computational Model
2057
sequencing model, once one hemisphere learned to control phoneme sequencing, the error dropped to zero and the other hemisphere stopped learning). The intriguing possibility suggested by these results is that for biological lateralization, the effects of asymmetric synaptic plasticity may vary in part depending on whether supervised or unsupervised learning is involved. Appendix A: Quantitative Map Measures
The measures for map organization, lateralization and mirror symmetry used here are from Alvarez et al. (1998) and are computed as follows. For each cortical element n, a map M is described by the offsets of its receptive eld center (cx(n), cy (n)) from the ideal position in a perfectly organized map and by the receptive eld radii (rx(n), ry (n )). The “differential square distance” between two immediate neighbors n and n0 is computed as |(n, n0 )| 2 D (cx (n0 ) ¡ cx (n ))2 C (cy(n0 ) ¡ cy(n ))2 C (rx ( n0 ) ¡ rx ( n )) 2 C ( ry( n0 ) ¡ ry ( n)) 2 . An organization measure kMk for map M is then computed as 0 1 kMk D N
X all M¡nodes n
st,s @
11 X
2
0
|(n, n )|
2A
,
all neighbors n0 of n
where s t,s (x ) D (1 C e¡2s(x¡t ) )¡1 . Parameters t and s were chosen to approximate human estimates of organization values for various maps (Alvarez et al., 1998). A simple difference of the two organization values is a good lateralization measure: Lat(L, R ) D kRk ¡ kLk. When kRk D 1 and kLk D 0, Lat(L, R) D C 1; while for kRk D 0 and kLk D 1, Lat(L, R) D ¡1; and for kRk D kLk, the lateralization is 0. The degree of mirror symmetry is estimated using a measure of the overlap of organized regions oreg(L) in the left map L and oreg(R0 ) in the mirror image R0 of the right map R. Areas of the intersection (oreg(L) \ oreg(R0 )), union, and symmetric difference (oreg(L)D oreg(R0 )) of the above regions are used to dene map overlap(L, R) D
area(oreg(L) \ oreg(R0 )) ¡ area(oreg(L)D oreg(R0 )) . area(oreg(L) [ oreg(R0 ))
The intersection term measures the overlap of the organized regions, while the symmetric difference term measures the discrepancy between these regions. When no organized regions exist, map overlap is undened.
2058
Svetlana Levitan and James A. Reggia
Appendix B: Model Dynamics
Consider the general linearized case (see equation 3.1) where the hemispheric regions may be asymmetric. Denote cs C cLlf D cL , cs C cRlf D cR , cpL AS D BL , and cpR AS D B R , giving dAL / dt D cL AL C KLR AR C BL ,
dAR / dt D KRL AL C cR AR C BR .
¡c
(B.1)
¢
KLR This is a system of two linear differential equations, with matrix , K cR whose trace is cL C cR (negative in our model), and whose determinant is cL cR ¡ KRL KLR ; it thus has a unique xed point unless KRL KLR D cL cR . The xed point, which is asymptotically stable when KRL KLR < cL cR and asymptotically unstable when KRL KLR > cL cR , is L RL
A¤L D (¡cR BL C KLR BR ) / (cL cR ¡ KLR KRL ), A¤R D (¡cL BR C KRL BL ) / (cL cR ¡ KLR KRL ). In the symmetric case with baseline parameter values cL D cR D ¡1.4, BL D BR D 7.0, KLR D KRL D K, we have A¤L D A¤R D 7 / (1.4 ¡ K ), so for K D 1.4 activation grows without bound in the linearized model. However, the additional self-inhibition that we use for K > 0 in simulations prevents this problem. As stated earlier, for positive K instead of cL D cR D ¡1.4, we use cL D cR D ¡1.4 ¡ 2.6K, which makes the behavior of the xed point smooth for positive K (although it is not differentiable at K D 0): A¤L D A¤R D 7 / (1.4 C 1.6K). This xed point remains asymptotically stable for all K > 0, facilitating symmetric map formation in both hemispheres. Thus, in the symmetric case, the xed point remains bounded and is asymptotically stable for all K > ¡1.4 and asymptotically unstable for K < ¡1.4. In asymmetric cases, the coordinates of the xed point behave more interestingly. For different input gain constants (cpL D 1.05 makes BL D 7.35, but cL D cR ) the difference A¤L ¡ A¤R D (BL ¡ BR ) / (K ¡ cL ) remains quite small for K > ¡1, but gets very large as K approaches cL D ¡1.4. Different lateral feedback has a similar effect, except the bifurcation point changes slightly: when cL D ¡1.3, the xed point becomes unstable for a larger K (K ¼ ¡1.349). The difference of the total activations at the xed point is given by A¤L ¡ A¤R D BL (cL ¡ cR ) / (cL cR ¡ K2 ), which is small for K > ¡1 and also goes up sharply near the bifurcation point. Figure 9 shows the dependence of the xed point on K for various cases. Here the additional self-inhibition for positive K is taken into account. Finally, asymmetric excitability actually causes two transitions in the model’s dynamics. While the coordinates of the asymptotically stable xed point remain close to each other (for K ¸ ¡1), we have a situation similar to the symmetric case for K > ¡1.4, so complete symmetric maps form.
A Computational Model
2059
AL AR 8 6 4 2 -4
-2
-2 -4
2
4
2
4
K
AL AR 8 6 4 2 -4
-2
-2 -4
K
Figure 9: Fixed point of system (B.1) in (a) the symmetric case and (b) with asymmetric input sensitivities or lateral feedback coefcients. Dashed line, total activation for the left hemisphere; solid line for the right. In the symmetric case, the values are equal; in the asymmetric case, they are close except near K D ¡1.4.
But when the asymptotically stable xed point coordinates have different orders of magnitude (i.e., K between ¡1.2 and ¡1.4), then the activation levels in the two hemispheres will differ signicantly (with the left hemisphere always much more active), and most learning will occur in the left hemisphere. Thus, the posttraining organization level on the left stays close to 1, while on the right it drops sharply. This accounts for the “bumps” in the
2060
Svetlana Levitan and James A. Reggia
posttraining activation levels in Figure 4b and the sharp changes in lateralization and mirror symmetry in Figures 4c and 4d. Finally, at K < ¡1.4, the xed point becomes asymptotically unstable, and any initial difference in the activations of two hemispheres will grow exponentially (as in symmetric case for K < ¡1.4). However, due to the asymmetry in excitability, the left hemisphere is more likely to have higher initial activation. This facilitates complementary map formation favoring the more active hemisphere. Acknowledgments
This work was supported by NINDS award NS35460. We thank Ioana Stoica for help with implementation of some simulations. References Alvarez, S., Levitan, S., & Reggia, J. (1998). Metrics for cortical map organization and lateralization. Bull. Math. Biol., 60, 27–47. Anninos, P. A., Argyrakis, P., & Skouras, A. (1984). A computer model for learning processes and the role of the cerebral commissures. Biol. Cybern, 50, 329– 336. Anninos, P. & Cook, N. (1988). Neural net simulation of the corpus callosum. Intl. J. Neurosci., 38, 381–391. Armentrout, S., Reggia, J., & Weinrich, M. (1994). A neural model of cortical map reorganization following a focal lesion. Articial Intelligence in Medicine, 6, 383–400. Bates, E., Thal, D., & Janowsky, J. (1992). Early language development and its neural correlates. In S. Segalowitz & I. Rapin (Eds.), Handbook of Neuropsychology, 7 (pp. 69–110). Amsterdam: Elsevier. Berlucchi, G. (1983). Two hemispheres but one brain. Behav. Brain Sci., 6, 171– 173. Bianki, V. (1993). The mechanism of brain lateralization. Philadelphia: Gordon & Breach. Cook, N. (1986). The brain code. London: Methuen. Cook N. & Beech A. (1990). The cerebral hemispheres and bilateral neural nets. Int. J. Neurosci., 52, 201–210. Denenberg, V. (1983). Micro and macro theories of the brain. Behav. Brain Sci., 6, 174–178. Fendrich, R., Wessinger, C. M., & Gazzaniga, M. S. (1996).Nasotemporal overlap at the retinal vertical meridian: Investigations with a callosotomy patient. Neuropsychologia, 34, 637–646. Ferbert, A., Priori, A., & Rothwell, J. C. (1992). Interhemispheric inhibition of the human motor cortex. J. Physiol., 453, 525–546. Geschwind, N., & Levitsky, W. (1968).Left-right asymmetries in temporal speech region. Science, 161, 186–187. Hellige, J. (1993). Hemispheric asymmetry. Cambridge, MA: Harvard University Press.
A Computational Model
2061
Innocenti, G. (1986). General organization of callosal connections in the cerebral cortex. In E. Jones & A. Peters (Eds.), Cerebral Cortex, 5 (pp. 291–353). New York: Plenum. Jacobs, R., & Kosslyn, S. (1994). Encoding shape and spatial relations. Cognitive Science, 19, 361–386. Kaas, J. (1991). Plasticity of sensory and motor maps in adult mammals. Ann. Rev. Neurosci., 14, 137–167. Kinsbourne, M. (Ed.). (1978). Asymmetrical function of the brain. New York: Cambridge University Press. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Levitan, S., Stoica, I., & Reggia, J. A. (1999). A model of lateralization and asymmetries in cortical maps. In Proceedings of IJCNN99. Washington D.C. Loftus, W., Tramo, M., Thomas, C., Green, R., Nordgren, R., & Gazzaniga, M. (1993). Three-dimensional quantitative analysis of hemispheric asymmetry in the human superior temporal region. Cerebral Cortex, 3, 348–355. Macdonell, R., Shapiro, B., Chiappa, H., Helmers, S., Day, B., & Shahani, B. (1991). Hemispheric threshold differences for motor evoked potentials produced by magnetic stimulation. Neurol., 41, 1441–1444. Meyer, B.-U., Roricht, ¨ S., von Einsiedel, H., Kruggel, F., & Weindl, A. (1995). Inhibitory and excitatory interhemispheric transfers between motor cortical areas in normal humans and patients with abnormalities of corpus callosum. Brain, 118, 429–440. Nudo, R., Jenkins, W., Merzenich, M., Prejean, T., & Grenda, R. (1992). Neurophysiological correlates of hand preference in primary motor cortex of adult monkeys. J. Neurosci., 12, 2918–2947. Orban, G. (1984), Neuronal operations in the visual cortex. Berlin: Springer-Verlag. Reggia, J., & Edwards, M. (1990). Phase transitions in connectionist models having rapidly varying connection strengths. Neural Computation, 2, 523–535. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neocortical dynamics. Neural Computation, 4, 287–317. Reggia, J., Goodall, S., & Shkuro, Y. (1998). Computational studies of lateralization of phoneme sequence generation. Neural Computation, 10, 1277–1297. Ringo, J., Doty, R., Demeter, S., & Simard, P. (1994). Time is of the essence: A conjecture that hemispheric specialization arises from interhemispheric conduction delay. Cerebral Cortex, 4, 331–343. Scheibel, A., Fried, I., Paul, L., Forsythe, A., Tomiyasu, U., Wechsler, A., Kao, A., & Slotnick, J. (1985). Differentiality characteristics of the human speech cortex. In D. Benson & E. Zaidel (Eds), The dual brain (pp. 65–74). New York: Guilford. Shevtsova, N. & Reggia, J. (1999). A neural network model of lateralization during letter identication. Journal of Cognitive Neuroscience, 11, 167–181. Stone, J., Leicester, J., & Sherman, S. (1973). Naso-temporal division of the monkey’s retina. J. Comp. Neur., 150, 333–348. Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13.
2062
Svetlana Levitan and James A. Reggia
Toyama, K., Tokashiki, S., & Matsunami, K. (1969). Synaptic action of commissural impulses upon association efferent cells in cat visual cortex. Brain Res., 14, 518–520. Tucker, D., & Williamson, P. (1984).Asymmetric neural control systems in human self-regulation. Psychol. Review, 91, 185–215. Udin, S., & Fawcett, J. (1988). Formation of topographic maps. Annual Review of Neuroscience, 11, 289–327. Van Kleek, M., & Kosslyn, S. (1991). Use of computer models to study cerebral lateralization. In F. Kitterle (Ed.), Cerebral laterality: Theory and research (pp. 155–174). Hillsdale, NJ: Erlbaum. Received June 18, 1999; accepted September 23, 1999.
LETTER
Communicated by Ad Aertsen
Rate Limitations of Unitary Event Analysis A. Roy P. N. Steinmetz E. Niebur Zanvyl Krieger Mind/Brain Institute,Johns Hopkins University,Baltimore, MD 21218, U.S.A. Unitary event analysis is a new method for detecting episodes of synchronized neural activity (Riehle, Gr un, ¨ Diesmann, & Aertsen, 1997). It detects time intervals that contain coincident ring at higher rates than would be expected if the neurons red as independent inhomogeneous Poisson processes; all coincidences in such intervals are called unitary events (UEs). Changes in the frequency of UEs that are correlated with behavioral states may indicate synchronization of neural ring that mediates or represents the behavioral state. We show that UE analysis is subject to severe limitations due to the underlying discrete statistics of the number of coincident events. These limitations are particularly stringent for low (0–10 spikes/s) ring rates. Under these conditions, the frequency of UEs is a random variable with a large variation relative to its mean. The relative variation decreases with increasing ring rate, and we compute the lowest ring rate, at which the 95% condence interval around the mean frequency of UEs excludes zero. This random variation in UE frequency makes interpretation of changes in UEs problematic for neurons with low ring rates.As a typical example, when analyzing 150 trials of an experiment using an averaging window 100 ms wide and a 5 ms coincidence window, ring rates should be greater than 7 spikes per second. 1 Introduction
Despite intensive research efforts over half a century, fundamental questions regarding the nature of neuronal representations remain unanswered. One long-standing debate (Hubel, 1959; Werner & Mountcastle, 1963; Smith & Smith, 1965; Grifth & Horn, 1966; Bullock, 1970; Softky & Koch, 1993; Shadlen & Newsome, 1995; Softky, 1995) is whether this representation involves only the mean rate of action potentials, averaged over times on the order of a second, or whether ner temporal structures play a role. Of particular interest are those time structures involving multiple neurons. For instance, synchronous or oscillatory ring of neurons in the cortex has been proposed as a mechanism for representing sensory information (Singer, c 2000 Massachusetts Institute of Technology Neural Computation 12, 2063– 2082 (2000) °
2064
A. Roy, P. N. Steinmetz, and E. Niebur
1993), for labeling different attributes of a stimulus (Decharms & Merzenich, 1996), and as a possible representation of internal behavioral states (Gerstein & Clark, 1964; Abeles, 1982b, 1991; Dayhoff & Gerstein, 1983a, 1983b; Niebur, Koch, & Rosin, 1993; Niebur & Koch, 1994, Steinmetz et al., 2000). As techniques for recording from multiple neurons become more commonplace, methods for analyzing data from simultaneous recordings become increasingly important. Methods now considered part of the standard repertoire are the cross-correlogram (see Eggermont, 1990, for a review), the joint peristimulus time histogram (Aertsen, Gerstein, Habib, & Palm, 1989), and the gravity method (Gerstein, Bedenbaugh, & Aertsen, 1989). Any addition to the available methods is welcome. There is a particular need for methods that do not require extensive averaging over trials since the brain has to solve problems as they appear, without repetitions and multiple samples of a stimulus. If the brain is able to work without averaging, then it should be possible to detect the “code” that the brain uses by applying algorithmic methods that do not require averaging either. It was one of the goals leading to the development of unitary event analysis to provide such a method (Grun, ¨ 1996). The basic task of unitary event analysis is the detection of periods of synchronous ring of multiple neurons (Grun, ¨ Aertsen, Abeles, & Gerstein, 1993; Grun, ¨ 1996; Riehle et al., 1997). The method determines intervals of time during an experiment when the number of coincident rings of two or more simultaneously recorded neurons signicantly exceeds the number that would be expected if the neurons red independently. Action potentials occurring simultaneously during these intervals are labeled unitary events (UEs). Changes in the frequency of UEs correlated with the presumed perceptual state of the animal subject were recently observed in monkey primary motor cortex by Riehle et al. (1997). In these experiments, UEs occurred just prior to the possible appearance of a visual movement cue. More important, the frequency of UEs was signicantly higher when the monkey later performed the instructed movement correctly than when the animal failed. The purpose of this article is to analyze this new method and point out limitations on its use. We do so using theoretical analysis and numerical simulation and by application of UE analysis to data recorded in primate somatosensory cortex. The central result is that any application of UE analysis requires careful examination of the ring rates of the analyzed sequences of action potentials (“spike trains”). While earlier work (Grun, ¨ 1996) highlighted limitations of the method for spike trains with high ring rates, we demonstrate here that use of the method for low-rate spike trains (0– 10 spikes/s) can also produce artifactual changes in the occurrence of UEs that may appear correlated with behavioral states (as is the case in the data analyzed in this article). This report illustrates this artifact, examines why it occurs, and determines the lower bounds on the ring rates when UE analysis can be used. (Parts of this work have been presented in abstract form in Roy, Steinmetz, & Niebur, 1998.)
Rate Limitations of Unitary Event Analysis
2065
2 Unitary Events Method Dened
Unitary events are coincident action potentials that occur at a greater frequency than would be expected if the spike trains were independent (Grun, ¨ 1996; Riehle et al., 1997). The basic idea is to count the observed number of coincidences (simultaneous ring of two or more neurons during a small analysis interval of width b, typically 5 ms) in an interval of width Tw (typically 100 ms). This count is compared with the number of coincidences that would occur if each neuron red independently with the average rate over the interval Tw . To perform this analysis, each spike train is represented as a binary (0, 1) sequence in time. Each of its segments of length Tw is subdivided into Nb D Tw exclusive bins of width b. Each bin has either the value 0, representing b no spikes, or 1, representing one or more spikes occurring in the bin. This sequence will be denoted s( j), j D 1, . . . , Nb . In UE analysis, it is assumed that spike trains obey Poisson statistics of constant rate within any interval Tw (i.e., that a spike train is a homogeneous Poisson process within this interval). Rates are allowed to be different in different segments, but the ring is always assumed to be governed by Poisson statistics. Therefore, neural ring is assumed to be an inhomogeneous Poisson process over intervals longer than 1 Tw . Under these assumptions, the probability of a neuron’s ring at least once in a bin is P (s ( j) D 1) ´ 1 ¡ P (no spikes) D 1 ¡ e¡n / Nb ,
(2.1)
where n is the total number of events (ones) observed in Tw . Multiple repetitions (“trials”) of an experiment are treated by assuming stationarity across the Nt trials but not necessarily within a trial beyond a time Tw . This allows the corresponding intervals from each trial to be concatenated into a single process of length Nt £ Tw , which is stationary by assumption. In this case, the number of bins per segment is Nb D NtbTw . Note a conceptual difculty here that is due to the overlap between neighboring segments. Strictly speaking, the rate in any segment determines the rate in all segments, because of this overlap in conjunction with the requirement of constant rate within any segment. Violation of this assumption may be tolerable as long as the ring rates vary slowly, compared to Tw . In practice, observed ring rates do not always change slowly; for an example, see the raster plots in Figure 1 around the times of stimulus onset and offset, or Figures 2 to 4 in Riehle et al. (1997). Grun ¨ (1996) developed more sophisticated methods for the choice of Tw , which may alleviate the practical consequences but do not solve the principal problems addressed later in this article. We will ignore this technical difculty of UE analysis. 1
2066
A. Roy, P. N. Steinmetz, and E. Niebur
The spiking activity from N simultaneously recorded neurons is described by a joint process composed of N parallel binary processes. The process is represented by a time sequence of N-tuples, V ( j), j D 1, . . . , Nb . Each component of the N-tuple is either zero (no action potential) or unity (one or more action potentials). Let Vk be a particular N-tuple and let Vki denote its ith component. Assuming that the neurons re independently, the joint probability of observing the N-tuple Vk is then given by the product of the marginal probabilities for each of the N processes: P (V k ) D
N Y i D1
»
e ¡ni / Nb 1 ¡ e ¡ni / Nb
¼ if Vki D 0 , if Vki D 1
(2.2)
where ni is the number of bins with unity value in the concatenated process for neuron i. Given the assumed stationarity, the number ok of occurrences of a particular N-tuple Vk in Nb bins has a binomial distribution. Its probability is given by P (ok |P(Vk ), Nb ) D
¡N ¢ b
ok
P (Vk )ok [1 ¡ P (Vk )]Nb ¡ok .
(2.3)
For each number of occurrences ok of N-tuple Vk , we can compute the signicance level for a critical region that contains just this outcome. The signicance level will be the total probability of nding the observed number, ok , of occurrences of Vk or an even larger and less likely number of occurrences. We dene this level as the joint p-value of the outcome: 2 N . Xb joint p-value (ok |P(Vk ), Nb ) D P (r|P (Vk ), Nb ) r D ok
D
Nb X
r D ok
¡N ¢ b
r
(2.4)
P (Vk )r [1 ¡ P (Vk )]Nb ¡r ,
where P (Vk ) is given by equation 2.2. In the cases considered in this article, only two simultaneous spike trains are analyzed. Therefore only four ring patterns of Vk are possible, and 1 only one of them, , representing coincident ring, is of primary interest. 1
¡¢
In the original UE analysis of Grun ¨ (1996), the joint p-value is based on the Poisson approximation to the binomial distribution (see her chapter 3). Although this approximation is frequently justied, it fails for the precise considerations introduced in section 3.3, and we therefore use the exact form in equation 2.4. 2
Rate Limitations of Unitary Event Analysis
2067
In the method of UE analysis, if the joint p-value is less than the target pvalue, then all coincidences inside the averaging window are labeled unitary events. 3 The target p-value (or a level) is the arbitrary signicance level chosen to test the null hypothesis that the two simultaneously recorded spike trains are independent. In this article, we consistently use a target p-value of 0.05. 3 Analysis of Unitary Events Method 3.1 Application to Cortical Spike Trains. Unitary events were detected in simultaneously recorded pairs of spike trains originating from individual neurons in somatosensory cortical area II (SII) of macaque monkey. During each trial, a video screen displayed either a blank screen indicating that the monkey was required to perform a tactile task, or a rectangular box consisting of a white outline and black center indicating that the monkey was required to perform a visual task. The tactile task consisted of discriminating the orientation of a 2 mm wide nonresilient bar pushed twice perpendicularly against the distal nger pad of a restrained nger; the orientation of the bar could be the same or different between the two presentations in the same trial. The monkey had to decide whether the orientations were identical or different. The visual task consisted of distinguishing whether the left or right box presented on the video screen was dimmed. In both cases, the monkey indicated its response by pressing a foot switch. Details of the experimental procedure were as described by Hsiao, O’Shaughnessy, and Johnson (1993) except that the tactile and visual stimulation were as described above. Figures 1A and 1B show recorded action potentials of two neurons in SII as well as UEs during 212 trials of this experiment while the monkey 3 A possible alternative, suggested by Gr un ¨ (1996), is to label as unitary events only some of the coincidences in the interval. Their number would be given by the number of coincidences exceeding that expected to occur by chance given the observed ring rates. This is not the case in the standard UE analysis approach as dened by Grun ¨ (1996) and Riehle et al. (1997) in which all coincidences are labeled UEs independent of the ring rates. We cannot identify which ones are in excess of the total number of coincidences. We can then make a statement that x% of them are in excess of what is expected. Alternatively, the coincidences that qualify as UEs could be selected randomly at the appropriate frequency. Another approach could be to generate a weighting procedure for the UEs (i.e., the weight of a coincidence increases monotonically (linearly) if it is detected as being signicant in multiple overlapping blocks). After the completion of the xed number of overlapping tests for each UE, the ones with the least weight are removed until the theoretical expected number is reached.
2068
A. Roy, P. N. Steinmetz, and E. Niebur
200
trial #
150
100
50
0
0
1
2
1
2
t[secs]
3
4
3
4
200
trial #
150
100
50
0
0
t[secs]
Figure 1: (A,B) Raster plots of action potentials (dots) and UEs (diamonds) in two cells as a function of time within each trial when the monkey performed the tactile task, requiring attention to the task. Each row corresponds to one trial. The stimulus bar is in contact with the skin from about t D 1.7 s to 2.2 s and from t D 3.3 s to 3.8 s; the “on” and “off” responses can be clearly seen for the cell shown in A. See the text for details.
Rate Limitations of Unitary Event Analysis
2069
worked on the tactile task. Note that the UEs are identical in parts A and B of Figure 1 since they are dened by the coincident ring of the two neurons whose action potentials are shown in these two gures. Note also that the assumption of stationarity across (but not within) trials appears justied. Figures 1A and 2A contrast the data that were obtained when the monkey worked in the tactile and visual task, respectively, in the same neuron. The intervals between 1–1.5, 2.5–3.0, and 3.8–4.3 seconds show a change in the frequency of UEs when the intervals are compared between Figures 1A and 2A, thus demonstrating that the frequency of UEs is correlated with the behavioral state of the primate. The stimulus bar contacts the skin during the time intervals between t D 1.7 s and 2.2 s and between t D 3.3 s and 3.8 s. Note that the UE frequency increases before (and not during) the stimulation. The underlying joint p-values are shown in Figure 2D, which plots (for better resolution) log10 1¡P for the tactile P task. A simple explanation assuming that UEs are suppressed by the application of the stimulus bars does not hold, since UE frequency decreases at the end of the trial, when the stimulus bar is again removed from the skin (after t D 3.8 s). Note, however, that the ring frequency of neuron 2 (raster plots in Figure 1B) is reduced toward the end of the trial (after about 3.2 s). Although UE analysis is designed to compensate for rate effects, we will see that changes in the ring rate may lead to changes in UE frequency. Figure 2A shows that no such increase in UE frequency is observed when the animal is performing the visual task in the same intervals with identical sensory stimulation (remember that the same tactile stimulus is being delivered to the monkey even when he is doing the visual task). Increased UE frequency therefore could be interpreted as signaling a buildup of an anticipatory state in the animal rather than a response to sensory stimulation. Of the 30 pairs of neurons examined, 9 (30%) showed changes in UE frequency correlated with the behavioral state of the monkey. Recordings from an additional 7 cell pairs contained many UEs, but UE occurrence did not vary with behavior. The remaining 16 pairs contained an insignicant number of UEs. These results appear comparable to those reported by Riehle et al. (1997) in motor cortex, where the frequency of UEs increases sharply in those intervals in which the monkey might anticipate the occurrence of a visual cue for movement. We will see that this interpretation must be made cautiously and may not be warranted, particularly not with low ring rates. 3.2 Discrete Distribution of the Number of Coincidences and Possible Signicance Levels. In order to elucidate the mechanisms underlying the
changes of UE frequency, we reevaluated the statistical test used to identify UEs. As reviewed in section 2, the method of UE analysis detects coincident spikes occurring in time intervals where the null hypothesis of independent neural ring is rejected. This interpretation of changes in UE frequency will
2070
A. Roy, P. N. Steinmetz, and E. Niebur
250
trial #
200
150
100
50
0
0
1
2
1
2
t[secs]
3
4
3
4
4
3
log[(1-P)/P]
2
1
0
-1
-2 0
t[secs]
Figure 2: (A) Similar raster plot as in Figure 1A when the monkey performed the visual task, which does not require tactile attention. (B) Joint p-values (P in the graph) plotted as a ratio of logarithmic functions for overlapping 100 ms time segments in the tactile task. The horizontal line at ordinate value log10 19 ¼ 1.3 shows the target signicance level (in this case, 0.05). At times (i.e., values on the abscissa) without entries in the gure, bins have no recorded spikes in one or both neurons.
Rate Limitations of Unitary Event Analysis
2071
be valid, however, only if the signicance level of the test for independent ring is equal for the time intervals being compared. If the signicance level of the test varies, then changes in UE frequency are due not only to changing frequencies of coincidences but also to changing signicance levels. It is a statistically ill-dened procedure to compare UE frequencies between intervals with different signicance levels. The signicance level of the test of independent ring is equal to the probability of a time interval containing a number of coincidences that lies in the critical region of the test. The structure of this critical region is determined by the probability distribution of the number of coincidences for independent processes. As reviewed in section 2, this distribution is binomial, with its shape depending on both the number of time bins examined and the rates of neural ring. Two examples with 2000 bins and two different rates are shown in Figures 3A and 3B. Critical regions for the maximum likelihood test used in UE analysis consist of right-hand tails of this distribution, since the maximum likelihood rule requires that if a specic number of coincidences is included in the critical region, then all less likely coincidences must also be included (Lindgren, 1976). It is therefore important to consider the effective signicance level, which we dene as the total probability of the largest critical region that does not exceed the target signicance level p. It is usually obtained by integrating the right tail of the probability density function leftward from large numbers until the area reaches the target p value. Examples are shown in Figure 3A for a ring rate of 5 spikes per second and in Figure 3B for 20 spikes per second. The gures show the critical regions closest to, but not exceeding, a signicance level of p D 0.05. The crucial observation is that the effective signicance levels are discrete, since the number of coincidences in a window of given size Tw is an integer and therefore a discrete variable. At low ring rates, such as 5 spikes per second, this discrete distribution creates effective signicance levels that are broadly spaced and usually approximate the target level poorly. The effect is less important for high ring rates, such as 20 spikes per second (for the chosen parameters), where a signicance level near the target signicance level can be chosen with reasonable accuracy (see below for a more quantitative statement). As the mean of the binomial distribution increases (both with increasing bin size and with increasing average ring rate), the effective signicance level approaches p, but for many physiologically realistic spike rates, the difference between the effective signicance level and p itself is a substantial fraction of p. The dependence of the effective signicance level on neuronal ring rate is shown in Figure 4, which plots the effective signicance level for a pair of independent neurons using a target p D 0.05 for 3000 time bins of length 5 ms each (this corresponds to a typical experiment with NT D 150 trials and Tw D 100 ms). The most problematic point when performing UE analysis is the presence of large jumps in signicance level as a function of rate. These
2072
A. Roy, P. N. Steinmetz, and E. Niebur
Probability of occurrence
0.5 0.4 0.3 0.2 0.1 critical region 0
0
1
2
3 4 5 6 7 No. of coincidences
8
9
10
Probability of occurrence
0.5 0.4 0.3 0.2 0.1 critical region 0
0
5
10
15 20 25 30 No. of coincidences
35
40
Figure 3: (A) Probability distribution of the number of coincidences for a ring rate of 5 spikes per second. The effective signicance is shown for a signicance level of p D 0.05. (B) As in (A) but for 20 spikes per second.
occur when a particular outcome, such as three coincidences per bin, is no longer included in the critical region as the rate increased. It can be seen that a small change in ring rate (e.g., 0.1 spike/s) can lead to a very large change in the size of the critical area—the effective signicance level. The
Rate Limitations of Unitary Event Analysis
2073
0.05
Effective Significance level
0.04
0.03
0.02
0.01
0
0
5
10 15 20 Firing rate (spikes/s)
25
30
Figure 4: Effective signicance levels for the maximum likelihood test of independent ring. Nb D 3000, b D 5 ms.
largest jumps appear at small ring rates. For instance, in the case illustrated in Figure 4, a small increase in ring rate at about 2 spikes per second (rst branch shown in the gure) leads to an increase in effective signicance by a whole order of magnitude, from p D 0.05 to about p D 0.005. It should be noted that the ring rate of the model can change by an arbitrarily small amount, although the smallest change in the observed rate in a simulation will correspond to the addition or removal of a single spike over NT trials in the averaging window of width Tw . We have thus determined that the discrete nature of coincident events is the cause of the observed large changes in signicance level. Small changes of neural ring rates correlated with behavioral status may therefore lead to large changes in signicance level. If this is the case, we cannot conclude that different frequencies of UEs are due to differences in the correlation state of the neurons. In terms of UE analysis, which is based on constant signicance levels, the apparent changes in UE frequency might represent artifacts of the method. Although the jumps in effective signicance level for a single test are large, we show in the next section that they do not appear clearly in UE analysis because the method includes testing the coincidences multiple times. The effects of this repeated testing will be explored next.
2074
A. Roy, P. N. Steinmetz, and E. Niebur
3.3 Rate Dependence of Unitary Events. As described in section 2, the presence of UEs is determined by the number of coincidences within a window of length b in an interval Tw . If this number is in the critical region of a test for independent ring, then all coincidences within the interval are considered UEs. After each test, the interval slides by the width of the coincidence window, b, and the test is repeated. Therefore, each coincidence is subjected to Tw / b dependent tests, which means 20 tests for typical window sizes T w D 100 ms, b D 5 ms. Furthermore, the average rate estimate used to construct a single test is a random variable. Since the sliding window moves one bin at a time, the tests of the null hypothesis are dependent. The randomness of the rate estimate and the interdependence of these tests make it difcult to determine analytically the overall signicance level of the test for coincident ring employed in UE analysis. We therefore compute the signicance level by Monte Carlo simulation. The signicance level is equal to the probability that coincident spikes are labeled as unitary events even though the null hypothesis of independent ring is true. The average rate estimate for a single test was simulated as being equal PT b to (Tw / b) ¡1 awD/1 Ya , where Ya are independent Poisson-distributed random variables drawn from a xed ring rate within the time window of width Tw . The Monte Carlo computations consisted of 10,000 simulated trials of 3Nb ¡ 1 bins each for two independent Poisson processes at each value of the ring rate. Unitary events were detected using the methods described in section 2. The frequency of UE occurrence was dened as the number of computed UEs in the central block of Nb columns of bins divided by bNb Nt (this avoids edge effects). The mean and variance of the UE frequency were then computed over all iterations of the simulation and shown in Figure 5. This graph shows that the average frequency of UE occurrence is a monotonically increasing function of rate. Intuitively, this can be understood from the fact that the total number of accidental coincidences increases with the product of the ring rates of the two neurons and that therefore the number of accidental coincidences labeled as UEs also increases. The dependence of UE frequency on ring rates has previously been noted by Gr un ¨ (1996) at higher ring rates, and it can lead to marked increases in UE frequency. For example, in Figure 5, it can be seen that a twofold rate change from 4.5 spikes per second to 9 spikes per second causes more than a fourfold increase in UE frequency. In addition, it is important to notice that 20 dependent tests are performed for every coincidence since each bin of width b D 5 ms is member of 20 windows of width Tw D 100 ms, each of which has, in general, a different ring rate. If any of these tests detects a signicantly elevated number of coincidences, all coincidences in the interval are labeled UEs. This causes the transformation of the “sawtooth” function in Figure 4 into the smooth increase in average number of UEs shown in Figure 5. The large jumps in
Rate Limitations of Unitary Event Analysis
Frequency (Hz)
0.6
2075
UE Frequency St. dev
0.4
0.2
0
0
2
4
6 8 Firing rate (spikes/s)
10
12
Figure 5: UE frequency as a function of ring rate. Nb D 3000, b D 5 ms.
signicance levels in Figure 4 are reected in the large standard deviation of UE frequency relative to its mean shown in Figure 5. Since the size of the jumps decreases with increasing ring rates, the same follows for the standard deviation of UE frequency relative to its mean. Thus, although the target signicance level used in performing UE analysis remains constant, the effective signicance level of the test for unitary events is dependent on ring rate. In section 3.4 we will develop criteria for deciding when the frequency of UEs is a reliable quantity. Before doing this, however, we applied this analysis to the data recorded in somatosensory cortex. Figure 6 demonstrates the effect of varying ring rates on UE frequency. We computed the time-varying ring rates for each of the two neurons whose activity is plotted in Figures 1A and 1B using a 100 msec averaging window and generated two (nonhomogeneous) Poisson processes using these rate proles. The time-varying rate proles were computed consistent with the assumed null hypothesis for UE analysis (which is in this case homogeneous or stationary rates over 100 msec). UE analysis was then applied to the simulated processes. Figure 6A shows the UEs obtained. The pattern of UEs is remarkably similar to that observed in Figure 1A. Similarly, the logarithmic plot of the signicance of the events shown in Figure 6B captures the relevant features of the corresponding plot in Figure 2B.
2076
A. Roy, P. N. Steinmetz, and E. Niebur
200
trial #
150
100
50
0
0
1
2
1
2
t[secs]
3
4
3
4
4 3
log[(1-P)/P]
2 1 0 -1 -2 -3 0
t[secs]
Figure 6: (A) Raster plots of action potentials (dots) and UEs (diamonds) in two cells simulated as realizations of two independent inhomogeneous Poisson processes with a rate prole similar to the cell from Figure 1A. See the text for details. (B) Joint p-values (P in the graph) plotted as a ratio of logarithmic functions for overlapping 100 ms time segments.
Rate Limitations of Unitary Event Analysis
2077
11
Minimum rate (spikes/s)
10 9 8 7 6 5 4 50
100
150
200 # of trials
250
300
350
Figure 7: Minimum rates where mean UE frequency = 2s.
A uniformly distributed (or random) pattern of UEs was obtained when we applied UE analysis to similarly generated inhomogeneous spike train simulations of each of the two neurons whose UE analyses are shown in Figure 2A. The random nature of the distribution of UEs matches the obvious lack of pattern in Figure 2A (data not shown). Therefore, based on these simulations, it can be argued that a large portion of the observed pattern of UEs in those gures is due to rate effects and not to rate-independent coincidences. 3.4 Firing Rates with Reliable Signicance Levels. Figure 5 shows that the frequency of UE occurrence for low neural ring rates is a random variable with a large standard deviation relative to its mean under the null hypothesis of independent ring. This random variation of the mean UE frequency makes it difcult to interpret changes in UE frequency. If this variation is high relative to the mean signicance level, then changes in UE occurrence may be due to changes unrelated to changes in the correlational state of the neurons. For instance, changes in the ring rate that are orders of magnitude smaller than the standard deviation of the average ring rate (fractions of a Hz; see Figure 4) can change the effective signicance level by a factor of 10. As a lower bound on rates where the variation is acceptable, we show in Figure 7 the minimum rate where a
2078
A. Roy, P. N. Steinmetz, and E. Niebur
95% condence interval excludes zero or, equivalently, where the mean is equal to twice the standard deviation. Other intervals may be appropriate depending on the degree of condence desired in a specic investigation. For example, when analyzing simulated data for a typical experiment comprising 3000 time bins (100 ms averaging and 5 ms analysis windows for 150 trials), a 95% condence interval (2s) for the signicance level includes zero for ring rates below 7.3 spikes per second. As the number of trials increases, lower rates are sufcient to produce a reliable signicance level. 4 Discussion
It is well established that the average rate of neuronal rings is used to represent information in parts of the nervous system (Kandel, Schwartz, & Jessel, 1991). It is likely that the nervous system also makes use of some of the temporal variation of the ring rate, for example, of onset and offset responses of some neural populations. It is a matter of active discussion (and has been for several decades; Hubel, 1959; Werner & Mountcastle, 1963; Smith & Smith, 1965; Grifth & Horn, 1966; Bullock, 1970) whether more subtle forms of temporal structures are being employed to encode and process neuronal information. One of the attractive features of temporal codes is that they do not rely on averaging over either large populations or long time intervals. Instead, it is conceivable that very specic patterns are generated in one part of the nervous system and instantaneously decoded and processed in other parts. As long as it is highly unlikely that such patterns are generated by chance, little or no averaging is required to recognize the patterns with a high degree of condence. Codes that operate without extensive temporal averaging have clear advantages: the brain of a behaving animal needs to solve problems as they appear, without waiting for repetitions and multiple samples of a stimulus. If this is the case, it should be possible to detect such temporal structures by algorithmic methods that do not require extensive averaging over either time or population. This is not the case for many of the methods used to study the temporal structure of neuronal responses. For instance, the autocorrelation function and its Fourier transform, the power spectrum, are global (nonlocal) measures in the temporal domain. The same is true for their generalizations to multiple spike trains, that is, the cross-correlation function and its Fourier transform. Therefore, such methods require data on the ring behavior over a long time period and are less suitable for making quick decisions. A recently proposed local method that does not rely on averaging over long time periods is unitary event analysis (Grun ¨ et al., 1993; Grun, ¨ 1996; Riehle et al., 1997). The underlying assumption is the existence of UEs in brain activity, represented by “brief unusual constellations of activity”
Rate Limitations of Unitary Event Analysis
2079
among neuronal assemblies (Grun, ¨ 1996). It is assumed that the degree of unusuality is given by the degree of interneuronal correlation and measured by a quantity closely related to the surprise measure used previously in neurophysiology (L´egendy, 1975; L´egendy & Salcman, 1985; Palm, Aertsen, & Gerstein, 1988) whose expectation value is, in turn, closely related to Shannon’s (1948) information and the entropy of statistical physics (Boltzmann, 1887). Since the frequency of accidental correlations increases with the ring rates of participating neurons, applying appropriate corrections requires knowledge of mean ring rates, which are computed by averaging over trials. Gr un ¨ (1996) has pointed out that UE analysis is of limited value if the ring rates of the analyzed neurons are high. The reason is, essentially, that the number of accidental coincidences becomes too large, which makes the detection of possible nonaccidental coincidences more difcult. In this article, we report limitations of this method that are particularly stringent in the case of low neural ring rates. This is of particular relevance since the ring rates in the more central cortical areas are usually low and similar to those found to be problematic in our analysis (a few impulses per second or less; see, e.g., Abeles, Vaadia, & Bergman, 1990). Since it is usually assumed that temporal codes such as those assumed in UE analysis are of more relevance in central than in peripheral cortical areas, the limitations reported here are pertinent. It is also important to note that UE analysis does not provide a measure of signicance of the absolute number of excess coincidences. It is therefore unlike the signicance of spatiotemporal patterns computed by Abeles (1982a) and Abeles, Bergman, Margalit, and Vaadia (1993). Rather, UE analysis attaches a label to brief periods of increased coincident ring and examines changes in the pattern of the label that are correlated with different behavioral or response states. In fact, the overlapping nature of the test epochs used in UE analysis makes it difcult to determine the signicance of the number of excess coincidences as a whole. At the root of the problem are the underlying discrete statistics and the resulting changes in effective signicance levels of the test for coincident ring. We show that for low but realistic ring rates (e.g., 1 Hz), the effective signicance level for coincidences labeled as UEs varies considerably. For instance, the signicance level of coincidences can be as low as 0.005 instead of the target level 0.05 (see Figure 4)—a whole order of magnitude below the target level—and the average signicance level of UEs at low rates is about half of that of the target level. These changes cause a large error in estimation of the mean number of UEs expected under the null hypothesis. Thus, it is difcult to determine whether changes in observed UE frequency are due to random variations or whether they signify changes in actual neural synchrony. In conclusion, we have pointed out limitations in the applicability of UE analysis in the case of low ring rates. These limitations arise due to
2080
A. Roy, P. N. Steinmetz, and E. Niebur
the discrete character of the underlying point process model. If experimental conditions permit the collection of sufciently large amounts of data, these limitations can be overcome. We provide practical guidelines for the parameter ranges within which the analysis is applicable.
Acknowledgments
We thank Paul Fitzgerald, Steve Hsiao, and Ken Johnson for providing the data presented in Figures 1 and 2. We also thank both anonymous referees for valuable suggestions. This work was supported by the Lucile P. Markey Foundation, the National Science Foundation, and the Alfred P. Sloan Foundation through a Sloan Fellowship to E. N.
References Abeles, M. (1982a). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1982b). Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. New York: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70(4), 1629–1638. Abeles, M., Vaadia, E., & Bergman, H. (1990). Firing patterns of single units in the prefrontal cortex and neural network models. Network: Computation in Neural Systems, 1, 13–25. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of effective connectivity. J. Neurophysiol., 61(5), 900–917. ¨ Boltzmann, L. (1887). Uber die mechanischen Analogieen des Zweiten Hauptsatzes der Thermodynamik. J. Reine Angew Mathematik, 100, 201–212. Bullock, T. H. (1970). The reliability of neurons. J. Gen. Physiol., 55, 565–584. Dayhoff, J., & Gerstein, G. (1983a).Favored patterns in spike trains. I. Detection. J. Neurophysiol., 49, 1334–1348. Dayhoff, J., & Gerstein, G. (1983b). Favored patterns in spike trains. II. Application. J. Neurophysiol., 49, 1349–1363. Decharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610– 613. Eggermont, J. J. (1990). The correlative brain: Theory and experiment in neural interaction. New York: Springer-Verlag. Gerstein, G., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36(1), 4–14.
Rate Limitations of Unitary Event Analysis
2081
Gerstein, G., & Clark, W. (1964).Simultaneous studies of ring patterns in several neurons. Science, 143, 1325–1327. Grifth, J. S., & Horn, G. (1966). An analysis of spontaneous impulse activity of units in the striate cortex of unrestrained cats. J. Physiology (London), 186, 516–534. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity. Frankfurt am Main: Verlag Harri Deutsch. Grun, ¨ S., Aertsen, A., Abeles, M., & Gerstein, G. (1993). Unitary events in multineuron activity: Exploring the language of cortical cell assemblies. In N. Elsner & M. Heisenberg (Eds.), Gene-brain-behavior (p. 470). New York: Georg Thieme Verlag. Hsiao, S. S., O’Shaughnessy, D. M., & Johnson, K. O. (1993). Effects of selective attention on spatial form processing in monkey primary and secondary somatosensory cortex. J. Neurophysiology, 70(1), 444–447. Hubel, D. H. (1959). Single unit activity in striate cortex of unrestrained cat. J. Physiology (London), 147, 226–238. Kandel, E., Schwartz, J., & Jessel, T. M. (1991). Principles of neural science (3rd ed.). New York: Elsevier. L´egendy, C. (1975).Three principles of brain function and structure. International Journal of Neuroscience, 6, 237–254. L´egendy, C., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926– 939. Lindgren, B. W. (1976). Statistical Theory (3rd ed.). New York: Macmillan. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Niebur, E., Koch, C., & Rosin, C. (1993). An oscillation-based model for the neural basis of attention. Vision Research, 33, 2789–2802. Palm, G., Aertsen, A., & Gerstein, G. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roy, A., Steinmetz, P., & Niebur, E. (1998). Rate dependence of unitary event analysis. Soc. Neurosci. Abstr., 28, 1513. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423, 623–656. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Smith, D. R., & Smith, G. K. (1965). A statistical analysis of the continuous activity of single cortical neurones in the cast anaesthetized isolated forebrain. Biophys. J., 5, 47–74. Softky, W. (1995). Simple codes versus efcient codes. Current Opinion in Neurobiology, 5, 239–247.
2082
A. Roy, P. N. Steinmetz, and E. Niebur
Softky, W., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Werner, G., & Mountcastle, V. B. (1963). The variability of central neural activity in a sensory system and its implications for the central reection of sensory events. J. Neurophysiol., 26, 958–977. Received December 4, 1998; accepted September 17, 1999.
LETTER
Communicated by Jean Fran c¸ ois Cardoso
Estimating Functions of Independent Component Analysis for Temporally Correlated Signals Shun-ichi Amari RIKEN Brain Science Institute, Japan This article studies a general theory of estimating functions of independent component analysis when the independent source signals are temporarily correlated. Estimating functions are used for deriving both batch and on-line learning algorithms, and they are applicable to blind cases where spatial and temporal probability structures of the sources are unknown. Most algorithms proposed so far can be analyzed in the framework of estimating functions. An admissible class of estimating functions is derived, and related efcient on-line learning algorithms are introduced. We analyze dynamical stability and statistical efciency of these algorithms. Different from the independently and identically distributed case, the algorithms work even when only the second-order moments are used. The method of simultaneous diagonalization of crosscovariance matrices is also studied from the point of view of estimating functions. 1 Introduction
Independent component analysis or blind source separation opens a new paradigm of signal processing applicable to many elds of science and engineering. It decomposes observed correlated signals into a sum of independent components. There have been proposed a number of batch estimation and on-line learning algorithms for blind source separations (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Amari, Cichocki, & Yang, 1996; Oja, 1997, and many others; see also Amari & Cichocki, 1998 for review). They work well even when the source signals have temporal correlations, which is the case in many practical applications. However, one can make use of these correlations positively for obtaining more efcient and reliable algorithms that work well even when the probability structures of temporal correlations are unknown. We apply the estimating function theory for semiparametric statistical models (Amari & Kawanabe, 1997; Amari & Cardoso, 1997) to this problem, because it can be used in the blind case where temporal correlations are unknown. We propose a general type of algorithms applicable to such cases. The proposed algorithms work well even when we use only the secc 2000 Massachusetts Institute of Technology Neural Computation 12, 2083– 2107 (2000) °
2084
Shun-ichi Amari
ond moments of signals, provided temporal structures of the sources are different. This is completely different from the independently and identically distributed (i.i.d.) (that is, white) case where we need higher-order cumulants. This fact has been known (Tong, Liu, Soon, & Huang, 1991; Molgedey & Schuster, 1994), and batch algorithms of simultaneous diagonalization of cross-covariance matrices have been proposed (Tong et al., 1991; Molgedey & Schuster, 1994; Ikeda & Murata, 1998; Murata, Ikeda, & Ziehe, 1998; Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Ziehe & Muller, ¨ 1998). The maximum likelihood method has also been studied (Pham & Garat, 1997; Perlmutter & Parra, 1997; Attias & Schreiner, 1998; Davies, 1999). All of these can be analyzed from the point of view of estimating functions. A class of estimating functions is said to be admissible when, given an arbitrary estimator, one can nd an estimator derived from the class that has smaller or at least equal asymptotic estimation error. Information geometry of estimating functions (Amari & Kawanabe, 1997; Amari & Cardoso, 1997) is applied to derive a class of admissible estimating functions. The class includes the score function of the maximum likelihood estimator (MLE). It will also be demonstrated that the method of simultaneous diagonalization of cross-correlation matrix is derived from an estimating function but does not belong to the admissible class. We nally present the analysis of dynamical stability and statistical efciency of learning algorithms. Given an estimating function, we show that there exist an innite number of equivalent estimating functions that give the same estimator. They form an equivalence class. However, when we construct on-line learning algorithms from those estimating functions, their dynamical behaviors are different. They have different dynamical stability and different convergence speed. We dene a standardized estimation function (Amari, 1999a) among an equivalence class of estimating functions, and prove that the learning algorithm derived from the standardized one has the best asymptotic performance and is equivalent to the related batch estimator. We give the standardized estimating function explicitly by generalizing Amari (1999a).
2 Temporally Correlated Sources 2.1 Source Model. Let us consider n statistically independent information sources that generate time series fsi (1), si (2), . . . , si (t), . . .g, i D 1, . . . , n, at discrete times t D 1, 2, . . .. Different sources are stochastically indepen6 dent, so that any two signals si (t) and sj (t0 ), i D j, are independent at any times t and t0 . However, signals from the same source may have temporal correlations, that is, si (t) and si (t0 ) are correlated. We further assume that si (t) has zero mean, E[si (t)] D 0, where E denotes expectation.
Estimating Functions
2085
Let us consider a regular stationary stochastic model described by a linear model, si (t) D
pi X kD 1
aik si (t ¡ k ) C ei (t ),
(2.1)
where pi may be a nite number or innite and ei (t) is a zero-mean i.i.d. (that is, white) time series called innovations. This article treats only such source models. The innovations may be gaussian or nongaussian. We assume that they satisfy E [ei (t)] D 0, £ ¤ E ei (t)ej (t0 ) D 0,
6 6 t0 ). (i D j or t D
(2.2)
When pi is nite, this is an autoregressive model of degree pi . By introducing the time shift operator z¡1 such that z¡1 si (t) D si (t ¡ 1), we can rewrite equation 2.1 as ±
²
Ai z ¡1 si (t) D ei (t ),
(2.3)
where ±
²
Ai z ¡1 D 1 ¡
pi X kD 1
aik z ¡k .
(2.4)
By using the inverse of the polynomial Ai , the source signal is written as ±
²
si (t) D Ai¡1 z ¡1 ei (t),
(2.5)
where Ai¡1 (z¡1 ) is a formal innite power series of z ¡1 , ±
²
Ai¡1 z ¡1 D
1 X
bi z ¡i ,
(2.6)
i D0
such that ±
²
±
²
Ai¡1 z ¡1 Ai z¡1 D 1.
(2.7)
Function Ai¡1 (z¡1 ) represents the impulse response function of the ith source, by which fsi (t )g is generated from white signals fei (t)g. Let ri (ei ) be
2086
Shun-ichi Amari
the probability density function of ei (t). Then the conditional probability density function of si (t) conditioned on the past signals can be written as ( pi fsi (t)| si (t ¡ 1), si (t ¡ 2) . . .g D ri si (t) ¡
X k
) aik si (t
¡ k)
n ± ² o D ri Ai z ¡1 si (t ) .
(2.8)
Therefore, for vector source signals s (t ) D [si (t), . . . , sn (t)]T at time t, where T denotes transposition, the conditional probability density is p fs (t)| s (t ¡ 1), s (t ¡ 2), . . .g D
n Y
n o ri Ai (z ¡1 )si (t) .
(2.9)
iD 1
We introduce the following notations:
" D [e 1 , . . . , e n ]T , ± ² h ± ²i A z¡1 D diag Ai z ¡1 , r (" ) D
n Y
r i ( ei ) ,
(2.10) (2.11) (2.12)
iD 1
and use abbreviation
st D s(t),
xt D x(t)
(2.13)
when there is no confusion. We also denote the past signals by
st,past D fst¡1 , st¡2 , . . .g .
(2.14)
Then equation 2.9 is rewritten as n ± ² o © ª p st | st,past D r A z¡1 st .
(2.15)
The joint probability density function of fs1 , s2 , . . . , st g is p ( s1 , s2 , . . . , st ) D D
t Y
© ª p si | si,past
i D1 t n Y
± ² o r A z¡1 si ,
i D1
(2.16)
Estimating Functions
2087
where st (t · 0) are put equal to 0. Practically st (t < 0) are not equal to 0 so that equation 2.16 is an approximation that holds asymptotically, that is, for large t. The source models are specied by n functions ri (ei ) and n inverse impulse response functions Ai (z¡1 ). Blind source separation should extract the independent signals from their (instantaneous linear) mixtures without knowing exact forms of ri (ei ) and Ai (z ¡1 ). In other words, they are treated as unknown parameters. 2.2 Mixing Model. We consider the case when observed signals are instantaneous mixtures of the source signals. (See Amari, Douglas, & Cichocki, 1998, and Zhang, Amari, & Cichocki, 1999, for geometry of convolutive mixtures; and Parra, 1998, for the maximum likelihood approach to the convolutive case.) Let x (t ) D [x1 (t), . . . , xn (t)]T be the set of linearly mixed signals,
xi (t) D
n X
Hij sj (t),
(2.17)
j D1
or in the matrix-vector notation,
x(t) D Hs (t).
(2.18)
Here, H is an n £ n unknown nonsingular mixing matrix. Here, we assume that the number of observations is exactly the same as that of the source signals. We also assume that H does not include temporal operations, that is, it does not include z ¡1 and does not depend on t. All of these assumptions can be relaxed (see, e.g., Lewicki & Sejnowski, 1998; Cichocki, Thawonmas, & Amari, 1997; Amari, 1999b; and many others), but we stick with them in this article. 2.3 Unmixing Model. Let W be an n £ n nonsingular matrix that trans-
forms observed x (t ) to
y (t ) D W x(t).
(2.19)
This provides an estimate of the original source signals s (t). If we have W D H ¡1 , y (t) is exactly equal to s(t). So our problem is to estimate H ¡1 from observations x (1), . . . , x (t). However, as is well known, the problem is ill posed and H or its inverse W is not identiable. The following facts are known (Comon, 1994): 1. The scales of the source signals are not identiable. 2. The order to arrange n extracted independent signals is not identiable, so that there is indeterminacy of permutations of sources.
2088
Shun-ichi Amari
However, we do not ¡need ¢ to assume that no more than one source is gaussian provided all Ai z ¡1 are different (Tong et al., 1991). We assume, for convenience, that the scale of each si satises some constraint, for example, E [hi fsi (t )g] D 0,
(2.20)
where hi is a norming function. A simple way is to use hi (si ) D s2i ¡ 1. This determines the scales of recovered signals uniquely. Matrix W is called a separating solution, when y (t) D W x(t) becomes a rescaled permutation of the original s (t). 3 Statistical Framework 3.1 Semiparametric Statistical Model. Given t observations fx1 , . . . , xt g, their joint probability density function is easily derived from equation 2.16 and st D W xt , where W D H ¡1 . It is written as
p fx1 , . . . , xt , W I A , rg D det | W | t
t Y
n ± ² o ri A z ¡1 W xi ,
(3.1)
i D1
which is specied by the unmixing parameter W D H ¡1 and the parameters A, r of the source models. The statistical model, equation 3.1, includes two kinds of parameters. One is W , which is to be estimated from observed data. This is called a (matrix-valued) parameter of interest. The other is A and r, called nuisance parameters, which are unknown but we do not have a particular interest in estimating them. Here, the nuisance parameters include n functions r1 , . . . , rn , whic