NOTE
Communicated by Norman Draper
Mixture Models Based on Neural Network Averaging Walter W. Focke
[email protected] Institute of Applied Materials, Department of Chemical Engineering, University of Pretoria, Pretoria 0001, South Africa
A modified version of the single hidden-layer perceptron architecture is proposed for modeling mixtures. A particular flexible mixture model is obtained by implementing the Box-Cox transformation as transfer function. In this case, the network response can be expressed in closed form as a weighted power mean. The quadratic Scheff´e K-polynomial and the exponential Wilson equation turn out to be special forms of this general mixture model. Advantages of the proposed network architecture are that binary data sets suffice for “training” and that it is readily extended to incorporate additional mixture components while retaining all previously determined weights. 1 Introduction Complex mixtures arise in diverse situations. They abound in nature as rocks, ores, coal, seawater, crude oil, natural gas, and the atmosphere. Human activity also leads to the creation of mixtures. This may be deliberate and purposeful, as in the alloying of metals to create special alloys (e.g., stainless steel) or in the formulation of a chemical product (e.g., paint). It may also be unintended (e.g., through side reactions occurring during chemical conversions). Occasionally it may be necessary to express the properties of a mixture in terms of its composition. Global models—those that are able to correlate mixture behavior over the entire factor space—are desirable. Predictive mechanistic theories are rare; thus, empirical mixture models are the norm. This implies the need for experimental data to fix adjustable parameters and also for statistical tests to validate the model and parameter choices. The selection of the functional form is crucial: statistical tests are valid only under the assumption that the postulated model is also valid (Bates & Watts, 1988). The empirical model should satisfy the following requirements:
r r
The mathematical form should be sufficiently flexible to correlate the underlying information without unnecessary restrictions. It should be consistent with available physical theory, especially in some limits of simplification.
Neural Computation 18, 1–9 (2006)
C 2005 Massachusetts Institute of Technology
2
W. Focke
r r
The parameters should be easy to interpret and estimate with common multivariate estimation techniques. It should be parameter parsimonious. Ideally, the coefficients should be obtainable from pure component properties or, at most, from binary mixture information (Prausnitz, Lichtenthaler, & de Azevedo, 1999). The model should then be predictive for the general multivariate case.
n Let R+ denote the set of all positive real numbers and R+ its n-product. A mixture property is a measured scalar output y ∈ R+ that depends solely on n the n-tuplet x = (x1 , x2 , . . . , xn )T ∈ R+ that expresses mixture composition in terms of dimensionless fractional units (e.g., mol fraction). This implies the restrictions xi ≥ 0 and the simplex constraint:
x1 + x2 + . . . . + xn = 1.
(1.1)
Cornell (2002) provides a comprehensive review of experimental design for mixtures. The recommended model forms for y = y(x) are the Scheff´e K-polynomials (Scheff´e, 1958; Draper & Pukelsheim, 1998). They offer advantages such as homogeneity of regression terms and reduced ill conditioning of the information matrix (Prescott, Dean, Draper, & Lewis, 2002). A common conjecture in chemical engineering is that only binary interactions between species need to be considered in mixtures (Hamad, 1998; Prausnitz et al., 1999). This is a strong assumption but is accepted here for the simplicity and convenience it introduces. If it were true, experimental data from the constituent binary subsystems would suffice to predict the multicomponent behavior. This disqualifies the higher-order Scheff´e polynomials from consideration. They include ternary and higher-interaction parameters that cannot necessarily be determined from binary data alone. The use of classic mixture models is now well established (Cornell, 2002). Application of neural networks to mixture correlation and prediction is a recent trend (Rowe & Colbourn, 2003). This communication considers simple global mixture models inspired by neural network architecture. 2 A Neural Network Model for Mixture Properties Figure 1 shows the proposed neural network for modeling mixture properties. It is a modified version of the popular single hidden-layer perceptron (SHLP) (Hagan, Demuth, & Beale, 1996) architecture. The following specifications apply:
r
Each of the n components in the mixture is associated with a specific input node–hidden layer neuron pair. This feature reflects the notion that the mixture components form a set of orthogonal basis vectors. Adding or removing such a node-neuron pair is equivalent to adding or removing this component from the mixture.
Mixture Models Based on Neural Network Averaging
x1
a11 a12 a1n
x2
Σ
T
Σ
T
3
T(u1)
T(u2)
x1 x2
-1 Σ T
y
xn an1
xn
an2
Σ
ann
T(un)
T
Figure 1: Neural network for mixture property estimation.
r r
The hidden neurons apply an arbitrary but strictly monotonic transfer function to the sum of weighted inputs xi . The output neuron employs the input xi as variable weights and applies the inverse of the transform used by the hidden neurons. This feature embodies the desire that the output should reflect a weighted average of the pure component and binary mixture properties only. It also ensures the dimensional homogeneity of the output response. Physical properties are dimensional quantities expressed in terms of characteristic units. This requires that arguments to nonlinear functions such as logarithmic, exponential, or trigonometric functions be dimensionless numbers.
Conceptually, the output of this modified neural network can be interpreted as a generalized quasi-arithmetic mean of the hidden-layer neuron summations ui (Bullen, 2003): y = Mn (u, x) := T
−1
n
xi T(ui ) .
(2.1)
i=1
This result is unaffected when the transform T in equation 2.1 is replaced by arbitrary linear combinations αT + β, provided α = 0 (Bullen, 2003).
4
W. Focke
The ui in equation 2.1 are distinct positive functions defined on the weights a ij ∈ R+ as follows:
ui =
n
a ij x j .
(2.2)
j=1
The model has a total of up to n2 adjustable parameters. Of these, the n coefficients a ii are determined by pure component behavior, whereas the n(n – 1) coefficients a ij quantify the nonideal, behavior of the corresponding binary mixtures. Thus, in theory at least, binary data sets suffice for “training” the network to predict multicomponent behavior. Furthermore, adding more components does not affect the weights of the existing network. In practice, the quality of predictions depends on the choice of transfer function and whether the postulate that “binary interactions suffice” is well founded. The full mathematical form of equation 2.1 found by substituting equation 2.2 into 2.1 is n n y = T −1 xi T a ij x j . i=1
(2.3)
j=1
When a ij = a ∀ i, j, the nature of the transfer function is immaterial, as the network output is just y = a . With a ij = a ii ∀ i, j, it returns the weighted quasi-arithmetic mean of the pure component property values, a ii : y=T
−1
n
xi T(a ii ) .
(2.4)
i=1
Actually equation 2.4 states that the transform of the dependent variable T(y) varies linearly with composition. The simplest possible transfer function is the linear transformation T(u) = u. Implementing it in equation 2.1, for the mixture neural network of Figure 1, yields
y=
n n i=1 j=1
a ij xi x j =
n i=1
a ii xi2 +
n n
(a ij + a ji )xi x j .
(2.5)
i=1 j>i
Equation 2.5 corresponds exactly to the second-degree Scheff´e Kpolynomial (Draper & Pukelsheim, 1998). It is conventional to correct for the overparameterization of this model by setting a ij = a ji (Cornell, 2002;
Mixture Models Based on Neural Network Averaging
5
Scheff´e, 1958). When a ij + a ji = a ii + a jj , the output defined by equation 2.5 is linear in the mole fractions: y=
n
a ii xi .
(2.6)
i=1
Employing the logarithmic transformation T(u) = n(u) instead yields the output
y=
n
xi n a ij x j .
i=1
(2.7)
j=1
This is the weighted geometric mean mixture model, an exponential form of the semitheoretical Wilson (1964) model used for the excess Gibbs energy of mixtures. The Box-Cox transformation (Box & Cox, 1964) is usually applied to reduce the heteroskedasticity of the residuals and to bring them closer to a normal distribution. It is defined by ur − 1 r T(u) → n(u)
T(u) =
for
r = 0
(2.8a)
for
r → 0.
(2.8b)
When equations 2.8a and 2.8b define the transfer function, the neural network output resembles the generalized power mean constructed by Ku, Ku, & Zhang (1999):
y=
] k[r n (u,
x) := lim+ s→r
n
1/s s
xi [ui (x)]
,
(2.9)
i=1
where r ∈ R and, as before, the ui are defined by equation 2.2. Equation 2.9 provides a flexible functional framework that includes both models described above: setting the exponent r = 1 yields the Scheff´e quadratic K-polynomial, whereas r = 0 recovers the exponential Wilson model. With a ij = a jj ∀ i, j, equation 2.9 also reduces to the linear form
y=
n i=1
a ii xi .
(2.10)
6
W. Focke
When a ij = a ii ∀ i, j, equation 2.9 simplifies to a special case of equation 2.4: yr =
n
xi a iir .
(2.11)
i=1
The model of equation 2.9 may appear to have too many adjustable parameters. Strategies exist to reduce their number (Daroch & Waller, 1985). For example, when a ij = b ii ∀ j = i, equation 2.2 simplifies to ui = a ii xi + b ii (1 − xi ).
(2.12)
This leads to a drastic reduction of the number of adjustable parameters from n2 to a total of just 2n (excluding the parameter r ). 3 Model Consistency Means, such as the generalized power means, are defined by an infinite sequence of continuous and strictly monotonic real functions k[r1 ] (u1 , x1 = 1) = a 11 ; k[r2 ] (u1 , u2 , x1 , x2 ); . . . k[rn ] (u1 , u2 , . . . , un , x1 , x2 , . . . , xn ) . . . associated with a characteristic set of axioms (Bullen, 2003). Inspection reveals that this model satisfies the following elementary consistency requirements (Hamad, 1998):
r r r
Parameter values do not change when more components are added to the mixture. The mixture property y reduces to the pure component value when any mole fraction approaches unity, that is, k[rn ] (u, ek ) = a kk where ek = n (0, . . . , 0, xk = 1, 0, . . . , 0)T is an orthonormal basis of R+ . The relation for an n-component mixture reduces to the corresponding (n − 1)−component form in the limit of infinite dilution of one of the ] (u, x). components: lim k[rn ] (u, x) = k[rn−1 xn →0
The following axioms are relevant with regard to mixture-model consistency: Symmetry: k[rn ] (u, x) is not changed when the u and x are permuted simultaneously. This follows from the commutative law of addition. Symmetry implies that predicted property values are independent of the way in which component indices are assigned. Reflexivity: k[rn ] (a , a , . . . , a , x1 , x2 , . . . , xn ) = a . Note that ui = a when a ij = a ∀ j.
Mixture Models Based on Neural Network Averaging
7
Decomposability. According to Michelsen and Kistenmacher (1990), a consistent model is also invariant with respect to dividing one component into two or more identical subcomponents. The appendix shows that the current model conforms to this requirement. Homogeneity: k[rn ] (u, x) is homogeneous of order one, that is, for all λ: ] [r ] k[r n (λu1 , λu2 , . . . , λun , x1 , . . . , xn ) = λkn (u1 , u2 , . . . , un , x1 . . . , xn ).
(3.1) The homogeneity property ensures that the model is dimensionally homogeneous. The ui are linear combinations of the a ij , j = 1, 2, . . . n. Therefore, the model is also homogeneous of degree one in the parameters a ij . According to Euler’s theorem on homogeneous functions of degree one, it follows that
y=
n n i=1 j=1
a ij
∂y . ∂a ij
(3.2)
The relative condition number of y with respect to the parameter a ij is defined as (Higham, 2002): C R (a ij ) =
a ij ∂ y . y ∂a ij
(3.3)
This number quantifies the sensitivity of a function with respect to small changes in a parameter value a ij as follows. If |C R (a ij )| 1, the function is very well conditioned; when |C R (a ij )| ≈ 1, it is well conditioned; but if |C R (a ij )| 1, it is considered to be ill conditioned. Combining Higham’s (2002) definition, equation 3.3, with Euler’s result, equation 3.2, reveals that the relative condition numbers sum to unity: n n
C R (a ij ) = 1.
(3.4)
i=1 j=1
The generalized weighted power mean is a monotonic increasing function of the a ij . Therefore, all C R (a ij ) ≥ 0, and it immediately follows from equation 3.4 that this model is intrinsically well conditioned with respect to all its adjustable parameters.
8
W. Focke
4 Conclusion Neural network computing is often equated with a black-box modeling approach. Critics also point out that the values of the weights in the network per se have no physical meaning. Thus, it is hard to account for, and properly validate, results obtained with a neural network. In this study, quite the opposite was achieved. Neural network architecture analysis led to conceptual insight. It revealed an underlying unity between the empirical quadratic Scheff´e polynomials and the semitheoretical Wilson models. Both models are special cases of the more general model defined by equation 2.9. Appendix: Decomposability of the Generalized Weighted Power Mean Assume that components n − 1 and n are in fact identical with a (n−1),k = a n,k ∀ k and therefore also un−1 (x) = un (x) ∀ x. This is justified by symmetry. To show that the model is invariant with respect to dividing one component into two or more identical subcomponents, it is sufficient to prove that , ] , k[r n (u1 , . . . , un−1 , un , x1 , . . . , xn−1 , xn ) = kn−1 (u1 , . . . , un−1 , x1 , . . . , xn−1 [r ]
, = xn−1 + xn, ). , , + xn, with xn−1 and xn, highlighting the mole fractions Here, xn−1 = xn−1 associated with the two identical components: , ui = a i1 x1 + a i2 x2 + . . . + a i(n−1) xn−1 + a in xn, , = a i1 x1 + a i2 x2 + . . . + a i(n−1) (xn−1 + xn, )
= a i1 x1 + a i2 x2 + . . . + a i(n−1) xn−1 . From this it also follows that un−1 = un and thus that , , urn−1 + xn, urn = (xn−1 + xn, )urn−1 = xn−1 urn−1 . xn−1
Now , ] r r r , r 1/r k[r n (u, x) = (x1 u1 + x2 u2 + . . . + xn−1 un−1 + xn un )
1/r = x1 ur1 + x2 ur2 + . . . + xn−1 urn−1 ] = k[rn−1 (u, x).
This completes the proof.
Mixture Models Based on Neural Network Averaging
9
Acknowledgments I gratefully acknowledge financial support for this research from the THRIP program of the Department of Trade and Industry and the National Research Foundation of South Africa as well as Xyris Technology. References Bates, D. M., & Watts, D. G. (1988). Nonlinear regression analysis and its applications. New York: Wiley. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. J. Roy. Statist. Soc. B, 26, 211–243, discussion, 244–252. Bullen, P. S. (2003). Handbook of means and their inequalities. Dordrecht: Kluwer. Cornell, J. A. (2002). Experiments with mixtures (3rd ed.). New York: Wiley. Daroch, J. N., & Waller, J. (1985). Additivity and interaction in three-component experiments with mixtures. Biometrika, 72, 153–163. Draper, N. R., & Pukelsheim F. (1998). Mixture models based on homogeneous polynomials. J. Statist. Plann. Inference, 71, 303–311. Hagan, M. T., Demuth, H. B., & Beale, M. (1996). Neural network design. Boston: PWS Publishing. Hamad, E. Z. (1998). Exact limits of mixture properties and excess thermodynamic functions. Fluid Phase Equilibria, 142, 163–184. Higham, N. J. (2002). Accuracy and stability of numerical algorithms (2nd ed.). Philadelphia: SIAM. Ku, H. T., Ku, M. C., & Zhang, X. M. (1999). Generalized power means and interpolating inequalities. Proc. Amer. Math. Soc., 127, 145–154. Michelsen, M. L., & Kistenmacher, H. (1990). On composition-dependent interaction coefficients. Fluid Phase Equilibria, 58, 229–230. Prausnitz, J. M., Lichtenthaler, R. N., & de Azevedo E. G. (1999). Molecular thermodynamics of fluid-phase equilibria. Upper Saddle River, NJ: Prentice Hall. Prescott, P., Dean, A., Draper, N., & Lewis, S. (2002). Mixture experiments: Illconditioning and quadratic model specification. Technometrics, 44, 260–268. Rowe, R. C., & Colbourn, E. A. (2003). Neural computing in product formulation. Chem. Educator, 8, 1–8. Scheff´e, H. (1958). Experiments with mixtures. J. Roy. Statist. Soc. B, 20, 344–360. Wilson, G. M. (1964). Vapor-liquid equilibrium. XI. A new expression for the excess free energy of mixing. J. Am. Chem. Soc., 86, 127–130.
Received March 23, 2005; accepted May 20, 2005.
LETTER
Communicated by Maxim Bazhenov
Sensory Memory for Odors Is Encoded in Spontaneous Correlated Activity Between Olfactory Glomeruli Roberto F. Gal´an
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, 10115 Berlin, Germany, and Department of Biological Sciences and Center for the Neural Basis of Cognition, Mellon Institute, Carnegie Mellon University, Pittsburgh 15213, PA, U.S.A.
Marcel Weidert
[email protected] Randolf Menzel
[email protected] Institute for Neurobiology, Freie Universit¨at Berlin, 14195 Berlin, Germany
Andreas V. M. Herz
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, 10115 Berlin, Germany
C. Giovanni Galizia
[email protected] Department of Entomology, University of California, Riverside, CA 92521, U.S.A.
Sensory memory is a short-lived persistence of a sensory stimulus in the nervous system, such as iconic memory in the visual system. However, little is known about the mechanisms underlying olfactory sensory memory. We have therefore analyzed the effect of odor stimuli on the first odor-processing network in the honeybee brain, the antennal lobe, which corresponds to the vertebrate olfactory bulb. We stained output neurons with a calcium-sensitive dye and measured across-glomerular patterns of spontaneous activity before and after a stimulus. Such a single-odor presentation changed the relative timing of spontaneous activity across glomeruli in accordance with Hebb’s theory of learning. Moreover, during the first few minutes after odor presentation, correlations between the spontaneous activity fluctuations suffice to reconstruct the stimulus. As spontaneous activity is ubiquitous in the brain, modifiable fluctuations could provide an ideal substrate for Hebbian reverberations and sensory memory in other neural systems.
Neural Computation 18, 10–25 (2006)
C 2005 Massachusetts Institute of Technology
Sensory Memory for Odors
11
1 Introduction Animals are able to evaluate a sensory stimulus for a certain time after stimulus offset, as in trace conditioning or delay learning (Clark, Manns, & Squire, 2002; Grossmann, 1971). Such sensory memories have been extensively investigated in the visual system, such as for afterimages (i.e., visual persistence) or iconic memory and for the acoustic system (Crowder, 2003) and imply that a neural representation of the stimulus remains active after the stimulus, for example, during the time interval between a presented cue and a task to be performed or an association to be made. “Delayed matching and nonmatching to sample” paradigms also rely on a temporary storage of sensory information, generally referred to as working memory (Del Giudice, Fusi, & Mattia, 2003). Such tasks have also been successfully solved by the honeybee, Apis mellifera, and prove that it possesses an exquisite sensory and working memory (Giurfa, Zhang, Jenett, Menzel, & Srinivasan, 2001; Grossmann, 1971; Menzel, 2001). Analyzing the neural basis of working memory in vertebrates, cortical neurons have been found that elevate their discharge rate during a delay period (Fuster & Alexander, 1971). These findings suggest a straightforward realization of Hebbian “reverberations” (Hebb, 1949) in that persistently active delay cells provide the memory trace and thus allow the animal to compare sequentially presented stimuli (Amit & Mongillo, 2003). In all of these studies, however, the investigation of sensory memory was embedded into a more complex paradigm in order to allow for a behavioral readout. Therefore, it cannot be excluded that the physiological traces of sensory memory contained a context-dependent or task-dependent component that is difficult to isolate. Taking a purely physiological approach and using the honeybee as an experimental animal, we asked whether a stimulus alone could modify brain activity in a way that would suggest a sensory memory. ¨ Odor stimuli are particularly salient for honeybees (Menzel & Muller, 1996). We therefore sought an initial memory trace following a nonreinforced odor stimulus. We find that the relative timing of spontaneous activity events in the antennal lobe is modified after a passive olfactory experience. This change follows the Hebbian covariance rule: pairs of coactivated or coinhibited glomeruli increased their correlation; glomeruli with an opposite response sign decreased their correlation (Sejnowski, 1977). Unlike the implications of Hebb’s rules in developmental studies (“fire together, wire together”), the effect observed here was short-lived, with a decay time of a few minutes. We therefore suggest that this form of short-term Hebbian plasticity in the honeybee antennal lobe serves as an olfactory sensory memory. 2 Materials and Methods 2.1 Imaging. Bees were prepared as described elsewhere (Sachse & Galizia, 2002). Briefly, forager bees (Apis mellifera carnica) were collected
12
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
from the hive, cooled for anesthesia, and restrained in Plexiglas stages. A window was cut into the cuticle to expose the brain. Then the calciumsensitive tracer dye FURA-dextran (MW 10,000, Molecular Probes) was injected into the tract of PN axons that leads from the antennal lobe (AL) to the mushroom bodies (lateral antenno-cerebralis-tract, lACT). After 3 to 5 hours, the bees were tested for successful staining of PNs using a fluorescent microscope: successful staining was evident when the PN cell bodies were brightly fluorescent under 380 nm light exposure. We then recorded the calcium activity in the antennal lobe for 4 minutes, at a rate of six image pairs (340 nm/380 nm) per second. (Some animals were recorded for longer periods.) After 2 minutes (at half time), we presented an odor for 4 sec¨ onds, using a computer-controlled olfactometer (Galizia, Joerges, Kuttner, Faber, & Menzel, 1997). Odors used differed between animals and were 1-hexanol, 1-octanol, limonene and a mixture of limonene and linalool (all from Sigma-Aldrich). We excluded the 4 seconds before, during, and after stimulus presentation (a total of 12 seconds) from the analysis to ensure that no direct stimulus-response would contaminate the analysis. Great care was taken not to expose the animal to any test odor before the experiment, and no animal was investigated twice. Recordings were done using a Till-Photonics monochromator and CCD-based imaging system, Olympus BX50WI microscope, Olympus 20W dip objective, NA 0.5; dichroic 410, LP 440 nm. Nine animals were examined. The experimental protocol complied with German law governing animal care. 2.2 Data Preanalysis. We calculated the ratio between the 340 nm and the 380 nm images for each measured point in time. In the FURA-dextran staining, individual glomeruli were clearly visible. We could thus identify the glomeruli on the basis of the morphological atlas of the honeybee AL (Galizia, McIlwrath, & Menzel, 1999). For each AL, we identified between 8 and 15 glomeruli (mean = 11.6; SD = 2.1). Time traces for each glomerulus were calculated by spatially averaging the pixel values belonging to that glomerulus. For each AL, this procedure yielded two sets of matrices (before and after stimulation) consisting of between 8 and 15 glomeruli at 696 and 672 points in time, respectively (6 images per s for 2 minutes minus 4 and 8 seconds, respectively). 2.3 Mathematical Analysis. The measured traces were high-pass filtered (cut-off frequency 0.025 Hz) to remove effects caused by a differential offset in different glomeruli, thus generating zero-mean time series. (Traces in Figures 1B and 1C are not filtered). Odor responses were described as vectors whose components measure the maximum glomerular activity elicited by the specific odor. Glomeruli were categorized into activated or inhibited by an odor by visual inspection of the response trace. Pair-wise correlations were calculated as the correlation coefficient of their spontaneous activity. Lagged correlations were normalized to the correlation coefficient for zero
Sensory Memory for Odors
13
lag and were calculated for relative shifts at 1 second intervals between −5 seconds and +5 seconds. All lags gave qualitatively the same results; Figure 3B shows the data for a lag of 3 seconds. The similarity between odor response and spontaneous activity was measured as their scalar product at every measured point in time. Correlation matrices across glomeruli were calculated for different time windows, taking the entire stretch before the stimulus, after the stimulus, or within four 1 minute intervals. We calculated the difference matrix as the difference between the correlation after and before stimulus. Significance for each element of the differential correlation matrix was calculated by bootstrapping the original data (2000 replications, α = 0.05). A principal component analysis was performed on the correlation matrices and the difference matrix, and the first principal components (PCs) were extracted. The significances between glomerular activity patterns and the first PC of the spontaneous activity matrix were assessed by Kendall’s non-parametric correlation coefficient r (Press, Teukolsky, Vetterling, & Flannery, 1992). Distributions of similarities across time and animals met normality conditions and were tested with an ANOVA. 3 Results Odors evoke combinatorial patterns of activity across olfactory glomeruli in the first brain area that processes olfactory input, the insect antennal lobe (AL) or the vertebrate olfactory bulb (Hildebrand & Shepherd, 1997; Korsching, 2002). These patterns can be visualized using optical imaging techniques (Galizia & Menzel, 2001; Sachse & Galizia, 2002). In honeybees, olfactory projection neurons (PNs) show a high degree of spontaneous activity in honeybees, that is, activity in the absence of an olfactory stimulus ¨ (Abel, Rybak, & Menzel, 2001; Muller, Abel, Brandt, Zockler, & Menzel, 2002), which can also be found in other species (Hansson & Christensen, 1999). We have not investigated the driving force of this spontaneous background activity, but it appears likely to be controlled by a recurrent inhibitory network within the antennal lobe, since spontaneous activity is blocked by GABA and increased by the chloride channel blocker Picrotoxin, and may also rely on background activity from sensory neurons (Sachse & Galizia, 2005). In many neurons, there is a direct relationship between the intracellular calcium concentration and the neuron’s activity (Haag & Borst, 2000; Ikegaya et al., 2004). PNs that show a high occurrence of action-potential bursts in electrophysiological recordings also show a continuous fluctuation of intracellular calcium levels at the same temporal scale (Galizia & Kimmerle, 2004). Therefore, it is possible to use calcium imaging to study bursts of spontaneous activity in olfactory PNs (Sachse & Galizia, 2002). To measure the spatiotemporal extent of such spontaneous activity patterns, we applied the calcium-sensitive dye FURA-dextran to the axon tract, leaving the AL, and obtained a backfill of PNs within the AL. Optical recording of the PNs confirmed that their intracellular calcium
14
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
concentration was constantly changing; the spontaneous activity in the AL did not consist of longer periods of sustained activity but of short, sporadic activity bouts (see Figures 1A to 1D; see also additional data on the Web). These spontaneous events were glomerular in the sense that the spatial extent of individual activity spots always corresponded to a glomerulus in the morphological view. Their amplitude varied across both glomeruli and in time. Stimulating the antennae with an odor led to characteristic odor-evoked activity patterns. The glomeruli activated in these patterns corresponded to those that were expected for the tested odors from previous studies (Galizia, Sachse, Rappert, & Menzel, 1999; Sachse & Galizia, 2002). The magnitude of intracellular calcium increases in an odor response was up to an order of magnitude higher than those of typical spontaneous activity fluctuations (see Figure 1B). After stimulation, odor-evoked calcium levels decayed back to baseline within a few seconds. This is consistent with measurements from olfactory receptor neurons, which in most cases stop firing at the end of an olfactory stimulus (de Bruyne, Foster, & Carlson, 2001; Getz & Akers, 1994), and supports the notion that the phenomena reported below are intrinsic to the AL and do not reflect an odor-specific structure in the input from receptors. This finding suggests that if there is a sensory memory within the AL, it cannot rely on persistent activity, that is, increased mean discharge rates after stimulation, as observed in other systems (Amit & Mongillo, 2003; Fuster & Alexander, 1971). Figure 1: Unrewarded odor stimuli affect the spontaneous activity in the honeybee antennal lobe (AL), as illustrated by data from one bee. (A, left) Fluorescent image of the AL with projection neurons (PNs) stained with Rhodamin-dextran. (A, right) Sequence of spontaneous activity frames before, during (red frame), and after odor exposure (odor: 1-octanol). Note the similarity of the activity pattern at 80 seconds and 110 seconds with the odor response. Activity is color coded with equal scaling in all images. (For the entire measurement, go to http://galizia.ucr.edu/spontActivity.) (B) Raw traces for three identified glomeruli (red: T1-17; green: T1-33; black: T1-62) over a 240 second stretch. Stimulus lasted for 4 seconds, starting at t = 0, as indicated by the bar. After the stimulus, activity fluctuations repeatedly co-occur. (C) Close-up view of two stretches in B. Open triangles indicate some occurrences where either glomerulus 17 (red) or 33 (green) were independently active. Closed triangles are those where both were active at the same time. Not all such instances are marked. (D) Projection of the spontaneous activity across all identified glomeruli before, during, and after odor presentation onto the activity during odor presentation itself. Filled triangles above the trace indicate all instances where the similarity measure is greater than 0.75 (dotted line). Such events are more prevalent after stimulus presentation, where the trace fluctuates more strongly rather than staying for longer periods in the high-similarity regime. The activity trace was high-pass filtered.
Sensory Memory for Odors
15
A
-110s -80s
-10s
stim. 10s
60s
80s
110s
B signal
8 6 4 2 0 -2 -4 -6 -100
-50
50
0
100 time (s)
signal
2 1 0 -1 -2 -3 -30
signal
C
-25
-20
2 2 1 1
Glom 17 0 0 -1 -1 Glom 33 -2 -2 Glom 62 -3 -3 10 -15 -10 time (s)
15
20
25
30 time (s)
similarity (r)
D 1 0
-1 -100
-50
0
50
100 time (s)
However, the relative timing of activity events in some glomeruli appeared to be altered. For instance, in Figure 1B, simultaneous fluctuations of glomerulus 17 and 33 are seen more frequently after the odor stimulus than before. We found that the glomeruli that increased their coactivity were those that had been activated by the odor stimulus. As shown in Figure 1C, coactive events between such glomeruli could also occur before stimulation (see filled triangles) and as independent activity bouts (empty triangles). After stimulation, coactive events were more frequent (filled triangles), and independent bouts were rare (none in the example here). Sensory memory in the AL may thus change the relative timing of spontaneous activity across
16
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
AL glomeruli. To extend this to entire across-glomeruli patterns, we determined the similarity (Kendall’s correlation) between odor response and spontaneous activity at every measured point in time (see Figure 1D). The resulting trace varies between 1 (perfect match, as at stimulus onset) and −1. Because the activity traces were high-pass filtered before the projection, the values right before stimulus onset lead to the projection having value −1. Instances where the similarity exceeds a threshold of 0.75 are marked by filled triangles. Clearly, after stimulation, spontaneous events resembling the odor-evoked response pattern became more prevalent: the spontaneous activity is on average more strongly attracted to the odor-evoked pattern. We then investigated which specific properties of the spontaneous activity changed after stimulation. While overall activity increased in some individuals and decreased in others, across animals the standard deviation of the spontaneous activity was not affected by odor exposure ( p = 0.426, Wilcoxon paired sample; see Figure 2A). This implies that the overall amplitude of the fluctuations did not increase; across animals, there was no increase in baseline activity. However, the total (summed) duration of spontaneous events that mimic the odor-evoked pattern (i.e., exceeding the 0.75 correlation threshold; see also Figure 1C) clearly differed after odor exposure ( p < 0.05, Wilcoxon paired sample) and increased more than twofold (see Figure 2B). There was no change in the amplitude of each frequency component of the spontaneous activity, as demonstrated by approximately equal power spectra before and after stimulation (see Figure 2C), and no change in overall spontaneous activity, as seen when comparing the histogram of activity occurrences before and after stimulation (see Figure 2D). Thus, the short-term odor memory reported here is encoded only in the correlated spontaneous fluctuations rather than in their amplitude or their mean value. Which glomeruli change their relative activity timing? In order to address this question, we sorted all glomeruli pairs into three categories based on their response properties to the presented odor: pairs where the tested odor led to increased intracellular calcium concentration in both glomeruli, pairs where one partner increased and the other decreased its calcium level, and pairs where at least one of the two did not respond to the odor. We then analyzed their correlation before and after odor exposure (see Figure 3A). We found that coactive pairs increased their correlation; pairs with opposing sign decreased their correlation, and nonactive pairs remained unchanged. Thus, the correlation changes followed a Hebbian rule: pairs of glomeruli where both were excited or inhibited by the stimulus increased their spontaneous coherence after stimulation; pairs of glomeruli where one was excited and the other inhibited decreased their correlation. To test whether an unspecific change in spontaneous activity might explain the changed correlations, we shifted the activity traces against each other after and before odor stimulus and recalculated the pair-wise correlations on these traces (see Figure 3B). After relative shifting, a correlation purely caused by increased
Sensory Memory for Odors
12
10 8 6 4
p=0.426
2 4
6 5 4 3 2
C
D
0 1 2 3 4 5 6 attraction before stim. (%)
1000
104
103
600
101
400
10-0
200
0.5 1.0 1.5 2.0 2.5 frequency (Hz)
before stim.
after stim.
800
102
10-1 0
p=0.047
1 0
6 8 10 12 s.d. before stim.
counts
0 2
power spectrum
0
attraction after stim. (%)
B
s.d. after stim.
A
17
0
-20 0 +20 ∆F/F
-20 0 +20 ∆F/F
Figure 2: Statistical analysis of spontaneous AL activity before and after stimulation. (A) Standard deviations of the total AL activity fluctuations for each animal before odor exposure plotted against the same after odor exposure. No systematic change is observed ( p = 0.426, Wilcoxon paired-sample test). (B) Odor-specific attraction, defined by the total (summed) duration of spontaneous events that closely resemble the odor-evoked activity pattern (r > 0.75; see also Figure 1D), and given as percentage of the total recording time after odor exposure plotted against the same before odor exposure. After stimulation, attraction increased significantly across animals ( p < 0.05, Wilcoxon paired-sample test) and was on average more than twice as large as before stimulation. (C) Power spectrum of the spontaneous activity before (thin line) and after (thick line) odor stimulation, averaged over all glomeruli in all bees. (D) Histogram of spontaneous activity (F/F values in high-pass-filtered traces) before and after stimulation across glomeruli. The distribution does not change after the stimulus, indicating that there is no overall increase in spontaneous activity. Both distributions are slightly supergaussian (black lines represent fit to gaussian distributions).
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
A pairwise correlation
original traces both glom. inhibited or both excited one glom. inhibited, the other excited at least one glom. did not respond **
B
re af te r
fo
re af te r
fo be
be
fo
re af te r
**
be
pairwise correlation
shifted traces both glom. inhibited or both excited one glom. inhibited, the other excited at least one glom. did not respond
fo r
e
**
be
af te r
e fo r be
af te r
be fo r
e
**
af te r
18
Figure 3: (A) Pair-wise correlation between olfactory glomeruli. As predicted by the Hebbian learning rule, pairs of glomeruli that were both excited or both inhibited by the odor significantly increased their correlation ( p 0.05, Wilcoxon paired-sample test); pairs of glomeruli where one was excited and the other one inhibited significantly decreased their correlation ( p 0.05); pairs of glomeruli where at least one did not respond to the odor did not significantly change their correlation ( p 0.05). (B) Pair-wise lagged correlation between olfactory glomeruli. By shifting the activity time series of each glomerulus with respect to each other (lag = 3 seconds), the correlations between glomeruli were generally reduced, and the increased correlation of coactive glomeruli was abolished. The decrease of correlation in the other groups is a consequence of the statistical properties of these traces (see the text). A similar picture was obtained for any time lag larger than 1 second.
Sensory Memory for Odors
19
activity should remain visible, while a correlation caused by specific timing of co-occurring events should decrease or disappear. We found that all correlation increases were lost or reduced, showing that the observed effect is due to a precisely timed coactivity rather than to an increased baseline activity. It should be noted that there is a small background correlation within the antennal lobe across all glomeruli, as evident in Figure 3A. Shifting the data reduces this background correlation (compare values between Figure 3A and Figure 3B); this effect also affects the traces after odor delivery, leading to the significances in Figure 3B for the non-co-excited glomeruli, where the decrease in correlation due to the stimulus and that due to shifting add up to a significant effect (see Figure 3B). We next calculated the correlation matrix of glomerular activity, before and after stimulus presentation (left panels of Figures 4A and 4C) by calculating the pair-wise correlation between their activity time courses. We derived the correlation changes by subtracting the two matrices (left panel of Figure 4D). In the example shown, glomeruli 17, 28, 33, 36, and 52 increased their pair-wise correlation; they were coactive more often than before odor exposure. These glomeruli are those that responded most strongly to the odor (see Figure 4B). Pairs of glomeruli that were both inhibited during stimulation tended to increase their correlation, too, as shown by the pairs 23-49 and 29-37 in the left panel of Figure 4D. In contrast, most pairs where one glomerulus was excited by the odor and the other was inhibited decreased their correlation. This phenomenon is clearly apparent for pair 17-37, 17-49, or 23-52. Thus, it seems that the pair-wise correlation of glomerular activation patterns well after stimulus offset resembled the odor-evoked response patterns. To test this key hypothesis, we performed a principal component analysis (PCA). PCs are the eigenvectors of a correlation matrix. In particular, the first PC corresponds to the eigenvector whose eigenvalue has the largest magnitude. In our case, it represents the dominant pattern of spontaneous activity in the sense that its average projection onto the spontaneous activity is maximal (Beckerman, 1995). We performed a PCA of the correlation matrices before and after stimulation, as well as for the difference of both. We then quantified the correspondence between the first principal component (right panels in Figures 4A, 4C, and 4D) and the odor-evoked pattern (see Figure 4B, right side only). There was no significant relationship before odor stimulation (r = 0.31, p = 0.143). After odor stimulation, the correspondence was highly significant (r = 0.79, p < 0.001). The same holds for the first PC of the difference matrix (r = 0.74, p < 0.001). This finding was confirmed across animals: the correlation matrices derived from the spontaneous activity after stimulation with an odor clearly reflected the pattern that was elicited by the odor, with mean correlation values between first PC and odor-evoked response of 0.52 after 1 minute and 0.39 after 2 minutes, as compared to 0.22 and 0.19 for 2 and 1 minutes before stimulation, respectively. These
20
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
differences were highly significant (two-way ANOVA, p < 0.001, with no significant difference between animals, p = 0.11; see Figure 4E). In addition to this across-animal analysis, we asked (for each animal) whether the first PC corresponds to the odor response. One minute after stimulation, this was the case in six of nine animals; after 2 minutes, their number had dropped to three. This indicates that the memory trace encoded in the correlated activity fluctuations decreases on a timescale of a few minutes. As a control, we also tested the prestimulus condition and found no resemblance between the first PC and the odor response in any animal, as expected. 4 Discussion As shown by our results, a single odor exposure without any predictive or associative value can lead to transient changes of the correlations between spontaneously active glomeruli that can last for more than 1 minute. Most notably, the pattern of activity that corresponds to the experienced odor is repeatedly “reactivated” during this period and thus constitutes a type of reverberation that is rather distinct from persistent activity (Amit & Mongillo, 2003; Fuster & Alexander, 1971). Hebb may have foreseen this
Figure 4: Spontaneous activity after odor stimulation represents an olfactory memory. Data in A–D are from one animal. (A) Matrix of pair-wise correlations between glomerular activity before the stimulus is given (left). The diagonal elements of the matrix equal unity by definition and are depicted in black. Components of the first principal component (PC) of this matrix (right). The lack of significant correlation with the odor response is within the frame. (B) Glomerular activity pattern elicited by the odor (2-octanol). Glomeruli are arranged according to their activity strength. This sequence of glomeruli is kept throughout A–D. The left panel is deliberately left empty to ease comparison with the other panels of the figure. (C) As A, but after odor stimulation. The components of the first PC clearly differ from those before the stimulus (A) and resemble the response pattern (B). The significance of the correlation with the odor response is given within the frame. (D) As in A, but for the differences (after or before) of pair-wise correlations. Nonsignificant entries of the matrix (by bootstrap analysis; see the text) have been set to zero and are shown in white; the diagonal elements equal zero by definition and are depicted in gray. As in C there is a statistically significant correlation to the odor response pattern (B). (E) Population data. Box plot of Kendall’s correlation between the first eigenvector and the odor response calculated from correlation matrices 2 minutes and 1 minute before and after stimulus delivery. There is a highly significant increase in the correlation. Numbers above the box plots indicate how many animals had a significant correlation between the first PC and the odor response. In agreement with our other results, this correlation was not significant for those bees for which the attraction did not increase after odor presentation (see Figure 2C).
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
1.0 0.8 0.6 0.4 0.2 0.0
0.8 0.6
r=0.31 p=0.1431
0.4 0.2 0.0
-0.2
glomerulus 23 37 49 29 62 42 38 39 28 52 33 36 17
signal
before stimulus
A
21
size of first PC
Sensory Memory for Odors
B
0.8 0.6 0.4 0.2
odor response
0.0 -0.2
glomerulus
C
after stimulus
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
0.8 0.6 0.4 0.2
size of first PC
23 37 49 29 62 42 38 39 28 52 33 36 17
0.0 -0.2
0.8 0.6
r=0.79 p=0.0002
0.4 0.2 0.0
-0.2
glomerulus
D
after minus before
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
0.4 0.2 0.0 -0.2
size of first PC
23 37 49 29 62 42 38 39 28 52 33 36 17
0.8 0.6
r=0.74 p=0.0004
0.4 0.2 0.0
-0.2
glomerulus
E
stimulus inference
23 37 49 29 62 42 38 39 28 52 33 36 17
1.0
0/9
0/9
6/9
3/9
-2
-1
1
2 minutes
0.8 0.6 0.4 0.2 0.0
22
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
possibility when he discussed both “persistence or repetition of a reverberatory activity” (Hebb, 1949, p. 62). Statistical properties of spontaneous activity patterns have been investigated in other systems, notably the mammalian cortex (Ikegaya et al., 2004). In this structure, the correlation between units is related to the behavioral state of the animal (Vaadia et al., 1995). A possible mechanism for creating specific firing patterns between neurons is provided by the concept of synfire-chains (Abeles, 1991; Durstewitz, Seamans, & Sejnowski, 2000). In this study we have not investigated the relationship of individual spikes, but that of entire activity bursts, which are reflected in the calcium increases at the temporal resolution of our measurements. Therefore, a direct comparison should be approached with caution. Which are the neurons that mediate the observed correlation changes? Since the changes occur between glomeruli, neurons connecting between glomeruli are likely candidates. In the honeybee, there are up to 4000 neurons local to the AL (Flanagan & Mercer, 1989; Fonta, Sun, & Masson, 1993; Galizia & Menzel, 2000). These neurons are not a uniform group: they have different transmitters (GABA, histamine), different morphologies (heterogeneous, homogeneous), and innervate different groups of glomeruli. Further work should elucidate which subpopulation of LNs accounts for stimulus-specific modifications of the glomerular fluctuations. Clearly, however, the changes follow a Hebbian correlation rule: glomeruli that are coactivated during a stimulus probably increase a (putative) reciprocal excitatory connection and/or decrease an inhibitory connection, so that their co-occurrence in a spontaneous activity event becomes more likely than before stimulation (see Figure 3). However, since the network does not consist of pair-wise connections, this description is certainly too simplistic. More work is needed to identify the mechanisms that may account for these findings. The stimulus may induce short-term changes of synaptic or membrane properties, which are known to influence spontaneous activity patterns (Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003; Tsodyks, Kenet, Grinvald, & Arieli, 1999). Stimulus-dependent modifications may, however, also be purely dynamic in the sense of “a memory trace that is wholly a function of a pattern of neural activity, independent of any structural change” (Hebb, 1949). The observed multiglomerular activity fluctuations are readily interpreted if we visualize the AL as a network with odor-specific attractors and a high level of spontaneous activity (Gal´an, Sachse, Galizia, & Herz, 2004). If an odor is presented, the basin of attraction corresponding to this stimulus is increased and biases the network fluctuations toward the odorspecific pattern. This may lead to the network enhancing the representation of that odor relative to others, as seen in Figure 2C. Such a short-term memory effect has been observed in locusts (Stopfer & Laurent, 1999). In that study, the coherence of PN activity increased when an odor stimulus was iterated. If the sensory memory of the previous (but same) stimulus was
Sensory Memory for Odors
23
still active, it would cause a reduced threshold for that pattern and thus facilitate a more coherent response, that is, more strongly synchronized PN activity, as reported (Stopfer & Laurent, 1999). In the rat olfactory bulb, exposure to an odor slightly modifies the response profile of mitral cells (Fletcher & Wilson, 2003), a finding that might indicate that similar changes in the interglomerular neural network occur in mammals as those observed here. It remains unclear, though, under what conditions, if any, changes are read out by other brain areas, that is, if spontaneous activity bouts can be coincident and cause spurious “remembrances" of the experienced odor, or whether these bouts are “perceived" as odor whiffs by the animal. By briefly changing the network activity, they might also play a role in classical ¨ conditioning of odors with appetitive rewards (Menzel & Muller, 1996). Let us also note that even if an external observer can retrieve the odor from the spontaneous activity, this does not prove that the animal actually uses this information. Nevertheless, the observed correlation changes are a robust and predictable phenomenon that occurs in the AL; by itself, this unexpected finding invites further investigation. In conclusion, we have revealed traces of sensory memory, in vivo, and have demonstrated that a single odor stimulus can modify the spontaneous activity of olfactory glomeruli. As traditional paradigms investigating Hebbian reverberations have exclusively focused on persistent activity after stimulation, not on correlated activity fluctuations, it is to be expected that future investigations along the lines of our study may reveal previously overlooked memory traces in many other neural systems. Acknowledgments The work of M. W. and R. M. was supported by the Deutsche Forschungsgemeinschaft (SFB 515).
References Abel, R., Rybak, J., & Menzel, R. (2001). Structure and response patterns of olfactory interneurons in the honeybee, Apis mellifera. J. Comp. Neurol., 437(3), 363–383. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J., & Mongillo, G. (2003). Selective delay activity in the cortex: Phenomena and interpretation. Cereb. Cortex, 13(11), 1139–1150. Beckerman, M. (1995). Adaptive cooperative systems. New York: Wiley. Clark, R. E., Manns, J. R., & Squire, L. R. (2002). Classical conditioning, awareness, and brain systems. Trends. Cogn. Sci., 6(12), 524–531. Crowder, R. G. (2003). Sensory memory. In J. H. Byrne (Ed.), Learning and memory (2nd ed., pp. 607–609). New York: Macmillan.
24
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
de Bruyne, M., Foster, K., & Carlson, J. R. (2001). Odor coding in the Drosophila antenna. Neuron, 30(2), 537–552. Del Giudice, P., Fusi, S., & Mattia, M. (2003). Modelling the formation of working memory with networks of integrate-and-fire neurons connected by plastic synapses. J. Physiol. Paris, 97(4–6), 659–681. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Neurocomputational models of working memory. Nat. Neurosci., 3 Suppl., 1184–1191. Flanagan, D., & Mercer, A. R. (1989). Morphology and response characteristics of neurones in the deutocerebrum of the brain in the honeybee Apis mellifera. J. Comp. Physiol. (A), 164, 483–494. Fletcher, M. L., & Wilson, D. A. (2003). Olfactory bulb mitral-tufted cell plasticity: Odorant-specific tuning reflects previous odorant exposure. J. Neurosci., 23(17), 6946–6955. Fonta, C., Sun, X. J., & Masson, C. (1993). Morphology and spatial distribution of bee antennal lobe interneurones responsive to odours. Chemical Senses, 18, 101–119. Fuster, J. M., & Alexander, G. E. (1971). Neuron activity related to short-term memory. Science, 173(997), 652–654. Gal´an, R. F., Sachse, S., Galizia, C. G., & Herz, A. V. (2004). Odor-driven attractor dynamics in the antennal lobe allow for simple and rapid olfactory pattern classification. Neural Comput., 16(5), 999–1012. ¨ Galizia, C. G., Joerges, J., Kuttner, A., Faber, T., & Menzel, R. (1997). A semi-in-vivo preparation for optical recording of the insect brain. J. Neurosci. Methods, 76(1), 61–69. Galizia, C. G., & Kimmerle, B. (2004). Physiological and morphological characterization of honeybee olfactory neurons combining electrophysiology, calcium imaging and confocal microscopy. J. Comp. Physiol. A, 190(1), 21–38. Galizia, C. G., McIlwrath, S. L., & Menzel, R. (1999). A digital three-dimensional atlas of the honeybee antennal lobe based on optical sections acquired by confocal microscopy. Cell Tissue Res., 295(3), 383–394. Galizia, C. G., & Menzel, R. (2000). Odour perception in honeybees: Coding information in glomerular patterns. Curr. Opin. Neurobiol., 10(4), 504–510. Galizia, C. G., & Menzel, R. (2001). The role of glomeruli in the neural representation of odours: Results from optical recording studies. J. Insect. Physiol., 47(2), 115– 130. Galizia, C. G., Sachse, S., Rappert, A., & Menzel, R. (1999). The glomerular code for odor representation is species specific in the honeybee Apis mellifera. Nat. Neurosci., 2(5), 473–478. Getz, W. M., & Akers, R. P. (1994). Honeybee olfactory sensilla behave as integrated processing units. Behav. Neural. Biol., 61(2), 191–195. Giurfa, M., Zhang, S., Jenett, A., Menzel, R., & Srinivasan, M. V. (2001). The concepts of “sameness" and “difference" in an insect. Nature, 410(6831), 930–933. ¨ Grossmann, K. E. (1971). Belohnungsverzogerung beim Erlernen einer Farbe an einer ¨ kunstlichen Futterstelle durch Honigbienen. Z. Tierpsychol., 29, 28–41. Haag, J., & Borst, A. (2000). Spatial distribution and characteristics of voltage-gated calcium signals within visual interneurons. J. Neurophysiol., 83(2), 1039–1051. Hansson, B. S., & Christensen, T. A. (1999). Functional characteristics of the antennal lobe. In B. S. Hansson (Ed.), Insect olfaction (pp. 125–161). Heidelberg: Springer.
Sensory Memory for Odors
25
Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hildebrand, J. G., & Shepherd, G. M. (1997). Mechanisms of olfactory discrimination: Converging evidence for common principles across phyla. Annu. Rev. Neurosci., 20, 595–631. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304(5670), 559–564. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425(6961), 954–956. Korsching, S. (2002). Olfactory maps and odor images. Curr. Opin. Neurobiol., 12(4), 387–392. Menzel, R. (2001). Searching for the memory trace in a mini-brain, the honeybee. Learn. Mem., 8(2), 53–62. ¨ Menzel, R., & Muller, U. (1996). Learning and memory in honeybees: From behavior to neural substrates. Annu. Rev. Neurosci., 19, 379–404. ¨ Muller, D., Abel, R., Brandt, R., Zockler, M., & Menzel, R. (2002). Differential parallel processing of olfactory information in the honeybee, Apis mellifera L. J. Comp. Physiol. A., 188(5), 359–370. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Sachse, S., & Galizia, C. G. (2002). Role of inhibition for temporal and spatial odor representation in olfactory output neurons: A calcium imaging study. J. Neurophysiol., 87(2), 1106–1117. Sachse, S., & Galizia, C. G. (2005). Topography and dynamics of the olfactory system. In S. Grillner (Ed.), Microcircuits: The interface between neurons and global brain function. Cambridge, MA: MIT Press. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4(4), 303–321. Stopfer, M., & Laurent, G. (1999). Short-term memory in olfactory network dynamics. Nature, 402(6762), 664–668. Tsodyks, M., Kenet, T., Grinvald, A., & Arieli, A. (1999). Linking spontaneous activity of single cortical neurons and the underlying functional architecture. Science, 286(5446), 1943–1946. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373(6514), 515–518.
Received January 19, 2005; accepted May 18, 2005.
LETTER
Communicated by Fred Rieke
The Optimal Synapse for Sparse, Binary Signals in the Rod Pathway Paul T. Clark
[email protected] Mark C. W. van Rossum
[email protected] Institute for Adaptive and Neural Computation, School of Informatics, Edinburgh, EH1 2QL, U.K.
The sparsity of photons at very low light levels necessitates a nonlinear synaptic transfer function between the rod photoreceptors and the rodbipolar cells. We examine different ways to characterize the performance of the pathway: the error rate, two variants of the mutual information, and the signal-to-noise ratio. Simulation of the pathway shows that these approaches yield substantially different performance at very low light levels and that maximizing the signal-to-noise ratio yields the best performance when judged from simulated images. The results are compared to recent data.
1 Introduction In this letter, we study early visual processing at very low light levels. At these so-called scotopic light levels, the photon capture rate per rodphotoreceptor is on the order of one per minute. The rod cells can detect single photons (Baylor, Lamb, & Yau, 1979; Baylor, Nunn, & Schnapf, 1984; Schneeweis & Schnapf, 1995). Photon capture by the rod can lead to a response in the ganglion cells (Barlow, Levick, & Yoon, 1971; Mastronarde, 1983b). The rod is the most common cell type in the retina; there are 20 times more rods than cones (Sterling & Demb, 2004). The large number of rods serves to detect as many photons as possible, whereas the spatial resolution of the rod pathway is low. The scotopic rod pathway therefore has a large convergence. The rod-bipolar cell collects the signal from some 10..100 rods ¨ (Dacheux & Raviola, 1986; Grunert, Martin, & W¨assle, 1994; Tsukamoto, Morigiwa, Ueda, & Sterling, 2001) while each rod connects to only two bipolar cells. Even when no photon is absorbed, the rod response is corrupted with continuous noise. This poses a potential problem for the pathway: a sharp thresholding function before summing the rod responses is required to maintain the single photon response. Without the nonlinearity, the single Neural Computation 18, 26–44 (2006)
C 2005 Massachusetts Institute of Technology
The Optimal Synapse
27
responses would drown in the noise (Baylor et al., 1984). An earlier biophysical model showed how such a nonlinearity can be implemented and demonstrated that the nonlinearity is indeed necessary to obtain observed performance (van Rossum & Smith, 1998). Recently, the transfer function of the synapse was measured, providing direct evidence for the existence of such a nonlinearity, and confirmed its synaptic mechanism (Field & Rieke, 2002b; Sampath & Rieke, 2004; Berntson, Smith, & Taylor, 2004). The performance of the pathway is critically dependent on the synaptic transfer function and its threshold. This raises the question how the threshold should be set from first principles. Interestingly, this question in general has no straightforward answer (Basseville, 1989). In this study, we research ways to set the optimal synaptic transfer function for sparse, binary signals. Counterintuitively, we show that different performance criteria lead to different optimal thresholds. Simulation of the pathway suggests that these different threshold settings greatly influence the signal in the bipolar cell. We first introduce our description of the pathway and then analyze different performance measures in the case of a sharp binary threshold. Next we extend to the more general case of smooth transfer functions, for which we rely on simulations. Finally, we discuss our predictions. We are not aware of any other studies comparing performance measures for sparse, binary detection—not in the bipolar pathway or in a general case. 2 Rod and Rod-Bipolar Pathway 2.1 Model for the Rods. The layout of the modeled rod-bipolar pathway is shown in Figure 1. At the lowest light levels, a rod photoreceptor might
t t+1 t+2 t+3 0 0 0 0
dim light
Gaussian noise +
0
1
0
0
+
0
0
0
0
+
0
0
0
0
+
0
0
0
0
+
photon flux
rods
parameters: threshold, slope
Σ
non–linear synapses rod–bipolar
Figure 1: Diagram of our model of the rod-bipolar pathway. A dim light source causes a very sparse flux of photons (modeled in discrete time steps). The photons are detected by the rods. Intrinsic gaussian rod noise corrupts the response. After a nonlinear synapse, the rod-bipolar sums the rod responses. The question is how the synapse’s threshold and slope should best be set to minimize signal loss.
28
P. Clark and M. van Rossum
detect a photon only every few minutes. Not every photon is absorbed and detected, but for simplicity we assume that each photon a rod receives is detected and leads to a response. This effectively yields an extra scale factor in the light level, the quantum efficiency, which is estimated between 3% and 50% (Baylor et al., 1979; Field, Sampath, & Rieke, 2005). The number of photons absorbed by the rod follows a Poisson distribution. The full dynamics of the response and the noise are not taken into account. We discretize the time into bins with the duration of the pathway’s integration time. The rod integration time is some 100..200 ms (Baylor et al., 1984; Walraven, Enroth-Cugell, Hood, MacLeod, & Schnapf, 1990). With the light level ρ, we denote the probability that a rod receives a photon per time bin. Because the power spectra are similar, it is unlikely that temporal integration by the synapse can strongly reduce the noise in favor of the signal (van Rossum & Smith, 1998). However, it important to note that, precise data lacking, synaptic filtering could lower the noise somewhat; in addition, bandpass filtering could increase the temporal information (Bialek & Owen, 1990; Armstrong-Gold & Rieke, 2003). At the low light levels we consider here, ρ 1 and the number of absorbed photons n is small: mostly zero and sometimes one. Thus, at the low light levels considered, a rod can essentially have two responses: it either detects a photon or does not. The probability that a particular rod detects two photons is negligible (ρ 2 ), as is the probability that two out of N rods detect a photon simultaneously, namely, ρ 2 N(N − 1). The task of the bipolar cell is therefore to discriminate between the case that none of the rods absorbed a photon and the case that one rod absorbed a photon. Importantly, the rod response is noisy, and its voltage distribution can be fitted to a gaussian with a standard deviation that increases with the number of photons absorbed. The probability distribution for a certain response amplitude x from a rod is (Field and Rieke, 2002a, 2002b) P(x) =
∞ ρ n exp(−ρ) G nx, ¯ σ D2 + nσ A2 n! n=0
=
∞ ρ n exp(−ρ) n=0
n!
¯ 1 (x − nx) exp − 2 σ 2 + nσ 2 , (2.1) D A 2π σ D2 + nσ A2 1
where G denotes the gaussian distribution. Without loss of generality, the mean response to a single event x¯ is normalized to 1. The empirical values for the noise in mouse rods are σ D = 0.27 and σ A = 0.33 (Field & Rieke, 2002b). These values for σ are only approximate values for the noise seen by the bipolar. It should be noted that the signal as seen by the bipolar can be noisier than this, because stochastic vesicle release can corrupt the signal further (Rao, Buchsbaum, & Sterling, 1994; van Rossum & Smith, 1998);
The Optimal Synapse
29
there are no precise estimates on this. On the other hand, synaptic filtering might reduce the noise, as stated above. Finally, the rod signal is corrupted with thermally driven spontaneous isomerization of rhodopsin (Baylor et al., 1984). This rate is about 10−3 events per rod per integration time. These events introduce extra errors because they are indistinguishable from real photon captures and therefore cannot be filtered out. They are thought to be a major contribution to the so-called dark light (Copenhagen, Donner, & Reuter, 1987; Sterling, Freed, & Smith, 1988). 2.2 The Bipolar Cell. The rod provides input to both OFF and ON bipolar cells. The OFF-bipolar pathway does not seem to be tuned for low ¨ scotopic vision (Soucy, Wang, Nirenberg, Nathans, & Meister, 1998; Volgyi, Deans, Paul, & Bloomfield, 2004); hence, we ignore it here. The rod ONbipolar cell pools the signal from some 10 to 100 rods. However, as the rod signal is noisy, the single photon signal would be lost in the noise if the bipolar were to sum the rod signals linearly. The reason is that the noise is pooled from √ all rods; thus, the standard deviation of the noise in the bipolar scales as N, whereas only one rod carries the signal. It has been noted that therefore it is essential to threshold the rod signals before they are summed by the bipolar (Baylor et al., 1984). In a modeling study, it was proposed that this threshold is implemented using a second-messenger cascade synapse; such a threshold mechanism yielded performance consistent with the physiological and psychophysical data (van Rossum & Smith, 1998). For now, we assume that the synaptic transfer function g(x) is a sharp step function with a threshold given by θ so g(x) = 0, if x < θ and g(x) = 1 otherwise. The threshold θ is the adjustable parameter. The simple transfer function is easy to study, and a binary function might seem to fit the binary input signal best. This second statement will turn out not to be fully correct, as is shown below, where other synapse models are discussed. Consider first that there is just one rod connected to the bipolar cell. We introduce the false-positive rate α (no photon, but erroneously detected) and the false-negative rate β (photon was received but not detected in the bipolar). For one rod, the n = 0 and n = 1 term in equation 2.1 yield
1 θ α= 1 − erf √ 2 2σ D 1 θ − 1 β= 1 + erf 2 2 2 2 σD + σ A In case N rods are connected to the bipolar cell, the bipolar cell is assumed N g(xi ). And after to sum the thresholded rod responses, that is, y = i=1 some combinatorics, one finds that the probability for the absorption of k
30
P. Clark and M. van Rossum
photons and a bipolar response j equals P( j, k) =
N j
(1 − ρ) N− j ρ j
j N− j j α k−l (1 − α) N+l− j−k β j−l (1 − β)l , × k −l l l=0
with the convention that ( ij ) = 0 if j < 0. In the limit of small ρ and small ρ N, again only two errors are important. First, none of the rods received a photon, but the output is unequal to zero. This probability is written as α N = 1 − (1 − α) N and can be interpreted as the generalized false-positive rate. Second, one of the rods received a photon, but the bipolar output is zero. This is written as β N = β(1 − α) N−1 . The false-positive and -negative rate characterize the pathway as a function of the threshold level θ . The threshold can roughly be deduced from ganglion cell data that showed that the false-negative rate is about 50% (Mastronarde, 1983a; van Rossum & Smith, 1998). This corresponds to a threshold setting of θ ≈ 1. In this letter, we examine the more fundamental question of how the optimal value for the threshold follows from the performance measure imposed. 3 Performance Measures 3.1 Threshold from Minimizing the Detection Errors. The problem of how to set the threshold can be analyzed with signal detection theory (Green & Swets, 1966; Van Trees, 1968). The setting of the threshold determines the trade-off between false positives and false negatives. A natural choice is to weigh both errors equally. In this case, the error equals the mean square error between input and output, or the Hamming distance. With one rod connected to the bipolar, the total error rate, denoted E R, is ER(θ ) = (1 − ρ)α(θ ) + ρβ(θ ), where the first term is the false-positive rate and the second term the falsenegative rate. Now the threshold θ can be varied so that the ER is minimal. If multiple rods are converging onto the bipolar, a simple counting argument gives for the error rate, ER = (1 − ρ N)α N + ρ Nβ N , where we ignored terms of order ρ 2 . In Figure 2A, the total error is plotted as a function of the threshold level when 10 rods are converging. For low thresholds, the false-positive rate is very high, and the error rate is close to 1. For very high thresholds, the error rate is quite low: ER = ρ N. Here the high threshold eliminates all photon events. The output is completely dark, which is not far from the truth but not very useful. For intermediate
The Optimal Synapse
A
31
B
0
10
10
10
Signal–to–noise
Error rate
10
–1
–2
–3
0
1
2
3
Mutual information (bits)
0.008 0.006 0.004 0.002 0.000
0
1
2
3
Threshold
Threshold
C
0.010
0.002
0.001
0
0
1
2
3
Threshold
Figure 2: The behavior of the three performance measures of the bipolar pathway as a function of the threshold level. (A) The total error rate, which counts false positive and false negatives. The minimum in the error rate corresponds to best performance. The dashed line indicates the much worse performance when the synapses are linear and the threshold is done after the summation. (B) The signal-to-noise ratio for a contrast discrimination task. The thin dashed line indicates the performance when thresholding is done after the summation. (C) The mutual information between the light level and the bipolar output as a function of the threshold. The dashed line indicates the mutual information between the rod signal and bipolar signal (y-scale divided by 20 to aid visualization; on the same scale, the dashed line would be much larger than the solid line). Parameters for all panels: 10 rods, light-level: 0.001 photons/rod/time step; rod noise is according to Field and Rieke (2002b).
thresholds, the error rate has a minimum for which false-positive and falsenegative rates are traded off. As the light level is lowered, the optimal threshold increases and can be larger than 1. This gives the somewhat counterintuitive result that if the signal is very sparse, a high threshold is beneficial, although this causes missing a large fraction of the events. To show the benefit of the thresholding synapse, we also show the error rate when the synapses are linear and the signal is thresholded after
32
P. Clark and M. van Rossum
summing (see Figure 2A, dashed line). This error rate has no minimum, and performance is worse in this case. 3.2 Threshold from Bayesian Inference. The same threshold value also follows from probabilistic Bayesian inference. The probability that the rod absorbed a photon (k = 1) versus that it did not (k = 0), given a response y, is P(k = 1|y) P(k = 0|y) + P(k = 1|y)
P(y|k = 0) P(k = 0) −1 = 1+ P(y|k = 1) P(k = 1) −1 (1 − ρ SP )G 0, σ D2 + ρ SP G 1, σ D2 + σ A2 1 − ρ = 1+ , ρ G 1, σ D2 + σ A2
g(y) =
(3.1)
where for completeness, we introduced the spontaneous isomerization rate ρ SP , measured in events per rod per time step. It mimics a photon event (see below). Under the simplification that σ A = 0 and ρ SP = 0, the probability that the rod absorbed a photon given the response is given by the well-known logistic function (Mackay, 2003), g(y) =
1 , 1 + exp[−(y − θ )/κ]
(3.2)
ρ and κ = σ D2 . If this probability is 50% with parameters θ = 12 − σ D2 ln 1−ρ or higher, a photon event was most likely, and the output is set to one; otherwise, there was likely no photon, and the output is set to zero. This threshold setting corresponds to the point where the rod probability distributions for the one photon and no photon signal intersect. The inference interpretation is equivalent to minimizing the number of errors done in the previous section and thus yields exactly the same optimal threshold. When spontaneous events are taken into account, the threshold is approximated by
θ=
1 − σ D2 ln(ρ − ρ SP ). 2
(3.3)
This has no solution for ρ < ρ SP ; the intuition is that any response was likely a spontaneous event rather than a real photon. When the assumption σ A = 0 is dropped, g(y) is no longer a monotonic function. Instead, a transition occurs at a negative value of y, which make g(y) = 1 also for negative
The Optimal Synapse
33
y. However, the probability of these small values of y is negligible. Furthermore, numerically the relevant upper threshold is virtually identical to the case that σ A = 0 (1.17 versus 1.19 when ρ = 10−4 ). 3.3 Threshold from Signal-to-Noise Ratio. Another performance measure of the pathway is the following: the discrimination should be clearest when the signal-to-noise ratio in the bipolar is maximal (Field & Rieke, 2002b). Therefore, the synapse should be tuned to maximize the signal-tonoise ratio. Here we maximize the signal-to-noise ratio in a contrast discrimination task in which a dark patch has to be distinguished from a brighter one. For a given light level, the response distribution of the bipolar cell is Q(y; ρ) = [1 − q ]δ(y) + q δ(y − 1), where q (ρ) = α N + ρ N(1 − α N − β N ) is the average bipolar output, consisting of both correct and false responses. The variance of this distribution is q (1 − q ). The signal-to-noise ratio is SNR(ρ1 , ρ2 ) =
2[q (ρ1 ) − q (ρ2 )]2 . q (ρ1 )[1 − q (ρ1 )] + q (ρ2 )[1 − q (ρ2 )]
The values of ρ1 and ρ2 are set as follows. When the discrimination is hardest, the discrimination between dark and highest light level is already difficult. Therefore, we will examine the case that ρ1 = 0 and ρ2 = 2ρ (the factor 2 ensures that the mean light level is ρ). However, we also consider the discrimination between two almost equal light levels: SNR(ρ − δρ, ρ + δρ) where δρ ρ. In practice, we found that this gave almost identical thresholds. In Figure 2B, the SNR is plotted for thresholding and for linear synapses followed by thresholding after summing. As the figure shows, the thresholding clearly improves the SNR. 3.4 Threshold from Information Theory. The detection problem and the need for a threshold in the synapse can also be studied using information theory. In general, the mutual information between an input variable x and an output y is I M = d x P(x) dy P(y|x)[log2 P(y|x) − log2 P(y)] (Cover & Thomas, 1991). We first calculate the mutual information between the light intensity and the bipolar signal as a function of the threshold. As above, we consider an input distribution with just two light intensities, 0 and 2ρ, with each probability 12 . P(y, x) has four terms: P(0, 0) = 12 [1 − q (0)], P(1, 0) = 12 q (0), P(0, ρ) = 12 [1 − q (2ρ)], and P(1, ρ) = 12 q (2ρ). The mutual information therefore becomes a sum over four terms,
1 1 1 IMRHO = [1 − q (0)] log2 [1 − q (0)] − log2 1 − q (0) − q (2ρ) 2 2 2
1 1 1 q (0) + q (2ρ) + q (0) log2 q (0) − log2 2 2 2
34
P. Clark and M. van Rossum
1 1 1 + [1 − q (2ρ)] log2 [1 − q (2ρ)] − log2 1 − q (0) − q (2ρ) 2 2 2
1 1 1 q (0) + q (2ρ) . + q (2ρ) log2 q (2ρ) − log2 2 2 2 This measure, labeled IMRHO, is our third performance criterion to set the threshold. The mutual information has a maximum as a function of the threshold level (see Figure 2C, solid line). Like the other criteria, the mutual information deteriorates when the signal is thresholded only after the rod signals have been summed (not shown). For sharp thresholds, the IMRHO is very similar to the SNR. In the case that the discrimination is done between ρ and ρ + δρ with small δρ, one can show by expansion in δρ that they are identical. Above, the information between light level and bipolar was used. Alternatively, one can optimize the mutual information between the actual photon signal and the bipolar signal. The photon signal is given by a Poisson process dependent on the light level. After all, one can argue that threshold should care only about the photons that are absorbed by the rod. We term this criterion IMROD. If just a single rod is connected to the bipolar, x describes the photon signal and y the bipolar output. Both x and y take values zero and one only. This does not mean the noise in the rod is ignored; it is captured in the α and β. P(y, x) now has the terms P(0, 0) = (1 − ρ)(1 − α), P(1, 0) = (1 − ρ)α, P(0, 1) = ρβ, P(1, 1) = ρ(1 − β). This yields I MROD = (1 − ρ)α{log2 α − log2 [α + ρ(1 − α − β)]} + (1 − ρ)(1 − α){log2 (1 − α) − log2 [1 − α − ρ(1 − α − β)]} + ρβ{log2 β − log2 [1 − α − ρ(1 − α − β)]} + ρ(1 − β){log2 (1 − β) − log2 [α + ρ(1 − α − β)]}. When instead of one rod, N rods are converging onto the bipolar cell, α N and β N should be used and ρ should be replaced by ρ N. This second mutual information measure reaches much higher values. This is understandable because unlike IMRHO, it lacks the Poisson process, which links the light level to actual photons. In the Poisson process, a lot of information is lost. Because the threshold setting does not affect the transformation of light level into absorbed photons, one could expect that both information measures have a similar dependence on the threshold. But this variant predicts consistently a lower threshold value (see Figure 2C, dashed curve).
The Optimal Synapse
35
4 Optimal Threshold Levels We have seen that the different performance criteria can yield different optimal threshold values. To gain a better understanding, we examined the optimal threshold as stimulus parameters are varied. The first observation is that for the binary transfer function, the SNR and IMRHO predict very similar thresholds. Figure 3A shows the optimal threshold for all criteria as the light level is varied. For high light levels, all approaches yield an optimal threshold close to 0.5 (although the approximations are expected to break down when ρ N ≈ 1). In practice, the minimal light level is limited by the dark-light to some 10−3 events/rod/integration time, although behavioral responses can persist at even lower light levels. At these light levels, the different approaches are still quite similar. To expose the differences more clearly, we purposefully neglected the spontaneous events and considered unrealistically low light levels. The threshold according to the SNR and ER is roughly linear in the log of the light level. The threshold value from IMROD is lower than for the SNR or ER. The intuition is that the mutual information approach prefers lower threshold values, because a low threshold yields a richer output distribution, although this increases the error rate. Next, we examined the dependence on the number of rods converging on the bipolar cell. The threshold values depend only weakly on the number of rods (see Figure 3B). With increasing the number of rods, the thresholds come closer. Finally, the thresholds depend on the noise in the rods (see Figure 3C). The lower the noise, the smaller the threshold. This is easily understood as in the zero noise limit, where the discrimination is easy; a threshold of 1/2 would be best according to all criteria. For high noise, the ER threshold is proportional to σ 2 (as shown above), whereas the IMROD threshold increases linearly with σ . Interestingly, for high noise, the optimal SNR threshold decreases after an initial increase. As stated above, rod responses contain spontaneous rhodopsin isomerization events that have not been included so far. Effectively, this introduces additional false positives. The false-positive rate becomes α SP = (1 − ρ SP )α + ρ SP (1 − β), where ρ SP is the spontaneous isomerization rate measured in events per rod per time step. These events affect the various performance criteria differently. The ER predicts a higher threshold when the spontaneous events are included (see Figure 3D). In fact, the optimum threshold diverges when the mean light level approaches the spontaneous event level (see also equation 3.3). For light levels less than the spontaneous rate, there is no optimal threshold; the curve in Figure 2A has no minimum. Indeed, the fewest errors in that case are made when the output is always zero. In contrast, the other measures have a finite optimal threshold for light levels lower than the spontaneous rate. For the SNR, this is easily understood: in the presence of spontaneous rate, a discrimination task has to discriminate between ρ1 + ρ SP and ρ2 + ρ SP . Hence, the optimal
36
P. Clark and M. van Rossum
B Optimal threshold
1.5
Optimal threshold
A
1
0.5 −6 10
−2
D
3
2
1
0
0
0.2 0.4 Rod noise
0.6
1.2 1.1 1 0.9 0.8
0
10
Optimal threshold
Optimal threshold
C
−4
10 10 Light level (events/rod/time)
1.3
0
5 10 15 Number of rods per bipolar
20
1.5
1
0.5 −6 10
−4
−2
10 10 Light level (events/rod/time)
10
0
IMROD ER SNR, IMRHO
Figure 3: Dependence of the optimal threshold according to different performance criteria versus various parameters. Dashed line: the number of errors criterion (ER); dotted line: the signal-to-noise ratio (SNR) and the mutual information between light level and bipolar response (IMRHO) (overlapping); solid line: mutual information between rod and bipolar response (IMROD). (A) Optimal thresholds for as a function of the light level. Ten rods; noise as in Field and Rieke. (B) Optimal threshold value as a function of the number of rods. Light level of 10−4 events per rod; noise as in Field and Rieke (2002b). (C) Dependence of the threshold value on the noise level in the rod. In this simulation, a simplified noise model was used, where the rod noise was independent of photon absorption (i.e., σ A = 0). 10 rods, ρ = 10−4 . (D) Effect of spontaneous events on the optimal threshold. The spontaneous event rate was 10−3 events per rod per time step. Other parameters as in A. Notably, the threshold based on the number of errors diverges when the light level is less than the spontaneous rate.
threshold shifts as if the light level were higher and equal to ρ + ρ SP rather than ρ. We tested whether the precise value of the time bin is important. In particular, the mutual information and its optimal threshold could depend
The Optimal Synapse
37
on the resolution of the sampling. We doubled the duration of the time bin. This manipulation doubles ρ, but also leads to a different value of α and β. For light levels below 0.01 events per second there is no noticeable difference in the threshold for both ER and SNR. Only for the mutual information IMROD did we see a slightly higher threshold (1.02, compared to 0.99 for the original time bin; ρ = 10−4 , N = 10). 5 Simulated Rod-Bipolar Pathway A priori it is not obvious which performance criterion should be used to set the synaptic threshold; all presented methods seem valid. To tackle this question, we simulated how the threshold setting would change the output of the bipolar system. It is likely that the different thresholds would lead to different visual percepts. These simulations allow us to examine the bipolar pathway as a function of the threshold level. The simulations consist of the following steps: 1. An image was split in rectangular regions, each corresponding to the receptive field of a single bipolar cell. The gray-scale of a certain pixel was extracted and multiplied with the mean light level to obtain the pixel’s light level. 2. A Poisson process with a rate given by the light level determined if a rod absorbed a photon. 3. Gaussian noise was added to the rod response. 4. The nonlinear transfer function was applied to the rod response to mimic the synapse. 5. The transformed rod responses were summed in the bipolar. We repeated this procedure for each bipolar cell, and the final output picture was averaged over many trials. This averaging mimics the pooling by the amacrine cells. Finally, we applied histogram equalization to the output picture using image processing software. This smooth, monotonic transformation improved visibility and gave the images a similar appearance despite very different mean output levels. Without it, the images can either easily saturate or become very dark. In the retina, such transformations can be performed by the circuitry of the amacrine cells and further downstream. It is important to note that this simulated pathway is just an approximation of the real one, as the number of trials used here is much higher than the number of bipolars connected to the amacrine cell, and the bipolarto-amacrine synapse might also contain a threshold; as in the bipolar, the signals are still quite sparse. These effects could change the results. Unfortunately, they are hard to study given our limited knowledge of processing by these circuits at the lowest light levels.
38
P. Clark and M. van Rossum
A
B
IMROD
SNR, IMRHO
ER
IMROD
SNR, IMRHO
ER
C
Figure 4: Simulated rod-bipolar image processing. (A) Original image. (B) Simulated images with a threshold level of the synapse 1.03 (optimal to maximize mutual information, IMROD), 1.33 (SNR and IMRHO), and 1.38 (ER). The threshold according to SNR gives a better-quality image than the image with the threshold according to IMROD. Average over 50,000 samples; mean light level 10−5 ; 10 rods; noise according to Field & Rieket (2002b). (C) Same as in B except that the rod noise is higher. Now, minimizing the error rate (right-most figure) does not lead to a clear picture. On the other hand the IMROD criterion performs decently for these parameters. The threshold settings were θ = 1.12 (IMROD), 1.66 (SNR), and 2.78 (ER). In combination with A, the SNR and IMRHO yield consistently the clearest image. σ D = 0.5; σ A = 0; ρ = 10−4 ; 10 rods; average over 50,000 samples.
We applied the simulation on an input image with a high- and a lowcontrast letter and a somewhat natural scene (see Figures 4A and 4B). The light level was deliberately chosen very low to emphasize the differences between the criteria. The low threshold level as predicted by IMROD leads to a high false-positive rate. As a result, the image is not very clear, and
The Optimal Synapse
39
low-contrast boundaries are hard to see. Setting the threshold according to the SNR (and the similar IMRHO) yields a clearer picture. Also, the ER yields a good output in this case, which is expected, as the predicted thresholds are very close. In the above situation, the SNR and the ER criteria predict a very similar threshold level (see also Figure 3A). To further distinguish between the thresholds predicted by the SNR and the ER criteria, we simulated a situation with high rod noise, σ D = 0.5, σ A = 0. In this case, the threshold according to the SNR is lower (see Figure 3C). Now the ER method performs worse, but the SNR still yields a good image (see Figure 4C). Combined, these results indicate that for the sharp threshold synapse, maximizing the SNR, or the almost identical IMRHO, consistently gives the best images in our simulation. 6 Performance with a Sigmoidal Synaptic Transfer Function So far, a hard threshold function has been imposed. We wondered if a smoother threshold function would yield different results. We have not tried to derive the best possible transfer function, but a variable-slope parameter κ was added to the transfer function to make a logistic function g(x) = [1 + exp(−(x − θ )/κ)]−1 . When κ = 0, the sharp threshold is recovered. When the transfer function is the smooth logistic function, the output of the bipolar becomes continuously valued. Both the SNR and the mutual information measures are easily calculated numerically when the transfer function is soft. It simply requires a discretization of the bipolar output in sufficiently small bins. The total number of errors ER is slightly ambiguous. We define it as follows: when no photon was absorbed but the bipolar voltage was larger than 1/2, a false positive was counted, whereas in the opposite case, a false negative was counted. We optimized the synapse by varying the slope of the transfer function in addition to the threshold value. Analytical treatment becomes intractable in this case, so we rely on numerical evaluation. The results are shown in Figure 5. The ER depends only very weakly on the smoothness (not shown). But the SNR and both mutual information measures improve when a smooth rather than a sharp transfer function is used. This can be seen by comparing the values at κ = 0 (the hard threshold) with nonzero κ. On the other hand, the transfer function should not be taken as too smooth, κ > ∼ 1; otherwise, the performance decreases. In this limit of large κ, the synaptic transfer function becomes smooth and mimics a linear one, which, as was shown above, has poor performance. The optimal value for κ is close to σ 2 , as expected from equation 3.2. The optimal value for the threshold θ , which now describes the transfer function’s midpoint, increases for smoother transfer functions. Both IMRHO and IMROD have a broad plateau with a very shallow maximum.
40
P. Clark and M. van Rossum
A
B
IMROD
C
SNR
4e-05
IMRHO
1e-05
2e-05
0
0
0.5
1
1.5
2
0.3 0.2 0.1 0
0
0
Threshold
0.5
1
1.5
2
0.3 5e-06 0.2 0.1 0 0
Threshold
0
0.5
1
1.5
Threshold
2
0.3 0.2 0.1 0
Inverse slope
0.0005
Figure 5: Performance criteria as a function of both the threshold θ and the inverse slope κ of the synaptic transfer function. (A, C) IMROD and IMRHO increase with a smoother synapse and have broad maxima. (B) The SNR has a sharp maximum and decreases as the transfer function is made much smoother (higher inverse slope). Parameters as in Figure 4A.
IMROD
SNR
IMRHO
ER
Figure 6: Simulated images when the synaptic transfer function is smooth. The optimal synapse settings were IMROD: θ = 1.17, κ = 0.14; SNR: θ = 1.37, κ = 0.06; IMRHO: θ = 1.36, κ = 0.11; and ER: θ = 1.38, κ = 0. The other parameters as in Figure 4A.
The SNR has a sharper profile and has a maximum at rather low κ. Its optimal threshold shifts about 0.05 upward. The values of the optima are given in Figure 6. Although the smoother synaptic transfer function increases the performance, inspection of the simulated images shows no obvious improvement (see Figure 6). This is not surprising as the increase is only some 20%, which is too small to yield significantly improved images. Likewise, the IMROD criterion still provides an image with many false positives. Another effect that occurs with the smooth transfer function is that the SNR and IMRHO no longer predict the same optimal threshold. However, this difference is too small to be visible in the images (compare SNR to IMRHO in Figure 6). 7 Discussion We have studied the signal transfer in the first synapse of the visual system at low light levels. The presence of noise in the rods necessitates a strong nonlinearity in the synapse, as otherwise the continuously present noise
The Optimal Synapse
41
from the other rods would swamp the signal from a rod receiving a photon. We considered a variety of performance criteria that could be used to tune the synapse. At higher light levels, when the signal is not extremely sparse, the predicted thresholds are similar. But at low light levels, the performance is sensitive to the choice of criterion. There is no principal choice on which criterion to use a priori (Basseville, 1989). Our results show that the predictions for the threshold are quite different at the lowest light levels. Which threshold, then, does the pathway use, and which threshold setting is the best? One possibility is to use the results from the simulated pathway, although these should be interpreted with care given the uncertainties in the circuitry. In the images, the signal-to-noise ratio (SNR) and the mutual information (IMRHO) consistently yield the best performance. The error rate (ER) (equivalent to Bayesian inference) is a reasonable criterion when rod noise is close to the measurements in Field and Rieke (2002b), but it predicts too high a threshold when the rod noise is larger. Interestingly, for good performance of the synapse, the mutual information needs to be calculated between the light level and the output (IMRHO), not between the photon signal and the output (IMROD). Otherwise, the threshold predicted is too low, and the pathway can have a very poor performance. Although this is not in conflict with information theory, it is a somewhat unexpected effect. We have also explicitly included the effect of spontaneous events and the rod noise on the optimal synapse parameters. Finally, we considered a smooth transfer function and found that it is slightly better than a sharp one, but the difference is small. An interesting question is how biology tunes and adapts the synapse according to the light level; we discussed some candidates earlier (van Rossum & Smith, 1998). This remains an outstanding issue experimentally and theoretically, as it emphasizes that the biology would need to optimize a quite noisy cost function. In the experiments, the threshold did not seem to adapt to the light level of the flash (Field & Rieke, 2002b), but the nonlinearity became weaker at higher background levels, increasing the response to flashes (Sampath & Rieke, 2004). 7.1 Comparison to Earlier Work and Data. We can compare our results to the data. Given the desired performance criterion, the current study predicts which threshold to expect at a given light level and noise level. Experimentally, however, the threshold and the other parameters (noise and convergence ratio) are hard to access. In the experiments, the transfer function and its threshold were not measured directly, but were inferred from the dependence of the mean and variance of the bipolar flash responses at higher light levels of about ρ ≈ 1. In Field and Rieke (2002b) the experimental data were described well when the transfer function was assumed to be a linear function with a step, that is, g(x) = x/[1 + exp(−(x − θ )/κ)]. The stimulus was a flash (about 1Hz)
42
P. Clark and M. van Rossum
at a flash intensity of 10−4 Rh∗ ; the mean light level was therefore some 10−5 events per integration time. (We think it is more likely that the threshold adapts to the mean light level, not to the flash strength.) The experimentally observed threshold level found was 1.3. The optimal threshold for these parameters according to the SNR is 1.37 and a slope of 0.06 (see Figure 5). The observed inverse slope, κ = 0.1, in the data was close to the prediction from maximizing the SNR. As was already noted in Field and Rieke (2002b), the SNR gives a good prediction of the threshold. However, caution should be used; the vesicle release noise could effectively increase the rod noise, necessitating a higher threshold (see Figure 3C), whereas temporal filtering might reduce the noise. It is also not clear how the presumably omnipresent spontaneous events are consistent with these findings; including them would reduce the optimal threshold level (see Figure 3D). However, in Berntson et al. (2004), the synaptic transfer was found to saturate when more than one photon was absorbed, as was assumed in this study. The reason for the discrepancy between the two experiments is not clear. For the current study, the difference in the actual shape of the transfer function is negligible, as the chance for multiple photon absorption in the rod is very small in the considered regime. However, it is likely that the different assumptions of the transfer function change the estimate for the threshold. This second experimental study found a threshold of 0.85 (Berntson et al., 2004). This seems to fit this study better when realistic spontaneous event rates are included. Although in principle this study gives explicit predictions for the synaptic transfer function and its dependence on light level, the experiments and noise measurements are not sensitive enough to decide which performance criterion the biological synapse follows.
7.2 Relevance to Other Systems. This study seems to deal with a particular circuit and circumstances: the bipolar pathway at very low light levels. In order to have a beneficial effect of the threshold, the following conditions should occur: (1) the signal should be sparse (i.e., only one or a few inputs out of many are simultaneously active), (2) the signal should be discrete, (3) and all inputs carry noise in the absence of a signal. However, the problem of detecting a sparse binary signal amid gaussian noise could be of much more general consideration. Consider, for instance, a view-invariant face cell in the cortex receiving many inputs, each of them active only when the face is seen from a particular angle. Given the ongoing spontaneous activity in neurons, such a system could need similar thresholding as the rod-bipolar pathway, in particular when the receptive field of the invariant cell is much wider than the tuning of the cells providing inputs. It is not clear how far this analogy holds. Nonlinearities in the synaptic transfer, such as synaptic facilitation (Varela et al., 1997), seem too weak to provide the required nonlinearities. Another possibility is that by using
The Optimal Synapse
43
population coding, the system spreads the signal out over many inputs, relieving this problem. Acknowledgments We thank Alexander Heimel, Fred Rieke, Chris Williams, and Robert Smith for insightful comments. Efstathios Politis helped in the initial phase of this project. The method of simulating low-light-level images was inspired by work of Andrew Hsu.
References Armstrong-Gold, C. E., & Rieke, F. (2003). Bandpass filtering at the rod to secondorder cell synapse in salamander (ambystoma tigrinum) retina. J. Neurosci., 23, 2796–2806. Barlow, H. B., Levick, W. R., & Yoon, M. (1971). Responses to single quanta of light in retinal ganglion cells of the cat. Vision Research Supplement, 3, 87–101. Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18, 349–369. Baylor, D. A., Lamb, T., & Yau, K. W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 237–253. Baylor, D. A., Nunn, B. J., & Schnapf, J. L. (1984). The photocurrent, noise and spectral sensitivity of rods of the monkey macaca fascicularis. J. Physiol., 357, 575–607. Berntson, A., Smith, R. G., & Taylor, W. R. (2004). Transmission of single photon signals through a binary synapse in the mammalian retina. Vis. Neurosci., 21, 693–702. Bialek, W., & Owen, W. G. (1990). Temporal filtering in retinal bipolar cells: Elements of an optimal computation? Biophys. J., 58, 1227–1233. Copenhagen, D. R., Donner, K., & Reuter, T. (1987). Ganglion cell performance at absolute threshold in toad retina: Effect of dark events in rods. J. Physiol., 393, 667–680. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dacheux, R. F., & Raviola, E. (1986). The rod pathway in the rabbit retina: A depolarizing bipolar and amacrine cell. J. Neurosci., 6, 331–345. Field, G. D., & Rieke, F. (2002a). Mechanisms regulating variability of the single photon response of mammalian rod photoreceptors. Neuron, 35, 733–747. Field, G. D., & Rieke, F. (2002b). Nonlinear signal transfer from mouse rods to bipolar cells in implications for visual sensitivity. Neuron, 34, 773–785. Field, G. D., Sampath, A. P., & Rieke, F. (2005). Retinal processing near absolute threshold: From behavior to mechanism. Ann. Rev. Physiol., 67, 491–514. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. ¨ Grunert, U., Martin, P. R., & W¨assle, H. (1994). Immunocytochemical analysis of bipolar cells in the macaque monkey retina. Journal of Comparative Neurology, 348, 607–627.
44
P. Clark and M. van Rossum
Mackay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. Mastronarde, D. N. (1983a). Correlated firing of cat retinal ganglions cells: I. Spontaneously active inputs to X- and Y-cells. J. Neurophysiol., 49, 303–324. Mastronarde, D. N. (1983b). Correlated firing of cat retinal ganglions cells: II. Responses of X- and Y-cell to single quantal events. J. Neurophysiol., 49, 325–349. Rao, R., Buchsbaum, G., & Sterling, P. (1994). Rate of quantal transmitter release at the mammalian rod synapse. Biophys. J., 67, 57–64. Sampath, A. P., & Rieke, F. (2004). Selective transmission of single photon responses by saturation at the rod-to-rod bipolar synapse. Neuron, 41, 431–443. Schneeweis, D. M., & Schnapf, J. L. (1995). Photovoltage in rods and cones in the macaque retina. Science, 268, 1053–1056. Soucy, E., Wang, Y., Nirenberg, S., Nathans, J., & Meister, M. (1998). A novel signaling pathway from rod photoreceptors to ganglion cells in mammalian retina. Neuron, 21, 481–493. Sterling, P., & Demb, J. B. (2004). Retina. In G. M. Shepherd (Ed.), Synaptic organization of the brain. New York: Oxford University Press. Sterling, P., Freed, M., & Smith, R. G. (1988). Architecture of rod and cone circuits to the on-beta ganglion cell. J. Neurosci., 8, 623–642. Tsukamoto, Y., Morigiwa, K., Ueda, K., & Sterling, P. (2001). Microcircuits for night vision in the mouse retina. J. Neurosci., 21, 8616–8623. van Rossum, M. C. W., & Smith, R. G. (1998). Noise removal at the rod synapse of mammalian retina. Vis. Neurosci., 15, 809–821. Van Trees, H. L. (1968). Detection, estimation, and modulation theory I. New York: Wiley. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. ¨ Volgyi, B., Deans, M. R., Paul, D. L., & Bloomfield, S. A. (2004). Convergence and segregation of the multiple rod pathways in mammalian retina. J. Neurosci., 24, 11182–11192. Walraven, J., Enroth-Cugell, C., Hood, D. C., MacLeod, D. I. A., & Schnapf, J. L. (1990). The control of visual sensitivity. In L. Spillmann and S. J. Werner (Eds.), Visual perception: The neurophysiological foundations (pp. 53–101). San Diego, CA: Academic Press.
Received September 20, 2004; accepted April 26, 2005.
LETTER
Communicated by Emilio Salinas
Simultaneous Rate-Synchrony Codes in Populations of Spiking Neurons Naoki Masuda
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan
Firing rates and synchronous firing are often simultaneously relevant signals, and they independently or cooperatively represent external sensory inputs, cognitive events, and environmental situations such as body position. However, how rates and synchrony comodulate and which aspects of inputs are effectively encoded, particularly in the presence of dynamical inputs, are unanswered questions. We examine theoretically how mixed information in dynamic mean input and noise input is represented by dynamic population firing rates and synchrony. In a subthreshold regime, amplitudes of spatially uncorrelated noise are encoded up to a fairly high input frequency, but this requires both rate and synchrony output channels. In a suprathreshold regime, means and common noise amplitudes can be simultaneously and separately encoded by rates and synchrony, respectively, but the input frequency for which this is possible has a lower limit. 1 Introduction Both synchrony and firing rates seem to play important roles in sensory, cognitive, and motor behavior, although the precise role of synchrony is still unclear. Synchrony levels and firing rates can be dynamically and simultaneously modulated by different signals. For example, monkeys during motor tasks are suggested to encode behavioral events and cognitive events ¨ Diesmann, & with firing rates and synchrony, respectively (Riehle, Grun, Aertsen, 1997). Rates and synchrony respectively encode task-related signals and expectation during visual tasks (de Oliveira, Thiele, & Hoffmann, 1997). Firing rates and oscillatory synchrony represent the identity (difference) and the category (overlap) of odor stimulus patterns in zebrafish olfactory systems (Friedrich, Habermann, & Laurent, 2004). If a more general temporal code with precise spike timing is taken into account, rates and spike time may simultaneously represent stimulus identity and external time (Berry, Warland, & Meister, 1997) or movement speed and body position (Huxter, Burgess, & O’Keefe, 2003). Theoretically, such simultaneous Neural Computation 18, 45–59 (2006)
C 2005 Massachusetts Institute of Technology
46
N. Masuda
codes have not been sufficiently studied because many factors interact with synchrony and firing rates. For example, synchronous inputs raise firing rates (Shadlen & Newsome, 1998; Burkitt & Clark, 2001; Salinas & Sejnowski, 2001; Moreno, de la Rocha, Renart, & Parga, 2002; Tiesinga & Sejnowski, 2004), and increased firing rates can decrease synchrony (Brunel, 2000; Burkitt & Clark, 2001). Furthermore, a theory must link dynamic inputs and outputs to these experimental configurations. Another complication is input modalities. The nature of effective inputs is not trivial. In static regimes, overall levels of balanced excitation and inhibition, which can be considered the input noise amplitude, determine firing rates, particularly in the subthreshold regime (Shadlen & Newsome, 1994, 1998). Experimental (Chance, Abbott, & Reyes, 2002) and theoretical (Tiesinga, Jos´e, & Sejnowski, 2000; Burkitt, Meffin, & Grayden, 2003; Kuhn, Aertsen, & Rotter, 2004) studies of gain modulation also support the coding of noise amplitudes as firing rates, although specific input-output relations change considerably if conductance inputs are considered. Other firing properties such as the coefficient of variation have also been examined in detail in this subthreshold noise-driven regime (Shadlen & Newsome, 1998; Rudolph & Destexhe, 2003; Kuhn et al., 2004). When dynamic inputs are considered, firing rates represent noise variance up to a high cutoff frequency (Lindner & Schimansky-Geier, 2001; Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004) and abrupt changes in mean inputs (Herrmann & Gerstner, 2001; Moreno et al., 2002), as well as conventional deterministic slow inputs (Knight, 1972). The cited articles focus on single postsynaptic neurons. Populations were assumed to consist of multiple neurons whose incident noise is independent for each neuron, which is unrealistic (Shadlen & Newsome, 1994, 1998; Salinas & Sejnowski, 2001). Neurons generally share inputs from upstream neurons because of divergent connectivity. Fluctuation of local field potentials is another major source of shared noise inputs (Huxter et al., 2003). Such common inputs usually limit the amount of information in population firing rates (Shadlen & Newsome, 1994, 1998; Litvak, Sompolinsky, Segev, & Abeles, 2003), whereas they induce synchronous firing (Mainen & Sejnowski, 1995; Reyes, 2003). Although these are established results, the interactions of common noise with spatially independent noise and mean inputs and the consequences of dynamic inputs are poorly understood. In this letter, we examine how neural populations encode dynamic inputs in dynamic patterns of firing rates and synchrony. We consider dynamic inputs comprising independently changing biases, noise different for each neuron, and common noise. Relevant codes are shown to vary according to the baseline input bias and the input frequency. In section 2, to illuminate the influence of the bias and the input frequency, we treat a theoretically tractable case in which common noise inputs are absent. Section 3 treats common noise inputs, and the information that dynamical synchrony and firing rates carry about the input is determined numerically.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
47
The relevance of the results to experiments and other theories is discussed in section 4. 2 Coding in the Absence of Common Noise Inputs We assume an uncoupled population of n leaky integrate-and-fire (LIF) neurons with spatially uniform dynamic inputs. The dynamics of the ith neuron is described by τ
d Vi (t) = −Vi (t) + µi (t), dt
(2.1)
where τ , Vi (t), and µi (t) are the membrane time constant, the membrane potential, and the external input, respectively. The neuron fires when Vi reaches the threshold θ , and then Vi is instantaneously reset to the resting potential Vr . We use LIF neurons instead of more realistic conductancebased models to facilitate mathematical analysis. Coupling between neurons is neglected, but its effect can be readily understood, as discussed in section 4. We start with the combination of a deterministic common mean input denoted by µ(t) and an independent noise source that is different for each neuron and whose amplitude is modulated according to σu (t) (Silberberg et al., 2004). Consequently, we write √ µi (t) = µ0 + µ(t) + σu (t) τ ηi (t),
(2.2)
where µ0 is the bias defined so that the temporal average of µ(t) becomes zero. Gaussian white noise, which is independent for each neuron, is denoted by ηi (t). Suprathreshold inputs (µ0 > θ ) allow neurons to fire even without noise, and subthreshold inputs (µ0 < θ ) must occur with noise for neurons to fire. Let us first ask what is encoded in the asynchronous regime by the population firing rate, which is denoted by ν(t) and normalized by the number of neurons. A Fokker-Planck analysis, in which the probability density of the membrane potentials is derived analytically, reveals how the bias affects the input characteristics encoded in firing rates. Analysis of synchrony and common noise requires numerics and is covered in section 3. The Fokker-Planck equation for the probability density of membrane potentials, denoted by P(V, t), is written as ∂ P(V, t) σu2 (t) ∂ 2 P(V, t) ∂ + = 2 ∂t 2τ ∂V ∂V =−
∂ S(V, t) , ∂V
V − µ0 − µ(t) P(V, t) τ (2.3)
48
N. Masuda
where S(V, t) = −
σu2 (t) ∂ P(V, t) V − µ0 − µ(t) − P(V, t) 2τ ∂V τ
(2.4)
is the probability current. Using the boundary condition, P(θ, t) = 0, ν(t) is given by (Brunel, 2000; Lindner & Schimansky-Geier, 2001; Moreno et al., 2002; Silberberg et al., 2004) ν(t) = S(θ, t) = −
σu2 (t) ∂ P(θ, t) . 2τ ∂V
(2.5)
Equation 2.5 underlies the claim in Lindner and Schimansky-Geier (2001) and Silberberg et al. (2004) that σu (t) but not µ(t) is coded in instantaneous firing rates. However, whether ν(t) or its delayed version represents µ(t) or σu (t) depends appreciably on µ0 . We deal with the transient response to examine the µ0 effect. Let us denote the stationary density and the stationary firing rate for constant inputs (µ(t) = 0 and σu (t) = σ u ) by PS (V) ≡ limt→∞ P(V, t) and ν0 , respectively. We combine ∂∂tPS = 0, equations 2.3 and 2.5, and an additional boundary condition caused by the resetting of neurons, a discontinuity of S(V, t) at V = Vr , by the amount ν(t). Then we obtain (Brunel, 2000) (θ −µ0 )/σ u 2ν0 τ (V − µ0 )2 Vr − µ0 u2 e du, exp − u− PS (V) = σu σu σ 2u (V−µ0 )/σ u (2.6) where is the Heaviside function, and the stationary firing rate ν0 is
√
ν0 = τ π
(θ −µ0 )/σ u
(Vr −µ0 )/σ u
du e
u2
−1
u −∞
e
−x 2
dx
.
(2.7)
According to equation 2.3, P(V, t) changes on the timescale of τ when the inputs are nonstationary. For a transient period not much longer than τ , the first-order approximation yields ∂ P(V, t) ∼ σu2 (t) ∂ 2 PS V − µ0 − µ(t) ∂ + PS = ∂t 2τ ∂ V 2 ∂V τ σ 2 (t) − σ 2u ∂ PS µ(t) ∂ PS = u ) − P , (−V + µ 0 S − ∂V τ ∂V τ σ 2u
(2.8)
Simultaneous Rate-Synchrony Codes in Spiking Neurons
A
49
B P(V,t)
P(V,t)
4 2 0
2
0 0
0.25 0.5 0.75 V
1
0
0.25 0.5 0.75 V
1
Figure 1: The membrane potential distribution of LIF neurons in the stationary state when (A) µ0 = 0.9 and (B) µ0 = 1.1. The plots are of numerical simulations of n = 300 neurons (steps), a theoretical prediction based on equation 2.5 (solid lines) and a theoretical prediction under the continuous boundary conditions (dotted lines).
where we have used be estimated by
∂ PS ∂t
= 0 and equation 2.3. Slow responses in ν(t) can
dν(t) 1 dσu2 (t) ∂ P(θ, t) σu2 (t) ∂ ∂ P(θ, t) =− − . dt 2τ dt ∂V 2τ ∂t ∂ V
(2.9)
The first term of equation 2.9 is a consequence of the instantaneous change in ν(t) induced by the change in σu (t). Regarding the second term, we cannot exchange the derivative with respect to V and that with respect to t. This is because equation 2.8 holds only for V < θ . At V = θ , equation 2.3 is singular because of the boundary conditions. The probability current S(θ, t) is actually determined by adjusting the amount of discontinuity of S(V, t) at V = Vr so that P(θ, t) is always pinned at 0 (Brunel, 2000). Although this technicality stems from the specific assumptions (especially P(θ, t) = 0), different boundary conditions or different spiking neuron models do not qualitatively change the situation. For example, we could permit Vi to jump from V = Vr back to V = θ as a noise effect because the hard thresholding and discontinuous resetting of the LIF neuron are just approximations to real neurons. Then P(θ, t) could be nonzero. In Figures 1A (µ0 = 0.9) and 1B (µ0 = 1.1), we compare the stationary membrane potential distributions obtained numerically (steps), by equation 2.5 (solid lines), and by the modified theory outlined above (dotted lines). Figure 1 indicates that this type of modification has little effect on P(V, t) and ν(t). Consequently, we proceed with the original formalization
50
N. Masuda
to evaluate ∂ P(θ − V, t)/∂t with V θ . Because P(θ, t) = 0, an increase in P(θ − V, t) with respect to t means a more negative slope of P at V = θ or a larger firing rate. We substitute PS (θ − V) ∼ = PS (θ ) into equation 2.8 to obtain 2ν0 µ(t) ∂ P(θ − V, t) ∼ 2ν0 (θ − µ0 ) 2 σu (t) − σ 2u + . = ∂t σ 4u σ 2u
(2.10)
The second term guarantees that ν(t) encodes µ(t) with some delay because of the single-neuron dynamics (Knight, 1972; Brunel, 2000; Lindner & Schimansky-Geier, 2001; Gerstner & Kistler, 2002). When µ0 > θ , the quantity represented by equation 2.10 decreases with σu (t) to attenuate the instantaneous response of ν(t) to σu (t) (see equation 2.5), reproducing the high-pass nature of noise-coded signals (Herrmann & Gerstner, 2001; Moreno et al., 2002). A larger bias magnifies the influence of µ(t). However, when µ0 < θ , the delayed response does not counteract the instantaneous response of ν(t) to σu (t). Consequently, σu (t) with either low or high frequency is primarily encoded in ν(t). This finding is consistent with quasistatic arguments. If the inputs are sufficiently slow relative to τ , ν(t) is given by equation 2.7, with µ0 and σ u 2 u 2 replaced by µ0 + µ(t) and σu (t), respectively. Since e u −∞ e −x d x increases monotonically with u, the integral in equation 2.7 decreases with µ(t) for any µ0 and with σu (t) for µ0 + µ(t) < θ . In these cases, ν0 duly represents µ(t) or σu (t). However, when µ0 > θ , the range of integration, or [(θ − µ0 − µ(t))/σu (t), (Vr − µ0 − µ(t))/σu (t)], shrinks as σu (t) increases, whereas the magnitude of the integrand increases with σu (t). The trade-off between these two factors determines ν0 , which implies that the firing rate depends more weakly on σu (t) than in the subthreshold case. To be more quantitative, we numerically simulate n = 300 uncoupled LIF neurons with τ = 10 ms, θ = 1, and Vr = 0. We update µ(t) and σu (t) with period T (T = 1 ms was used in Silberberg et al., 2004, to generate fast inputs). The amplitudes of µ(t) and that of σu (t) are chosen from uniform distributions on [−0.075, 0.075] and [0.030, 0.055], respectively. We fix these dynamical ranges and also that of common noise, which is incorporated in section 3, because stronger input modulation obviously leads to their better representation by the outputs. We determine ν(t) by the normalized number of spikes in each bin of width T ms corresponding to the input switching. Then the cross-correlation functions between ν(t) and the inputs, corr (µ(t), ν(t)) and corr (σu (t), ν(t)), are used as performance measures. Figure 2A shows the dependence of corr (µ(t), ν(t)) on µ0 and T. As this theory and those of others (Knight, 1972; Lindner & Schimansky-Geier, 2001; Gerstner & Kistler, 2002; Silberberg et al., 2004) predict, corr (µ(t), ν(t)) increases with T and the neural ensemble works as a low-pass filter for µ(t). With T fixed, corr (µ(t), ν(t)) increases with µ0 in accordance with
Simultaneous Rate-Synchrony Codes in Spiking Neurons
B 1
corr(σu(t),ν(t))
corr(µ(t),ν(t))
A
51
0.75 0.5 0.25 0 0.9
1
1.1 1.2 1.3 µ0
1 0.75 0.5 0.25 0 0.9
1
1.1 1.2 1.3 µ0
Figure 2: Dependence of (A) corr (µ(t), ν(t)) and (B) corr (σu (t), ν(t)) on µ0 and T in the absence of common noise inputs. The results for T = 0.3 ms (thickest lines), 1 ms, 2 ms, 5 ms, and 30 ms (thinnest lines) are shown.
equation 2.10. Figure 2B shows that corr (σu (t), ν(t)) decreases with µ0 except when T is very small. This result is predicted by equations 2.7 and 2.10. It agrees with the prediction of enhanced rate coding by using subthreshold stochastic resonance (Lindner & Schimansky-Geier, 2001), and it extends the observation that the subthreshold regime yields higher gains for static noise inputs (Tiesinga et al., 2000; Chance et al., 2002; Moreno et al., 2002; Burkitt et al., 2003; Kuhn et al., 2004). For an even smaller µ0 , a sufficiently large σu (t) that would drive up Vi does not generally last long enough to make neurons fire, and rate coding of σu (t) deteriorates because neurons rarely fire. Rate coding is optimized at a certain µ0 for fast σu (t). The optimal bias is at a slightly subthreshold level. This may be related to the fact that membrane potentials often hover around this level (Shadlen & Newsome, 1994, 1998). An optimal T exists for a range of given µ0 , implying a bandpass property for σu (t). The high-pass nature is expected from the theory, whereas very fast σu (t) (T = 0.5 and 1 ms in Figure 2B) cannot be captured by firing rates because n is finite (Brunel, 2000; Lindner & Schimansky-Geier, 2001). 3 Coding in the Presence of Common Noise Inputs We next apply inputs with common noise represented by √ √ µi (t) = µ0 + µ(t) + σu (t) τ ηi (t) + σc (t) τ η(t),
(3.1)
where σc (t) is the dynamical signal carried by the common noise and η(t) is a gaussian white noise. We renew σc (t) every T ms to a random value
52
N. Masuda
from the uniform distribution on [0, 0.025], as is done for µ(t) and σu (t). The dynamic range of σu (t) and that of σc (t) are assumed to be the same to compare their effects on dynamic outputs. In addition to firing rates, transient synchrony, which is typically reinforced by common inputs, might be functionally relevant (de Oliveira et al., 1997; Riehle et al., 1997; Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Salinas & Sejnowski, 2001; Friedrich et al., 2004). Because both common noise and transient synchrony are difficult to treat mathematically, we resort to numerical simulations. We measure synchrony with two spike-based methods. The first statistic uses a prescribed temporal precision T (< T) ms. We subdivide each bin of width T ms into T/T bins and count the number of spikes in each sub-bin. Counts are denoted by N1 , N2 , . . . , NT/T . Since stronger synchrony results in a more rugged distribution of the spike counts, the degree of dynamical synchrony syn(t) (i T ≤ t < (i + 1)T, i ∈ Z) is defined to be the normalized standard deviation of {Ni ; 1 ≤ i ≤ T/T }, namely, 2
T/T
T/T T/T
T T T Ni − Nj Ni . syn(t) = T T T i=1
j=1
(3.2)
i=1
In calculating the correlation functions between syn(t) and the inputs, we discard the bins without spikes. We set T = 1 ms because synchrony on this timescale is biologically relevant (Mainen & Sejnowski, 1995; Diesmann, Gewaltig, & Aertsen, 1999; Steinmetz et al., 2000). In experiments, we often do not know when the inputs change. Binning is impossible because we know neither T nor T . With this severe condition in mind, a dynamic synchrony measure C Vp (t) based on lumped spike trains (Tiesinga & Sejnowski, 2004) is also calculated. The minimum distance between spike times of different neurons becomes small during synchrony. This idea is quantified by creating a spike train with spike times {ti : i ∈ Z} by merging all the spikes from n neurons. To define C Vp (t) as the instantaneous coefficient of variation of {ti : i ∈ Z}, we clip consecutive a + 1 spike times {ti : 0 ≤ i ≤ a } from {ti : i ∈ Z} so that t is closer to ta /2 than any other ti . Then a 2
a 2 1 1 1
C Vp (t) ≡ √ t − ti−1 − ti − ti−1 a n a i=1 i i=1 a 1 ti − ti−1 , a i=1
(3.3)
Simultaneous Rate-Synchrony Codes in Spiking Neurons
53
√ where 1/ n is the normalization factor (Tiesinga & Sejnowski, 2004). Perfect √ synchrony leads to C Vp (t) = 1, whereas asynchrony yields C Vp (t) = 1/ n. The choice of a is arbitrary, and we set a = 40. Although ν(t) encodes σc (t) and σu (t) in a similar way, a major difference is that the upper cutoff frequency of the input is lower for σc (t) because induced synchrony effectively decreases the number of neurons. With this in mind, we start with T = 5 ms, which prescribes an easy scheme for ν(t) to encode µ(t) and σu (t) under proper µ0 and sufficient asynchrony, as revealed in Figure 2. Figure 3A indicates that in the presence of considerable interference by common noise, ν(t) favors µ(t) (squares) or σu (t) (circles), depending on µ0 . Results are similar to those in Figure 2A. Although corr (σc (t), ν(t)) also decreases with µ0 (triangles), this relation is weak regardless of µ0 . However, σc (t) is represented with high fidelity by dynamical synchrony, as shown in Figure 3B, which displays corr (σc (t), syn(t)) (thick line with triangles) and corr (σc (t), C Vp (t)) (thin line with triangles). Synchrony induced by common noise is extended to the case of mixed dynamical inputs. Only in far subthreshold regimes are synchronous firing rates too low for syn(t) or C Vp (t) to represent σc (t). In suprathreshold regimes, simultaneously applied µ(t) and σc (t) are more or less independently encoded in dynamical firing rates and dynamical synchrony. Dynamical synchrony is anticorrelated more strongly with σu (t) than with µ(t) because noise directly desynchronizes neurons. When rate coding of σu (t) is relevant with small µ0 , synchrony also has the same information in σu (t). That is, σu (t) occupies two output channels: firing rates and synchrony. Although µ(t) is also anticorrelated with synchrony (Brunel, 2000; Burkitt & Clark, 2001), this effect is much smaller, especially for suprathreshold µ0 , where rate coding of µ(t) is efficient. We note that syn(t) (thick lines in Figure 3B and also in Figures 3D and 3F, as explained below) rather than C Vp (t) (thin lines) is correlated with the inputs more often than the other way round. This is because the timing of input changes is available only for the calculation of syn(t). However, C Vp (t) behaves consistently like syn(t), indicating that the results are not sensitive to synchrony measures. Figure 2 indicates that the timescales of µ(t) and σu (t) affect the efficiency of rate coding. How is this extended to the cases in which σc (t) and dynamical synchrony are involved? As shown in Figures 3C and 3E for T = 2 and 15 ms, σc (t) does not influence ν(t) regardless of T or µ0 , and the relevant mode of rate coding as a function of µ0 and T is similar to that shown in Figure 2 in which σc (t) is absent. In the suprathreshold regime, dynamical synchrony represents low-passed σc (t), as shown by the triangles in Figures 3D (T = 2 ms) and 3F (T = 15 ms). Then µ(t) and σc (t) can be separately encoded in the rates and the synchrony, but they are low-pass filtered. In the subthreshold regime, dynamical synchrony and firing rates represent σu (t) up to a relatively high frequency (the circles in Figures 3D and 3F). In this situation, σu (t) can be encoded up to a high frequency,
54
N. Masuda
1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
E corr with ν(t)
0.5 0.25 0 -0.25 -0.5 0.9
1 µ0
1.1
1.2
0.9
1 µ0
1.1
1.2
0.9
1 µ0
1.1
1.2
D corr with syn(t),CVp(t)
corr with ν(t)
C
corr with syn(t),CVp(t)
B
0.5 0.25 0 -0.25 -0.5
F 1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
corr with syn(t),CVp(t)
corr with ν(t)
A
0.5 0.25 0 -0.25 -0.5
Figure 3: (A, C, E) Cross-correlations between ν(t) and each of the three inputs (squares: µ(t); circles: σu (t); triangles: σc (t)). (B, D, F) Cross-correlations between the degrees of dynamical synchrony (thick lines: syn(t); thin lines: C Vp (t)) and the three inputs. We set (A, B) T = 5 ms, (C, D) T = 2 ms, and (E, F) T = 15 ms.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
1 0.75 0.5 0.25 0 0.9
1 1.1 µ0
1.2
corr with ν(t)
C
corr with syn(t),CVp(t)
B 0.5 0.25 0 -0.25 -0.5 0.9
1 1.1 µ0
1.2
0.9
1
1.2
D 1 0.75 0.5 0.25 0 0.9
1
µ0
1.1
1.2
corr with syn(t),CVp(t)
corr with ν(t)
A
55
0.5 0.25 0 -0.25 -0.5 1.1
µ0
Figure 4: (A, C) Cross-correlations between ν(t) and the inputs, and (B, D) crosscorrelations between the degrees of dynamical synchrony and the inputs. The strength of the uncorrelated noise lies in (A, B) [0.010,0.035] and (C, D) [0.055, 0.080]. We set T = 5 ms. The corresponding results for the noise strength [0.030, 0.055] are shown in Figures 3A and 3B.
although it occupies both rate and synchrony channels. In short, there is a trade-off between the number of manageable inputs and the quality of the conveyed information about each input. We also examine effects of background noise. Since its baseline level is biologically difficult to estimate, we try several test values. In Figure 3, σu (t) has an amplitude taken from [0.030, 0.055] and can be regarded as a summation of a dynamic σu (t) with amplitude 0.025 and background noise whose level is statically equal to 0.030. With T = 5 ms, as in Figures 3A and 3B, Figure 4 shows the coding results when the amplitude of σu (t) has the same dynamic range (= 0.025) as in Figure 3 but with different static levels. The
56
N. Masuda
amplitude of σu (t) falls in [0.010, 0.035] ([0.055, 0.080]) for Figures 4A and 4B (4C and 4D). A large background noise depresses synchronous firing and coding of σc (t) on synchrony, particularly in the suprathreshold regime (see Figure 4D). Figure 4C shows that rate coding does not degrade so much. For a small background noise, σc (t) is coded with fidelity in the dynamic synchrony level (see Figure 4B). In the subthreshold regime, rate coding of σu (t) degrades to a large extent, as background noise decreases and synchrony is preferred. In both subthreshold and suprathreshold regimes, relative contributions of firing rates and synchrony on input encoding depend on the level of background noise. 4 Discussion We have examined how firing rates and synchrony can simultaneously encode dynamic inputs of different modalities. With a small bias, the independent noise signal σu (t) is encoded up to a high frequency by occupying both rate and synchrony channels. With a large bias, firing rates and synchrony represent the mean signal µ(t) and the common noise signal σc (t) separately. Although the use of two channels is efficient in this case, the inputs can be coded only up to lower cutoff frequencies. Our results for σu (t) extend a variety of work on gain modulation of noisecoded signals (Tiesinga et al., 2000; Chance et al., 2002; Burkitt et al., 2003; Kuhn et al., 2004) and on coding of balanced excitation-inhibition inputs (Shadlen & Newsome, 1998) to the dynamical setup. We also have extended the results of other investigations of dynamical noise inputs (Lindner & Schimansky-Geier, 2001; Silberberg et al., 2004) to network situations. It is well known that σc (t) induces synchrony and limits the precision of rate coding (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2001; Masuda & Aihara, 2003; Litvak et al., 2003; Reyes, 2003). We have shown that dynamical synchrony actually represents σc (t) and that σc (t) does not interfere with rate coding of the other inputs. Tiesinga and Sejnowski (2004) also mentioned that dynamical firing rates and synchrony may measure different entities. They used interneuron networks with spatially heterogeneous inputs to see the transition between the rate regime and the synchrony regime. We have presented another mechanism, with a systematic evaluation of the effect of different input modalities. Real neural networks abound in feedback and heterogeneity. Stronger recurrent connectivity (Brunel, 2000; Burkitt & Clark, 2001; Gerstner & Kistler, 2002; Masuda & Aihara, 2003) and homogeneity (Brunel, 2000; Gerstner & Kistler, 2002; Masuda & Aihara, 2003) make synchrony more likely. According to Figure 4, more synchrony (asynchrony) means that synchrony (a firing rate) carries more information on dynamic inputs regardless of the bias level. We expect that feedback and homogeneity also control the balance between the rate code and the synchrony code (Masuda & Aihara, 2003). However, feedback, which could be modeled as a part of common
Simultaneous Rate-Synchrony Codes in Spiking Neurons
57
noise, does not exactly correspond to external input. Rather, feedback spikes may underlie more combinatorial or memory-related codes such as the synfire chain (Abeles, 1991; Diesmann et al., 1999; Litvak et al., 2003; Reyes, 2003). If synchrony is dynamically modulated by σc (t) in a neural population, the convergent nature of coupling (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2001) makes dynamical synchronous outputs serve as σc (t) for a downstream population. The repetition of this process in a layered manner defines an extended synfire chain so that time-dependent σc (t) can propagate through layers. At the same time, ν(t) at an output side of a population can be µ(t) at an input side of another population. This cascade produces a chain of rate code (van Rossum, Turrigiano, & Nelson, 2002; Litvak et al., 2003; Masuda & Aihara, 2003). A novel point is that these two types of chains can be multiplexed. With static inputs, simultaneous propagation of a firing rate and a synchrony level through feedforward networks was analyzed in the context of stable propagation of synfire chains (Diesmann et al., 1999; Gerstner & Kistler, 2002). Our results extend that work to a dynamic framework and input encoding. This scheme contrasts with the situation in which the rate code and the synchrony code alternatively, not simultaneously, propagate in a feedforward manner (Masuda & Aihara, 2002, 2003; van Rossum et al., 2002). In experimental situations in which both firing rates and spike timing are expected to be simultaneously functional (Riehle et al., 1997; Huxter et al., 2003; Friedrich et al., 2004), firing rates and synchrony may express mean input and common noise input, respectively. Our theory predicts that this multiplexing scheme cannot handle fast inputs. Multiplexing may be used to represent sensory, behavioral, or cognitive signals that do not require high temporal resolution, such as static odor information (Friedrich et al., 2004) and physical location and speed (Huxter et al., 2003). We are uncertain if behavioral and cognitive events discussed in Riehle et al. (1997) are relatively fast. Dynamical firing rates and synchrony are negatively correlated in some behavioral tasks. For example, synchrony is high during stimulus expectation periods, whereas asynchrony accompanied by increased firing rates emerges at stimulus onset (de Oliveira et al., 1997). This observation may be understood if independent noise signals, rather than mean inputs or common noise, are raised as the stimulus is turned on. We speculate that even relatively rapid changes in stimuli can be processed in this situation.
Acknowledgments We thank H. Nakahara, M. Okada, D. Nozaki, B. Doiron, K. Aihara, Y. Tsubo, and S. Amari for helpful discussions. This work is supported by the Special Postdoctoral Researchers Program of RIKEN.
58
N. Masuda
References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Comput., 13, 2639–2672. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. de Oliveira, S. C., Thiele, A., & Hoffmann, K.-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17(23), 9248–9260. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Friedrich, R. W., Habermann, C. J., & Laurent, G. (2004). Multiplexing using synchrony in the zebrafish olfactory bulb. Nat. Neurosci., 7(8), 862–871. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to integrate-and-fire neuron. J. Comput. Neurosci., 11, 135–151. Huxter, J., Burgess, N., & O’Keefe, J. (2003). Independent rate and temporal coding in hippocampal pyramidal cells. Nature, 425, 828–832. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24(10), 2345–2356. Lindner, B., & Schimansky-Geier, L. (2001). Transmission of noise coded versus additive signals through a neuronal ensemble. Phys. Rev. Lett., 86(14), 2934–2937. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feedforward networks with excitatory-inhibitory balance. J. Neurosci., 23(7), 3006–3015. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons, Science, 268, 1503–1506. Masuda, N., & Aihara, K. (2002). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Comput., 15, 103–125. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89(28), 288101.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
59
Reyes, A. D. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nat. Neurosci., 6(6), 593–599. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization Riehle, A., Grun, and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Rudolph, M., & Destexhe, A. (2003). The discharge variability of neocortical neurons during high-conductance states. Neuroscience, 119, 855–873. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. J. Neurophysiol., 91, 704–709. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Tiesinga, P. H. E., Jos´e, J. V., & Sejnowski, T. J. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Rev. E, 62, 8413–8419. Tiesinga, P. H. E., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966.
Received October 20, 2004; accepted June 1, 2005.
LETTER
Communicated by Daniel Amit
Spontaneous Dynamics of Asymmetric Random Recurrent Spiking Neural Networks H´edi Soula
[email protected] Guillaume Beslon
[email protected] Artificial Life and Behavior, PRISMA, National Institute of Applied Science, Lyon, France
Olivier Mazet
[email protected] Mathematic Lab, Camille Jordan Institute, National Institute of Applied Science, Lyon, France
In this letter, we study the effect of a unique initial stimulation on random recurrent networks of leaky integrate-and-fire neurons. Indeed, given a stochastic connectivity, this so-called spontaneous mode exhibits various nontrivial dynamics. This study is based on a mathematical formalism that allows us to examine the variability of the afterward dynamics according to the parameters of the weight distribution. Under the independence hypothesis (e.g., in the case of very large networks), we are able to compute the average number of neurons that fire at a given time—the spiking activity. In accordance with numerical simulations, we prove that this spiking activity reaches a steady state. We characterize this steady state and explore the transients.
1 Introduction Many neurobiological problems require understanding the behaviors of large, recurrent spiking neural networks. Indeed, it is assumed that these observable behaviors are a result of the collective dynamics of interacting neurons. The question then becomes, given a connectivity of the network and a single neuron property, what are the possible kinds of dynamics? In the case of homogeneous nets (the same connectivity inside the network), some authors have found sufficient conditions for phase synchronization (locking) or stability (Chow, 1998; Gerstner, 2001). Coombes (1999) calculated Lyapunov exponents in a given symmetric connectivity map and showed that some neurons were “chaotic” (the highest exponent was positive). In very general cases (Golomb, 1994; van Vreeswijk & Sompolinsky, Neural Computation 18, 60–79 (2006)
C 2005 Massachusetts Institute of Technology
Spontaneous Dynamics of Random Recurrent Networks
61
1996; Meyer & van Vreeswijk, 2002), it has been shown that the dynamics can show a broad variety of aspects. In the case of integrate-and-fire (I&F) neurons, Amit and Brunel (1997a, 1997b) used consistency techniques on nets of irregular firing neurons. This technique allowed them to derive a self-sustaining criterion. Using FokkerPlanck diffusion, the same kind of method was used in the case of linear I&F neurons in Mongillo and Amit (2001), Fusi and Mattia (1999), and Mattia and del Guidice (2000), for stochastic networks dynamics with noisy input current (del Guidice & Mattia, 2003), and in the case of sparse weight connectivity (Brunel, 2000). However, stochastic recurrent spiking neurons networks are rarely studied in their spontaneous functioning. Indeed, in most cases, the dynamics is driven by an external current—whether meaningful or noisy. Without this external current, the resulting behavior is often considered as trivial. However, our experimental results show that large, random recurrent networks do exhibit nontrivial functioning modes. Depending on a coupling parameter between neurons (in our case, the variance of the distribution of weights), the network is able to follow a wide spectrum of spontaneous behavior, from the trivial neural death (the initial stimulation does not produce any further spiking activity) to an extreme locking mode (some neurons fire all the time, while others never do). In the intermediate states, the average spiking activity grows with the variance. Thus, we basically follow the same ideas as in Amit and Brunel (1997a) and Fusi and Mattia (1999) and try to predict these behaviors when using large, random networks. In this case, we need to make an independence hypothesis and use mean field techniques. Note that this so-called mean field hypothesis has been rigorously proven in a different neuronal network model (Moynot & Samuelides, 2002). More precisely, in our case, the connectivity weights will follow an independent and identically distributed law, and the neurons’ firing activities are supposed to be independent. After introducing the spiking neural model, we propose a mathematical formalism that allows us to determine (with some approximations) the probability law of the spiking activity. Since no hypothesis other than independence is used, a re-injection of the dynamics is needed. It leads, expectedly, to a massive use of recursive equations. Although nonintuitive, these equations are a solid ground on which many conclusions can rigorously be drawn. Fortunately, the solutions of these equations are as expected, that is, the average spiking activity (and as a consequence the average frequency) reaches a steady state very quickly. Moreover, this steady state depends on only the parameters of the weight distribution. To keep the arguments simple, we detail the process for a weight matrix following a centered normal law. Extensions are proposed afterward for a nonzero mean and a sparse connectivity. All of these results corroborate accurately with simulated neural networks data.
62
H´edi Soula, G. Beslon, and O. Mazet
2 The Neural Model The following series of equations describe the discrete I&F model we use throughout this letter (Tuckwell, 1988). Our network consists of N all-toall coupled neurons. Each time a given neuron fires, a synaptic pulse is transmitted to all the other neurons. This firing occurs whenever the neuron potential V crosses a threshold θ from below. Just after the firing occurs, the potential of the neuron is reset to 0. Between a reset and a spike, the dynamics of the potential is given by the following (discrete) temporal equation:
Vi (t + 1) = γ Vi (t) +
Wi j δ t − Tjn .
(2.1)
n>0 j=N
The first part of the right-hand side of the equation describes the leak current—γ is the leak (0 ≤ γ ≤ 1). Obviously, a value of 0 for γ indicates that the neuron has no short-term memory. On the other hand, γ = 1 describes a linear integrator. Since we study only spontaneous dynamics, there is no need to introduce any input current in equation 2.1. The Wi j are the synaptic influences (weights) and δ(x) = 1 whenever x = 0 and 0 otherwise (Kronecker symbol). The Tjn are the times of firing of a neuron j and a multiple of the sample discretization time. The times of firing are formally defined for all neurons i as Vi (Tin − t) > θ and the nth firing date recursively as Tin = inf t | t > Tin−1 , Vi (t) ≥ θ .
(2.2)
We set Ti0 = −∞. Moreover, once it has fired, the neuron’s potential is reset to zero. Thus, when computing Vi (Tin + 1), we set Vi (Tin ) = 0 in equation 2.1. Finally and in order to simplify, we restrict ourselves to a synaptic weight √ distribution that follows a centered normal law N(0, σ 2 ) and let φ = σ N be the coupling factor. 3 General Study In this section we give a very general formulation of the distribution of the spiking activity defined as Xt —the numbers of firing neurons at a time step t for an event. The basic idea consists of partitioning the spiking activity according to the instantaneous period of the neurons. Hence, we write, (1)
Xt = Xt
(t−1)
+ . . . + Xt
,
Spontaneous Dynamics of Random Recurrent Networks
63
(k)
when Xt is the number of neurons that have fired at t and t − k but not in between. If we suppose that the starting potential of all neurons is 0 and only X0 0 neurons were excited in order to make them fire, we have Vi (1) = Xj=1 Wi j . Thus, using equation 2.2, we get
X1 =
N
χ{Vi (1)>θ } ,
i=1
where χ{Vi (1)>θ } = 1 whenever Vi (1) > θ and 0 otherwise. Furthermore, we have X2 =
X1
χ{Vi (2)>θ } +
i=1
N
χ{Vi (1)θ } .
i=1
Indeed, the number of firing neurons at time step 2 are those that have X1 (1) χ{Vi (2)>θ } since the reset fired twice (at t = 1 and t = 2, that is, X2 = i=1 (2) potential is 0). We need to add those that have not fired at t = 1 (X2 = N i=1 χ{Vi (1)θ } ). Thus, for t, taking into account the initial step, we have recursively
Xt =
t u=1
ˆ X t−u χ{Vi (t−u+1) θ) where ∼ N( pvmin , ((1 − p)C + xt )σ 2 ). We can find the P(k, t) recursively, noting that the probability p depends on previous xt . To simplify, we suppose that vmin = 0 (i.e, the neuron cannot have negative values). In this case, whatever the charge is, the probability p is always 1/2. So the last equation becomes C 2 P(Vi (t) > θ ) = P ∼ N 0, + xt σ , > θ . 2
(5.12)
It acts as if the decay rate was divided by two. Thus, inserting this equation into equation 4.4 leads to
xt+1 =
t m=0
xˆ m pφ
t−m γ i i=0
2
xt−i
t−m−1 j=0
1 − pφ
j γ k k=0
2
x j−k+m
, (5.13)
where ∀m > 0, xˆ m = xm , and xˆ 0 = 1.
70
H´edi Soula, G. Beslon, and O. Mazet
5.4 Variance Evolution. The computation of P(k, t) enables us to compute another moment of the distribution. We recall (using the second Wald’s identity) Var(Xt ) =
t−1 (E( Xˆ k )P(k, t)(1 − P(k, t)) + Var( Xˆ k )P(k, t)2 ). k=0
This equation can be rewritten as Var(Xt ) − E(Xt ) =
t−1
P(k, t)2 (Var( Xˆ k ) − E( Xˆ k )).
(5.14)
k=0
Since the expectation converges, using the same reasoning, we can conclude that the variance converges to a stationary state. 5.5 Extending the Class of the Weight Matrix Distribution. For simplicity, we used a centered normal law. We now can extend the class of weight distribution available to compute the P(k, t). It is easy to insert a nonzero mean for the distribution. Let us assume √ that the weights follow a normal law N(µ/N, φ/ N). Provided that all the hypotheses remain valid, we can insert this mean into the definition of pφ , which becomes pφ,µ , defined as 1 pφ,µ (y) = √ 2π
∞ θ −µy √ yφ
x2
e − 2 dx.
It remains now to replace the previous pφ with this new one. We can extend this result to a more biological model using two separate populations of neurons: excitatory and inhibitory. Let us suppose that we have N neurons with Ni inhibitory neurons and Ne excitatory neurons so that N = Ne + Ni . As before, the whole network is totally connected, and the weight distribution for each population follows a gaussian distribution— N(µi , σi ) for inhibitors and N(µe , σe ) for excitators. In this case, one neuron, the excitator, projects toward both populations, and the draws are independent. The same is true for an inhibitory neuron. Assuming that µi < 0 and µe > 0, we can compute Xt+1 using 1 P(Vi (t + 1) > θ ) = √ 2π
∞
√
θ −µi Xti −µe Xte
x2
e − 2 dx.
σi2 Xti +σe2 Xte
e We reuse the same equations with these two variables since we have Xt+1 = Ni i e (1 − N )Xt+1 and Xt+1 = Xt+1 − Xt+1 . Indeed, the whole charge created by
Spontaneous Dynamics of Random Recurrent Networks
71
the spiking activity is spread among all neurons. We can extend this with more than two populations. The results follow the same pattern as in the case of one population. Finally, we can introduce a sparse weight matrix (a matrix with zero coefficients) computed as follows: a weight w has a probability √ p to be zero and a probability 1 − p to follow a normal law N(µ/N, φ/ N). As above, it leads to a new pˆ φ,µ function. When calculating the charge, it came from “X” neurons, leading to a sum of X normal laws. In the case of the sparse matrix, it reduce to (1 − p)X neurons. So our new function pˆ φ,µ becomes pˆ φ,µ (y) = pφ,µ ((1 − p)y). It remains to insert this new function into equation 5.13. 6 Results and Comparison In order to compute the P(k, t), we need a seemingly false hypothesis: the independence of charges. The total charge for a given neuron is calculated as if the weight matrix was redrawn at each time step. In other words, the stochastic input of one neuron (given by the others) is treated as a noisy threshold. As will be shown in the results, it is a rather good approximation. It supposes also that for the self-sustaining mode, the system does not die. This problem will appear every time. It will lead either to a slight overestimation of the spiking activity (when the probability to die is weak) or a complete failure (for intermediary coupling factor). We conducted extensive numerical simulations to confront our formulas. For each set of parameters, 1000 random networks of 1000 I&F neurons were used. We used the same threshold value (θ = 1.0) and, to be in concordance with equation 5.13, we set vmin = 0. We tested various γ , φ, x0 , µ, and p. All results were consistent with theoretical computations. We obtain a striking accuracy in describing the temporal evolution of the averaged spiking activity, both qualitatively and quantitatively. The steadystate is quickly reached (a few time steps), and the transients are striking well, predicted by our equations. When γ = 0, the prediction was slightly overestimated. On the other hand, when γ = 0.0, the prediction is accurate. Sample results are displayed in Figures 1 and 2 for µ = 0.0 (centered weight matrix) and p = 0.0 (no sparsity) and various values of φ and γ . The figures show also that when γ = 0.0 and φ = 2.5 (in Figure 1) and also when γ = 1.0 and φ = 1.5, the prediction completely fails. These are not isolated points. In fact, for all γ ’s, there is an interval of φ (once p and µ are chosen) where the spiking activity shows no regularity. On the boundary between death and self-sustaining activity, both independence hypotheses fail.
72
H´edi Soula, G. Beslon, and O. Mazet
Figure 1: Transients. Comparison between theoretical computations and experimental data. Results for two leak values (γ ∈ {0.0, 0.5}). The spiking activity increases with the coupling factor (φ ∈ {1.5, 2.5, 3.5, 4.5, 5.5}). Theoretical results are displayed with a plain curve, and experimental data points are circles. The curves are paired and increasing with φ. The lowest pair (experimental and theoretical) corresponds to φ = 1.5, while the highest corresponds to φ = 5.5. Note that for γ = 0.0 and φ = 2.5, the theoretical computation predicts a quite higher value than the experimental result. Note also the slight overestimation of the spiking activity as soon as γ = 0. The starting number of spiking neurons is x0 = 0.10 for each simulation. Parameters: N = 1000, θ = 1, p = µ = 0.
For the variance prediction, results are less accurate. Expectedly, the higher the moment, the harder the prediction. More neurons would probably have been needed to obtain an accurate prediction. Nevertheless, when φ is high enough, we are able to describe precisely the evolution of the variance. It converges as quickly as the expectation to a limit. Moreover, this limit decreases with the coupling factor. Since the variance computation is based on the expectation, we did not expect it to work around intermediate values of φ, where the prediction completely failed. Figure 3 displays typical variance prediction for two values of φ, and a leak equals 0.9 (the others parameters were θ = 1, p = 0, and µ = 0.0).
Spontaneous Dynamics of Random Recurrent Networks
73
Figure 2: Transients (continued): More comparisons between theoretical computations and experimental data. Results for two higher leak values (γ ∈ {0.9, 1.0}). The spiking activity increases with the coupling factor (φ ∈ {1.5, 2.5, 3.5, 4.5, 5.5}). Theoretical results are displayed with a plain curve; experimental data points are circles. The curves are paired and increasing with φ (see Figure 1). Note that the slight overestimation of the spiking activity increases with γ . Note also that when γ = 1.0 and φ = 1.5, theory predicts death, while experimental values do not converge to zero. The starting number of spiking neurons is x0 = 0.10 for each simulation. Parameters: N = 1000, θ = 1, p = µ = 0.
7 Discussion The independence hypothesis can be a powerful way to approach random networks of spiking neurons. Mean field techniques and local field partition are commonly used to deal with them. However, to the best of our knowledge, no other study has allowed a theoretical derivation of all the moments of the spiking activity distribution. Using this framework, we are able to describe the behaviors of large spiking neural networks in spontaneous functioning. The initial stimulation corresponds to a synchronous spiking of a fraction (x0 ) of the network. This shows that a spontaneous regime can be self-sufficient. It means that networks can afford discontinuous inputs without losing their internal activity.
74
H´edi Soula, G. Beslon, and O. Mazet
Figure 3: Variance. The variance of the spiking activity is displayed for two values √ of the coupling factor (φ ∈ {4.5, 5.5}). For clarity, the variance was scaled by N and displayed with the corresponding predicted expectation of the spiking activity (dotted curves). As in previous figures, circles correspond to experimental data points and the plain curve to theoretical computation. Note that the variance converges as quickly as the expectation. Note also that the higher the coupling factor, the better the prediction. Parameters: N = 1000, θ = 1, γ = 0.9, p = µ = 0.
√ More precisely, we proved that the coupling factor (φ = σ N) can characterize the average spiking activity. For instance, whatever the initial stimulation is (provided it is strong enough), the network’s average spiking activity reaches the same steady state. Since the neural death is also a possible steady state, in dynamical systems terminology, depending on the value of φ, we exhibited a bifurcation (Brunel & Hakim, 1999). The spiking activity grows with the coupling factor, while the variability of the spiking activity distribution decreases. Moreover, for a high value of φ, it seems that the self-sustaining activity is maintained by neurons that fire at the maximum rate. Averaging does not tell anything on one particular neuron. Experimental data show that a very high value of φ is needed to obtain periodic neurons. However, the number of periodic neurons grows with the coupling factor, leading to an extreme locking. At the limit φ = ∞, all neurons are periodic (either 1 or ∞), and those that fire do so synchronously (that is, all the time). However, around the bifurcation, independence is deemed to fail and, consequently, also the prediction. Indeed, the coupling is too weak to allow regularities.
Spontaneous Dynamics of Random Recurrent Networks
75
Since our method allows us to derive all the moments of the distribution, a way is open to obtain a full description of the spiking activity distribution. We are also able to provide the stochastic repartition of the neurons’ instantaneous period. This information is a starting point for studying the effect on spiking activity of a learning algorithm based on spiking delays. Appendix A: Wald’s Identity Let us define, for s ∈ [0, 1], the moment generating function G X (s) of a random variable X, by G X (s) = E(s X ). It is easy to see that the nth moment of X is given by the nth first derivative of G X evaluated in 1: G X (1) = E(X), G X (1) = E(X(X − 1)), . . . G X (1) (n)
= E(X(X − 1) . . . (X − n + 1)). Given (Xi )i∈N a sequence of independent and identically distributed (i.i.d.) random variables, and N an integer-valued random variable indeN pendent of Xi , let us define Y = i=1 Xi . We then have N G Y (s) = E s i=1 Xi =
+∞
n P(N = n)E s i=1 Xi
n=0
=
+∞
n P(N = n)E s X1
n=0
= G N (G X1 (s)). By derivating once and then twice the above equality, we respectively obtain the first and second Wald’s identities: E(Y) = E(N)E(X1 ), Var(Y) = E(N)Var(X1 ) + Var(N)E(X1 )2 . Appendix B: Random Sum of Random Variables Let us first prove a general result, which will have an application for the neuronal potential, written as a random sum of i.i.d. random variables.
76
H´edi Soula, G. Beslon, and O. Mazet
Let f be a function so that limk→∞ f (k) = α ∈ R, ( pkN )(k,N)∈N2 , a sequence N pkN = 1. Now define g(N) = satisfying ∀k ∈ N, lim N→∞ pkN = 0 and k=1 N N k=1 pk f (k), and prove that lim g(N) = α.
(B.1)
N→∞
∀ > 0, ∃N0 ∈ N, ∀k > N0 , | f (k) − α| < 2 . So we can write N N 0 N N pk ( f (k) − α) + pk ( f (k) − α) |g(N) − α| ≤ k=N0
k=1
N 0 N ≤ pk ( f (k) − α) + . 2 k=1
N0 being fixed, it remains to complete the proof to get a rank N1 so that ∀N > N1 ,
N 0 N pk ( f (k) − α) < . 2 k=1
Application. Let X(N) be a sequence of random variables on [1, N] so that lim E(X(N) ) = ∞ and ∀k, lim P(X(N) = k) = 0,
N→∞
N→∞
which is satisfied, for instance, when X(N) ∼ N( N2 , σ 2 ) or when X(N) is uni(N) form on [1, N]. In this case, if we set pk = P(X(N) = k), we can write (N) g(N) = E( f (X )), and equation B.1 yields lim E( f (X(N) ) − f (E(X(N) )) = 0.
N→∞
(B.2)
Equation 4.1 derives from the case 1 f (k) = √ 2π
+∞ √1 kσ
x2
e − 2 dx.
Appendix C: Simple Case We prove here that substituting γ = 0 in equation 4.4 gives the equation xt = pφ (xt−1 ).
Spontaneous Dynamics of Random Recurrent Networks
77
It yields xt+1 =
t
xˆ m pφ (xt )
t−1
(1 − pφ (x j )).
j=m
m=0
For t = 0, it gives x1 = pφ (x0 ). Using the recurrence hypothesis ∀m = 1 . . . t, xm = pφ (xm−1 ), we find that xt+1 =
t
pφ (xm−1 ) pφ (xt )
(1 − pφ (x j ))
j=m
m=0
= pφ (xt )
t−1
t
pφ (xm−1 )
t−1
(1 − pφ (x j ))
j=m
m=0
= pφ (xt )vt−1 . It is now enough to note that vt−1 = pφ (xt−1 ) + (1 − pφ (xt−1 ))
t−1
pφ (xm−1 )
m=0
t−2
(1 − pφ (x j )))
j=m
= pφ (xt−1 ) + (1 − pφ (xt−1 ))vt−2 . Since v0 = (1 − pφ (x0 )) + pφ (x0 ) = 1, we have vt = 1 for all t. We show by recurrence that we get equation 4.6. Appendix D: Sufficient Condition for Neural Death Let us go back to 1 pφ (y) = √ 2π
∞ √θ yφ
x2
e − 2 dx,
for y ∈ [0, 1]. Taking the derivative over y, we have 2 θ − θ e 2yφ2 . pφ (y) = √ 2 2πφy3/2
We see immediately that pφ (y) ≥ 0, so pφ is increasing. So stable nonzero fixed points should appear for a value of y that crosses the line y = x from above. Then a sufficient condition for neural death is that ∀y pφ (y) < 1.
78
H´edi Soula, G. Beslon, and O. Mazet
Let z = 1y , τ =
√θ , 2φ
and g(z) = pφ ( 1z ). Then
τ z3/2 2 g(z) = √ e −τ z . 2 π Taking the derivative of g(z) yields τ z1/2 2 g (z) = √ e −τ z 2 π
3 − τ 2z . 2
So g (z) has the same sign as 2τ3 2 − z, and since g (0) = g (+∞) = 0, the maximum of g(z) is obtained for z = 2τ3 2 . But z = 1y . Thus, z ∈ [1, +∞[. It leads to max
y∈[0,1]
pφ (y)
3 , g(1) . = max g(z) = max g z∈[1,+∞[ 2τ 2
So a sufficient condition for zero to be the only fixed point becomes g
3 2τ 2
Since τ =
√θ , 2φ
φ
20 in practice), xˆ k ≈ Mˆxk−1 + Kzk ≈ M2 xˆ k−2 + MKzk−1 + Kzk ≈··· ≈ Mk−1 xˆ 1 +
k−2
M j Kzk− j .
j=0
where M j is the jth power of matrix M (i.e. M2 = M · M ). The above equation shows that in the Kalman framework, the estimate at time step k is a linear function of the firing rates at all time instants from time t2 to the present. This corresponds to the Wiener filter (Gelb, 1974), but the advantage of the Kalman implementation is that it computes the state estimate recursively, resulting in an efficient algorithm. Note also that the coefficients of all firing rates decay exponentially with respect to the current time step. This shows three basic properties of the Kalman filter: (1) it estimates the state at each time step using all the previous and present measurements (firing rates); (2) the weights of the firing rates decay exponentially (those far from the present time have a weak effect on the state estimate); and (3) for k >> 1, the state estimate is approximately independent of the initial state. This last point means that the choice of the initial state is relatively unimportant. A.4 Effect of Bin Size on Decoding Accuracy. We studied the effect of varying the bin size on decoding accuracy and found that accuracy was improved by increasing the bin size beyond the 70 ms and 50 ms bins used in the analysis above. Table 7 and Figure 10A summarize the results. For the pinball task, we varied the bin size, t, by multiples of 70 ms from 0 to 700 ms. For the pursuit tracking task, we considered bins ranging from 0 to 700 ms in 50 ms increments. In all cases, we used nonoverlapping time bins, as the use of overlapping bins results in a severe violation of the assumption of conditional independence underlying by the Kalman framework (see equation 2.7). With overlapping bins, data are “reused,” and the resulting estimates are no longer statistically valid. For each test condition (bin size), the Kalman model was trained, and hand kinematics were calculated every t ms. Table 7 shows that larger bins resulted in better decoding accuracy, up to a limit. One reason for this
114
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Table 7: Decoding Results for Varying Bin Sizes. Pinball Task Method t t t t t t t
Pursuit Tracking Task
CC (x, y)
= 70 ms = 140 ms = 210 ms = 280 ms = 350 ms = 420 ms = 490 ms
MSE
(0.82, 0.93) (0.83, 0.93) (0.81, 0.92) (0.82, 0.93) (0.78, 0.92) (0.78, 0.92) (0.75, 0.89)
(cm2 )
Method t t t t t t t t
5.87 5.29 5.45 5.16 5.95 5.81 7.25
= 50 ms = 100 ms = 150 ms = 200 ms = 250 ms = 300 ms = 400 ms = 500 ms
CC (x, y)
MSE (cm2 )
(0.79, 0.68) (0.80, 0.68) (0.81, 0.68) (0.81, 0.69) (0.81, 0.69) (0.81, 0.70) (0.79, 0.70) (0.79, 0.71)
5.99 5.63 5.27 5.03 4.89 4.66 4.59 4.64
Notes: The other model parameters are as described above: pinball task: uniform lag = 140 ms, kinematic model = (pos, vel, accel); pursuit tracking task: uniform lag = 150 ms, kinematic model = (pos, vel). Boldface type indicates the best decoding results.
may be that as the bin size grows, the variation in the firing rate is better approximated by the gaussian model used in the Kalman filter. In the case of the slow motions of the pursuit tracking task, larger bin sizes were appropriate. For the fast motions of the pinball task, bin sizes beyond approximately 280 ms resulted in a loss of accuracy. This suggests
10 8 6 100
200
300
400
500
600
700
MSE (cm 2)
MSE (cm 2)
A 6 5.5 5 4.5 100
200
300
400
500
600
700
bin size (ms)
bin size (ms)
B 15
y-position (cm)
y-position (cm)
12 10 8 6 4 2
14 13 12 11
0 5
10
x-position (cm)
15
11
12
13
14
15
16
x-position (cm)
Figure 10: (A) Decoding accuracy (MSE) as a function of bin size for the pinball task (left) and the pursuit tracking task (right). (B) Example reconstruction with varying bin size. (left) Pinball task: 70 ms (solid), 280 ms (tightly dashed), and 490 ms (loosely dashed). (right) Pursuit tracking task: 50 ms (solid), 300 ms (tightly dashed) and 500 ms (loosely dashed).
Motor Cortical Decoding Using a Kalman Filter
115
that while larger bin sizes can increase accuracy, the ultimate size is limited and is related to the speed of motion. Additionally, increased bin size had a negative effect on the detail of the recovered trajectories (see Figure 10B). As bin size increases, the frequency of state estimates decreases, resulting in a coarser approximation to the underlying trajectory. In general, larger bin sizes (up to some limits) produce more accurate results but at the cost of introducing a delay in estimating the system state. The constraints of a particular application will determine the appropriate bin size. Note that if a uniform lag of, for example, 140 ms time bins is used, we can exploit measurement data binned into 140 ms time bins without introducing any delay (or output lag) in the estimate of the system state relative to the natural hand motion. For real-time prosthesis applications, this system delay should be less than 200 ms, which suggests that overall bin size minus the uniform lag time should be less than 200 ms. For the pinball task, with a 140 ms lag, this would mean a maximum bin size of approximately 280 ms. For the pursuit tracking task, with a 150 ms lag, a maximum bin size of 250 to 300 ms would be appropriate. While this increases accuracy, it comes at the cost of a “jerkier” reconstruction. Acknowledgments This work was supported in part by the DARPA BioInfoMicro Program, the NIH NINDS Neural Prosthesis Program and Grant #NS25074, and the NSF ITR Program award #0113679. We thank D. Mumford, E. Brown, M. Serruya, A. Shaikhouni, J. Dushanova, C. Vargas-Irwin, L. Lennox, D. Morris, D. Grollman, and M. Fellows for their assistance. J.P.D. is a cofounder and shareholder in Cyberkinetics, Inc., a neurotechnology company that is developing neural prosthetic devices. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899– 1907. Brown, E., Frank, L., Tang, D., Quirk, M., & Wilson, M. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411– 7425. Carmena, J. M., Lebedev, M. A., Crist, R. E., O’Doherty, J. E., Santucci, D. M., Dimitrov, D. F., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. L. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. PLoS, Biology, 1, 001–016. Cisek, P., & Kalaska, J. F. (2002). Simultaneous encoding of multiple potential reach directions in dorsal premotor cortex. Journal of Neurophysiology, 87, 1149–1154.
116
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Donoghue, J. P., Sanes, J. N., Hatsopoulos, N. G., & Gaal, G. (1998). Neural discharge and local field potential oscillations in primate motor cortex during voluntary movements. Journal of Neurophysiology, 79, 159–173. Eden, U., Frank, L., Barbieri, R., Solo, V., & Brown, E. (2004). Dynamic analysis of neural encoding by point process adaptive filtering. Neural Computation, 16, 971–988. Fetz, E., Toyama, K., & Smith, W. (1991). Synaptic interaction between cortical neurons. In A. Peters & E. Jones (Eds.), Cerebral cortex (Vol. 9, pp. 1–47). New York: Plenum. Flament, D., & Hore, J. (1988). Relations of motor cortex neural discharge to kinematics of passive and active elbow movements in the monkey. Journal of Neurophysiology, 60, 1268–1284. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2002). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietterich, S. Becker, & Z. Ghahram (Eds.), Advances in neural information processing systems, 14 (pp. 213–220). Cambridge, MA: MIT Press. Gao, Y., Black, M. J., Bienenstock, E., Wu, W., & Donoghue, J. P. (2003). A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions. In 1st International IEEE/EMBS Conference on Neural Engineering (pp. 189–192). Capri, Italy. Gelb, A. (1974). Applied optimal estimation. Cambridge, MA: MIT Press. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 8, 1527–1537. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neural population coding of movement direction. Science, 233, 1416–1419. Ghazanfar, A., Stambaugh, C., & Nicolelis, M. (2000). Encoding of tactile stimulus location by somatosensory thalamocortical ensembles. Journal of Neuroscience, 20, 3761–3775. Gribble, P. L., & Scott, S. H. (2002). Method for assessing directional characteristics of non-uniformly sampled neural activity. Journal of Neuroscience Methods, 113, 187–197. Hatsopoulos, N., Ojakangas, C., Paninski, L., & Donoghue, J. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. Proceedings of the National Academy of Sciences, 95(26), 15706–15711. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. ASME, Journal of Basic Engineering, 82, 35–45. Kalman, R. E., & Bucy, R. (1961). New results in linear filtering and prediction. Trans. ASME, Journal of Basic Engineering, 83, 95–108. Kennedy, P. R., & Bakay, R. A. (1998). Restoration of neural output from a paralyzed patient by a direct brain connection. NeuroReport, 9, 1707–1711. Kettner, R., Schwartz, A., & Georgopoulos, A. (1988). Primary motor cortex and free arm movements to visual targets in three-dimensional space. III. Positional gradients and population coding of movement direction from various movement origins. Journal of Neuroscience, 8, 2938–2947.
Motor Cortical Decoding Using a Kalman Filter
117
Larsen, R. J., & Marx, M. L. (2001). An introduction to mathematical statistics and its applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Maynard, E., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement directions. Journal of Neuroscience, 19, 8083–8093. Maynard, E., Nordhausen, C., & Normann, R. (1997). The Utah intracortical electrode array: A recording structure for potential brain-computer interfaces. Electroencephalography and Clinical Neurophysiology, 102, 228–239. Moran, D., & Schwartz, A. (1999a). Motor cortical activity during drawing movements: Population representation during spiral tracing. Journal of Neurophysiology, 82, 2693–2704. Moran, D., & Schwartz, A. (1999b). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Murphy, K. P. (1998). Switching Kalman filter (Tech. Rep. 98-10). Cambridge, MA: Compaq Cambridge Research Laboratory. Paninski, L., Fellows, M., Hatsopoulos, N., & Donoghue, J. P. (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Sanchez, J., Erdogmus, D., Principe, J., Wessberg, J., & Nicolelis, M. (2002). Comparison between nonlinear mappings and linear state estimation to model the relation from motor cortical neuronal firing to hand movements. In Proceedings of SAB Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices (pp. 59–65). Edinburgh, Scotland. Sanchez, J., Erdogmus, D., Rao, Y., Principe, J., Nicolelis, M., & Wessberg, J. (2003). Learning the contributions of motor, premotor, and posterior parietal cortices for hand trajectory reconstruction in a brain machine interface. In Proceedings of the 1st international IEEE EMBS Conference on Neural Engineering. Capri, Italy. Schwartz, A. (1993). Motor cortical activity during drawing movements: Population representation during sinusoid tracing. Journal of Neurophysiology, 70, 28– 36. Schwartz, A., & Moran, D. (1999). Motor cortical activity during drawing movements: Population representation during lemniscate tracing. Journal of Neurophysiology, 82, 2705–2718. Schwartz, A., Taylor, D., & Helms Tillery, S. (2001). Extraction algorithms for cortical control of arm prosthetics. Current Opinion in Neurobiology, 11, 701–707. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., & Donoghue, J. (2003). Robustness of neuroprosthetic decoding algorithms. Biological Cybernetics, 88, 219–228. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Brain-machine interface: Instant neural control of a movement signal. Nature, 416, 141–142. Shoham, S. (2001). Advances towards an implantable motor cortical interface. Unpublished doctoral dissertation, University of Utah. Taylor, D., Helms Tillery, S., & Schwartz, A. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296, 1829–1832.
118
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Twum-Danso, N., & Brockett, R. (2001). Trajectory estimation from place cell data. Neural Networks, 14, 835–844. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Welch, G., & Bishop, G. (2001). An introduction to the Kalman filter (Tech. Rep. 95–041). Chapel Hill: University of North Carolina at Chapel Hill. Wessberg, J., Stambaugh, C., Kralik, J., Beck, P., L. M., Chapin, J., Kim, J., Biggs, S., Srinivasan, M., & Nicolelis, M. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408, 361–365. Wood, F., Fellows, M., Donoghue, J. P., & Black, M. J. (2004). Automatic spike sorting for neural decoding. In Proc. the 26th Annual Internaltional Conference of the IEEE EMBS (pp. 4009–4012). San Francisco. Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., & Donoghue, J. P. (2002). Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter. In SAB’02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices (pp. 66–73). Edinburgh, Scotland. Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Shaikhouni, A., & Donoghue, J. P. (2003). Neural decoding of cursor motion using a Kalman filter. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 133–140). Cambridge, MA: MIT Press. Wu, W., Black, M. J., Mumford, D., Gao, Y., Bienenstock, E., & Donoghue, J. P. (2004a). Modeling and decoding motor cortical activity using a switching Kalman filter. IEEE Transactions on Biomedical Engineering, 51, 933–942. Wu, W., Shaikhouni, A., Donoghue, J. P., & Black, M. J. (2004b). Closed-loop neural control of cursor motion using a Kalman filter. Proc. IEEE Engineering in Medicine and Biology Society (pp. 4126–4129). San Francisco. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044.
Received July 8, 2004; accepted May 3, 2005.
LETTER
Communicated by Garrison Cottrell
Facial Attractiveness: Beauty and the Machine Yael Eisenthal
[email protected] School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
Gideon Dror
[email protected] Department of Computer Science, Academic College of Tel-Aviv-Yaffo, Tel-Aviv 64044, Israel
Eytan Ruppin
[email protected] School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
This work presents a novel study of the notion of facial attractiveness in a machine learning context. To this end, we collected human beauty ratings for data sets of facial images and used various techniques for learning the attractiveness of a face. The trained predictor achieves a significant correlation of 0.65 with the average human ratings. The results clearly show that facial beauty is a universal concept that a machine can learn. Analysis of the accuracy of the beauty prediction machine as a function of the size of the training data indicates that a machine producing human-like attractiveness rating could be obtained given a moderately larger data set. 1 Introduction In this work, we explore the notion of facial attractiveness through the application of machine learning techniques. We construct a machine that learns from facial images and their respective attractiveness ratings to produce human-like evaluation of facial attractiveness. Our work is based on the underlying theory that there are objective regularities in facial attractiveness to be analyzed and learned. We first briefly describe the psychophysics of facial attractiveness and its evolutionary origins. We then provide a review of previous work done in the computational analysis of beauty, attesting to the novelty of our work. 1.1 The Psychophysics of Beauty 1.1.1 Beauty and the Beholder. The subject of visual processing of human faces has received attention from philosophers and scientists, such as Neural Computation 18, 119–142 (2006)
C 2005 Massachusetts Institute of Technology
120
Y. Eisenthal, G. Dror, and E. Ruppin
Aristotle and Darwin, for centuries. Within this framework, the study of human facial attractiveness has had a significant part: “Beauty is a universal part of human experience, it provokes pleasure, rivets attention, and impels actions that help ensure survival of our genes” (Etcoff, 1999). Various experiments have empirically shown the influence of physical attractiveness on our lives, as both individuals and part of a society; its impact is obvious by the amounts of money spent on plastic surgery and cosmetics each year. Yet the face of beauty, something we can recognize in an instant, is still difficult to formulate. This outstanding question regarding the constituents of beauty has led to a large body of ongoing research by scientists in the biological, cognitive, and computational sciences. Over centuries, the common notion in this research has been that beauty is in the eye of the beholder—that individual attraction is not predictable beyond our knowledge of a person’s particular culture, historical era, or personal history. However, more recent work suggests that the constituents of beauty are neither arbitrary nor culture bound. Several rating studies by Perrett et al. and other researchers have demonstrated high crosscultural agreement in attractiveness rating of faces of different ethnicities (Cunningham, Roberts, Wu, Barbee, & Druen, 1995; Jones, 1996; Perrett, May & Yoshikawa, 1994; Perrett et al., 1998). This high congruence over ethnicity, social class, age, and sex has led to the belief that perception of facial attractiveness is data driven—that the properties of a particular set of facial features are the same irrespective of the perceiver. If different people can agree on which faces are attractive and which are not when judging faces of varying ethnic background, then this suggests that people everywhere are using similar criteria in their judgments. This belief is strengthened by the consistent relations, demonstrated in experimental studies, between attractiveness and various facial features, with both male and female raters. Cunningham (1986) and Cunningham, Barbee, & Pike (1990) showed a strong correlation between beauty and specific features, which were categorized as neonate (features such as small nose and high forehead), mature (e.g., prominent cheekbones), and expressive (e.g., arched eyebrows). They concluded that beauty is not an inexplicable quality that lies only in the eye of the beholder. A second line of evidence in favor of a biological rather than an arbitrary cultural basis of physical attractiveness judgments comes from studies of infant preferences for face types. Langlois et al. (1987) showed pairs of female faces (previously rated for attractiveness by adults) to infants only a few months old. The infants preferred to look at the more attractive face of the pair, indicating that even at 2 months of age, adult-like preferences are demonstrated. Slater et al. (1998) demonstrated the same preference in newborns. The babies looked longer at the attractive faces, regardless of the gender, race, or age of the face. The owner-vs.-observer hypothesis was further studied in various experiments. Zaidel explored the question of whether beauty is in the perceptual
Facial Attractiveness: Beauty and the Machine
121
space of the observer or a stable characteristic of the face (Chen, German, & Zaidel, 1997). Results showed that facial attractiveness is more dependent on the physiognomy of the face than on a perceptual process in the observer for both male and female observers. 1.1.2 Evolutionary Origins. Since Darwin, biologists have studied natural beauty’s meaning in terms of the evolved signal content of striking phenotypic features. Evolutionary scientists claim that the perception of facial features may be governed by circuits shaped by natural selection in the human brain. Aesthetic judgments of faces are not capricious but instead reflect evolutionary functional assessments and valuations of potential mates (Thornhill & Gangestad, 1993). These “Darwinian” approaches are based on the premise that attractive faces are a biological “ornament” that signals valuable information; attractive faces advertise a “health certificate,” indicating a person’s “value” as a mate (Thornhill & Gangestad, 1999). Advantageous biological characteristics are probably revealed in certain facial traits, which are unconsciously interpreted as attractive in the observer’s brain. Facial attributes like good skin quality, bone structure, and symmetry, for example, are associated with good health and therefore contribute to attractiveness. Thus, human beauty standards reflect our evolutionary distant and recent past and emphasize the role of health assessment in mate choice or, as phrased by anthropologist Donald Symons (1995), “Beauty may be in the adaptations of the beholder.” Research has concentrated on a number of characteristics of faces that may honestly advertise health and viability. Langlois and others have demonstrated a preference for average faces: composite faces, a result of digital blending and averaging of faces, were shown to be more attractive than most of the faces used to create them (Grammer & Thornhill, 1994; Langlois & Roggman, 1990; Langlois, Roggman, & Musselman, 1994; O’Toole, Price, Vetter, Bartlett, & Blanz, 1999). Evolutionary biology holds that in any given population, extreme characteristics tend to fall away in favor of average ones; therefore, the ability to form an average-mate template would have conveyed a singular survival advantage (Symons, 1979; Thornhill & Gangestad, 1993). The averageness hypothesis, however, has been widely debated. Average composite faces tend to have smooth skin and be symmetric; these factors, rather than averageness per se, may lead to the high attractiveness attributed to average faces (Alley & Cunningham, 1991). Both skin texture (Fink, Grammer, & Thornhill, 2001) and facial bilateral symmetry (Grammer & Thornhill, 1994; Mealey, Bridgstock, & Townsend, 1999; Perrett et al., 1999) have been shown to have a positive effect on facial attractiveness ratings. The averageness hypothesis has also received only mixed empirical support. Later studies found that although averageness is certainly attractive, it can be improved on. Composites of beautiful people were rated more appealing than those made from the larger, random population (Perrett et al.,
122
Y. Eisenthal, G. Dror, and E. Ruppin
1994). Also, exaggeration of the ways in which the prettiest female composite differed from the average female composite resulted in a more attractive face (O’Toole, Deffenbacher, et al., 1998; Perrett et al., 1994, 1998); these turned out to be sexually dimorphic traits, such as small chin, full lips, high cheekbones, narrow nose, and a generally small face. These sex-typical, estrogen-dependent characteristics in females may indicate youth and fertility and are thus considered attractive (Perrett et al., 1998; Symons, 1979, 1995; Thornhill & Gangestad, 1999). 1.2 Computational Beauty Analysis. The previous section clearly indicates the existence of an objective basis underlying the notion of facial attractiveness. Yet the relative contribution to facial attractiveness of the aforementioned characteristics and their interactions with other facial beauty determinants are still unknown. Different studies have examined the relationship between subjective judgments of faces and their objective regularity. Morphing software has been used to create average and symmetrized faces (Langlois & Roggman, 1990; Perrett et al., 1994, 1999), as well as attractive and unattractive prototypes (http://www.beautycheck.de), in order to analyze their characteristics. Other approaches have addressed the question within the study of the relation between aesthetics and complexity, which is based on the notion that simplicity lies at the heart of all scientific theories (Occam’s razor principle). Schmidhuber (1998) created an attractive female face composed from a fractal geometry based on rotated squares and powers of two. Exploring the question from a different approach, Johnston (Johnston & Franklin, 1993) produced an attractive female face using a genetic algorithm, which evolves a “most beautiful” face according to interactive user selections. This algorithm mimics, in an oversimplified manner, the way humans (consciously or unconsciously) select features they find the most attractive. Measuring the features of the resulting face showed it to have “feminized” features. This study and others, which have shown attractiveness and femininity to be nearly equivalent for female faces (O’Toole et al., 1998), have been the basis for a commercial project, which uses these sex-dependent features to determine the sex of an image and predict its attractiveness (http://www.intelligent-earth.com). 1.3 This Work. Previous computational studies of human facial attractiveness have mainly involved averaging and morphing of digital images and geometric modeling to construct attractive faces. In general, computer techniques used include delineation, transformation, prototyping, and other image processing techniques, most requiring fiducial points on the face. In this work, rather than attempt to morph or construct an attractive face, we explore the notion of facial attractiveness through the application of machine learning techniques. Using only the images themselves,
Facial Attractiveness: Beauty and the Machine
123
we try to learn and analyze the mapping from two-dimensional facial images to their attractiveness scores, as determined by human raters. The cross-cultural consistency in attractiveness ratings demonstrated in many previous studies has led to the common notion that there is an objective basis to be analyzed and learned. The remainder of this letter is organized as follows. Section 2 presents the data used in our analyses (both images and ratings), and section 3 describes the representations we chose to work with. Section 4 describes our experiments with learning facial attractiveness, presenting prediction results and analyses. Finally, section 5 consists of a discussion of the work presented and general conclusions. Additional details are provided in the appendix. 2 The Data 2.1 Image Data Sets. To reduce the effects of age, gender, skin color, facial expression, and other irrelevant factors, subject choice was confined to young Caucasian females in frontal view with neutral expression, without accessories or obscuring items (e.g., jewelry). Furthermore, to get a good representation of the notion of beauty, the data set was required to encompass both extremes of facial beauty: very attractive as well as very unattractive faces. We obtained two data sets, which met the above criteria, both of relatively small size of 92 images each. Data set 1 contains 92 young Caucasian (American) females in frontal view with neutral expressions, face and hair comprising the entirety of the picture. The images all have identical lighting conditions and nearly identical orientation, in excellent resolution, with no obscuring or distracting features, such as jewelry or glasses. The pictures were originally taken by Japanese photographer Akira Gomi. Images were received with attractiveness ratings. Data set 2 contains 92 Caucasian (Israeli) females, aged approximately 18, in frontal view, face and hair comprising the entirety of the picture. Most of the images have neutral expressions, but in order to keep the data set reasonably large, smiling images in which the mouth was relatively closed were also used. The images all have identical lighting conditions and nearly identical orientation. This data set required some image preprocessing and is of slightly lower quality. The images contain some distracting features, such as jewelry. The distributions of the raw images in the two data sets were found to be too different for combining the sets, and therefore all our experiments were conducted on each data set separately. Data set 1, which contains highquality pictures of females in the preferred age range, with no distracting or obscuring items, was the main data set used. Data set 2, which is of slightly lower quality, containing images of younger women with some distracting features (jewelry, smiles), was used for exploring cross-cultural
124
Y. Eisenthal, G. Dror, and E. Ruppin
consistency in attractiveness judgment and in its main determinants. Both data sets were converted to grayscale to lower the dimension of the data and simplify the computational task. 2.2 Image Ratings 2.2.1 Rating Collection. Data set 1 was received with ratings, but to check consistency of ratings across cultures, we collected new ratings for both data sets. To facilitate both the rating procedure and the collection of the ratings, we created an interactive HTML-based application that all our raters used. This provided a simple rating procedure in which all participants received the same instructions and used the same rating process. The raters were asked to first scan through the entire data set (in grayscale) to obtain a general notion of the relative attractiveness of the images, and only then to proceed to the actual rating stage. They were instructed to use the entire attractiveness scale and to consider only facial attractiveness in their evaluation. In the rating stage, the images were shown in random order to eliminate order effects, each on a separate page. A rater could look at a picture for as long as he or she liked and then score it. The raters were free to return to pictures they had already seen and adjust their ratings. Images in data set 1 were rated by 28 observers—15 male, 13 female— most in their twenties. For data set 2, 18 ratings were collected from 10 male and 8 female raters of similar age. Each facial image was rated on a discrete integer scale between 1 (very unattractive) and 7 (very attractive). The final attractiveness rating of a facial image was the mean of its ratings across all raters. 2.2.2 Rating Analysis. In order to verify the adequacy and consistency of the collected ratings, we examined the following properties: • Consistency of ratings. The raters were randomly divided into two groups. We calculated the mean ratings of each group and checked consistency between the two mean ratings. This procedure was repeated numerous times and consistently showed a correlation of 0.9 to 0.95 between the average ratings of the two groups for data set 1 and a correlation of 0.88 to 0.92 for data set 2. The mean ratings of the groups were also very similar for both data sets, and a t-test confirmed that the rating means for the two groups were not statistically different. • Clustering of raters. The theory underlying the project is that individuals rate facial attractiveness according to similar, universal standards. Therefore, our assumption was that all ratings are from the same distribution. Indeed, clustering of raters produced no apparent grouping. Specifically, a chi-square test that compared the distribution of ratings of male versus female raters showed no statistically significant differences between these two groups. In addition, the correlation between the average female ratings and average male ratings was very high: 0.92 for data set 1 and 0.88 for data
Facial Attractiveness: Beauty and the Machine
125
set 2. The means of the female and male ratings were also very similar, and a t-test confirmed that the means of the two groups were not statistically different. The results show no effect of observer gender. An analysis of the original ratings for data set 1 (collected from Austrian raters) versus the new ratings (collected from Israeli raters) shows a high similarity in the images rated as most and least attractive. A correlation of 0.82 was found between the two sets of ratings. These findings strongly reinforce previous reports of high cross-cultural agreement in attractiveness rating. 3 Face Representation Numerous studies in various face image processing tasks (e.g., face recognition and detection) have experimented with various ways to specify the physical information in human faces. The different approaches tried have demonstrated the importance of a broad range of shape and image intensity facial cues (Bruce & Langton, 1994; Burton, Bruce, & Dench, 1993; Valentine & Bruce, 1986). The most frequently encountered distinction regarding the information in faces is a qualitative one between feature-based and configurationalbased information, that is, discrete, local, featural information versus spatial interrelationship of facial features. Studies suggest that humans perceive faces holistically and not as individual facial features (Baenninger, 1994; Haig, 1984; Young, Hellawell, & Hay, 1989), yet experiments with both representations have demonstrated the importance of features in discriminative tasks (Bruce & Young, 1986; Moghaddam & Pentland, 1994). This is a particularly reasonable assumption for beauty judgment tasks, given the correlation found between features and attractiveness ratings. Our work uses both kinds of representations. In the configurational representation, a face is represented with the raw grayscale pixel values, in which all relevant factors, such as texture, shading, pigmentation, and shape, are implicitly coded (though difficult to extract). A face is represented by a vector of pixel values created by concatenating the rows or columns of its image. The pixel-based representation of a face will be referred to as its pixel image. The featural representation is motivated by arguments tying beauty to ideal proportions of facial features such as distance between eyes, width of lips, size of eyes, and distance between the lower lip and the chin. This representation is based on the manual measurement of 37 facial feature distances and ratios that reflect the geometry of the face (e.g., distance between eyes, mouth length and width). The facial feature points according to which these distances were defined are displayed in Figure 1. (The full list of feature measurements is given in the appendix, along with their calculation method.) All raw distance measurements, which are in units of pixels, were normalized by the distance between pupils, which serves
126
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 1: Feature landmarks used for feature-based representation.
as a robust and accurate length scale. To these purely geometric features we added several nongeometric ones: average hair color, an indicator of facial symmetry, and an estimate of skin smoothness. The feature-based measurement representation of a face will be referred to as its feature vector. 4 Learning Attractiveness We turn to present our experiments with learning facial attractiveness, using the facial images and their respective human ratings. The learners were trained with the pixel representation and with the feature representation separately. 4.1 Dimension Reduction. The pixel images are of an extremely high dimension, of the order of 100,000 (equal to image resolution). Given the high dimensionality and redundancy of the visual data, the pixel images underwent dimension reduction with principal component analysis (PCA). PCA has been shown to relate reliably to human performance on various face image processing tasks, such as face recognition (O’Toole, Abdi, Deffenbacher, & Valentin, 1993; Turk & Pentland, 1991) and race and sex classification (O’Toole, Deffenbacher, Abdi, & Bartlett, 1991), and to be semantically relevant. The eigenvectors pertaining to large eigenvalues have been shown to code general information, such as orientation and categorical assessment, which has high variance and is common to all faces in the set ¨ (O’Toole et al., 1993; O’Toole, Vetter, Troje, & Bulthoff, 1997; Valentin & Abdi, 1996). Those corresponding to the smaller eigenvalues code smaller,
Facial Attractiveness: Beauty and the Machine
127
more individual variation (Hancock, Burton, & Bruce, 1996; O’Toole et al., 1998). PCA was also performed on the feature-based measurements in order to decorrelate the variables in this representation. This is important since strong correlations, stemming, for example, from left-right symmetry, were observed in the data. 4.1.1 Image Alignment. For PCA to extract meaningful information from the pixel images, the images need to be aligned, typically by rotating, scaling, and translating, to bring the eyes to the same location in all the images. To produce sharper eigenfaces, we aligned the images according to a second point as well—the vertical location of the center of the mouth, a technique known to work well for facial expression recognition (Padgett & Cottrell, 1997). This nonrigid transformation, however, involved changing face height-to-width ratio. To take this change into account, the vertical scaling factor of each face was added to its low-dimensional representation. As the input data in our case are face images and the eigenvectors are of the same dimension as the input, the eigenvectors are interpretable as faces and are often referred to as eigenfaces (Turk & Pentland, 1991). The improvement in sharpness of the eigenfaces from the main data set as a result of the alignment can be seen in Figure 2. Each eigenface deviates from uniform gray, where there is variation in the face set. The left column consists of two eigenfaces extracted from PCA on unaligned images; face contour and features are blurry. The middle column shows eigenfaces from images aligned only according to eyes. The eyes are indeed more sharply defined, but other features are still blurred. The right column shows eigenfaces from PCA on images aligned by both eyes and vertical location of mouth; all salient features are much more sharply defined. 4.1.2 Eigenfaces. PCA was performed on the input vectors from both representations, separately. Examples of eigenvectors extracted from the pixel images from the main data set can be seen in Figure 3. The eigenfaces in the top row are those pertaining to the highest eigenvalues, the middle row shows eigenfaces corresponding to intermediate eigenvalues, and the bottom row presents those pertaining to the smallest eigenvalues. As expected, the eigenfaces in the top row seem to code more global information, such as hair and face shape, while the eigenvectors in the bottom row code much more fine, detailed feature information. Each eigenface is obviously not interpretable as a simple single feature (as is often the case with a smaller data set), yet it is clearly seen in the top row eigenfaces that the directions of highest variance are hair and face contour. This is not surprising, as the most prominent differences between the images are in hair color and shape, which also causes large differences in face shape (due to partial occlusion by hair). Smaller variance can also be seen in other features, mainly eyebrows and eyes.
128
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 2: (Left) Eigenfaces from unaligned images. (Middle) Eigenfaces from images aligned only by eyes. (Right) Eigenfaces from images aligned by both eyes and mouth.
4.2 Feature Selection. The eigenfaces are the features representing the face set; they can be combined in a certain weighting to represent a specific face. A low-dimensional representation using only the first eigenvectors minimizes the squared error between the face representation and the original image and is sufficient for accurate face recognition (Turk & Pentland, 1991). However, omitting the dimensions pertaining to the smaller eigenvalues decreases the perceptual quality of the face (O’Toole et al., 1993, 1998). Consequently, we anticipated that using the first m eigenfaces would not produce accurate results in our attractiveness evaluation task. Indeed, these experiments resulted in poor facial attractiveness predictions. We therefore selected the eigenfaces most important to our task by sorting them according to their relevance to attractiveness ratings. This relevance was estimated by calculating the correlation of the eigenvector projections with the human ratings across the various images. Interestingly, in the pixel representation, the features found most correlated with the attractiveness ratings were those pertaining to intermediate and smaller eigenvalues. Figure 4 shows the eigenfaces; the top row displays those pertaining to the highest eigenvalues, and the bottom row presents the eigenfaces with projections most correlated with human ratings. While the former show mostly general
Facial Attractiveness: Beauty and the Machine
129
Figure 3: Eigenfaces from largest to smallest eigenvalues (top to bottom).
Figure 4: Eigenfaces pertaining to highest eigenvalues (top row) and highest correlations with ratings (bottom row).
130
Y. Eisenthal, G. Dror, and E. Ruppin
features of hair and face contour, the latter also clearly show the lips, the nose tip, and eye size and shape to be important features. The same method was used for feature selection in the feature-based representation. The feature measurements were sorted according to their correlation with the attractiveness ratings. It should be noted that despite its success, using correlation as a relevance measure is problematic, as it assumes the relation between the feature and the ratings to be monotonic. Yet experiments with other ranking criteria that do not make this assumption, such as chi square and mutual information, produced somewhat inferior results. 4.3 Attractiveness Prediction. The original data vectors were projected onto the top m eigenvectors from the feature selection stage (where m is a parameter on which we performed optimization) to produce a lowdimensional representation of the data as input to the learners in the prediction stage. 4.3.1 Classification into Two Attractiveness Classes. Although the ultimate goal of this work was to produce and analyze a facial beauty predictor using regression methods, we begin with a simpler task, on which there is even higher consistency between raters. To this end, we recast the problem of predicting facial attractiveness into a classification problem: discerning “attractive” faces (the class comprising the highest 25% rated images) from “unattractive” faces (the class of lowest 25% rated images). The main classifiers used were standard K-nearest neighbors (KNN) (Mitchell, 1997) and support vector machines (SVM) (Vapnik, 1995). The best results obtained are shown in Table 1, which displays the percentage of correctly classified images. Classification using the KNN classifier was good; correct classifications of 75% to 85% of the images were achieved. Classification rates with SVM were slightly poorer, though for the most part in the same percentage range. Both classifiers performed better with the feature vectors than with the pixel images; this is particularly true for SVM. Best SVM results were achieved using a linear kernel. In general, classification (particularly with KNN) was good for both data sets and ratings, with success percentages slightly lower for the main data set.
Table 1: Percentage of Correctly Classified Images. Data Set 1
Data Set 2
Pixel Images
KNN SVM
75% 68%
77% 73%
Feature Vectors
KNN SVM
77% 76%
86% 84%
Facial Attractiveness: Beauty and the Machine
131
KNN does not use specific features, but rather averages over all dimensions, and therefore does not give any insight into which features are important for attractiveness rating. In order to learn what the important features are, we used a C4.5 decision tree (Quinlan, 1986, 1993) for classification using feature vectors without preprocessing by PCA. In most cases, the results did not surpass those of the KNN classifier, but the decision tree did give some insight into which features are “important” for classification. The features found most informative were those pertaining to size of the lower part of the face (jaw length, chin length), smoothness of skin, lip fullness, and eye size. These findings are all consistent with previous psychophysics studies. 4.3.2 The Learners for the Regression Task. Following the success of the classification task, we proceeded to the regression task of rating prediction. The predictors used for predicting facial beauty itself were again KNN and SVM. For this task, however, both predictors were used in their regression version, mapping each facial image to a real number that represents its beauty. We also used linear regression, which served as a baseline for the other methods. Targets used were the average human ratings of each image. The output of the KNN predictor for a test image was computed as the weighted average of the targets of the image’s k nearest neighbors, where the weight of a neighbor is the inverse of its Euclidean distance from the test image. That is, let v1 , . . . ,vk be the set of k nearest neighbors of test image v with targets y1 , . . . ,yk , and let d1 , . . . ,dk be their respective Euclidean distances from v. The predicted beauty y for the test image v is then wi yi y = i , i wi
i = 1,
2, . . . , k
where wi = (di + δ)−1 is the weight of neighbor vi and where δ is a smoothing parameter. On all subsequent uses of KNN, we set δ = 1. KNN was run with k values ranging from 1 to 45. As a predictor for our task, KNN suffers from a couple of drawbacks. First, it performs averaging, and therefore its predicted ratings had very low variance, and all extremely high or low ratings were evened out and often not reached. In addition, it uses a Euclidean distance metric, which need not reflect the true metric for evaluation of face similarity. Therefore, we also studied an SVM regressor as an attractiveness predictor, a learner that does not use a simple distance metric and does not perform averaging in its prediction. The SVM method, in its regression version, was used with several kernels: linear, polynomials of degree 2 and 3, and gaussian with different values of γ , where log2 γ ∈{−6, −4, −2, 0}. γ is related to the width parameter
132
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 5: Prediction results obtained with pixel images (A) and feature-based representation (B). Performance is measured by the correlation between the predicted ratings and the human ratings.
δ by γ = 1/(2σ 2 ). We performed a grid search over the values of slack parameter c and the width of regression tube w such that log10 c ∈ { − 3, −2, −1, 0, 1} and w ∈ {0.1, 0.4, 0.7, 1.0}. In all runs, we used a soft margin SVM implemented in SVMlight (Joachims, 1999). Due to the relatively small sample sizes, we evaluated the performance of the predictors using cross validation; predictions were made using leaven-out, with n = 1 for KNN and linear regression and n = 5 for SVM. 4.3.3 Results of Facial Attractiveness Prediction. Predicted ratings were evaluated according to their correlation with the human ratings, using the Pearson correlation. Figure 5 depicts the best results of the attractiveness predictors on the main data set. Figure 5A shows the best correlations achieved with the pixel-based representation, and Figure 5B shows the best results for the feature-based representation. Prediction results for the pixel images show a peak near m = 25 features, where the maximum correlation achieved with KNN is approximately 0.45. Feature-based representation shows a maximum value of nearly 0.6 at m = 15 features, where the highest correlation is achieved with both SVM and linear regression. Highest SVM results in both representations were reached with a linear kernel. Results obtained on the second data set were similar. The normalized MSE of the best predicted ratings is 0.6 to 0.65 (versus a normalized MSE of 1 of the “trivial predictor,” which constantly predicts the mean rating). KNN performance was poor—significantly worse than that of the other regressors in the feature-based representation. These results imply that the Euclidean distance metric is probably not a good estimate for similarity of faces for this task. It is interesting to note that the simple linear regressor performed as good as or better than the KNN predictor. However, this effect may be attributed to our feature selection method,
Facial Attractiveness: Beauty and the Machine
133
ranking features by the absolute value of their correlations with the target, which is optimal for linear regression. 4.3.4 Significance of Results. All predictors performed better with the feature-based representation than with the pixel images (in accordance with results of classification task). Using the feature vectors enabled a maximum correlation of nearly 0.6 versus a correlation of 0.45 with the pixel images. To check the significance of this score, we produced an empirical distribution of feature-based prediction scores with random ratings. The entire preprocessing, feature selection, hyperparameter selection, and prediction process was run 100 times, each time with a different set of randomly generated ratings, sampled from a normal distribution with mean and variance identical to those of the human ratings. For each run, the score taken was the highest correlation of predicted ratings with the original (random) ratings. The average correlation achieved with random ratings was 0.28, and the maximum correlation was 0.47. Figure 6A depicts the histogram of these correlations. Using QQplot, we verified that the empirical distribution of observed correlations is approximately normal; this is shown in Figure 6B. Using the normal approximation, the correlation obtained by our featurebased predictor is significant to a level of α = 0.001. The numbers and figures presented are for the KNN predictor. Correlations achieved with linear regression have different mean and standard deviation but a similar z-value. The distribution of these correlations was also verified to be approximately normal, and the correlation achieved by our linear regressor was significant to the same level of α = 0.001. This test was not run for the SVM predictor due to computational limitations.
Figure 6: Correlations achieved with random ratings. (A) Histogram of correlations. (B) QQplot of correlations versus standard normal distributed data. Correlations were obtained with KNN predictor.
134
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 7: Correlation of the weighted machine ratings with human ratings versus the weighting parameter, α.
4.3.5 Hybrid Predictor. Ratings predicted with the two representations were very different; the best ratings achieved using each representation had a correlation of only 0.3 to 0.35. The relatively low correlation between the feature-based and pixel-based predictions suggests that results might be improved by using the information learned from both representations. Therefore, an optimal weighted average of the best feature-based and pixelbased ratings was calculated. We produced a hybrid machine that generates the target rating yhybrid = αyfeature + (1 − α)ypixel , where yfeature is the rating of the feature-based predictor, ypixel is the prediction of the pixel-based machine, and 0 ≤ α ≤ 1. Figure 7 shows the correlation between the hybrid machine ratings and the human ratings as a function of the weights tried (weights shown are those of the feature-based ratings, α). The hybrid predictor was constructed using the best feature-based and pixel-based ratings obtained with linear regression. As evident from the graph, the best weighted ratings achieve a correlation of 0.65 with the human ratings. The hybrid predictor with the optimal value of α = 0.65 improves prediction results by nearly 10% over those achieved with a single representation. Its normalized MSE is 0.57, lower than that of the individual rating sets. These weighted ratings have the highest correlation and lowest normalized MSE with the human scores. Therefore, in subsequent analysis, we use these weighted ratings as the best machine-predicted ratings, unless stated otherwise. 4.3.6 Evaluation of Predicted Rating Ranking. An additional analysis was performed to evaluate the relative image ranking induced by the best
Facial Attractiveness: Beauty and the Machine
135
Figure 8: Probability of error in the predicted relative order of two images as a function of the absolute difference in their original human ratings.
machine predictions. Figure 8 shows the probability of error in the predicted relative ordering of two images as a function of the absolute difference d in their original, human ratings. The differences were binned into 16 bins. The probability of error for each value of d was computed over all pairs of images with an absolute difference of d in their human ratings. As evident from the graph, the probability decreases almost linearly as the absolute difference in the original ratings grows. 4.3.7 The Learning Curve of Facial Attractiveness. For further evaluation of the prediction machine, an additional experiment was run in which we examined the learning curve of our predictor. We produced this curve by iteratively running the predictor for a growing data set size in the following manner. The number of input images, n, was incremented from 5 to the entire 92 images. For every n, the predictor was run 10 times, each time with n different, randomly selected images (for n = 92, all images were used in a single run). Testing was performed on the subsets of n images only, using leave-one-out, and results were evaluated according to the correlation of the predicted ratings with the human ratings of these images. Figure 9 shows the results for the KNN predictor trained using the feature representation with k = 16 and m = 7 features. The correlations shown in the plot are the average over the 10 runs. The figure clearly shows that the performance of the predictor improves with the increase in the number of images. The slope of the graph is positive for every n ≥ 50. Similar behavior was observed with other parameters and learners. This tendency is less distinct in the corresponding graph for the pixel images.
136
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 9: Accuracy of prediction as a function of the training set size.
5 Discussion This letter presents a predictor of facial attractiveness, trained with female facial images and their respective average human ratings. Images were represented as both raw pixel data and measurements of key facial features. Prediction was carried out using KNN, SVM, and linear regression, and the best predicted ratings achieved a correlation of 0.65 with the human ratings. We consistently found that the measured facial features were more informative for attractiveness prediction on all tasks tried. In addition to learning facial attractiveness, we examined some characteristics found correlated with facial attractiveness in previous experiments. In particular, we ran our predictor on the “average” face, that is, the mathematical average of the faces in the data set. This face received only an average rating, showing no support for the averageness hypothesis in our setting. This strengthens previous experiments that argued against the averageness hypothesis (as described in section 1.1.2). The high attractiveness of composite faces may be attributable to their smooth skin and symmetry and not to the averageness itself, explaining the fact that the mathematical average of the faces was not found to be very attractive. Given the high dimensionality and redundancy of visual data, the task of learning facial attractiveness is undoubtedly a difficult one. We tried additional preprocessing, feature selection, and learning methods, such as Wrapper (Kohavi & John, 1996), Isomap (Tenenbaum, de Silva, & Langford, ¨ 2000), and kernel PCA (Scholkopf, Smola, & Muller, 1999), but these all produced poorer results. The nonlinear feature extraction methods probably
Facial Attractiveness: Beauty and the Machine
137
failed due to an insufficient number of training examples, as they require a dense sampling of the underlying manifold. Nevertheless, our predictor achieved significant correlations with the human ratings. However, we believe our success was limited by a number of hindering factors. The most meaningful obstacle in our project is likely to be the relatively small size of the data sets available to us. This limitation can be appreciated by examining Figure 9, which presents a plot of prediction performance versus the size of the data set. The figure clearly shows improvement in the predictor’s performance as the number of images increases. The slope of the graph is still positive with the 92 images used and does not asymptotically level off, implying that there is considerable room for improvement by using a larger, but still realistically conceivable, data set. Another likely limiting factor is insufficient data representation. While the feature-based representation produced better results than the pixel images, it is nonetheless incomplete; it includes only Euclidean distance-based measurements and lacks fine shape and texture information. The relatively lower results with the pixel images show that this representation is not informative enough. In conclusion, this work, novel in its application of computational learning methods for analysis of facial attractiveness, has produced promising results. Significant correlations with human ratings were achieved despite the difficulty of the task and several hindering factors. The results clearly show that facial beauty is a universal concept that a machine can learn. There are sufficient grounds to believe that future work with a moderately larger data set may lead to an “attractiveness machine” producing humanlike evaluations of facial attractiveness. Appendix: Features Used by the Feature-Based Predictors A.1 Feature-Based Representation. Following is a list of the measurements comprising the feature-based representation: 1. Face length 2. Face width—at eye level 3. Face width—at mouth level 4. Distance between pupils 5. Ratio between 2 and 3 6. Ratio between 1 and 2 7. Ratio between 1 and 3 8. Ratio between 4 and 2 9. Right eyebrow thickness (above pupil)
138
Y. Eisenthal, G. Dror, and E. Ruppin
10. Left eyebrow thickness (above pupil) 11. Right eyebrow arch—height difference between highest point and inner edge 12. Left eyebrow arch—height difference between highest point and inner edge 13. Right eye height 14. Left eye height 15. Right eye width 16. Left eye width 17. Right eye size = height * width 18. Left eye size = height *width 19. Distance between inner edges of eyes 20. Nose width at nostrils 21. Nose length 22. Nose size = width * length 23. Cheekbone width = (2–3) 24. Ratio between 23 and 2 25. Thickness of middle of top lip 26. Thickness of right side of top lip 27. Thickness of left side of top lip 28. Average thickness of top lip 29. Thickness of lower lip 30. Thickness of both lips 31. Length of lips 32. Chin length—from bottom of face to bottom of lower lip 33. Right jaw length—from bottom of face to right bottom face edge 34. Left jaw length—from bottom of face to left bottom face edge 35. Forehead height—from nose top to top of face 36. Ratio of (distance from nostrils to eyebrow top) to (distance from face bottom to nostrils) 37. Ratio of (distance from nostrils to face top) to (distance from face bottom to nostrils)
Facial Attractiveness: Beauty and the Machine
139
38. Symmetry indicator (description follows) 39. Skin smoothness indicator (description follows) 40. Hair color indicator (description follows) A.2 Symmetry Indicator. A vertical symmetry axis was set between the eyes of each image, and two rectangular, identically sized windows, surrounding only mouth and eyes, were extracted from opposite sides of the axis. The symmetry measure of the image was calculated as N1 i (Xi − Yi )2 , where N is the total number of pixels in each window, Xi is the value of pixel i in the right window, and Yi is the value of the corresponding pixel in the left window. The value of the indicator grows with the asymmetry in a face. This indicator is indeed a measure of the symmetry in the facial features, as the images are all consistent in lighting and orientation. A.3 Skin Smoothness Indicator. The “smoothness” of a face was evaluated by applying a Canny edge detection operator to a window from the cheek/forehead area; a window representative of the skin texture was selected for each image. The skin smoothness indicator was the average value of the output of this operation, and its value monotonously decreases with the smoothness of a face. A.4 Hair Color Indicator. A window representing the average hair color was extracted from each image. The indicator was calculated as the average value of the window, thus increasing with lighter hair. Acknowledgments We thank Bernhard Fink and the Ludwig-Boltzmann Institute for Urban Ethology at the Institute for Anthropology, University of Vienna, Austria, for one of the facial data sets used in this research. References Alley, T. R., & Cunningham, M. R. (1991). Averaged faces are attractive, but very attractive faces are not average. Psychological Science, 2, 123–125. Baenninger, M. (1994). The development of face recognition: Featural or configurational processing? Journal of Experimental Child Psychology, 57, 377–396. Bruce V., & Langton S. (1994). The use of pigmentation and shading information in recognizing the sex and identities of faces. Perception, 23, 803–822. Bruce, V., & Young, A. W. (1986). Understanding face recognition. British Journal of Psychology, 77, 305–327. Burton, A. M., Bruce, V., & Dench, N. (1993). What’s the difference between men and women? Evidence from facial measurement. Perception, 22(2), 153–176.
140
Y. Eisenthal, G. Dror, and E. Ruppin
Chen, A. C., German, C., & Zaidel, D. W. (1997). Brain asymmetry and facial attractiveness: Facial beauty is not simply in the eye of the beholder. Neuropsychologia, 35(4), 471–476. Cunningham, M. R. (1986). Measuring the physical in physical attractiveness: Quasi experiments on the sociobiology of female facial beauty. Journal of Personality and Social Psychology, 50(5), 925–935. Cunningham, M. R., Barbee, A. P., & Pike, C. L. (1990). What do women want? Facial metric assessment of multiple motives in the perception of male physical attractiveness. Journal of Personality and Social Psychology, 59, 61–72. Cunningham, M. R., Roberts, A. R., Wu, C. H., Barbee, A. P., & Druen, P. B. (1995). Their ideas of beauty are, on the whole, the same as ours: Consistency and variability in the cross-cultural perception of female attractiveness. Journal of Personality and Social Psychology, 68, 261–279. Etcoff, N. (1999). Survival of the prettiest: The science of beauty. New York: Anchor Books. Fink, B., Grammer, K., & Thornhill, R. (2001). Human (homo sapien) facial attractiveness in relation to skin texture and color. Journal of Comparative Psychology, 115(1), 92–99. Grammer, K., & Thornhill, R. (1994). Human facial attractiveness and sexual selection: The role of symmetry and averageness. Journal of Comparative Psychology, 108(3), 233–242. Haig, N. D. (1984). The effect of feature displacement on face recognition. Perception, 13, 505–512. Hancock, P. J. B., Burton, A. M., & Bruce, V. (1996). Face processing: Human perception and PCA. Memory and Cognition, 24, 26–40. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Johnston, V. S., & Franklin, M. (1993). Is beauty in the eye of the beholder? Ethology and Sociobiology, 14, 183–199. Jones, D. (1996). Physical attractiveness and the theory of sexual selection: Results from five populations. Ann Arbor: University of Michigan Museum. Kohavi, R., & John, G. H. (1996). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Langlois, J. H., & Roggman, L. A. (1990). Attractive faces are only average. Psychological Science, 1, 115–121. Langlois, J. H., Roggman, L. A., Casey, R. J., Ritter, J. M., Rieser-Danner, L. A., & Jenkins, V. Y. (1987). Infant preferences for attractive faces: Rudiments of a stereotype? Developmental Psychology, 23, 363–369. Langlois, J. H., Roggman, L. A., & Musselman, L. (1994). What is average and what is not average about attractive faces? Psychological Science, 5, 214–220. Mealey, L., Bridgstock, R., & Townsend, G. C. (1999). Symmetry and perceived facial attractiveness: A monozygotic co-twin comparison. Journal of Personality and Social Psychology, 76(1), 151–158. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill. Moghaddam, B., & Pentland, A. (1994). Face recognition using view-based and modular eigenspaces. In R. J. Mammone & J. D. Murley, Jr. (Eds.), Automatic systems for the identification and inspection of humans, Proc. SPIE, 2257.
Facial Attractiveness: Beauty and the Machine
141
O’Toole, A., Abdi, H., Deffenbacher, K. A., & Valentin, D. (1993). Low-dimensional representation of faces in higher dimensions of the face space. Journal of the Optical Society of America A, 10(3), 405–411. O’Toole, A. J., Deffenbacher, K. A., Valentin, D., McKee, K., Huff, D., & Abdi, H. (1998). The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition, 26, 146–160. O’Toole, A. J., Deffenbacher, K. A., Abdi, H., & Bartlett, J. A. (1991). Simulating the “other-race effect” as a problem in perceptual learning. Connection Science Journal of Neural Computing, Artificial Intelligence, and Cognitive Research, 3, 163–178. O’Toole, A. J., T. Price, T. Vetter, J. C. Bartlett, & V. Blanz (1999). 3D shape and 2D surface textures of human faces: The role of “averages” in attractiveness and age. Image and Vision Computing, 18, 9–19. ¨ O’Toole, A. J., Vetter, T., Troje, N. F., & Bulthoff, H. H. (1997). Sex classification is better with three-dimensional head structure than with image intensity information. Perception, 26, 75–84. Padgett, C., & Cottrell, G. (1997). Representing face images for emotion classification. In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Perrett, D. I., Burt, D. M., Penton-Voak, I. S., Lee, K. J., Rowland, D. A., & Edwards, R. (1999). Symmetry and human facial attractiveness. Evolution and Human Behavior, 20, 295–307. Perrett, D. I., Lee, K. J., Penton-Voak, I., Rowland, D. A., Yoshikawa, S., Burt, D. M., Henzi, S. P., Castles, D. L., & Akamatsu, S. (1998). Effects of sexual dimorphism on facial attractiveness. Nature, 394, 826–827. Perrett, D. I., May, K. A., & Yoshikawa, S. (1994). Facial shape and judgments of female attractiveness. Nature, 368, 239–242. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Schmidhuber, J. (1998). Facial beauty and fractal geometry. Available online at: http://www.idsia.ch/∼juergen/locoface/newlocoface.html. ¨ Scholkopf, B., Smola, A., & Muller, K. R. (1999). Kernel principal component analysis. ¨ In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Slater, A., Von der Schulenberg, C., Brown, E., Badenoch, M., Butterworth, G., Parsons, S., & Samuels, C., (1998). Newborn infants prefer attractive faces. Infant Behavior and Development, 21, 345–354. Symons, D. (1979). The evolution of human sexuality. New York: Oxford University Press. Symons, D. (1995). Beauty is in the adaptations of the beholder: The evolutionary psychology of human female sexual attractiveness. In P. R. Abramson & S. D. Pinkerton (Eds.), Sexual nature, sexual culture (pp. 80–118). Chicago: University of Chicago Press. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global framework for nonlinear dimensionality reduction. Science, 290(550), 2319–2323. Thornhill, R., & Gangestad, S. W. (1993). Human facial beauty: Averageness, symmetry and parasite resistance. Human Nature, 4(3), 237–269.
142
Y. Eisenthal, G. Dror, and E. Ruppin
Thornhill, R., & Gangestad, S. W. (1999). Facial attractiveness. Trends in Cognitive Sciences, 3, 452–460. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Valentin, D., & Abdi, H. (1996). Can a linear autoassociator recognize faces from new orientations? Journal of the Optical Society of America, series A, 13, 717–724. Valentine, T., & Bruce, V. (1986). The effects of race, inversion and encoding activity upon face recognition. Acta Psychologica, 61, 259–273. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Young, A. W., Hellawell, D., & Hay, D. C. (1989). Configurational information in face perception. Perception, 16, 747–759.
Received July 30, 2004; accepted May 15, 2005.
LETTER
Communicated by Peter Hancock
Classification of Faces in Man and Machine Arnulf B. A. Graf ∗
[email protected] Felix A. Wichmann
[email protected] ¨ Heinrich H. Bulthoff
[email protected] ¨ Bernhard Scholkopf
[email protected] ¨ Max Planck Institute for Biological Cybernetics, D 72076 Tubingen, Germany
We attempt to shed light on the algorithms humans use to classify images of human faces according to their gender. For this, a novel methodology combining human psychophysics and machine learning is introduced. We proceed as follows. First, we apply principal component analysis (PCA) on the pixel information of the face stimuli. We then obtain a data set composed of these PCA eigenvectors combined with the subjects’ gender estimates of the corresponding stimuli. Second, we model the gender classification process on this data set using a separating hyperplane (SH) between both classes. This SH is computed using algorithms from machine learning: the support vector machine (SVM), the relevance vector machine, the prototype classifier, and the K-means classifier. The classification behavior of humans and machines is then analyzed in three steps. First, the classification errors of humans and machines are compared for the various classifiers, and we also assess how well machines can recreate the subjects’ internal decision boundary by studying the training errors of the machines. Second, we study the correlations between the rankorder of the subjects’ responses to each stimulus—the gender estimate with its reaction time and confidence rating—and the rank-order of the distance of these stimuli to the SH. Finally, we attempt to compare the metric of the representations used by humans and machines for classification by relating the subjects’ gender estimate of each stimulus and the distance of this stimulus to the SH. While we show that the classification error alone is not a sufficient selection criterion between the different algorithms humans might use to classify face stimuli, the distance of these stimuli to the SH is shown to capture essentials of the internal decision
∗ Present
address: Center for Neural Science, New York University, New York, NY,
USA. Neural Computation 18, 143–165 (2006)
C 2005 Massachusetts Institute of Technology
144
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
space of humans. Furthermore, algorithms such as the prototype classifier using stimuli in the center of the classes are shown to be less adapted to model human classification behavior than algorithms such as the SVM based on stimuli close to the boundary between the classes.
1 Introduction Bringing together theoretical modeling and behavioral data is arguably one of the main challenges when studying the “computational brain” (Churchland & Sejnowski, 1992). The aim of this letter is to obtain a better understanding of the algorithms responsible for the classification of visual stimuli by humans. For this, we combine machine learning and psychophysical techniques to gain insights into the algorithms human subjects use during visual classification of images of human faces according to their gender. In this “machine-learning-psychophysics” approach, we substitute a complex system that is very hard to analyze—the human brain—with a reasonably complex system—a learning machine (Vapnik, 1998). The latter is complex enough to capture some essentials of the human behavior but is still amenable to close analysis (Poggio, Rifkin, Mukherjee, & Niyogi, 2004). The research presented in this article is focused on a novel methodology that bridges the gap between human psychophysics and machine learning by extracting quantitative information from a (high-level) human behavioral experiment. The past decade has seen important technological advances in neuroscience from a microscopic scale (e.g., multiunit recordings) to a macroscopic scale (e.g., functional magnetic resonance imaging), yielding novel insights into visual processing. However, on an algorithmic level, the methods and understanding of brain processes involved in visual recognition are still limited, although numerous attempts have been made since this problem was pointed out by Marr (1982). Recently various computational models for visual recognition have been proposed. For instance, a network of Gabor wavelet filters was used to describe the processing of visual information (Mel, 1997). Independent component analysis was combined with a nearest-neighbor classifier to model face recognition (Bartlett, Movellan, & Sejnowski, 2002). The computations done by the human visual system for facial expression recognition were described using Gabor wavelets, principal component analysis, and artificial neural networks (Dailey, Cottrell, Padgett, & Adolphs, 2002). Object recognition and classification was also modeled using a hierarchical model composed of a network of nonlinear units combined using a maximum operation (Riesenhuber & Poggio, 1999, 2002). While each of these methods is successful for its own task, they illustrate the divergence of the approaches used to understand human category learning as pointed out, for example, in the overview by Ashby and Ell (2001). In this letter, we propose a novel
Classification of Faces in Man and Machine
145
method combining machine learning and human psychophysics to shed light on the algorithms humans use to classify visual stimuli. Our framework allows us to compare directly the classification behavior of different algorithms to that of humans. While the results obtained in this letter have no claim to be biologically inspired or to explain a specific function of the visual system (see, e.g., Rolls & Deco, 2002, for an overview of such computational methods), we instead ask the following questions: Can we generate testable hypotheses about the algorithms humans use to classify visual inputs? Can we find a classifier whose behavior reflects human classification behavior significantly better than others? Current high-level vision research, with its intrinsically complex stimuli, is hampered by a lack of methods to answer such questions at the algorithmic level. The method presented here has the potential to contribute to overcoming this obstacle. An initial attempt using machine learning to help understand the algorithms humans use to classify the gender of faces was presented by Graf and Wichmann (2004). This letter extends that work. In section 2 we present a psychophysical gender classification experiment of images of human faces and analyze the subjects’ responses—the gender estimate with its reaction time and confidence rating. Section 3 introduces several algorithms from machine learning that will be used to model the classification behavior of humans. Our analysis of the classification behavior of humans proceeds in three steps. First, the classification performance of humans and machines is compared in section 4, and the findings are related to those described in the literature. Second, we correlate in section 5 the rank-order of the subjects’ responses to each stimulus with the rank-order of the distance of this stimulus to the separating hyperplane (SH) of the machine. The success of these studies encourages us to perform the third step in section 6: a metric comparison of the representations used by humans and machines for classification, using the subjects’ gender estimate of each stimulus and the corresponding distance to the SH of the machine. Section 7 summarizes our results and discusses their implications. 2 Human Classification In a human psychophysical classification experiment, 55 human subjects were asked to classify a random gender-balanced subset of 152 out of 200 realistic human faces according to their gender. The stimuli were presented sequentially once to each subject. The temporal envelope of stimulus presentation was a modified Hanning window (a raised cosine function with a raising time of 500 ms and a plateau time of 1000 ms, for a total presentation time of 2000 ms per face). After the presentation of each stimulus, a blank screen with mean luminance was shown to the subjects for 1000 ms before the presentation of the following stimulus. We recorded the subjects’ estimated gender (female or male) together with the reaction time (RT) and
146
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
a confidence rating (CR) on a scale from 1 (unsure) to 3 (sure). No feedback on the correctness of the subjects’ answers was provided. Subjects were asked to classify the faces as fast as possible to obtain perceptual, rather than cognitive, judgments. Most of the time they responded well before the presentation of the stimulus had ended (mean reaction time over all stimuli and subjects was approximately 900 ms). A training phase of 8 faces (4 male and 4 female faces) preceded the actual classification experiment in order to acquaint the subjects with the stimuli and the experimental procedure. Subjects viewed the screen binocularly with their head stabilized by a headrest. All subjects had normal or corrected-to-normal vision and were paid for their participation. Most of them were students from the University of ¨ Tubingen, and all of them were naive to the purpose of the experiment. Each stimulus was an 8-bit grayscale frontal view of a Caucasian face with a nominal size of 256 × 256 pixels. All faces were centered on the display, had the same pixel-surface area and the same mean intensity, and they came from a processed version of the MPI face database1 (Blanz & Vetter, 1999). The details of the image processing are described in Graf and Wichmann (2002). The stimuli were presented against the mean luminance (50 cd/m2 ) of a linearized Clinton Monoray CRT driven by a Cambridge Research Systems VSG 2/5 display controller. Neither the presentation of a male nor of a female face changed the mean luminance of the screen. The subjects’ gender estimates were analyzed using signal detection theory (Wickens, 2002). We assume that on the decision axis, the internal class representations are corrupted by gaussian distributed noise with same unit variance but different means. We define correct response probabilities for male (+) and female (−) stimuli as P+ = P( yˆ = 1|y = 1) and P− = P( yˆ = −1|y = −1), where yˆ is the estimated class and y the true class of the stimulus. The discriminability of both classes can then be computed as d = Z(P+ ) + Z(P− ), where Z = −1 , and is the cumulative normal distribution with zero mean and unit variance. Averaged across all subjects, we obtain a high discriminability, d = 2.63 ± 0.57, suggesting that the classification task is comparatively easy for the subjects, albeit not trivial (no ceiling effect). Furthermore, the subjects exhibit a pronounced male bias in the responses defined as log(β) = 12 (Z2 (P+ ) − Z2 (P− )) = 1.49 ± 1.15, indicating that more females are classified as males than males as females. In Figure 1 we show the relation between the average across all subjects of the subjects’ responses for each stimulus, each point in these plots representing one stimulus. We can first see that for P( yˆ = +1|x) ≈ 1, all the stimuli are male and that for P( yˆ = +1|x) ≈ 0, all the stimuli are female. Second, we can observe the male bias already mentioned: a higher density of responses near P( yˆ = +1|x) ≈ 1. Furthermore, there are female stimuli for which P( yˆ = +1|x) > 12 , but no male stimuli for which P( yˆ = +1|x) < 12 . 1
To be found online at http://faces.kyb.tuebingen.mpg.de.
Classification of Faces in Man and Machine
147 3
1.1
male stimuli female stimuli
2.8 2.6
RT
CR
1 0.9
2.4 2.2 2
male stimuli female stimuli
0.8
1.8 0
0.5 P(y^=+1|x)
1
0
0.5 P(y^=+1|x)
1
Figure 1: Relation between the subjects’ responses—the probability P( yˆ = +1|x) to answer male, the reaction time RT, and the confidence rating CR— on a stimulus-by-stimulus basis (responses averaged across subjects).
Clearly the threshold for male-female discrimination depends on the male bias and is located in [ 12 , 1]. Third, we notice that for stimuli with a high probability to belong to either class (P( yˆ = +1|x) = 0 or 1), the corresponding RTs are short and the CRs are high. In other words, when the subjects make a correct gender estimate, they answer fast, and they are confident of their response. For the stimuli where the subjects have difficulty choosing a class (P( yˆ = +1|x) ≈ 0.5), they take longer to respond (long RT) and are unsure of their response (low CR). Subjects thus have a rather good knowledge of the correctness of their gender estimate. 3 Machine Classification To model the subjects’ classification behavior using machine learning, we first need to preprocess the stimuli to reduce their “apparent” dimensionality. We use principal component analysis PCA (Duda, Hart, & Stork, 2001), a widely used linear preprocessor from unsupervised machine learning, to preprocess the data. PCA is an eigenvalue decomposition of the covariance matrix associated with the data matrix D = B E along the directions of largest variance where the columns of the basis matrix B are constrained to be orthonormal and the rows of the encoding matrix E are orthogonal. The rows of B are termed eigenfaces according to one of the first studies to apply PCA to human faces (Sirovich & Kirby, 1987). PCA has also been successfully applied to model face perception and classification in a large number of studies, from psychophysics (O’Toole, Abdi, Deffenbacher, & Valentin, 1993: Valentin, Abdi, Edelman, & O’Toole, 1997; O’Toole, Deffenbacher, Valentin, McKee, Huff, & Abdi, 1998; O’Toole, Vetter, & Blanz, 1999; Furl,
148
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
Phillips, & O’Toole, 2002), to artificial recognition systems (Turk & Pentland, 1991; Golomb, Lawrence, & Sejnowski, 1991; Gray, Lawrence, Golomb, & Sejnowski, 1995; O’Toole, Phillips, Cheng, Ross, & Wild, 2000; Bartlett et al., 2002) and facial expression modeling (Calder, Burton, Miller, Young, & Akamatsu, 2001). Like all previous studies, we apply PCA to the vectors obtained when reshaping the intensity matrix of the pixels of each face into a single 2562 × 1 vector. We keep the full space of the data, that is, the 200 nonzero components of the PCA decomposition of the data, and obtain a PCA-encoding data matrix E of size 200 × 200, where each row is the encoding corresponding to a face stimulus. By construction, these encodings are already centered. Subsequently these encodings are also normalized since this has been shown to be quite effective in real-world applications for some classifiers (Graf, Smola, & Borer, 2003). Since we consider the full encoding space of dimension 200, the choice of PCA as a preprocessor is of little consequence, and the face stimuli can be reconstructed perfectly from these encodings. In this letter, we consider two types of stimulus data sets for each subject: the true and the subject data sets. The patterns in both data sets are represented by their (centered and normalized) PCA encodings. The true data set contains the p = 152 encodings xi ∈ R200 , i = 1, . . . , p of the stimuli seen by the subject, combined with the true labels yi = ±1 of these stimuli—their true gender as given by the MPI face database. The subject data set is composed of the same encodings xi , combined this time with the labels yˆi of the stimuli as estimated by the subject in the psychophysical classification experiment. This data set represents what we assume to be the subject’s internal representation of the face space. Altogether we thus have 55 true and subject data sets. We use methods from supervised machine learning to model classification. The classifiers are applied to the true and the subject data sets and thus classify in the PCA space of dimension 200. We consider classifiers that are linear: they classify using a separating hyperplane (SH) defined by its normal vector w and offset b. Furthermore, these classifiers can all be expressed in dual form: the normal vector is a linear combination of the patterns of the data set w = i αi xi . Since we cannot investigate all such classifiers in an exhaustive manner, we consider the most representative member of each one of four families of classification principles: the support vector machine, the relevance vector machine, the prototype classifier and the K-means classifier. Figure 2 shows these classifiers applied on a twodimensional toy data set. These classifiers are presented and discussed in further detail below. ¨ The support vector machine SVM (Vapnik, 2000; Scholkopf & Smola, 2002) is a state-of-the-art maximum margin classification algorithm rooted in statistical learning theory. SVMs classify by maximizing the margin separating both classes while minimizing the classification errors. This tradeoff between maximum margin and misclassifications is controlled by a
Classification of Faces in Man and Machine
SVM
RVM
149
Prot
Kmean
Figure 2: Classification of a two-dimensional toy data set using the classifiers considered in this study. The dark lines indicate the SHs.
parameter C set by cross-validation.2 The optimal dual space parameter α maximizes the following expression,
αi −
i
1 yi y j αi α j xi |x j , 2 ij
subject to
αi yi = 0
and 0 ≤ αi ≤ C,
i
where ·|· stands for the inner (or scalar) product between two vectors. The offset is computed as b = yi − w| xi i|0 n and M can be arbitrarily large.4 Assumption 5. For n = 2, at most one of the sets {h(t) ∗ si,I (t), i = 1, . . . , n} and {h(t) ∗ ri (t), i = 1, . . . , M} has gaussian components. If n > 2, at most one of h(t) ∗ si,I (t), i = 1, . . . , n, and h(t) ∗ ri (t), i = 1, . . . , M, is gaussian. In assumption 4, the matrix B can be represented as [In | 0n×(M−n) ] · B, where In denotes the n-dimensional identity matrix, 0n×(M−n) the n × (M − n) zero matrix, and B an M-dimensional square matrix. This means that s D (t) can be regarded as the projection of the linear transform of the M-dimensional independent random vector r(t) on the n-dimensional space. The larger M is, the more vectors s D (t) can be represented by Br(t). We can now present the following proposition, which is more applicable than proposition 1: Proposition 2. Let x(t) be the observations in the SDICA model. Under assumptions 1 to 5, the outputs of the SDICA separation system, yi (t), are instantaneously independent if and only if the filter h(t) filters out the dependent subcomponents si,D (t) and WA is a generalized permutation matrix. Proof: Since si,I (t) and s j,D (t) are assumed to be spatially independent stochastic sequences for all pairs of i and j, yi,I (t) and y j,D (t), defined If M ≤ n, components of s D (t) may become independent after linear transform, and assumption 2 is violated. In this case, the SDICA problem is actually a special case of the overcomplete ICA, which is detailed in section 4. 4
200
K. Zhang and L.-W. Chan
in equation 3.2, are always independent for all pairs of i and j according to lemma 1. Suppose yi (t) are independent. Under the assumptions in this proposition, the components of y I (t), and those of y D (t), must be mutually independent according to lemma 3 and its extension in the highdimensional case. Therefore, yi,D (t) must vanish, as they are assumed to be always dependent. Then equation 3.2 becomes the demixing procedure of the basic ICA model. Obviously this proposition is true. Assumptions 1 to 5 are generally not very restrictive. And it is important to emphasize that it is possible for proposition 2 to be true even when some of the assumptions are violated. According to proposition 2, we can use the SDICA separation system in Figure 1A, to separate the observations generated by the SDICA model, and the filter h(t) and the demixing matrix W in the SDICA system can be obtained by making yi (t) mutually independent. 3.2 The Learning Rule for BS-ICA. The SDICA system can be considered as a special case of the blind separation system of convolutive mixtures. The approach based on information maximization provides simple and efficient algorithms for blind separation of convolutive mixtures, but it has the side effect of temporally whitening the output (Bell & Sejnowski, 1995; Torkkola, 1996). The temporally whitening effect must be avoided in SDICA. Hence, here the information-maximization principle should not be adopted. Mutual information is a natural and canonical measure of statistical dependence, and the mutual information-minimization approach has been applied for blind separation of convolutive mixtures in Babaie-Zadeh, Jutten, and Nayebi (2001b). We can use mutual information to measure the dependence between yi and estimate h(t) and W by minimizing the mutual information between yi . In information theory, the mutual information between n random variables y1 , . . . , yn is defined as I (y1 , . . . , yn ) =
n
H(yi ) − H(y),
(3.4)
i=1
where y = (y1 , . . . , yn )T , and H(•) denotes the (differential) entropy. I (y1 , . . . , yn ) is always nonnegative and is zero if and only if yi are mutually independent. We can then derive the learning rules for h(t) and W in Figure 1A based on the minimization of mutual information between yi . 3.2.1 Adjusting W in the Instantaneous Stage. Since in the SDICA separation system, the filtering stage and instantaneous stage are in a cascade structure, h(t) will not be involved explicitly in the learning rule for W, and the instantaneous stage just aims at minimizing the mutual information
An Adaptive Method for Subband Decomposition ICA
201
given zi as input. The learning rule for W is then the same as that in the basic ICA model (Bell & Sejnowski, 1995; Cardoso, 1997), ∂ I (y) = −[WT ]−1 − E{ψy (y)zT }, ∂W
(3.5)
where z = [z1 , . . . , zn ]T , zi (t) = h(t) ∗ xi (t), and ψy (y) = [ψ y1 (y1 ), . . . , ψ yn (yn )]T is called the marginal score function (MSF) in Babaie-Zadeh et al. (2001b). ψ yi (u) is the score function of the random variable yi , defined as ψ yi (u) = (log p yi (u)) =
p yi (u) p yi (u)
.
(3.6)
Multiplying the right-hand side of equation 3.5 with wT w, we get the natural gradient method (Amari et al., 1996; Cardoso & Laheld, 1996): W ∝ (I + E{ψy (y)yT })W.
(3.7)
In order to make yi of an unit variance when the algorithm converges, we replace the entries on the diagonal of I + E{ψy (y)yT } by 1 − E(yi2 ). In this way, the above learning rule is modified as follows: W ∝ (I − diag{E(y12 ), . . . , E(yn2 )} + E{ψy (y)yT } − diag{E[ψy (y)yT ]})W, (3.8) where diag{E(y12 ), . . . , E(yn2 )} denotes the diagonal matrix with E(y12 ), . . . , E(yn2 ) as its diagonal entries, and diag{E[ψy (y)yT ]} denotes the diagonal matrix whose diagonal entries are those on the diagonal of E{ψy (y)yT }. 3.2.2 Adjusting h(t) in the Filtering Stage. Let h(t) be a causal finite impulse response (FIR) filter, h(t) = [h 0 , h 1 , . . . , h L ]. We have
p yi (yi ) ∂ yi ∂ log p yi (yi ) · = −E ∂h k p yi (yi ) ∂h k n ∂ yi ∂z p = −E ψ yi (yi ) · · ∂z p ∂h k
∂ H(yi ) = −E ∂h k
= −E t
p=1
ψ yi (t) ·
n p=1
wi, p x p (t − k) ,
(3.9)
202
K. Zhang and L.-W. Chan
where wi, p is the (i, p)th entry of the matrix W. And n ∂ log py (y) ∂ yi ∂ log py (y) = −E · ∂h k ∂ yi ∂h k i=1 n n ∂ log p (y(t)) y E wi, p x p (t − k) · =− ∂ yi (t)
∂ H(y) = −E ∂h k
i=1
p=1
= −E t ϕ yT (t) · W · x(t − k) ,
(3.10)
where ϕ y (y) = [ϕ1 (y), . . . , ϕn (y)]T is called the joint score function (JSF) in Babaie-Zadeh et al. (2001b), and its ith element is ∂ log py (y) ϕi (y) = = ∂ yi
∂ ∂ yi
py (y)
py (y)
.
(3.11)
Combining equations 3.9 and 3.10 gives ∂ I (y) ∂ H(yi ) ∂ H(y) = − ∂h k ∂h k ∂h k i=1 T = −E t ψ y (t) · W · x(t − k) + E t ϕ yT (t) · W · x(t − k) = E t β yT (t) · W · x(t − k) , n
(3.12)
where β y (y) = ϕ y (y) − ψ y (y) is defined as the score function difference (SFD) in Babaie-Zadeh et al. (2001b). The SFD is an independence criterion; it vanishes if and only if yi are mutually independent. Now the elements of h(t) can be adjusted according to equation 3.12 with the gradient-descent method. Since the SFD estimation cannot be avoided in updating h k , alternatively we can adopt the SFD-based algorithm for adjusting W (Samadi, BabaieZadeh, Jutten, & Nayebi, 2004): W ∝ −E{β y (y)yT }W.
(3.13)
Equation 3.8 (or equation 3.13) and equation 3.12 are the learning rules of BS-ICA. When the instantaneous stage and the filter stage both converge, the matrix W is the demixing matrix associated with the mixing matrix A. The original sources can be estimated as Wx. Moreover, the outputs yi form the estimate of a filtered version of the independent subcomponents si,I . As in separating convolutive mixtures (Babaie-Zadeh et al., 2001b) and separating convolutive postnonlinear mixtures (Babaie-Zadeh, Jutten, &
An Adaptive Method for Subband Decomposition ICA
203
Nayebi, 2001a), the algorithm (see equations 3.12 and 3.13) involves the SFD (β y (y)). The SFD estimation problem has been addressed in several articles (e.g., Taleb & Jutten, 1997, and Pham, 2003). In our experiments, we adopt Pham’s method, because it is fast and comparatively accurate. The SFD estimation requires a large number of samples and is difficult to perform when the dimension of the output is greater than two. In section 3.4, we propose a scheme to avoid high-dimensional SFD. 3.3 Practical Considerations. In Figure 1A, if we exchange a scalar factor between the filter h(t) and the matrix W, the output y does not change. This scaling indeterminacy will do harm to the convergence of our algorithm. Therefore, we set h 0 , the first element in h(t), as 1, to eliminate this indeterminacy. In applying BS-ICA, we should not neglect the effect of the capacity of the filter h(t). If the proportion of the dependent subcomponents is extremely large, it is hard to eliminate the effect of the dependent subcomponents due to the limited capacity of h(t), such that our method may converge to an incorrect target under this condition. 3.3.1 Order of h(t) and Parameter Initialization. In practice, the performance of SDICA is affected by the order of h(t). Note that there is no ideal digital filter. There will always be finite transition bands between the stopbands and pass-bands. Furthermore, a digital filter can never completely attenuate the amplitude of the signal in the stop bands, and it will not allow the signal in the pass-bands to pass though unscathed. Instead, the stopbands will be attenuated by a finite gain factor; hence, our method may fail to filter out the dependent subcomponents when their energy is extremely high. We should also consider the effect of the length of h(t). If the length, L + 1, is too short, the resolution of the filter is poor, and the width of the transition band will increase. But a long length of the filter will result in heavy computational load and may cause the algorithm to be less robust. In our three experiments, the filter length is set as 17, 11, and 21, respectively. In BS-ICA, there are many parameters to be tuned. Also, due to the limited accuracy in the SFD estimation, there may be some local optimum, especially when the data dimension is high. In practice, when the proportion of the dependent subcomponents is not that large, we found it is very useful to initialize the demixing matrix W with that obtained by traditional ICA algorithms (e.g., the FastICA algorithm). This initialization scheme can improve the convergence speed and may help to avoid some local optimum. 3.3.2 Penalty Term with Prior Knowledge on h(t). Notice that the decomposition of each source si (t) into the independent subcomponent si,I (t) and the dependent subcomponent si,D (t) (see equation 3.1) is not unique. In other words, if h(t) filters out not only the dependent subcomponent si,D (t) but also part of the independent subcomponent si,I (t), the independence between output yi can also be achieved. This means that we may recover
204
K. Zhang and L.-W. Chan
only part of the independent subcomponent with BS-ICA. This would not affect the solution when we aim only to estimate the true mixing matrix A. However, if we also aim to recover the whole independent subcomponents, we can tackle this problem by incorporating an additional penalty term in the objective function. We can take into account the prior information on the frequency localization of the independent subcomponent. For instance, if we know in advance that the frequency of the independent subcomponent of interest is around the radian frequency 0 , we can modify the objective function by incorporating the frequency response magnitude of h(t) at 0 , J 1 = I (y1 , . . . , yn ) − λ|H( j0 )|2 ,
(3.14)
where λ is a small enough positive number. Since H( j0 ) =
L
h(t)e − j0 t ,
t=0
we have ∗
|H( j0 )| = H( j0 ) · H ( j0 ) = 2
L
h(t)e
t=0
=
L L
− j0 t
·
L
h(k)e
j0 k
k=0
h(t)h(k) cos[(t − k)0 ],
t=1 k=1
and then, ∂|H( j0 )|2 h(t) cos[(k − t)0 ]. =2 ∂h(k) L
t=0
The gradient of J 1 with respect to h k is L ∂ J1 = E β yT (t) · W · x(t − k) − 2λ h t cos[(k − t)0 ]. hk
(3.15)
t=0
h k can then be adjusted with the gradient-descent method. h 0 is still set as 1 during the update process. 3.3.3 To Minimize the Distortion and Improve the Robustness. The outputs yi are the estimate of (a filtered version of) the independent subcomponents.
An Adaptive Method for Subband Decomposition ICA
205
In order to make yi be as close as possible to the original independent subcomponents, the pass-band of h(t) should be as wide as possible provided that h(t) filters out the dependent subcomponents. This can be achieved by using equation 3.15 as the learning rule, with the value of 0 no longer fixed; each time it is randomly chosen between 0 and the Nyquist frequency π. With this scheme, our method becomes more robust. If necessary, after convergence of BS-ICA, we can further modify the learned h(t) such that it has a flat spectrum in the pass-bands, while its stop-bands remain. In this way, the distortion is further reduced.
3.4 To Avoid High-Dimensional SFD. Due to the curse of dimensionality, it is difficult to estimate the probability density function (pdf) in highdimensional spaces, and hence the SFD estimation is an obstacle when applying our algorithm to the high-dimensional case. So it is useful to find a way to avoid the SFD in the learning rule. In general, pairwise independence is weaker than mutual independence. However, according to the Darmois-Skitovich theorem, for the linear instantaneous ICA, the property that outputs yi are pairwise independent is equivalent to the mutual independence between yi (Comon, 1994). In this case, we can use a heuristic way to achieve mutual independence between yi by minimizing the sum of the pairwise mutual information:
J =
n n−1
I (yi , y j ).
(3.16)
i=1 j=i+1
This function is always nonnegative and is zero if and only if yi are pairwise independent. Under the assumptions made in proposition 2, the mutual independence between the outputs of the SDICA system, yi (t), is equivalent to their pairwise independence according to the extension of lemma 3 in the highdimensional case. Consequently, in SDICA, we can minimize the objective function, equation 3.16, to make yi mutually independent. In this way, the SFD estimation is always performed in the two-dimensional space, as explained below. By deriving the gradient of equation 3.16 with respect to h k , we can get the update rule for h(t),
∂J = E γ yT (t) · W · x(t − k) , ∂h k
(3.17)
206
K. Zhang and L.-W. Chan
where γ y (t) = [γ1 (t), . . . , γn (t)]T is defined as pairwise score function difference (PSFD), and γi =
n j=1, j=i
β yi (yi , y j ) =
n
ϕ yi (yi , y j ) − (n − 1)ψ yi (yi ).
j=1, j=i
Note that ϕ yi (yi , y j ) is the score function of the joint pdf of (yi , y j ) with ∂ log p y ,y (yi ,y j )
i j , and ψ yi (yi ) is the score funcrespect to yi , that is, ϕ yi (yi , y j ) = ∂ yi tion of yi . (For a proof of the rule, refer to the appendix.) In order to obtain γ y , we need to estimate β yi (yi , y j ) for all i = j. This means that the SFD estimation is always performed in the two-dimensional space, regardless of the original data dimension. For simplicity, in the instantaneous stage, we still use I (y1 , . . . , yn ) as the objective function, and the learning rule for W is still equation 3.5. From the definition of the PSFD γ y , we can see the SFD of each pair of yi is needed to construct the n-dimensional PSFD γ y (y). Therefore, for estimating γ y (y), we need to estimate a set of two-dimensional SFDs with
Cn2 = n(n−1) elements. Clearly the complexity of the PSFD-based algorithm 2 (see equation 3.17) is a quadratic function of n in each iteration. This scheme can also be applied to the algorithms for separating linear instantaneous or convolutive mixtures with the SFD involved (Babaie-Zadeh et al., 2001b). Consequently, these algorithms become more applicable for high-dimensional data. 4 For Overcomplete ICA Problems with Sources Having Specific Frequency Characteristics For the overcomplete ICA, the mixing system (see equation 2.1) is not invertible. In fact, due to the information lost in the mixing process caused by the reduction of the data dimension, even if we know the mixing matrix A, we cannot recover all the independent sources exactly. Generally, solving the overcomplete ICA problem involves two processes: estimation of the mixing matrix and estimation of the original sources with the help of their assumed prior probability densities and the estimated mixing matrix (Olshausen & Field, 1997; Lewicki & Sejnowski, 2000). While the method proposed by Hyv¨arinen, Cristescu, and Oja (1999) is based on the FastICA algorithm (see Hyv¨arinen & Oja, 1997), combined with the concept of quasiorthogonality for producing more independent components than observations. This idea has been extended from a maximum likelihood viewpoint in Hyvarinen and Inki (2002). In addition, the work in Amari (1999) focuses on only the natural gradient learning algorithm to estimate the mixing matrix A and does not treat the problem of source recovery. In these methods, some assumed prior distributions, which are usually sparse,
An Adaptive Method for Subband Decomposition ICA
207
are imposed on the sources; accordingly, these methods may not work in some scenarios. Here we consider the overcomplete ICA problems from a different point of view: we exploit the information of the frequency spectra of the sources.
4.1 Relation to SDICA. SDICA is closely related to the overcomplete ICA problem. Here we are concerned with one form of the overcomplete ICA problems, in which there exists a subset of sources with the number k (2 ≤ k ≤ m), forming the vector s(1) , such that each source in s(1) has some frequency band outside the frequency bands of the sources not in s(1) . Denote the vector consisting of the sources not in s(1) by s(2) . Partition the mixing matrix A into A(1) and A(2) according to s(1) and s(2) , the two disjoint subsets of sources. Then we have
x = As = A(1) A(2) ·
s(1) s
(2)
= A(1) s(1) + A(2) s(2) .
(4.1)
If A(1) is square and nonsingular, we can further get x = A(1) (s(1) + A A(2) s(2) ). Generally A(1)−1 A(2) is not a generalized permutation matrix; otherwise, some independent sources will be merged together, and the overcomplete ICA becomes the ordinary ICA problem (Cao & Liu, 1996). Consequently the elements of A(1)−1 A(2) s(2) are dependent. The elements of s(1) + A(1)−1 A(2) s(2) can then be regarded as sources following the assumption in SDICA (see equation 2.3). If A(1) is not square, we consider the case where the rank of A(1) is k. (1) , which consists of There exists a k × k submatrix of A(1) , denoted by A (1) (1) some rows of A , such that A is of the full rank. Denote the vector of the (1) by x˜ , and the corresponding submatrix observations corresponding to A (2) . We have x˜ = A (1) (s(1) + A (1)−1 A (2) s(2) ). The elements of s(1) + of A(2) by A (2) s(2) can still be considered as sources following the assumption in (1)−1 A A SDICA. Consequently, the problem of overcomplete ICA becomes a special case of the SDICA problem. (1)−1
4.2 To Solve One Form of the Overcomplete ICA Problems. If we use the SDICA system to separate the overcomplete mixture x discussed above, we have
(1) (2) h(t) ∗ s1 (t) h(t) ∗ s1 (t) .. .. + W · A(2) · . y(t) = W · A(1) · . . (1) (2) h(t) ∗ sk (t) h(t) ∗ sn−k (t)
(4.2)
208
K. Zhang and L.-W. Chan
In the general case, WA(1) and WA(2) will not be generalized permutation matrices at the same time; otherwise the overcomplete ICA problem is degenerated to the ordinary ICA problem (Cao & Liu, 1996). In this case, as a consequence of lemma 3, there are two possibilities to achieve the independence between yi in equation 4.2: we can either use h(t) to filter out s(2) and set WA(1) as a generalized permutation matrix, or we can filter out s(1) and let WA(2) be a generalized permutation matrix. Therefore, we can use the BS-ICA method to separate the overcomplete mixtures of this kind. For estimating the k sources in s(1) , we can confine the demixing matrix W to be a k × m matrix. The outputs yi form an estimate of a filtered version of these sources, and the learned W is the associated demixing matrix. When k = m = n2 and the frequency bands of components of s(1) and those of s(2) do not overlap, theoretically the algorithm has at least two local minima: one is to recover s(1) and the other to recover s(2) . In practice, due to the limited capacity of the filter, one of these two minima may be hard to obtain, especially when the corresponding sources have very little contribution to the observations. But this local minimum can be obtained in another simple way. In this case, when we achieve one local minimum—for example, components of s(1) are recovered and they completely pass though h(t) (if necessary, this can be achieved by incorporating a penalty term; see section 3.3.3)—the filter h −1 (t) attenuates s(1) and allows s(2) to pass though.5 Applying a linear instantaneous ICA algorithm to h −1 (t) ∗ x(t), we can estimate the demixing matrix associated with A(2) and the filtered version of components of s(2) . We will discuss this issue with the help of the experiment. Noisy ICA is a special case of overcomplete ICA. Usually the noise is assumed to be white, that is, it has a flat frequency spectrum. The independent source signals usually concentrate on a narrower frequency band. Our method may produce two different outcomes for such noisy ICA problems: h(t) suppresses the noise and the independent sources pass, or h(t) attenuates the independent sources and allows the noise to pass through. The former case is what we generally want to achieve. For the latter case, we can further apply h −1 (t) to the observations to attenuate the noise and obtain the signals. 5 Experiments To assess the quality of the demixing matrix W for separating observations generated by the mixing matrix A, we use the Amari performance index
5 For a causal finite impulse response (FIR) filter, its inverse always exists. If it is a minimum-phase filter, it has a causal inverse. Otherwise, the inverse is a noncausal filter.
An Adaptive Method for Subband Decomposition ICA
209
Perr (Amari et al., 1996; Cichocki & Amari, 2003),
Perr
n n n | p | | p | 1 i j ji = − 1 + − 1 n(n − 1) maxk | pik | maxk | pki | i=1
j=1
j=1
(5.1) where pi j = [WA]i j . The smaller Perr is, the closer the product of W and A is to the generalized permutation matrix. Note that the proposed BS-ICA method can not only estimate the mixing matrix A and the original sources, but also estimate the independent subcomponents. As in blind separation of convolutive mixtures, the BS-ICA method for SDICA produces outputs as an estimate of the filtered version of the independent subcomponents instead of the original ones. Therefore, for measuring the separation quality, we also use the output signal-to-noise ratio (SNR), defined as (assuming there is no permutation indeterminacy)
SNRi = 10 log10
E yi2
E yi2 |si,I =0
,
(5.2)
where yi |si,I = 0 stands for what is at the ith output when the corresponding input sub-component si,I is zero (Babaie-Zadeh et al., 2004). A high SNR means that the contribution to this output from the other sources (including the dependent subcomponents) is small. 5.1 Experiment 1: SDICA with Artificially Generated Data. In the first experiment, we use some artificially generated signals to test the performance of the BS-ICA method for solving the SDICA problem. The four independent subcomponents are an amplitude-modulated signal, a sign signal, a high-frequency noise signal, and a speech signal. Each signal has 1000 samples. Each original source si contains one of the above independent signals, together with a sinusoid wave with a particular frequency but different phases for different sources, which is the dependent subcomponent. Figure 2 shows these signals as well as their frequency characteristics (the magnitude of their Fourier transforms). The experimental setting is similar to the first experiment in Tanaka and Cichocki (2004). Any nonsingular matrix could be used as the mixing matrix. In this experiment, the mixing matrix is
0.7 0.7
1
1.2
1 1.4 1 0.3 A= . 1.2 0.3 0.7 1.0 0.4 1.1 1.2 0.6
210
K. Zhang and L.-W. Chan
s
1,I
2
200
0
–2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
s
2,I
1
s
3,I
5
4,I
5
s
2
3
0 0
1
2
3
1
2
3
1
2
3
2
3
100
0
–5
450
500
550
600
650
700
750
800
2
0 0 50
0
–5
i,D
1
0
–1
wave of s
0 0 400
450
500
550
600
650
700
750
800
0 0 500
0
–2
450
500
550
600
650
700
750
800
0 0
1
t
Ω (rad)
Figure 2: The source-independent subcomponents (top four) and the waveform of the dependent subcomponents (bottom one), as well as their frequency characteristics, represented by the magnitude of their Fourier transforms. Only 400 samples are plotted for illustration. The sources si = si,I + si,D , and si,D are sinusoid waves with the same frequency but different phases. 1000
x1
10 0
–10
x2
10
5
x3
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0
1
2
3
0
–10
0 –5 10
x4
450
0
–10
t
Ω (rad)
Figure 3: The observations (left) and their frequency characteristics (right) in experiment 1.
The observations xi , as well as their frequency characteristics, are shown in Figure 3. Since some source-independent subcomponents in this experiment have disjoint frequency representations, the method proposed in Tanaka and Cichocki (2004) may fail.
An Adaptive Method for Subband Decomposition ICA
211
Here the dimension of observations is four. Since it is very difficult to estimate the SFD in the four-dimensional space, we use the scheme described in section 3.4 to avoid the high-dimensional SFD. Consequently, we just need to estimate the two-dimensional SFD, which is estimated using Pham’s method (Pham, 2003). The length of the causal filter h(t) is 17. h 0 is set as 1, and all the other elements of h(t) are initialized as 0. W is initialized as the identity matrix. The learning rate for the filtering stage is 0.1, and that for the instantaneous stage is 0.15. Equation 3.7 is adopted for adjusting W. At convergence, the product of W and A is
−0.0003
0.0287 0.0027 0.6611
−0.0049 0.9249 0.0009 0.0016 WA = . 1.8876 −0.0037 0.0113 0.0108 −0.0545 −0.0134 0.6420 0.0113 The Amari performance index Perr = 0.0278, from which we can see WA is almost a generalized permutation matrix. It means that W is really the demixing matrix with respect to A. Then the original sources si can be estimated as Wx with good performance. We repeat this experiment and incorporate an additional term to achieve little distortion with λ = 0.01, as discussed in section 3.3.3, and the experimental result is almost the same. For comparison, we also apply the FastICA algorithm (Hyv¨arinen & Oja, 1997) (with the tanh nonlinearity and in the symmetric estimation mode) to directly separate the observations x. The resulting Amari performance index is Perr = 0.139, which indicates poor performance in recovering the mixing procedure A due to the effect of the dependent subcomponents. Figure 4 shows the output SNRs (with respect to the source-independent subcomponents depicted in Figure 2) versus iterations and the waveforms of the outputs yi . The SNRs are very high, meaning that the dependent subcomponents have been filtered out by the filter h(t) and the outputs are a good estimate of (the filtered version) of the independent subcomponents. This can be verified by examining the frequency response magnitude of h(t) (see Figure 5). By comparing the frequency representation of z1 (t) (note that z1 (t) = h(t) ∗ x1 (t)) in Figure 5C with that of x1 (t) in Figure 3, top, we can see the effect of the dependent subcomponents around = 0.47 rad is almost eliminated. From Figure 5B, we can see that the frequency response magnitude of h(t) varies greatly in the pass-bands. This results in the distortion in the estimated independent subcomponents. As discussed in section 3.3.3, for less distortion of the recovered independent subcomponents, we design a causal filter h 1 (t), which has the same stop-band as h(t) and has a nearly constant magnitude in the pass-bands. In this experiment, the frequency response magnitude of h 1 is indicated by the dotted line in Figure 5B. The result of applying h 1 (t) to the recovered sources Wx(t) is shown in Figure 6.
212
K. Zhang and L.-W. Chan
B
A
5 1
45
y
40
0
−5 2
35
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
25
0
−2 5
20
3
15
y
SNR (dB)
2
30
10
4
0
y
SNR 1 SNR 2 SNR3 SNR
5
4
–5 0
500
1000
1500
2000
0
−5 5 0
−5
2500
iteration
t
Figure 4: (A) The output SNRs with respect to the source-independent subcomponents. (B) The outputs yi .
A
B
1
h(t) h1(t)
2
h(t)
|H(jΩ)|
0.5
1
0
–0.5
0
0
5
t
10
15
0
Ω (rad)
1
2
3
C
500
1
|Z (jΩ)|
1000
0
0
1
Ω (rad)
2
3
Figure 5: (A) The learned filter h(t). (B) Its frequency response magnitude (solid line). (C) The magnitude of the Fourier transform of z1 (t) = h(t) ∗ x1 (t). The dotted line in B shows the frequency response magnitude of h 1 (t), which has the same stop-band as h(t), but the magnitude in the pass-bands is almost a constant.
We can see that now the independent subcomponents are recovered with less distortion compared to Figure 4B. We repeat the above experiment for 20 runs, and in each run the mixing matrix A is randomly chosen and is guaranteed to be nonsingular. In each
An Adaptive Method for Subband Decomposition ICA
213
5 0 –5 2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
0 –2 5 0 –5 5 0 –5
t
Figure 6: The recovered independent subcomponents by applying h 1 (t) to the recovered sources Wx(t).
run, we run the FastICA algorithm, BS-ICA with W initialized by FastICA, and BS-ICA with W initialized with the identity matrix. We find that BS-ICA with W initialized by FastICA always converges quickly (in 500 iterations), and generally BS-ICA with W initialized with the identity matrix needs at least 1000 iterations to converge. The resulting Amari performance index of these three methods is shown in Figure 7. In all runs, BS-ICA with W initialized by FastICA converges to the desired target. And in three runs (runs 2, 8, and 14 in Figure 7), BS-ICA with W initialized with the identity matrix converges to a wrong target. There are two main reasons for this phenomenon. First, the second and first independent subcomponent signals have a very narrow band, and their frequency peaks are very close to that of the dependent subcomponent, as seen in Figure 2. Hence, they are very likely to be attenuated and distorted greatly. For the three runs where Perr is very high, we find that all columns of WA, except the second one, have only one dominant element. In other words, the poor performance in these three runs is caused by the second source-independent subcomponent signal. Second, the proportion of the dependent subcomponent is quite large, and it is hard to be filtered out. For these three runs, BS-ICA with W initialized with the identity matrix also converges to the desired target if we reduce the amplitude of the dependent subcomponent by half. 5.2 Experiment 2: Separating Images of Human Faces. Since the face images, represented as one-dimensional signals, are apparently dependent,
214
K. Zhang and L.-W. Chan 0.4 0.35
FastICA BS –ICA (W initialized by FastICA) BS–ICA(W randomly initialized)
0.3
P
err
0.25 0.2 0.15 0.1 0.05 0 0
5
10
15
20
run
Figure 7: The Amari performance index of FastICA, BS-ICA with W initialized by FastICA, and BS-ICA with W initialized with the identity matrix, for 20 runs. In each run, the mixing matrix A is randomly chosen.
it is very hard to separate them with the ICA technique (Hyv¨arinen, 1998). Hyv¨arinen (1998) successfully separated four images of human faces by applying traditional ICA on the innovation processes. Here we use BS-ICA to do such a task. The four original images of human faces are the same as in Hyv¨arinen (1998), as shown in Figure 8A.6 We mix them with a random mixing matrix. Figure 8B shows the mixtures of the original images for illustration. The Amari performance index of separating the mixtures with FastICA is 0.592, and that of the natural gradient method (see equation 3.7, with the score function adaptively estimated from data) is 0.289.7 Figure 8C shows the separation result by the natural gradient method. Clearly the result produced by traditional ICA is poor. We repeat BS-ICA with W initialized with FastICA and BS-ICA with W initialized with the identity matrix for 20 runs. In each run, the mixing matrix is randomly chosen. The length of h is 11. The learning rate in the filtering stage is 0.04, and that in the linear demixing stage is 0.08. Since it is very computation demanding and memory demanding to do SFD estimation on a large number of samples, in each iteration we process only 6 Many thanks to Aapo Hyv¨ arinen for providing us the images and granting permission to use them. 7 Here, FastICA and the natural gradient method (with the score function estimated from data) give different results. The main reason is that the original images are highly correlated, as seen from the correlation matrix computed in Hyv¨arinen (1998), while the outputs of FastICA are always uncorrelated.
An Adaptive Method for Subband Decomposition ICA
215
A
B
C
D
Figure 8: Separating mixtures of images of human faces. (A) Original images. (B) Some mixtures of the original images. (C) Separation result by traditional ICA (natural gradient method with the score function estimated from data). (D) Separation result by BS-ICA.
3000 samples. We find that no matter which method is used to initialize W, the Amari performance index obtained by BS-ICA is always between 0.0397 and 0.0473. In other words, the original images are successfully separated with good performance. Figure 8D shows the separation result of one run. The learned h(t), as well as its frequency-response magnitude, is given in
216
K. Zhang and L.-W. Chan
B 2
8
1
6
|H(jΩ)|
h(t)
A
0
4 2
–1 –2
h(t) hAR(t)
0
2
4
t
6
8
10
0
0
1
Ω (rad)
2
3
Figure 9: (A) The learned h(t) in separating mixtures of images. (B) Its frequency response magnitude (solid line). For comparison, the dotted line shows the frequency-response magnitude of the filter h AR (t), which is obtained by fitting a tenth-order autoregressive model to the observed images.
Figure 9. We can see that h(t) attenuates not only the low-frequency part but also the high-frequency part of the observations. For comparison, we repeat the experiment in Hyv¨arinen (1998). We use a tenth-order autoregressive model to estimate the innovation processes from the observations. The frequency response magnitude of the filter producing the innovation processes from the observations, denoted by h AR (t), is shown by the dotted line in Figure 9B. After that, we estimate the mixing matrix by applying traditional ICA on the innovation processes. FastICA gives the Amari performance index 0.132, and the natural gradient method with the score function estimated from data gives 0.061. So the original images are also recovered successfully by exploiting innovation processes. And by comparing the performance index, one can see that BS-ICA gives a better result. 5.3 Experiment 3: Overcomplete ICA. In this experiment, we test the usefulness of our method for the overcomplete ICA problem. The four independent sources are an amplitude-modulated signal (s1 ), a sign signal (s2 ), a high-frequency noise signal (s3 ), and a sinusoid (s4 ), which are the first three independent subcomponents and the dependent subcomponent in the first experiment (see Figure 2). The mixing matrix is A=
0.7 0.7 1.5 1
1
1.4 1.5 0.4
.
Note that each pair of the columns of A is linear independent. The two observations, together with their frequency characteristics, are shown in Figure 10. Since only two observations are available, each time the BS-ICA method can recover only two sources and the corresponding mixing matrix. From
An Adaptive Method for Subband Decomposition ICA
x
1
5
400
0
–5
200
450
500
550
600
650
700
750
800
10
x2
217
0 0
1
2
3
2
3
600 400
0 200
–10
450
500
550
600
650
700
750
800
0 0
1
t
Ω (rad)
Figure 10: The two observations (left) and their frequency characteristics (right) in experiment 3.
B
A
8
1
|H(jΩ)|
h(t)
0.5 0
4
–0.5 –1
0
5
t 10
15
20
0
0
Ω (rad)
1
C
2
3
|Z1(jΩ)|
4000
0
0
1
Ω (rad)
2
3
Figure 11: (A) The filter h(t). (B) Its frequency-response magnitude. (C) The magnitude of the Fourier transform of z1 (t) = h(t) ∗ x1 (t).
Figure 2, we can see that they all have different frequency characteristics, and s3 has a wide frequency band. In order to recover two among all four sources, the other two sources would be filtered out by h(t). For a good frequency resolution, the length of h(t) should not be too small. Here we set the length of h(t) as 21. h 0 is set as 1, and all the other elements of h(t) are initialized as 0. W is initialized as the identity matrix. The learning rates for the filter stage and instantaneous stage are both 0.15. After about 1500 iterations, the algorithm converges. The filter h(t), its frequency response magnitude, and frequency characteristics of z1 (t) are shown in Figure 11. From Figure 11B and Figure 2, we can see h(t) has
218
K. Zhang and L.-W. Chan
y
1
2
0
–2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
2
2
0
–2
t
Figure 12: Two outputs produced by the BS-ICA method for the overcomplete ICA problem. They are the estimate of a filtered version of some original sources. The SNR of y1 with respect to s4 is 20.1 dB, and that of y2 with respect to s2 is 18.2 dB.
significantly attenuated the amplitude-modulated signal (s1 ) and the highfrequency noise signal (s3 ). The filtered version of s4 and s2 is recovered with the SNR values 20.1 dB and 18.2 dB, as shown in Figure 12. The product of W and A is WA =
0.0305 −0.0051 0.1225 0.1451
0.1350
0.2265 0.1803 −0.0018
.
As s1 and s3 are filtered out by h(t), we neglect the first and third columns of this matrix and get the Amari performance index Perr = 0.040. So the mixing matrix associated with s2 and s4 is successfully recovered. Note that different sources may be recovered if we use different initialization values for h(t) and W. Now we have obtained the estimate of two sources with h(t) filtering out the other two. In order to recover the remaining two sources, we could run our algorithm again with h −1 (t) as a good initialization value for the filter. If the frequency representations of the recovered sources and the remaining ones do not overlap (or overlap slightly), we can simply apply h −1 (t) to allow the remaining sources to pass through and attenuate the recovered sources. By applying a linear instantaneous ICA algorithm to the signals h −1 (t) ∗ xi (t), we can get the estimate of the remaining sources without running BS-ICA again. This simple scheme is adopted in the following experiment. The linear instantaneous ICA algorithm chosen is the FastICA algorithm with the tanh nonlinearity and in the symmetric estimation mode. The product of the demixing matrix obtained
An Adaptive Method for Subband Decomposition ICA
219
4
y
1
2 0
–2 –4 400
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
2
2 0
–2 400
t
Figure 13: Outputs produced by applying FastICA on h −1 (t) ∗ x(t). The SNR of y1 with respect to s1 is 12.6 dB, and that of y2 with respect to s3 is 18.1 dB.
by FastICA, and the mixing matrix A is WA =
−0.2167 −0.5029 −0.0044 0.0060
0.4263
0.2153 −0.3237 −0.5298
.
Neglecting its second and fourth columns, the Amari performance index is Perr = 0.040, which indicates that the columns in the mixing matrix associated with s1 and s3 are successfully recovered. Figure 13 shows the output signals, with the SNR values 12.6 dB and 18.1 dB, respectively. 6 Conclusion In this article, we considered the problem of subband decomposition ICA. We investigated the feasibility of adaptively separating mixtures generated by the subband decomposition ICA model. Based on the minimization of the mutual information between outputs, we developed an adaptive algorithm for subband decomposition ICA, called band-selective ICA. The advantage of this algorithm is that it automatically selects the frequency bands in which source subcomponents are most independent and attenuates the dependent subcomponents. Practical issues in implementing our method were considered, and some techniques were suggested to improve the performance of this algorithm. We also discussed the relationship between subband decomposition ICA and overcomplete ICA. By taking into account the information of the frequency bands of sources, our algorithm can be exploited to solve one form of the overcomplete ICA problems in which sources have specific frequency localization characteristics. Experimental results have been given to illustrate the performance of our method for subband decomposition ICA as well as overcomplete ICA.
220
K. Zhang and L.-W. Chan
Appendix: Proof of the Rule, Equation 3.17 According to equation 3.16, we have
J=
n n−1
I (yi , y j )
i=1 j=i+1
=
n i=1
n 1 (n − 1)H(yi ) − H(yi , y j ) 2 j=1, j=i
and ∂ H(yi , y j ) = −E ∂h k
log p yi ,y j (yi , y j ) ∂h k
∂ log p yi ,y j (yi , y j ) ∂ yi ∂ log p yi ,y j (yi , y j ) ∂ y j · + · ∂ yi ∂h k ∂ yj ∂h k n = −E ϕ yi (yi (t), y j (t)) wi, p x p (t − k) + ϕ y j (yi (t), y j (t))
= −E
p=1
n w j, p x p (t − k) . × p=1
Also taking into account equation 3.9, we have n n ∂ H(y , y ) ∂J ∂ H(y ) 1 i j i (n − 1) = − ∂h k ∂h k 2 ∂h k j=1, j=i
i=1
n n E ϕ yi (yi (t), y j (t)) − (n − 1)ψ yi (yi (t)) = i=1
·
n p=1
j=1, j=i
wi, p x p (t − k)
= E γ yT (t) · W · x(t − k) . This is exactly equation 3.17.
An Adaptive Method for Subband Decomposition ICA
221
Acknowledgments This work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China. We are very grateful to the anonymous referees for their helpful comments and suggestions. We also thank Deniz Erdogmus for helpful discussions. References Amari, S. (1999). Natural gradient for over- and under-complete bases in ICA. Neural Computation, 11, 1875–1883. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2001a). Blind separating convolutive post-nonlinear mixtures. In Proc. ICA2001 (pp. 138–143). San Diego, CA. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2001b). Separating convolutive mixtures by mutual information minimization. In Proc. IWANN (pp. 834–842). New York: Springer. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2004). A minimization-projection (MP) approach for blind separating convolutive mixtures. In ICASSP 2004. Montreal, Canada. Bach, F. R., & Jordan, M. I. (2003). Beyond independent components: Trees and clusters. Journal of Machine Learning Research, 4, 1205–1233. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., & Amin, M. G. (1998). Blind source separation based on timefrequency signal representations. IEEE Transactions on Signal Processing, 46, 2888– 2897. Cao, X.-R., & Liu, R.-W. (1996). General approach to blind source separation. IEEE Transactions on Signal Processing, 44, 562–571. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cardoso, J.-F. (1998). Multidimensional independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’98). Seattle, WA. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. Signal Processing, 44, 3017–3030. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceeding–F, 140, 362–370. Cichocki, A., & Amari, S. (2003). Adaptive blind signal and image processing: Learning algorithms and applications. (rev. ed.). New York: Wiley. Cichocki, A., Amari, S., Siwek, K., Tanaka, T., et al. (2003). ICALAB toolboxes for signal and image processing. Available online at http://www.bsp.brain.riken.jp/ ICALAB/.
222
K. Zhang and L.-W. Chan
Cichocki, A., & Belouchrani, A. (2001). Source separation of temporally correlated source using bank of band pass filters. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation (ICA2001) (pp. 173– 178). San Diego, CA: Cichocki, A., & Georgiev, P. (2003). Blind source separation algorithms with matrix constraints. IEICE Transactions on Information and Systems, Special Session on Independent Component Analysis and Blind Source Separation, E86-A (1), 522–531. Cichocki, A., Rutkowski, T. M., & Siwek, K. (2002). Blind signal extraction of signals with specified frequency band. In Neural Networks for Signal Processing XII: Proceedings of the 2002 IEEE Signal Processing Society Workshop (pp. 515–524). Piscataway, NJ: IEEE. Cichocki, A., & Zurada, J. M. (2004). Blind signal separation and extraction: Recent trends, future perspectives, and applications. In ICAISC 2004 (pp. 30–37). New York: Springer. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cram´er, H. (1962). Random variables and probability distributions (2nd ed.). Cambridge: Cambridge University Press. Gharieb, R. R., & Cichocki, A. (2003). Second-order statistics based blind source separation using a bank of subband filters. Digital Signal Processing, 13, 252– 274. Hyv¨arinen, A. (1998). Independent component analysis for time-dependent stochastic processes. In Proc. Int. Conf. on Artificial Neural Networks (ICANN’98) (pp. 541– 546). Skovde, Sweden. Hyv¨arinen, A., Cristescu, R., & Oja, E. (1999). A fast algorithm for estimating overcomplete ICA bases for image windows. In Proc. Int. Joint Conf. on Neural Networks (pp. 894–899). Washington, DC. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phases and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12, 1705–1720. Hyv¨arinen, A., Hoyer, P. O., & Oja, E. (2001). Image denoising by sparse code shrinkage. In S. Haykin & B. Kosko (Eds.), Intelligent signal processing. Piscataway, NJ: IEEE Press. Hyv¨arinen, A., & Inki, M. (2002). Estimating overcomplete independent component bases for image windows. Journal of Mathematical Imaging and Vision, 17, 139– 152. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Kagan, A. M., Linnik, J. V., & R´ao, C. R. (1973). Characterization problems in mathematical statistics. New York: Wiley. Kiviluoto, K., & Oja, E. (1998). Independent component analysis for parallel financial time series. In Proc. ICONIP’98 (pp. 895–898). Tokyo, Japan. Cambridge, MA: MIT Press. Lewicki, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365.
An Adaptive Method for Subband Decomposition ICA
223
Liu, R. W., & Luo, H. (1998). Direct blind separation of independent non-gaussian signals with dynamic channels. In Proc. Fifth IEEE Workshop on Cellular Neural Networks and Their Applications (pp. 34–38). London, England. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T.-J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pham, D. T. (2003). Fast algorithm for estimating mutual information, entropies and scores functions. In Proceeding of ICA 2003. Nara, Japan. Rickard, S., Balan, R., & Rosca, J. (2003). Blind source separation based on spacetime-frequency diversity. In 4th International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003). Nara, Japan. Ristaniemi, T., & Joutsensalo, J. (1999). On the performance of blind source separation in CDMA downlink. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 437–441). Aussois, France. Samadi, S., Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2004). Blind source separation by adaptive estimation of score function difference. In Proc. ICA 2004 (pp. 9–17). New York: Springer. Taleb, A., & Jutten, C. (1997). Entropy optimization—application to blind source separation. ICANN (pp. 529–534). New York: Springer. Tanaka, T., & Cichocki, A. (2004). Subband decomposition independent component analysis and new performance criteria. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’04) (pp. 541–544). Piscataway, NJ: IEEE. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. Proc. IEEE workshop on Neural Networks for Signal Processing (pp. 423–432). Kyoto, Japan. Vig´ario, R. (1997). Extraction of ocular artifacts from EEG using independent component analysis. Electroenceph. Clin. Neurophysiol, 103, 395–404. Vig´ario, R., V. Jousm¨aki, M. H., Hari, R., & Oja, E. (1998). Independent component analysis for identification of artifacts in magnetoencephalographic recordings. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 229–235). Cambridge, MA: MIT Press. Ye, J.-M., Zhu, X.-L., & Zhang, X.-D. (2004). Adaptive blind separation with an unknown number of sources. Neural Computation, 16, 1641–1660. Yilmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via timefrequency masking. IEEE Transactions on Signal Processing, 52, 1830–1847.
Received February 7, 2005; accepted June 1, 2005.
LETTER
Communicated by Gabriel Huerta
On Consistency of Bayesian Inference with Mixtures of Logistic Regression Yang Ge
[email protected] Wenxin Jiang
[email protected] Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A.
This is a theoretical study of the consistency properties of Bayesian inference using mixtures of logistic regression models. When standard logistic regression models are combined in a mixtures-of-experts setup, a flexible model is formed to model the relationship between a binary (yes-no) response y and a vector of predictors x. Bayesian inference conditional on the observed data can then be used for regression and classification. This letter gives conditions on choosing the number of experts (i.e., number of mixing components) k or choosing a prior distribution for k, so that Bayesian inference is consistent, in the sense of often approximating the underlying true relationship between y and x. The resulting classification rule is also consistent, in the sense of having near-optimal performance in classification. We show these desirable consistency properties with a nonstochastic k growing slowly with the sample size n of the observed data, or with a random k that takes large values with nonzero but small probabilities. 1 Introduction Mixtures of experts (ME; Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical mixtures of experts (HME; Jordan & Jacobs, 1994) are popular techniques for regression and classification, and have attracted attention in the areas of neural networks and statistics. ME and HME are a variety of neural networks that have an interpretation of probabilistic mixture, in contrast to the usual neural nets, which are based on linear combinations. With mixture, instead of with linear combinations, simple models are combined in ME and HME for improved predictive capability. This structure of the probabilistic mixture allows the use of convenient computing algorithms such as the expectation maximization (EM) algorithm (Jordan & Xu, 1995) and the Gibbs sampler (Peng, Jacobs, & Tanner, 1996). The ME and HME are flexible constructions that can allow various models or experts to be mixed. For binary classification, simple and standard classifiers such as logistic regression models can be combined to model the Neural Computation 18, 224–243 (2006)
C 2005 Massachusetts Institute of Technology
Mixtures of Logistic Regression
225
relationship between a binary response y ∈ {0, 1} and a predictor vector x. Such combined models can approximate arbitrary smooth relations between y and x in the sense of Jiang and Tanner (1999a). Peng, Jacobs, and Tanner (1996) apply mixtures of binary and multinomial logistic regression for pattern recognition. They found that a Bayesian approach based on simulating the posterior distribution gives better performance than a frequentist approach based on likelihood. Recently, Wood, Kohn, Jiang, and Tanner (2005) studied binary regression where probit-transformed spline models with different smoothing parameters are mixed, and Markov chain Monte Carlo methods are used to simulate the posterior distributions for both the model parameters and the number of mixing components. Their extensive simulations and empirical studies demonstrate excellent performance of the Bayesian approach and local adaptivity of the mixing paradigm. These empirical successes have motivated us to study the theoretical reasons behind the question: Why does the Bayesian procedure work well in such mixture models of binary regression? The purpose of this letter is to study the consistency properties of Bayesian inference for mixtures of binary logistic regression. Will inferential results based on the posterior distribution be reliable? In a Bayesian approach, the posterior distribution will propose various relationships between x and y, based on some observed data. We will investigate the conditions under which the proposed relationships are consistent or often close to the underlying true relationship. This will also imply that the resulting plug-in classification rule has near-optimal performance. There are several senses of consistency. The precise formulation of these problems is given in section 2. We assume that the true model possesses some unknown smooth mean function E(y|x), which can be outside the n proposed ME family and that the observed data (yi , xi )i=1 are n independent and identical copies of (y, x). In section 3 we study the consistency problem for a sequence of ME models, where the number of experts (or mixture components) K = kn increases with sample size n. Such a construct allows a large number of experts eventually and enables good functional approximation (Jiang & Tanner 1999a). We will show that the following condition on kn leads to consistency: kn increases to infinity at a rate slower than na for some a ∈ (0, 1), as sample size n increases. Later in section 3, we consider the case when the number of experts K is considered to be random and follows a prior distribution. We will show that the critical conditions for consistency involve the prior on K : the prior is supported for all large values of K and has a sufficiently thin tail. Our work parallels that of Lee (2000), who studies similar properties (without classification consistency) for ordinary neural networks (NN) based on linear combinations. Lee’s method involves truncating the space of all proposed models into a limited part and an unlimited part, and shows that (1) the unlimited part has very small prior probability satisfying some
226
Y. Ge and W. Jiang
tail condition; (2) the limited part is not too complicated in the sense of satisfying an entropy condition; and (3) the prior is chosen to have not too small a probability mass around the true model, which is an approximation condition. Condition 3 typically involves some approximation results, since the prior proposed models have to be able to get as near as possible to the true model; otherwise, the prior mass would be zero over some neighborhood of the true model. We implement these conditions for ME. However, we note that there is a fundamental difficulty resulting from the mechanism of approximation with ME. In the known mechanism (Jiang & Tanner 1999a), to approximate a true relation arbitrarily well, ME with many experts crowded together and with large parameter values is used. This results in large parameter values of the ME model (the components that describe the changing of mixing weights) increase with K . In a Bayesian approach, typically very small prior is given to such ME configurations; they have large parameter values lying in the tail of the prior. Yet we would like to show that the resulting posterior of such configurations is not too small, since these configurations are close to the true relation. In order to handle this difficulty, we characterize how large the ME parameter values are needed for good approximation: values of order ln(K ) are sufficient, which are not too far in the tail of the prior distribution. Such a result is established by embedding ME with K ∗ (< K )-experts as a subset of ME with K-experts. (See lemma 5 and its proof.) When we consider the situation with random K , we face another difficulty: the usual priors of K , such as geometric or Poisson, cannot satisfy both conditions 1 and 3. If the truncation occurs at a K that is too large, the limited part of the proposed model space may become too complicated to satisfy the entropy condition. If the truncation occurs at a K that is too small, the tails may be too thick to satisfy the tail condition. Such a dilemma was not discussed in Lee (2000), who did not consider the entropy condition for the case of random K . In order to handle this situation, we introduce a contraction sequence for the number of experts: K = k(i), which grows to infinity as i increases but grows more slowly than i. Then the prior probability, for example, λi = (0.5)i for the geometric, is put on index i. Since k(i) can stay unchanged for some i, this contraction sequence effectively groups the geometric probabilities together at the choice of number of experts and produces a thinner tail. We show that using suitably contracted geometric or Poisson priors on K , all conditions hold to produce consistency. 2 Notation and Definitions 2.1 Models. We first define the single-layer ME models where logistic regression models are mixed.
Mixtures of Logistic Regression
227
The binary response variable is y ∈ {0, 1}, and x is an s-dimensional predictor. As in Jiang and Tanner (1999a, 1999b), we let = [0, 1]s = ⊗qs =1 [0, 1] be the space of the predictor x, and let x have a uniform distribution on . This is a convenient starting point, and the results can be easily adapted to the case when x has a positive density and is supported on a compact set. This convenient formulation results in several simplified relations. The joint density of p(y, x) is the same as the conditional density p(y|x), which is completely determined by the conditional probability of a positive response P(y = 1|x), which is equal to the condition mean or regression function µ(x) = E(y|x), which is alternatively formulated in a transformed version h(x) = log{µ(x)/(1 − µ(x))}, called the log-odds. We consider a family of smooth relations between y and x as defined in Jiang and Tanner (1999a, 1999b): is the family of joint densities p(y, x) such that the log-odds h(x) has continuous derivatives up to the second order, which are all bounded above by a constant, uniformly over x. Such a nonparametric family can be approximated by mixtures of logistic regression (Jiang & Tanner 1999a, 1999b). Define the family k , of mixtures of k logistic regression models, as follows: k is the set of joint densities f (y, x|θ ), such that the conditional densities have the form u +vT x k α j +β T x e j j f (y|x, θ ) = j=1 g j H j , where g j = k ul +vT x ; Hj = ( e α +βj T x ) y ( α 1+β T x )1−y . l=1
e
l
1+e
j
j
1+e
j
j
The α, β, u, v’s are parameters of the model. Except that we restrict u1 = 0 and v1 = 0 for the sake of the identifiability, we allow all components of the parameter vectors to vary in (−∞, ∞). We denote by θ the combined vector of parameters θ = (α1 , β 1T , . . . , αk , β kT , u2 , v2T , . . . , uk , vkT )T ∈ dim(θ ) , where dim(θ ) = (s + 1)(2k − 1). 2.2 Bayesian Inference. The observed data set is (Y1 , X1 ), . . . , (Yn , Xn ), which we denote simply as (Yi , Xi )n . Here n is the sample size. We assume (Yi , Xi )n to be an independent and identically distributed (i.i.d.) sample of an unknown density f 0 from the nonparametric family of smooth relations . The mixture of logistic regression approach involves estimating the nonparametric f 0 using parametric relations f from k , the family of mixtures of k logistic regression models. We now describe Bayesian inference for uncovering f 0 based on (Yi , Xi )n . In the mixture of logistic regression approach, one first puts a prior distribution πn to propose densities f from the k-mixture family k (through the corresponding parameters θ ). This prior will then produce a posterior distribution of f over k (through on the observed nthe corresponding θ ),conditional n data: πn (dθ |(Yi , Xi )n ) = i=1 f (Yi , Xi |θ )πn (dθ )/ i=1 f (Yi , Xi |θ )πn (dθ ). Then the predictive density, which is the Bayes estimate of f 0 , is given by fˆ n (·) = f (·|θ )πn (dθ |(Yi , Xi )n ).
228
Y. Ge and W. Jiang
Let µ0 (x) = E f0 [Y|X = x] = y=0,1 yf 0 (y|x) be the true regression function; then µ ˆ n (x) = E fˆn [Y|X = x] is the estimated regression function. For now, we will let k = kn be nonstochastic and possibly depend on sample size n, which explains the dependence of prior on n. Later we will also consider the case when k = K is itself regarded as a random component of the parameter; a prior randomly decides to use an f from k with probability P(K = k), k = 1, 2, 3, . . . The prior densities on θ -components are assumed to be independent normal with zero mean and common positive variance σ 2 . (The results can be easily generalized to cases with different means and variances.) 2.3 Consistency. We first define consistency in regression function estimation, which we will call R-consistency. Definition 1 (R-Consistency). µ ˆ n is asymptotically consistent for µ0 if P as n → ∞. (µ ˆ n (x) − µ0 (x))2 dx → 0 Here and below, the convergence in probability of the form P q {(Yi , Xi )n } → q 0 , for any quantity dependent on the observed data, means limn→∞ P(Yi ,Xi )n [|q {(Yi , Xi )n } − q 0 | ≤ ] = 1 for all > 0, where (Yi , Xi )n are an i.i.d. random sample from the true density f 0 . This definition describes a desirable property for the estimated regression function µ ˆ n to be often (with P(Yi ,Xi )n tending to one) close (in L 2 sense) to the true µ0 , for large n. Now we define consistency in terms of the density function, which we will term as D-consistency. First, for any ε > 0, define a Hellinger ε-neighborhood by Aε = { f : DH ( f, f 0 ) ≤ ε}, where DH ( f, f 0 ) =
√ √ ( f − f 0 )2 dxdy is the Hellinger distance.
Definition 2 (D-Consistency). Suppose (Yi , Xi )n is an i.i.d. random sample from density f 0 . The posterior is asymptotically consistent for f 0 over Hellinger neighborhood if for any ε > 0 , P
Pr (Aε |(Yi , Xi )n ) → 1
as n → ∞.
That is, the posterior probability of any Hellinger neighborhood of f 0 converges to 1 in probability. This definition describes a desirable property for the posterior-proposed joint density f to be often close to the true f 0 , for large n.
Mixtures of Logistic Regression
229
Now we define the consistency in classification, which we will call C-consistency. Here we consider the use of the plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] in predicting Y. We are interested in how the misclassification error E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n } approaches the minimal error P{Co (X) = Y} = infC: Dom(X)→{0,1} P{C(X) = Y}, where Co (x) = I [µ0 (x) > 1/2] is the ideal Bayes rule based on the (unknown) true mean function µ0 . Definition 3 (C-Consistency). Let Bˆ n : Dom(X) → {0,1} be a classification rule that is computable based on the observed data (Yi , Xi )n . If limn→∞ E (Yi ,Xi )n P Bˆ n (X) = Y|(Yi , Xi )n } = P{Co (X) = Y}, then Bˆ n is called a consistent classification rule. It is straightforward to show that three consistency concepts are related in our situation with binary data, where µ ˆ n and µ0 are bounded between [0, 1]: Proposition 1 (Relations among three consistencies). D-Consistency =⇒ RConsistency =⇒ C-Consistency. Proof. The first relation is due to lemma 4. The second relation is due to ¨ corollary 6.2 of Devroye, Gyorfi, and Lugosi (1996). We will first establish D-consistency; then R- and C-consistencies naturally follow. 3 Results and Conditions We first consider the case when the number of experts K = kn is a nonstochastic sequence depending on sample size n. Theorem 1 (Nonstochastic K). Let the prior for the parameters, πn (dθ ), be independent normal distributions with mean zero and fixed variance σ 2 for each parameter in the model. Let kn be the number of experts in the model, such that i. limn→∞ kn = ∞ and ii. kn ≤ na for all sufficiently large n, for some 0 < a < 1. Then we have the following results: a. The posterior distribution of the densities is D-consistent for f 0 , that is, P
Pr ({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞, for all > 0.
b. The estimated regression function µ ˆ n is R-consistent for µ0 , that is, (µ ˆn − P
µ0 )2 dx → 0 as n → ∞.
230
Y. Ge and W. Jiang
c. The plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] is C-consistent for the Bayes rule Co (x) = I [µ0 (x) > 1/2], that is, limn→∞ E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n } = P{Co (X) = Y}. Now we consider the case when the number of experts K is a random parameter. We will consider the possibility that K = k(I ) is constructed out of a more basic random index I , which, for example, can be the geometric or the Poisson distribution. The function k(·) will be called a contraction function. We will see the reason to introduce the contraction: the sufficient condition we propose on K requires a very thin probability; common distributions such as geometric or Poisson can be used only after a tail-thinning contraction. The densities f (y, x|k, θ ) are now indexed by both the parameter vector θ and the number of experts k. The prior is π(k, dθ ) = ˜ k π(dθ |k), where λ ˜ k = P[k(I ) = k], and π(dθ |k) is again chosen to be λ the independent N(0, σ 2 ) distributions on all components of θ . The n n posterior distribution is then π(k, dθ |(Y , X ) ) = f (Y , X i i i i |k, θ )π(k, i=1 n
dθ )/ ∞ j=1 i=1 f (Yi , Xi | j, θ )π( j, dθ ). Then the predictive density, which is the Bayes estimate of f 0 , is given f (·|k, θ )π(k, dθ |(Yi , Xi )n ). The corresponding estimated by fˆ n (·) = ∞ k=1 regression function is µ ˆ n (x) = y=0,1 y fˆ n (y|x). The plug-in classification rule is Cˆ n (x) = I [µ ˆ n (x) > 1/2]. Theorem 2 (Random K). Suppose the priors π(dθ |k) conditional on the number of experts are independent normal with mean 0 and fixed variance σ 2 . Suppose the prior put on the number of experts k(I ) satisfies the following conditions: iii. P[k(I ) = k] > 0 for all sufficiently large k. iv. The tail probabilities decrease at a faster-than-geometric rate, that is, there exists q > 1 such that fixing any r > 0, for all sufficiently large k, P[k(I ) ≥ k] ≤ exp(−k q r ). Then we have the following results. d. The posterior distribution of the densities is D-consistent for f 0 , that is, P
Pr({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞, for all > 0.
e. The estimated regression function µ ˆ n is R-consistent for µ0 , that is, (µ ˆn − P
µ0 )2 dx → 0 as n → ∞. f. The plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] is C-consistent for the Bayes rule Co (x) = I [µ0 (x) > 1/2], that is, limn→∞ E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n = P{Co (X) = Y}.
Mixtures of Logistic Regression
231
The super-geometrically thin tail condition iv cannot be directly satisfied if the number of experts follows some common distributions such as geometric or Poisson. However, if one applies a contraction k(·) to a geometric or Poisson random variable, where k(·) grows very slowly, the probability of a large contracted k(I ) can be made sufficiently small. Remark 1. For example, consider the contractions of the form k(I ) = χ(I ) + 1, where u represents the integer part of u, and χ(I ) is a strictly and slowly increasing function. It is easy to confirm that when I is a geomet1+δ ric random variable, taking χ(I ) = I 1/q (for some δ > 0 and q > 1) will make k(I ) satisfy condition iv after using the equation P(I ≥ B) = P(I > 1+δ 0) B . When I is a Poisson random variable, taking χ(I ) = {ln(I + 1 )}1/q (for some δ > 0 and q > 1) will make k(I ) satisfy condition iv after applying a Chebyshev’s inequality to obtain P(I ≥ B) ≤ E I /B. Below we will first give the proofs of the main theorems. The lemmas used will be stated and proved later. 3.1 Proof of Theorem 1. The proof involves splitting the space kn of all kn -expert densities into a limited part Fn and an unlimited part Fnc and applying proposition 2 below. Let Fn be the set of ME models with each parameter bounded by Cn in absolute value, that is, |u j | ≤ Cn , |v j h | ≤ Cn , |α j | ≤ Cn , |β j h | ≤ Cn , j = 1, . . . , kn , h = 1, . . . , s, where Cn grows with n such that n 2 +η ≤ Cn ≤ exp(nb−a ) for some η > 0 and 0 < a < b < 1 (a is the same a as in the bound of kn ). 1
Proposition 2 (Lee, 2000, theorem 2). Suppose the following conditions hold: Tail condition 1. There exists an r > 0 and N1 , such that πn (Fnc ) < e xp(−nr ) ∀n ≥ N1 . condition 2. exists some constant c > 0 such that ∀ε > 0, Entropy √ There ε 2 H (u)du ≤ c nε for all sufficiently large n. [ ] 0 Approximation condition 3. For all γ , ν > 0, there exists an N2 , such that πn (KLγ ) ≥ e xp(−nν), ∀n ≥ N2 . Then the posterior is asymptotically consistent for f 0 over Hellinger neighborhoods,
232
Y. Ge and W. Jiang
that is, for any > 0, P
Pr ({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞. Here, for any γ > 0, define a Kullback-Leibler γ -neighborhood by KLγ = { f : DK ( f, f 0 ) ≤ γ }, f 0 ln( f 0 / f )dxdy is the Kullback-Leibler divergence. where DK ( f, f 0 ) = This proposition was proved in Lee (2000), theorem 2, where the entropy condition was used but not stated explicitly. Here H[ ] () is the Hellinger bracketing entropy defined in √ the following steps, where the family of function is taken to be F ∗ = { f : f ∈ Fn }, the set of√ square roots of densi√ ties from Fn , and the metric is the L 2 -norm so that f − g = DH ( f, g) the Hellinger distance, for any two densities f and g. Definition 4 (Brackets and bracketing entropy). i. For any two functions l and u, define the bracket [l, u] as the set of all functions f such that l ≤ f ≤ u. ii. Let · be a metric. Define an ε-bracket as a bracket with u − l ≤ ε. iii. Define the bracketing number of a set of functions F ∗ as the minimum number of ε-brackets needed to cover F ∗ , and denote it by N[ ] (ε, F ∗ , · ). iv. The bracketing entropy, denoted by H[ ] (·) = ln N[ ] (·, F ∗ , · ), is the natural logarithm of the bracketing number. Now we prove theorem 1. Lemma 1 guarantees the tail condition. Lemma 2 guarantees the entropy condition. Lemma 3 guarantees the approximation condition. Therefore, we have the D-consistency due to proposition 2. The R- and C-consistencies follow from proposition 1. 3.2 Proof of Theorem 2. Let Gm be a restricted set of mixtures of mexperts models, whose parameter components are all bounded by Cn m in 1 absolute value, with n 2 +η ≤ Cn ≤ exp nb−a for some η > 0 and 0 < a < b < 1, where a = 1/q . Such restricted sets Gm are nested due to proposition 3. n Gk , where kn = (cn)a , a = 1/q ∈ (0, 1) and c ∈ (0, 1]. We let Fn = ∪kk=1 k n ˜ Tail condition 1. Note that π(Fnc ) = 1 − π(Fn ) ≤ ∞ k=kn +1 λk + k=1 λ˜ k π(θ ∞ > Cn k|k), where θ ∞ is the maximum absolute value of all the θ components. For all sufficiently large n and all r > 0, the tail probabil ˜ ity ∞ λ is less than e −nr /2 due to condition iv. All tail probabilities k=kn +1 k π(θ ∞ > Cn k|k) are less than e −nr /2 also, due to the choice of Cn and the normality of π(dθ |k) (using Mill’s ratio for normal tail probabilities).
Mixtures of Logistic Regression
233
Therefore, π(Fnc ) ≤ e −nr for all sufficiently large n and all r > 0, showing the tail condition i. n Entropy condition 2: Note that the Fn = ∪kk=1 Gk = Gkn since the sets of density functions represented by Gk are increasing with k (see proposition 3). Then the entropy condition can be computed for Gkn , where the bounds of the parameter values are now Cn kn instead of the previous bound Cn . Repeating the proof of the entropy condition as before shows that the condition still holds. Approximation condition iii. Fix any γ > 0. Then π(K L γ ) = ∞ k=1 P(K = k)π(K L γ |k) ≥ P(K = kn )π(K L γ |kn ) > 0, due to the positive P(K = kn ) (guaranteed by condition iii of theorem 2) and that π(K L γ |kn ) > e −nν , fixing any ν > 0, for all large enough n, which was proved for nonstochastic kn before. (Here kn = (cn)a is less than na and increases to ∞.) Therefore, π(K L γ ) ≥ e −nν for all sufficiently large n, fixing any ν > 0, since the lefthand side is positive and not dependent on n. This shows the approximation condition iii. All conditions of proposition 2 hold, and the D-consistency holds, which further implies the R- and C-consistency. 4 Lemmas Used for Proving the Theorems In the first three lemmas, the number of experts kn satisfies conditions i and ii of theorem 1. The prior πn for the parameters is such that each parameter is an independent normal with mean 0 and fixed variance σ 2 . The dimension of the parameters dim(θ ) will be denoted as dn = (s + 1)(2kn − 1) both here and later in the proofs. Lemma 1 (for tail condition). There exists a constant r > 0, such that πn (Fnc ) < e xp(−nr ) for all sufficiently large n. Here Fn is the limited part of the kn -experts family defined in section 3.1. Lemma√2 (for entropy condition). Consider the family of square-root densities F ∗ = { f : f ∈ Fn } defined in section 3.1. Then the following relations hold for the Hellinger bracket entropy H[ ] () for F ∗ : a. H[ ] (u) ≤ ln[(
4Cn2 dn u
)dn ].
b. There exists a constant c > 0, such that ∀ε > 0, ε √ H[ ] (u)du ≤ c nε 2 0
for all sufficiently large n.
234
Y. Ge and W. Jiang
Lemma 3 (for approximation condition). For all γ , ν > 0, there exists an N2 , such that πn (K L γ ) ≥ e xp(−nν), ∀n ≥ N2 . Here K L γ is the Kullback-Leibler neighborhood defined in section 3.1. The following lemma holds whether or not the number of experts is random. Using the notation in sections 2.2 and 2.3, we have: Lemma 4 (regression function versus density function). a. (µ ˆ n − µ0 )2 dx ≤ 4D2H ( fˆ n , f 0 ). b. D2H ( fˆ n , f 0 ) ≤ 2 + 4πn [{ f : DH ( f, f 0 ) > }|(Yi , Xi )n ], ∀ > 0. The next proposition is used to form the nested sequence of restricted models in section 3.2 for proving consistency with a random number of experts. Proposition 3 (Nesting). Let Gm = m { f : |θl | < Cm, ∀1 ≤ l ≤ dim(θ )} for some C ≥ 1 not dependent on m, which is a restricted set of m-expert models with parameters bounded by Cm. If m ≥ m, then Gm ⊆ Gm . Here m is the m-expert family defined in section 2.1. The proofs of these results are contained in the appendix. 5 Conclusions Our work shows that Bayesian inference based on mixtures of logistic regression can be a reliable tool for estimating the regression function and the joint density, as well as for binary classification. We expect that analogous properties may be studied in multiway classification, where multinomial logistic regression models are mixed. This, as well as Bayesian inference based on mixtures of generalized linear models (such as mixtures of Poisson and gamma regression), form natural topics for future research. So far, we have focused on classification rules of the form Cˆ n (x) = I [µ ˆ n (x) > 1/2]. However, as a referee points out, the concept of C-consistency can also be ˆ n (x) > r ] for some r ∈ (0, 1), which extended to rules of the form Cˆ n (x) = I [µ may be useful in situations with asymmetric costs, as when misclassifying Y = 1 as 0 costs more than misclassifying Y = 0 as 1. A long-standing question in ME theory is the selection of number of experts (or mixing components) K. The current work provides insight from the view of Bayesian inference, from either choosing a nonstochastic sequence K = kn dependent on sample size n or treating K as random and placing a suitable prior on K . The latter approach is especially interesting, since it can generate a posterior distribution on K conditional on the ob served data: π(K |da ta ) = θ π(K , dθ |da ta ). This method of inference on K
Mixtures of Logistic Regression
235
is in some sense robust and protective against model misspecification; it does not need to assume a true model with number of experts k0 . The true model is a nonparametric one with arbitrary smooth relation in family . In general, there is no “true number of experts” for K . What are proposed by π(K |da ta ) are “good K ’s” instead of “true K”; they are the K ’s for some good approximating models from the ME family. It may also be interesting to consider random K with a finite prior distribution, with support increasing with n. This in some sense is combining the approach of the two theorems. The motivation is that we would like the number of experts K to be random in order to search over a range of values. On the other hand, we would like K to be not too large, in order to reduce computation. (Large K would correspond to a high-dimensional parametric model.) There are several possiblities leading to consistent Bayesian inference. One can use a truncated prior P[K = k] = P[k(I ) = k]I [k ≤ Bn ]/P[k(I ) ≤ Bn ], k = 1, 2, . . . . Here k(I ) satisfies conditions iii and iv of theorem 2 and can be, for example, the contracted Poisson or contracted geometric random variables described in remark 1. The truncation bound can be taken to be, for example, Bn = 2(cn)1/q + 1, where q > 1 is the same as in condition iv and c ∈ (0, 1]. One can also use a uniform prior, such as P[K = k] = (cn)a −1 I [k ≤ (cn)a ], k = 1, 2, . . ., for some a ∈ (0, 1), c ∈ (0, 1]. Both can easily be shown to lead to consistent Bayesian inference, by adapting the proof of theorem 2. Appendix: Secondary Propositions and Proofs Denote f = f (y|x; k, θ ) for a mixture-of-k-experts (conditional) density. Then the following two propositions hold for any (k, θ ) and for any (y, x) ∈ {0, 1} × [0, 1]s , which will be useful later. Proposition 4.
f ≤ 1.
ln f | ≤ 1, where θl is the lth element of θ , for each l = Proposition 5. | ∂ ∂θ l 1, . . . , dim(θ ).
The following lemma will be used for proving lemma 3: Lemmma 5. Let f be the mixture-of-experts model with parameters (θ1 , . . . , θdn ) and f˜ be another mixture-of-experts model with parameters (θ˜ 1 , . . . , θ˜dn ). Suppose that the number of experts of f and f˜ are both kn , where kn grows to infinity with n and kn ≤ na for some 0 < a < 1, for all large enough n. Define a δ-neighborhood of f : Mδn ( f ) = { f˜ : |θi − θ˜i | ≤ δ,
i = 1, 2, . . . , dn }.
236
Y. Ge and W. Jiang
Then the following holds for any γ > 0: Given any f 0 ∈ (the “smooth nonparametric” family defined in section 2.1), for all sufficiently large n, there exist δ and f such that Mδn ( f ) ⊆ K L γ (i.e., for any f˜ ∈ Mδn ( f ), we have DK ( f˜ , f 0 ) ≤ γ ), where δ = 4(s+γ1 )na and f is a kn -expert density with parameter components satisdn fying maxk= 1 |θk | ≤ c(γ ) + ln kn , for some constant c(γ ) depending on γ but not on n. k Proof of Proposition 4. f = sup j H j ( j g j ) = sup j j=1 g j H j ≤ α j +β Tj x H j ≤ 1, since H j = ( e α j +β T x ) y ( α1j +β T x )1−y ≤ 1. 1+e
j
j
1+e
Proof of Proposition 5. Note that for each l = 1, . . . , dim(θ ),
k
j [ ∂θ∂ l ln(g j H j )](g j H j )
∂ ln f ∂
g j H j
=
∂θ = ∂θ ln
l
l
j g j Hj j=1
∂ ln(g j H j )
. ≤ sup
∂θl j Since for each j = 1, . . . , k, ln(g j H j ) = (u j +
vTj x)
T u +v x − ln e j j + y α j + β Tj x j
T − ln 1 + e α j +β j x , it is easy to show that |
∂ ln(g j H j ) | ∂θl
ln f ≤ max{|x1 |, . . . , |xs |, 1} = 1. So, | ∂ ∂θ | ≤ 1. l
Proof of Lemma 1. πn (Fnc ) = Pr(at least one|θl | > Cn , l = 1, . . . , dn ) dn dn
θl Cn Pr(|θl | > Cn ) = 2 Pr ≤ > σ σ l=1
l=1
Cn2 − 2 2σ
2dn σ 1 by Mill’s ratio √ exp Cn 2π 1+2η √ n a + ln[(s + 1)(2n − 1)(2σ/ 2π)] noting kn ≤ na ≤ exp − 2σ 2
≤
≤ exp(−nr ) for any r > 0, for all sufficiently large n, since η > 0.
Mixtures of Logistic Regression
237
Proof of Lemma 2a. Use f t = f (y, x|k, t) to simplify the notation, while showing the dependence on a parameter valued at t. Use t∞ = dim(t) sup j=1 |t j | to denote the L ∞ norm for a vector t.
d
n
∂
f θ · (tl − sl )
| ft − fs | =
∂θl l=1
(θ is an intermediate point between t and s)
d
n
∂
= (tl − sl ) ln f θ fθ
∂θl l=1
≤
dn
l=1
1 ∂ ln f θ
· | fθ | sup |tl − sl | ·
2 ∂θl
l
1 ≤ dn t − s∞ by propositions 4 and 5. 2 +η ≤ Cn ≤ exp(nb−a ), so Cn ≥ 1. Then, Since C√ n grows with n such that n 2 √ Cn dn | f t − f s | ≤ t − s∞ · 2 . Let F (x, y) = Cn dn /2. By theorem 3 and equation 15 of Lee (2000), we have Cn + 1 dn N[ ] (2εF 2 , F ∗ , · 2 ) ≤ N(ε, Fn , L ∞ ) ≤ . ε 1
Here, N(ε, Fn , · ) is the covering number, that is, the minimal number of balls of radius ε that arerequired to cover the set Fn under a specified met√ 1 2 ric. Now, 2εF 2 = 2ε 2εCn dn ; replace 2εF 2 y=0 (C n dn /2) dx = Cn +1 with ε; then N[ ] (ε, F ∗ , · 2 ) ≤ ( ε/(√ )dn = ( 2C d ) n n
Therefore, H[ ] (u) = ln N[ ] (u, F ∗ , · 2 ) ≤ ln(
√
2Cn (Cn +1)dn dn ) ε
4Cn2 dn dn ) . u
Proof of Lemma 2b. By the result of lemma 2a, 2 dn ε ε 4Cn dn H[ ] (u)du ≤ ln du u 0 0 v(ε) v 2 = dn √ −4Cn2 dn ve −v /2 dv 2 ∞ 4C 2 dn where v(u) = 2 ln n u ∞ ε −v 2 /2 v(ε) + e dv = 4Cn2 dn dn /2 4Cn2 dn v(ε)
≤(
4Cn2 dn dn ) . ε
238
Y. Ge and W. Jiang
≤ Cn2 dn
dn /2
ε Cn2 dn
2 ln(4Cn2 dn /ε)
φ( 2 ln(4Cn2 dn /ε)) + 4 2π 2 ln(4Cn2 dn /ε) 1 = ε dn /2 2 ln(4Cn2 dn /ε) 1 + 2 ln(4Cn2 dn /ε) ≤ 2ε dn ln Cn2 + ln(4dn ) − ln ε √
for all large enough n. Noting that dn = (s + 1)(2kn − 1) ≤ 2(s + 1)na , and Cn ≤ exp(nb−a ), we have l.h.s. ≤ 2ε 2(s + 1)na 2nb−a + ln 8(s + 1) + ln na − ln ε. Since 0 < a < b < 1, then ∃t such that 0 < a < t < b < 1 and b − a < 1 − t, 1 √ n
ε
√ H[ ] (u)du ≤ 2ε n−t 2(s + 1)na n−(1−t)
0
× 2nb−a + ln 8(s + 1) + ln na − ln ε → 0 as n → ∞.
So, ∃c > 0 such that ∀ε > 0, large n.
ε 0
√ H[ ] (u)du ≤ c nε 2 for all sufficiently
Proof of Lemma 3. We use the neighborhood Mδ = Mδn ( f ) in lemma 5 to prove the result. By lemma 5, we have Mδ ⊆ K L γ for all sufficiently large n. Then, πn (KLγ ) ≥ πn (Mδ ) = Pr
d n
θ˜
=
dn θl +δ l=1
≥
dn l=1
≥
dn l=1
θl −δ
√
√
|θl − θ˜l | ≤ δ
l=1
u2 exp − 2 du 2σ 2πσ 2 1
(|θl | + δ)2 exp − 2σ 2 2πσ 2 2δ
(c(γ ) + ln kn + δ)2 exp − √ 2σ 2 2πσ 2 2δ
Mixtures of Logistic Regression
239
(c(γ ) + ln kn + δ)2 = exp −dn + ln 2σ 2 (2 ln na )2 a ≥ exp −n 2(s + 1) 2σ 2
√
2πσ 2 2δ
for all large enough n, using kn ≤ na , δ =
γ 4(s+1)na
≥ exp(−nν) for all large enough n, fixing any ν > 0, since a ∈ (0, 1). Proof of Lemma 4a.
(µ ˆ n − µ0 ) dx = 2
2 ˆ y( f n − f 0 )dy dx
=
y
fˆ n +
≤
2
y
fˆ n +
f0
2 f0
fˆ n −
2 f 0 dy dx
2 ˆ f n − f 0 dy dx dy
since
y2
fˆ n +
2
dy =
f0
y2
≤2 =2
fˆ n + f 0 + 2
fˆ n f 0 dy
y2 ( fˆ n + f 0 )dy
1
y2 ( fˆ n + f 0 ) ≤ 4 by proposition 4.
y=0
√ Then, (µ ˆ n − µ0 )2 dx ≤ 4 ( fˆ n − f 0 )2 dydx = 4D2H ( fˆ n , f 0 ). Proof of Lemma 4b. Denote fˆ n = f πn (dθ |(Yi , Xi )n ) = E θ |· f . Denote A = { f : DH ( f, f 0 ) ≤ } as in section 2.3. Then, D2H ( fˆ n , f 0 ) = =
=2 − 2
E θ |· f −
2 f 0 dydx
f 0 + E θ |· f − 2
f 0 E θ |· f dydx
E θ |· ( f f 0 )dydx
240
Y. Ge and W. Jiang
≤2−2
E θ |·
f f 0 dydx
by Jensen’s inequality
= E θ |· 2 − 2 f f 0 dydx
= E θ |·
f + f0 − 2
= E θ |·
f −
by Fubini’s theorem
f f 0 dydx
2 f 0 dydx = E θ |· D2H ( f, f 0 )
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
= =
Aε
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
+
Acε
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
≤ ε2 + =ε +
Acε
2
Acε
≤ ε2 + 2 = ε2 + 4
Acε
Acε
f −
2 f 0 dydx πn (dθ |(Yi , Xi )n )
f + f0 − 2
f f 0 dydx πn (dθ |(Yi , Xi )n )
( f + f 0 )dydx πn (dθ |(Yi , Xi )n )
πn (dθ |(Yi , Xi )n )
= ε 2 + 4πn ({ f : DH ( f, f 0 ) > ε}|(Yi , Xi )n ).
It is easy to see that lemmas 4a and 4b hold also for the case with random number of experts k, by augmenting the integration over dθ with sum over k. Proof of Lemma 5. Jiang and Tanner (1999b, theorem 2) state that c sup f0 ∈ inf f ∈k DK ( f, f 0 ) ≤ k 4/s for some c > 0 independent of k, for each k = 1, 2, 3, . . . . Here, s = dim(x). Therefore, given any γ > 0, for any true model f 0 ∈ , there exists a k ∗ -experts model f ∗ ∈ k ∗ , with k ∗ large enough, such that
DK ( f ∗ , f 0 ) ≤
c γ + < γ /2. (k ∗ )4/s 4
Mixtures of Logistic Regression
241
This k ∗ -experts density f ∗ can be written as a kn -experts density f (kn > k ∗ for all large enough n) if we let
u j = u∗j ,
j = 1, . . . , k ∗ − 1
u j = u∗k ∗ − ln(kn − k ∗ + 1),
j = k ∗ , . . . , kn
and (α j , β j , v j ) = (α ∗j , β ∗j , v∗j ), (α j , β j , v j ) =
j = 1, . . . , k ∗ − 1
(αk∗∗ , β ∗k ∗ , v∗k ∗ ),
j = k ∗ , . . . , kn .
Here u’s, v’s, α’s, and β’s are components of parameter θ for density f ; u∗ ’s, v∗ ’s, α ∗ ’s, and β ∗ ’s are components of parameter θ ∗ for density f ∗ . This parameterization for embedding is explained in the proof of proposition 3. This implies that there also exists a kn -experts model f such that DK ( f, f 0 ) = DK ( f ∗ , f 0 ) < γ /2, for all sufficiently large n. Let θ ∗ and θ denote the vectors of parameters in the k ∗ -experts and kn -experts model, respectively. From the above parameter settings, we have θ ∞ ≤ max{θ ∗ ∞ , |u∗k ∗ | + ln(kn − k ∗ + 1)} ≤ c(γ ) + ln kn for some constant c(γ ) possibly dependent on γ . Now consider any kn -expert model f˜ ∈ Mδn ( f ). Note that DK ( f˜, f 0 ) = The first term second part, ln
f 0 ln
f 0 ln f0 f
f0 = f˜
f 0 ln
= DK ( f, f 0 )
100
0 0
20
40
0
60
0
20
40
60
80
100
size (# of neurons) 2000
1500
3440
1500
count
count
1000
1000
500
500 0
>200 0
50
100
150
time span (ms)
200
0
0
10
20
30
length (longest path)
Figure 8: Characteristics of polychronous groups (total number is 5269 in a network of 1000 neurons). (Top left) An example of a polychronous group. (Top right) Distribution of group sizes, that is, the number of neurons that form each group. The example group has size 15. (Bottom left) Distribution of groups’ time spans—the time difference between the firing of the first and the last neuron in the group. The example group has time span 56 ms. (Bottom right) Distribution of the longest paths in the groups. The example group has a path of length 5.
131 different groups. Because different groups activate at different times, the neuron can fire with one group at one time and with another group at another time. 3.3 Groups Share Neurons. Quite often, different polychronous groups can share more than one neuron. Two such cases are illustrated in Figure 9. Neurons (8, 103, 351, 609, 711, 883) belong to two different groups in the upper half of the figure. However, there is no ambiguity because their firing order is different; the neurons fire with one spike-timing pattern at one time (when the first group is activated), with the other pattern at some other time (when the second group is activated), and with no pattern most of the time. The lower half of the figure depicts two groups having eight neurons in common and firing with different spike-timing patterns. In addition, neurons 838 and 983 fire twice during activation of the second group. Again, there is no ambiguity here because each polychronous group is defined not only by its constituent neurons but also by their precise spiking time. As an extreme illustration of this property, consider a toy fully connected network of 5 neurons in Figure 10A. In principle, such a network can exhibit 5! = 240 different firing patterns if we require that each neuron fires only
258
E. Izhikevich
711
neuron # (103, 711, 609, 8, 883, 351) firing pattern (ms) (0, 1, 12, 20, 25, 34)
609
883
351
103
neuron # (609, 711, 103, 8, 883, 351) firing pattern (ms) (0, 1, 2, 20, 25, 60)
8
711 609
883 351 103 8 785
779
983
neuron # (77, 779, 416, 474, 785, 838, 983, 877) firing pattern (ms) (0, 0, 2, 14, 15, 16, 20, 30)
474
416
877 838
77
416
983
474
983
877 838
77
0
neuron # (77, 416, 779, 785, 474, 838, 983, 877, 838, 983) firing pattern (ms) (0, 4, 5, 15, 18, 20, 21, 31, 43, 66)
785
779
10
20
838
30
40
50
60
70
80
90
100
time (ms)
Figure 9: Two examples of pairs of groups consisting of essentially the same neurons but firing with different spike-timing patterns; see also Figure 10. Neurons of interest are marked with numbers. The list of neurons and their firing times are provided for each group in the upper-right corners.
once and we distinguish the patterns only on the basis of the order of firing of the neurons. If we allow for multiple firings, then the number of patterns explodes combinatorially. However, the connectivity among the neurons imposes a severe restriction on the possible sustained and reproducible firing patterns, essentially excluding most of them. Random delays in the network would result in one, and sometimes two, polychronous groups.
Polychronization
259
s
2 ms
s 2m s 5m
3
B
2 ms
2
4
4
2
2
2
20
0
30
10
20
30
4
4
2
2
2
10
20
30
0
10
20
30
4
4
4
2
2
2
0
10
20
30
4
0
10
20
30
4
2 10
20
30
30
0
10
20
30
cyclic 10
20
30
cyclic 0
10
20
30
0
10
20
30
0
10
20
30
4
2 0
20
0
4
0
10
2
4
10
0 4
7 ms
0
neuron number
2
4
1 2m s 6m s
s
3m 7m s s s 4m
s m
2m
2m
7 4 m ms s
5 ms
s m
4
5
4
4
6 ms ms
10
A
2 0
10
20
30
time (ms)
Figure 10: (A) A toy network of five neurons with axonal conduction delays and strong synapses (two simultaneous spikes is enough to fire any neuron). (B) The delays in the network are optimized so that it has 14 polychronous groups, including two cyclic groups that exhibit reverberating activity with a time (phase) shift.
The delays in Figure 10A are not random; they were constructed to maximize the number of polychronous groups. Although there are only five neurons, the network has 14 polychronous groups shown in Figure 10B. Adding a sixth neuron triples the number of groups so that there are more groups than synapses in the network. Considering toy examples like this one, it would not be surprising that a network of 1011 neurons (which is the size of the human brain) would have more groups than the number of particles in the universe.
260
E. Izhikevich
454
simulation surrogate
count
100 80 60 40 20 0
0
10
20
30
40
activation frequency (# per hour) Figure 11: Distribution of frequencies of activation of groups in the simulated and surrogate (inverted time) spike trains. Each group is found using anatomical data (connectivity and delays) and then used as a template to scan through the spike train. The group is said to activate when more than 50% of its neurons polychronize, that is, fire with the prescribed spike-timing pattern with ±1 ms jitter, as in Figure 6. Surrogate data emphasize the statistical significance of these events.
3.4 Activation of Groups. Our definition of a polychronous group relies on the anatomy of the network, not on its dynamics. (Of course, the former and the latter are dependent via STDP.) We say that a group is half-activated when at least 50% of its constituent excitatory neurons polychronize (i.e., fire according to the prescribed spike-timing pattern with ±1 ms spike jitter). For example, the group in Figure 6 is 63% activated because 16 of 19 excitatory neurons polychronized. Once all the groups are found using the anatomical data (connectivity and delays), we use each group as a template, scan the spiking data recorded during a 24 hour period, and count how many times the group is half-activated. We apply this procedure only to those groups (total 471) that persist during the 24-hour period. In Figure 11 we plot the distribution histogram of the averaged frequency of half-activation of polychronous groups. The mean activation frequency is 7 times per hour, that is, every 8 minutes, implying that there is a spontaneous activation of a group every 1 second (8 × 60/471 ≈ 1 sec). Since an averaged neuron is a member of 131 different groups, 131 × 7 = 917 of its spikes per hour are part of activation of a group, which is less than 4% of the total number of spikes (up to 25,000) fired during the hour. Thus, the majority of the spikes are noise, and only a tiny fraction is involved in polychrony. The only way to tell which is which is to consider these spikes with relation to the spikes of the other neurons. To test the significance of our finding, we use surrogate data obtained from the spike data by inverting the time. Such a procedure does not change the mean firing rates, interspike histograms, magnitude of
Polychronization
261
cross-correlations, and other meaningful statistics of the spike trains. In particular, this approach is free from the criticism (Oram, Wiener, Lestienne, & Richmond, 1999; Baker & Lemon, 2000) that precise firing sequences appear exclusively by chance in spike rasters with covarying firing rates. Activation frequency of (noninverted) groups in the surrogate (inverted) spike raster, depicted as black bars in Figure 11, is much lower, indicating that group activations are statistically significant events. We emphasize that our method of analysis of spike trains is drastically different from the one used to search for synfire chains in in vivo data. We do not search for patterns in spike data; we know what the patterns are (using the connectivity matrix and delays); we just scan the spikes and count the occurrences of each pattern. Apparently such an approach is feasible only in models. 3.5 Representations. What is the significance of polychronous groups? We hypothesize that polychronous groups could represent memories and experience. In the simulation above, no coherent external input to the system was present. As a result, random groups emerge; that is, the network generates random memories not related to any previous experience. However, coherent external stimulation builds certain groups that represent this stimulation in the sense that the groups are activated when the stimulation is present. Different stimuli build different groups even when the same neurons are stimulated, as we illustrate in Figure 12. Every second during a 20-minute period, we stimulate 40 neurons, 1, 21, 41, 61, . . . , 781, either with the pattern (1, 2, . . . , 40) ms or with the inverse pattern (40, . . . , 2, 1) ms, as we show in the top of Figure 12. Initially, no groups starting with stimulated neurons existed (we did not explore whether the stimulation activated any of the existing groups consisting of other neurons). However, after 20 minutes of simulation 25 new groups emerged. Fifteen of them correspond to the first stimulus; they can be activated when the network is stimulated with the first pattern. The other 10 correspond to the second stimulus; that is, they can be activated when the network is stimulated with the second pattern. Thus, the groups represent the memory of the two input patterns, and their activation occurs when the network “sees” the corresponding patterns. In Figure 13 we depict the time evolution of the largest group corresponding to the first pattern in Figure 12. Notice how the group annexes neurons, probably at the expense of the other groups in the network. Further simulation shows that the initial portion of the group is relatively stable, but its tail expands and shrinks in an unpredictable manner. Finally, not all groups corresponding to a pattern activate when the network is stimulated. Because the groups share some neurons and have excitatory and inhibitory interconnections, they are in a constant state of competition and cooperation. As a result, each presentation of a stimulus activates only two to three groups (15%) in a random manner.
262
E. Izhikevich
stimulation pattern 1
stimulation pattern 2
20 ms
Figure 12: Persistent stimulation of the network with two spatiotemporal patterns (top) result in the emergence of polychronous groups that represent the patterns; the first few neurons in each group are the ones being stimulated, and the rest of the group activates (polychronizes) whenever the patterns are present. 2 min
5 min
9 min
3 min
6 min
10 min
4 min
7 min
20 min
20 ms
Figure 13: Time evolution (growth) of the last (largest) polychronous group in Figure 12 corresponding to stimulation pattern 1.
3.6 Rate to Spike-Timing Conversion. Neurons in the model use a spike-timing code to interact and form groups. However, the external input from sensory organs, such as retinal cells and hair cells in cochlear, arrives as a rate code, that is, encoded into the mean firing frequency of spiking. How can the network convert rates to precise spike timings? It is easy to see how rate to spike-timing conversion could occur at the onset of stimulation. As the input volley arrives, the neurons getting stronger excitation fire first, and the neurons getting weaker excitation fire later or not at all. This mechanism relies on the fact that there is a clear onset
Polychronization A rate code
263
spike-timing code
B rate code
spike-timing code
v2
v2
v1
v1
v1
threshold
v2
resting
resting
v2
threshold
v1
I2 I1
IPSP
IPSP
Figure 14: Rate code to spike-timing code conversion by spiking network with fast inhibition. (A) The firing rate input (rate code) induces a phasic inhibition in the network. While the excitatory neurons recover from the inhibition, the ones that get the strongest input fire first, and the ones getting the weakest input fire last (or not fire at all). (B) The strength of the rate code input determines the degree of hyperpolarization of excitatory neurons. Those inhibited less will fire sooner. This mechanism has the desired logarithmic scaling that makes the spike-timing code insensitive to the strength of the input (see the text for detail). Open circles—excitatory neurons; filled circles—populations of inhibitory neurons.
of stimulation, for example, after a visual saccade. What if stimulation is tonic without a clear beginning or end? We hypothesize that intrinsic rhythmicity generates internal “saccades” that can parse the rate input into spike timing, and we discuss three possible mechanisms how this could be accomplished. First, the intrinsic rhythmic activity can chop tonic input into “meanfiring-rate” packets. Since each packet has a well-defined onset, it can be converted into spike timings according to the mechanism described above: the neuron receiving the strongest input fires first, and so forth. This mechanism is similar, but not equivalent, to the mechanism proposed by Hopfield (1995). The other two mechanisms rely on the action of inhibitory interneurons. In one case, depicted in Figure 14A, inhibitory neurons, being faster, fire first and inhibit excitatory neurons, thereby resulting in a long inhibitory postsynaptic potential (IPSP). The rate with which excitatory neurons recover from the IPSP depends on their intrinsic properties and the strength of the overall external input. The stronger the input, the sooner the neuron fires after the IPSP. Again, the neuron receiving the strongest input fired first, and the neuron receiving the weakest input fired last.
264
E. Izhikevich
In the third mechanism depicted in Figure 14B, inputs with different firing rates produce IPSPs of different amplitudes in the excitatory neurons downstream. As the excitatory neurons recover from the inhibition, the neurons are ready to fire spikes (due to some other tonic stimulation) at different moments: the neuron inhibited less would fire first, and the neuron inhibited more would fire last or not at all. Since the recovery from inhibition is nearly exponential, the system is relatively insensitive to the input scaling. That is, a stronger input results in firing with the same spike-timing pattern but with an earlier onset. Similarly, a weaker input does not change the spike-timing pattern but only delays its onset. Let us illustrate this point using two linear dimensionless equations, vi = −vi , i = 1, 2, that model the recovery of the membrane potential from the inhibition vi (0) = −Ii , where each Ii denotes the amplitude (peak) of IPSP. The recovery is exponential, xi (t) = −Ii e −t , so the moment of time each membrane reaches a certain threshold value, say, v = −1, is ti = log Ii . If we scale the input by any factor (e.g., k Ii ), we translate the threshold moment by a constant because log k Ii = log k + log Ii , which is the same for both neurons. Thus, regardless of the scaling of the input, the time difference log k I1 − log k I2 = log I1 − log I2 is invariant. Thus, in contrast to Hopfield (1995), we do not need to postulate that the input is already somehow converted to the logarithmic scale. Synchronized inhibitory spiking implements the logarithmic conversion and makes spike-timing response relatively insensitive to the input scaling. 3.7 Stimulus-Triggered Averages. Notice that synchronized inhibitory activity occurs during gamma frequency oscillations (see Figure 5). Thus, the network constantly converts rate code to spike-timing code (and back) via gamma rhythm. Each presentation of a rate code stimulus activates an appropriate polychronous group or groups that represent the stimulus. This activation is locked to the phase of the gamma cycle but not to the onset of the stimulus. We explain this point in Figure 15, which illustrates results of a typical in vivo experiment in which a visual, auditory, or tactile stimulus is presented to an animal (we cannot simulate this with the present network because, among many other things, we do not model thalamus, the structure that gates inputs to the cortex). Suppose that we record from neuron A belonging to a polychronous group representing the stimulus. Since the input comes at a random phase of the internal gamma cycle, the activation of the group occurs at random moments, resulting in a typical smeared stimulus-triggered histogram. Such histograms were interpreted by many biologists as “the evidence” of absence of precise spike-timing patterns in the brain, since the only reliable effect that the stimulus evokes is the increased probability of firing of neuron A (i.e., increase in its meanfiring rate). Even recording from two or more neurons belonging to different groups would result in broad histograms and weak correlations among the neurons, because the groups rarely activate together, and when they do,
Polychronization
265 A
polychronous group
gamma rhythm
A
rate-code stimulus 0 stimulus-triggered histogram
Figure 15: Noisy, unreliable response of neuron A is due to the unreliable activation of the group representing the stimulus. The activation is locked to the intrinsic gamma cycle, not to the onset of the stimulation, resulting in the smeared stimulus-triggered histogram. One needs to record from two or more neurons belonging to the same group to see the precise spike-timing patterns in the network (the group and the associated gamma rhythm are drawn by hand).
they may activate at different cycles of the gamma rhythm. We see that “noisy,” unreliable responses of individual neurons to stimuli are the result of noisy and unreliable activations of polychronous groups. Recordings of two or more neurons belonging to the same group are needed to see the precise patterns (relative to the gamma rhythm). 4 Discussion Simulating a simple spiking network (the MATLAB code is in the appendix), we discovered a number of interesting phenomena. The most striking one is the emergence of polychronous groups—strongly interconnected groups of neurons having matching conduction delays and capable of firing stereotypical time-locked spikes with millisecond precision. Thus, such groups can be seen not only in anatomically detailed cortical models (Izhikevich et al., 2004) but also in simple spiking networks. Changing some of the parameters of the model twofold changes the number of groups that can be supported by the network but does not eliminate them completely. The selforganization of neurons into polychronous groups is a robust phenomenon that occurs despite the experimentalist’s efforts to prevent it.
266
E. Izhikevich
Our model is minimal; it consists of spiking neurons, axonal conduction delays, and STDP. All are well-established properties of the real brain. We hypothesize that unless the brain has an unknown mechanism that specifically prevents polychronization, the real neurons in the mammalian cortex must also self-organize into such groups. In fact, all the evidence of reproducible spike-timing patterns (Abeles, 1991, 2002; Lindsey et al., 1997; Prut et al., 1998; Villa et al., 1999; Chang et al., 2000; Tetko & Villa, 2001; Mao et al., 2001; Ikegaya et al., 2004; Riehle et al., 1997; Beggs & Plenz, 2003, 2004; Volman, Baruchi, & Ben-Jacob, 2005) can be used as the evidence of the existence and activation of polychronous groups. 4.1 How Is It Different from Synfire Chains? The notion of a synfire chain (Abeles, 1991; Bienenstock, 1995; Diesmann et al., 1999; Ikegaya et al., 2004) is probably the most beautiful theoretical idea in neuroscience. Synfire chains describe pools of neurons firing synchronously, not polychronously. Synfire activity relies on synaptic connections having equal delays or no delays at all. Though easy to implement, networks without delays are finite-dimensional and do not have rich dynamics to support persistent polychronous spiking. Indeed, in the context of synfire activity, the groups in Figure 9 could not be distinguished, and the network of five neurons in Figure 10 would have only one synfire chain showing reverberating activity (provided that all the delays are equal and sufficiently long). Bienenstock (1995) referred to polychronous activity as a synfire braid. Synfire chain research concerns the stability of a synfire activity. Instead, we employ here population thinking (Edelman, 1987). Although many polychronous groups are short-lived, there is a huge number of them constantly appearing. And although their activation is not reliable, there is a spontaneous activation every second in a network of 1000 neurons. Thus, the system is robust not in terms of individual groups but in terms of populations of groups. 4.2 How Is It Different from Hopfield-Grossberg Attractor Networks? Polychronous groups are not attractors from dynamical system point of view (Hopfield, 1982; Grossberg, 1988). When activated, they result in stereotypical but transient activity that typically lasts three to four gamma cycles (less than 100 ms; see Figure 8). Once the stimulation is removed, the network does not return to a “default” state but continues to be spontaneously active. 4.3 How Is It Different from Feedforward Networks? The anatomy of the spiking networks that we consider is not feedforward but reentrant (Edelman, 1987). Thus, the network does not “wait” for stimulus to come but exhibits an autonomous activity. Stimulation perturbs only the intrinsic activity, as it happens in mammalian brain. As a result, the network does not have a rigid stimulus-response function. The same stimulus can elicit
Polychronization
267
quite different responses because it activates a different (random) subset of polychronous groups representing the stimulus. Thus, the network operates in a highly degenerate regime (Edelman & Gally, 2001). 4.4 How Is It Different from Liquid-State Machines? Dynamics of a network implementing liquid-state-machine paradigm (Maass, Natschlaeger, & Markram, 2002) is purely stimulus driven. Such a network does not have short-term memory, and it cannot place the input in the context of the previous inputs. The simple model presented here implements some aspects of the liquid-state computing (e.g., it could be the liquid), but its response is not quite stimulus driven; it depends on the current state of the network, which in turn depends on the short-term and long-term experience and previous stimuli. This could be an advantage or a drawback, depending on the task that needs to be solved. Let us discuss some interesting open problems and implementation issues that are worth exploring further:
r
r
r
r
r
Finding groups: Our algorithm for finding polychronous groups considers various triplets firing with various spiking patterns and determines the groups that are initiated by the patterns. Because of the combinatorial explosion, it is extremely inefficient. In addition, we probably miss many groups that do not start with three neurons. Training: Our training strategy is the simplest and probably the least effective one: choose a set of “sensory” neurons, stimulate them with different spike-timing sequences, and let STDP form or select/reinforce appropriate groups. It is not clear whether this strategy is effective when many stimuli are needed to be learned. Incomplete activation: When a group is activated, whether in response to a particular stimulation or spontaneously, it rarely activates entirely. Typically, neurons at the beginning of the group polychronize, that is, fire with the precise spike-timing pattern imposed by the group connectivity, but the precision fades away as activation propagates along the group. As a result, the connectivity in the tail of the group does not stabilize, so the group as a whole changes. Stability: Because of continuous plasticity, groups appear, grow (see Figure 13), live for a certain period of time, and then could suddenly disappear (Izhikevich et al., 2004). Thus, spontaneous activity of the network leads to a slow degradation of the memory, and it is not clear how to prevent this. Sleep states: The network can switch randomly between different states. Some of them correspond to “‘vigilance” with gamma oscillations, and others resemble “sleep” states, similar to the one in Figure 5
268
E. Izhikevich
r
r
r
r
r
(top). It is not clear whether such switching should be prevented or whether it provides certain advantages for homeostasis of connections. Optimizing performance: Exploring the model, we encounter a regime when the number of polychronous groups was greater than the number of synapses in the network. However, the network was prone to epileptic seizures, which eventually lead to uncontrolled, completely synchronized activity. More effort is required to fine-tune the parameters of the model to optimize the performance of the network without inducing paroxysmal behavior. Context dependence: Propagation delays are assumed to be constant in the present simulation. In vivo studies have shown that axonal conduction velocity has submillisecond precision, but it also depends on the prior activity of the neuron during last 100 ms; hence, it can change with time in a context-dependent manner (Swadlow, 1974; Swadlow & Waxman, 1975; Swadlow, Kocsis, & Waxman, 1980). Thus, a polychronous group may exist and be activated in one time, but can temporarily disappear at another time because of the previous activity of its constituent neurons. Synaptic scaling: We assumed here that the maximal cut-off synaptic value is 10 mV, which is slightly more than half of the threshold value of the pyramidal neuron in the model. Since the average neuron in the network has 100 presynaptic sources, it implies that 2% of presynaptic spikes is enough to make it fire. It is interesting, but computationally impossible at present, to estimate the number of different polychronous groups when each neuron has, say, 400 presynaptic sources, each having maximal value of 2.5 mV. In this case, each group would be “wider,” since at least four neurons (the same 2%) are needed to fire any given postsynaptic cell. Network scaling: We simulated a network of 103 neurons and found 104 polychronous groups. How does the number of groups scale with the number of neurons? In particular, how many polychronous groups are there in a network of 1011 neurons, each having 104 synapses? This is a fundamental question related to the information capacity of the human brain. Closing the loop: An important biological observation is that organisms are part of the environment. The philosophy at the Neurosciences Institute (the author’s host institute) is “the brain is embodied, the body is embedded.” Thus, to understand and simulate the brain, we need to give the neural network a body and put it into real or virtual environment (Krichmar & Edelman, 2002). In this case, the network becomes part of a closed loop: the environment stimulates “sensory” neurons via sensory organs. Firings of these neurons combined with
Polychronization
r
269
the current state of the network (i.e., the context) activate appropriate polychronous groups, which excite “motor” neurons and produce appropriate movements in the environment (i.e., response). The movements change the sensory input and close the causality loop. Reward and reinforcement learning: Some stimuli bring the reward (not modeled here) and activate the value system (Krichmar & Edelman, 2002). It strengthens recently active polychronous groups— the groups that resulted in the reward. This increases the probability that the same stimuli in the future would result in activation of the same groups and thereby bring more reward. Thus, in addition to passively learning input stimuli, the system can actively explore those stimuli that bring the reward.
5 Cognitive Computations Let us discuss possible directions of this research and its connection to studies of neural computation, attention, and consciousness. This section is highly speculative; it is motivated by, but not entirely based on, the simulations described above. 5.1 Synchrony: Good or Bad? Much research on the dynamics of spiking and oscillatory networks is devoted to determining the conditions that ensure synchronization of the network activity. Many researchers (including this author until a few years ago) are under the erroneous assumption that synchronization is something good and desirable. What kind of information processing could possibly go on in a network of synchronously firing neurons? Probably none, since the entire network acts as a single neuron. Here we treat synchronization (or polychronization) of all neurons in the network as being an undesirable property that should be avoided. In fact, synchronization (or polychronization) should be so rare and difficult to occur by chance that when it happens, even transiently in a small subset of the network, it would signify something important, something meaningful, e.g., a stimulus is recognized, two or more features are bound, attention is paid. All these cognitive events are related to the activation of polychronous groups, as we discuss in this section. 5.2 Novel Model of Neural Computation. Most of artificial neural network research concerns supervised or unsupervised training of neural nets, which consists in building a mapping from a given set of inputs to a given set of outputs. For example, the connection weights of Hopfield-Grossberg model (Hopfield, 1982; Grossberg, 1988) are modified so that the given input patterns become attractors of the network. In these approaches, the network is “instructed” what to do.
270
E. Izhikevich
In contrast, we take a different approach in this article. Instead of using the instructionist approach, we employ a selectionist approach, known as the theory of neuronal group selection (TNGS) and neural Darwinism (Edelman 1987). There are two types of selection constantly going on in the spiking network:
r
r
Selection on the neuronal level: STDP selects subgraphs with matching conduction delays in initially unstructured network, resulting in the formation of a large number of groups, each capable of polychronizing, that is, generating a reproducible spike-timing pattern with millisecond precision. The number of coexisting polychronous groups, called repertoire of the network, is potentially unlimited. Selection on the group level: Polychronous groups are representations of possible inputs to the network, so that each input selects groups from the repertoire. That is, every time the input is presented to the network, a polychronous group (or groups) whose spike-timing pattern resonates with the input is activated (i.e., the neurons constituting the group polychronize).
Using the analogy with the immune system, where we have antibodies for practically all possible antigens, even those that do not exist on earth, we can take our point of view to the extreme and say that the network “has memories of all past and future events,” with the past events corresponding to certain groups with assigned representations and the future events corresponding to the large, amorphous, potentially unlimited cloud of available groups with no representation. Learning of a new input consists of selecting and reinforcing an appropriate group (or groups) that resonates with the input. Assigning the representation (meaning) to the group consists of potentiating weak connections that link this group with other groups coactive at the same time, that is, putting the group in the context of the other groups that already have representations (see Figure 16). In this sense, each polychronous group represents its stimulus and the context. In addition, persistent stimuli may create new groups, as we show in section 3. In any case, the input constantly shapes the landscape of the groups present in the network, selecting and amplifying some groups and suppressing and destroying others. The major result of this article is that spiking networks with delays have more groups than neurons. Thus, the system has potentially enormous memory capacity and will never run out of groups, which could explain how networks of mere 1011 neurons (the size of the human neocortex) could have such a diversity of behavior. Of course, we need to learn how to use this extraordinary property in models. 5.3 Binding and Gamma Rhythm. Binding is discussed in detail by Bienenstock (1995) in the context of synfire activity (see also the special
Polychronization
271
A
B
input
Figure 16: Due to the potentially unlimited number of coexisting polychronous groups, the system “has memories of all past and future events,” denoted by shaded and empty figures, respectively. (A) Past events are represented by the groups with assigned representations; they activate in response to specific inputs. Connections between the groups provide the context. (B) Experiencing a new event consists of selecting a group out of the amorphous set of “available” groups that resonates with the input. The context is provided by the potentiated connections between the group and recently active groups.
issue of NEURON (September 1999) on the binding problem). Bienenstock’s major idea is that dynamic binding of various features of a stimulus corresponds to the synchronization of synfire waves propagating along distinct chains. The synchronization is induced by weak reentrant synaptic coupling between these chains (see also Seth et al. 2004b). This idea is equally applicable to polychronous activity. In Figure 17 we illustrate what could happen when different groups representing different features of a stimulus are activated asynchronously (left) or time locked (right). In the former case, no specific temporal relationship would exist between firings of neurons belonging to different groups, except that the firings would be correlated (they are all triggered by the common input). The dotted lines in Figure 17 (right) are the reentrant connections between groups that establish the context for each group. These connections would coordinate activations of the groups and would be responsible for the time locking (polychronization) in Figure 17 (right). In essence, the four groups in the figure would act as a single meta-group whose reproducible
272
E. Izhikevich no binding (asynchronous activation)
gamma signature
binding (polychronous activation) feature 1
LFP feature 2
feature 3
feature 4
contribution to gamma
contribution to gamma
Figure 17: Time-locked activation of groups representing various features of a stimulus results in binding of the features and increased gamma rhythm. Each group contributes small gamma oscillation to the network gamma. (Left) The oscillations average out during the asynchronous activation. (Right) The oscillations add up during the time-locked activation. Dotted lines: weak reentrant connections between the groups that synchronize (or polychronize) their activation (the groups and the associated gamma rhythm are drawn by hand).
spike-timing pattern represents all features of the stimulus bound together into a whole. Each group has a gamma signature indicated by dashed boxes in Figure 17 (top left) and discussed in section 3 (see Figure 6). Activation of such a group produces a small oscillation of the local field potential (LFP) in the gamma frequency. When groups activate asynchronously, their LFPs would have random phases and cancel each other. When groups activate polychronously during binding, their LFPs would add up, resulting in the noticeable network gamma rhythm and increased synchrony (Singer & Gray, 1995). 5.4 Modeling Attention. The small size of the system does not allow us to explore other cognitive functions of spiking networks. In September 2005, the author simulated a detailed thalamo-cortical system having 1011
Polychronization
273
representation A
representation B
A is activated
Figure 18: Stimuli A and B are both represented by pairs of polychronous groups with overlapping neurons. Selective attention to representation A (both groups representing A are active) does not inhibit neurons involved in representation B. Because the neurons are shared, they just fire with the spike-timing pattern corresponding to A.
spiking neurons (i.e., the size of the human brain), 6-layer cortical microcircuitry, specific, non-specific, and reticular thalamic nuclei. One second of simulation took more than one month on a cluster of 27 3-GHz processors. In a large-scale network, there could be many groups (more than the 15 depicted in Figure 12) that represent any particular input stimulus. The stimulus alone could activate only a small subset of the groups. However, weak reentrant connections among the groups may trigger a regenerative process leading to explosive activation of many other groups representing the stimulus, resulting in its perception (and possibly increases gamma rhythm). These groups take up most of the neurons in the network so that only a relatively few neurons are available for activation of any other group not related to the stimulus. We might say that the stimulus is the focus of attention. If two or more stimuli are present, then activation of groups representing one stimulus essentially precludes the other stimuli from being attended. Remarkably, the groups corresponding to the unattended stimuli are not inhibited. The neurons constituting the groups fire, but with a different spike-timing pattern (see Figure 18). We hypothesize that this mutual exclusion is related to the phenomenon of selective attention.
274
E. Izhikevich
In our view, attention is not a command that comes from the “higher” or “executive” center and tells which input to attend to. Instead, we view attention as an emerging property of simultaneous and regenerative activation (via positive feedback) of a large subset of groups representing a stimulus, thereby impeding activation of other groups corresponding to other stimuli. Multiple stimuli compete for the focus of attention, and the winner is determined by many factors, mostly the context. 5.5 Consciousness as Attention to Memory. When no stimulation is present, there is a spontaneous activation of polychronous groups, as in Figure 11. We hypothesize that if the size of the network exceeds a certain threshold, a random activation of a few groups representing a previously seen stimulus may activate other groups representing the same stimulus so that the total number of activated groups is comparable to the number of activated groups that occurs when the stimulus is actually present. Not only would such an event exclude all the other groups not related to the stimulus from being activated, but from the network’s point of view, it would be similar to the event when the stimulus is actually present and it is the focus of attention. One can say that the network “thinks” of the stimulus—that is, it pays attention to the memory of the stimulus. Such “thinking” resembles “experiencing” the stimulus. A sequence of spontaneous activations corresponding to one stimulus, then another, and so on may be related to the stream of primary (perceptual or sensory) consciousness (Edelman, 2004), which can be found in many nonhuman animals. Of course, it does not explain the high-order (conceptual) consciousness of humans. Appendix: The Model The MATLAB code simulating the network activity is in Figure 19. The upper half of the program initializes the network, and it takes approximately 30 sec on a 1 GHz Pentium PC. The lower half of the program executes the model, and it takes 5 seconds to simulate 1 second of network activity. The actual time may vary depending on the firing rate of the neurons. The MATLAB code and an equivalent 20-times-faster C++ code are also available on the author’s Web page. Let us describe the details of the model. A.1 Anatomy. The network consists of N = 1000 neurons with the first Ne = 800 of excitatory RS type, and the remaining Ni = 200 of inhibitory FS type (Izhikevich, 2003). The ratio of excitatory to inhibitory cells is 4 to 1, as in the mammalian neocortex. Each excitatory neuron is connected to M = 100 random neurons, so that the probability of connection is M/N = 0.1, again as in the neocortex. Each inhibitory neuron is connected to M = 100 excitatory neurons only. The indices of postsynaptic targets are in the N×M-matrix post. Corresponding synaptic weights are in the N×M-matrix s. Inhibitory weights are not plastic, whereas excitatory weights evolve according to the STDP
Polychronization
275
% spnet.m: Spiking network with axonal conduction delays and STDP % Created by Eugene M.Izhikevich. February 3, 2004 M=100; % number of synapses per neuron D=20; % maximal conduction delay % excitatory neurons % inhibitory neurons % total number Ne=800; Ni=200; N=Ne+Ni; a=[0.02*ones(Ne,1); 0.1*ones(Ni,1)]; d=[ 8*ones(Ne,1); 2*ones(Ni,1)]; sm=10; % maximal synaptic strength post=ceil([N*rand(Ne,M);Ne*rand(Ni,M)]); s=[6*ones(Ne,M);-5*ones(Ni,M)]; % synaptic weights sd=zeros(N,M); % their derivatives for i=1:N if i0); % pre excitatory neurons aux{i}=N*(D-1-ceil(ceil(pre{i}/N)/(M/D)))+1+mod(pre{i}-1,N); end; STDP = zeros(N,1001+D); v = -65*ones(N,1); % initial values u = 0.2.*v; % initial values firings=[-D 0]; % spike timings for sec=1:60*60*24 % simulation of 1 day for t=1:1000 % simulation of 1 sec I=zeros(N,1); I(ceil(N*rand))=20; % random thalamic input fired = find(v>=30); % indices of fired neurons v(fired)=-65; u(fired)=u(fired)+d(fired); STDP(fired,t+D)=0.1; for k=1:length(fired) sd(pre{fired(k)})=sd(pre{fired(k)})+STDP(N*t+aux{fired(k)}); end; firings=[firings;t*ones(length(fired),1),fired]; k=size(firings,1); while firings(k,1)>t-D del=delays{firings(k,2),t-firings(k,1)+1}; ind = post(firings(k,2),del); I(ind)=I(ind)+s(firings(k,2), del)'; sd(firings(k,2),del)=sd(firings(k,2),del)-1.2*STDP(ind,t+D)'; k=k-1; end; v=v+0.5*((0.04*v+5).*v+140-u+I); % for numerical v=v+0.5*((0.04*v+5).*v+140-u+I); % stability time u=u+a.*(0.2*v-u); % step is 0.5 ms STDP(:,t+D+1)=0.95*STDP(:,t+D); % tau = 20 ms end; plot(firings(:,1),firings(:,2),'.'); axis([0 1000 0 N]); drawnow; STDP(:,1:D+1)=STDP(:,1001:1001+D); ind = find(firings(:,1) > 1001-D); firings=[-D 0;firings(ind,1)-1000,firings(ind,2)]; s(1:Ne,:)=max(0,min(sm,0.01+s(1:Ne,:)+sd(1:Ne,:))); sd=0.9*sd; end;
Figure 19: MATLAB code of the spiking network with axonal conduction delays and spike-timing-dependent plasticity (STDP). It is available on the author’s Web page: www.izhikevich.com.
276
E. Izhikevich
rule discussed in the next section. Their derivatives are in the N×M-matrix sd , though only the Ne×M-block of the matrix is used. Each synaptic connection has a fixed integer conduction delay between 1 ms and D = 20 ms, where D is a parameter (M/D must be integer in the model). We do not model modifiable delays (Huning, Glunder, & Palm, 1998; Eurich, Pawelzik, Ernst, Cowan, & Milton, 1999) or transmission failures (Senn, Schneider, & Ruff, 2002). The list of all synaptic connections from neuron i having delay j is in the cell array delays{i, j} . Our MATLAB implementation assigns 1 ms delay to all inhibitory connections, and 1 to D ms delay to all excitatory connections. Although the anatomy of the model is random, reflecting the connectivity within a cortical minicolumn, one can implement an arbitrarily sophisticated anatomy by specifying the matrices post and delays. The details of the anatomy do not matter in the rest of the MATLAB code and do not slow the simulation. Once the matrices post and delays are specified, the program initializes cell arrays pre and aux. The former contains indices of all excitatory neurons presynaptic to a given neuron, and the latter is an auxiliary table of indices needed to speed up STDP implementation. A.2 Spiking Neurons. Each neuron in the network is described by the simple spiking model (Izhikevich, 2003) v = 0.04v 2 + 5v + 140 − u + I
(A.1)
u = a (bv − u)
(A.2)
with the auxiliary after-spike resetting if v ≥ +30 mV, then
v←c u ← u + d.
(A.3)
Here variable v represents the membrane potential of the neuron, and u represents a membrane recovery variable, which accounts for the activation of K+ ionic currents and inactivation of Na+ ionic currents, and it provides negative feedback to v. After the spike reaches its apex at +30 mV, which is not to be confused with the firing threshold, the membrane voltage and the recovery variable are reset according to equation A.3. Depending on the values of the parameters, the model can exhibit firing patterns of all known types of cortical neurons (Izhikevich, 2003). It can also reproduce all of the 20 most fundamental neurocomputational properties of biological neurons summarized in Figure 3, (see Izhikevich, 2004). We use (b, c) = (0.2, −65) for all neurons in the network. For excitatory neurons, we use the values (a , d) = (0.02, 8) corresponding to cortical pyramidal neurons exhibiting regular spiking (RS) firing patterns. For inhibitory neurons, we use the values (a , d) = (0.1, 2) corresponding to cortical
Polychronization
277
interneurons exhibiting fast spiking (FS) firing patterns. Better values of parameters corresponding to different types of cortical neurons, as well as the explanation of the model, can be found in Izhikevich (2006). Variable I in the model combines two kinds of input to the neuron: (1) random thalamic input and (2) spiking input from the other neurons. This is implemented via N-dimensional vector I. A.3 Spike-Timing-Dependent Plasticity. The synaptic connections in the model are modified according to the spike-timing-dependent plasticity (STDP) rule (Song et al., 2000). We use the simplest and the most effective implementation of this rule, depicted in Figure 4. If a spike from an excitatory presynaptic neuron arrives at a postsynaptic neuron (possibly making the postsynaptic neuron fire), then the synapse is potentiated (strengthened). In contrast, if the spike arrives right after the postsynaptic neuron fired, the synapse is depressed (weakened). If pre- and postsynaptic neurons fire uncorrelated Poissonian spike trains, there are moments when the weight of the synaptic connection is potentiated, and there are moments when it is depressed. We chose the parameters of the STDP curve so that depression is stronger than potentiation and the synaptic weight goes slowly to zero. Indeed, such a connection is not needed and should be eliminated. In contrast, if the presynaptic neuron often fires before the postsynaptic one, then the synaptic connection slowly potentiates. Indeed, such a connection causes the postsynaptic spikes and should be strengthened. In this way, STDP strengthens causal interactions in the network. The magnitude of potentiation or depression depends on the time interval between the spikes. Each time a neuron fires, the variable STDP is reset to 0.1. Every millisecond, STDP decreases by 0.95*STDP, so that it decays to zero as 0.1e −t/20(ms) , according to the parameters in Figure 4. This function determines the magnitude of potentiation or depression. For each fired neuron, we consider all its presynaptic neurons and determine the timings of the last excitatory spikes arrived from these neurons. Since these spikes made the neuron fire, the synaptic weights are potentiated according to the value of STDP at the presynaptic neuron adjusted for the conduction delay. This corresponds to the positive part of the STDP curve in Figure 4. Notice that the largest increase occurs for the spikes that arrived right before the neuron fired, that is, the spikes that actually caused postsynaptic spike. In addition, when an excitatory spike arrives at a postsynaptic neuron, we depress the synapse according to the value of STDP at the postsynaptic neuron. This corresponds to the negative part of the STDP curve in Figure 4. Indeed, such a spike arrived after the postsynaptic neuron fired, and hence the synapse between the neurons should be weakened. (The same synapse will be potentiated when the postsynaptic neuron fires.)
278
E. Izhikevich
Instead of changing the synaptic weights directly, we change their derivatives sd, and then update the weights once a second according to the rule s ← s + 0.01 + sd, and sd ← 0.9sd, where 0.01 describes activityindependent increase of synaptic weight needed to potentiate synapses coming to silent neurons (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Desai, Cudmore, Nelson, & Turrigiano, 2002). Thus, the synaptic change is not instantaneous but slow, taking many seconds to develop. We manually keep the weights in the range between 0 and sm, where sm is a parameter of the model, typically less than 10 (mV). Acknowledgments Anil K. Seth and Bruno van Swinderen read the manuscript and made a number of useful suggestions. Gerald M. Edelman, Bernie J. Baars, Anil K. Seth, and Bruno van Swinderen motivated my interest in the studies of consciousness. The concept of consciousness as attention to memories was developed in conversations with Bruno van Swinderen. This research was supported by the Neurosciences Research Foundation. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M. (2002). Synfire chains. In M. A. Arbib (Ed.), The handbook of Brain theory and neural networks. (pp. 1143–1146). Cambridge, MA: MIT Press. Amit, D. J., & Brunel N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Baker, S. N., & Lemon R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. J. Neurophysiol., 84, 1770–1780. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. J. Neuroscience, 23, 11167–11177. Beggs, J. M., & Plenz, D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neuroscience, 24, 5216–5229. Bellen, A., & Zennaro, M. (2003). Numerical methods for delay differential equations. Oxford: Clarendon Press. Bienenstock, E. (1995). A model of neocortex. Network: Comput. Neural Syst., 6, 179– 224. Braitenberg, V., & Schuz, A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Bryant, H., & Segundo, J. (1976). Spike initiation by transmembrane current: A whitenoise analysis. J. Physiol. (Lond.), 260, 279–314. Buzsaki, G., Llinas, R., Singer, W., Berthoz, A., & Christen, Y. (Eds.). (1994). Temporal coding in the brain. New York: Springer-Verlag.
Polychronization
279
Chang, E. Y., Morris, K. F., Shannon, R., & Lindsey, B. G. (2000). Repeated sequences of interspike intervals in baroresponsive respiratory related neuronal assemblies of the cat brain stem. J. Neurophysiol., 84, 1136–1148. Changeux, J. P., & Danchin, A. (1976). Selective stabilization of developing synapses as a mechanism for the recall and recognition. Cognition, 33, 25– 62. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Desai, N. S., Cudmore, R. H., Nelson, S. B., & Turrigiano, G. G. (2002). Critical periods for experience-dependent synaptic scaling in visual cortex. Nature Neuroscience, 5, 783–789. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Edelman, G. M. (1987). Neural Darwinism: The theory of neuronal group selection. New York: Basic Books. Edelman, G. M. (1993). Neural Darwinism: Selection and reentrant signaling in higher brain function. Neuron, 10, 115–125. Edelman, G. M. (2004). Wider than the sky: The phenomenal gift of consciousness. New Haven, CT: Yale University Press. Edelman, G. M., & Gally, J. (2001). Degeneracy and complexity in biological systems. PNAS, 98, 13763–13768. Eurich, C., Pawelzik, K., Ernst, U., Cowan, J., & Milton, J. (1999). Dynamics of selforganazed delay adaptation. Phys. Rev. Lett., 82, 1594–1597. Ferster, D., & Lindstrom, S. (1983). An intracellular analysis of geniculocortical connectivity in area 17 of the cat. Journal of Physiology, 342, 181–215. Foss, J., & Milton, J. (2000). Multistability in recurrent neural loops arising from delay. J. Neurophysiol., 84, 975–985. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding Nature, 383, 76–78. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17–61. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79, 2554–2558. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weekly connected neural networks. New York: Springer-Verlag. Huning, H., Glunder, H., & Palm, G. (1998). Synaptic delay learning in pulse-coupled neurons. Neural Computation, 10, 555–565. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Fester, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304, 559–564. Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14, 1569–1572. Izhikevich, E. M. (2004). Which model to use for cortical spiking neurons? IEEE Transactions on Neural Networks, 15, 1063–1070.
280
E. Izhikevich
Izhikevich, E. M. (2006). Dynamical systems in neuroscience: The geometry of excitability and bursting. Cambridge, MA: The MIT Press. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Krichmar, J. L., & Edelman, G. M. (2002). Machine psychology: Autonomous behavior, perceptual categorization and conditioning in a brain-based device. Cerebral Cortex, 12, 818–830. Lindsey, B. G., Morris, K. F., Shannon, R., & Gerstein, G. L. (1997). Repeated patterns of distributed synchrony in neuronal assemblies. J. Neurophysiol., 78, 1714–1719. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feed-forward networks with excitatory-inhibitory balance. J. Neurosci., 23, 3006–3015. Maass, W., Natschlaeger, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14, 2531–2560. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mao, B.-Q., Hamzei-Sichani, F., Aronov, D., Froemke, R. C., & Yuste, R. (2001). Dynamics of spontaneous activity in neocortical slices. Neuron, 32, 883–898. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signals. Nat. Neurosci., 5, 463–471. Miller, R. (1996a). Neural assemblies and laminar interactions in the cerebral cortex. Biol. Cybern., 75(3), 253–261. Miller, R. (1996b). Cortico-thalamic interplay and the security of operation of neural assemblies and temporal chains in the cerebral cortex. Biol. Cybern., 75(3), 263–275. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79, 2857–2874. Reinagel, P., & Reid, R. C. (2002). Precise firing events are conserved across neurons. J. Neurosci., 22, 6837–6841. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Salami, M., Itami, C., Tsumoto, T., & Kimura, F. (2003). Change of conduction velocity by regional myelination yields constant latency irrespective of distance between thalamus and cortex. PNAS, 100, 6174–6179. Senn, W., Schneider, M., & Ruf, B. (2002). Activity-dependent development of axonal and dendritic delays, or, why synaptic transmission should be unreliable. Neural Computation, 14, 583–619. Seth, A. K., McKinstry, J. L., Edelman, G. M., & Krichmar, J. L. (2004a). Spatiotemporal processing of whisker input supports texture discrimination by a brain-based device. In S., Schaal, A., Billard, S., Vijayakumar, J., Hallam, & J.-A., Meyer (Eds.), From animals to animats 8: Proceedings of the Eighth International Conference on the Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.
Polychronization
281
Seth, A. K., McKinstry, J. L., Edelman, G. M., & Krichmar, J. L. (2004b). Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device. Cerebral Cortex, 14, 1185–1199. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Shadlen, M. N., & Morshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919–926. Stewart, I., Golubitsky, M., & Pivato, M. (2003). Symmetry groupoids and patterns of synchrony in coupled cell networks. SIAM J. Appl. Dynam. Sys., 2, 606–646. Strehler, B. L., & Lestienne, R. (1986). Evidence on precise time-coded symbols and memory of patterns in monkey cortical neuronal spike trains. PNAS, 83, 9812– 9816. Swadlow, H. A. (1974). Systematic variations in the conduction velocity of slowly conducting axons in the rabbit corpus collosum. Experimental Neurology, 43, 445– 451. Swadlow, H. A. (1985). Physiological properties of individual cerebral axons studied in vivo for as long as one year. J. Neurophysiology, 54, 1346–1362. Swadlow, H. A. (1988). Efferent neurons and suspected interneurons in binocular visual cortex of the awake rabbit: Receptive fields and binocular properties. J. Neurophysiol., 88, 1162–1187. Swadlow, H. A. (1992). Monitoring the excitability of neocortical efferent neurons to direct activation by extracellular current pulses. J. Neurophysiol., 68, 605–619. Swadlow, H. A. (1994). Efferent neurons and suspected interneurons in motor cortex of the awake rabbit: Axonal properties, sensory receptive fields, and subthreshold synaptic inputs. J. Neurophysiology, 71, 437–453. Swadlow, H. A., Kocsis, J. D., & Waxman, S. G. (1980). Modulation of impulse conduction along the axonal tree. Ann. Rev. Biophys. Bioeng., 9, 143–179. Swadlow, H. A., & Waxman, S. G. (1975). Observations on impulse conduction along central axons. PNAS, 72, 5156–5159. Tetko, I. V., & Villa, A. E. P. (2001). A pattern grouping algorithm for analysis of spatiotemporal patterns in neuronal spike trains. 2: Application to simultaneous single unit recordings. Journal of Neuroscience Methods, 105, 15–24. Turrigiano, G. G., Leslie, K. R., Desai N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Villa, A. E., Tetko, I. V., Hyland, B., & Najem, A. (1999). Spatiotemporal activity patterns of rat cortical neurons predict responses in a conditioned task. Proc. Natl. Acad. Sci. USA, 96, 1106–1111.
282
E. Izhikevich
Volman, V., Baruchi, I., & Ben-Jacob, E. (2005). Manifestation of function-follow-form in cultured neuronal networks. Physics Biology, 2, 98–110. Wiener, J., & Hale, J. K. (1992). Ordinary and delay differential equations. New York: Wiley. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: Experimental and mathematical observations on network dynamics. Int. J. Psychophysiol., 38, 315–336.
Received January 31, 2005; accepted June 14, 2005.
LETTER
Communicated by Peter Dayan
Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia Randall C. O’Reilly
[email protected] Michael J. Frank
[email protected] Department of Psychology, University of Colorado Boulder, Boulder, CO 80309, U.S.A.
The prefrontal cortex has long been thought to subserve both working memory (the holding of information online for processing) and executive functions (deciding how to manipulate working memory and perform processing). Although many computational models of working memory have been developed, the mechanistic basis of executive function remains elusive, often amounting to a homunculus. This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia, and amygdala, which together form an actor-critic architecture. The critic system learns which prefrontal representations are task relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems. The model’s performance compares favorably with standard backpropagation-based temporal learning mechanisms on the challenging 1-2-AX working memory task and other benchmark working memory tasks. 1 Introduction This letter presents a computational model of working memory based on the prefrontal cortex and basal ganglia (the PBWM model). The model represents a convergence of two logically separable but synergistic goals: understanding the complex interactions between the basal ganglia (BG) and prefrontal cortex (PFC) in working memory function and developing a computationally powerful model of working memory that can learn to perform complex temporally extended tasks. Such tasks require learning which information to maintain over time (and what to forget) and how to Neural Computation 18, 283–328 (2006)
C 2005 Massachusetts Institute of Technology
284
R. O’Reilly and M. Frank
assign credit or blame to events based on their temporally delayed consequences. The model shows how the prefrontal cortex and basal ganglia can interact to solve these problems by implementing a flexible working memory system with an adaptive gating mechanism. This mechanism can switch between rapid updating of new information into working memory and robust maintenance of existing information already being maintained (Hochreiter & Schmidhuber, 1997; O’Reilly, Braver, & Cohen, 1999; Braver & Cohen, 2000; Cohen, Braver, & O’Reilly, 1996; O’Reilly & Munakata, 2000). It is trained in the model using a version of reinforcement learning mechanisms that are widely thought to be supported by the basal ganglia (e.g., Sutton, 1988; Sutton & Barto, 1998; Schultz et al., 1995; Houk, Adams, & Barto, 1995; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & Arbib, 2001; Contreras-Vidal & Schultz, 1999; Joel, Niv, & Ruppin, 2002). At the biological level of analysis, the PBWM model builds on existing work describing the division of labor between prefrontal cortex and basal ganglia (Frank, Loughry, & O’Reilly, 2001; Frank, 2005). In this prior work, we demonstrated that the basal ganglia can perform dynamic gating via the modulatory mechanism of disinhibition, allowing only task-relevant information to be maintained in PFC and preventing distracting information from interfering with task demands. The mechanisms for supporting such functions are analogous to the basal ganglia role in modulating more primitive frontal system (e.g., facilitating adaptive motor responses while suppressing others; Mink, 1996). However, to date, no model has attempted to address the more difficult question of how the BG “knows” what information is task relevant (which was hard-wired in prior models). The present model learns this dynamic gating functionality in an adaptive manner via reinforcement learning mechanisms thought to depend on the dopaminergic system and associated areas (e.g., nucleus accumbens, basal-lateral amygdala, midbrain dopamine nuclei). In addition, the prefrontal cortex representations themselves learn using both Hebbian and error-driven learning mechanisms as incorporated into the Leabra model of cortical learning, which combines a number of well-accepted mechanisms into one coherent framework (O’Reilly, 1998; O’Reilly & Munakata, 2000). At the computational level, the model is most closely related to the long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997; Gers, Schmidhuber, & Cummins, 2000), which uses error backpropagation to train dynamic gating signals. The impressive learning ability of the LSTM model compared to other approaches to temporal learning that lack dynamic gating argues for the importance of this kind of mechanism. However, it is somewhat difficult to see how LSTM itself could actually be implemented in the brain. The PBWM model shows how similarly powerful levels of computational learning performance can be achieved using more biologically based mechanisms. This model has direct implications for understanding executive dysfunction in neurological disorders such as attention deficit– hyperactivity disorder (ADHD) and Parkinson’s disease, which involve the
Making Working Memory Work
285
interaction between dopamine, basal ganglia, and prefrontal cortex (Frank, Seeberger, & O’Reilly, 2004; Frank, 2005). After presenting the PBWM model and its computational, biological, and cognitive bases, we compare its performance with that of several other standard temporal learning models including LSTM, a simple recurrent network (SRN; Elman, 1990; Jordan, 1986), and real-time recurrent backpropagation learning (RBP; Robinson & Fallside, 1987; Schmidhuber, 1992; Williams & Zipser, 1992). 2 Working Memory Functional Demands and Adaptive Gating The need for an adaptive gating mechanism can be motivated by the 1-2-AX task (see Figure 1; Frank et al., 2001), which is a complex working memory task involving both goals and subgoals and is used as a test case later in the article. Number and letter stimuli (1,2,A,X,B,Y) appear one at a time in sequence, and the participant is asked to detect one of two target sequences, depending on whether he or she last saw a 1 or a 2 (which thus serves as “task” stimuli). In the 1 task, the target is A followed by X, and for 2, it is B-Y. Thus, the task demand stimuli define an outer loop of active maintenance (maintenance of task demands) within which there can be a number of inner loops of active maintenance for the A-X level sequences. This task imposes three critical functional demands on the working memory system: Rapid updating: As each stimulus comes in, it must be rapidly encoded in working memory.
oute inne r loop L L r loops R L 1 L A X B X time
L 2
L
R B Y
Figure 1: The 1-2-AX task. Stimuli are presented one at a time in a sequence. The participant responds by pressing the right key (R) to the target sequence; otherwise, a left key (L) is pressed. If the subject last saw a 1, then the target sequence is an A followed by an X. If a 2 was last seen, then the target is a B followed by a Y. Distractor stimuli (e.g., 3, C, Z) may be presented at any point and are to be ignored. The maintenance of the task stimuli (1 or 2) constitutes a temporal outer loop around multiple inner-loop memory updates required to detect the target sequence.
286
R. O’Reilly and M. Frank
Working Memory
a) Update
b) Maintain
A
A
Gating Sensory Input
open A
closed C
Figure 2: Illustration of active gating. When the gate is open, sensory input can rapidly update working memory (e.g., encoding the cue item A in the 1-2-AX task), but when it is closed, it cannot, thereby preventing other distracting information (e.g., distractor C) from interfering with the maintenance of previously stored information.
Robust maintenance: The task demand stimuli (1 or 2) in the outer loop must be maintained in the face of interference from ongoing processing of inner-loop stimuli and irrelevant distractors. Selective updating: Only some elements of working memory should be updated at any given time, while others are maintained. For example, in the inner-loop, A’s and X’s should be updated while the task demand stimulus (1 or 2) is maintained. The first two of these functional demands (rapid updating and robust maintenance) are directly in conflict with each other when viewed in terms of standard neural processing mechanisms, and thus motivate the need for a dynamic gating mechanism to switch between these modes of operation (see Figure 2; Cohen et al., 1996; Braver & Cohen, 2000; O’Reilly et al., 1999; O’Reilly & Munakata, 2000; Frank et al., 2001). When the gate is open, working memory is updated by incoming stimulus information; when it is closed, currently active working memory representations are robustly maintained. 2.1 Dynamic Gating via Basal Ganglia Disinhibition. One of the central postulates of the PBWM model is that the basal ganglia provide a selective dynamic gating mechanism for information maintained via sustained activation in the PFC (see Figure 3). As reviewed in Frank et al. (2001), this idea is consistent with a wide range of data and other computational models that have been developed largely in the domain of motor control, but also in working memory (Wickens, 1993; Houk & Wise, 1995; Wickens, Kotter, & Alexander, 1995; Dominey, Arbib, & Joseph, 1995; Berns & Sejnowski, 1995, 1998; Jackson & Houghton, 1995; Beiser & Houk, 1998; Kropotov &
Making Working Memory Work
287
Posterior Cortex
Frontal Cortex dorsal striatum
Go
D1
+
excitatory inhibitory
thalamus
NoGo
VA,VL,MD
D2
−
GPe
SNr Figure 3: The basal ganglia are interconnected with frontal cortex through a series of parallel loops, each of the form shown. Working backward from the thalamus, which is bidirectionally excitatory with frontal cortex, the SNr (substantia nigra pars reticulata) is tonically active and inhibiting this excitatory circuit. When direct pathway “Go” neurons in dorsal striatum fire, they inhibit the SNr, and thus disinhibit frontal cortex, producing a gating-like modulation that we argue triggers the update of working memory representations in prefrontal cortex. The indirect pathway “NoGo” neurons of dorsal striatum counteract this effect by inhibiting the inhibitory GPe (globus pallidus, external segment).
Etlinger, 1999; Amos, 2000; Nakahara, Doya, & Hikosaka, 2001). Specifically, in the motor domain, various authors suggest that the BG are specialized to selectively facilitate adaptive motor actions, while suppressing others (Mink, 1996). This same functionality may hold for more advanced tasks, in which the “action” to facilitate is the updating of prefrontal working memory representations (Frank et al., 2001; Frank, 2005). To support robust active maintenance in PFC, our model takes advantage of intrinsic bistability of PFC neurons, in addition to recurrent excitatory connections (Fellous, Wang, & Lisman, 1998; Wang, 1999; Durstewitz, Kelc, & Gunturkun, 1999; Durstewitz, Seamans, & Sejnowski, 2000a). Here we present a summary of our previously developed framework (Frank et al., 2001) for how the BG achieves gating:
r
Rapid updating occurs when direct pathway spiny “Go” neurons in the dorsal striatum fire. Go firing directly inhibits the substantia nigra
288
R. O’Reilly and M. Frank
r
r
pars reticulata (SNr) and releases its tonic inhibition of the thalamus. This thalamic disinhibition enables, but does not directly cause (i.e., gates), a loop of excitation into the PFC. The effect of this excitation in the model is to toggle the state of bistable currents in the PFC neurons. Striatal Go neurons in the direct pathway are in competition (in the SNr, if not the striatum; Mink, 1996; Wickens, 1993) with “NoGo” neurons in the indirect pathway that effectively produce more inhibition of thalamic neurons and therefore prevent gating. Robust maintenance occurs via intrinsic PFC mechanisms (bistability, recurrence) in the absence of Go updating signals. This is supported by the NoGo indirect pathway firing to prevent updating of extraneous information during maintenance. Selective updating occurs because there are parallel loops of connectivity through different areas of the basal ganglia and frontal cortex (Alexander, DeLong, & Strick, 1986; Graybiel & Kimura, 1995; Middleton & Strick, 2000). We refer to the separately updatable components of the PFC/BG system as stripes, in reference to relatively isolated groups of interconnected neurons in PFC (Levitt, Lewis, Yoshioka, & Lund, 1993; Pucak, Levitt, Lund, & Lewis, 1996). We previously estimated that the human frontal cortex could support roughly 20,000 such stripes (Frank et al., 2001).
3 Learning When to Gate in the Basal Ganglia Figure 4 provides a summary of how basal ganglia gating can solve the 1-2-AX task. This figure also illustrates that the learning problem in the basal ganglia amounts to learning when to fire a Go versus NoGo signal in a given stripe based on the current sensory input and maintained PFC activations. Without such a learning mechanism, our model would require some kind of intelligent homunculus to control gating. Thus, the development of this learning mechanism is a key step in banishing the homunculus from the domain of working memory models (cf. the “central executive” of Baddeley’s, 1986, model). There are two fundamental problems that must be solved by the learning mechanism: Temporal credit assignment: The benefits of having encoded a given piece of information into prefrontal working memory are typically available only later in time (e.g., encoding the 1 task demand helps later only when confronted with an A-X sequence). Thus, the problem is to know which prior events were critical for subsequent good (or bad) performance. Structural credit assignment: The network must decide which PFC stripes should encode which different pieces of information at a given time. When successful performance occurs, it must reinforce those stripes that
Making Working Memory Work
a)
Stim
289
b)
PFC
1
1
C1
C Thal SNr
{
{
Striatum Go
c)
d) A
A 1
{
X
RA 1
{
Figure 4: Illustration of how the basal ganglia gating of different PFC stripes can solve the 1-2-AX task (light color = active; dark = not active). (a) The 1 task is gated into an anterior PFC stripe because a corresponding striatal stripe fired Go. (b) The distractor C fails to fire striatial Go neurons, so it will not be maintained; however, it does elicit transient PFC activity. Note that the 1 persists because of gating-induced robust maintenance. (c) The A is gated in. (d) A right key press motor action is activated (using the same BG-mediated disinhibition mechanism) based on X input plus maintained PFC context.
actually contributed to this success. This form of credit assignment is what neural network models are typically very good at doing, but clearly this form of structural credit assignment interacts with the temporal credit assignment problem, making it more complex. The PBWM model uses a reinforcement-learning algorithm called PVLV (in reference to its Pavlovian learning mechanisms; O’Reilly, Frank, Hazy, & Watz, 2005) to solve the temporal credit assignment problem. The simulated dopaminergic (DA) output of this PVLV system modulates Go versus NoGo firing activity in a stripe-wise manner in BG-PFC circuits to facilitate structural credit assignment. Each of these is described in detail below. The model (see Figure 5) has an actor-critic structure (Sutton & Barto, 1998), where the critic is the PVLV system that controls the firing of simulated midbrain DA neurons and trains both itself and the actor. The actor is the basal ganglia gating system, composed of the Go and NoGo pathways in the dorsal striatum and their associated projections through BG output structures to the thalamus, and then back up to the PFC. The DA signals computed
290
R. O’Reilly and M. Frank
Sensory Input
Motor Output
Posterior Cortex: I/O Mapping
PFC: Context, Goals, etc (gating)
PVLV: DA (Critic)
(modulation)
BG: Gating (Actor)
Figure 5: Overall architecture of the PBWM model. Sensory inputs are mapped to motor outputs via posterior cortical (“hidden”) layers, as in a standard neural network model. The PFC contextualizes this mapping by representing relevant prior information and goals. The basal ganglia (BG) update the PFC representations via dynamic gating, and the PVLV system drives dopaminergic (DA) modulation of the BG so it can learn when to update. The BG/PVLV system constitutes an actor-critic architecture, where the BG performs updating actions and the PVLV system “critiques” the potential reward value of these actions, with the resulting modulation shaping future actions to be more rewarding.
by PVLV drive both performance and learning effects via opposite effects on Go and NoGo neurons (Frank, 2005). Specifically, DA is excitatory onto the Go neurons via D1 receptors and inhibitory onto NoGo neurons via D2 receptors (Gerfen, 2000; Hernandez-Lopez et al., 2000). Thus, positive DA bursts (above tonic level firing) tend to increase Go firing and decrease NoGo firing, while dips in DA firing (below tonic levels) have the opposite effect. The change in activation state as a result of this DA modulation can then drive learning in an appropriate way, as detailed below and in Frank (2005). 3.1 Temporal Credit Assignment: The PVLV Algorithm. The firing patterns of midbrain dopamine (DA) neurons (ventral tegmental area, VTA, and substantia nigra pars compacta, SNc; both strongly innervated by the basal ganglia) exhibit the properties necessary to solve the temporal credit assignment problem because they appear to learn to fire for stimuli that predict subsequent rewards (e.g., Schultz, Apicella, & Ljungberg, 1993; Schultz, 1998). This property is illustrated in schematic form in Figure 6a for a simple Pavlovian conditioning paradigm, where a conditioned stimulus (CS, e.g., a tone) predicts a subsequent unconditioned stimulus (US, i.e., a reward). Figure 6b shows how this predictive DA firing can reinforce BG Go firing to maintain a stimulus, when such maintenance leads to subsequent reward.
Making Working Memory Work
291
a) CS US/r DA
b) CS PFC
(maint in PFC) (causes updating)
(spans the delay)
BG−Go US/r (reinforces Go)
DA Figure 6: (a) Schematic of dopamine (DA) neural firing for a conditioned stimulus (CS, e.g., a tone) that reliably predicts a subsequent unconditioned stimulus (US, i.e., a reward, r). Initially, DA fires at the point of reward, but then over repeated trials learns to fire at the onset of the stimulus. (b) This DA firing pattern can solve the temporal credit assignment problem for PFC active maintenance. Here, the PFC maintains the transient input stimulus (initially by chance), leading to reward. As the DA system learns, it begins to fire DA bursts at stimulus onset, by virtue of PFC “bridging the gap” (in place of a sustained input). DA firing at stimulus onset reinforces the firing of basal ganglia Go neurons, which drive updating in PFC.
Specifically, the DA firing can move from the time of a reward to the onset of a stimulus that, if maintained in the PFC, leads to this subsequent reward. Because this DA firing occurs when the stimulus comes on, it is well timed to facilitate the storage of this stimulus in PFC. In the model, this occurs by reinforcing the connections between the stimulus and the Go gating neurons in the striatum, which then cause updating of PFC to maintain the stimulus. Note that other models have leveraged this same logic, but have the DA firing itself cause updating of working memory via direct DA projections to PFC (O’Reilly et al., 1999; Braver & Cohen, 2000; Cohen et al., 1996; O’Reilly & Munakata, 2000; Rougier & O‘Reilly, 2002; O’Reilly, Noelle, Braver, & Cohen, 2002). The disadvantage of this global DA signal is that it
292
R. O’Reilly and M. Frank
would update the entire PFC every time, making it difficult to perform tasks like the 1-2-AX task, which require maintenance of some representations while updating others. The apparently predictive nature of the DA firing has almost universally been explained in terms of the temporal differences (TD) reinforcement learning mechanism (Sutton, 1988; Sutton & Barto, 1998; Schultz et al., 1995; Houk et al., 1995; Montague, Dayan, & Sejnowski, 1996; Suri et al., 2001; Contreras-Vidal & Schultz, 1999; Joel et al., 2002). The earlier DA gating models cited above and an earlier version of the PBWM model (O’Reilly & Frank, 2003) also used this TD mechanism to capture the essential properties of DA firing in the BG. However, considerable subsequent exploration and analysis of these models has led us to develop a non-TD based account of these DA firing patterns, which abandons the prediction framework on which it is based (O’Reilly et al., 2005). In brief, TD learning depends on sequential chaining of predictions from one time step to the next, and any weak link (i.e., unpredictable event) can break this chain. In many of the tasks faced by our models (e.g., the 1-2-AX task), the sequence of stimulus states is almost completely unpredictable, and this significantly disrupts the TD chaining mechanism, as shown in O’Reilly et al. (2005). Instead of relying on prediction as the engine of learning, we have developed a fundamentally associative “Pavlovian” learning mechanism called PVLV, which consists of two systems: primary value (PV) and learned value (LV) (O’Reilly et al., 2005; see Figure 7). The PV system is just the Figure 7: PVLV learning mechanism. (a) Structure of PVLV. The PV (primary value) system learns about primary rewards and contains two subsystems: the excitatory (PVe) drives excitatory DA bursts from primary rewards (US = unconditioned stimulus), and the inhibitory (PVi) learns to cancel these bursts (using timing or other reliable signals). Anatomically, the PVe corresponds to the lateral hypothalamus (LHA), which has excitatory projections to the midbrain DA nuclei and responds to primary rewards. The PVi corresponds to the striosome-patch neurons in the ventral striatum (V. Str.), which have direct inhibitory projections onto the DA system, and learn to fire at the time of expected rewards. The LV (learned value) system learns to fire for conditioned stimuli (CS) that are reliably associated with reward. The excitatory component (LVe) drives DA bursting and corresponds to the central nucleus of the amygdala (CNA), which has excitatory DA projections and learns to respond to CS’s. The inhibitory component (LVi) is just like the PVi, except it inhibits CS-associated bursts. (b) Application to the simple conditioning paradigm depicted in the previous figure, where the PVi learns (based on the PVe reward value at each time step) to cancel the DA burst at the time of reward, while the LVe learns a positive CS association (only at the time of reward) and drives DA bursts at CS onset. The phasic nature of CS firing, despite a sustained CS input, requires a novelty detection mechanism of some form; we suggest a synaptic depression mechanism as having beneficial computational properties.
Making Working Memory Work
293
a) LV i (V. Str.)
PV i (V. Str.) excitatory inhibitory
Timing (cereb.)
DA (VTA/SNc)
CS
b) CS US/PVe Timing PV i LV e DA
US LV e
PV e
(CNA)
(LHA)
294
R. O’Reilly and M. Frank
Rescorla-Wagner/delta-rule learning algorithm (Rescorla & Wagner, 1972; Widrow & Hoff, 1960), trained by the primary reward value r t (i.e., the US) at each time step t (where time steps correspond to discrete events in the environment, such as the presentation of a CS or US). For simplicity, consider a single linear unit that computes an expected reward value Vˆ tpv based on weights wit coming from sensory and other inputs xit (e.g., including timing signals from the cerebellum): Vˆ tpv =
xit wit
(3.1)
i
(our actual value representation uses a distributed representation, as described in the appendix). The error in this expected reward value relative to the actual reward present at time t represents the PV system’s contribution to the overall DA signal: ˆt . δ tpv = r t − V pv
(3.2)
Note that all of these terms are in the current time step, whereas the similar equation in TD involves terms across different adjacent time steps. This delta value then trains the weights into the PV reward expectation, wit = δ tpv xit ,
(3.3)
where wit is the change in weight value and 0 < < 1 is a learning rate. As the system learns to expect primary rewards based on sensory and other inputs, the delta value decreases. This can account for the cancellation of the dopamine burst at the time of reward, as observed in the neural recording data (see Figure 7b). When a conditioned stimulus is activated in advance of a primary reward, the PV system is actually trained to not expect reward at this time, because it is always trained by the current primary reward value, which is zero in this case. Therefore, we need an additional mechanism to account for the anticipatory DA bursting at CS onset, which in turn is critical for training up the BG gating system (see Figure 6). This is the learned value (LV) system, which is trained only when primary rewards are either present or expected by the PV and is free to fire at other times without adapting its weights. Therefore, the LV is protected from having to learn that no primary reward is actually present at CS onset, because it is not trained at that time. In other words, the LV system is free to signal reward associations for stimuli even at times when no primary reward is actually expected. This results in the anticipatory dopamine spiking at CS onset (see Figure 7b), without requiring an unbroken chain of predictive events between stimulus onset and subsequent reward, as in TD. Thus, this anticipatory dopamine spiking
Making Working Memory Work
295
by the LV system is really just signaling a reward association, not a reward prediction. As detailed in O’Reilly et al. (2005), this PV/LV division provides a good mapping onto the biology of the DA system (see Figure 7a). Excitatory projections from the lateral hypothalamus (LHA) and central nucleus of the amygdala (CNA) are known to drive DA bursts in response to primary rewards (LHA) and conditioned stimuli (CNA) (e.g., Cardinal, Parkinson, Hall, & Everitt, 2002). Thus, we consider LHA to represent r , which we also label as PVe to denote the excitatory component of the primary value system. The CNA corresponds to the excitatory component of the LV system described above (LVe), which learns to drive DA bursts in response to conditioned stimuli. The primary reward system Vˆ pv that cancels DA firing at reward delivery is associated with the striosome/patch neurons in the ventral striatum, which have direct inhibitory projections into the DA system (e.g., Joel & Weiner, 2000), and learn to fire at the time of expected primary rewards (e.g., Schultz, Apicella, Scarnati, & Ljungberg, 1992). We refer to this as the inhibitory part of the primary value system, PVi. For symmetry and important functional reasons described later, we also include a similar inhibitory component to the LV system, LVi, which is also associated with the same ventral striatum neurons, but slowly learns to cancel DA bursts associated with CS onset. (For full details on PVLV, see O’Reilly et al., 2005, and the equations in the appendix.) 3.2 Structural Credit Assignment. The PVLV mechanism just described provides a solution to the temporal credit assignment problem, and we use the overall PVLV δ value to simulate midbrain (VTA, SNc) dopamine neuron firing rates (deviations from baseline). To provide a solution to the structural credit assignment problem, the global PVLV DA signal can be modulated by the Go versus NoGo firing of the different PFC/BG stripes, so that each stripe gets a differentiated DA signal that reflects its contribution to the overall reward signal. Specifically, we hypothesize that the SNc provides a more stripe-specific DA signal by virtue of inhibitory projections from the SNr to the SNc (e.g., Joel & Weiner, 2000). As noted above, these SNr neurons are tonically active and are inhibited by the firing of Go neurons in the striatum. Thus, to the extent that a stripe fires a strong Go signal, it will disinhibit the SNc DA projection to itself, while those that are firing NoGo will remain inhibited and not receive DA signals. We suggest that this inhibitory projection from SNr to SNc produces a shunting property that negates the synaptic inputs that produce bursts and dips, while preserving the intrinsically generated tonic DA firing levels. Mathematically, this results in a multiplicative relationship, such that the degree of Go firing multiplies the magnitude of the DA signal it receives (see the appendix for details). It remains to be determined whether the SNc projections support stripe-specific topography (see Haber, Fudge, & McFarland, 2000, for data
296
R. O’Reilly and M. Frank
suggestive of some level of topography), but it is important to emphasize that the proposed mechanism involves only a modulation in the amplitude of phasic DA changes in a given stripe and not qualitatively different firing patterns from different SNc neurons. Thus, very careful quantitative parallel DA recording studies across multiple stripes would be required to test this idea. Furthermore, it is possible that this modulation could be achieved through other mechanisms operating in the synaptic terminals regulating DA release (Joel & Weiner, 2000), in addition to or instead of overall firing rates of SNc neurons. What is clear from the results presented below is that the networks are significantly impaired at learning without this credit assignment mechanism, so we feel it is likely to be implemented in the brain in some manner. 3.3 Dynamics of Updating and Learning. In addition to solving the temporal and structural credit assignment problems, the PBWM model depends critically on the temporal dynamics of activation updating to solve the following functional demands:
r
r
Within one stimulus-response time step, the PFC must provide a stable context representation reflecting ongoing goals or prior stimulus context, and it must also be able to update to reflect appropriate changes in context for subsequent processing. Therefore, the system must be able to process the current input and make an appropriate response before the PFC is allowed to update. This offset updating of context representations is also critical for the SRN network, as discussed later. In standard Leabra, there are two phases of activation updating: a minus phase where a stimulus is processed to produce a response, followed by a plus phase where any feedback (when available) is presented, allowing the network to correct its response next time. Both of these phases must occur with a stable PFC context representation for the feedback to be able to drive learning appropriately. Furthermore, the BG Go/NoGo firing to decide whether to update the current PFC representations must also be appropriately contextualized by these stable PFC context representations. Therefore, in PBWM, we add a third update phase where PFC representations update, based on BG Go/NoGo firing that was computed in the plus phase (with the prior PFC context active). Biologically, this would occur in a more continuous fashion, but with appropriate delays such that PFC updating occurs after motor responding. The PVLV system must learn about the value of maintaining a given PFC representation at the time an output response is made and rewarded (or not). This reward learning is based on adapting synaptic weights from PFC representations active at the time of reward, not based on any transient sensory inputs that initially activated those
Making Working Memory Work
r
297
PFC representations, which could have been many time steps earlier (and long since gone). After BG Go firing updates PFC representations (during the third phase of settling), the PVLV critic can then evaluate the value of the new PFC state to provide a training signal to Go/NoGo units in the striatum. This training signal is directly contingent on striatal actions: Did the update result in a “good” (as determined by PVLV associations) PFC state? If good (DA burst), then increase the likelihood of Go firing next time. If bad (DA dip), then decrease the Go firing likelihood and increase NoGo firing. This occurs via direct DA modulation of the Go/NoGo neurons in the third phase, where bursts increase Go and decrease NoGo activations and dips have the opposite effect (Frank, 2005). Thus, the Go/NoGo units learn using the delta rule over their states in the second and third phases of settling, where the third phase reflects the DA modulation from the PVLV evaluation of the new PFC state.
To summarize, the temporal credit assignment “time travel” of perceived value, from the point of reward back to the critical stimuli that must be maintained, must be based strictly on PFC states and not sensory inputs. But this creates a catch-22 because these PFC states reflect inputs only after updating has occurred (O’Reilly & Munakata, 2000), so the system cannot know that it would be good to update PFC to represent current inputs until it has already done so. This is solved in PBWM by having one system (PVLV) for solving the temporal credit assignment problem (based on PFC states) and a different one (striatum) for deciding when to update PFC (based on current sensory inputs and prior PFC context). The PVLV system then evaluates the striatal updating actions after updating has occurred. This amounts to trial-and-error learning, with the PVLV system providing immediate feedback for striatal gating actions (and this feedback is in turn based on prior learning by the PVLV system, taking place at the time of primary rewards). The system, like most reinforcement learning systems, requires sufficient exploration of different gating actions to find those that are useful. The essential logic of these dynamics in the PBWM model is illustrated in Figure 8 in the context of a simple “store ignore recall” (SIR) working memory task (which is also simulated, as described later). There are two additional functional features of the PBWM model: (1) a mechanism to ensure that striatal units are not stuck in NoGo mode (which would prevent them from ever learning) and to introduce some random exploratory firing, and (2) a contrast-enhancement effect of dopamine modulation on the Go/NoGo units that selectively modulates those units that were actually active relative to those that were not. The details of these mechanisms are described in the appendix, and their overall contributions to learning, along
298
R. O’Reilly and M. Frank
state 1: successful Recall – PFC Input
S
+ S
gn
state 3: ignoring I
++
–
+
++
–
+
++
S
X
X
S
S
S
S
∆w
R
Striatum
state 2: storing S
gn
∆w
S gn
gn
syn dep
I gn
gn
rew
DA prev gate = S stored –> rew on R, assoc S (in PFC!) w/ rew in DA sys
GO = store S, DA likes –> GO firing reinforced
NOGO = ignore I, DA stopped firing fm syn. depress
Figure 8: Phase-based sequence of operations in the PBWM model for three input states of a simple Store, Ignore, Recall task. The task is to store the S stimulus, maintain it over a sequence of I (ignore) stimuli, and then recall the S when an R is input. Four key layers in the model are represented in simple form: PFC, sensory Input, Striatum (with Go = g and NoGo = n units), and overall DA firing (as controlled by PVLV). The three phases per trial (−, +, ++ = PFC update) are shown as a sequence of states for the same layer (i.e., there is only one PFC layer that represents one thing at a time). W indicates key weight changes, and the font size for striatal g and n units indicates effects of DA modulation. Syndep indicates synaptic depression into the DA system (LV) that prevents sustained firing to the PFC S representation. In state 1, the network had previously stored the S (through random Go firing) and is now correctly recalling it on an R trial. The unexpected reward delivered in the plus phase produces a DA burst, and the LV part of PVLV (not shown) learns to associate the state of the PFC with reward. State 2 shows the consequence of this learning, where, some trials later, an S input is active and the PFC is maintaining some other information (X). Based on existing weights, the S input triggers the striatal Go neurons to fire in the plus phase, causing PFC to update to represent the S. During this update phase, the LV system recognizes this S (in the PFC) as rewarding, causing a DA burst, which increases firing of Go units, and results in increased weights from S inputs to striatal Go units. In state 3, the Go units (by existing weights) do not fire for the subsequent ignore (I) input, so the S continues to be maintained. The maintained S in PFC does not continue to drive a DA burst due to synaptic depression, so there is no DA-driven learning. If a Go were to fire for the I input, the resulting I representation in PFC would likely trigger a small negative DA burst, discouraging such firing again. The same logic holds for negative feedback by causing nonreward associations for maintenance of useless information.
Making Working Memory Work
299
with the contributions of all the separable components of the system, are evaluated after the basic simulation results are presented. 3.4 Model Implementation Details. The implemented PBWM model, shown in Figure 9 (with four stripes), uses the Leabra framework, described
Input Output X Y Z A B C
X Y Z L R
1 2 3
A B C 1 2 3
X Y Z
X Y Z
A B C 1 2 3
A B C 1 2 3
1 2 3
PFC (Context)
NoGo NoGo NoGo NoGo NoGo NoGo NoGo NoGo Go Go Go Go Go Go Go Go
Hidden (Posterior Cortex)
Striatum (Gating, Actor)
PVe PVi
SNrThal SNc
LVi LVe
X Y Z A B C
DA
PVLV (Critic)
Figure 9: Implemented model as applied to the 1-2-AX task. There are four stripes in this model as indicated by the groups of units within the PFC and Striatum (and the four units in the SNc and SNrThal layers). PVe represents primary reward (r or US), which drives learning of the primary value inhibition (PVi) part of PVLV, which cancels primary reward DA bursts. The learned value (LV) part of PVLV has two opposing excitatory and inhibitory components, which also differ in learning rate (LVe = fast learning rate, excitatory on DA bursts; LVi = slow learning rate, inhibitory on DA bursts). All of these reward-value layers encode their values as coarse-coded distributed representations. VTA and SNc compute the DA values from these PVLV layers, and SNc projects this modulation to the Striatum. Go and NoGo units alternate (from bottom left to upper right) in the Striatum. The SNrThal layer computes Go-NoGo in the corresponding stripe and mediates competition using kWTA dynamics. The resulting activity drives updating of PFC maintenance currents. PFC provides context for Input/Hidden/Output mapping areas, which represent posterior cortex.
300
R. O’Reilly and M. Frank
in detail in the appendix (O’Reilly, 1998, 2001; O’Reilly & Munakata, 2000). Leabra uses point neurons with excitatory, inhibitory, and leak conductances contributing to an integrated membrane potential, which is then thresholded and transformed via an x/(x + 1) sigmoidal function to produce a rate code output communicated to other units (discrete spiking can also be used, but produces noisier results). Each layer uses a k-winners-takeall (kWTA) function that computes an inhibitory conductance that keeps roughly the k most active units above firing threshold and keeps the rest below threshold. Units learn according to a combination of Hebbian and error-driven learning, with the latter computed using the generalized recirculation algorithm (GeneRec; O’Reilly, 1996), which computes backpropagation derivatives using two phases of activation settling, as mentioned earlier. The cortical layers in the model use standard Leabra parameters and functionality, while the basal ganglia systems require some additional mechanisms to implement the DA modulation of Go/NoGo units, and toggling of PFC maintenance currents from Go firing, as detailed in the appendix. In some of the models, we have simplified the PFC representations so that they directly reflect the input stimuli in a one-to-one fashion, which simply allows us to transparently interpret the contents of PFC at any given point. However, these PFC representations can also be trained with random initial weights, as explored below. The ability of the PFC to develop its own representations is a critical advance over the SRN model, for example, as explored in other related work (Rougier, Noelle, Braver, Cohen, & O’Reilly, 2005). 4 Simulation Tests We conducted simulation comparisons between the PBWM model and a set of backpropagation-based networks on three different working memory tasks: (1) the 1-2-AX task as described earlier, (2) a two-store version of the Store-Ignore-Recall (SIR) task (O’Reilly & Munakata, 2000), where two different items need to be separately maintained, and (3) a sequence memory task modeled after the phonological loop (O’Reilly & Soto, 2002). These tasks provide a diverse basis for evaluating these models. The backpropagation-based comparison networks were:
r
A simple recurrent network (SRN; Elman, 1990; Jordan, 1986) with cross-entropy output error, no momentum, an error tolerance of .1 (output err < .1 counts as 0), and a hysteresis term in updating the context layers of .5 (c j (t) = .5h j (t − 1) + .5c j (t − 1), where c j is the context unit for hidden unit activation h j ). Learning rate (lrate), hysteresis, and hidden unit size were searched for optimal values across this and the RBP networks (within plausible ranges, using round numbers, e.g., lrates of .05, .1, .2, and .5; hysteresis of 0, .1, .2, .3, .5, and
Making Working Memory Work
r
r
301
.7, hidden units of 25, 36, 49, and 100). For the 1-2-AX task, optimal performance was with 100 hidden units, hysteresis of .5, and lrate of .1. For the SIR-2 task, 49 hidden units were used due to extreme length of training required, and a lrate of .01 was required to learn at all. For the phonological loop task, 196 hidden units and a lrate of .005 performed best. A real-time recurrent backpropagation learning network (RBP; Robinson & Fallside, 1987; Schmidhuber, 1992; Williams & Zipser, 1992), with the same basic parameters as the SRN, and a time constant for integrating activations and backpropagated errors of 1, and the gap between backpropagations and the backprop time window searched in the set of 6, 8, 10, and 16 time steps. Two time steps were required for activation to propagate from the input to the output, so the effective backpropagation time window across discrete input events in the sequence is half of the actual time window (e.g., 16 = 8 events, which represents two or more outer-loop sequences). Best performance was achieved with the longest time window (16). A long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997) with forget gates as specified in Gers (2000), with the same basic backpropagation parameters as the other networks, and four memory cells.
4.1 The 1-2-AX Task. The task was trained as in Figure 1, with the length of the inner-loop sequences randomly varied from one to four (i.e., one to four pairs of A-X, B-Y, and so on, stimuli). Specifically, each sequence of stimuli was generated by first randomly picking a 1 or 2, and then looping for one to four times over the following inner-loop generation routine. Half of the time (randomly selected), a possible target sequence (if 1, then A-X; if 2, then B-Y) was generated. The other half of the time, a random sequence composed of an A, B, or C, followed by an X, Y, or Z, was randomly generated. Thus, possible targets (A-X, B-Y) represent at least 50% of trials, but actual targets (A-X in the 1 task, B-Y in the 2 task) appear only 25% of time on average. The correct output was the L unit, except on the target sequences (1-A-X or 2-B-Y), where it was an R. The PBWM network received a reward if it produced the correct output (and received the correct output on the output layer in the plus phase of each trial), while the backpropagation networks learned from the error signal computed relative to this correct output. One epoch of training consisted of 25 outer-loop sequences, and the training criterion was 0 errors across two epochs in a row (one epoch can sometimes contain only a few targets, making a lucky 0 possible). For parameter searching results, training was stopped after 10,000 epochs for the backpropagation models if the network had failed to learn by this point and was scored as a failure to learn. For statistics, 20 different networks of each type were run.
302
R. O’Reilly and M. Frank
Epochs to Criterion
1-2-AX Task Training Time 3000 2500 2000 1500 1000 500 0
PBWM
RBP LSTM Algorithm
SRN
Figure 10: Training time to reach criterion (0 errors in two successive epochs of 25 outer-loop sequences) on the 1-2-AX task for the PBWM model and three backpropagation-based comparison algorithms. LSTM = long short-term memory model. RBP = recurrent backpropagation (real-time recurrent learning). SRN = simple recurrent network.
The basic results for number of epochs required to reach the criterion training level are shown in Figure 10. These results show that the PBWM model learns the task at roughly the same speed as the comparison backpropagation networks, with the SRN taking significantly longer. However, the main point is not in comparing the quantitative rates of learning (it is possible that despite a systematic search for the best parameters, other parameters could be found to make the comparison networks perform better). Rather, these results simply demonstrate that the biologically based PBWM model is in the same league as existing powerful computational learning mechanisms. Furthermore, the exploration of parameters for the backpropagation networks demonstrates that the 1-2-AX task represents a challenging working memory task, requiring large numbers of hidden units and long temporalintegration parameters for successful learning. For example, the SRN network required 100 hidden units and a .5 hysteresis parameter to learn reliably (hysteresis determines the window of temporal integration of the context units) (see Table 1). For the RBP network, the number of hidden units and the time window for backpropagation exhibited similar results (see Table 2). Specifically, time windows of fewer than eight time steps resulted in failures to learn, and the best results (in terms of average learning time) were achieved with the most hidden units and the longest backpropagation time window.
Making Working Memory Work
303
Table 1: Effects of Various Parameters on Learning performance in the SRN. Hidden-layer sizes for SRN (lrate = .1, hysteresis = .5) Hiddens 25 36 49 Success rate 4% 26% 86% Average epochs 5367 6350 5079
100 100% 2994
Hysteresis for SRN (100 hiddens, lrate = .1) Hysteresis .1 .2 Success rate 0% 0% Average epochs NA NA
.5 100% 2994
.3 38% 6913
.7 98% 3044
Learning rates for SRN (100 hiddens, hysteresis = .5) lrate .05 .1 .2 Success rate 100% 100% 96% Average epochs 3390 2994 3308 Notes: Success rate = percentage of networks (out of 50) that learned to criterion (0 errors for two epochs in a row) within 10,000 epochs. Average epochs - average number of epochs to reach criterion for successful networks. The optimal performance is with 100 hidden units, learning rate .1, and hysteresis .5. Sufficiently large values for the hidden units and hysteresis parameters are critical for successful learning, indicating the strong working memory demands of this task.
Table 2: Effects of Various Parameters on Learning Performance in the RBP Network. Time window for RBP (lrate = .1, 100 hiddens) Window 6 8 Success rate 6% 96% Average epochs 1389 625
10 96% 424
16 96% 353
Hidden-layer size for RBP (lrate = .1, window = 16) Hiddens 25 36 49 Success rate 96% 100% 96% Average epochs 831 650 687
100 96% 353
Notes: The optimal performance is with 100 hidden units, time window = 16. As with the SRN, the relatively large size of the network and long time windows required indicate the strong working memory demands of the task.
4.2 The SIR-2 Task. The PBWM and comparison backpropagation algorithms were also tested on a somewhat more abstract task (which has not been tested in humans), which represents perhaps the simplest, most direct form of working memory demands. In this store ignore recall (SIR) task (see Table 3), the network must store an arbitrary input pattern for a recall test that occurs after a variable number of intervening ignore trials (O’Reilly & Munakata, 2000). Stimuli are presented during the ignore trials and must be identified (output) by the network but do not need to be maintained. Tasks with this same basic structure were the focus of the original Hochreiter and Schmidhuber (1997) work on the LSTM algorithm, where
304
R. O’Reilly and M. Frank
Table 3: Example Sequence of Trials in the SIR-2 Task, Showing What Is Input, What Should Be Maintained in Each of Two “Stores,” and the Target Output. Trial
Input
Maint-1
Maint-2
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I-D S1-A I-B S2-C I-A I-E R1 I-A I-C S1-D I-E R1 I-B R2
– A A A A A A – – D D D – –
– – – C C C C C C C C C C C
D A B C A E A A C D E D B C
Notes: I = Ignore unit active, S1/2 = Store 1/2 unit active, R 1/2 = Recall unit 1/2 active. The functional meaning of these “task control” inputs must be discovered by the network. Two versions were run. In the shared representations version, one set of five stimulus inputs was used to encode A–E, regardless of which control input was present. In the dedicated representations version, there were different stimulus representations for each of the three categories of stimulus inputs (S1, S2, and I), for a total of 15 stimulus input units. The shared representations version proved impossible for nongated networks to learn.
they demonstrated that the dynamic gating mechanism was able to gate in the to-be-stored stimulus, maintain it in the face of an essentially arbitrary number of intervening trials by having the gate turned off, and then recall the maintained stimulus. The SIR-2 version of this task adds the need to independently update and maintain two different stimulus memories, instead of just one, which should provide a better test of selective updating. We explored two versions of this task—one that had a single set of shared stimulus representations (A-E) and another with dedicated stimulus representations for each of the three different types of task control inputs (S1, S2, I). In the dedicated representations version, the stimulus inputs conveyed directly their functional role and made the control inputs somewhat redundant (e.g., the I-A stimulus unit should always be ignored, while the S1-A stimulus should always be stored in the first stimulus store). In contrast, a stimulus in the shared representation version is ambiguous; sometimes an A should be ignored, sometimes stored in S1, and other times stored in S2, depending on the concomitant control input. This difference in stimulus ambiguity made a big difference for the nongating networks, as discussed below. The networks had 20 input units (separate A–E stimuli
Making Working Memory Work
305
for each of three different types of control inputs (S1, S2, I) = 15 units, and the 5 control units: S1,S2,I,R1,R2). On each trial, a control input and corresponding stimulus were randomly selected with uniform probability, which means that S1 and S2 maintenance ended up being randomly interleaved with each other. Thus, the network was required to develop a truly independent form of updating and maintenance for these two items. As Figure 11a shows, three out of four algorithms succeeded in learning the dedicated stimulus items version of the task within roughly comparable numbers of epochs, while the SRN model had a very difficult time, taking on average 40,090 epochs. We suspect that this difficulty may reflect the limitations of the one time step of error backpropagation available for this network, making it difficult for it to span the longer delays that often occurred (Hochreiter & Schmidhuber, 1997). Interestingly, the shared stimulus representations version of the task (see Figure 11b) clearly divided the gating networks from the nongated ones (indeed, the nongated networks—RBP and SRN—were completely unable to achieve a more stringent criterion of four zero-error epochs in a row, whereas both PBWM and LSTM reliably reached this level). This may be due to the fact that there is no way to establish a fixed set of weights between an input stimulus and a working memory representation in this task version. The appropriate memory representation to maintain a given stimulus must be determined entirely by the control input. In other words, the control input must act as a gate on the fate of the stimulus input, much as the gate input on a transistor determines the processing of the other input. More generally, dynamic gating enables a form of dynamic variable binding, as illustrated in Figure 12 for this SIR-2 task. The two PFC stripes in this example act as variable “slots” that can hold any of the stimulus inputs; which slot a given input gets “bound” to is determined by the gating system as driven by the control input (S1 or S2). This ability to dynamically route a stimulus to different memory locations is very difficult to achieve without a dynamic gating system, as our results indicate. Nevertheless, it is essential to emphasize that despite this additional flexibility provided by the adaptive gating mechanism, the PBWM network is by no means a fully generalpurpose variable binding system. The PFC representations must still learn to encode the stimulus inputs, and other parts of the network must learn to respond appropriately to these PFC representations. Therefore, unlike a traditional symbolic computer, it is not possible to store any arbitrary piece of information in a given PFC stripe. Figure 13 provides important confirmation that the PVLV learning mechanism is doing what we expect it to in this task, as represented in earlier discussion of the SIR task (e.g., see Figure 8). Specifically, we expect that the system will generate large positive DA bursts for Store events and not for Ignore events. This is because the Store signal should be positively associated with correct performance (and thus reward), while the Ignore signal should not be. This is exactly what is observed.
306
R. O’Reilly and M. Frank
a)
Epochs to Criterion
SIR 2 Store, 5 Dedicated Items 10000 1000 100 10 1
PBWM
RBP LSTM Algorithm
SRN
b)
Epochs to Criterion
SIR 2 Store, 2 Shared Items 100000 10000 1000 100 10 1
PBWM
RBP LSTM Algorithm
SRN
Figure 11: Training time to reach criterion (0 errors in 2 consecutive epochs of 100 trials each) on the SIR-2 task for the PBWM model and three backpropagation-based comparison algorithms, for (a) dedicated stimulus items (stimulus set = 5 items, A–E) and (b) shared stimulus items (stimulus set = 2 items, A–B). LSTM = long short-term memory model. RBP = recurrent backpropagation (real-time recurrent learning). SRN = simple recurrent network. The SRN does significantly worse in both cases (note the logarithmic scale), and with shared items, the nongated networks suffer considerably relative to the gated ones, most likely because of the variable binding functionality that a gating mechanism provides, as illustrated in Figure 12.
4.3 The Phonological Loop Sequential Recall Task. The final simulation test involves a simplified model of the phonological loop, based on earlier work (O’Reilly & Soto, 2002). The phonological loop is a working
Making Working Memory Work
a)
"S1" "S2"
307
b)
"S1" "S2"
A A
S1
A A
Go No
S2
PFC Input
No
Go BG
Figure 12: Gating can achieve a form of dynamic variable binding, as illustrated in the SIR-2 task. The store command (S1 or S2) can drive gating signals in different stripes in the BG, causing the input stimulus item (A,B, . . .) to be stored in the associated PFC stripe. Thus, the same input item can be encoded in a different neural “variable slot” depending on other inputs. Nevertheless, these neural stripes are not fully general like traditional symbolic variables; they must learn to encode the input items, and other areas must learn to decode these representations.
memory system that can actively maintain a short chunk of phonological (verbal) information (e.g., Baddeley, 1986; Baddeley, Gathercole, & Papagno, 1998; Burgess & Hitch, 1999; Emerson & Miyake, 2003). In essence, the task of this model is to encode and replay a sequence of “phoneme” inputs, much as in the classic psychological task of short-term serial recall. Thus, it provides a simple example of sequencing, which has often been linked with basal ganglia and prefrontal cortex function (e.g., Berns & Sejnowski, 1998; Dominey et al., 1995; Nakahara et al., 2001). As we demonstrated in our earlier model (O’Reilly & Soto, 2002), an adaptively gated working memory architecture provides a particularly efficient and systematic way of encoding phonological sequences. Because phonemes are a small closed class of items, each independently updatable PFC stripe can learn to encode this basic vocabulary. The gating mechanism can then dynamically gate incoming phonemes into stripes that implicitly represent the serial order information. For example, a given stripe might always encode the fifth phoneme in a sequence, regardless of which phoneme it was. The virtue of this system is that it provides a particularly efficient basis for generalization to novel phoneme sequences: as long as each stripe can encode any of the possible phonemes and gating is based on serial position and not phoneme identity, the system will generalize perfectly to novel sequences (O’Reilly & Soto, 2002). As noted above, this is an example of variable binding, where the stripes are variable-like slots for a given position, and the gating “binds” a given input to its associated slot.
308
R. O’Reilly and M. Frank
a) DA Values Over Training
Avg DA Value
0.4 0.3
Store
0.2 Recall
0.1 0.0
Ignore -0.1 0
50
100 Epochs
150
200
b)
Avg LVe Value
LVe (CS Assoc) Values Over Training 1.0 Recall
0.8 0.6
Store
0.4 0.2 0.0
Ignore 0
50
100 Epochs
150
200
Figure 13: (a) Average simulated DA values in the PBWM model for different event types over training. Within the first 50 epochs, the model learns strong, positive DA values for both types of storage events (Store), which reinforces gating for these events. In contrast, low DA values are generated for Ignore and Recall events. (b) Average LVe values, representing the learned value (i.e., CS associations with reward value) of various event types. As the model learns to perform well, it accurately perceives the reward at Recall events. This generalizes to the Store events, but the Ignore events are not reliably associated with reward, and thus remain at low levels.
Our earlier model was developed in advance of the PBWM learning mechanisms and used a hand-coded gating mechanism to demonstrate the power of the underlying representational scheme. In contrast, we trained the present networks from random initial weights to learn this task. Each
Making Working Memory Work
309
training sequence consisted of an encoding phase, where the current sequence of phonemes was presented in order, followed by a retrieval phase where the network had to output the phonemes in the order they were encoded. Sequences were of length 3, and only 10 simulated phonemes were used, represented locally as one out of 10 units active. Sequence order information was provided to the network in the form of an explicit “time” input, which counted up from 1 to 3 during both encoding and retrieval. Also, encoding versus retrieval phase was explicitly signaled by two units in the input. An example input sequence would be: E-1-‘B,’ E-2-‘A,’ E-3-‘G,’ R-1, R-2, R-3, where E/R is the encoding/recall flag, the next digit specifies the sequence position (“time”), and the third is the phoneme (not present in the input during retrieval). There are 1000 possible sequences (103 ), and the networks were trained on a randomly selected subset of 300 of these, with another nonoverlapping sample of 300 used for generalization testing at the end of training. Both of the gated networks (PBWM and LSTM) had six stripes or memory cells instead of four, given that three items had to be maintained at a time, and the networks benefit from having extra stripes to explore different gating strategies in parallel. The PFC representations in the PBWM model were subject to learning (unlike previous simulations, where they were simply a copy of the input, for analytical simplicity) and had 42 units per stripe, as in the O’Reilly and Soto (2002) model, and there were 100 hidden units. There were 24 units per memory cell in the LSTM model (note that computation increases as a power of 2 per memory cell unit in LSTM, setting a relatively low upper limit on the number of such cells). Figure 14 shows the training and testing results. Both gated models (PBWM, LSTM) learned more rapidly than the nongated backpropagationbased networks (RBP, SRN). Furthermore, the RBP network was unable to learn unless we presented the entire set of training sequences in a fixed order (other networks had randomly ordered presentation of training sequences). This was true regardless of the RBP window size (even when it was exactly the length of a sequence). Also, the SRN could not learn with only 100 hidden units, so 196 were used. For both the RBP and SRN networks, a lower learning rate of .005 was required to achieve stable convergence. In short, this was a difficult task for these networks to learn. Perhaps the most interesting results are the generalization test results shown in Figure 14b. As was demonstrated in the O’Reilly and Soto (2002) model, gating affords considerable advantages in the generalization to novel sequences compared to the RBP and SRN networks. It is clear that the SRN network in particular simply “memorizes” the training sequences, whereas the gated networks (PBWM, LSTM) develop a very systematic solution where each working memory stripe or slot learns to encode a different element in the sequence. This is a good example of the advantages of the variable-binding kind of behavior supported by adaptive gating, as discussed earlier.
310
R. O’Reilly and M. Frank
a)
Epochs to Criterion
Phono Loop Training Time 3000 2000 1000 0
PBWM
RBP LSTM Algorithm
SRN
b) Generalization Error %
Phono Loop Generalization 100 75 50 25 0
PBWM
RBP LSTM Algorithm
SRN
Figure 14: (a) Learning rates for the different algorithms on the phonological loop task, replicating previous general patterns (criterion is one epoch of 0 error). (b) Generalization performance (testing on 300 novel, untrained sequences), showing that the gating networks (PBWM and LSTM) exhibit substantially better generalization, due to their ability to dynamically gate items into active memory “slots” based on their order of presentation.
4.4 Tests of Algorithm Components. Having demonstrated that the PBWM model can successfully learn a range of different challenging working memory tasks, we now test the role of specific subcomponents of the algorithm to demonstrate their contribution to the overall performance. Table 4 shows the results of eliminating various portions of the model in
Making Working Memory Work
311
Table 4: Results of Various Tests for the Importance of Various Separable Parts of the PBWM Algorithm, Shown as a Percentage of Trained Networks That Met Criterion (Success Rate). Success Rate (%) Manipulation No Hebbian No DA contrast enhancement No Random Go exploration No LVi (slow LV baseline) No SNrThal DA Mod, DA = 1.0 No SNrThal DA Mod, DA = 0.5 No SNrThal DA Mod, DA = 0.2 No SNrThal DA Mod, DA = 0.1 No DA modulation at all
12ax
SIR-2
Loop
95 80 0 15 15 70 80 55 0
100 95 95 90 5 20 30 40 0
100 90 100 30 0 0 20 20 0
Notes: With the possible exception of Hebbian learning, all of the components clearly play an important role in overall learning, for the reasons described in the text as the algorithm was introduced. The No SNrThal DA Mod cases eliminate stripe-wise structural credit assignment; controls for overall levels of DA modulation are shown. The final No DA Modulation at all condition completely eliminates the influence of the PVLV DA system on Striatum Go/NoGo units, clearly indicating that PVLV (i.e., learning) is key.
terms of percentage of networks successfully learning to criterion. This shows that each separable component of the algorithm plays an important role, with the possible exception of Hebbian learning (which was present only in the “posterior cortical” (Hidden/Output) portion of the network). Different models appear to be differentially sensitive to these manipulations, but all are affected relative to the 100% performance of the full model. For the “No SNrThal DA Mod” manipulation, which eliminates structural credit assignment via the stripe-wise modulation of DA by the SNrThal layer, we also tried reducing the overall strength of the DA modulation of the striatum Go/NoGo units, with the idea that the SNrThal modulation also tends to reduce DA levels overall. Therefore, we wanted to make sure any impairment was not just a result of a change in overall DA levels; a significant impairment remains even with this manipulation. 5 Discussion The PBWM model presented here demonstrates powerful learning abilities on demonstrably complex and difficult working memory tasks. We have also tested it informally on a wider range of tasks, with similarly good results. This may be the first time that a biologically based mechanism for controlling working memory has been demonstrated to compare favorably with the learning abilities of more abstract and biologically implausible
312
R. O’Reilly and M. Frank
backpropagation-based temporal learning mechanisms. Other existing simulations of learning in the basal ganglia tend to focus on relatively simple sequencing tasks that do not require complex working memory maintenance and updating and do not require learning of when information should and should not be stored in working memory. Nevertheless, the central ideas behind the PBWM model are consistent with a number of these existing models (Schultz et al., 1995; Houk et al., 1995; Schultz et al., 1997; Suri et al., 2001; Contreras-Vidal & Schultz, 1999; Joel et al., 2002), thereby demonstrating that an emerging consensus view of basal ganglia learning mechanisms can be applied to more complex cognitive functions. The central functional properties of the PBWM model can be summarized by comparison with the widely used SRN backpropagation network, which is arguably the simplest form of a gated working memory model. The gating aspect of the SRN becomes more obvious when the network has to settle over multiple update cycles for each input event (as in an interactive network or to measure reaction times from a feedforward network). In this case, it is clear that the context layer must be held constant and be protected from updating during these cycles of updating (settling), and then it must be rapidly updated at the end of settling (see Figure 15). Although the SRN achieves this alternating maintenance and updating by fiat, in a biological network it would almost certainly require some kind of gating mechanism. Once one recognizes the gating mechanism hidden in the SRN, it is natural to consider generalizing such a mechanism to achieve a more powerful, flexible type of gating. This is exactly what the PBWM model provides, by adding the following degrees of freedom to the gating signal: (1) gating is dynamic, such that information can be maintained over a variable number of trials instead of automatically gating every trial; (2) the context representations are learned, instead of simply being copies of the hidden layer, allowing them to develop in ways that reflect the unique demands of working memory representations (e.g., Rougier & O’Reilly, 2002; Rougier et al., 2005); (3) there can be multiple context layers (i.e., stripes), each with its own set of representations and gating signals. Although some researchers have used a spectrum of hysteresis variables to achieve some of this additional flexibility within the SRN, it should be clear that the PBWM model affords considerably more flexibility in the maintenance and updating of working memory information. Moreover, the similar good performance of PBWM and LSTM models across a range of complex tasks clearly demonstrates the advantages of dynamic gating systems for working memory function. Furthermore, the PBWM model is biologically plausible. Indeed, the general functions of each of its components were motivated by a large base of literature spanning multiple levels of analysis, including cellular, systems, and psychological data. As such, the PBWM model can be used to explore possible roles of the individual neural systems involved by perturbing parameters
Making Working Memory Work
output
hidden
313
output
context
input
a) process event 1
(context gate closed)
hidden
output
context
input
b) update context
(context gate open)
hidden
context
input
c) process event 2
(context gate closed)
Figure 15: The simple recurrent network (SRN) as a gating network. When processing of each input event requires multiple cycles of settling, the context layer must be held constant over these cycles (i.e., its gate is closed, panel a ). After processing an event, the gate is opened to allow updating of the context (copying of hidden activities to the context, panel b). This new context is then protected from updating during the processing of the next event (panel c). In comparison, the PBWM model allows more flexible, dynamic control of the gating signal (instead of automatic gating each time step), with multiple context layers (stripes) that can each learn their own representations (instead of being a simple copy).
to simulate development, aging, pharmacological manipulations, and neurological dysfunction. For example, we think the model can explicitly test the implications of striatal dopamine dysfunction in producing cognitive deficits in conditions such as Parkinson’s disease and ADHD (e.g., Frank et al., 2004; Frank, 2005). Further, recent extensions to the framework have yielded insights into possible divisions of labor between the basal ganglia and orbitofrontal cortex in reinforcement learning and decision making (Frank & Claus, 2005). Although the PBWM model was designed to include many central aspects of the biology of the PFC/BG system, it also goes beyond what is currently known and omits many biological details of the real system. Therefore, considerable further experimental work is necessary to test the specific predictions and neural hypotheses behind the model, and further elaboration and revision of the model will undoubtedly be necessary. Because the PBWM model represents a level of modeling intermediate between detailed biological models and powerful, abstract cognitive and computational models, it has the potential to build important bridges between these disparate levels of analysis. For example, the abstract ACT-R cognitive architecture has recently been mapped onto biological substrates including the BG and PFC (Anderson et al., 2004; Anderson & Lebiere, 1998), with the specific role ascribed to the BG sharing some central aspects of its role in the PBWM model. On the other end of the spectrum,
314
R. O’Reilly and M. Frank
biologically based models have traditionally been incapable of simulating complex cognitive functions such as problem solving and abstract reasoning, which make extensive use of dynamic working memory updating and maintenance mechanisms to exhibit controlled processing over a time scale from seconds to minutes. The PBWM model should in principle allow models of these phenomena to be developed and their behavior compared with more abstract models, such as those developed in ACT-R. One of the major challenges to this model is accounting for the extreme flexibility of the human cognitive apparatus. Instead of requiring hundreds of trials of training on problems like the 1-2-AX task, people can perform this task almost immediately based on verbal task instructions. Our current model is more appropriate for understanding how agents can learn which information to hold in mind via trial and error, as would be required if monkeys were to perform the task.1 Understanding the human capacity for generativity may be the greatest challenge facing our field, so we certainly do not claim to have solved it. Nevertheless, we do think that the mechanisms of the PBWM model, and in particular its ability to exhibit limited variable-binding functionality, are critical steps along the way. It may be that over the 13 or so years it takes to fully develop a functional PFC, people have developed a systematic and flexible set of representations that support dynamic reconfiguration of input-output mappings according to maintained PFC representations. Thus, these PFC “variables” can be activated by task instructions and support novel task performance without extensive training. This and many other important problems, including questions about the biological substrates of the PBWM model, remain to be addressed in future research. Appendix: Implementational Details The model was implemented using the Leabra framework, which is described in detail in O’Reilly and Munakata (2000) and O’Reilly (2001), and summarized here. See Table 5 for a listing of parameter values, nearly all of which are at their default settings. These same parameters and equations have been used to simulate over 40 different models in O’Reilly and Munakata (2000) and a number of other research models. Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardized mechanisms, instead of constructing new mechanisms for each model. (The model can be obtained by e-mailing the first author at
[email protected].) 1 In practice, monkeys would likely require an extensive shaping procedure to learn the relatively complex 1-2-AX hierarchical structure piece by piece. However, we argue that much of the advantage of shaping may have to do with the motivational state of the organism: it enables substantial levels of success early on, to keep motivated. The model currently has no such motivational constraints and thus does not need shaping.
Making Working Memory Work
315
Table 5: Parameters for the Simulation. Parameter
Value
Parameter
El Ei Ee Vrest τ k in/out k PFC k PVLV khebb to PFC khebb
0.15 0.15 1.00 0.15 .02 1 4 1 .01 .001∗
gl gi ge γ k hidden k striatum to PFC
Value 0.10 1.0 1.0 0.25 600 7 7 .01 .001∗
Notes: See the equations in the text for explanations of parameters. All are standard default parameter values except for those with an *. The slower learning rate of PFC connections produced better results and is consistent with a variety of converging evidence, suggesting that the PFC learns more slowly than the rest of cortex (Morton & Munakata, 2002).
A.1 Pseudocode. The pseudocode for Leabra is given here, showing exactly how the pieces of the algorithm described in more detail in the subsequent sections fit together. Outer loop: Iterate over events (trials) within an epoch. For each event: 1. Iterate over minus (−), plus (+), and update (++) phases of settling for each event. (a) At start of settling: i. For non-PFC/BG units, initialize state variables (e.g., activation, v m). ii. Apply external patterns (clamp input in minus, input and output, external reward based on minus-phase outputs). (b) During each cycle of settling, for all nonclamped units: i. Compute excitatory netinput (ge (t) or η j ; equation A.3) (equation 24 for SNr/Thal units). ii. For Striatum Go/NoGo units in ++ phase, compute additional excitatory and inhibitory currents based on DA inputs from SNc (equation A.20). iii. Compute kWTA inhibition for each layer, based on gi (equation A.6): A. Sort units into two groups based on gi : top k and remaining k + 1 to n. B. If basic, find k and k + 1th highest; if average based, compute average of 1 → k & k + 1 → n. C. Set inhibitory conductance gi from gk and gk+1 (equation A.5).
316
R. O’Reilly and M. Frank
iv. Compute point-neuron activation combining excitatory input and inhibition (equation A.1). (c) After settling, for all units: i. Record final settling activations by phase (y−j , y+j , y++ ). ii. At end of + and ++ phases, toggle PFC maintenance currents for stripes with SNr/Thal act > threshold (.1). 2. After these phases, update the weights (based on linear current weight values): (a) For all non-BG connections, compute error-driven weight changes (equation A.8) with soft weight bounding (equation A.9), Hebbian weight changes from plus-phase activations (equation A.7), and overall net weight change as weighted sum of error-driven and Hebbian (equation A.10). (b) For PV units, weight changes are given by delta rule computed as difference between plus phase external reward value and minus phase expected rewards (equation A.11). (c) For LV units, only change weights (using equation A.13) if PV expectation > θ pv or external reward/punishment actually delivered. (d) For Striatum units, weight change is the delta rule on DAmodulated second-plus phase activations minus unmodulated plus phase acts (equation A.19). (e) Increment the weights according to net weight change. A.2 Point Neuron Activation Function. Leabra uses a point neuron activation function that models the electrophysiological properties of real neurons, while simplifying their geometry to a single point. The membrane potential Vm is updated as a function of ionic conductances g with reversal (driving) potentials E as follows: Vm (t) = τ
gc (t)gc (E c − Vm (t)),
(A.1)
c
with three channels (c) corresponding to e excitatory input, l leak current, and i inhibitory input. Following electrophysiological convention, the overall conductance is decomposed into a time-varying component gc (t) computed as a function of the dynamic state of the network, and a constant gc that controls the relative influence of the different conductances. The excitatory net input/conductance ge (t) or η j is computed as the proportion of open excitatory channels as a function of sending activations times the weight values: η j = ge (t) = xi wij =
1 xi wij . n i
(A.2)
Making Working Memory Work
317
The inhibitory conductance is computed via the kWTA function described in the next section, and leak is a constant. Activation communicated to other cells (y j ) is a thresholded () sigmoidal function of the membrane potential with gain parameter γ : y j (t) = 1+
1 1 γ [Vm (t)−]+
,
(A.3)
where [x]+ is a threshold function that returns 0 if x < 0 and x if x > 0. Note that if it returns 0, we assume y j (t) = 0, to avoid dividing by 0. To produce a less discontinuous deterministic function with a softer threshold, the function is convolved with a gaussian noise kernel (µ = 0, σ = .005), which reflects the intrinsic processing noise of biological neurons, y∗j (x)
=
∞ −∞
1 2 2 e −z /(2σ ) y j (z − x)dz, √ 2πσ
(A.4)
where x represents the [Vm (t) − ]+ value, and y∗j (x) is the noise-convolved activation for that value. In the simulation, this function is implemented using a numerical lookup table. A.3 k-Winners-Take-All Inhibition. Leabra uses a kWTA (k-WinnersTake-All) function to achieve inhibitory competition among units within a layer (area). The kWTA function computes a uniform level of inhibitory current gi for all units in the layer, such that the k + 1th most excited unit within a layer is generally below its firing threshold, while the kth is typically above threshold, gi = gk+1 , + q gk − gk+1
(A.5)
where 0 < q < 1 (.25 default used here) is a parameter for setting the in hibition between the upper bound of gk and the lower bound of gk+1 . These boundary inhibition values are computed as a function of the level of inhibition necessary to keep a unit right at threshold, gi =
ge∗ g¯ e (E e − ) + gl g¯ l (El − ) , − Ei
(A.6)
where ge∗ is the excitatory net input without the bias weight contribution. This allows the bias weights to override the kWTA constraint. In the basic version of the kWTA function, which is relatively rigid about the kWTA constraint and is therefore used for output layers, gk and gk+1 are set to the threshold inhibition value for the kth and k + 1th most excited units, respectively. In the average-based kWTA version, gk is the average
318
R. O’Reilly and M. Frank
gi value for the top k most excited units, and gk+1 is the average of gi for the remaining n − k units. This version allows more flexibility in the actual number of units active depending on the nature of the activation distribution in the layer.
A.4 Hebbian and Error-Driven Learning. For learning, Leabra uses a combination of error-driven and Hebbian learning. The error-driven component is the symmetric midpoint version of the GeneRec algorithm (O’Reilly, 1996), which is functionally equivalent to the deterministic Boltzmann machine and contrastive Hebbian learning (CHL). The network settles in two phases—an expectation (minus) phase, where the network’s actual output is produced, and an outcome (plus) phase, where the target output is experienced—and then computes a simple difference of a preand postsynaptic activation product across these two phases. For Hebbian learning, Leabra uses essentially the same learning rule used in competitive learning or mixtures-of-gaussians, which can be seen as a variant of the Oja normalization (Oja, 1982). The error-driven and Hebbian learning components are combined additively at each connection to produce a net weight change. The equation for the Hebbian weight change is hebb wij = xi+ y+j − y+j wij = y+j (xi+ − wij ),
(A.7)
and for error-driven learning using CHL, err wij = (xi+ y+j ) − (xi− y−j ),
(A.8)
which is subject to a soft-weight bounding to keep within the 0 − 1 range: sberr wij = [err ]+ (1 − wij ) + [err ]− wij .
(A.9)
The two terms are then combined additively with a normalized mixing constant khebb : wij = [khebb (hebb ) + (1 − khebb )(sberr )].
(A.10)
A.5 PVLV Equations. See O’Reilly et al. (2005) for further details on the PVLV system. We assume that time is discretized into steps that correspond to environmental events (e.g., the presentation of a CS or US). All of the following equations operate on variables that are a function of the current time step t. We omit the t in the notation because it would be redundant. PVLV is composed of two systems, PV (primary value) and LV (learned value), each of which in turn is composed of two subsystems (excitatory and inhibitory). Thus, there are four main value representation layers in
Making Working Memory Work
319
PVLV (PVe, PVi, LVe, LVi), which then drive the dopamine (DA) layers (VTA/SNc). A.5.1 Value Representations. The PVLV value layers use standard Leabra activation and kWTA dynamics as described above, with the following modifications. They have a three-unit distributed representation of the scalar values they encode, where the units have preferred values of (0, .5, 1). The overall value represented by the layer is the weighted average of the unit’s activation times its preferred value, and this decoded average is displayed visually in the first unit in the layer. The activation function of these units is a “noisy” linear function (i.e., without the x/(x + 1) nonlinearity, to produce a linear value representation, but still convolved with gaussian noise to soften the threshold, as for the standard units, equation A.4), with gain γ = 220, noise variance σ = .01, and a lower threshold = .17. The k for kWTA (average based) is 1, and the q value is .9 (instead of the default of .6). These values were obtained by optimizing the match for value represented with varying frequencies of 0-1 reinforcement (e.g., the value should be close to .4 when the layer is trained with 40% 1 values and 60% 0 values). Note that having different units for different values, instead of the typical use of a single unit with linear activations, allows much more complex mappings to be learned. For example, units representing high values can have completely different patterns of weights than those encoding low values, whereas a single unit is constrained by virtue of having one set of weights to have a monotonic mapping onto scalar values. A.5.2 Learning Rules. The PVe layer does not learn and is always just clamped to reflect any received reward value (r ). By default, we use a value of 0 to reflect negative feedback, .5 for no feedback, and 1 for positive feedback (the scale is arbitrary). The PVi layer units (y j ) are trained at every point in time to produce an expectation for the amount of reward that will be received at that time. In the minus phase of a given trial, the units settle to a distributed value representation based on sensory inputs. This results in unit activations y−j and an overall weighted average value across these units denoted PVi . In the plus phase, the unit activations (y+j ) are clamped to represent the actual reward r (a.k.a. PVe ). The weights (wij ) into each PVi unit from sending units with plus-phase activations xi+ , are updated using the delta rule between the two phases of PVi unit activation states: wij = (y+j − y−j )xi+ .
(A.11)
This is equivalent to saying that the US/reward drives a pattern of activation over the PVi units, which then learn to activate this pattern based on sensory inputs.
320
R. O’Reilly and M. Frank
The LVe and LVi layers learn in much the same way as the PVi layer (see equation A.11), except that the PV system filters the training of the LV values, such that they learn only from actual reward outcomes (or when reward is expected by the PV system but is not delivered), and not when no rewards are present or expected. This condition is PVfilter = PVi < θmin ∨ PVe < θmin ∨ PVi > θmax ∨ PVe > θmax wi =
(y+j − y−j )xi+
if PVfilter
0
otherwise
(A.12) ,
(A.13)
where θmin is a lower threshold (.2 by default), below which negative feedback is indicated, and θmax is an upper threshold (.8), above which positive feedback is indicated (otherwise, no feedback is indicated). Biologically, this filtering requires that the LV systems be driven directly by primary rewards (which is reasonable and required by the basic learning rule anyway) and that they learn from DA dips driven by high PVi expectations of reward that are not met. The only difference between the LVe and LVi systems is the learning rate , which is .05 for LVe and .001 for LVi. Thus, the inhibitory LVi system serves as a slowly integrating inhibitory cancellation mechanism for the rapidly adapting excitatory LVe system. The four PV and LV distributed value representations drive the dopamine layer (VTA/SNc) activations in terms of the difference between the excitatory and inhibitory terms for each. Thus, there is a PV delta and an LV delta: δ pv = PVe − PVi
(A.14)
δlv = LVe − LVi .
(A.15)
With the differences in learning rate between LVe (fast) and LVi (slow), the LV delta signal reflects recent deviations from expectations and not the raw expectations themselves, just as the PV delta reflects deviations from expectations about primary reward values. This is essential for learning to converge and stabilize when the network has mastered the task (as the results presented in the article show). We also impose a minimum value on the LVi term of .1, so that there is always some expectation. This ensures that low LVe learned values result in negative deltas. These two delta signals need to be combined to provide an overall DA delta value, as reflected in the firing of the VTA and SNc units. One sensible way of doing so is to have the PV system dominate at the time of primary
Making Working Memory Work
321
rewards, while the LV system dominates otherwise, using the same PVbased filtering as holds in the LV learning rule (see equation A.13): δ=
δ pv
if PVfilter
δlv
otherwise
.
(A.16)
It turns out that a slight variation of this where the LV always contributes works slightly better, and is what is used in this article: δ = δlv +
δ pv
if PVfilter
0
otherwise
.
(A.17)
A.5.3 Synaptic Depression of LV Weights. The weights into the LV units are subject to synaptic depression, which makes them sensitive to changes in stimulus inputs, and not to static, persistent activations (Abbott, Varela, Sen, & Nelson, 1997). Each incoming weight has an effective weight value w ∗ that is subject to depression and recovery changes as follows, wi∗ = R(wi − wi∗ ) − Dxi wi ,
(A.18)
where R is the recovery parameter, D is the depression parameter, and wi is the asymptotic weight value. For simplicity, we compute these changes at the end of every trial instead of in an online manner, using R = 1 and D = 1, which produces discrete one-trial depression and recovery. A.6 Special Basal Ganglia Mechanisms A.6.1 Striatal Learning Function. Each stripe (group of units) in the Striatum layer is divided into Go versus NoGo in an alternating fashion. The DA input from the SNc modulates these unit activations in the update phase by providing extra excitatory current to Go and extra inhibitory current to the NoGo units in proportion to the positive magnitude of the DA signal, and vice versa for negative DA magnitude. This reflects the opposing influences of DA on these neurons (Frank, 2005; Gerfen, 2000). This update phase DA signal reflects the PVLV system’s evaluation of the PFC updates produced by gating signals in the plus phase (see Figure 8). Learning on weights into the Go/NoGo units is based on the activation delta between the update (++) and plus phases: wi = xi (y++ − y+ ).
(A.19)
To reflect the finding that DA modulation has a contrast-enhancing function in the striatum (Frank, 2005; Nicola, Surmeier, & Malenka, 2000;
322
R. O’Reilly and M. Frank
Hernandez-Lopez, Bargas, Surmeier, Reyes, & Galarraga, 1997) and to produce more of a credit assignment effect in learning, the DA modulation is partially a function of the previous plus phase activation state, ge = γ [da]+ y+ + (1 − γ )[da]+
(A.20)
where 0 < γ < 1 controls the degree of contrast enhancement (.5 is used in all simulations), [da]+ is the positive magnitude of the DA signal (0 if negative), y+ is the plus-phase unit activation, and ge is the extra excitatory current produced by the da (for Go units). A similar equation is used for extra inhibition (gi ) from negative da ([da]− ) for Go units, and vice versa for NoGo units. A.6.2 SNrThal Units. The SNrThal units provide a simplified version of the SNr/GPe/Thalamus layers. They receive a net input that reflects the normalized Go–NoGo activations in the corresponding Striatum stripe:
Go − NoGo ηj = Go + NoGo +
(A.21)
(where []+ indicates that only the positive part is taken; when there is more NoGo than Go, the net input is 0). This net input then drives standard Leabra point neuron activation dynamics, with kWTA inhibitory competition dynamics that cause stripes to compete to update the PFC. This dynamic is consistent with the notion that competition and selection take place primarily in the smaller GP/SNr areas, and not much in the much larger striatum (e.g., Mink, 1996; Jaeger, Kita, & Wilson, 1994). The resulting SNrThal activation then provides the gating update signal to the PFC: if the corresponding SNrThal unit is active (above a minimum threshold; .1), then active maintenance currents in the PFC are toggled. This SNrThal activation also multiplies the per stripe DA signal from the SNc, δ j = snr j δ,
(A.22)
where snrj is the snr unit’s activation for stripe j, and δ is the global DA signal, equation A.16. A.6.3 Random Go Firing. The PBWM system learns only after Go firing, so if it never fires Go, it can never learn to improve performance. One simple solution is to induce Go firing if a Go has not fired after some threshold number of trials. However, this threshold would have to be either task specific or set very high, because it would effectively limit the maximum maintenance duration of the PFC (because by updating
Making Working Memory Work
323
PFC, the Go firing results in loss of currently maintained information). Therefore, we have adopted a somewhat more sophisticated mechanism that keeps track of the average DA value present when each stripe fires a Go: dak = dak + (dak − dak ).
(A.23)
If this value is < 0 and a stripe has not fired Go within 10 trials, a random Go firing is triggered with some probability (.1). We also compare the relative per stripe DA averages, if the per stripe DA average is low but above zero, and one stripe’s da k is .05 below the average of that of the other stripe’s, if (dak < .1) and (dak − da < −.05); Go,
(A.24)
a random Go is triggered, again with some probability (.1). Finally, we also fire random Go in all stripes with some very low baseline probability (.0001) to encourage exploration. When a random Go fires, we set the SNrThal unit activation to be above Go threshold, and we apply a positive DA signal to the corresponding striatal stripe, so that it has an opportunity to learn to fire for this input pattern on its own in the future. A.6.4 PFC Maintenance. PFC active maintenance is supported in part by excitatory ionic conductances that are toggled by Go firing from the SNrThal layers. This is implemented with an extra excitatory ion channel in the basic Vm update equation, A.1. This channel has a conductance value of .5 when active. (See Frank et al., 2001, for further discussion of this kind of maintenance mechanism, which has been proposed by several researchers—e.g., Lewis & O’Donnell, 2000; Fellous et al., 1998; Wang, 1999; Dilmore, Gutkin, & Ermentrout, 1999; Gorelova & Yang, 2000; Durstewitz, Seamans, & Sejnowski, 2000b.) The first opportunity to toggle PFC maintenance occurs at the end of the first plus phase and then again at the end of the second plus phase (third phase of settling). Thus, a complete update can be triggered by two Go’s in a row, and it is almost always the case that if a Go fires the first time, it will fire the next, because Striatum firing is primarily driven by sensory inputs, which remain constant.
Acknowledgments This work was supported by ONR grants N00014-00-1-0246 and N00014-031-0428 and NIH grants MH64445 and MH069597. Thanks to Todd Braver, Jon Cohen, Peter Dayan, David Jilk, David Noelle, Nicolas Rougier, Tom
324
R. O’Reilly and M. Frank
Hazy, Daniel Cer, and members of the CCN Lab for feedback and discussion on this work.
References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220. Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience, 9, 357–381. Amos, A. (2000). A computational model of information processing in the frontal cortex and basal ganglia. Journal of Cognitive Neuroscience, 12, 505–519. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036–1060. Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum. Baddeley, A. D. (1986). Working memory. New York: Oxford University Press. Baddeley, A., Gathercole, S., & Papagno, C. (1998). The phonological loop as a language learning device. Psychological Review, 105, 158. Beiser, D. G., & Houk, J. C. (1998). Model of cortical-basal ganglionic processing: Encoding the serial order of sensory events. Journal of Neurophysiology, 79, 3168– 3188. Berns, G. S., & Sejnowski, T. J. (1995). How the basal ganglia make decisions. In A. Damasio, H. Damasio, & Y. Christen (Eds.), Neurobiology of decision-making (pp. 101–113). Berlin: Springer-Verlag. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produces sequences. Journal of Cognitive Neuroscience, 10, 108– 121. Braver, T. S., & Cohen, J. D. (2000). On the control of control: The role of dopamine in regulating prefrontal function and working memory. In S. Monsell & J. Driver (Eds.), Control of cognitive processes: Attention and performance XVIII (pp. 713–737). Cambridge, MA: MIT Press. Burgess, N., & Hitch, G. J. (1999). Memory for serial order: A network model of the phonological loop and its timing. Psychological Review, 106, 551– 581. Cardinal, R. N., Parkinson, J. A., Hall, J., & Everitt, B. J. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal cortex. Neuroscience and Biobehavioral Reviews, 26, 321–352. Cohen, J. D., Braver, T. S., & O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control, and schizophrenia: Recent developments and current challenges. Philosophical Transactions of the Royal Society (London) B, 351, 1515–1527. Contreras-Vidal, J. L., & Schultz, W. (1999). A predictive reinforcement model of dopamine neurons for learning approach behavior. Journal of Comparative Neuroscience, 6, 191–214.
Making Working Memory Work
325
Dilmore, J. G., Gutkin, B. G., & Ermentrout, G. B. (1999). Effects of dopaminergic modulation of persistent sodium currents on the excitability of prefrontal cortical neurons: A computational study. Neurocomputing, 26, 104–116. Dominey, P., Arbib, M., & Joseph, J.-P. (1995). A model of corticostriatal plasticity for learning oculomotor associations and sequences. Journal of Cognitive Neuroscience, 7, 311–336. Durstewitz, D., Kelc, M., & Gunturkun, O. (1999). A neurocomputational theory of the dopaminergic modulation of working memory functions. Journal of Neuroscience, 19, 2807. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000a). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. Journal of Neurophysiology, 83, 1733–1750. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000b). Neurocomputational models of working memory. Nature Neuroscience, 3 (Suppl.), 1184–1191. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Emerson, M. J., & Miyake, A. (2003). The role of inner speech in task switching: A dual-task investigation. Journal of Memory and Language, 48, 148–168. Fellous, J. M., Wang, X. J., & Lisman, J. E. (1998). A role for NMDA-receptor channels in working memory. Nature Neuroscience, 1, 273–275. Frank, M. J. (2005). Dynamic dopamine modulation in the basal ganglia: A neurocomputational account of cognitive deficits in medicated and non-medicated Parkinsonism. Journal of Cognitive Neuroscience, 17, 51–72. Frank, M. J., & Claus, E. D. (2005). Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement learning, decision making and reversal. Manuscript submitted for publication. Frank, M. J., Loughry, B., & O’Reilly, R. C. (2001). Interactions between the frontal cortex and basal ganglia in working memory: A computational model. Cognitive, Affective, and Behavioral Neuroscience, 1, 137–160. Frank, M. J., Seeberger, L., & O’Reilly, R. C. (2004). By carrot or by stick: Cognitive reinforcement learning in Parkinsonism. Science, 306, 1940–1943. Gerfen, C. R. (2000). Molecular effects of dopamine on striatal projection pathways. Trends in Neurosciences, 23, S64–S70. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12, 2451–2471. Gorelova, N. A., & Yang, C. R. (2000). Dopamine D1/D5 receptor activation modulates a persistent sodium current in rat’s prefrontal cortical neurons in vitro. Journal of Neurophysiology, 84, 75. Graybiel, A. M., & Kimura, M. (1995). Adaptive neural networks in the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 103–116). Cambridge, MA: MIT Press. Haber, S. N., Fudge, J. L., & McFarland, N. R. (2000). Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. Journal of Neuroscience, 20, 2369–2382. Hernandez-Lopez, S., Bargas, J., Surmeier, D. J., Reyes, A., & Galarraga, E. (1997). D1 receptor activation enhances evoked discharge in neostriatal medium spiny neurons by modulating an L-type Ca2+ conductance. Journal of Neuroscience, 17, 3334–3342.
326
R. O’Reilly and M. Frank
Hernandez-Lopez, S., Tkatch, T., Perez-Garci, E., Galarraga, E., Bargas, J., Hamm, H., & Surmeier, D. J. (2000). D2 dopamine receptors in striatal medium spiny neurons reduce L-type Ca2+ currents and excitability via a novel PLCβ1-IP3 -calcineurinsignaling cascade. Journal of Neuroscience, 20, 8987–8995. Hochreiter, S., & Schmidhuber, J. (1997). Long short term memory. Neural Computation, 9, 1735–1780. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–248). Cambridge, MA: MIT Press. Houk, J. C., & Wise, S. P. (1995). Distributed modular architectures linking basal ganglia, cerebellum, and cerebral cortex: Their role in planning and controlling action. Cerebral Cortex, 5, 95–110. Jackson, S., & Houghton, G. (1995). Sensorimotor selection and the basal ganglia: A neural network model. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 337–368). Cambridge, MA: MIT Press. Jaeger, D., Kita, H., & Wilson, C. J. (1994). Surround inhibition among projection neurons is weak or nonexistent in the rat neostriatum. Journal of Neurophysiology, 72, 2555–2558. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547. Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 96, 451. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the 8th Confererence of the Cognitive Science Society (pp. 531–546). Hillsdale, NJ: Erlbaum. Kropotov, J. D., & Etlinger, S. C. (1999). Selection of actions in the basal gangliathalamocoritcal circuits: Review and model. International Journal of Psychophysiology, 31, 197–217. Levitt, J. B., Lewis, D. A., Yoshioka, T., & Lund, J. S. (1993). Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 and 46). Journal of Comparative Neurology, 338, 360–376. Lewis, B. L., & O’Donnell, P. (2000). Ventral tegmental area afferents to the prefrontal cortex maintain membrane potential “up” states in pyramidal neurons via D1 dopamine receptors. Cerebral Cortex, 10, 1168–1175. Middleton, F. A., & Strick, P. L. (2000). Basal ganglia and cerebellar loops: Motor and cognitive circuits. Brain Research Reviews, 31, 236–250. Mink, J. W. (1996). The basal ganglia: Focused selection and inhibition of competing motor programs. Progress in Neurobiology, 50, 381–425. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Morton, J. B., & Munakata, Y. (2002). Active versus latent representations: A neural network model of perseveration and dissociation in early childhood. Developmental Psychobiology, 40, 255–265.
Making Working Memory Work
327
Nakahara, H., Doya, K., & Hikosaka, O. (2001). Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences—a computational approach. Journal of Cognitive Neuroscience, 13, 626–647. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Anuual Review of Neuroscience, 23, 185–215. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8(5), 895–938. O’Reilly, R. C. (1998). Six principles for biologically-based computational models of cortical cognition. Trends in Cognitive Sciences, 2(11), 455–462. O’Reilly, R. C. (2001). Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199–1242. O’Reilly, R. C., Braver, T. S., & Cohen, J. D. (1999). A biologically based computational model of working memory. In A. Miyake & P. Shah (Eds.), Models of working memory: Mechanisms of active maintenance and executive control. (pp. 375–411). Cambridge: Cambridge University Press. O’Reilly, R. C., & Frank, M. J. (2003). Making working memory work: A computational model of learning in the frontal cortex and basal ganglia (ICS Tech. Rep. 03-03, revised 8/04). University of Colorado Institute of Cognitive Science. O’Reilly, R. C., Frank, M. J., Hazy, T. E., & Watz, B. (2005). Rewards are timeless: The primary value and learned value (PVLV) Pavlovian learning algorithm. Manuscript submitted for publication. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. O’Reilly, R. C., Noelle, D., Braver, T. S., & Cohen, J. D. (2002). Prefrontal cortex and dynamic categorization tasks: Representational organization and neuromodulatory control. Cerebral Cortex, 12, 246–257. O’Reilly, R. C., & Soto, R. (2002). A model of the phonological loop: Generalization and binding. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Pucak, M. L., Levitt, J. B., Lund, J. S., & Lewis, D. A. (1996). Patterns of intrinsic and associational circuitry in monkey prefrontal cortex. Journal of Comparative Neurology, 376, 614–630. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variation in the effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Theory and research (pp. 64–99). New York: Appleton-Century-Crofts. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. CUED/F-INFENG/TR.1). Cambridge: Cambridge University Engineering Department. Rougier, N. P., Noelle, D., Braver, T. S., Cohen, J. D., & O’Reilly, R. C. (2005). Prefrontal cortex and the flexibility of cognitive control: Rules without symbols. Proceedings of the National Academy of Sciences, 102(20), 7338–7343.
328
R. O’Reilly and M. Frank
Rougier, N. P., & O’Reilly, R. C. (2002). Learning representations in a gated prefrontal cortex model of dynamic task switching. Cognitive Science, 26, 503–520. Schmidhuber, J. (1992). Learning unambiguous reduced sequence descriptions. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 291–298). San Mateo, CA: Morgan Kaufmann. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593. Schultz, W., Romo, R., Ljungberg, T., Mirenowicz, J., Hollerman, J. R., & Dickinson, A. (1995). Reward-related signals carried by dopamine neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–248). Cambridge, MA: MIT Press. Schultz, W., Apicella, P., Scarnati, D., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. Journal of Neuroscience, 12, 4595–4610. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103, 65–85. Sutton, R. S. (1988). Learning to predict by the method of temporal diferences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. Journal of Neuroscience, 19, 9587. Wickens, J. (1993). A theory of the striatum. Oxford: Pergamon Press. Wickens, J. R., Kotter, R., & Alexander, M. E. (1995). Effects of local connectivity on striatal function: Simulation and analysis of a model. Synapse, 20, 281– 298. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4 (pp. 96–104). New York: Institute of Radio Engineers. Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications. Hillsdale, NJ: Erlbaum.
Received August 9, 2004; accepted June 14, 2005.
LETTER
Communicated by Mikhail Lebedev
Identification of Multiple-Input Systems with Highly Coupled Inputs: Application to EMG Prediction from Multiple Intracortical Electrodes David T. Westwick
[email protected] Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, T2N 1N4, Canada.
Eric A. Pohlmeyer
[email protected] Department of Biomedical Engineering, Northwestern University, Evanston, IL 60208, U.S.A.
Sara A. Solla
[email protected] Department of Physiology, Northwestern Medical School, Chicago, IL, 60611, and Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, U.S.A.
Lee E. Miller
[email protected] Department of Physiology, Northwestern Medical School, Chicago, IL 60611, U.S.A.
Eric J. Perreault
[email protected] Department of Physical Medicine and Rehabilitation, Northwestern University Medical School, Chicago, IL 60611, U.S.A.
A robust identification algorithm has been developed for linear, timeinvariant, multiple-input single-output systems, with an emphasis on how this algorithm can be used to estimate the dynamic relationship between a set of neural recordings and related physiological signals. The identification algorithm provides a decomposition of the system output such that each component is uniquely attributable to a specific input signal, and then reduces the complexity of the estimation problem by discarding those input signals that are deemed to be insignificant. Numerical difficulties due to limited input bandwidth and correlations among the inputs are addressed using a robust estimation technique based on singular value decomposition. The algorithm has been evaluated on both simulated and experimental data. The latter involved estimating the Neural Computation 18, 329–355 (2006)
C 2005 Massachusetts Institute of Technology
330
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
relationship between up to 40 simultaneously recorded motor cortical signals and peripheral electromyograms (EMGs) from four upper limb muscles in a freely moving primate. The algorithm performed well in both cases: it provided reliable estimates of the system output and significantly reduced the number of inputs needed for output prediction. For example, although physiological recordings from up to 40 different neuronal signals were available, the input selection algorithm reduced this to 10 neuronal signals that made significant contributions to the recorded EMGs. 1 Introduction Recent advances in microelectrode array technology have made it possible to record from multiple neurons simultaneously (Maynard, Nordhausen, & Normann, 1997; Williams, Rennaker, & Kipke, 1999; Nicolelis et al., 2003). This capability may enhance our understanding of communication within the central nervous system (CNS) and may allow the development of brain-machine interfaces (BMIs) that provide enhanced communication and control for individuals with significant neurological disorders (Donoghue, 2002; Mussa-Ivaldi & Miller, 2003). However, the potential of these recording devices is limited by the current methods available for processing multi channel recordings. The purpose of this study is to develop robust and efficient algorithms for determining the dynamic relationship between a set of neural recordings and a continuous time signal related to those recordings. Much of the research incorporating multielectrode arrays has focused on using intra cortical recordings to predict kinematic and dynamic features of hand motion in freely moving primates and on using these predictions as a basis for developing cortical BMIs. To date, a number of linear and nonlinear algorithms have been used to generate the map between cortical activity and specific movement variables (Serruya, Hatsopoulos, Paninski, Fellows, & Donoghue, 2002; Taylor, Tillery, & Schwartz, 2002; Carmena et al., 2003). In general, linear models have been found to perform nearly as well as nonlinear models for the prediction of continuous movement variables (Wessberg et al., 2000; Gao, Black, Bienenstock, Wu, & Donoghue, 2003). However, nonlinear models can have advantages for predicting the rest periods between movements or the peak velocities during the fastest movements in a continuous movement sequence (Kim et al., 2003). As might be expected, performance of either type of model generally improves with increasing numbers of neurons. Prediction accuracy for various movement-related signals has been shown to increase with increasing numbers of neurons for as many as 250 simultaneously recorded cortical neurons (Carmena et al., 2003). However, (Sanchez et al., 2004) recently demonstrated that prediction accuracy can be improved further by an appropriate selection of inputs. Two main sources of error can arise when a large number of neurons are used as inputs to the system identification process. The first is that
Identification of Multiple-Input Systems with Highly Coupled Inputs
331
correlations among neurons can result in a numerically ill-conditioned estimation problem. The second is that using too many inputs can lead to accurate fits of the data employed in the estimation process but poor generalization to new data sets. Although additional neural recordings can provide novel information, not all of this information may be relevant to the task or process under study. Therefore, techniques that reduce the dimensionality of the input signals could reduce the computational complexity of the system identification problem and possibly increase prediction accuracy. One approach to this problem has been to use principal component analysis (PCA) to generate a set of orthogonal inputs that span the space defined by the original data (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Isaacs, Weber, & Schwartz, 2000; Wu et al., 2003). Such techniques can reduce the dimensionality of the input signal space when there are correlations between the input signals. This reduction can enhance the robustness of the identification process, but it does not reduce the number of neural signals that need to be recorded, since each principal component is a linear combination of all available input signals. An alternative approach is to select the “most relevant” set of inputs (Sanchez et al., 2004). Once this selection is complete, spike sorting and subsequent signal processing can be restricted to the retained inputs. Input signals with restricted bandwidths also can lead to a numerically unstable identification problem and poor generalization. This problem manifests itself as the need to invert an ill-conditioned input correlation matrix, a problem that can be alleviated by using standard singular value decomposition (SVD) techniques to check for numerical instabilities (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004) during the inversion process. When such instabilities exist, robust estimates of the impulse response functions (IRFs) between the inputs and the output still may be obtained by using an SVD-based matrix pseudo-inverse (Westwick & Kearney, 1997b). The goal of this study is to develop robust tools for processing information obtained from large numbers of neural recordings and for determining the linear relationship between these recordings and a related physiological signal of interest. Specifically, we have developed an algorithm for selecting an optimal set of input signals, based on their unique contributions to the system output, and for developing robust predictors of the output from this subset of inputs. The performance of these novel methods is demonstrated on both simulated and experimental data; the latter consist of intracortical recordings from the primary motor cortex of a freely moving primate together with EMG data taken from several arm muscles. 2 Analytical Methods Consider a multiple-input single-output (MISO) system, represented by a bank of N linear finite impulse response (FIR) filters with memory length
332
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
M. Let xk (t), for k = 1, 2, . . . , N, be measurements of the N input signals at time t, and let z(t) be the measured output. Then, z(t) =
N M−1
h k (τ )xk (t − τ ) + w(t),
(2.1)
k=1 τ =0
where w(t) accounts for both noise in the output measurements and the effects of any additional, unmeasured inputs to the system. It is assumed to be zero mean and uncorrelated with all the measured inputs xk (t), k = 1, . . . , N. The objective is to estimate the NM filter weights, h k (τ ) for τ = 0, . . . , M − 1 and k = 1, . . . , N, from input-output measurements xk (t), z(t), for t = 1, . . . , T. Given sufficient data, ideally T NM, this can be accomplished by rewriting equation 2.1 as a matrix equation, z = Xh + w,
(2.2)
where z and w are T element vectors containing z(t) and w(t), respectively. The IRFs h k (τ ) are placed, in order, in the NM element vector h, h = [h 1 (0) h 1 (1), . . . , h 1 (M − 1) h 2 (0) h 2 (1), . . . , h N (M − 1)]T .
(2.3)
Thus, X is the block structured matrix, X = X1 , X2 , . . . , XN ,
(2.4)
where the Xk are T × M matrices,
xk (1)
0
...
0
xk (2) xk (1) ... 0 Xk = . . .. .. .. .. . . . xk (T) xk (T − 1) . . . xk (T − M + 1) Since the noise is uncorrelated with the inputs, the minimum mean squared error estimate of h can be obtained using the normal equation (Golub & Van Loan, 1989), hˆ = (XT X)−1 XT z.
(2.5)
One disadvantage of the direct solution, equation 2.5, of the normal equation is that the matrices can become unacceptably large. For example, in
Identification of Multiple-Input Systems with Highly Coupled Inputs
333
the neural processing experiment described in section 4.2, the system had 40 inputs, each filter was represented using a 52 tap FIR filter, and the identification was performed using up to 18,000 data points. Thus, direct application of equation 2.5 would require multiplying an 18,000 by 2080 element matrix with its transpose, and then computing the inverse of the resulting 2,080 by 2080 matrix. 2.1 Auto- and Cross-Correlations. Perreault, Kirsch, and, Acosta (1999) developed an efficient solution to the MISO system identification problem based on auto- and cross-correlation functions instead of direct use of the data. They have shown that the input-output relationship can be rewritten in terms of auto- and cross-correlation matrices, φx1 z x1 x1 . . . = . . . xN x1 φ xN z
. . . x1 xN h1 . .. .. . . .. , . . . xN xN h N
(2.6)
where h k is a vector whose elements are the samples of the IRF h k (τ ), φ xk z is an M element vector containing the cross-correlation φxk z (τ ), and xk x is an M × M Toeplitz structured matrix whose elements are the correlations xk x (i, j) = φxk x (i − j). In compact notation, equation 2.6 is written as Xz = XX h.
(2.7)
Perreault et al. (1999) estimated the IRFs of the linear filters by solving equation 2.7 exactly through an inversion of the matrix XX , hˆ = −1 XX Xz ,
(2.8)
and noticed that the input correlation matrix XX could become ill conditioned if either the input signals are strongly coupled with one another or they are severely band-limited. The equivalence of the two algorithms becomes evident when noting that 1 T X X = XX + O T 1 T Xz , X z = T
M , T
(2.9) (2.10)
where O(M/T) indicates an error with magnitude of the same order as M/T. Thus, if T M, the solutions hˆ obtained from equations 2.5 and 2.8
334
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
will be virtually identical. Note that the error term of order M/T appears in the quadratic factor, equation 2.9 but not in the linear factor, equation 2.10. Furthermore, the corrections needed to make equation 2.9 exact can be implemented using the method suggested by Korenberg (1988), although the effect of these corrections is negligible unless the model’s memory length M is significant compared to the data length T. 2.2 Input Contributions to the Output. The neural input data may contain signals that are either unrelated to the output variable of interest or are highly correlated with one another. Both scenarios lead to increases in the variability of the estimated model and decreases in its ability to generalize. To address these problems, an algorithm was developed that locates and eliminates redundant or irrelevant inputs. The algorithm is a variation on the orthogonal least-squares technique (Chen, Cowan & Grant, 1991), but it is based on a backward elimination approach rather than on forward regression (Miller, 1990). Thus, at each stage, this iterative algorithm computes the unique contribution that each input makes to the output and then eliminates the input that makes the smallest such contribution. To find the component in the output that can be attributed only to the input xk (t), construct the matrices M1 = [X1 , . . . , Xk−1 , Xk+1 , . . . , XN ], M2 = Xk , orthogonalize M2 , which contains delayed copies of the input xk (t), against the remaining inputs, stored in M1 , and then project the output z(t) onto these orthogonal columns using the the following QR factorization (Golub & Van Loan, 1989)
M1
M2
z = QR
= Q1
Q2
R11 qz 0 0
R12 R22 0
r1z
r2z ,
(2.11)
r3z
where QT Q is the NM × NM identity matrix and R is upper triangular by construction. These matrices are partitioned such that Q1 and Q2 have the same dimensions as M1 and M2 , respectively. The dimensions of the blocks in R can be inferred from M1 = Q1 R11 M2 = Q1 R12 + Q2 R22 z = Q1 r1z + Q2 r2z + r3z qz .
(2.12)
Identification of Multiple-Input Systems with Highly Coupled Inputs
335
Since the columns of Q are orthogonal, the three terms on the right-hand side of equation 2.12 are orthogonal to each other. Thus, Q2 r2z , which is orthogonal to Q1 and hence M1 , is the component in the output that can be attributed only to the input xk (t). The mean squared value of this unique contribution, yˆ k (t), is given by: T 1 1 2 yˆ k (t) = (Q2 r2z )T (Q2 r2z ) T T t=1
=
T r2z r2z . T
(2.13)
The procedure described above can be used to identify the input that makes the smallest unique contribution to the output. The least significant input may then be removed from the pool of inputs and the process repeated. Note that if there are correlations between the inputs, the significance of the remaining inputs will change as a result of the deletion. Hence, it is necessary to repeat this process (N − 1) times to determine an optimal set of inputs to use in the identification process. Since the computational cost associated with the QR factorization in equation 2.11 is approximately 4T(MN)2 flops (Golub & Van Loan, 1989) and since this factorization would have to be repeated once for each input, this scheme is clearly not practical. However, direct computation of the QR factorization is not necessary. To simplify the notation, consider estimating the contribution due to the last input xN (t), so that the QR factorization in equation 2.11 involves the matrix M1 M2 z = X z . Squaring the right-hand side of equation 2.11 yields (QR)T QR = RT R,
(2.14)
while squaring the left-hand side gives
X
z
T
X
z =
XT X
XT z
zT X
zT z
=T
XX TXz
Xz σz2
(2.15)
,
(2.16)
where the last equality follows from equations 2.9 and 2.10. Thus, the matrix R, and hence the mean-squared value of yˆ N (t), can be obtained by computing the Cholesky factorization (Golub & Van Loan, 1989) of the matrix in
336
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
equation 2.16, which is constructed from the auto- and cross-correlation matrices. Note that the computational cost of the Cholesky factorization is approximately (1/3)(NM)3 flops, independent of T. Clearly, the contribution due to any one of the inputs can be obtained by rearranging the blocks Xz so that the input of interest appears in the bottom and in XX and right-most block of rows and columns in XX and in the bottom-most block Xz . of rows in 2.3 Singular Value Decomposition. Although the input selection algorithm will reduce correlations between inputs, the linear regression may still be poorly conditioned. For example, input properties such as limited bandwidth, which will produce an ill-conditioned autocorrelation matrix, are not altered by the selection algorithm. This ill-conditioned regression problem can be solved robustly using the singular value decomposition (SVD) (Golub & Van Loan, 1989) of the regression matrix X, X = USVT ,
(2.17)
where UT U = I, VT V = I, and S = diag(σ1 , σ2 , . . . , σ NM ), with σ1 ≥ σ2 ≥ · · · ≥ σ NM ≥ 0. Consider the estimate: hˆ = VS−1 UT z
(2.18)
= h + VS−1 UT w T
−1
(2.19)
= V(V h + S U w). T
(2.20)
Aside from finite-precision errors, the solutions in equations 2.5 and 2.18 are identical. However, rewriting equation 2.18 as 2.19 provides insight into the effect that measurement noise will have on the final estimate. Thus, let η = UT w
(2.21)
ζ = VT h,
(2.22)
and let ηk and ζk be the kth elements of the vectors η and ζ , respectively. The estimate hˆ can then be written as NM
ηk hˆ = ζk + vk , σk
(2.23)
k=1
where vk is the kth column of the matrix V. The decomposition in equation 2.23 expands the vector of IRF estimates as a linear combination of an orthogonal basis formed by the right singular vectors. Each expansion
Identification of Multiple-Input Systems with Highly Coupled Inputs
337
coefficient consists of two terms: ζk , the projection of the true system onto the kth right singular vector, and ηk /σk , the projection of the measurement noise onto the kth left singular vector. Note, however, that the kth noise term is scaled by 1/σk . Thus, small singular values can be expected to proˆ The goal is to duce relatively large noise terms, and hence large errors in h. retain only those terms in equation 2.23 that are dominated by the signal component and to discard the rest (Westwick & Kearney, 1997b). One approach to the selection of significant terms is to reorder the singular values according to their contributions to the output. The model can then be built up term by term, including the most significant remaining term at each step. Once an acceptable level of model accuracy has been reached, the expansion can be halted. The model output is given by yˆ = Xhˆ ˆ = USVT h.
(2.24)
Define the coefficient vector, ˆ γ = SVT h,
(2.25)
which contains the projection of the IRF estimate onto the right singular vectors, scaled by their associated singular values. The mean squared value of the model output is then 1 ˆ Tˆ 1 y y = γ T γ . T T
(2.26)
Thus, γk2 , the square of the kth element of the vector γ , represents the contribution made by the kth term in equation 2.23 to the mean square of the model output (Westwick & Kearney, 1997a). To sort the terms in order of decreasing significance, we need only to sort them in decreasing absolute value of the γk . Finally, we note that the SVD may be used in conjunction with the efficient correlation-based technique proposed by Perreault et al. (1999). Calculate the SVD of the input correlation matrix, XX = VVT , where = then
1 2 S . T
(2.27)
The initial estimate of the IRFs, based on equation 2.8, is
1 Xz . vk vkT hˆ = λk NM
k=1
(2.28)
338
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
The terms in equation 2.28 can be sorted in decreasing order of contribution to the output; the mean squared value of the output contribution of each of these terms can be calculated using 2 γk2 = λk vkT hˆ .
(2.29)
The mean squared output is then plotted versus the number of singular vectors retained in the model. The point where the plot starts to level off determines how many singular vectors to include in the final model. 3 Experimental Methods The algorithms described above have been evaluated using data from both computer simulations and physiological recordings. The rationale behind the use of artificial data is to test our algorithms on a system with known and modifiable properties; the application to physiological recordings then demonstrates the performance of these algorithms under more realistic conditions, where the system under study may not be well characterized a priori. 3.1 Generation of Artificial Data. The input selection and system identification algorithms were evaluated on a simulated linear MISO system with highly correlated and band-limited inputs. Both of these characteristics significantly complicate the identification process and are likely to be encountered when recording a large number of physiological signals. A schematic representation of the simulation process used to generate the artificial input-output data is shown in Figure 1A. The inputs were generated using K normally distributed independent white noise sources. Each of these sources was band-limited using a digital Butterworth filter with a randomly generated order, ranging from first to fourth, and a randomly generated cut-off frequency, ranging from 10% to 90% of the Nyquist rate. Unique filters were used for each input. The resulting set of K -independent, band-limited sources was multiplied by a randomly generated K × N mixing matrix, resulting in N correlated signals. In all cases N was greater than K , to mimic the recording of a large number of physiological signals driven by a small number of independent sources. Our goal was to evaluate the system identification procedure both with and without the optimal selection algorithm. Since this identification algorithm will not work if the inputs are fully statistically dependent, N independent gaussian white noise sources were added to the N dependent signals, to ensure some degree of independence. A 10 dB signal to noise ratio was used to emphasize the coupling among the N generated inputs over their small degree of independence.
Identification of Multiple-Input Systems with Highly Coupled Inputs
339
Figure 1: Schematic representation of the computer simulations. (A) Diagram of the process used to generate artificial data. The K independent sources are gaussian white noise. The SISO filters are digital Butterworth filters with randomly generated orders and cut-off frequencies. Specific details are provided in the text. (B) Typical input and output signals generated by this process; 250 sample points are shown.
The system output was generated from these inputs using a similar process. Each one of the N inputs was filtered by a unique digital Butterworth filter with a randomly generated order, ranging from first to fifth, and a randomly generated cutoff frequency, ranging from 10% to 80% of the Nyquist rate. These parameters were chosen so that on average, the bandwidth of the system inputs was greater than that of the system to be identified. The resulting signals were combined using a randomly generated N × 1 mixing matrix to produce a single output. Measurement noise was simulated by adding normally distributed white noise with 10 dB signal-to-noise ratio. Figure 1B shows typical input and output signals generated by this process. Monte Carlo simulations were used to evaluate algorithm performance. Each set of simulations used a fixed number of independent sources (K )
340
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
and corresponding inputs (N). In contrast, different noise sources, filter parameters, and mixing matrices were used for each trial in the set. Specific values for these stochastic model parameters were selected as described above. Unless specified, each Monte Carlo simulation set consisted of 100 trials, to obtain robust estimates of the mean and standard deviation for each quantity of interest. 3.2 Physiological Recordings. The algorithms also were evaluated using a set of physiological recordings obtained from a single behaving primate (Macacca mulatta) during execution of a button-pressing task. A total of 29 3-minute data files were collected during eight separate recording sessions. The recording sessions spanned a period of three months; it is assumed that the neural signals recorded from each electrode varied over the course of this period. Each trial began with the left hand on a touch pad at waist level. Following a 0.3 second touch pad hold time, one of four buttons in front of the subject was illuminated, instructing the subject to reach toward and press this randomly chosen illuminated button. The buttons were arranged on the top and bottom surfaces of a plastic box (see Figure 2A), thus requiring the subject to approach the button with the forearm either pronated or supinated, respectively. After a brief hold (∼200 ms), a tone indicated success; the subject was given a juice reward and could return its hand to the touch pad to initiate the next trial after a random intertrial interval. Figure 2B shows data, including associated neuronal discharge signals, electromyograms (EMGs) from four muscles, and a binary trace indicating the button press times for a series of five such trials. There were an average of 60 ± 8 button presses in each 3 minute trial, equally distributed across the four targets. The neuronal discharge signals were recorded from an array of 100 electrodes (Cyberkinetics, Inc.) The array was chronically implanted under the dura in the arm area of the primary motor cortex. Leads from the array were routed to a connector implanted on the skull. Signals from the best 32 of these electrodes were sent to a DSP-based multineuron acquisition system (Plexon, Inc, Dallas, TX) for later analysis with the Plexon Off-line Sorter software. For the data presented here, the sorting algorithm was able to distinguish between 35 to 40 independent neural signals from the 32 electrode recordings. It was possible to classify approximately 15% of these signals as single neurons, based on stringent shape and minimum interspike interval criteria. The remaining signals were those for which action potentials probably were due to more than one neuron, but from which background noise and occasional artifacts were removed. The spike occurrences for these discriminated signals were then converted into a rate code, sampled at 100 Hz, for subsequent processing. EMG signals were recorded from surface electrodes placed above the anterior deltoid (AD), biceps (BI), triceps (TRI), and combined wrist and digit flexors (FF). The signals were sampled at 2000 Hz and subsequently
Identification of Multiple-Input Systems with Highly Coupled Inputs
341
(A) top targets
bottom targets
Motor Cortical Discharge
(B)
AD BI TRI FF Target 12
14
16
18 20 Time (seconds)
22
24
26
Figure 2: Experimental setup for physiological recordings. (A) In this reaching task, the monkey is required to press one of four lighted buttons located on the top and bottom surfaces of the target platform. (B) Typical data recorded during this task. The top traces correspond to the neuronal firing patterns recorded from the intracortical microelectrode array. The bottom traces show the simultaneously recorded electromyograms from four of the arm muscles involved in the task: anterior deltoid (AD), biceps (BI), triceps (TRI), and combined wrist and digit flexors (FF). The target trace identifies periods of button pressing.
342
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
rectified, low-pass-filtered (10 Hz), and resampled at 100 Hz. All animalrelated procedures were approved by the institutional Animal Care and Use Committee at Northwestern University. 3.3 Statistical Analysis. To quantify model accuracy, we use r 2 , the square of the correlation coefficient between the system and model outputs. Typically, experimental data are divided into two sets: estimation data, and validation data. The model is identified from the estimation data, and tested on the validation data. If the model is validated using the estimation data, the value of r 2 can be biased upward if the model fits some of the measurement noise (overfitting). Thus, the difference between the values of r 2 obtained from the estimation and validation data can be used to assess the degree of overfitting. The accuracy of the proposed input selection algorithm was compared to that of PCA and of two of the methods recently proposed by Sanchez et al. (2004). All of these alternative algorithms reduce the dimensionality of the input signal space. PCA was chosen for comparison, since it is a commonly used approach for improving the numerical stability of the identification process. PCA differs from the Sanchez algorithms in that it uses all inputs in the prediction process. In contrast, the methods proposed by Sanchez are designed to remove unnecessary neural signals. The first method (fSISO) ranks each input according to the amount of variance it can predict in the system output. Prediction is performed using a SISO linear filter and ignoring all other available inputs; correlations among inputs are not considered. The second method (fMISO) ranks the inputs according to the sum of the estimated linear filter magnitudes; the estimated filters are obtained using a nonparametric linear MISO identification. This was the most effective of the three algorithms evaluated by Sanchez et al. (2004). As with the optimal selection algorithm, this ranking process must be repeated (N − 1) times for a system with N inputs. Each iteration involves eliminating the least significant input. 4 Results 4.1 System Identification from Artificial Data. In the case of artificial data, the optimal selection algorithm greatly reduced the number of inputs needed for an accurate prediction of the system output. The performance of our input selection algorithm was compared to that obtained if a subset of the input signals was randomly chosen for use in the system identification process. The value of r 2 calculated from the cross-validation data was used to compare the performance of these two selection processes. Figure 3 shows the average results from a set of 100 Monte Carlo simulations using 10 independent sources (K = 10) to produce 20 correlated inputs (N = 20). For these simulation parameters, more than 90% of the maximally obtainable output variance could be predicted using the three most significant inputs.
Identification of Multiple-Input Systems with Highly Coupled Inputs
343
1.0
0.8
0.6 R2
Optimal Selection Random Selection 0.4
0.2
0.0 0
2
4
6
8
10
12
14
16
18
20
Number of Inputs
Figure 3: Model accuracy as a function of the number of inputs used in the identification process. Thin traces correspond to the value of r 2 for randomly selected inputs, and thick traces correspond to that for optimally selected inputs. Error bars indicate the standard deviation based on the results of 100 simulated trials. Simulation parameters: K = 10, N = 20, T = 2000 data points for each of the simulated data vectors. The estimated linear filters had lengths of M = 32 points.
In contrast, more than twice as many inputs were required to reach the same level of fitting accuracy when a random selection was used. Similar results were obtained for a wide range of simulation parameters, as examined by varying the number K of independent sources from 7 to 15 and the number N of input signals generated by these sources from 7 to 40. Although it reduced the fitting accuracy, the robust identification algorithm improved output predictions for cross-validation data. The prediction accuracy of the estimated linear system was evaluated for both the fitted data and for cross-validation data not used in the fitting process. These results, as functions of the number of singular values used in the pseudoinverse, are shown in Figure 4. Simulation parameters were identical to those used for Figure 3, although half of the resulting data set was used for identification and the remaining half for cross-validation. As expected, r 2 for the fitted data increases monotonically with the number of singular values used in the identification process. However, there is a clear peak in the r 2 value for the cross-validation data, indicating the advantage of restricting the number of singular values when computing the pseudo-inverse for this
344
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault 1.0
0.8
0.6 R
2
Fitted Data Validation Data 0.4
0.2
0.0 0
100
200
300
400
500
600
Number of SVs
Figure 4: Prediction accuracy for the fitted (solid black trace) and crossvalidation (dashed black trace) data as a function of the number of singular values used in the pseudo-inverse. Results are the average and standard deviation (gray bands) from 100 simulated data sets. Simulation parameters: K = 10, N = 20, 2000 data points for each of the simulated data vectors (1000 points were used for the estimation and 1000 for the cross-validation). The estimated linear filters had a length of M = 32 points.
data set. Such a result is a signature of overfitting, and the simulation parameters were chosen to illustrate this point. For the same set of simulation parameters, the discrepancy between the curves for the fitting and validation data diminishes as the number of data points used in the fitting process is increased. The combined use of the optimal input selection algorithm and the robust MISO identification yields accurate output predictions using only a small subset of the measured inputs. Figure 5 shows typical model predictions for the same system used to generate the data for Figures 3 and 4. Only 3 of the 20 available inputs were used to predict the output response. Figure 5A shows the results when optimal input selection was combined with the robust identification algorithm. Based the results shown in Figure 4, the number of singular values needed to predict 90% of the output variance was used to compute the pseudo-inverse. Figure 5B shows typical results using three randomly chosen inputs and all singular values in the identification process. For this data set, r 2 increased by 0.1 when the optimal input selection algorithm was used in conjunction with the robust MISO
Identification of Multiple-Input Systems with Highly Coupled Inputs
345
(A) Signal Amplitude
10 5 0 -5 r 2 = 0.88 -10
0
50
100 150 Data Points
200
(B)
Data Prediction
10 Signal Amplitude
250
5 0 -5 r 2 = 0.78
-10 0
50
100 150 Data Points
200
250
Figure 5: Actual (gray traces) and predicted (black traces) model outputs for simulated data using 3 of 20 available input signals. (A) Results when the optimal inputs are chosen and the robust identification algorithm is used. (B) Results when the inputs are randomly chosen and a pseudo-inverse is not used in the identification process. Simulation parameters: K = 10, N = 20, 4000 data points for each of the simulated data vectors (2000 points were used for the estimation and 2000 for the cross-validation). The estimated linear filters had a length of M = 32 points.
identification, as compared to the result obtained using the original identification algorithm. This increase corresponds to a nearly twofold reduction in the mean squared error. In a series of 100 Monte Carlo simulations, r 2 increased by 0.17 ± 0.11. 4.2 System Identification from Physiological Data. The algorithms presented in this article were designed to work with correlated multiple input data of restricted bandwidth. The experimental data collected from the microelectrode arrays exhibited both of these features. Figure 6A shows
346
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
Figure 6: Input data characteristics. (A) Power spectra for the cortical recordings for a typical 180 second reaching trial. Individual signals are shown in gray and the mean of all signals in black. (B) Interdependence of the cortical recordings. The gray trace shows the instantaneous spike rate recorded from a single electrode; the black trace shows the estimate of that recording based on the recordings from all other electrodes (39 signals). Both signals have been low-pass-filtered at 10 Hz using a fourth-order Butterworth filter. The prediction is based on a linear MISO system estimated from 180 seconds of data. All estimated filters had a length of 510 ms.
Identification of Multiple-Input Systems with Highly Coupled Inputs
347
the power spectra for all cortical signals recorded during a typical trial; the average spectrum is shown in black. The spectrum for each signal was broad but had a dominant peak at approximately 0.2 Hz to 0.5 Hz, due to the time between subsequent reaching movements. The power outside this range was lower but remained significant because the signals contained relatively narrow bursts of activity. Correlation between channels was assessed by examining the accuracy with which a given neural signal could be predicted using all other available neural signals. A causal linear MISO system was used for this prediction. Across all 29 data sets, the squared correlation coefficient between any given input and its prediction from all of the other inputs was 0.33 ± 0.10 (mean ± SD ). Furthermore, the best input prediction in each trial resulted in r 2 of 0.55 ± 0.03. Figure 6B shows a typical example of the dependence between recorded neural signals. The gray trace shows the spike rate of the neural signal recorded from a single electrode, and the black trace shows the prediction of that signal based on the measurements from all other electrodes. Both signals were low-pass-filtered at 10 Hz to accentuate the average spike rates. The r 2 value for these two signals prior to filtering was 0.32. These results demonstrate that the neural recordings obtained from the microelectrode array were not statistically independent and that there is significant overlap in the information contained in these recordings, especially at the lower frequencies more relevant to movement control. The optimal selection algorithm was effective at reducing the number of cortical signals necessary to predict the EMG activity for each of the four recorded muscles. Figure 7 shows the average r 2 value across all 29 data sets as a function of the number of inputs used to predict each of the recorded EMGs. Filter lengths of 510 ms were used for prediction; lengths beyond this size did not improve prediction accuracy significantly. The average r 2 obtained when using the optimally selected inputs is compared to that obtained when using the three alternative strategies: PCA, fSISO, and fMISO. The thick straight lines above each set of curves indicate the regions where the selection algorithm proposed in this letter performed significantly better than the alternatives ( p < 0.05). The optimal algorithm produced the maximal cross-validation prediction accuracy for each of the four muscles tested and performed significantly better than the alternative methodologies. These results were most dramatic in comparison to the fSISO and PCA algorithms. There also was a statistically significant improvement in comparison to the fMISO algorithm, but the magnitude of the difference between the performance of that approach and that of the optimal algorithm was small. Reducing the number of inputs also increased the accuracy of the EMG predictions for cross-validation data for 113 of the 116 available data sets (4 muscles × 29 trials). The improvement in r 2 across all trials relative to when all inputs were used ranged from 0 to 0.28, with an average of 0.05 ± 0.07. Because the number of optimal inputs varied between
348
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
0.7
R2
0.6 0.5 0.4 0.3
AD
BI
0.7
R2
0.6 0.5 Optimal fMISO fSISO PCA
0.4 0.3
TRI 0
FF 10
20
Number of Inputs
30
0
10
20
30
Number of Inputs
Figure 7: Performance of the optimal selection algorithm on electrophysiological data. Each panel corresponds to a different muscle and shows the average (29 trials) accuracy of EMG predictions as a function of the number of input signals used in the identification process. All reported r 2 values are for crossvalidation data not used in the selection or identification processes. Solid traces correspond to r 2 obtained using the optimal selection algorithm. The gray lines show the results of using the fMISO selection algorithm, the coarse dashed lines show the results of using the fSISO selection algorithm, and the fine dashed lines show the results of using a PCA. In the estimation process, 120 seconds of data were used, and 60 seconds were used for cross-validation. All estimated filters had a length of 510 ms. The thick lines above the traces indicate regions where the optimal selection algorithm was significantly better ( p < 0.05) than the alternative approaches.
trials, this improvement is not evident in the average responses shown in Figure 7. Once the optimal set of inputs has been selected, the robust identification algorithm provided little performance enhancement with respect to techniques not based on a pseudo-inverse, as long as sufficient data were used in the identification process. Figure 8 provides results from a typical trial and shows r 2 for the fitted and cross-validation data, as a function of
Identification of Multiple-Input Systems with Highly Coupled Inputs 1.0 AD
349
BI
0.8
R2
0.6 0.4 Fitted Data Validation Data
0.2 0.0 1.0 TRI
FF
0.8 R2
0.6 0.4 0.2 0.0
0
100 200 300 400 Number of SVs
500
0
100 200 300 400 Number of SVs
500
Figure 8: EMG prediction accuracy for the fitted (solid traces) and crossvalidation (dashed traces) data as a function of the number of singular values used in the pseudo-inverse. For estimation, 120 seconds of data and 60 seconds for cross-validation were used. The estimated filters had a length of 510 ms.
the number of singular values used in the pseudo-inverse. This identification was performed on the 10 optimal inputs. Even when using these, less than 10% of the available singular values was needed to obtain 90% of the maximum achievable prediction accuracy. In contrast to the simulations presented in Figure 4, there is no clear peak in the cross-validation curve. This is due to the use of sufficiently long data records (2 minutes). The SVD algorithm had a more dramatic effect when fewer data were used in the identification process, although the peak value of r 2 was maximized by using at least 2 minutes of data for system identification. The optimal algorithm for the selection of inputs has made it possible to predict upper limb EMGs reliably. Figure 9 compares the recorded and predicted EMGs from a set of cross-validation data for each muscle. The 10 best neurons were used for prediction; a different set of neurons was used for each muscle. There is a close correspondence between the actual and predicted EMGs. For the cross-validation data in these examples, r 2 was between 0.60 and 0.73, typical values for trials without long rest periods between subsequent movements.
350
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
AD
BI
TRI
FF
Actual Predicted
5 sec
Figure 9: Actual (gray traces) and predicted (black traces) EMGs using 10 of 41 available cortical recordings. All plots are for cross-validation data not used in the estimation process. Models were estimated from 120 seconds of data, leaving 60 seconds available for cross-validation. The estimated filters had a length of 510 ms.
5 Discussion In this letter, we have presented two novel algorithms to address numerical problems associated with the identification of MISO systems and have demonstrated the applicability of these algorithms to the processing of neural recordings from intracortical microelectrode arrays. These algorithms address the problems associated with identifying MISO systems with highly correlated inputs and inputs with restricted information content. The algorithms provide tools for selecting the optimal inputs to be used in the identification process and for generating robust estimates of the MISO system that relate these inputs to the observed output. Both algorithms were found to perform well for simulated and experimental data. 5.1 Selection of Inputs. With rapid advances in microelectrode technology, it becomes increasingly possible to obtain large numbers of simultaneous neural recordings. However, the computational burden associated with processing such large numbers of available inputs can be a challenge
Identification of Multiple-Input Systems with Highly Coupled Inputs
351
in applications where efficient processing is essential. Furthermore, the likelihood for correlations among inputs increases as the number of recorded signals increases, as demonstrated in section 4. Such correlations can lead to numerical instabilities during the identification process (Perreault et al., 1999). Both issues can be addressed by eliminating inputs that do not provide unique information about the system output. Nonessential inputs may be uncorrelated with the output or may be highly correlated with other inputs. We have developed an algorithm for detecting such inputs and have demonstrated that it is effective. In addition to decreasing the computational time required for the remaining steps in the system identification process, such a pruning of neural inputs also has the potential to reduce the computational costs associated with preprocessing algorithms such as spike sorting and artifact removal. The reported advantages associated with reducing the number of inputs used in the identification process are not in contradiction to findings that prediction accuracy increases with increasing numbers of recorded signals (Carmena et al., 2003; Wessberg et al., 2000). Recently, (Paninski et al., 2004) and (Sanchez et al., 2004) demonstrated that the accuracy with which movement variables can be predicted from neural recordings depends strongly on which neurons are selected as model inputs. Our results demonstrate that this selection can be optimized by choosing neural signals based on the uniqueness of their contribution to the system output. Increasing the number of neural recordings increases the sample of neurons from which to draw the optimal set. Therefore, the potential of experimental techniques that allow such large-scale recordings is likely to be enhanced by optimally selecting a subset of the available recorded signals for use in the identification and prediction process. Similar selection algorithms recently were explored by Sanchez et al. (2004), who also demonstrated that a subset of the available neural recordings could be used to predict kinematic variables associated with reaching. We were able to compare two of their algorithms with the one proposed in this article. Although our selection algorithm produced the best results, the fMISO algorithm proposed by Sanchez et al. (2004) performed nearly as well. The results of both studies emphasize the need for considering all neural inputs and their contribution to the system output during the selection process. 5.2 Robust SVD Estimation. Most system identification algorithms rely on the use of white or at least broadband stationary inputs to produce reliable, robust estimates of the system dynamics. However, it can be difficult to obtain broadband inputs during functional behaviors. Under realistic conditions, the input bandwidth may be limited, and the assumption of stationarity may be violated. Hence, it is necessary to develop and use system identification algorithms that produce robust estimates of the system dynamics under such conditions. Westwick and Kearney
352
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
(1997b) have developed a robust algorithm for identifying SISO systems using a pseudo-inversion of the input autocorrelation matrix (see equation 2.8). Here we have extended this algorithm to the identification of MISO systems. Our results (see Figure 4) demonstrate that the use of this algorithm can improve prediction accuracy for cross-validation data not used in the identification process. These improvements are greatest when the number of data used in the identification process is relatively small, indicating that the robust algorithm helps reduce the problem of overfitting. Smaller improvements are observed when the number of data used to identify the MISO system is increased (see Figure 8). For neural data similar to those used in this study, this algorithm is likely to be most beneficial in situations where only small data records can be collected or when it is necessary to characterize system behavior over short periods of time. 5.3 Linear Systems Identification. This study has been restricted to the use of linear system identification techniques. Although the transformation from neural activity to motor output presumably contains many significant nonlinearities, linear models of the net transformation from neural activity to EMGs work surprisingly well when populations of neurons are considered. Similar phenomena have been demonstrated by a number of groups that have compared prediction accuracies of both linear filters and nonlinear networks for decoding neural information from motor and visual systems (Warland, Reinagel, & Meister, 1997; Wessberg et al., 2000; Gao et al., 2003). Given the similarities in prediction accuracy, there are a number of advantages to using linear identification techniques, including the computational and conceptual simplicity of these approaches as well as the potential for meaningful interpretation of the estimated linear filters. Linear IRFs can provide useful characterizations of the transfer of information from the cortex to the motor system (e.g., bandwidth, delays). In contrast, it can be more difficult to obtain similar insights if the system is modeled as a nonlinear network. However, the potential advantages of nonlinear models may become more apparent when their ability to generalize is tested under a wider range of conditions. For example, in situations where there is a significant pause between movements, it may be advantageous to incorporate a static output nonlinearity to approximate the threshold response of the motoneuron pool. Additionally, the techniques presented here could be applied to nonlinear systems consisting of static nonlinearities connected in series with finite memory linear systems (Bussgang, 1952). However, the applicability of these techniques to more general nonlinear systems remains to be demonstrated. 5.4 Potential Applications. The algorithms presented in this letter could be useful in a range of applications where it is necessary to predict the output of a MISO system and the inputs to that system are highly
Identification of Multiple-Input Systems with Highly Coupled Inputs
353
correlated, have limited information content, or can be recorded for only short periods of time. With respect to the processing of neural information, the algorithms could be used in the analysis of multidimensional signals from a variety of sources, including electromyograms, electroencephalograms, and intracortical recordings. One application for the input selection algorithm is as a mapping tool for determining which neural recordings are most relevant to any timedependent process of interest. Examples include assessing the neural signals or anatomical substrates contributing to movement control, cognitive tasks, or visual processing. The selection algorithm also could be used in conjunction with the robust identification algorithm when it is necessary to predict the system output or generate a control signal from a set of recorded inputs. One such application that has received much recent attention is the development of BMIs, including those for the restoration of motor function via neuromuscular stimulation of paralyzed muscles (Lauer, Peckham, Kilgore, & Heetderks, 2000), the control of augmented communication aids for individuals with severe communication disorders (Kennedy, Bakay, Moore, Adams, & Goldwaithe, 2000), and the control of assistive devices for improved mobility and function (Carmena et al., 2003; Taylor, Tillery, & Schwartz, 2003). Success in each of these applications hinges on the availability of a multidimensional natural control signal, such as would result from intracortical microelectrode array recordings. The input selection algorithm could be used to identify the neural signals most relevant to each degree of control, and the robust identification algorithm could be used to estimate the system describing the dynamic relationship between those neural signals and the desired control signal. By using only the optimal inputs, it would be possible to decrease the computational time needed for identification and prediction. This could offer significant improvements in real-time applications, especially those using adaptive algorithms. It should be noted, though, that changes in the neural signals available over time will require a reevaluation of which of the available signals are optimal for a given task. This evaluation could operate as a background process, updating the set of optimal signals as necessary.
Acknowledgments This research was supported by NSERC grant RGP-238939 (D.T.W.), NSF grant IBN-0432171 (E.J.P.) and NIH grants NS36976 (L.E.M.) and 1 K25 HD044720-01 (E.J.P.). E.A.P. was supported by NSF through an IGERT fellowship in Dynamics of Complex Systems in Science and Engineering, grant DGE-9987577. S.A.S. acknowledges the hospitality of the Kavli Institute for Theoretical Physics at the University of California, Santa Barbara, and partial NSF support under grant PHY99-07949.
354
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
References Bussgang, J. J. (1952). Crosscorrelation functions of amplitude distorted gaussian signals. MIT Res. Lab. Elec. Tech. Rep., 216, 1–14. Carmena, J. M., Lebedev, M. A., Crist, R. E., O’Doherty, J. E., Santucci, D. M., Dimitrov, D., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biol., 1(2), E42. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci., 2(7), 664–670. Chen, S., Cowan, C., & Grant, P. (1991). Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Netw., 2, 302–309. Donoghue, J. P. (2002). Connecting cortex to machines: Recent advances in brain interfaces. Nat. Neurosci., 5 (Suppl.), 1085–1088. Gao, Y., Black, M. J., Bienenstock, E., Wu, W., & Donoghue, J. P. (2003). A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions. In 1st International IEEE/EMBS Conference on Neural Engineering (pp. 189–192). Los Alamitos, CA: IEEE. Golub, G., & Van Loan, C. (1989). Matrix computations (2nd ed.). Baltimore, MD: Johns Hopkins University Press. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prothesis. IEEE Trans. Rehabil. Eng., 8(2), 196–198. Kennedy, P. R., Bakay, R. A., Moore, M. M., Adams, K., & Goldwaithe, J. (2000). Direct control of a computer from the human central nervous system. IEEE Trans. Rehabil. Eng., 8(2), 198–202. Kim, S. P., Sanchez, J. C., Erdogmus, D., Rao, Y. N., Wessberg, J., Principe, J. C., & Nicolelis, M. (2003). Divide-and-conquer approach for brain machine interfaces: Nonlinear mixture of competitive linear models. Neural Netw., 16(5–6), 865–871. Korenberg, M. (1988). Identifying nonlinear difference equation and functional expansion representations: The fast orthogonal algorithm. Ann. Biomed. Eng., 16, 123–142. Lauer, R. T., Peckham, P. H., Kilgore, K. L., & Heetderks, W. J. (2000). Applications of cortical signals to neuroprosthetic control: A critical review. IEEE Trans. Rehabil. Eng., 8(2), 205–208. Maynard, E. M., Nordhausen, C. T., & Normann, R. A. (1997). The Utah Intracortical Electrode Array: A recording structure for potential brain-computer interfaces. Electroencephalogr. Clin. Neurophysiol., 102(3), 228–239. Miller, A. (1990). Subset selection in regression. London: Chapman and Hall. Mussa-Ivaldi, F. A., & Miller, L. E. (2003). Brain-machine interfaces: Computational demands and clinical needs meet basic neuroscience. Trends Neurosci., 26(6), 329–334. Nicolelis, M. A., Dimitrov, D., Carmena, J. M., Crist, R., Lehew, G., Kralik, J. D., & Wise, S. P. (2003). Chronic, multisite, multielectrode recordings in macaque monkeys. Proc. Natl. Acad. Sci. USA, 100(19), 11041–11046.
Identification of Multiple-Input Systems with Highly Coupled Inputs
355
Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue, J. P. (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity. J. Neurophysiol., 91(1), 515–532. Perreault, E., Kirsch, R., & Acosta, A. (1999). Multiple-input, multiple-output system identification for characterization of limb stiffness dynamics. Biol. Cybern., 80, 327–337. Sanchez, J., Carmena, J., Lebedev, M., Nicolelis, M., Harris, J., & Principe, J. (2004). Ascertaining the importance of neurons to develop better brain-machine interfaces. IEEE Trans. Biomed. Eng., 51(6), 943–953. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416(6877), 141–142. Taylor, D. M., Tillery, S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296(5574), 1829–1832. Taylor, D. M., Tillery, S. I., & Schwartz, A. B. (2003). Information conveyed through brain-control: Cursor versus robot. IEEE Trans. Neural. Syst. Rehabil. Eng., 11(2), 195–199. Warland, D. K., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. J. Neurophysiol., 78(5), 2336–2350. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361–365. Westwick, D. T., & Kearney, R. E. (1997a). Generalized eigenvector algorithm for nonlinear system identification with non-white inputs. Ann. Biomed. Eng., 25(5), 802–814. Westwick, D. T., & Kearney, R. (1997b). Identification of physiological systems: A robust method for non-parametric impulse response estimation. Med. Biol. Eng. Comput., 35(2), 83–90. Williams, J. C., Rennaker, R. L., & Kipke, D. R. (1999). Long-term neural recording characteristics of wire microelectrode arrays implanted in cerebral cortex. Brain Res. Protoc., 4(3), 303–313. Wu, W., Black, M. J., Mumford, D., Gao, Y., Bienenstock, E., & Donoghue, J. (2003). A switching Kalman filter model for the motor cortical coding of hand motion. In 25th International IEEE/EMBS Conference. Los Alamitos, CA: IEEE.
Received August 4, 2004; accepted June 30, 2005.
LETTER
Communicated by Bard Ermentrout
Oscillatory Networks: Pattern Recognition Without a Superposition Catastrophe Thomas Burwick
[email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, 44306 Bochum, Germany Institut fur
Using an oscillatory network model that combines classical network models with phase dynamics, we demonstrate how the superposition catastrophe of pattern recognition may be avoided in the context of phase models. The model is designed to meet two requirements: on and off states should correspond, respectively, to high and low phase velocities, and patterns should be retrieved in coherent mode. Nonoverlapping patterns can be simultaneously active with mutually different phases. For overlapping patterns, competition can be used to reduce coherence to a subset of patterns. The model thereby solves the superposition problem. 1 Introduction For a network with two or more active patterns, where activity is solely defined by on and off states, a subsequent stage of information processing may not be able to read out single patterns. This so-called superposition catastrophe has long been recognized as a major challenge for neural network modeling (Rosenblatt, 1961). It is related to the binding problem ¨ (see Roskies, 1999; Muller, Elliott, Herrmann, & Mecklinger, 2001). The superposition and binding problems motivated von der Malsburg to propose grouping of neural units based on temporal correlation. This allows several nonoverlapping patterns to be active at the same time and still be separable due to different temporal properties (von der Malsburg, 1981, 1985). For example, they may synchronize with different phases for different patterns (von der Malsburg & Schneider, 1986). Subsequently, associative memory based on temporal coding has been implemented in oscillatory neural networks. Recognized patterns may then correspond to limit cycles. The first models that implemented temporal coding were based on Wilson-Cowan-like dynamics, where the oscillators were defined in terms of coupled excitatory and inhibitory units (von der Malsburg & Schneider, 1986; Baird, 1986; Freeman, Yao, & Burke, 1988; Li & Hopfield, 1989). These approaches have been extended in Wang, Buhmann, & von der Malsburg, (1990) and von der Malsburg and Buhmann (1992). Segmentation is a particular example of the superposition problem. Correspondingly, many applications are concerned with this task (e.g., Terman & Wang, 1995; Wang & Neural Computation 18, 356–380 (2006)
C 2005 Massachusetts Institute of Technology
Oscillatory Networks
357
Terman, 1995, 1997) with applications such as medical image segmentation (Shareef, Wang, & Yagel, 1999). Oscillatory networks with associative memory were also studied by using a phase description for the oscillators. Whenever such a parameterization is possible, the analysis of oscillatory systems may be simplified significantly (see Kuramoto, 1984; Winfree, 2001). In this article, we study the avoidance of the superposition catastrophe in the context of phase models. Our discussion is based on a network model, where the real-valued activity uk of each unit k is complemented with a phase θk , k = 1, . . . , N. The phases θk are supposed to parameterize a temporal structure of the signals. We consider a generalization of classical neural network dynamics: duk 1 wkl (θ ) g(ul ) + Ik , = −uk + dt N
(1.1a)
dθk = 2π g(uk ) (1 + Sk (u, θ )) , dt
(1.1b)
wkl (θ ) = gkl (α + β cos (θk − θl )),
(1.2)
N
τu
l=1
τθ where
Sk (u, θ ) =
N β gkl sin (θl − θk ) g(ul ) . N
(1.3)
l=1
The activation function is g(x) = (1 + tanh (x)) /2, the τu , τθ > 0 are timescales, and the Ik are external inputs. The wkl (θ ) are phase-dependent weights, specified by the gkl and real-valued parameters α, β ≥ 0. In our examples, we use α > β. In accordance with the mentioned interpretation of the phases, the model is designed so that the limit 1(0) of g(uk ) describes an on(off) state, leading to high(low) frequencies of θk . This motivates the factor g(uk ) on the right-hand side of equation 1.1b. The Sk (u, θ ) will correct the phase velocity and imply synchronization. For on-states with g(uk ) → 1, the phase velocity will approach (2π/τθ ) (1 + Sk ), while for off-states, g(uk ) → 0, the phase velocity will vanish. (For more detailed discussion and motivation of wkl (θ ), Sk (u, θ ) see below.) Understanding a possible relevance of equations 1.1 to biology would be of interest. An interpretation of the units in terms of single biological neurons is not intended. Possibly an interpretation in terms of populations of neurons may be found. An exact interpretation of variables in biological terms, however, is outside the scope of this letter. For the purpose of this letter, it is sufficient to see the model as a formal framework that allows the implementation of the benefits that temporal coding should hold for pattern recognition.
358
T. Burwick
Specifying an oscillatory network model requires two choices to be made. First, a model for the single oscillators without coupling has to be chosen. Second, the coupling between the oscillators has to be specified. A common choice for the oscillator dynamics is the Stuart-Landau model. This assigns a complex-valued dynamics to the oscillator and may be derived from a general ordinary differential equation near the Hopf bifurcation (see Kuramoto, 1984). A natural choice for the couplings is linear in the complex-valued normal form coordinate (Kuramoto, 1975). The complete system may then be interpreted as a generalized version of the discrete complex-valued GinzburgLandau reaction-diffusion system. In the following, we refer to it as the generalized Ginzburg-Landau (GGL) model. Most approaches toward associative memory in the context of models with phases and amplitudes use the GGL model with vanishing shear (see Hoppensteadt & Izhikevitch, 2003, for a short review). Using pure phase models as an adiabatic approximation, associative memory has been implemented by identifying the coupling strengths with Hebbian weights (Abbott, 1990; Sompolinsky, Golomb, & Kleinfeld, 1990, 1991; Baldi & Meir, 1990; Kuramoto, Aoyagi, Nishikawa, Chawanya, & Okuda, 1992; Sompolinsky & Tsodyks, 1992; Kuzmina & Surina, 1994; Kuzmina, Manykin, & Surina, 1995). The corresponding step in the context of phase and amplitude dynamics was taken by using a complexified analog of the Cohen-Grossberg-Hopfield function (Takeda & Kishigami, 1992; Chakravaraty & Gosh, 1994; Aoyagi, 1995; Hoppensteadt & Izhikevitch, 1996). This approach imitated the method that was used for discrete dynamics (Noest, 1988a, 1988b). Reviews of complex-valued neural networks and their application to pattern recognition may also be found in Hirose (2003). The GGL model is the simplest model for coupling Stuart-Landau oscillators, but not the only possible one. Alternative models have been studied with and without regard to associative memory. For example, Tass and Haken proposed three different models and used one to model synchronization of neural activity in the visual cortex (Tass, 1993; Tass and Haken, 1996a, 1996b). In the context of associative memory, the Stuart-Landau oscillator of the GGL model was replaced with an oscillator model that has on and off states (Aoyagi, 1995). A phase-locked loop model was introduced to allow implementations with well-developed circuit technology (Hoppensteadt & Izhikevitch, 2000). Another model was proposed that allowed a smooth transition between fixed-point mode and oscillatory mode by appropriately changing a parameter (Chakravarathy & Gosh, 1996). We will also find that the model of equations 1.1 is not of the GGL type. In section 2, we review associative memory based on the GGL model with vanishing shear and identical natural frequencies. We will argue that the phase dynamics of this model is not suited to solve the superposition problem by giving different phases to different patterns and the background. In the context of weakly coupled GGL models, a solution has been proposed that relates the pooling of neural oscillators to frequency
Oscillatory Networks
359
gaps resulting from nonidentical natural frequencies (frequency modulation, FM; see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). In section 3, we take a different approach of modifying the underlying system and starting from equations 1.1 instead. This will allow the use of phase gaps to prevent the superposition catastrophe. We give a detailed comparison of equations 1.1 and the GGL model and also relate our approach to the FM proposal. In section 4, we present simple examples. In section 5, we conclude with a summary and outlook. 2 The Generalized Ginzburg-Landau Model We now review and discuss associative memory based on the GGL model. Notice, however, that we do not discuss the storage of complex-valued patterns. The changes may be included without difficulty by going to complexvalued Hermitean weights (Noest, 1988a, 1988b; Takeda & Kishigami, 1992; Chakravarathy & Gosh, 1994; Aoyagi, 1995). We present the models in a form that will allow a convenient comparison with equations 1.1 in section 3. 2.1 The GGL Model as Gradient System. The discrete complex Ginzburg-Landau (GGL) model may be expressed in terms of complex coordinates (Kuramoto, 1975; see also Hoppensteadt & Izhikevitch, 1997, sec. 10.1–3): dzk β = I˜ k + iωk zk − (σk + iηk )zk |zk |2 + gkl zl , dt N N
GGL:
(2.1)
l=1
where I˜ k , ωk , σk , ηk , and the weights gkl are real-valued parameters, k = 1, . . . , N. Associative memory has been introduced for the parameter choices ηk = 0, corresponding to vanishing shear, and ωk = (see Hoppensteadt & Izhikevitch, 1997, sec. 10.4, and the short review in Hoppensteadt & Izhikevitch, 2003). The latter choice allows us to set ωk = 0 by going to the comoving frame, θk → θk + t. We may also set σk = 1. With zk = Vk exp (iθk ) , z¯ k = Vk exp (−iθk ), where V, θ are real, equation 2.1 is then equivalent to
GGL:
N d Vk 1 w kl (θ ) Vl , = I˜ k Vk − Vk3 + dt N
(2.2a)
dθk 1 Sk (V, θ ), = dt Vk
(2.2b)
l=1
GGL:
360
T. Burwick
where w kl is wkl with α = 0, as given by equation 1.2, and Sk is obtained from Sk by replacing g(uk ) → Vk . The gkl sin (θl − θk ) Vl terms in Sk (V, θ ) tend to synchronize (desynchronize) unit k with units l whenever gkl > 0 (gkl < 0). Without couplings, gkl = 0, it is easily seen that for I˜ k < 0, the origin Vk = 0 is a stable fixed point; at I˜ k = 0, the unit kexperiences a Hopf bifurcation, generating a stable limit cycle with radius I˜ k for I˜ k > 0. Introducing the Lyapunov function, θ) = − 1 L(V, w kl (θ ) Vk Vl + P(V) 2N k,l
(2.3)
with P(V) =
k
−
I˜ k 2 1 4 V + Vk 2 k 4
(2.4)
allows us to express the dynamics of equations 2.2 as a gradient system (Noest, 1988a, 1988b; Takeda & Kishigami, 1992; Chakravarathy & Gosh, 1994; Aoyagi, 1995; Hoppensteadt & Izhikevitch, 1996). Assuming that the gkl are symmetric, one obtains GGL:
d Vk ∂ L(V, θ ), =− dt ∂ Vk
(2.5a)
GGL:
dθk 1 ∂ L(V, θ ). =− 2 dt Vk ∂θk
(2.5b)
θ ) should then correspond to storage of patterns. In secThe minima of L(V, θ ) to a coherence tion 2.2, we specify the storage of patterns and relate L(V, measure for the patterns. 2.2 Memory, Competition, and Coherence. We consider P patterns p with components ξk ∈ {0, 1}, k = 1, . . . , N. Using these patterns, we specify the weights gkl of equations 1.2 and 1.3 as gkl = dkl +
−1 p q 1 p p ξk ξl + λ 2 ξ ξ . P p P p=q k l
≡ h kl (ξ )
≡ c kl (ξ )
(2.6)
The couplings d correspond to a background geometry, the excitatory part h corresponds to Hebbian memory, and the inhibitory part c establishes
Oscillatory Networks
361
competition between the patterns with λ ≥ 0. Analogously, any inhibitory part of d establishes competition between units. The complex coordinates of equation 2.1 may be used to define a coherence measure C p for each pattern p: Z p (V, θ ) =
1 p ξ zk = C p exp(i p ) , Np k k
(2.7)
where Z p , C p, p depend on V, θ and Z p = C p , 0 ≤ C p ≤ 1. These correspond to the Kuramoto coherence measure restricted to the active parts of the stored patterns (Kuramoto, 1984: see Strogatz, 2000, for a review of the Kuramoto model and references to recent results). We call p the phase of pattern p and C p its coherence. N p denotes the number of nonzero components of pattern p. Using equation 2.6 with vanishing background geometry, that is, d = 0, equation 2.3 may then be written as = − β L gkl cos (θk − θl ) Vk Vl + P 2N k,l
β Re gkl exp i (θk − θl ) Vk Vl + P (2.8) 2N k,l λ β N 2 2 ρp C p − ρ p ρq C p Cq cos p − q + P, =− 2 P P p=q p =−
could be given where ρ p = Np /N denotes the density of pattern p. Since L this form, without competition, that is, λ = 0, equations 2.5 imply that the β-dependent terms of equations 2.2 introduce a tendency to maximize the coherence of all patterns. With λ > 0, this tendency is accompanied by a competition between the patterns, so that any pattern p with significant coherence C p will try to suppress the coherence Cq of any other pattern q or is minimzed. will arrange for a phase difference between p and q so that L For example, assume P = 2 and both patterns are nonoverlapping. The first sum in the last line of equations 2.8 will imply C1, C 2 → 1, while the second sum introduces a phase difference ( 2 − 1 ) → π/2 mod π in order In fact, such a dynamics is exactly what we observe in to minimize L. example 3 in section 4.4.1. This tendency toward coherence of a pattern and decoherence between nonoverlapping patterns bears a resemblance to the results obtained for the Wang-Terman model (Terman & Wang, 1995). The Wang-Terman model, however, is based on coupled relaxation oscillators and patterns were not stored via Hebbian connections. Instead, a specific two-dimensional
362
T. Burwick
background geometry was assumed for the purpose of image segmentation. Using the two-dimensional background geometry makes the WangTerman model rather comparable to lattice models such as the ones studied in Sakagushi, Shinomoto, and Kuramoto (1987, 1988). The latter, however, were not applied to image segmentation, do not use inhibition, and do not have on and off states. Moreover, differences between the synchronization properties of relaxation and nonrelaxation oscillators have been described in Somers and Kopell (1993, 1995) and Izhikevitch (2000). 2.3 Phase Dynamics with Vanishing Amplitudes. The coupling in equation 2.2b has a remarkable effect on the synchronization properties (see also Hoppensteadt & Izhikevitch, 1997). It contains the terms
GGL:
N 1 gkl sin (θl − θk ) Vl Vk
(2.9)
l=1
on the right-hand side of equation 2.2b. Due to the factor 1/Vk , the units get more susceptible to phase differences as Vk → 0. Moreover, a state Vk approaching the off-state is synchronized more strongly with on-states Vl , not with other off-states (we assume gkl > 0). Thus, there is a strong tendendency toward global synchrony (infinitely strong for off-states, since 1/Vk → ∞, as Vk → 0), where on- and off-states all acquire the same phase. In section 3.2 we will assume Vk = g(uk ) in order to compare equations 1.1 with the GGL model. The tendency toward global coherence, resulting from equation 2.9, is then not compatible with our interpretation of the Vk (see section 1). We intend to identify on and off states of units, respectively with high and low frequencies dθk /dt. Therefore, the phase of a unit that is close to an off-state should not be strongly driven toward synchronization with on-state phases. When returning to equations 1a and 1b in the next section, we will therefore find that our model differs from the associative memory GGL model not only by introducing the frequencydriving activity dθk /dt = (2π/τθ ) g(uk ) + . . . , but also by accompanying this change with an alternative phase coupling that allows off-state units to have phases that differ from on-state phases. This feature is necessary for using phase differences to avoid the superposition problem. 2.4 Weak Couplings and Temporal Coding. The central topic of this article is the superposition problem and its possible solution based on grouping of neural oscillators due to temporal properties. Therefore, it should be compared to an earlier approach that will now be reviewed in the context of weakly coupled systems.
Oscillatory Networks
363
For the case of weak couplings, the amplitude dynamics in equations 2.2, I˜ k > 0, k = 1, . . . , N, may be adiabatically eliminated (see Kuramoto, 1984). This will leave the phase dynamics N dθk β gkl sin (θl − θk ) , = ωk + dt N
(2.10)
l=1
where 0 < β 1 and gkl is obtained from the adiabatic values Vk → Vk∞ , Vl → Vl∞ , leading to gkl = (Vl∞ /Vk∞ ) gkl . Weak couplings will also ∞ imply Vk > 0, so that gkl < ∞. In contrast, oscillator death with Vk∞ = 0 (as well as self-ignition, I˜ k < 0) may occur for stronger couplings (see Aronson, Ermentrout, & Kopell, 1990). In equation 2.10, we also reestablished nonidentical frequencies ωk . In equation 2.2b, these were set to and eliminated by going to the comoving frame, θk → θk + t. Nonidentical frequencies have been used to propose neural grouping based on different frequency (see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). This frequency modulation (FM) proposal uses a number of basic frequencies , = 1, . . . , G, and assumes that ωk ∈ { 0 , 1 , . . . , G } ,
(2.11)
with G ≤ N. Two units k and l will then belong to the same group (ensemble, pool), if their natural frequencies will be the same, (k) = (l) , up to terms of order β. This approach gets particularly interesting when higher-order phase couplings are included in equation 2.10. The equality condition is then generalized to a resonance condition (Hoppensteadt & Izhikevitch, 1997, sec. 9.4.3). In section 3.4, we relate the FM proposal to our approach. 3 Phase Extension of Classical Networks In this section we return to the model of equations 1.1. The system is compared with the GGL model, pattern recognition capabilitilies are discussed, and temporal coding is compared with the FM proposal mentioned at the end of the previous section. 3.1 The Model. The system of equations 1.1 reads duk 1 wkl (θ ) g(ul ) + Ik , = −uk + dt N
(1.1a)
dθk = 2π g(uk ) (1 + Sk (u, θ )) , dt
(1.1b)
N
τu
l=1
τθ
364
T. Burwick
where wkl (θ ) = gkl (α + β cos (θk − θl )), Sk (u, θ ) =
N β gkl sin (θl − θk ) g(ul ). N
(2.2) (2.3)
l=1
kl of the GGL model, except for the paramThe weights wkl agree with the w eter α that was introduced to give the weights also a classical part. The term classical refers to phase independence (see the discussion in von der Malsburg, 1999). This difference was motivated by introducing in equations 1.1 a phase coupling that does not replace but complements classical network dynamics. In fact, for the examples of section 4, we always use α > β ≥ 0. The Sk (u, θ ) are obtained from Sk (V, θ ) by replacing Vl → g(ul ). With β > 0, the sin (θl − θk ) in Sk introduce a tendency to synchronize (desynchronize) the units k and l, assuming gkl > 0 (gkl < 0), and the cos(θk − θl ) in wkl strengthens (weakens) the couplings if the units are inphase (out-of-phase). These features agree with that of equations 2.2. Notice, however, the different strengths of phase couplings, given by g(uk ) in equations 1.1 and by 1/Vk in equations 2.2, which lead to the different behavior of units that are close to off-states. We will return to a comparison between equations 1.1 and equations 2.2 in section 3.2. With β = 0, the weights are constant, wkl (θ ) = αgkl , and equation 1.1a decouples from the phase dynamics, thereby reducing to a classical model, while the phase dynamics reduces to uncoupled oscillations, dθk /dt = (2π/τθ ) g(uk ). In this sense, the complete dynamics of equations 1.1a and 1.1b is an extension of the classical dynamics. 3.2 Gradient Dynamics, Frequency-Driving Activity, and Comparison with GGL Models. In the following, we identify Vk ≡ |zk | = g(uk ). Interpreting the activity g(ul ) as an amplitude allows a comparison of equations 1.1 with 2.2. Notice that we do not restrict the Vk to be small; the range of the Vk is 0 < Vk < 1, due to the definition of g. As |uk | → ∞, the Vk run into saturation. The comparison of equations 1.1 with 2.2 will get most transparent by relating equations 1.1 to gradient dynamics, in analogy to section 2.1. Using the weights wkl (θ ) of equation 1.2, a natural generalization of the CohenGrossberg-Hopfield function (Cohen & Grossberg, 1983; Hopfield, 1984) for amplitudes and phases reads
L(V, θ ) = −
1 wkl (θ ) Vk Vl + P(V) , 2N k,l
(3.1)
Oscillatory Networks
365
where the potential is given by P(V) =
Vk
k
=
1 k
2
g −1 (x) − Ik d x
ln − V − I ln V + − V V (Vk (1 (1 k k) k )) k k .
(3.2) (3.3)
For deriving equation 3.3, the uk may be obtained from Vk through uk = g −1 (Vk ) =
1 Vk . ln 2 1 − Vk
(3.4)
We applied the integration formula ln x d x = x ln x − x and set the integration constant in equation 3.3 to zero. We may now relate equations 1.1a and 1.1b to L, just as equations 2.2 were Using dg(u)/du = 2Vk (1 − Vk ), equations 1.1 may be written related to L. as: N d Vk 1 1 Vk τu + wkl (θ ) Vl + Ik = 2Vk (1 − Vk ) − ln dt 2 1 − Vk N l=1 ∂ L(V, θ ), = 2Vk (1 − Vk ) − ∂ Vk τθ
dθk = 2π Vk 1 + Sk (V, θ ) dt ∂ = 2π Vk + 2π − L(V, θ ). ∂θk
(3.5a)
(3.5b) (3.5c) (3.5d)
The system splits into a gradient part and the frequency-driving activity term 2π Vk . For the remainder of this section, we compare the system of equations 3.5 with the GGL model of equations 2.2 and 2.5. The amplitude dynamics of equation 3.5a is similar to equation 2.2a. The ln term in equation 3.5a is a higher-order analog of the I˜ k Vk − Vk3 terms in equation 2.2a. The additional factors Vk (1 − Vk ) correspond to the saturation as Vk → 0, 1. The main difference is in the phase dynamics. Notice first that the GGL model could be formulated as the pure gradient system of equations 2.5a and 2.5b only by assuming vanishing shear, that is, ηk = 0 in equation 2.1. Therefore, the term 2π Vk in equation 3.5d may be compared to the shear term of the GGL model that is usually set to zero when pattern recognition is implemented (see Hoppensteadt & Izhikevitch, 1997, sec. 10.4). In
366
T. Burwick
contrast, we consider the nonvanishing shear-like term to be essential for our interpretation of the oscillatory units. Setting the 2π g(uk ) + . . . term on the right-hand side of equation 1.1b to zero would destroy the relation between the activity uk and the frequency dθk /dt of the signals. This realizes the requirement that on and off states should correspond, respectively, to high and low frequencies of the phases. Moreover, in order to establish a consistent picture, the effect of the Sk (u, θ ) should not be growing as Vk → 0. In section 2.3, we mentioned that the coupling Sk /Vk in equation 2.2b leads to global synchrony, establishing a strong tendency toward equal phases of on- and off-states. Such a coupling would be in conflict with low frequencies as Vk → 0. Therefore, an obvious difference between equations 2.2 and equations 3.5 is that the latter uses the coupling Vk Sk instead. Thereby, synchronization between units k and l, assuming gkl > 0, is enforced only for on-states, that is, the unit k gets less susceptible to phase differences as Vk → 0. In terms of the gradient system, the difference arises due to the factor 1/Vk2 in equation 2.5b that is absent in equation 3.5d. Therefore, with regard to the Vk , the coupling in equations 35a to 3.5d is of higher order than the couplings of the GGL model. 3.3 Pattern Recognition Capabilities. Patterns stored according to equation 2.6 are related to minima of L. Correspondingly, one could expect that gradient dynamics in equations 3.5a to 3.5d tends to retrieve stored patterns by approaching the minima of L. However, due to the term 2π Vk in equation 3.5d, the system of equations 1.1 is not a pure gradient system. We find ∂L 2 d τu L(V, θ ) = − 2Vk (1 − Vk ) dt ∂ Vk k τu ∂L ∂L + − Vk . τθ ∂θk ∂θk
(3.6)
In contrast to GGL models with vanishing shear, the right-hand side includes terms that may be nonnegative. These may imply an increase in the L(V, θ ) values. Let us write the corresponding terms in the form ∂L β Vk − Vl Vk = sin (θl − θk ) Vk gkl Vl . ∂θk N k,l 2 k
(3.7)
In the following, we will discuss the effect of these terms on the pattern recognition capability of the system.
Oscillatory Networks
367
The terms of equation 3.7 result from the first term 2π Vk + . . . on the right-hand side of equation 1.1b. Why could this term eventually cause an increase of L(V, θ )? Say, θk = θl + δθ (mod 2π) with 0 < δθ < π. On one hand, the sign of sin (θl − θk ) is such that the θl tend to be fastened and the θk tend to be slowed down in order to reach synchrony (assuming gkl > 0 and α > β ≥ 0). With Vk − Vl < 0, this dynamics is supported by the frequency-driving activity terms. The corresponding terms in equation 3.7 will be negative and will imply a tendency of L(V, θ ) to decrease. On the other hand, for Vk − Vl > 0, the activity term may outvote the sin (θl − θk ) tendency: θk could remain faster than θl . This corresponds to the positive terms of equation 3.7, which may cause L(V, θ ) to increase. As a result, however, θk will approach θl from “the other side,” that is, the situation θk = θl − δθ (mod 2 π) is reached. Should Vk − Vl > 0 still be the case, the terms of equation 3.7 will then become negative and will imply a tendency of L(V, θ ) to decrease again. Obviously, without additional inquiry, we may not judge whether the foregoing increase will be cancelled by the following decrease of L(V, θ ). Therefore, we now look more closely at the V dynamics. Assuming an adiabatic approximation, τu τθ , we find that the only term that may become negative inside the bracket of equation 3.6 is suppressed by (τu /τθ ). Moreover, due to their (fast) dynamics, the Vk will have approached their on or off states (then ∂L/∂ Vk → 0) when the terms of order τu /τθ get relevant. Assuming sufficiently large values of α in equation 1.2, we may arrange the set of indices k so that Vk = 1 − k , k = 1, . . . , M, and Vk = k , k = M + 1, . . . , N, where 0 < k 1, for some M ≤ N. Then M ∂L β Vk → sin (θk − θl ) gkl + O() = O(), ∂θk N k
(3.8)
k,l=1
due to the antisymmetry of sin (θk − θl ) gkl . Combining these aspects, we find that frequency-driving activity terms only imply a (τu /τθ ) O() correction to dL/dt, and we may expect that these terms will not significantly affect the pattern recognition capabilities. This expectation will be confirmed by the examples in section 4. 3.4 Comparison with the Frequency Modulation Approach to Temporal coding. Before giving examples for the pattern recognition behavior, we want to relate equations 1 to the frequency modulation (FM) approach described in section 2.4. Remember the weak coupling limit in equation 2.10: N dθk β gkl sin (θl − θk ) . = (k) + dt N l=1
Equation 1.1b:
2π g(uk ) τθ
g(uk )gkl g(ul )
(3.9)
368
T. Burwick
Here, we have added a comparison with equation 1.1b. We may relate the FM proposal to equations 1a and 1b by assuming two basic frequencies: 1 = 0 (off-states) and 2 = 2π/τθ (on-states). The difference is that for the FM approach, the have been external parameters, while in equations 1a and 1b, they are subject to the dynamics of equation 1.1a. Whether the system approaches 1 or 2 depends on external inputs and initial values. Notice also that in the context of the complete dynamics of equations 1.1a and 1.1b, the couplings are also subject to a dynamics. Due to the term g(uk )gkl g(ul ), a synchronization is not enforced if one or both of the units k, l approach an off-state. In the context of brain dynamics, it has been speculated that the external character of natural frequencies would actually turn dynamical in a more complete setting: “It is reasonable to speculate that the brain has mechanisms to regulate the natural frequencies of its neurons so that the neurons can be entrained into different pools at different times simply by adjusting the ” (Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). In section 1, we mentioned that here we do not aim at a biological interpretation. Nevertheless, it is obvious that equations 1.1 provide an extension of equation 3.9, where the analog of the natural frequencies may result from a dynamical process. Notice, however, a difference between the FM approach and the direction we take. The FM approach uses only frequency gaps for separating the neural pools, labeled by 1, . . . , G . The analogy with equations 1.1 goes only to separating on-states ( 1 ) and off-states ( 2 ). Among the on-states, we continue with a separation based on phase gaps. In particular, for overlapping patterns, competitive couplings are used to separate their phases. Using competition as the separating mechanism has been proposed already in von der Malsburg (1981). With regard to biological interpretation, it has been suggested that temporal coding based on frequency gaps versus temporal coding based on phase gaps may be complementary (Hoppensteadt & Izhikevitch, 1999), the former being valid in the weak-coupling regime, while the latter may be more suiteable for strong couplings (Izhikevitch, 1999). Notice in this respect that the model of equations 1.1 is indeed applicable with strong couplings, in particular due to the saturation properties of the activation function. 4 Examples In this section, we illustrate the dynamics of equations 1.1 in the context of simple examples. We continue to use the short form Vk = g (uk ). In section 4.1, we comment on input choices. In section 4.2, we specify the parameters and present the patterns that we use for the examples. In section 4.3, we consider the weights without competition, that is, λ = 0 in equation 2.6. We apply external input to the network and demonstrate how
Oscillatory Networks
369
coherent pattern recognition retrieves the stored patterns. As expected, the frequency-driving activity 2π Vk separates the phases of active units from the background. We compare coherent pattern recognition to the classical dynamics. This will help in understanding the limitations of the classical approach and valuing the additional features that arise due to the phase couplings, most notably the avoidance of the superposition problem. In section 4.4, we include competition between the patterns, that is, λ > 0, and study the resulting effects. 4.1 Pattern Retrieval and Input Choices. Choosing the activation function g of equations 1.1 so that on and off states of a unit k correspond, respectively, to the limits 1 and 0 of Vk introduces a subtlety regarding the inputs I . This may be understood when writing equations 1.1 in terms of h(x) = tanh(x) = 2g(x) − 1 instead. For this discussion, it is sufficient to consider β = 0, then duk α gkl h(ul ) + J k , = −uk + dt 2N N
τu
(4.1)
l=1
where J is related to I through α gkl . 2N N
Ik = J k −
(4.2)
l=1
The subtlety is related to the interpretation of vanishing inputs. Vanishing inputs should be identified not with vanishing I but with J = 0. The related I value is then given by the second term in equation 4.2. Only for such a value of I will u = 0 imply du/dt = 0 so that vanishing inputs correspond to a neutral or undecided state at the origin of u-space. In this section, we specify the inputs in terms of J . 4.2 A Simple Network. We now want to illustrate the dynamics of equations 1.1 by discussing simple examples. We choose N = 6, P = 3, and store the patterns 1 0 0 1 0 0 0 0 1 , ξ2 = , ξ3 = , ξ1 = 0 1 1 0 0 1 0
0
1
(4.3)
370
T. Burwick
1
2
1 3
6 5
2
1 3
6
4
5
pattern 1
2 3
6
4
5
pattern 2
4
pattern 3
Figure 1: The stored patterns according to equation 4.3, drawn together with the resulting Hebbian connections h kl of equation 4.4 and numbering of the units. The filled and empty circles correspond, respectively, to on and off states.
with N1 = N2 = 2, N3 = 3. This form allows us to distinguish two cases: overlapping and nonoverlapping patterns. Pattern 1 does not overlap with patterns 2 and 3, while patterns 2 and 3 overlap at unit 4. The patterns are illustrated in Figure 1. The Hebbian weights h and the inhibitory competition terms c are given by equation 2.6:
1 1 0 h kl = P 0 0
10 01 01 00
0 0 0 0 0 1 1 0 0 , c kl = −1 2 P 2 1 1 2 1 111
110000
000111
01211
0 1 2 1 1 1 0 1 1 1 . 2 1 2 1 1 1 1 1 0 0
(4.4)
111100
We set d = 0 in equation 2.6 since we are not interested in background effects. In the following, the global Hebbian parameter α is set to α = 2P N. This is large enough to establish attractors of the system that realize storage of the patterns. On-states correspond to Vk → 1, while off-states correspond to Vk → 0. The timescales are set to τu = τθ /4 = τ . This is a mild form of the τu τθ scenario, underlying the adiabatic approximation in section 3.3. Simulations are performed using Euler’s method with time step dt = 0.02τ . The initial values, random values for the θ and small random values for uk (so that Vk ≈ 1/2), are the same for all examples in this section. The examples are distinguished by different values for β, λ, and inputs Ik (expressed in terms of J k ; see section 4.1). 4.3 Coherent Pattern Recognition. We begin with two examples for the response to external inputs. In section 4.1 we argued that vanishing inputs
Oscillatory Networks
371
should be identified with J k = 0, where J is related to I by equation 4.2. Therefore, excitatory input should be identified with J k > 0 and inhibitory input with J k < 0. Both examples compare coherent pattern recognition with the classical case. The classical case corresponds to β = 0. For the phase coding scenario, we choose β = 0.1α. In this section, we do not include competition—λ = 0. 4.3.1 Example 1. First, we set J 1 = −E, J 6 = +E, where E = 10, while all other inputs vanish. The resulting dynamics is displayed in Figure 2. We find that the inhibitory input at unit 1 suppresses pattern 1, while the excitatory input at unit 6 excites pattern 3. Moreover, due to its overlap with pattern 3, pattern 2 is also excited. This is true for the classical and the phase coding cases. It is the behavior that should be expected. If phase coding is included, β = 0.1α, we find that the active units synchronize after a few multiples of the timescale. The example confirms that we succeeded in constructing a pattern recognition mechanism that retrieves patterns in a coherent mode. The frequencydriving activity term in equation 1.1b is essential in separating the active from nonactive units. The freqencies of active units approach 2π/τθ , while the frequencies of nonactive units are frozen toward zero. The implications for the superposition problem are illustrated with the next example. 4.3.2 Example 2. The superposition problem may now be demonstrated by choosing both inputs to be positive, J 1 = J 6 = +E. Then pattern 1 is also excited (see Figure 3). In the classical case, β = 0, the superposition of active patterns no longer allows us to distinguish the single patterns. For example, it is no longer possible to distinguish pattern 1 since all units are active. Consequently, there is a problem for information processing along the succeeding stages. This is the superposition catastrophe that plagues the classical approach (Rosenblatt, 1961; von der Malsburg, 1981). Comparing the classical situation to the case β = 0.1α makes it obvious why the inclusion of phase coding helps to avoid the superposition problem. Now the units carry not only information about being on or off, related to high or low frequencies, but also information about the phases. We find that units 1 and 2 of pattern 1 synchronize separately from units 3 to 6 of patterns 2 and 3. As a result, in the output, pattern 1 may be distinguished from the rest due to its different phase. Whenever the succeeding stage of information processing is sensitive to this phase difference, it will be able to identify pattern 1 despite the fact that the other units are also on. 4.4 Competition Between Patterns. While the superposition problem is a severe drawback in the application of classical networks, another problem is the mixed states that correspond to a common activation of overlapping patterns. This arises when using the Hebbian weights without additional coupling of the units. It is only a problem, of course, if we aim at getting a
372
T. Burwick
A
(β = 0)
B
J1 = - E
(β = 0.1α)
J1 = - E
J6 = + E
J6 = + E
1
0.5
0.5
Phase θk /
2π
1
0
0
5
10
15
Activity Vk
1
0
5
10
units 3-6 unit 2
0.5
unit 2
0.5
unit 1 0
5
10
unit 1 15
1
0
0
5
10
15
10
15
1
pattern 3
pattern 3
pattern 2
0.5
pattern 2
0.5
pattern 1
pattern 1 0
15
1
units 3-6
0
Coherence Cp
0
0
5
10
time t [τ]
15
0
0
5
time t [τ]
Figure 2: Example 1. The lines connecting the units indicate the Hebbian connections h kl of equation 4.4 (see Figure 1). The filled and empty circles correspond, respectively, to on and off states. We compare the classical case (β = 0) with phase coding (β > 0). (A) β = 0, the classical case. The inhibitory input J 1 suppresses pattern 1. The excitatory input J 6 activates patterns 3 and 2 through unit 4. (B) β = 0.1α, phase coding. Coherent activity is indicated by circles with parallel stripes. The connected active units synchronize within a few multiples of the timescale. The corresponding patterns reach maximal coherence. (Whenever mod 2π is applied to θ , the values are connected for the sake of better visualization.)
Oscillatory Networks
A
373
(β = 0)
B
J1 = + E
J1 = + E
J6 = + E
J6 = + E
1
0.5
0.5
Phase θk /
2π
1
0
0
5
10
15
Activity Vk
1
Coherence Cp
0
0
10
15
units 3-6 units 1-2
0.5
0.5
0
5
10
1
15
0
0
0
5
10
15
patterns 1,3,2
pattern 3
0.5
5
1
pattern 1 pattern 2
0
5
1
units 1-6
0
(β = 0.1α)
0.5
10
time t [τ]
15
0
0
5
10
15
time t [τ]
Figure 3: Example 2. Both external inputs are now excitatory. As a consequence, all units are activated. Again, we compare the classical case (β = 0) with that of phase coding (β > 0). (A) β = 0, the classical case. The superposition problem arises. The activation carries no information about single patterns. (B) β = 0.1α, phase coding. Now the units of the two connected components synchronize separately. Different orientations of the stripes correspond to different phases. The superposition problem is avoided. Pattern 1 is distinguishable from the rest. Separating patterns 2 and 3 also requires competition (see example 3).
374
T. Burwick
unique pattern as the output of the network. Then the situation of examples 1 and 2 is not satisfying since patterns 2 and 3 are activated simultaneously due to being connected through Hebbian connections. This is why we now include additional inhibitory weights c kl that introduce competition between the patterns—that is, we use nonvanishing λ in equation 2.6. 4.4.1 Example 3. We set λ = 1.2P. No external inputs are applied. To make the influence of phase coding obvious, we now use a larger value for β by choosing β = 0.8α for the nonclassical case. The resulting dynamics is displayed in Figure 4. It may be understood as follows. Let us first discuss the classical situation. When analyzing our random initial values for uk , they are found to be slightly positive, so that the Vk are slightly above 1/2 and the activities tend to run toward the on-states. This, however, invokes the competition process, and for the classical scenario, β = 0, the units of pattern 3 succeed in suppressing all the other activities, so that finally units 4 to 6 of pattern 3 approach the on-state attractor, while the other units move toward the off-state attractor. The flat lines in the phase diagram correspond to switched-off units 1 to 3. The dominance of pattern 3 is mainly due to N3 > N2,1 . With phase coding, the situation is different and allows an interesting observation. We find that although again pattern 3 dominates pattern 2, we now end up with a still active pattern 1. The reason for this may be understood when realizing that the two winning patterns are out of phase. In fact, the phase difference approaches π. Thus, pattern 1 escaped the competition by desynchronizing with pattern 3 so that the coupling w(θ ) and thus the competition is weakened. As a result, the system ends up in a situation similar to Figure 3, but now the superposition problem between the overlapping patterns has been solved due to the competition: unit 3 and thereby pattern 2 is suppressed. Two patterns may therefore be simultaneously active and still be separable. This is another example of avoiding the superposition problem. Notice that an analogous behavior occurs when global inhibition is used—that is, competition between units instead of competition between patterns. A possible example is obtained from λ = 0 and dkl = −1 for all k, l in equation 2.6. Again, in the classical case, pattern 3 is winning, while in the presence of phase coding, patterns 1 and 3 are active with different phases, as illustrated for the case of Figure 4. 5 Summary and Outlook In this letter, we used the oscillatory network model of equations 1.1 to study a solution to the superposition catastrophe in the context of phase models. The system of equations 1.1 was obtained by extending classical neural network models to include phase dynamics. It was designed to meet two basic requirements: on and off states should correspond, respectively,
Oscillatory Networks
A
375
(β = 0)
B
1
0.5
0.5
Phase θk /
2π
1
(β = 0.8α)
0
0
5
10
15
Activity Vk
1
0
5
10
15
1
units 4-6 units 4-6 units 1-2
0.5
0.5
unit 3 0
unit 3
units 1-2 0
5
10
15
1
Coherence Cp
0
0
0
5
10
15
10
15
1
patterns 1,3 pattern 3 0.5
pattern 2
0.5
pattern 2 pattern 1 0
0
5
10
time t [τ]
15
0
0
5
time t [τ]
Figure 4: Example 3. The network now also includes inhibitory weights λc kl , λ = 1.2P. These introduce competition between patterns. The network is now completely connected, as illustrated by the additional broken lines. No external input is applied. (A) β = 0, the classical case. Pattern 3 is the winning pattern (it is larger than the others). (B) β = 0.8α, phase coding. Again pattern 3 wins over pattern 2. Pattern 1, however, is now able to survive. Phase coding allows pattern 1 to escape the competition by desynchronizing with pattern 3. This is illustrated by circles with stripes of orthogonal orientation.
376
T. Burwick
to high and low frequencies of the phases, and patterns should be retrieved in a coherent (synchronized) mode. Identifying the activity g (uk ) of equations 1.1 with an oscillation amplitude Vk allows us to compare the model with equations 2.2, a generalized version of the discrete complex Ginzburg-Landau (GGL) model with vanishing shear and identical natural frequencies. This model describes StuartLandau oscillators (i.e., oscillators close to a Hopf-bifurcation) coupled to lowest order, a model that is frequently used for implementing associative memory (see Hoppensteadt & Izhikevitch, 2003). With Vk = g (uk ), the GGL model does not obey the first of the two requirements. That phase dynamics leads to different results as oscillatory units k approach their offstate, Vk → 0. We also compared our approach with an alternative proposal for grouping oscillatory units based on temporal properties (see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). This frequency modulation (FM) proposal is founded on choosing different natural frequencies for oscillatory units, resulting in the grouping of units with nearly identical natural frequencies. With regard to a separation of on- and off-states, equations 1.1 may be seen as a dynamical extension of this approach. However, with regard to a grouping among on-states, the FM approach would be based on frequency gaps, while the model of equations 1.1 uses phase gaps. Correspondingly, we mentioned that (regarding biological systems) a complementarity has been suggested (Izhikevitch, 1999), according to which temporal coding with frequency gaps arises for weakly coupled systems, while temporal coding based on phase gaps may be more suitable for strong couplings. In the context of equations 1.1, the solution of the superposition problem is straightforward. It is in accordance with original proposals of temporal coding (von der Malsburg, 1981, 1985; see also von der Malsburg, 2003). The frequency-driving acitivity term 2πg (uk ) in equation 1.1b separates the active units from the background so that active units, Vk → 1, approach the frequency (2π/τθ ) (1 + Sk ), while nonactive units, Vk → 0, are frozen to zero frequency. Among the active units, the phases provide an additional labeling of the activities. In case of coherent pattern recognition, due to synchronizing phase couplings, each component may be identified with a single phase. Different patterns may then have the same frequency and may still be distinguished by the phase differences between the patterns. In order to avoid mixed states of overlapping patterns, a competition between the patterns may be introduced. This was achieved by introducing appropriate inhibitory weights in addition to the excitatory Hebbian weights. The coherence of the patterns may then be reduced to the winning subset. In this letter, we did not specify mechanisms by which successive stages of information processing can read out the phases of different components. Future work should approach this issue so that a more complete picture arises. Moreover, taking the limit N → ∞ is essential for obtaining phase
Oscillatory Networks
377
transitions between ordered and disordered states (see Kuramoto, 1984; Strogatz, 2000). Studying the proposed and related models in this limit should be of particular interest. Higher-order phase couplings, and possibly related phenomena such as clustering, should also be of interest (see Tass, 1999). In the context of the FM proposal, these led to resonance conditions for the natural frequencies. An early version of the Wang-Terman model was studied for patterns that were overlapping at one unit, resembling our example in section 4 (Wang et al., 1990). Presentation of overlapping patterns as input led to common activation of these patterns: the units at the overlap participated in each of the overlapping patterns (see Wang et al., Figure 3). This differs from our example 3, where competition led to a winner among the overlapping patterns. If a common activation of overlapping patterns should be desirable, different approaches might work for the phase models. For example, the phase model could be extended to the coupling of higher-order modes. Overlapping units could then participate in a higher-frequency mode that may simultanously synchronize with the lower-frequency nonoverlapping parts of the patterns via resonances. This would still allow the patterns to desynchronize in the nonoverlapping parts. Modifying the phase model to establish such a feature, however, is beyond the scope of this article. We presented simple examples to illuminate the dynamical content of equations 1.1. Evidently, to gain importance as a pattern recognition method, it should be demonstrated that the method stands the test of the more advanced applications. In case of locally coupled relaxation oscillators, this step was taken by Wang and Terman, who studied segmentation of real images. They observed that a straightforward application of relaxation oscillator networks led to an unsuitable segmentation of the real image, leading to many tiny fragments, a problem they called fragmentation. This problem was solved by introducing a lateral potential for each oscillator (Wang & Terman, 1997; see also Shareef et al., 1999). It will be of interest to study how the phase model should be applied to real images. In case fragmentation arises, it would be particularly interesting to study whether some hierarchical processing could help to combine the fragments, possibly in analogy to hierarchical processing supported by higher-level cortical areas. Given the relative simplicity of the phase models in comparison with coupled relaxation oscillator models, one may hope that necessary steps to include high-level processing could be more easily analyzed and implemented.
Acknowledgment It is a pleasure to thank Christoph von der Malsburg for inspiring and helpful discussions.
378
T. Burwick
References Abbott, L. F. (1990). A network of oscillators, J. Phys. A: Math. Gen., 23, 3835–3859. Aoyagi, T. (1995). Networks of neural oscillators for retrieving phase information, Physical Review Letters, 74(20), 4075–4078. Aronson, D., Ermentrout, G., & Kopell, N. (1990). Amplitude response of coupled oscillators. Physica D, 41, 403–449. Baird, B. (1986). Nonlinear dynamics of pattern formation and pattern recognition in the rabbit olfactory bulb. Physica D, 22, 242–252. Baird, P., & Meir, R. (1990). Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Computation, 2, 458– 471. Chakravarthy, S. V., & Gosh, J. (1994). A Neural Network–based associative memory for storing complex-valued patterns. In Proc. IEEE Int. Conf. Syst. Man Cybern. (pp. 2213–2218). Piscataway, NJ: IEEE. Chakravarthy, S. V., & Gosh, J. (1996). A complex-valued associative memory for storing patterns as oscillatory states. Biological Cybernetics, 75, 229–238. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transaction on Systems, Man, and Cybernetics, SMC-13, 815–826. Freeman, W. J., Yao, Y., & Burke, B. (1988). Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks, 1, 277–288. Hirose, A. (Ed.). (2003). Complex-valued neural networks. Singapore: World Scientific. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences of the U.S.A., 81, 3088–3092. Hoppensteadt, F. C., & Izhikevitch, M. E., (1996). Synaptic organization and dynamical properties of weakly connected neural oscillators: II. Learning phase information. Biological Cybernetics, 75, 129–135. Hoppensteadt, F. C., & Izhikevitch, E. M., (1997). Weakly connected neural networks. Berlin: Springer-Verlag. Hoppensteadt, F. C., & Izhikevitch, E. M. (1999). Thalamo-Cortical interactions modeled by weakly connected oscillators: Could the brain use FM radio principles? BioSystems, 48, 85–94. Hoppensteadt, F. C., & Izhikevitch, E. M. (2000). Pattern recognition via synchronization in phase-locked loop neural networks. IEEE Transactions on Neural Networks, 11, 734–738. Hoppensteadt, F. C., & Izhikevitch, E. M. (2003). Canonical neural models, In M. Arbib (Ed.), Brain theory and neural networks (2nd ed., pp. 181–186). Cambridge, MA: MIT Press. Izhikevitch, E. M. (1999). Weakly connected quasi-periodic oscillators, FM interactions, and multiplexing in the brain. BioSystems, 48, 85–94. Izhikevitch, E. M. (2000). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1805. Kuramoto, Y. (1975). Self-entrainement of a population of coupled non-linear oscillators. In H. Araki (ed.), International Symposium on Mathematical Problems in Theoretical Physics (pp. 420–422). Berlin: Springer-Verlag.
Oscillatory Networks
379
Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Kuramoto, Y., Aoyagi, T., Nishikawa, I., Chawanya, T., & Okuda, K. (1992). Neural network model carrying phase information with application to collective dynamics. Progr. Theor. Phys., 87, 1119–1126. Kuzmina, M., Manykin, E., & Surina, I. (1995). Associative memory oscillatory networks with Hebbian and pseudo-inverse matrices of connections. In Proceedings of the Third European Congress on Intelligent Techniques and Soft Computing (EUFIT’95), Aachen, Germany, August 28–31 (pp. 392–395). Bellingham, WA: SPIE International Society for Optical Engineering. Kuzmina, M., & Surina, I. (1994). Macrodynamical approach for oscillatory networks. In Proceedings of SPIE: Optical Neural Networks (Vol. 2430, pp. 229–235). Aachen: ELITE Foundation. Li, Z., & Hopfield, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processings. Biological Cybernetics, 61, 379–392. ¨ Muller, H. J., Elliott, M. A., Herrmann, C. S., & Mecklinger, A. (Eds.). (2001). Visual Cognition, 8 [Special issue]. Noest, A. J. (1988a). Associative memory in sparse phasor neural networks. Europhysics Letters, 6, 469–474. Noest, A. J. (1988b). Discrete-state phasor neural networks. Physical Review A, 38, 2196–2199. Rosenblatt, F. (1961). Principles of neurodynamics: Perception and the theory of brain mechanism. Washington, DC: Spartan Books. Roskies, A. L. (Ed.). (1999). Neuron, 24 [Special topic]. Sakagushi, H., Shinomoto, S., & Kuramoto, Y. (1987). Local and global selfentrainments in oscillator lattices. Progr. Theor. Phys., 77, 1005–1010. Sakagushi, H., Shinomoto, S., & Kuramoto, Y. (1988). Mutual entrainment in oscillator lattices with nonvariational type interaction. Progr. Theor. Phys., 79, 1069–1079. Shareef, N., Wang, D., & Yagel, R. (1999). Segmentation of medical images using LEGION. IEEE Transactions on Medical Imaging, 18, 74–91. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, 68, 393–407. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 88, 1–14. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1990). Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. USA, 87, 7200–7204. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Physical Review A, 43, 6990–7011. Sompolinsky, H., & Tsodyks, M. (1992). Processing of sensory information by a network of coupled oscillators. International Journal of Neural Systems, 3 (Suppl.), 51–56. Strogatz, S. H. (2000). From Kuramoto to Crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D, 143, 1–20. Takeda, M., & Kishigami, T. (1992). Complex neural fields with a Hopfield-like energy function and an analogy to optical fields generated in phase-conjugate resonators. J. Opt. Soc. Am. A, 9, 2182–2192.
380
T. Burwick
Tass, P. A. (1993). Synchronisierte Oszillationen im visuellen Cortex—ein synergetisches ¨ Theoretische Physik und Modell. Unpublished doctoral dissertation, Institut fur Synergetik der Universitt Stuttgart. Tass, P. A. (1999). Phase resetting in medicine and biology. Berlin: Springer-Verlag. Tass, P., & Haken, H. (1996a). Synchronization in networks of limit cycle oscillators. Z. Phys. B, 100, 303–320. Tass, P., & Haken, H. (1996b). Synchronization oscillations in the visual cortex—a synergetic model. Biological Cybernetics, 74, 31–39. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Max-Planck Institute for Biophysical Chemistry. von der Malsburg, C. (1985). Nervous structures with dynamical links. Ber. Bunsenges. Phys. Chem., 89, 703–710. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. von der Malsburg, C. (2003). Dynamic link architecture. In M. Arbib (Ed.), Brain theory and neural networks (2nd ed., pp. 365–368). Cambridge, MA: MIT Press. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Wang, D., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Computation, 2, 94–106. Wang, D., & Terman, D. (1995). Locally excitatory globally inhibitory oscillator networks. IEEE Transaction on Neural Networks, 6, 283-286. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Winfree, A. T. (2001). The geometry of biological time (2nd ed.). Berlin: Springer-Verlag.
Received November 10, 2004; accepted June 30, 2005.
LETTER
Communicated by Bruno Olshausen
Topographic Product Models Applied to Natural Scene Statistics Simon Osindero
[email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
Max Welling
[email protected] Department of Computer Science, University of California Irvine, Irvine, CA 92697-3425, U.S.A.
Geoffrey E. Hinton
[email protected] Canadian Institute for Advanced Research and Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
We present an energy-based model that uses a product of generalized Student-t distributions to capture the statistical structure in data sets. This model is inspired by and particularly applicable to “natural” data sets such as images. We begin by providing the mathematical framework, where we discuss complete and overcomplete models and provide algorithms for training these models from data. Using patches of natural scenes, we demonstrate that our approach represents a viable alternative to independent component analysis as an interpretive model of biological visual systems. Although the two approaches are similar in flavor, there are also important differences, particularly when the representations are overcomplete. By constraining the interactions within our model, we are also able to study the topographic organization of Gabor-like receptive fields that our model learns. Finally, we discuss the relation of our new approach to previous work—in particular, gaussian scale mixture models and variants of independent components analysis. 1 Introduction This letter presents a general family of energy-based models that we refer to as product of student-t (PoT) models They are particularly well suited to modeling statistical structure in data for which linear projections are expected to result in sparse marginal distributions. Many kinds of data might be expected to have such structure, and in particular natural data Neural Computation 18, 381–414 (2006)
C 2005 Massachusetts Institute of Technology
382
S. Osindero, M. Welling, and G. Hinton
sets such as digitized images or sounds seem to be well described in this way. The goals of this letter are twofold. First, we present the general mathematical formulation of PoT models and describe learning algorithms for them. We hope that this part of the article will be useful in introducing a new method to the community’s tool kit for machine learning and density estimation. Second, we focus on applying PoTs to capturing the statistical structure of natural scenes. This is motivated from both a density estimation perspective and from the perspective of providing insight into information processing within the visual pathways of the brain. PoT models were touched on briefly in Teh, Welling, Osindero, & Hinton (2003), and in this letter, we present the basic formulation in more detail, provide hierarchical and topographic extensions, and give an efficient learning algorithm employing auxiliary variables and Gibbs sampling. We also provide a discussion of the PoT model in relation to similar existing techniques. We suggest that the PoT model could be considered a viable alternative to the more familiar technique of independent component analysis (ICA) when constructing density models, performing feature extraction, or building interpretive computational models of biological visual systems. As we shall demonstrate, we are able to reproduce many of the successes of ICA, yielding results that are comparable but with some interesting and significant differences. Similarly, extensions of our basic model can be related to some of the hierarchical forms of ICA that have been proposed, as well as to gaussian scale mixtures. Again there are interesting differences in formulation. An example of a potential advantage in our approach is that the learned representations can be inferred directly from the input, without the need for any iterative settling even in hierarchical or highly overcomplete models. The letter is organized as follows. Section 2 describes the mathematical form of the basic PoT model along with extensions to hierarchical and topographic versions. Section 3 describes how to learn within the PoE framework using the contrastive divergence (CD) algorithm (Hinton, 2002) (with the appendix providing the background material for running the necessary Markov chain Monte Carlo sampling). In section 4 we present results of our model when applied to natural images. We are able to recreate the success such of ICA-based models as, for example, Bell and Sejnowski (1995, 1997), Olshausen and Field (1996, 1997), Hoyer and Hyvarinen (2000), Hyvarinen, Hoyer, & Inki (2001), and Hyvarinen and Hoyer (2001). Our model provides computationally motivated accounts for the form of simple cell and complex cell receptive fields, as well as for the basic layout of cortical topographic maps for location, orientation, spatial frequency, and spatial phase. Additionally, we are easily able to produce such results in an overcomplete setting.
Topographic Product Models Applied to Natural Scene Statistics
383
In section 5 we analyze in more detail the relationships between our PoT model, ICA models and their extensions, and gaussian scale mixtures. Finally, in section 6, we summarize our work. 2 Products of Student-t Models We will begin with a brief overview of product of expert models (Hinton, 2002), before presenting the basic product of Student-t model (Welling, Hinton, & Osindero, 2002). Then we move on to discuss hierarchical topographic extensions. 2.1 Product of Expert Models. Product of expert models (PoEs) were introduced in Hinton (2002) as an alternative method of combining expert models into one joint model. In contrast to mixture of expert models, where individual models are combined additively, PoEs combine expert opinions multiplicatively as follows (see also Heskes, 1998), 1 pi (x|θi ), Z(θ ) M
PPoE (x|θ) =
(2.1)
i=1
where Z(θ ) is the global normalization constant and pi (·) are the individual expert models. Mixture models employ a divide-and-conquer strategy, with different experts being used to model different subsets of the training data. In product models, many experts cooperate to explain each input vector, and different experts specialize in different parts of the input vector or in different types of latent structure. If a scene contains n different objects that are processed in parallel, a mixture model needs a number of components exponential in n because each component of the mixture must model a combination of objects. A product model, by contrast, requires only a number of components linear in n because many different experts can be used at the same time. Another benefit of product models is their ability to model sharp boundaries. In mixture models, the distribution represented by the whole mixture must be vaguer than the distribution represented by a typical component of the mixture. In product models, the product distribution is typically much sharper than the distributions of the individual experts1 , which is a major advantage for high-dimensional data (Hinton, 2002; Welling, Zemel, & Hinton, 2002). 1 When multiplying together n equal-variance gaussians, for example, the variance is reduced by a factor of n. It is also possible to make the entropy of the product distribution higher than the entropy of the individual experts by multiplying together two very heavytailed distributions whose modes are in very different places.
384
S. Osindero, M. Welling, and G. Hinton
Learning PoE models has been difficult in the past, mainly due to the presence of the partition function Z(θ ). However, contrastive divergence learning (Hinton, 2002) (see section 3.2) has opened the way to apply these models to large-scale applications. PoE models are related to many other models that have been proposed in the past. In particular, log-linear models2 have a similar flavor but are more limited in their parameterization, 1 exp (λi f i (x)) , Z(λ) M
PLogLin (x|λ) =
(2.2)
i=1
where exp[λi f i (·)] takes the role of an unnormalized expert. A binary product of experts model was introduced under the name harmonium in Smolensky (1986). A learning algorithm based on projection pursuit was proposed in Freund and Haussler (1992). In addition to binary models (Hinton, 2002), the gaussian case been studied (Williams, Agakov, & Felderhof, 2001; Marks & Movellan, 2001; Williams & Agakov, 2002; Welling, Agakov, & Williams, 2003). 2.2 Product of Student-t Models. The basic model we study here is a form of PoE suggested by Hinton and Teh (2001), where the experts are given by generalized Student-t distributions: y = Jx 1 pi (yi |αi ) ∝ αi . 1 + 12 yi2
(2.3) (2.4)
The variables yi are the responses to linearly filtered input vectors and can be thought of as latent variables that are deterministically related to the observables, x. Through this deterministic relationship, equation 2.4 defines a probability density on the observables. The filters, { Ji }, are learned from the training data (typically images) by maximizing or approximately maximizing the log likelihood. Note that due to the presence of the J parameters, this product of Student-t (PoT) model is not log linear. However, it is possible to introduce auxiliary variables, u, such that the joint distribution P(x, u) is log linear3 and the marginal distribution P(x) reduces to that of the original
2 Otherwise known as exponential family models, maximum entropy models, and additive models—for example, see Zhu, Wu, & Mumford (1998). 3 Note that it is log linear in the parameters θ i jk = Ji j Jik and αi with features ui x j xk , and log ui .
Topographic Product Models Applied to Natural Scene Statistics
385
PoT distribution, PPoT (x) =
∞
du P(x, u) 0
P(x, u) ∝ exp −
M i=1
(2.5)
ui
1 2 1 + ( Ji x) + (1 − αi ) log ui , 2
(2.6)
where Ji denotes the row vector corresponding to the ith row of the filter matrix J. An intuition for this form of reparameterization with auxiliary variables can be gained by considering that a one-dimensional t-distribution can be written as a continuous mixture of gaussians, with a gamma distribution controlling mixing proportions on components with different precisions, that is, Gamma Gaussian
−(α+ 12 ) 1 α+ 2 1 1 1 2 1 1 = dλ √ λ 2 e − 2 τ λ 1 + τ2 λα−1 e −λ . √ (α) 2 2π (α) 2π
(2.7)
The advantage of this reformulation using auxiliary variables is that it supports an efficient, fast-mixing Gibbs sampler, which is in turn beneficial for contrastive divergence learning. The Gibbs chain samples alternately from P(u|x) and P(x|u) given by 1 Gui αi ; 1 + ( Ji x)2 2 i=1 V = Diag[u], P(u|x) = Nx 0; ( JVJT )−1
P(u|x) =
M
(2.8) (2.9)
where G denotes a gamma distribution and N a normal distribution. From equation 2.9, we see that the variables u can be interpreted as precision variables in the transformed space y = Jx. In terms of graphical models, the representation that best fits the PoT model with auxiliary variables is that of a two-layer bipartite undirected graphical model. Figure 1A schematically illustrates the Markov random field (MRF) over u and x; Figure 1B shows the role of the deterministic filter outputs in this scheme. A natural way to interpret the differences between directed models (and in particular ICA models) and PoE models was provided in Hinton and Teh (2001) and Teh et al. (2003). Whereas directed models intuitively have a topdown interpretation (e.g., samples can be obtained by ancestral sampling starting at the top layer units), PoE models (or more generally energy-based models) have a more natural bottom-up interpretation. The probability of an input vector is proportional to exp(−E(x)), where the energy E(x) is
386
S. Osindero, M. Welling, and G. Hinton u
u
u
2
z=W(y)
W y=Jx
y=Jx
J x
x
A
x
B
C
Figure 1: (A) Standard PoT model as an undirected graph or Markov random field (MRF) involving observables, x and auxiliary variables, u. (B) Standard PoT MRF redrawn to show the role of deterministic filter outputs y = Jx. (C) Hierarchical PoT MRF drawn to show both sets of deterministic variables, y and z = W(y)2 , as well as auxiliary variables u.
computed bottom-up starting at the input layer (e.g., E(y) = E( Jx)). We may thus interpret the PoE model as modeling a collection of soft constraints, parameterized through deterministic mappings from the input layer to the top layer (possibly parameterized as a neural network) and where the energy serves to penalize inputs that do not satisfy these constraints (e.g., are different from zero). The costs contributed by the violated constraints are added to compute the global energy, which is equivalent to multiplying the distributions of to compute the product distribution the individual experts (since P(x) ∝ i pi (x) ∝ exp(− i E i (x))). For a PoT, we have a two-layer model where the constraint violations are penalized using the energy function (see equation 2.6), E(x) =
M i=1
1 2 αi log 1 + ( Ji x) . 2
(2.10)
We note that the shape of this energy function implies that relative to a quadratic penalty, small violations are penalized more strongly while large violations are penalized less strongly. This results in “sparse” distributions of violations (y-values) with many very small violations and occasional large ones. In the case of an equal number of observables, {xi }, and latent variables, {yi } (the so-called complete representation), the PoT model is formally equivalent to square, noiseless ICA (Bell & Sejnowski, 1995) with Student-t priors. However, in the overcomplete setting (more latent variables than
Topographic Product Models Applied to Natural Scene Statistics
387
observables), product of experts models are essentially different from overcomplete ICA models (Lewicki & Sejnowski, 2000). The main difference is that the PoT maintains a deterministic relationship between latent variables and observables through y = Jx, and consequently not all values of y are allowed. This results in important marginal dependencies between the y-variables. In contrast, in overcomplete ICA, the hidden y-variables are marginally independent by assumption and have a stochastic relationship with the x-variables. (For details, we refer to Teh et al., 2003.) For undercomplete models (fewer latent variables than observables), there is again a discrepancy between PoT models and ICA models. In this case, the reason can be traced back to the way noise is added to the models in order to force them to assign nonzero probability everywhere in input space. In contrast to undercomplete ICA models, where noise is added in all directions of input space, undercomplete PoT models have noise added only in the directions orthogonal to the subspace spanned by the filter matrix J. (More details can be found in Welling, Zemel, & Hinton, 2003, 2004.) 2.3 Hierarchical PoT (HPoT) Models. We now consider modifications to the basic PoT by introducing extra interactions between the activities of filter outputs, yi , and altering the energy function for the model. These modifications were motivated by observations of the behavior of independent components of natural data and inspired by similarities between our model and (hierarchical) ICA. Since the new model essentially involves adding a new layer to the standard PoT, we refer to it as a hierarchical PoT (HPoT). As we will show in section 4, when trained on a large collection of natural image patches, the linear components { Ji } behave similarly to the learned basis functions in ICA and grow to resemble the well-known Gabor-like receptive fields of simple cells found in the visual cortex (Bell & Sejnowski, 1997). These filters, like wavelet transforms, are known to decorrelate input images very effectively. However, it has been observed that higher-order dependencies remain between the filter outputs {yi }. In particular, there are important dependencies between the activities or energies yi2 (or more generally |yi |β , β > 0) of the filter outputs. This phenomenon can be neatly demonstrated through the use of bow-tie plots, in which the conditional histogram of one filter output is plotted given the output value of a different filter (e.g., see Simoncelli, 1997). The bow-tie shape of the plots implies that the first-order dependencies have been removed by the linear filters { Ji } (since the conditional mean vanishes everywhere), but that higher-order dependencies still remain; specifically, the variance of one filter output can be predicted from the activity of neighboring filter outputs. In our modified PoT, the interactions between filter outputs will be implemented by first squaring the filter outputs and subsequently introducing an extra layer of units, denoted by z. These units will be used to capture the dependencies between these squared filter outputs, z = W(y)2 = W( Jx)2 ,
388
S. Osindero, M. Welling, and G. Hinton
and this is illustrated in Figure 1C. (Note that in the previous expression and in what follows, the use of (·)2 with a vector argument will imply a component-wise squaring operation.) The modified energy function is E(x) =
M i=1
K 1 2 αi log 1 + Wi j ( J j x) 2
W ≥ 0,
(2.11)
j=1
where the nonnegative parameters Wi j model the dependencies between the activities {yi2 }.4 Note that the forward mapping from x through y to z is completely deterministic and can be interpreted as a bottom-up neural network. We can also view the modified PoT as modeling constraint violations, but this time in terms of z with violations now penalized according to the energy in equation 2.11. As with the standard PoT model, there is a reformulation of the hierarchical PoT model in terms of auxiliary variables, u, P(x, u) ∝ exp −
M i=1
K 1 ui 1 + Wi j ( J j x)2 + (1 − αi ) log ui , 2
(2.12)
j=1
with conditional distributions, P(u|x) =
M i=1
K 1 Gui αi ; 1 + Wi j ( J j x)2 2
P(x|u) = Nx 0; ( JVJT )−1
(2.13)
j=1
V = Diag[WT u].
(2.14)
Again, we note that this auxiliary variable representation supports an efficient Gibbs sampling procedure where all auxiliary variables u are sampled in parallel given the inputs x using equation 2.13, and all input variables x are sampled jointly from a multivariate gaussian distribution according to equation 2.14. As we will discuss in section 3.2, this is an important ingredient in training HPoT models from data using contrastive divergence. Finally, in a somewhat speculative link to computational neuroscience, in the following discussions we will refer to units, y, in the first hidden layer as simple cells and units, z, in the second hidden layer as complex cells. For simplicity, we will assume the number of simple and complex cells to be
4 For now, we implicitly assume that the number of first hidden-layer units (i.e., filters) is greater than or equal to the number of input dimensions. Models with fewer filters than input dimensions need some extra care, as noted in section 2.3.1. The number of toplayer units can be arbitrary, but for concreteness, we will work with an equal number of first-layer and top-layer units.
Topographic Product Models Applied to Natural Scene Statistics
389
equal. There are no obstacles to using unequal numbers, but this does not appear to lead to any qualitatively different behavior. 2.3.1 Undercomplete HPoT Models. The HPoT models, as defined in section 2.3, were implicitly assumed to be complete or overcomplete. We may also wish to consider undercomplete models. These models can be interesting in a variety of applications where one seeks to represent the data in a lower-dimensional yet informative space. Undercomplete models need a little extra care in their definition, since in the absence of a proper noise model, they are unnormalizable over input space. In Welling, Agakov, & Williams (2003) and Welling, Zemel, & Hinton (2003, 2004), a natural solution to this dilemma was proposed where a noise model is added in directions orthogonal to all of the filters {J}. We note that it is possible to generalize this procedure to HPoT models, but in the interests of parsimony, we omit detailed discussion of undercomplete models in this article. 2.4 Topographic PoT Models. The modifications described next were inspired by a similar proposal in Hyvarinen et al. (2001) named topographic ICA. By restricting the interactions between the first and second layers of a HPoT model, we are able to induce a topographic ordering on the learned features. Such ordering can be useful for a number of reasons; for example, it may help with data visualization by concentrating feature activities in local regions. This restriction can also help in acting as a regularizer for the density models that we learn. Furthermore, it makes it possible to compare the topographic organization of features in our model (and based on the statistical properties of the data) to the organization found within cortical topographic maps. We begin by choosing a topology on the space of filters. This is most conveniently done by simply considering the filters to be laid out on a grid and considering local neighborhoods with respect to this layout. In our experiments, we use a regular square grid and apply toroidal boundary conditions to avoid edge effects. The complex cells receive input fromthe simple cells in precisely the same way as in our HPoT model, yi = j Wi j ( J j x)2 , but now W is fixed and we assume it is chosen such that it computes a local average from the grid of filter activities. The free parameters that remain to be learned using contrastive divergence are {αi , J}. In the following, we explain why the filters {Ji } should be expected to organize themselves topographically when learned from data. As noted previously, there are important dependencies between the activities of wavelet coefficients of filtered images. In particular, the variance (but not the mean) of one coefficient can be predicted from the value of a neighboring coefficient. The topographic PoT model can be interpreted as
390
S. Osindero, M. Welling, and G. Hinton
an attempt to model these dependencies through a Markov random field on the activities of the simple cells. However, we have predefined the connectivity pattern and have left the filters to be determined through learning. This is the opposite strategy as the one used in, for instance, Portilla, Strela, Wainwright, & Simoncelli (2003), where the wavelet transform is fixed and the interactions between wavelet coefficients are modeled. One possible explanation for the emergent topography is that the model will make optimal use of these predefined interactions if it organizes its simple cells such that dependent cells are nearby in filter space and independent ones are distant.5 A complementary explanation is based on the interpretation of the model as capturing complex constraints in the data. The penalty function for violations is designed such that (relative to a squared penalty) large violations are relatively mildly penalized. However, since the complex cells represent the average input from simple cells, their values would be well described by a gaussian distribution if the corresponding simple cells were approximately independent. (This is a consequence of the central limit theorem for sums of independent random variables.) In order to avoid a mismatch between the distribution of complex cell outputs and the way they are penalized, the model ought to position simple cells that have correlated activities near each other. In doing so, the model can escape the central limit theorem because the simple cell outputs that are being pooled are no longer independent. Consequently, the pattern of violations that arises is a better match to the pattern of violations one would expect from the penalizing energy function. Another way to understand the pressure toward topography is to ask how an individual simple cell should be connected to the complex cells in order to minimize the total cost caused by the simple cell’s outputs on real data. If the simple cell is connected to complex cells that already receive inputs from the simple cell’s neighbors in position and spatial frequency, the images that cause the simple cell to make a big contribution will typically be those in which the complex cells that it excites are already active, so its additional contribution to the energy will be small because of the gentle slope in the heavy tails of the cost function. Hence, since complex cells locally pool simple cells, local similarity of filters is expected to emerge. 2.5 Further Extensions to the Basic PoT Model. The parameters {αi } in the definition of the PoT model control the sparseness of the activities of the complex and simple cells. For large values of α, the PoT model will resemble more and more a gaussian distribution, while for small values, there is a very sharp peak at zero in the distribution that decays very quickly into fat tails. 5
This argument assumes that the shape of the filters remains essentially unchanged (i.e., Gabor-like) by the introduction of the complex cells in the model. Empirically we see that this is indeed the case.
Topographic Product Models Applied to Natural Scene Statistics
391
1 0.9
Beta=5 Beta=2 Beta=1/2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -8
-6
-4
-2
0
2
4
6
8
10
Figure 2: Functions f (x) = 1/(1 + |x|β ) for different values of β.
In the HPoT model, the complex cell activities, z, are the result of linearly combining the (squared) outputs simple cells, y = Jx. The squaring operation is a somewhat arbitrary choice (albeit a computationally convenient and empirically effective one), and we may wish to process the first-layer activities in other ways before combining them in the second layer. In particular, we might consider modifications of the form activity = |Jx|β with | · | denoting absolute values and β > 0. Such a model defines the a density in y-space of the form M K 1 1 β p y (y) = exp − αi log 1 + Wi j |y j | . Z(W, α) 2 i=1
(2.15)
j=1
A plot of the unnormalized distribution f (x) = 1/(1 + |x|β ) is shown in Figure 2 for three settings of the parameter β. One can observe that for smaller values of β, the peak at zero becomes sharper and the tails become fatter. In section 3, we show that sampling, and hence learning, with contrastive divergence can be performed efficiently for any setting of β. 3 Learning in HPoT Models In this section, we explain how to perform maximum likelihood learning of the parameters for the models introduced in the previous section. In the case of complete and undercomplete PoT models, we are able to analytically
392
S. Osindero, M. Welling, and G. Hinton
compute gradients; however, in the general case of overcomplete or hierarchical PoT’s, we are required to employ an approximation scheme, and the preferred method in this article will be contrastive divergence (CD) (Hinton, 2002). Since CD learning is based on Markov chain Monte Carlo (MCMC) sampling, the appendix provides a discussion of sampling procedures for the various models we have introduced. 3.1 Maximum Likelihood Learning in HPoT Models. To learn the parameters θ = ( J, W, α) (and β for the extended models), we will maximize the log likelihood of the model, 1 log px (xn ; θ ). N N
θ ML = arg max L = arg max θ
θ
(3.1)
n=1
For models that have the Boltzmann form, p(x) = compute the following gradient, N ∂L ∂ E(x; θ ) 1 ∂ E(xn ; θ ) − =E , ∂θ ∂θ N ∂θ p
1 Z
exp[−E(x; θ )], we can
(3.2)
n=1
where E[·] p denotes expectation with respect to the model’s distribution over x (this term comes from the derivatives of the log partition function, Z). For the parameters ( J, W, α) in the PoT, we obtain the following derivative functions: αi Wi j ( Jx) j xk ∂ E(x; θ ) = 1 2 ∂ J jk 1 + j Wi j ( Jx) j 2 i 1 α ( Jx)2j ∂ E(x; θ ) 2 i = 1 ∂ Wi j 1 + 2 j Wi j ( Jx)2k j ∂ E(x; θ ) 1 = log 1 + Wi j ( Jx)2j . ∂αi 2 j
(3.3)
(3.4)
(3.5)
Once we have computed the gradients of the log likelihood, we can maximize it using any gradient-based optimization algorithm. Elegant as the gradients in equation 3.2 may seem, in the general case they are intractable to compute. The reason is the expectation in the first term of equation 3.2 over the model distribution. One may choose to approximate this average by running an MCMC chain to equilibrium with p(x; θ ) as its invariant distribution. However, there are (at least) two reasons that this might not be a good idea: (1) the Markov chain has to be run to equilibrium
Topographic Product Models Applied to Natural Scene Statistics
393
for every gradient step of learning, and (2) we need a lot of samples to reduce the variance in the estimates. Hence, for the general case, we propose using the contrastive divergence learning paradigm, which is discussed next. 3.2 Training HPoT Models with Contrastive Divergence. For complete and undercomplete HPoT models, we can derive the exact gradient of the log likelihood with respect to the parameters J. In the complete case, these gradients turn out to be of the same form as the update rules proposed in Bell and Sejnowski (1995). However, the gradients for the parameters W and α are much harder to compute.6 Furthermore, in overcomplete settings, the exact gradients with respect to all parameters are computationally intractable. We now describe an approximate learning paradigm to train the parameters in cases where evaluation of the exact gradients is intractable. Recall that the bottleneck in computing these gradients is the first term in equation 3.2. An approximation to these expectations can be obtained by running an MCMC sampler with p(x; J, W, α) as its invariant distribution and computing Monte Carlo estimates of the averages. As mentioned in section 3.1, this is a very inefficient procedure because it needs to be repeated for every step of learning, and a fairly large number of samples may be needed to reduce the variance in the estimates.7 Contrastive divergence (Hinton, 2002), replaces the MCMC samples in these Monte Carlo estimates with samples from brief MCMC runs, which were initialized at the data cases. The intuition is that if the current model is not a good fit for the data, the MCMC particles will swiftly and consistently move away from the data cases. But, if the data population represents a fair sample from the model distribution, then the average energy will not change when we initialize our Markov chains at the data cases and run them forward. In general, initializing the Markov chains at the data and running them only briefly introduces bias but greatly reduces both variance and computational cost. Algorithm 1 summarizes the steps in this learning procedure: Algorithm 1: Contrastive Divergence Learning 1. Compute the gradient of the energy with respect to the parameters, θ , and average over the data cases xn . 2. Run MCMC samplers for k steps, starting at every data vector xn , keeping only the last sample sn,k of each chain. 6 Although we can obtain exact derivatives for α in the special case where W is restricted to be the identity matrix. 7 An additional complication is that it is hard to assess when the Markov chain has converged to the equilibrium distribution.
394
S. Osindero, M. Welling, and G. Hinton
3. Compute the gradient of the energy with respect to the parameters, θ , and average over the samples sn,k . 4. Update the parameters using, ∂ E(xn ) ∂ E(snk ) η , − θ = N ∂θ ∂θ samples sn,k
(3.6)
data xn
where η is the learning rate and N the number of samples in each minibatch. For further details on contrastive divergence learning, we refer to the literature (Hinton, 2002; Teh et al., 2003; Yuille, 2004; Carreira-Perpinan & Hinton, 2005). For highly overcomplete models, it often happens that some of the Ji filters (rows of J) decay to zero. To prevent this happening, from 2 we constrain the L 2 -norm of these filters to be one: j J i j = 1∀i. Also, constraining the norm of the rows of the W matrix was helpful during learning. We choose to constrain the L 1 -norm to unity j Wi j = 1∀i, which makes sense because Wi j ≥ 0. We note that the objective function is not convex, and so the existence of poor local minima could be a concern. The stochastic nature of our gradient descent procedure may provide some protection against being trapped in shallow minima, although it has the concomitant price of being slower than noise-free gradient descent. We also note that the intractability of the partition function makes it difficult to obtain straightforward objective measures of model performance since log probabilities can be computed only up to an unknown additive constant. This is not so much of a problem when one is using a trained model for, say, feature extraction, statistical image processing, or classification, but it does make explicit comparison with other models rather hard. (For example, there is no straightforward way to compare the densities provided by our overcomplete HPoT models with those from overcomplete ICA-style models.) 4 Experiments on Natural Images There are several reasons to believe that the HPoT should be an effective model for capturing and representing the statistical structure in natural images; indeed much of its form was inspired by the dependencies that have been observed in natural images. We have applied our model to small patches taken from digitized natural images. The motivation for this is several-fold. First, it provides a useful test of the behavior of our model on a data set that we believe to contain sparse structure (and therefore to be well suited to our framework). Second, it allows us to compare our work with that from other authors and similar models, namely ICA. Third, it allows us to use our model framework
Topographic Product Models Applied to Natural Scene Statistics
395
as a tool for interpreting results from neurobiology. Our method can complement existing approaches and also allows one to suggest alternative interpretations and descriptions of neural information processing. Section 4.2 presents results from complete and overcomplete single-layer PoTs trained on natural images. Our results are qualitatively similar to those obtained using ICA. In section 4.3 we demonstrate the higher-order features learned in our hierarchical PoT model, and in section 4.4 we present results from topographically constrained hierarchical PoTs. The findings in these two sections are qualitatively similar to the work by Hyvvarinen et al. (2001); however, our underlying statistical model is different and allows us to deal more easily with overcomplete, hierarchical topographic representations. 4.1 Data Sets and Preprocessing. We performed experiments using standard sets of digitized natural images available on the World Wide Web from Aapo Hyvarinen8 and Hans van Hateren.9 The results obtained from the two different data sets were not significantly different, and for simplicity, all results reported here are from the van Hateren data set. To produce training data of a manageable size, small, square patches were extracted from randomly chosen locations in the images. As is common for unsupervised learning, these patches were filtered according to computationally well-justified versions of the sort of whitening transformations performed by the retina and lateral geniculate nucleus (LGN). (Atick & Redlich, 1992). First, we applied a log transformation to the raw pixel intensities. This procedure somewhat captures the contrast transfer function of the retina. It is not critical, but for consistency with past work, we incorporated it for the results presented here. The extracted patches were subsequently normalized such that mean pixel intensity for a given pixel across the data set was zero, and also so that the mean intensity within each patch was zero, effectively removing the DC component from each input. The patches were then whitened, usually in conjunction with dimensionality reduction. This is a standard technique in many ICA approaches and speeds up learning without having much impact on the final results obtained. 4.2 Single-Layer PoT Models. Figure 3 illustrates results from our basic approach and shows for comparison results obtained using ICA. The data consisted of 150,000 patches of size 18 × 18 that were reduced to vectors of dimension 256 by projection onto the leading 256 eigenvectors of the data covariance matrix, and then whitened to give unit variance along each axis. 4.2.1 Complete Models. We first present the results of our basic approach in a complete setting and display a comparison of the filters learned using 8 9
http://www.cis.hut.fi/projects/ica/data/images/. http://hlab.phys.rug.nl/imlib/index.html.
396
S. Osindero, M. Welling, and G. Hinton
A
B
C
D
Figure 3: Learned filters shown in the raw data space. Each small square represents a filter vector, plotted as an image. The gray scale of each filter display has been (symmetrically) scaled to saturate at the maximum absolute weight value. (A) Random subset of filters learned in a complete PoT model. (B) Random subset of filters learned in a complete ICA model. (C) Random subset of filters learned in a 1.7× overcomplete PoT model. (D) Random subset of filters learned in a 2.4× overcomplete PoT model.
our method with a set obtained from an equivalent ICA model learned using direct gradient ascent in the likelihood. We trained both models (learning just J, and keeping α fixed10 at 1.5) for 200 passes through the entire data set of 150,000 patches. The PoT was trained using one-step contrastive
This is the minimum value of α that allows us to have a well-behaved density model in the complete case. As alpha gets smaller than this, the tails of the distribution get heavier and heavier, and the variance and, eventually, mean are no longer well defined. 10
Topographic Product Models Applied to Natural Scene Statistics
397
divergence as outlined in section 3.2, and the ICA model was trained using the exact gradient of the log likelihood (as in Bell & Sejnowski, 1995 for instance). As expected, at the end of learning, the two procedures delivered very similar results, exemplars of which are given in Figures 3A and 3B. Furthermore, both sets of filters bear a strong resemblance to the types of simple cell receptive fields found in V1. 4.2.2 Overcomplete Models. We next consider our model in an overcomplete setting; this is no longer equivalent to any ICA model. In the PoT, overcomplete representations are simple generalizations of the complete case, and unlike causal generative approaches, the features are conditionally independent since they are given just by a deterministic mapping. To facilitate learning in the overcomplete setting, we have found it beneficial to make two modifications to the basic setup. First, we set αi = α∀i and make α a free parameter to be learned from the data. The learned value of α is typically less than 1.5 and gets smaller as we increase the degree of overcompleteness.11 One intuitive way of understanding why this might be expected is the following. Decreasing α reduces the energy cost for violating the constraints specified by each individual feature; however, this is counterbalanced by the fact that in the overcomplete setting, we expect an input to violate more of the constraints at any given time. If α remains constant as more features are added, the mass in the tails may no longer be sufficient to model the distribution well. The second modification that we make is to constrain the L2-norm of the filters to l, making l another free parameter to be learned. If this modification is not made, there is a tendency for some of the filters to become very small during learning. Once this has happened, it is difficult for them to grow again since the magnitude of the gradient depends on the filter output, which in turn depends on the filter length. The first manipulation simply extends the power of the model, but one could argue that the second manipulation is something of a fudge: if we have sufficient data, a good model, and a good algorithm, it should be unnecessary to restrict ourselves in this way. There are several counterarguments to this, the principal ones being: (1) we might be interested, from a biological point of view, in representational schemes in which the representational units all receive comparable amounts of input; (2) we can view it as approximate posterior inference under a prior belief that in an effective model, all the units should play a roughly equal part in defining the density and forming the representation. We note that a similar manipulation is also applied by most practitioners dealing with overcomplete ICA models (e.g., Olshausen & Field, 1996).
11 Note that in an overcomplete setting, depending on the direction of the filters, α may be less than 1.5 and still yield a normalizable distribution overall.
398
S. Osindero, M. Welling, and G. Hinton
In Figures 3C and 3D, we show example filters typical of those learned in overcomplete simulations. As in the complete case, we note that the majority of learned filters qualitatively match the linear receptive fields of simple cells found in V1. Like V1 spatial receptive fields, most (although not all) of the learned filters are well fit by Gabor functions. We analyzed in more detail the properties of filter sets produced by different models by fitting a Gabor function to each filter (using a least-squares procedure) and then looking at the population properties in terms of Gabor parameters.12 Figure 4 shows the distribution of parameters obtained by fitting Gabor functions to complete and overcomplete filters. For reference, similar plots for linear spatial receptive fields measured in vivo are given in Ringach (2002) and van Hateren and van der Schaaf (1998). The plots are all reasonable qualitative matches to those shown for the “real” V1 receptive fields as shown, for instance, in Ringach (2002). They also help to indicate the effects of representational overcompleteness. With increasing overcompleteness, the coverage in the spaces of location, spatial frequency, and orientation becomes denser and more uniform, while at the same time, the distribution of receptive fields shapes remains unchanged. Further, the more overcomplete models give better coverage in lower spatial frequencies that are not directly represented in complete models. Ringach (2002) reports that the distribution of shapes from ICA or sparse coding can be a poor fit to the data from real cells, the main problem being that there are too few cells near the origin of the plot, which corresponds roughly to cells with smaller aspect ratios and small numbers of cycles in their receptive fields. The results that we present here appear to be a slightly better fit. (One source of the differences might be Ringach’s choice of ICA prior.) A large proportion of our fitted receptive fields are in the vicinity of the macaque results, although as we become more overcomplete, we see a spread farther away from the origin. In summary, our results from these single-layer PoT models can account for many of the properties of simple cell linear spatial receptive fields in V1. 4.3 Hierarchical PoT Models. We now present results from the hierarchical extension of the basic PoT model. In principle, we are able to learn both sets of weights—the top-level connections W and the lower-level connections J—simultaneously. However, effective learning in this full system has proved difficult when starting from random initial conditions. The results we present in this section were obtained by initializing W to the 12 Approximately 5 to 10% of the filters failed to localize well in orientation or location—usually appearing somewhat like noise or checkerboard patterns—and were not well described by a Gabor function. These were detected during the parametric fitting process and were eliminated from our subsequent population analyses. It is unclear exactly what role these filters play in defining densities within our model.
Topographic Product Models Applied to Natural Scene Statistics 1.7x OVERCOMPLETE
COMPLETE
A
18
12
12
12
6
6
6
6
12
18
B
0 0
6
18
12
18
0 0
6
12
18
Frequency & Orientation
.125 .25
.5
40
.125 .25
.5
Phase
60
30
40 20 0
45
90
D60 40 20 0
0
.5
60
40
10 0
.125 .25
80
20
E
2.4x OVERCOMPLETE
18
0 0
C
Location
399
2
4
20
0
120 100 80 60 40 20 0
0 45 90 Aspect Ratio
0
0
45
90
0
2
4
150 100 50 0
2
4
RF Shapes
0
4
4
3
3
3
2
2
2
2
1
1
1
1
0
0
2
4
0
0
2
4
4
0
F
0
2
4
RF Shapes
4 3
0
0
2
4
Figure 4: A summary of the distribution of some parameters derived by fitting Gabor functions to receptive fields of three models with different degrees of overcompleteness in the representation size. The left-most column (A–E) is a complete representation, the middle column is 1.7× overcomplete, and the right-most column is 2.4× overcomplete. (A) Each dot represents the center location, in pixel coordinates within a patch, of a fitted Gabor. (B) Scatter plots showing the joint distribution of orientation (azimuthally) and spatial frequency in cycles per pixel (radially). (C) Histograms of Gabor fit phase (mapped to range 0◦ –90◦ since we ignore the envelope sign). (D) Histograms of the aspect ratio of the Gabor envelope (length/width). (E) A plot of “normalized width” versus “normalized length” (cf. Ringach, 2002). (F) For comparison, we include data from real macaque experiments (Ringach, 2002).
400
S. Osindero, M. Welling, and G. Hinton
Figure 5: Each panel in this figure illustrates the theme represented by a different top-level unit. The filters in each row are arranged in descending order, from left to right, of the strength Wi j with which they connect to the particular top-layer unit.
identity matrix and first learning J, before subsequently releasing the W weights and then letting the system learn freely. This is therefore equivalent to initially training a single-layer PoT and then subsequently introducing a second layer. When models are trained in this way, the form of the first-layer filters remains essentially unchanged from the Gabor receptive fields shown previously. Moreover, we see interesting structure being learned in the W weights, as illustrated by Figure 5. The figure is organized to display the filters connected most strongly to a top-layer unit. There is a strong organization by what might be termed themes based on location, orientation, and spatial frequency. An intuition for this grouping behavior is as follows. There will be correlations between the squared outputs of some pairs of filters, and by having them feed into the same top-level unit, the model is able to capture this regularity. For most input images, all members of the group will have small combined activity, but for a few images, they will have significant combined activity. This is exactly what the energy function favors, as opposed to a grouping of very different filters that would lead to a rather gaussian distribution of activity in the top layer. Interestingly, these themes lead to responses in the top layer (if we examine the outputs zi = Wi ( Jx)2 ) that resemble complex cell receptive fields. It can be difficult to accurately describe the response of nonlinear units in a network, and we choose a simplification in which we consider the response of the top-layer units to test stimuli that are gratings or Gabor patches. The test stimuli were created by finding the grating or Gabor stimulus that was most effective at driving a unit and then perturbing various parameters about this maximum. Representative results from such a characterization are shown are shown in Figure 6. In comparison to the first-layer units, the top-layer units are considerably more invariant to phase and somewhat more invariant to position. However, both the sharpness of tuning to orientation and spatial frequency remain roughly unchanged. These results typify the properties that we see when we consider the responses of the second layer in our hierarchical model and are a striking match to the response properties of complex cells.
B
Normalised Resp.
A
Normalised Resp.
Topographic Product Models Applied to Natural Scene Statistics
Phase
1
1
0.5 0
Location
1
0.5 50
0
50
0 5
Orientation
0.5 0
5
0
1
50
0
50
0
0
1
1
1
0.5
0.5
0.5
0.5
50 0 50 Phase Shift, Degs
0 5 0 5 Location Shift, Pixels
Spatial Frequency
0.5
1
0
401
1
2
0
0 50 0 50 0 1 2 Orientation Shift, Degs Frequency Scale Factor
Figure 6: (A) Tuning curves for simple cells (i.e., first-layer units). (B) Tuning curves for complex cells (i.e., second-layer units). The tuning curves for phase, orientation, and spatial frequency were obtained by probing responses using grating stimuli; the curve for location was obtained by probing using a localized Gabor patch stimulus. The optimal stimulus was estimated for each unit, and then one parameter (phase, location, orientation or spatial frequency) was varied, and the changes in responses were recorded. The response for each unit was normalized such that the maximum output was 1, before combining the data over the population. The solid line shows the population average (median of 441 units in a 1.7× overcomplete model), and the lower and upper dotted lines show the 10% and 90% centiles, respectively. We use a style of display as used in Hyvarinen et al. (2001).
4.4 Topographic Hierarchical PoT Models. We next consider the topographically constrained form of the hierarchical PoT that we proposed in an attempt to induce spatial organization on the representations learned. The W weights are fixed and define local, overlapping neighborhoods on a square grid with toroidal boundary conditions. The J weights are free to learn, and the model is trained as usual. Representative results from such a simulation are given in Figure 7. The inputs were patches of size 25 × 25, whitened and dimensionality reduced to vectors of size 256; the representation is 1.7× overcomplete. By simple inspection of the filters in Figure 7A, we see that there is strong local continuity in the receptive field properties of orientation and spatial frequency and location, with little continuity of spatial phase. With notable similarity to experimentally observed cortical topography, we see pinwheel singularities in the orientation map and a low-frequency cluster in the spatial frequency map, which seems to be somewhat aligned with one of the pinwheels. While the map of location (retinotopy) shows good local structure, there is poor global structure. We suggest that this may be due to the relatively small scale of the model and the use of toroidal boundary conditions (which eliminated the need to deal with edge effects).
402
S. Osindero, M. Welling, and G. Hinton
Figure 7: An example of a filter map. (The gray scale is saturating in each cell independently.) This model was trained on 25 × 25 patches that had been whitened and dimensionality reduced to 256 dimensions, and the representation layer is 1.7 × overcomplete in terms of the inputs. The neighborhood size was a 3 × 3 square (i.e., eight nearest neighbors). We see a topographically ordered array of learned filters with local continuity of orientation, spatial frequency, and location. The local variations in phase seem to be random. Considering the map for orientation, we see evidence for pinwheels. In the map for spatial frequency, there is a distinct low-frequency cluster.
5 Relation to Earlier Work 5.1 Gaussian Scale Mixtures. We can consider the complete version of our model as a gaussian scale mixture (GSM; Andrews & Mallows, 1974;
Topographic Product Models Applied to Natural Scene Statistics
403
Wainwright & Simoncelli, 2000; Wainwright, Simoncelli, & Willsky, 2000) with a particular (complicated) form of scaling function.13 The basic form for a GSM density on a variable, g, can be given as follows (Wainwright & Simoncelli, 2000), pGSM (g) =
∞
−∞
T g (cQ)−1 g φc (c)dc, exp − N 1 2 (2π) 2 |cQ| 2 1
(5.1)
where c is a nonnegative scalar variate and Q is a positive definite covariance matrix. This is the distribution that results if we draw c from φc (c) and √ a variable v from a multidimensional gaussian NV (0, Q) and then take g = cv. Wainwright et al. (2000) discuss a more sophisticated model in which the distributions of coefficients in a wavelet decomposition for images are described by a GSM that has a separate scaling variable, c i , for each coefficient. The c i have a Markov dependency structure based on the multiresolution tree that underlies the wavelet decomposition. In the complete setting, where the y variables are in linear one-to-one correspondence with the input variables, x, we can interpret the distribution p(y) as a gaussian scale mixture. To seethis, we first rewrite p(y, u) = p(y|u) p(u), where the conditional p(y|u) = j N y j [0, ( i Wi j ui )−1 ] is gaussian (see equation 2.14). The distribution p(u) needs to be computed by marginalizing p(x, u) in equation 2.12 over x, resulting in − 12 1 −ui αi −1 p(u) = e ui Wjk u j , Zu i k j
(5.2)
where the partition function Zu ensures normalization. We see that the marginal distribution of each yi is a gaussian scale mixture in which the scaling variate for yi is given by c i (u) = ( j Wji u j )−1 . The neighborhoods defined by W in our model play an analogous role to the tree-structured cascade process in Wainwright et al. (2000) and determine the correlations between the different scaling coefficients. However, a notable difference in this respect is that the GSM model assumes a fixed tree structure for the dependencies, whereas our model is more flexible in that the interactions through the W parameters can be learned. The overcomplete version of our PoT is not so easily interpreted as a GSM because the {yi } are no longer independent given u, nor is the distribution over x a simple GSM due to the way in which u is incorporated into the 13
In simple terms, a GSM density is one that can be written as a (possibly infinite) mixture of gaussians that differ only in the scale of their covariance structure. A wide range of distributions can be expressed in this manner.
404
S. Osindero, M. Welling, and G. Hinton
covariance matrix (see equation 2.9). However, much of the flavor of a GSM remains. 5.2 Relationship to tICA. In this section we show that in the complete case, the topographic PoT model is isomorphic to the model optimized (but not the one initially proposed) by Hyvarinen et al. (2001) in their work on topographic ICA (tICA). These authors define an ICA generative model in which the components or sources are not completely independent but have a dependency that is defined with relation to some topology, such as a toroidal grid—components close to one another in this topology have greater codependence than those that are distantly separated. Their generative model is shown schematically in Figure 8. The first layer takes a linear combination of variance-generating variables, t, and then passes them through some nonlinearity, φ(·), to give positive scaling variates, σ . These are then used to set the variance of the sources, s, and conditioned on these scaling variates, the components in the second layer
t H σ s A x Figure 8: Graphical model for topographic ICA (Hyvarinen et al., 2001). First, the variance generating variables, ti , are generated independently from their prior. They are then linearly mixed through the matrix H, before being nonlinearly transformed using function φ(·) to give the variances, σi = φ(HiT t), for each of the sources, i. Values for these sources, si , are then generated from independent zero mean distributions with variances σi , before being linearly mixed through matrix A to give observables xi .
Topographic Product Models Applied to Natural Scene Statistics
405
are independent. These sources are then linearly mixed to give the observables, x. The joint density for (s, t) is given by p(s, t) =
psi
i
si φ HiT t
pti (ti ) , φ HiT t
(5.3)
and the log likelihood of the data given the parameters is
L(B) =
psi
i
data x
BT x i T φ Hi t
pti (ti ) | det B|dt φ HiT t
(5.4)
where B = A−1 . As noted in their article, this likelihood is intractable to compute because of the integral over possible states of t. This prompts the authors to derive an approach that makes various simplifications and approximations to give a lower bound on the likelihood. First, they restrict the form of the base density for s to be gaussian,14 both 1 t and H are constrained to be nonnegative, and φ(·) is taken to be (·)− 2 . This yields the following expression for the marginal density of s, p(s) =
1 2 exp − t H s p (t ) HiT t dt. k ik i ti i d 2 k (2π) 2 i i
1
(5.5)
This expression is then simplified by the approximation,
HiT t ≈
Hii ti .
(5.6)
While this approximation may not always be a good one, it is a strict lower bound on the true quantity and thus allows a lower bound on the likelihood as well. Their final approximate likelihood objective, L(B), is then given by d d T 2 L(B) = G Hi j Bi x + log | det(B)| , data
j=1
(5.7)
i=1
14 Their model can therefore be considered a type of GSM, although the authors do not comment on this.
406
S. Osindero, M. Welling, and G. Hinton
where the form of the scalar function G is given by G(τ ) = log
1 1 tτ pt (t) Hii dt. √ exp 2 2π
(5.8)
The results obtained by Hyvarinen and Hoyer (2001) and Hyvarinen et al. (2001) are very similar to those presented here in section 4. These authors also noted the similarity between elements of their model and the response properties of simple and complex cells in V1. Interestingly, the optimization problem that they actually solve (i.e., maximization of equation 5.7), rather than the one they originally propose, can be mapped directly onto the optimization problem for a square, topographic PoT model if we take: B ≡ JPoT , H ≡ WPoT , and G(τ ) = log(1 + 12 τ ). More generally, we can construct an equivalent, square energy-based model whose likelihood optimization corresponds exactly to the optimization of their approximate objective function. In this sense, we feel that our perspective has some advantages. First, we have a more accurate picture of what model we actually (try to) optimize. Second, we are able to move more easily to overcomplete representations. If Hyvarinen et al. (2001) were to make their model overcomplete, there would no longer be a deterministic relationship between their sources s and x. This additional complication would make the already difficult problems of inference and learning significantly harder. Third, in the HPoT framework, we are able to learn the top-level weights W in a principled way using the techniques discussed in section 3.2, whereas current tICA approaches have treated only fixed local connectivity. 5.3 Relationship to Other ICA Extensions. Karklin and Lewicki (2003, 2005) also propose a hierarchical extension to ICA that involves a second hidden layer of marginally independent sparsely active units. Their model is of the general form proposed in Hyvarinen et al. (2001) but uses a different functional dependency between the first and second hidden layers to that employed in the topographic ICA model that Hyvarinen et al. (2001) fully develop. In the generative pass from Karklin and Lewicki’s model, linear combinations of the second-layer activities are fed through an exponential function to specify scaling or variance parameters for the first hidden layer. Conditioned on these variances, the units in the first hidden layer are independent and behave like the hidden variables in a standard ICA model. This model can be described by the graph in Figure 8, where the transfer function φ(·) is given by an exponential. Using the notation of this figure, the relevant distributions are p(ti ) =
qi exp(|ti |qi ) 2 q i−1
(5.9)
Topographic Product Models Applied to Natural Scene Statistics
σ j = ce [Ht] j
! !q j ! sj ! qj p(s j |σ j ) = −1 exp !! !! σj 2σ j q j xk = [As]k .
407
(5.10) (5.11) (5.12)
The authors have so far considered only complete models and in this case, as with tICA, the first layer of hidden variables is deterministically related to the observables.15 To link this model to our energy-based PoT framework, first we consider the following change of variables: B = A−1
(5.13)
−1
(5.14)
K = H νj =
σ j−1 .
(5.15)
Then, considering the q variables to be fixed, we can write the energy function of their model as !qi ! ! ! ! E(x, ν) = K ik log(cνk )!! + log ν j + |ν j |q j | [Bx] j |q j . ! i
k
(5.16)
j
When we take the Boltzmann distribution with the energies defined in equation 5.16, we recover the joints and marginals specified by Karklin and Lewicki (2003, 2005). While the overall models are different, there are some similarities between this formulation and the auxiliary variable formulation of extended HPoT models (i.e., equation 2.12 with generalized exponent β from section 2.5). Viewed from an energy-based perspective, they both have the property that an energy “penalty" is applied to (a magnitude function of) a linear filtering of the data. The scale of this energy penalty is given by a supplementary set of random variables that themselves are subject to an additional energy function. As with standard ICA, in overcomplete extensions of this model, the similarities to an energy-based perspective would be further reduced. We note as an aside that it might be interesting to consider the energy-based overcomplete extension of Karklin and Lewicki’s model, in addition to the standard causal overcomplete extension. In the overcomplete version of the
15 Furthermore, as well as focusing their attention on the complete case, the authors assume the first-level weights are fixed to a set of filters obtained using regular ICA.
408
S. Osindero, M. Welling, and G. Hinton
causal model, inference would likely be much more difficult because of posterior dependencies both within and between the two hidden layers. For the overcomplete energy-based model, the necessary energy function appears not to be amenable to efficient Gibbs sampling, but parameters could still be learned using contrastive divergence and Monte Carlo methods such as hybrid Monte Carlo. 5.4 Representational Differences Between Causal Models and EnergyBased Models. As well as specifying different probabilistic models, overcomplete energy-based models (EBMs) such as the PoT differ from overcomplete causal models in the types of representation they (implicitly) entail. This has interesting consequences when we consider the population codes suggested by the two types of model. We focus on the representation in the first layer (simple cells), although similar arguments might be for deeper layers as well. In an overcomplete causal model, many configurations of the sources are compatible with a configuration of the input.16 For a given input, a posterior distribution is induced over the sources in which the inferred values for different sources are conditionally dependent. As a result, even for models that are linear in the generative direction, the formation of a posterior representation in overcomplete causal models is essentially nonlinear, and, moreover, it is nonlocal due to the lack of conditional independence. This implies that unlike EBMs, inference in overcomplete causal models is typically iterative, often intractable, and therefore time-consuming. Also, although we can specify the basis functions associated with a unit, it is much harder to specify any kind of feedforward receptive field in causal models. The issue of how such a posterior distribution could be encoded in a representation remains open; a common postulate (made on the grounds of efficient coding) is that a maximum a posteriori (MAP) representation should be used, but we note that even computing the MAP value is usually iterative and slow. Conversely, in overcomplete EBMs with deterministic hidden units such as we have presented in this article, the mapping from inputs to representations remains simple and noniterative and requires only local information. In Figure 9 we use a somewhat absurd example to schematically illustrate a salient consequence of this difference between EBMs and causal models that have sparse priors. The array of vectors in Figure 9A should be understood to be either a subset of the basis functions in an overcomplete causal model or a subset of the filters in overcomplete PoT model. In Figure 9B, we show four example input images. These have been chosen to be identical to four of the vectors shown in Figure 9A. The left-hand column of Figure 9C shows the representation responses of the units in an EBM-style model for
16
In fact, strictly speaking, there is a subspace of compatible source configurations.
Topographic Product Models Applied to Natural Scene Statistics
A 0
B
1
2
3
C
5
4
6
EBM Representation
7
0.5
0.5
0 1
0 1
0.5
0.5
0 1
0 1
0.5
0.5
0 1
1
0.5
0.5 0
5
8
9
10
Causal Model Representation 1
1
0
409
10
0
0
5
10
Figure 9: Representational differences between overcomplete causal models and overcomplete deterministic EBMs. (A) The 11 vectors in this panel should be considered as the vectors associated with a subset of representational units in either an overcomplete EBM or an overcomplete causal model. In the EBM, they would be the feedforward filter vectors; in the causal model, they would be basis functions. (B) Probe stimuli. These images exactly match the vectors as those associated with units 4,5,6, and 1. (C) The left-hand column shows the normalized responses in an EBM model of the 11 units assuming they are filters. The right-hand column shows the normalized response from the units assuming that they are basis functions in a causal model with a sparse prior and that we have formed a representation by taking the MAP configuration for the source units.
these four inputs; the right-hand column shows the MAP responses from an overcomplete causal model with a sparse source prior. This is admittedly an extreme case, but it provides a good illustration of the point we wish to make. More generally, although representations in an overcomplete PoT are sparse, there is also some redundancy; the PoT population response is typically less sparse (Willmore & Tolhurst, 2001) than a causal model with an equivalent prior.
410
S. Osindero, M. Welling, and G. Hinton
Interpreting the two models as a description of neural coding, one might expect the EBM representation to be more robust to the influences of neural noise as compared with the representation suggested from a causal approach. Furthermore, the EBM style representation is shiftable; it has the property that for small changes in the input, there are small changes in the representation. This property would not necessarily hold for a highly overcomplete causal model. Such a discontinuous representation might make subsequent computations difficult and nonrobust, and it also seems somewhat at odds with the neurobiological data; however, proper comparison is difficult since there is no real account of dynamic stimuli or spiking in either model. At present, it remains unclear which type of model, causal or energy based, provides the more appropriate description of coding in the visual system, especially since there are many aspects that neither approach captures.
6 Summary We have presented a hierarchical energy-based density model that we suggest is generally applicable to data sets that have a sparse structure or can be well characterized by constraints that are often well satisfied but occasionally violated by a large amount. By applying our model to natural scene images, we are able to provide an interpretational account for many aspects of receptive field and topographic map structure within primary visual cortex and which also develops sensible high-dimensional population codes. Deterministic features (i.e., the first- and second-layer filter outputs) within our model play a key role in defining the density of a given image patch, and we are able to make a close relationship between these features and the responses of simple cells and complex cells in V1. Furthermore, by constraining our model to interact locally, we are able to provide some computational motivation for the forms of the cortical maps for retinotopy, phase, spatial frequency, and orientation. While our model is closely related to some previous work, most prominently Hyvarinen et al. (2001), it bestows a different interpretation on the learned features, is different in its formulation, and describes rather different statistical relations in the overcomplete case. We present our model as both a general alternative tool to ICA for describing sparse data distributions and also as an alternative interpretive account for some of the neurobiological observations from the mammalian visual system. Finally, we suggest that the models outlined here could be used as a starting point for image processing applications such as denoising or deblurring and that it might also be adapted to time-series data such as natural audio sequences.
Topographic Product Models Applied to Natural Scene Statistics
411
Appendix: Sampling in HPoT Models A.1 Complete Models. We start our discussion with sampling in complete HPoT models. In this case, there is a simple invertible relationship between x and y, implying that we may focus on sampling y and subsequently transforming these samples back to x-space through x = J−1 y. Unfortunately, unless W is diagonal, all y variables are coupled through W, which makes it difficult to devise an exact sampling procedure. Hence, we resort to Gibbs sampling using equation 2.13, where we replace y j = J j x to acquire sample u|y. To obtain a sample y|u, we convert equation 2.9 into P(y|u) = Ny y; 0, Diag[WT u]−1 .
(A.1)
We iterate this process (alternatingly sampling u ∼ P(u|y) and y ∼ P(y|u)) until the Gibbs sampler has converged. Note that both P(u|y) and P(y|u) are factorized distributions, implying that both u and y variables can be sampled in parallel. A.2 Overcomplete Models. In the overcomplete case, we are no longer allowed to first sample the y variables and subsequently transform them into x space. The reason is that the deterministic relation y = Jx means that when there are more y variables than x variables, some y configurations are not allowed (i.e., they are not in the range of the mapping x → Jx with x ∈ R). If we sample y, all these samples (with probability one) will have some components in these forbidden dimensions, and it is unclear how to transform them correctly into x-space. An approximation is obtained by projecting the y-samples into x-space using x˜ = J# y. We have often used this approximation in our experiments and have obtained good results, but we note that its accuracy is expected to decrease as we increase the degree of overcompleteness. A more expensive but correct sampling procedure for the overcomplete case is to use a Gibbs chain in the variables u and x (instead of u and y) by using equations 2.13 and 2.14 directly. In order to sample x|u, we need to compute a Cholesky factorization of the inverse-covariance matrix of the gaussian distribution P(x|u), RT R = JVJT
V = Diag[WT u].
(A.2)
The samples x|u are now obtained by first sampling from a multivariate standard normal distribution, n ∼ Nn [n; 0, I], and subsequently setting x = R−1 n. The reason this procedure is expensive is that R depends on u, which changes at each iteration. Hence, the expensive Cholesky factorization and inverse have to be computed at each iteration of Gibbs sampling.
412
S. Osindero, M. Welling, and G. Hinton
A.3 Extended PoT Models. The sampling procedures for the complete and undercomplete extended models discussed in section 2.5 are very similar, apart from the fact that the conditional distribution P(y|u) is now given by
Pext (x|u) ∝
M i=1
1 exp − Vii |yi |β 2
V = Diag[WT u].
(A.3)
Efficient sampling procedures exist for this generalized gaussian-Laplace probability distribution. In the overcomplete case, it has proved more difficult to devise an efficient Gibbs chain (the Cholesky factorization is no longer applicable), but the approximate projection method using the pseudo-inverse, J# , still seems to work well. Acknowledgments We thank Peter Dayan and Yee Whye Teh for important intellectual contributions to this work and many other researchers for helpful discussions. The work was funded by the Gatsby Charitable Foundation, the Wellcome Trust, NSERC, CFI, and OIT. G.E.H. holds a Canada Research Chair. References Andrews, D., & Mallows, C. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, 36, 99–102. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes. Neural Computation, 4(2), 196–210. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Carreira-Perpinan, M., & Hinton, G. (2005). On contrastive divergence learning. In R. Cowell & Z. Ghahrami (Eds.), Proceedings of the Society for Artificial Intelligence and Statistics. Barbados. Available online at http://www.gatsby.ucl.ac.uk/aistats. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions of binary vectors using 2-layer networks. In C. L. Giles, S. J. Hanson, & J. D. Cawan (Eds.), Advances in neural information processing systems, 4 (pp. 912–919). San Francisco: Morgan Kaufmann. Heskes, T. (1998). Selecting weighting factors in logarithmic opinion pools. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Topographic Product Models Applied to Natural Scene Statistics
413
Hinton, G., & Teh, Y. (2001). Discovering multiple constraints that are frequently approximately satisfied. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (pp. 227–234). San Francisco: William Kaufmann. Hoyer, P. O., & Hyvarinen, A. (2000). Independent component analysis applied to feature extraction from colour and stereo images. Network Computation in Neural Systems, 11(3), 191–210. Hyvarinen, A., & Hoyer, P. O. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18), 2413–2423. Hyvarinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Computation, 13(7), 1527–1558. Karklin, Y., & Lewicki, M. S. (2003). Learning higher-order structures in natural images. Network Computation in Neural Systems, 14, 483–499. Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17, 397–423. Lewicki, M., & Sejnowski, T. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Marks, T. K., & Movellan, J. R. (2001). Diffusion networks, products of experts, and factor analysis (Tech. Rep. UCSD MPLab TR 2001.02). San Diego: University of California, San Diego. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–610. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Portilla, J., Strela, V., Wainwright, M., & Simoncelli, E. P. (2003). Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. Image Processing, 12(11), 1338–1351. Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. J. Neurophysiol., 88(1), 455–463. Simoncelli, E. (1997). Statistical models for images: Compression, restoration and synthesis. In Proceedings of the 31st Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA: IEEE Computer Society. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1: Foundations. New York: McGraw-Hill. Teh, Y., Welling, M., Osindero, S., & Hinton, G. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research [Special Issue] 4, 1235–1260. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B Biol. Sci., 265(1394), 359–366. Wainwright, M. J., & Simoncelli, E. P. (2000). Scale mixtures of gaussians and the ¨ statistics of natural images. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 855–861). Cambridge, MA: MIT Press.
414
S. Osindero, M. Welling, and G. Hinton
Wainwright, M. J., Simoncelli, E. P., & Willsky, A. S. (2000). Random cascades of gaussian scale mixtures and their use in modeling natural images with application to denoising. In Proceedings of the 7th International Conference on Image Processing. Vancouver, BC: IEEE Computer Society. Welling, M., Agakov, F., and Williams, C. (2003). Extreme components analysis. ¨ L. K. Savl, & B. Scholkopf ¨ In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Welling, M., Hinton, G., & Osindero, S. (2002). Learning sparse topographic rep¨ resentations with products of student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Welling, M., Zemel, R., & Hinton, G. (2002). Self-supervised boosting. In S. Becker, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, S. Thrun, 15. Cambridge, MA: MIT Press. Welling, M., Zemel, R., & Hinton, G. (2003). A tractable probabilistic model for projection pursuit. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann. Welling, M., Zemel, R., & Hinton, G. (2004). Probabilistic sequential independent components analysis. IEEE Transactions in Neural Networks [Special Issue], 15, 838–849. Williams, C. K. I., & Agakov, F. (2002). An analysis of contrastive divergence learning in gaussian Boltzmann machines (Tech. Rep. EDI-INF-RR-0120). Edinburgh: School of Informatics, University of Edinburgh. Williams, C., Agakov, F., & Felderhof, S. (2001). Products of gaussians. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Willmore, B., & Tolhurst, D. J. (2001). Characterizing the sparseness of neural codes. Network Computation in Neural Systems, 12(3), 255–270. Yuille, A. (2004). A comment on contrastive divergence (Tech. Rep.). Los Angeles: Department Statistics and Psychology, UCLA. Zhu, S. C., Wu, Y. N., & Mumford, D. (1998). Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27(2), 107–126.
Received December 20, 2004; accepted August 1, 2005.
LETTER
¨ ak Communicated by Peter Foldi´
A Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural Images Michael S. Falconbridge
[email protected] School of Psychology, University of Western Australia, Nedlands WA 6009, Australia
Robert L. Stamps
[email protected] School of Physics, University of Western Australia, Nedlands WA 6009, Australia
David R. Badcock
[email protected] School of Psychology, University of Western Australia, Nedlands WA 6009, Australia
Slightly modified versions of an early Hebbian/anti-Hebbian neural network are shown to be capable of extracting the sparse, independent linear components of a prefiltered natural image set. An explanation for this capability in terms of a coupling between two hypothetical networks is presented. The simple networks presented here provide alternative, biologically plausible mechanisms for sparse, factorial coding in early primate vision. 1 Introduction Retinal photoreceptor activity is typically high in awake, behaving primates. Yet based on data describing average energy consumption by human cortex and the energetic cost of neural activity, Lennie (2003) recently calculated that the fraction of active cortical neurons is very low. The upper bound proposed by Lennie is 1%, even in so-called active cortical areas (those areas that light up during fMRI, for example). Single cell recordings in V1—the first visual cortical area—reveal that these neurons tend to have very low activity most of the time and high activity only rarely (although more often than a gaussian activity distribution would predict) (Baddeley et al., 1997). Thus, very early in the cortical stages of processing, a sparse representation of the dense sensory input is achieved. How is sparse coding implemented in the primate brain? In a landmark paper, Olshausen and Field (1996) produced a successful model of V1 simple cells based on the assumptions of independence of simple cell responses, sparseness, and low information loss. Other modelers showed that the Gabor-like receptive fields produced by implementing Neural Computation 18, 415–429 (2006)
C 2005 Massachusetts Institute of Technology
416
M. Falconbridge, R. Stamps, and D. Badcock
the Olshausen model also result from applying independent component analysis (ICA) with the assumption of leptokurtic (highly peaked at zero) data sources (Hurri, Hyv¨arinen, & Oja, 1996; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998; van Hateren & Ruderman, 1998).1 It was shown that the two methods of ICA and what might be termed SCA (sparse component analysis) are actually equivalent when it comes to modeling perception of natural scenes (e.g., Bell & Sejnowski, 1997). In this article, we show that two slightly modified versions of an early ¨ ak, 1990) are also caHebbian/anti-Hebbian sparse coding algorithm (Foldi´ pable of extracting the sparse, independent components of natural images. The first version is, in all respects, the same as the original Foldiak algorithm except that the binary constraint on the outputs is removed; the second is a more biologically plausible version. In the discussion, a mathematical explanation for why the algorithms achieve ICA and SCA is presented. 2 Foldiak’s Network Foldiak’s original SCA algorithm combines three simple, biologically inspired learning rules. Hebbian learning occurs on the feedforward connections denoted wi j . Anti-Hebbian learning occurs on lateral connections ui j between the output units. A sensitivity term si assigned to each output unit automatically adjusts to encourage a certain low level of activity among the outputs. Although no overall objective function has been presented for the algorithm, an intuitive explanation for the three respective learning rules is that they (1) ensure that the feedforward connections capture the interesting (high-variance-producing) features in the data set, (2) discourage correlations between outputs, which forces them to represent different features, and (3) encourage sparse, distributed activity among the output units. The first Foldiak algorithm presented here may be described in a step-wise fashion as follows. The architecture of the network used to implement the algorithm is depicted in Figure 1. 1. The prefiltered image xq is presented to the network. For both versions of the network, prefiltering was performed according to the process described in Olshausen and Field (1997), which approximates the effect of processing by retinal and lateral geniculate nucleus (LGN) neural circuits (see Olshausen & Field, 1997, and Atick & Redlich, 1992, where a similar function is compared with the effect of retina/LGN processing). Natural images were taken from http://www.cis.hut.fi/projects/ica/data/images.
1
The Gabor function referred to is a two-dimensional version—an oriented, twodimensional gaussian envelope modulated by a sinusoidal function along one (typically the short) axis.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
417
x1
σ1
y1
x2
σ2
y2
xn
σm
ym
Figure 1: Representation of the architecture of the Foldiak network. The network is actually single layered but has been separated into two parts to demonstrate that it consists of a linear, feedforward subnetwork that feeds into a recurrent inhibition layer. Hebbian learning occurs on the feedforward connections and anti-Hebbian learning on the lateral connections. Each σ is a weighted sum of its inputs. yi is calculated as in equation 2.2.
2. The responses of filters (rows of a matrix W denoted wi ) are calculated: q q σ q = Wxq or σi = wi j x j . (2.1) j
3. The outputs are calculated: q q q yi = f σi + ui j y j + si .
(2.2)
j
f is the activation function. ui j = u ji represents the connection strength between output units i and j, which is never greater than zero for any (i, j) and is zero when i = j. si is a sensitivity term associated with output unit i. Note that the solution to equation 2.2 cannot be calculated in one step. The output values may be acquired by finding the equilibrium solution to the differential (with respect to iteration number) equation: q q q q y˙ i = f σi + ui j y j + si − yi . (2.3) j
4. The weights and sensitivity terms are updated according to the following rules. The Hebbian rule for feedforward weights is wi j = η1 yi (x j − yi wi j ).
(2.4)
An anti-Hebbian rule for lateral weights ui j = −η2 (yi y j − α 2 ) (if i = j or ui j > 0 then ui j : = 0).
(2.5)
418
M. Falconbridge, R. Stamps, and D. Badcock
And the sensitivity update rule is si = η3 (α − yi ),
(2.6)
where ηm are small, positive numbers representing the learning rates for the various parameters, and α is the desired average activity for all output units. Alternatively, the variables may be updated using average values after the presentation of a batch of, say, 100 images. This is, in practice, an advantageous approach as it leads to smoother changes in the net. 5. q is iterated, and steps 1 to 5 are repeated. Note that all equations shown here are exactly the same as those of the original Foldiak (1990) algorithm except in two ways: (1) the last term of the wi j rule has been changed from −η1 yi wi j , which effectively fixes |wi | when yi is binary, to −η1 yi yi wi j , which is more effective for doing the same thing when yi is continuous; and (2) the equation forcing outputs greater than 0.5 to go to one and those less than 0.5 to go to zero has been removed. 3 Simulations and Results 3.1 The First Network. In line with the original Foldiak network, this network uses a sigmoidal activation function, f (x) =
1 , 1 + e −βx
(3.1)
where β is a “steepness” parameter. The function constrains outputs to lie between zero and one. In keeping with the original Foldiak network, inputs were also constrained to be positive. This was achieved by simultaneously presenting (within one vector) a rectified version of the original prefiltered image and a rectified negative version of the image. This represents filtering through on and off channels in the retina/LGN system (Schiller, 1992). Having positive inputs and positive outputs produces positive feedforward connections via the Hebbian learning rule. (See Hoyer, 2003, for additional arguments supporting positive nets.) Figure 2 shows a set of 144 image components learned by the first network. They were calculated by taking the first half of each row2 of W (representing connections from on units) and subtracting the second half (representing connections from off units). This reverses the rectification step 2 Note that the rows of W were referred to earlier as filters through which images are passed to get values for σ . This is so, but the learning rule for wi j ensures that these rows come to represent components of the image data set. This is typical for ICA and SCA algorithms. Filter and component are used interchangeably to refer to a row of W.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
419
Figure 2: An example set of 144 components produced by the continuous Foldiak network.
used to produce positive inputs, which projects the connection strengths back into the original prefiltered image space. The set of components in the figure was produced using the parameters α = .005, β = 200, η2 = 2.4, and η3 = 1.5. η1 was varied from 3 (106 iterations) to 1.5 (400,000 iterations) to .725 (106 iterations). This represents a gradual decrease in the plasticity of the feedforward connections, although a similar result can be obtained by maintaining η1 = 3 for 106 iterations. There is a tendency toward Gabor-like functions. This is consistent with the general form of the components found by existing ICA and SCA algorithms. There are obvious differences, though, between some of the components in Figure 2 and other published components. There is a subset of components that appear to contain a small number of Gabor-like features (e.g., the component on row 3, column 10). Note, though, that during learning of natural scenes, the multiple Gabor components tend to fluctuate, whereas the single Gabor components remain stable, even when η1 is large. This suggests that the single Gabor-like functions represent optimal solutions. In order to analyze the properties of the components, 2D Gabor functions were fit to each component. Fits that produced relatively high chi-square values were not included in the analysis (5 multiple Gabor components
420
M. Falconbridge, R. Stamps, and D. Badcock
Figure 3: The distribution of the 139 Gabor components in frequency space (.) compared with that for Olshausen and Field’s (1997) components (o).
out of the total 144 components were lost). The distribution of peak spatial frequencies and orientations for the fits is depicted in Figure 3. In the same figure, the corresponding distribution for the Olshausen and Field (1997) components is shown for comparison. The distribution for their components is slightly more spread, and the peak is at a higher frequency. This is more obvious in Figure 4, where orientation information in the data has been removed. The histograms for the two component sets are directly comparable, as each bin represents the same frequency in terms of cycles per image. They are compared in the diagram to a histogram of peak spatial frequency sensitivities for a set of 53 macaque simple cells (De Valois, Albrecht, & Thorell, 1982). In making the comparison, we chose to align the upper cutoff frequency of the cell data with the theoretical (Nyquist) upper limit for the component data. This approximately equates one pixel with a single photoreceptor. Using this point of reference, the peak of the Foldiak network’s distribution lies closer to the simple cell peak than does the peak for the Olshausen data. 3.2 A More Biologically Plausible Network. For the more plausible network, the activation function was chosen to more closely reflect biological data. The function of choice was a one-sided hyperbolic ratio function: f (c) =
cβ c β +γ β
0
if c ≥ 0; if c < 0.
(3.2)
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
normalized "cell" count
1
421
MF O&F DeV
0.8
0.6
0.4
0.2
0 –0.38 –0.14
0.13
0.37
0.63
0.89
1.14
1.4
1.67
1.91
2.16
Log of peak spatial frequency (normalized to upper cut-off) Figure 4: Distribution of peak spatial frequencies for the modified Foldiak (MF) network components, the Olshausen and Field (1997) components, and a sample of 53 simple cells (De Valois et al., 1982).
This function was shown by Albrecht and Hamilton (1982) to describe well the response of simple cells to the increasing contrast of stimuli and has been used extensively to fit cell response data since (Gottschalk, 2002). A common value for β for simple cell data is 2.0 (Gottschalk, 2002). This was used in our simulations. γ was varied for each simulation to allow exponential-like y histograms in keeping with the response distributions of V1 cells to natural images (Baddeley, et al., 1997; Gallant & Vinje, 2000). The positivity constraint on the inputs was removed so that the filter response distributions were approximately symmetrical around zero. With this constraint removed, negative inputs come to represent positive inputs from “off” LGN cells and positive inputs represent inputs from “on” cells. Essentially, the model is equivalent to a positive one. In either case, the conditions for an output unit response, and thus the occurrence of learning, include the matching of both input pattern and sign to the pattern and sign—or, equivalently, the on-off specificity—of the feedforward connections. The removal of the positivity constraint was necessary to allow the model to produce exponential output distributions. If inputs were constrained to be positive, the output distributions tended to be highly peaked at zero and slightly peaked at one, and to represent little to no activity in between.
422
M. Falconbridge, R. Stamps, and D. Badcock
Figure 5: The receptive fields of a set of output units. These are calculated by mapping the responses of units to a point source applied to each point in the input array.
Figure 5 depicts an example set of receptive field profiles for the output units. The receptive fields were mapped out by recording the responses of output units to a single white pixel (the point source) placed at each point in the input array. The receptive fields depicted were a result of a simulation where α = 0.01 and γ = 0.3. The learning rate for the feedforward connections varied from 1.25 to 0.63 (representing a decrease in plasticity over time), for the lateral connections it remained at 0.1, and for the sensitivity term it varied from 0.2 to 0.1 (representing increased stability of units in the face of fluctuating input levels) over a total of 106 image presentations. Note that a value of .01 for α means that the approximate percentage of active outputs at any one time is equal to the upper limit of 1% predicted by Lennie (2003) for active cortical areas. The distribution of receptive fields in frequency space is similar to the distribution for the sigmoidal model’s components depicted in Figure 3 in that there is an even spread across orientations and a similar concentration in the midfrequency range. Thus, the relation of the distribution to that of De Valois et al.’s cells is similar.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
423
In an attempt to define the contrast response function of our output units, the responses of some randomly selected units to images (including images of the units’ best fitting Gabors) of varying contrast were measured. In line with physiological data, the response to increasing contrast is very similar to the response functions imposed on these units and is thus well modeled by the hyperbolic ratio function. In summary, compared with the first network, the second model has a more plausible choice of response function for output units and more plausible output activity distributions (i.e., resembling more closely distributions of activity in real visual systems). The level of activity among output units is similar to that determined by Lennie, and the contrast response functions of output units resemble those of real simple cells. The receptive fields as measured using a point source quantitatively resemble simple cell-receptive fields measured in the same way. 4 Discussion 4.1 How Do the Algorithms Work? For the purpose of a comparison with the Olshausen generative model (see Olshausen & Field, 1997), assume there is a set of variables for the Foldiak net that can be compared to the vector a, which represents a set of coefficients for the components of an image x. For the Foldiak generative model, these coefficients are assumed to be positive (no subtraction of components allowed when generating an image). The reason for this assumption is explained later. As the pattern of feedforward connections to any particular output unit for both the Foldiak and Olshausen net represents an input component (rather than an optimal filter for that component), then the generative model uses these connection strengths to generate images. The generative equation is thus x = WT a
xj =
or
wk j a k .
(4.1)
k
The Foldiak net produces a variable σ that is calculated thus: σi =
wi j x j .
(4.2)
j
Combining equations 4.1 and 4.2 produces σi =
wi j
j
wk j a k
k
2 = ai wi j + wi j wk j a k . j
k=i
j
424
M. Falconbridge, R. Stamps, and D. Badcock
Now, σi is used as input to a fully interconnected network that produces outputs y. Incorporating the previous relationship into equation 2.2 produces yi = f a i |wi |2 +
k=i
wi j wk j a k +
j
uik yk + si ,
(4.3)
k=i
where |wi |2 = j wi2j . We can use this equation to determine the statistical relationship between the hypothetical coefficient a i and the network output yi . This is done using the fact that the probability distribution for the variables a is related to the distribution for y by
dyi
. P(a i ) = P(yi )
da i
(4.4)
Because of the steep sigmoidal-like shape of f (large steepness parameters were used for both the sigmoidal and hyperbolic ratio functions), the derivative dyi /da i is sharply peaked when the argument of f is equal to zero. Assuming P(yi ) is nonzero in this vicinity (substantiated later), P(a i ) is a narrow gaussian-like distribution that peaks where a i |wi |2 +
k=i
j
wi j wk j a k = −si −
uik yk .
(4.5)
k=i
The activation function thus effectively produces a low noise mapping between the outputs of two hypothetical networks—the first represented by the left-hand side and the second by the right-hand side of equation 4.5. The properties of the second net are examined next. The learning rule for si (see equation 2.6) attempts to make y¯ i = α. If it can achieve this, si itself goes to zero (fluctuations about zero may occur in order to keep y¯ i at the desired level). If not zero, si tends to be negative when α is small because in general, yi tends to be larger than α. The learning rule for uik (see equation 2.5) tries to push correlations between yi and yk to the value of α 2 . By achieving this, uik itself goes to zero. If not zero, uik is always negative. If α is appropriately chosen (i.e., if the level of correlations and the level of average activity it represents is attainable by the second net), then the right-hand side of equation 4.5 will tend to be positive but will approach zero in time. The implications for the first, generative net are as follows. As |wi | is essentially fixed by the Foldiak wi j rule, then it means that the average of a i —assumed positive—is pushed toward zero. a¯ i cannot in general be zero if the generative model is going to produce actual images,
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
425
so the right-hand-side net going to zero also means that the rows of W are forced to become orthogonal as this minimizes the sum over j in the second left-hand side term of equation 4.5. An effective cost function may be produced for the linear generative net based on these observations—that is, 2 (xi − xˆ i ) + λ1 C= S1 (a i ) + λ2 S2 (|wi w j |) , q
i
i
ij
with |wi |2 fixed, where λ1 and λ2 are constants and S1 and S2 are “sparseness” functions like that used in Olshausen’s net to penalize large values for the argument. The first term encourages accurate image reconstruction, the second encourages small values for the component coefficients, and the third encourages the rows of W to point in orthogonal directions. This cost function represents a summary of all of the forces present in the hypothetical Foldiak generative network. The first term embodies the generative model assumption that reconstructed images are to be as close to the originals as possible. The second and third terms represent the effect of coupling the generative model to the network represented by the right-hand side of equation 4.5. The actual forms of the functions S1 and S2 are not obvious from the coupling. Any function that is smooth and increases monotonically when the argument is greater than zero and that is flat close to zero should approximate the effect of the coupling in equation 4.5. Flattening near zero reflects the fact that the right-hand side of equation 4.5 approaches zero quickly when it is large and more slowly as it gets closer to zero. It is also in line with the notion that when either a i or |wi w j | is zero, there should be no forces on them—to make them either larger or negative—as the aim is to make them as small as possible and neither a i or |wi w j | is allowed to be negative. An appropriate choice for S1 and S2 might be S(x) = x 2 . The cost function is the same as the Olshausen cost function except that it has an extra term that encourages the image components to be orthogonal. This term does not oppose the actions of the others in any way; rather, it helps ensure that the components represent unique aspects of the image data set. This explains why our learned components and those of the Olshausen net have similar properties. Note that the above cost function does not by itself lead to all of the learning rules present in the Foldiak algorithm, as the lateral connections and sensitivity terms do not appear in the function. They have to be introduced into the model via the coupling we have described in order to get the right type of rules.
426
M. Falconbridge, R. Stamps, and D. Badcock
4.1.1 Choosing α. The parameter α must be chosen not only to allow the right-hand side of equation 4.5 to approach zero; it must also allow the Foldiak network to produce appropriate outputs y. The update rule for wi j is the Oja (1982) update rule, so the final weight matrix minimizes the Oja cost function,3 which, for multiple Oja component analyzers, is C=
q
xi − xˆ i j
2
where
xˆ i j = w ji y j .
(4.6)
i, j
According to this equation each output unit produces its own “reconstructed” image xˆ i = wi yi , which is compared to the original x. Averaged over all input images and all reconstructed images produced for each input image, the aim is for the reconstructed images to look as much like the corresponding input images as possible in a least-squares sense. The rows of W thus learn to represent significant components of the image set. The aim of SCA is to use each component as rarely as possible while keeping an average like the one in the Oja cost function low. The aim of ICA is to keep component coefficients as independent from one another as possible. Both of these constraints translate to making α as small as possible, as doing so makes y¯ i small for all i and minimizes correlations between the outputs. Competing with this is the fact that α must also be large enough to allow the right-hand side of equation 4.5 to approach zero. 4.1.2 Supplementary Notes. Note that only correlations between the outputs are removed by the ui j rule, but higher-order dependencies than this are discouraged among the linear component coefficients a. This is because Hebbian learning on the feedforward connections between the inputs x and the nonlinear outputs y constitutes nonlinear Hebbian learning, where statistical relationships of higher order than correlations are absorbed into the rows of W (Karhunen & Joutsensalo, 1994; Sudjianto & Hassoun, 1995). It was stated earlier that a is assumed to be positive. This holds because y is always positive, which means that by the Hebbian learning rule, the components that emerge are those that are added to make the images— not subtracted. It follows then that the coefficients a i are positive. Also, it was required that P(yi ) is nonzero around the points where the relation in equation 4.5 holds. The left-hand side is simply σi , so this equates to
3 The cost function is not minimized with respect to the outputs y, as these are constrained in certain ways. If the constraints (including the nonlinear activation function) were removed, then each row of W would point to the principal component of the data set (assuming there was one after prefiltering).
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
427
requiring P
ui j y j = si − σi > 0.
(4.7)
j
Note the following: that y j is always positive or zero, ui j is always negative or zero, si tends to be negative or zero, and the distribution P(σi ) sits in the region σ > 0 for a positive network and is centered on zero and relatively spread for a mixed-sign network. It is very likely, then, that relationship 4.7 holds. 4.2 Possible Biological Mechanism. The attraction of a nonlinear Hebbian/anti-Hebbian network like that presented here is its biological plausibility. The occurrence of Hebbian-like mechanisms leading to longterm changes (long-term potentiation, LTP) in the brains of a range of animals is fairly well substantiated (see Brown, Kairiss, & Keenan, 1990, for a review). Its occurrence in perceptual areas of the brain is often assumed (e.g., Rauschecker, 1991). Katz and Shatz (1996) cite evidence for the occurrence of LTP in visual cortex of a range of animals at various ages. A biologically plausible mechanism for the anti-Hebbian learning is as follows. Neighboring pyramidal cells responsible for passing information from and to other brain areas or layers are connected via inhibitory interneurons. Such neurons constitute 20% of the neural population in cortex (Thomson & Deuchars, 1997) and are known to connect cells whose receptive fields lie near one another in visual space (Blakemore, Carpenter, & Georgeson, 1970; Budd & Kisvarday, 2001). Anti-Hebbian learning translates in this scheme to Hebbian learning on the inhibitory interneuron/ pyramidal synapses. Evidence for a cellular origin to adaptation such as that seen in our models via the use of an adapting sensitivity term can be found in Carandini (2000). This is in line with psychophysical and brain imaging results that show that adaptation is specific to the cells activated by the repeated stimuli (e.g., Boynton & Finney, 2003). 5 Conclusion Two slightly modified versions of Foldiak’s (1990) Hebbian/anti-Hebbian network have been used to extract the sparse, independent linear components of a prefiltered natural image set. The second version in particular has a number of aspects in common with the primate visual system, including receptive fields qualitatively similar to those of simple cells, response functions and activity distributions qualitatively similar to those of visual cells, output units that adapt to the prevailing magnitude of the input signal,
428
M. Falconbridge, R. Stamps, and D. Badcock
and lateral inhibitory connections like those known to exist in the primate visual system. The algorithms work because the nonlinear activation function links a hypothetical linear generative model to certain self-adjusting components of the network. Changes in the network produce changes in the generative model. An effective cost function for the generative network was suggested. It is essentially the same as the Olshausen and Field (1997) cost function with an additional term to encourage orthogonal components. A possible biological mechanism based on the models uses inhibitory interneurons and Hebbian learning to achieve the proposed anti-Hebbian learning. Acknowledgments We thank the reviewers for their helpful comments on earlier drafts of the article. This work was supported by ARC grants A00000836 and DP0346084 (D.R.B.) and an Australian Postgraduate Award (M.F.). References Albrecht, D. G., & Hamilton, D. B. (1982). Striate cortex of monkey and cat: Contrast response function. Journal of Neurophysiology, 41, 217–237. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Computation, 4, 196–210. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society of London B, 264, 1775–1783. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Blakemore, C., Carpenter, R. H. S., & Georgeson, M. A. (1970). Lateral inhibition between orientation detectors in the human visual system. Nature, 228, 37–39. Boynton, G. M., & Finney, E. M. (2003). Orientation-specific adaptation in human visual cortex. Journal of Neuroscience, 23(25), 8781–8787. Brown, T. H., Kairiss, E. W., & Keenan, C. L. (1990). Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rev. Neurosci., 13, 475–511. Budd, J. M. L., & Kisvarday, Z. F. (2001). Local lateral connectivity of inhibitory clutch cells in layer 4 of cat visual cortex (area 17). Experimental Brain Research, 140, 245–250. Carandini, M. (2000). Visual cortex: Fatigue and adaptation. Current Biology, 10(16), R1–R3. De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22, 545–559. ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Foldi´ Biol. Cybern., 64, 165–170. Gallant, J. L., & Vinje, W. E. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287(5456), 1273.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
429
Gottschalk, A. (2002). Derivation of the visual contrast response function by maximizing information rate. Neural Computation, 14, 527–542. Hoyer, P. O. (2003). Modelling receptive fields with non-negative sparse coding. In E. D. Schutter (Ed.), Computational neuroscience: Trends in research 2003. Amsterdam: Elsevier. Hurri, J., Hyv¨arinen, A., & Oja, E. (1996). Image feature extraction using independent component analysis. In Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG)’96. Piscataway, NJ: IEEE. Karhunen, J., & Joutsensalo, J. (1994). Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7(1), 113–127. Katz, L. C., & Shatz, C. J. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Lennie, P. (2003). The cost of cortical computation. Current Biology, 13, 493–497. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Rauschecker, J. P. (1991). Mechanisms of visual plasticity. Physical Review, 71(2), 587– 615. Schiller, P. H. (1992). The ON and OFF channels of the visual system. Trends in Neurosciences, 15(3), 86–92. Sudjianto, A., & Hassoun, M. H. (1995). Statistical basis of nonlinear Hebbian learning and application to clustering. Neural Networks, 8(5), 707–715. Thomson, A. M., & Deuchars, J. (1997). Synaptic interactions in neocortical local circuits: Dual intracellular recordings in vivo. Cerebral Cortex, 7, 510–522. van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265, 2315–2320. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265, 359–366.
Received January 30, 2004; accepted June 9, 2005.
LETTER
Communicated by Steven Nowlan
Differential Log Likelihood for Evaluating and Learning Gaussian Mixtures Marc M. Van Hulle
[email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, B-3000 Leuven, Belgium
We introduce a new unbiased metric for assessing the quality of density estimation based on gaussian mixtures, called differential log likelihood. As an application, we determine the optimal smoothness and the optimal number of kernels in gaussian mixtures. Furthermore, we suggest a learning strategy for gaussian mixture density estimation and compare its performance with log likelihood maximization for a wide range of real-world data sets. 1 Introduction A standard procedure in density estimation is to minimize the negative log likelihood given the sample S = {vi |i = 1, · · · , N}, vi = [vi1 , · · · , vid ] ∈ V ⊆ Rd , taken from the density p(v) (Redner & Walker, 1984; Bishop, 1995), often in combination with the expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977): 1 log p(v ˜ i |), N N
F = −log L = −
(1.1)
i=1
with p(.) ˜ the density estimate, and the parameter vector of the model estimate. When N → ∞, we obtain the expected log likelihood: E[−log L] = −
p(v) log p(v|)dv. ˜
(1.2)
V
For a given sample S, the (sample) log likelihood cannot be used for judging the goodness of fit of a model or for model selection, since it is biased as an estimator of the expected log likelihood. The bias appears in practice since the same sample is used for estimating both the model parameters and the expected log likelihood. Akaike (1973) was the first to suggest an approximation of this bias, which he included in his information metric to estimate the expected log likelihood. This has led to several improved Neural Computation 18, 430–445 (2006)
C 2005 Massachusetts Institute of Technology
Differential Log Likelihood
431
information metrics (for review, see Stoica & Sel´en, 2004), including a bootstrap procedure (Ishiguro, Sakamoto, & Kitagawa, 1997). In this letter, we consider the log likelihood to be biased as an estimator of the theoretically optimal log likelihood (cf. the Kullback-Leibler divergence). We estimate this bias by the difference between the negative log likelihood and the entropy of the density estimate. We call it the differential log likelihood, in analogy with the differential entropy (thus, also with respect to some reference; see Cover & Thomas, 1991). For the entropy, we will use kernel density estimates (“plug-in” estimates; Ahmad & Lin, 1976), but other estimates could be used as well. The differential log likelihood should fluctuate around zero, when a good model is achieved, and diverge away from it otherwise. We show under what condition differential log likelihood is an unbiased estimator for finite data sets and when this is approximately the case. In terms of applications, we determine the optimal kernel smoothness and the number of kernels in kernel-based density estimation of both synthetic and real-world examples and derive a strategy for differential log likelihood learning, which we apply to density estimation of a wide selection of real-world data sets. 2 Differential Log Likelihood The expected log likelihood is biased by the differential (“joint”) entropy (Shannon, 1948) of the function to be estimated. When subtracting this bias, we obtain the Kullback-Leibler distance or divergence (KL) (Kullback, 1959; for review, see Soofi, 2000): KL( p p) ˜ =−
p(v) log
V
p(v|) ˜ dv p(v)
p(v) log p(v|)dv ˜ +
=− V
p(v) log p(v)dv,
(2.1)
V
that is, a nonnegative quantity that will equal zero only when the true and estimated densities are equal, so that it can be regarded as an (average) error function, weighting the discrepancy between the true and estimated densities logarithmically. Given a sample S, the entropy can be estimated from the density estimate p˜ (plug-in estimates; Ahmad and Lin, 1976), or from sample spacings in the one-dimensional case (Hall, 1984), or from the nearest-neighbor method in the d-dimensional case (Kozachenko & ¨ Leonenko, 1987) (for review, see Beirlant, Dudewicz, Gyorfi, & van der Meulen, 1997). An alternative approach to estimate entropy or KL is by means of the Edgeworth expansion (Comon, 1994; Van Hulle, 2005a), albeit that some reservations should be made about the accuracy of this polynomial density expansion (Friedman, 1987; Huber, 1985).
432
M. Van Hulle
Consider the following reasoning. We start with the negative log likelihood and subtract from it a bias term, namely, the model estimate’s differential entropy:
LL( p p) ˜ =−
p(v) log p(v|)dv ˜ + V
p(v|) ˜ log p(v|)dv, ˜
(2.2)
V
and call it the differential log likelihood. It is clear that when p(.) = p(.) ˜ everywhere, LL = 0. The advantage with respect to KL is that as many samples as desired can be generated from p˜ to estimate its differential entropy. There is an intimate connection with the information criteria used for model selection (Akaike, 1973; Ishiguro et al., 1997; Stoica & Sel´en, 2004). Consider the following expression: E v∈ p
1 log p(v|(v)) ˜ − E v ∈ p˜ log p(v ˜ |(v)) . N
(2.3)
The terms between the square brackets correspond to LL( p p), ˜ given the sample S, and form an unbiased estimate of the previous expression; when for the second term between the square brackets the expectation would be taken for v ∈ p, we would obtain the classical bias correction term for the estimated log likelihood used in information criteria. The difference between the two formulations is that LL will equal zero in the optimal case, whereas the bias correction term is a nonnegative quantity. In fact, since LL = 0 in the optimal case, the estimated entropy term it contains could be regarded as a bias correction of the estimated log likelihood. In the general case, we have the following connection with KL: LL( p p) ˜ = KL( p p) ˜ − H( p) ˜ + H( p).
(2.4)
For the connection between distortion error (vector quantization), log likelihood, and Kullback-Leibler divergence in the case of gaussian mixtures, we refer to Van Hulle (2005b). It can be verified that LL is nonsymmetric: LL( p p) ˜ = LL( p ˜ p). It is, however, in general different from KL since for LL( pφ p ), with φ p the d-dimensional multivariate gaussian density with the same mean and covariance matrix, we can write that LL( pφ p ) = KL( pφ p ) − H(φ p ) + H( p) = 0,
(2.5)
since KL( pφ p ) = H(φ p ) − H( p) (i.e., the negentropy). Hence, LL( pφ p ) will be zero when the maximum likelihood gaussian approximation is achieved.
Differential Log Likelihood
433
2.1 Gaussian Input, Gaussian Kernel. Assume that the true distribution is a d-dimensional gaussian, φ p , with parameters the mean vector µ p and covariance matrix p , and that we use a single gaussian kernel, φ p˜ , with parameters µ p˜ and p˜ , for modeling the true distribution. It can then be shown that LL(φ p φ p˜ ) =
1 −1 1 d Tr p˜ p + (µ p − µ p˜ ) −1 , (2.6) p˜ (µ p − µ p˜ ) − 2 2 2
after some algebraic manipulations. Evidently when we let the parameters µ p˜ and p˜ of our gaussian kernel match those of the distribution, µ p and p , we obtain LL(φ p φ p˜ ) = 0 (the validity is discussed in appendix A). When estimating LL from samples taken from the distributions φ p and φ p˜ , we can show that LL is asymptotically χ 2 distributed (see appendix A) and that it is an asymptotically unbiased estimator when we take for µ p˜ and p˜ the maximum likelihood estimates of µ p and p . 2.2 General Input, Gaussian Kernel. Consider now the case of a general input density p(v) and a gaussian kernel φ(v). We can approximately write that (see appendix B) LL( pφ) ≈ LL(φ p φ)
(2.7)
and know that when φ = φ p , thus when the mean vectors and covariance matrices match, LL( pφ) = 0 (see equation 2.5). Hence, also for this case, the metric is unbiased. Similarly, we can compute the inverse relation (see appendix C): LL(φ p) ≈ LL(φφ p ).
(2.8)
2.3 General Input, General Model. There is no general applicable approximation for this case, but when the mean vectors and covariance matrices of p and p˜ are equal, we have that (see appendix D) d 1 i,i,i LL( p p) ˜ ≈ (κ − κ˜ i,i,i )2 + 3 12 i=1
1 + 6
d
d
(κ i,i, j − κ˜ i,i, j )2
i, j=1,i= j
(κ i, j,k − κ˜ i, j,k )2
(2.9)
i, j,k=1,i< j 0. Among these patterns, identify the one with min(maxr j ∈R pouter (x|r j )), and put it
Enhancing Density-Based Data Reduction Using Entropy
into R. pouter (x|r j ) is defined by d(x,r ) 1 − Ra d outerj (r j ) , pouter (x|r j ) = 0,
d(x, r j ) ≤ Ra d outer (r j ) otherwise
479
,
where Rad outer(r j ) is the distance of r j with the 2Lth pattern nearest to it. Step 3. When R consists of K + 1 representatives, delete the worst representative—the one having the largest RE(r j , X) or WRE(r j , X). Step 4. Calculate RE(R, X) or WRE(R, X) for the newly constructed R. Step 5. Repeat from step 2 to step 4 until RE(R, X) or WRE(R, X) cannot be reduced for five consecutive iterations. Below, an example is given to demonstrate the procedure of WREDR. In this example, WREDR identifies 6 representatives from 60 original patterns. In Figure 2, certain intermediate results are illustrated. The original patterns and representatives are marked by “.” and “*”, respectively. The disc around a representative roughly illustrates the region covered by that representative. Figure 2a shows that 6 patterns are selected into R0 during initialization. The representative ability of these 6 patterns is poor because the area at the bottom right is uncovered, and the regions covered by r1 andr5 overlap each other. The forward searching process tackles the problem of overlapping. As a result, r1 is marked as representative, and r5 is eliminated. Also, r6 is eliminated because it is redundant to r2 and r3 . Obviously, in this forward searching stage, qualified representatives are selected from R0 , and redundant ones are deleted. However, the area uncovered by R0 —for instance, the bottom right in this example—has not been explored yet. This task will be fulfilled in the following WRE-based stepwise searching process in which new representatives r5 and r6 , illustrated in Figure 2b, are determined consecutively in the order of r5 to r6 . Apparently, in this course, the bottom right of the given data space is gradually explored. After r5 and r6 are added, the size of R is 6. This is the desirable value. Thus, in the following, the WRE-based process substitutes the “worst” representative with a new one. In this example, r2 is determined as the “worst” and replaced by rnew. As suggested in Figure 2c, the region covered by r2 has much overlap with the regions covered by other representatives. In this sense, deleting r2 is reasonable. Figure 2c also shows that substituting r2 with rnew improves the representative ability of R because more data patterns are covered, and the overlap between representatives is further reduced. 4.2.2 Remarks. In the first stage of REDR or WREDR, the density of the patterns in R0 is analyzed, whereas the density of only one data pattern is required at each iteration in the second stage. Assume that there are k0
480
D. Huang and T. Chow
r1
r5
r4
r2 r3
r6
(a)
r1 r4
r2
r6
r5
r3
(b)
r1 r4
r2
rnew
r5
r3
r6
(c)
Figure 2: Demonstration of the proposed data reduction procedure. The original data patterns are marked with “.” The representatives are described using “*.”
Enhancing Density-Based Data Reduction Using Entropy
481
patterns selected into R0 and that the second stage runs ni iteration. The computational complexity of REDR is then O(N(ni + k0 )), where N is the size of X. Recall that the computational complexity of the multiscale method is O(N2 ). Generally, k0 is much less than N. The proposed method is rapidly convergent, as suggested by our experimental results. Thus, we have N(ni + k0 ) N2 . That is, the proposed methods have a substantially lower computational requirement compared to the multiscale method. Also, the proposed methods have a significantly lower memory requirement compared with the multiscale method because the proposed methods do not need to remember a large number of distance values. In the above process, the condition of maxr j ∈R pouter (x|r j ) > 0 in step 2 guarantees a stable data reduction process. Apparently the patterns with maxr j ∈R pouter (x|r j ) = 0 must still be uncovered by R. Without a priori knowledge on the density distribution of these patterns, determining the representative from them is not recommended. With the constraint of maxr j ∈R pouter (x|r j ) > 0, the newly determined representative must be around the boundary of the area that has already been covered by R. In this way, the proposed method can gradually and reliably explore the whole data space. This can be seen in the above example in which r5 and r6 are marked as representative in turn. In a supervised task, the proposed methods are conducted in a stratified way. Given a value of L in step 0, which determines the ratio between the sizes of the reduced data and the original data, the pattern set of each class is reduced separately. The collection of these reduced data sets is the final data reduction result. Apart from this stratified way, a new strategy is developed to filter out outliers. An outlier is an object that behaves in a different way from the others. For instance, in a classification data set, an outlier generally exhibits a different class label from those having similar attributes. Theoretically, outliers cause noise to the supervised learning algorithms and degrade the performance of these algorithms (Han & Kamber, 2001). It is always desirable to eliminate outliers. In our proposed outlier-filtering strategy, a representative candidate is checked before being added to R, which is described in step 1 or 2. Only those that are not outliers can be placed into R. To determine if an object is an outlier, the area around that object is considered. Assume that Ao is the area around a representative candidate ro . For any class (say, c k ), the conditional probability p(Ao |ck ) can be calculated using p(Ao |c k ) =
1 1 p(x|ro ) p(x|c k ) = N x∈X N
x∈
class ck
p(x|ro ),
482
D. Huang and T. Chow
Table 2: Comparisons Between WREDR and WREDR-FO. Number of Outliers 10 50 80 120
WREDR
WREDR-FO
No Data Reduction
0.98 ± 0.02 0.89 ± 0.05 0.81 ± 0.05 0.79 ± 0.07
1.00 ± 0 1.00 ± 0 0.99 ± 0 0.99 ± 0
0.97 0.90 0.83 0.78
Note: These results highlight the contributions of the filtering-outlier strategy.
where N is the number of patterns in X. The class having the maximum of p(Ao |ck ) is the dominant class of Ao . Apparently, if the class of ro is consistent with the dominant class of Ao , ro is not an outlier; otherwise, ro is an outlier and cannot be included in R. In the proposed method, the computational burden of p(Ao |ck ) is very small because p(x|ro ) has been estimated during the calculation of RE/WRE. Below, the proposed algorithm, together with this outlier-filtering strategy, is called REDR-FO or WREDR-FO. To highlight the benefits of this outlier-filtering strategy, WREDR-FO is compared to WREDR. A classification data set was generated from two normal distributions that consist of two classes {0, 1}: 0.3 0 Class 1 (class = 0) (500 data points)X ∼ N [1, 1] , , 0 0.3
0.3 0 . Class 2 (class = 1) (500 data points)X ∼ N [−1, −1] , 0 0.3
A thousand patterns of this data set were evenly split into two groups of equal size for training and testing. Certain patterns were randomly chosen in the training data and were mislabeled to generate outliers. Classification accuracy is used to evaluate the quality of a reduced data set. Obviously, high classification accuracy indicates a good reduced data set. In this section, the kNN rule with k = 1 is used as the evaluation classifier. WREDR and WREDR-FO are required to reduce the original 500 training patterns to 50. As the performance of these data reduction methods may be affected by different initializations, statistical results of 10 independent trials are presented. Table 2 lists the comparative results. It is noted that as the number of outliers increases, the contribution of the outlier-filtering strategy becomes more significant. It is worth noting that because of the proposed outlierfiltering strategy, WREDR-FO is able to improve the final classification performance.
Enhancing Density-Based Data Reduction Using Entropy
483
5 Experimental Results Here, reduction ratio denotes the ratio between the sizes of reduced data and given data. For a given training data set, a testing data set; and a reduction ratio, a data reduction method is first used to reduce the training data set. Then, based on the reduced data set, certain density estimation and classification models are built. With the performance of these models on the testing data set, the employed data reduction method is evaluated. Throughout our investigation, the following strategies are adopted unless stated otherwise: 1. Each input variable is preprocessed to have zero mean and unit variance through a linear transformation. 2. In WREDR or REDR, the size of R0 is the same as that of R, the final data reduction result. 3. In the sampling type schemes, SOM and the proposed methods, performance is affected by initialization. Thus, in each case, these methods are independently run 10 times. The statistical results over the 10 trials are presented here. 4. Unlike other methods, Mitra’s multiscale method (Mitra et al., 2002) delivers only one set of result in each case. The results delivered by this method are by no mean statistical ones. Also, an exact desired reduction ratio, 0.1 or 0.3, may not be obtained; trials with different values of k may deliver only a close but not the exact reduction ratio. Thus, the reduction ratios provided later in this article are simply the closest ones. For a given reduction ratio, we choose the trial in which the actual reduction ratio is the closest to that given value and present the results of that trial here. 5. Our investigations are conducted using Matlab 6.1 on a Sun Ultra Enterprise 10000 workstation with 100 MHz clock frequency and 4 GB memory. 5.1 Density Estimation. The study presented in this section was conducted on synthetic data, in which the real density functions are known. More important, a large number of testing data patterns can be generated to guarantee the accuracy of evaluation results. Five data reduction methods—random sampling scheme, SOM, the density-based multiscale method, REDR, and WREDR—are compared from the perspectives of efficiency and effectiveness. The running time is recorded for efficiency evaluation. The effectiveness is measured by using the difference between the real density function (g(x)) and a density estimation function obtained based on reduced data ( f (x)). A small density difference indicates good reduced data.
484
D. Huang and T. Chow
In this study, g(x) is known, whereas f (x) is modeled with a Parzen window (Parzen, 1962). Given an M-dimensional pattern set Q = {q 1 , q 2 , q 3 , . . . , q Nq }, the Parzen window estimator is
p(x) =
Nq Nq 1 1 p(x|q i ) = κ(x − q i , h i ), Nq Nq i=1
(5.1)
i=1
where κ(•) is the kernel function and h i is the parameter to determine the width of the window. With the proper selection of κ(•) and h i , the Parzen window estimator can converge to the real probability density (Parzen, 1962). In this study, κ is a gaussian function. The M-dimensional gaussian function is κ(x − q i , h i ) = G(x − q i , h i ) =
1 1 T exp − (x − q )(x − q ) . i i 2h i (2π h i2 ) M/2
For the pattern q i , the window width h i is determined with h i = 2 d(q i , q j ) where d(q i , q j ) is the Euclidean distance, that is, d(q i , q j ) = (q i − q j )(q i − q j )T , and q j is the pattern that is the kth nearest to q i . We use two settings, k = 2 and k = 3, and the results indicate that the latter one performs better than the former one. Thus, we set k with 3. The difference between g(x), the real density function, and f (x), the density estimation function on a reduced data using equation 5.1, is measured with two indices: the absolute distance Da b and the Kullback-Leibler distance (divergence) DK L . Da b and DK L are, respectively, defined as Da b ( f (x), g(x)) =
| f (x) − g(x)|d x and
x
DK L ( f (x), g(x)) =
f (x) log x
f (x) d x. g(x)
The integrals of the above equations are calculated by using a numerical approach. After a large set of patterns, TX, is evenly sampled in a given data space, Da b and DK L are approximated on TX by using
Da b ( f (x), g(x)) ≈
| f (txi ) − g(txi )|txi ,
txi ∈T X
DK L ( f (x), g(x)) ≈
txi ∈T X
f (txi ) log
f (txi ) txi . g(txi )
Enhancing Density-Based Data Reduction Using Entropy
485
Table 3: Data Sets Used in Density Estimation Application. Name of Data Data1 Data2
Data3
Data4
Distribution of Data 600 fromN ([0, 0], I2 ). 800 fromN ([0, 0], 0.5I2 ), 800 from N ([1, 1], 0.3I2 ), 800 from N ([−1, −1], 0.3I2 ). 500 fromN ([0, 0], 0.5I2 ), 500 from N ([1, 1], 0.3I2 ), 500 from N ([−1, −1], 0.3I2 ), 500 from N ([−1, 1], 0.3I2 ), 500 from N ([1, −1], 0.3I2 ). 800 fromN (0, 0.5), 800 from N (−0.5, 1), 800 from N (1, 0.3).
TX 1681 patterns in [3, 3]∼[−3, −3] 1681 patterns in [3, 3]∼[−3, −3] 1681 patterns in [3, 3]∼[−3, −3]
1201 patterns in [3] ∼ [−3]
In order to guarantee precise approximation, the range for sampling TX is determined in a way that the range can cover virtually the whole data region where the probability is more than zero. That is, for a given probability function g(x), TX covers almost all areas where g(x) is not zero. In this case, we have 1 = x g(x)d x ≈ txi ∈T X g(txi )txi . The data sets used in this section are shown in Table 3. In this study, these data sets were all generated in low-dimensional data domains because high dimensionality is well known for its adverse effect of reducing the reliability of Parzen window. First, these methods are compared in terms of Da b and DK L . The comparative results are presented in Figure 3, with a reduction ratio of 0.05, and Figure 4, with a reduction ratio of 0.1. The results in different examples and with different reduction ratios lead clearly to similar conclusions. From the perspective of the quality of data reduction results, the density-based methods—that is, the multiscale methods, REDR and WREDR–deliver similar performance. These density-based methods outperform SOM and the random sampling scheme. In Table 4, the data reduction methods are compared in terms of computational efficiency. It is noted that both REDR and WREDR are more efficient than the multiscale method. This is because the exhaustive computation on pattern-pair distance is avoided in REDR and WREDR. To sum up, among the compared methods, REDR and WREDR are clearly the best data reduction methods because they can deliver the best or nearly the best data reduction results with greater computational efficiency. Besides, the very small deviations illustrated in Figures 3 and 4 indicate that initialization has little effect on the performance of either REDR or WREDR. Also, it can be noted that WREDR outperforms REDR in most cases. Clearly, it is due to the weighting strategy of WRE. Furthermore, REDR and WREDR are compared through t-tests in which the p-values can reflect the significance of difference between the results of REDR and
486
D. Huang and T. Chow
Dab
DKL 0.6
0.6
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(a) Dab
DKL 0.4
0.6
0.4 0.2 0.2
0
1
2
3
4
0
5
1
2
3
4
5
(b) 0.4
0.4
Dab
DKL
0.2
0
0.2
1
2
3
4
0
5
1
2
3
4
5
(c) Dab
DKL 0.4 0.4
0.2 0.2
0
1
2
3
4
0
5
1
2
3
4
5
(d) Figure 3: Comparisons on effectiveness in terms of Da b and DK L for the reduction ratio = 0.05. (a) Results on Data1. (b) Results on Data2. (c) Results on Data3. (d) Results on Data4. In each image, from left to right, the bars represent the results of the random sampling scheme, SOM, the multiscale method, REDR, and WREDR, respectively.
Enhancing Density-Based Data Reduction Using Entropy
Dab
DKL 0.4
0.4
0.2
0.2
0
1
2
3
4
487
0
5
1
2
3
4
5
(a) Dab
DKL 0.6
0.6
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(b) 0.4
0.4
Dab
DKL
0.2
0
0.2
1
2
3
4
0
5
1
2
3
4
5
(c) DKL
Dab
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(d) Figure 4: Comparisons on the effectiveness in terms of Da b and DK L for a reduction ratio = 0.1. (a) Results on Data1. (b) Results on Data2. (c) Results on Data3. (d) Results on Data4. In each image, from left to right, the bars represent the results of the random sampling scheme, SOM, the multiscale method, REDR, and WREDR, respectively.
488
D. Huang and T. Chow
Table 4: Comparisons in Terms of Running Time (in seconds). Name of Data Set Reduction ratio = 0.05 Data1 Data2 Data3 Data4 Reduction ratio = 0.1 Data1 Data2 Data3 Data4
SOM
Multiscale Method
REDR
WREDR
1 15 27 7
2 271 452 177
1 59 79 21
1 64 82 24
2 19 23 13
5 364 520 209
3 125 147 26
3 109 152 30
Table 5: Comparisons Between WREDR and REDR in Terms of Da b . Reduction Ratio = 0.05 Name of Data Set Data1 Data2 Data3 Data4
Reduction Ratio = 0.1
Average of REDR
Average of WREDR
p-Value
Average of REDR
Average of WREDR
p-Value
0.235 0.193 0.142 0.195
0.235 0.184 0.138 0.184
0.61 0.35 0.16 0.34
0.161 0.280 0.128 0.305
0.161 0.273 0.130 0.299
0.89 0.34 0.76 0.45
WREDE. A small p-value means a large difference. In Table 5, the comparative results of REDR and WREDR are presented. These results show that the advantage of WRE over RE becomes significant as the reduction ratio decreases. Also, this advantage basically increases along the direction from the simple distribution, such as Data1, to a relatively complex data distribution, such as Data3. DK L and Da b are known as a straightforward and accurate way to measure the representation ability of a reduced data set. Using them as reference, we evaluate the reliability of the proposed criteria RE and WRE. The values of RE (WRE), DK L , and Da b in each iteration of the second stage of REDR (WREDR) are recorded. The variations of DK L and, Da b are compared with those of REDR (WREDR). In Figures 5 and 6, the typical results on two data are illustrated. It can be seen that RE and WRE vary in a similar fashion with DK L , and Da b . We can thus assert that RE and WRE are reliable enough to measure the representation ability of a reduced data set. 5.2 Classification. In this section, five data reduction methods are compared: the stratified sampling scheme, the supervised SOM, the multiscale method, WREDR, and WREDR-FO. The results of the RE-based method
Enhancing Density-Based Data Reduction Using Entropy
489
0.3
RE Dab DKL
0.1 1
Number of Iteration
100
100WRE 0.2
Dab DKL
0.15
1
Number of Iteration
52
Figure 5: Typical comparison of variation between RE/WRE and density errors (i.e., Da b and DK L ). These values are obtained in the second stage of REDR/WREDE on Data3 with reduction ratio = 0.05. For clear illustration, the values of WRE are timed by 100. Both RE and WRE are shown to vary in a similar fashion with density errors. It verifies that RE and WRE are reliable in evaluating the data reduction effectiveness.
are not presented here because they are similar to the results of the WREbased ones. The stratified sampling scheme and the supervised SOM treat a classification data set in the same way as WREDR and WREDR-FO: with a predetermined reduction ratio, the pattern subsets of different classes are reduced separately, and the final data reduction result is the collection of the results on all the classes. The six data sets used in this section are described in Table 6. The synthetic data, which were detailed in section 4.2.2, contain 80 outliers. To evaluate a reduced data set, several popular classifiers are first built. According to the results of these classifiers on the testing data, the tested data set is evaluated. A high classification result indicates a good reduced data set. The classifiers used are the kNN
490
D. Huang and T. Chow
0.4
RE
Dab
DKL
0.2
1
54
Number of Iteration
100WRE Dab DKL 0.3
0.1 1
Number of Iteration
43
Figure 6: Typical comparison of variation between RE/WRE and density errors (i.e., Dab and DKL ). These results are obtained in the second stage of REDR/WREDE on Data3 with a reduction ratio = 0.05.
rule with k =1 and the multilayer perceptron (MLP) (Haykin, 1999). MLP is provided in the Netlab toolbox (http://www.ncrg.aston.ac.uk/netlab). Throughout this investigation, six hidden neurons are used, and the number of output neurons is set with the number of classes so that one class distinctively corresponds to one output neuron. Also, in each example, the classification models are constructed on the entire training data set. The results of those models on the testing data are presented in Table 7. Two reduction ratios, 0.1 and 0.3, are investigated in this study. Table 8 lists the comparative results. In the example of image segmentation, the maximal reduction ratio of the multiscale method is about 0.17, which is much less than 0.3. Thus, in this example, there is no result of the
Enhancing Density-Based Data Reduction Using Entropy
491
Table 6: Data Sets used in Classification Application. Name of Data Set
Number of Training Number of Testing Number of Number of Data Samples Data Samples Features Classes
Synthetic Data MUSK Pima Indian Diabetes Spambase Statlog Image Segmentation Forest Covertypea
500 3000 500
500 3598 268
2 166 8
2 2 2
2000 4435
2601 2000
58 36
2 6
50,000
35,871
54
5
a The original Forest Covertype data set has seven classes and more than 580,000 patterns.
Under our computer environment, it is very hard to tackle the whole data set. Thus, the patterns belonging to class 1 and class 2 are omitted in our study.
Table 7: Classification Accuracy of the Models Built on the Training Data Sets. Name of Data Set
kNN
MLP
Synthetic Data MUSK Pima Spambase Image Segmentation Forest Covertype
0.83 0.94 0.69 0.90 0.89 0.95
0.99 0.97 0.73 0.92 0.85 0.85
multiscale method for a reduction ratio of 0.3. The presented results clearly indicate the advantage of WREDR-FO, which is due to the contribution of the criterion WRE and the outlier-filtering strategy. Also, referring to the results presented in Table 7, it is suggested that WREDR-FO has little effect on reducing classification accuracy. Even in the examples of the synthetic classification and the Pima Indian Diabetes classification, WREDR-FO can enhance the final classification performance, in contrast to the general argument that a reduced data set corresponds to degraded classification results (Provost & Kolluri, 1999). Clearly, using our proposed outlier-filtering strategy can compensate for the classification loss caused by data reduction to a certain degree. In addition, it is noted that WREDR-FO provides more classification enhancement to kNN than to MLP. This is mainly due to the fact that kNN may be more sensitive to noise than MLP (Khotanzad & Lu, 1990). In Table 9, different methods are compared in terms of running time. WREDR and WREDR-FO are shown to be much more efficient than the multiscale method. Also, the computational effort required by the proposed outlier-filtering strategy is insignificant since WREDR-FO is almost as efficient as WREDR.
Table 8: Comparisons in Terms of Classification Accuracy.
Reduction ratio = 0.1 Synthetic Data Musk Pima Spambase Image segmentation Forest covertype Reduction ratio = 0.3 Synthetic Data Musk
Spambase Image segmentation Forest covertype
Supervised SOM
Multiscale Method
WREDR
kNN
MLP
kNN
MLP
kNN
MLP
kNN
MLP
kNN
MLP
0.83 0.07 0.89 0.02 0.66 0.03 0.82 0.01 0.86 0.01 0.80 0.02
0.93 0.06 0.90 0.02 0.68 0.03 0.88 0.02 0.81 0.02 0.54 0.02
0.92 0.07 0.94 0.01 0.70 0.02 0.83 0.01 0.86 0.01 0.82 0.02
0.89 0.05 0.85 0.00 0.71 0.03 0.84 0.01 0.81 0.01 0.66 0.02
0.86
0.93
0.92
0.93
0.68
0.73
0.83
0.88
0.87
0.82
0.88
0.78
0.91 0.00 0.91 0.01 0.68 0.02 0.84 0.01 0.86 0.01 0.90 0.00
0.98 0.00 0.92 0.01 0.71 0.03 0.89 0.01 0.81 0.01 0.77 0.00
0.99 0.02 0.94 0.01 0.72 0.02 0.86 0.00 0.87 0.01 0.91 0.01
1.00 0.00 0.93 0.01 0.75 0.02 0.90 0.01 0.84 0.01 0.80 0.00
0.82 0.04 0.92 0.01 0.67 0.04 0.84 0.01 0.86 0.01 0.84 0.01
0.98 0.03 0.94 0.02 0.67 0.04 0.89 0.01 0.82 0.01 0.70 0.00
0.91 0.02 0.94 0.00 0.70 0.03 0.85 0.01 0.87 0.01 0.85 0.00
0.93 0.01 0.86 0.01 0.65 0 0.84 0.00 0.83 0.01 0.71 0.00
0.80
0.98
0.93
0.94
0.67
0.68
0.87
0.90
—
—
0.99 0 0.94 0.02 0.69 0.02 0.90 0.00 0.83
0.92
0.79
0.91 0.02 0.93 0.00 0.69 0.02 0.85 0.01 0.88 0.01 0.92 0.00
0.98 0.01 0.94 0.00 0.72 0.01 0.88 0.01 0.88 0.01 0.93 0.00
1.00 0 0.95 0.01 0.73 0.02 0.91 0.00 0.84 0.01 0.84 0.00
0.84 0.00
WREDR-FO
Notes: In the cells of listing results, the upper and lower values are the mean and the standard deviation, respectively. The best result of each case is highlighted in boldface.
D. Huang and T. Chow
Pima
Stratified Sampling
492
Name of Data Set
Enhancing Density-Based Data Reduction Using Entropy
493
Table 9: Comparisons in Terms of Running Time (in seconds). Name of Data Set Reduction ratio = 0.1 Synthetic Data Musk Pima Spam-base Image Segmentation Forest covertype Reduction ratio = 0.3 Synthetic Data Musk Pima Spambase Image Segmentation Forest covertype
Supervised SOM
Multiscale Method
WREDR
WREDR-FO
0.8 153 1.2 15 43 8.2 × 103
2.7 1.1 × 103 3.3 285 1.7 × 103 1.2 × 105
0.8 410 1.4 35 99 6.3 × 103
0.7 479 1.6 35 107 7.8 × 103
1.8 651 1.9 29 73 4.9 × 104
7.3 3.1× 103 10.0 760 — 1.9 × 105
4.0 1.5 × 103 4.1 131 414 1.3 × 104
3.4 1.6 × 103 4.2 133 459 1.9 × 104
6 Conclusions This article focuses on the study of density-based data reduction schemes because this type of data reduction technique can be widely used for tackling data analysis tasks and building data analysis model. In the conventional density-based methods, the probability density of each data point has to be estimated or analyzed. This makes these methods computationally expensive when huge data sets are given. To address this shortcoming, we propose a novel type of entropy-based data reduction criteria and a data reduction process based on these criteria. Compared with the existing density-based methods, our proposed methods exhibit higher efficiency and similar effectiveness. Also, the strategy for outlier filtering is designed. This simple and efficient strategy is immensely useful for classification tasks. Finally, it is important to note that the experimental results indicate that the proposed methods are robust to initializations. Acknowledgments We thank the anonymous reviewers for their useful comments. The work described in this article is fully supported by a grant from City University of Hong Kong of project, no. 7001701-570. References Astrahan, M. M. (1970). Speech analysis by clustering, or the hyperphoneme method (Stanford A I Project Memo). Palo Alto, CA: Stanford University.
494
D. Huang and T. Chow
Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. Int. J. Intell. Syst., 16(12), 1445–1473. Blum, A. L., & Langley, P. (1997). Selection of relevant feature and examples in machine learning. Artificial Intelligence, 97(1–2), 245–271. Catlett, J. (1991). Megainduction: Machine learning on very large databases. Unpublished doctoral dissertation, University of Sydney, Australia. Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Trans. Computers, 23(11), 1179–1184. Chow, T. W. S., & Wu, S. (2004). An online cellular probabilistic self-organizing map for static and dynamic data Sets. IEEE. Trans. on Circuits and Systems I, 51(4), 732–747. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dasarathy, B. V. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. Los Alamitos, CA: IEEE Computer Society Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Trans. on Inform. Theory, IT-18, 431–433. Gersho, A., & Gray, R. M. (1992). Vector quantization and signal compression. Norwell, MA: Kluwer. Gray, R. M. (1984). Vector quantization. IEEE Assp Magazine, 1, 4–29. Friedman, J. H. (1997). Data mining and statistics: What’s the connection? Available online at http://www.salford-systems.com/doc/dm-stat.pdf. Han, J. W., Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE. Trans. on Information Theory, 14, 515–516. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Khotanzad, A., & Lu, J.-H. (1990). Classification of invariant image representations using a neural network. IEEE Transactions on Signal Processing, 38(6), 1028–1038. Kohonen, T. (2001). Self-organizing maps. London: Springer. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Density-based multiscale data condensation. IEEE. Trans. on PAMI, 24(6), 734–747. Parzen, E. (1962). On the estimation of a probability density function and mode. Ann. Math. Statist., 33, 1064–1076. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 2, 131–169. Plutowski, M., & White, H. (1993). Selecting concise training sets from clean data. IEEE Trans. Neural Networks, 4(2), 305–318. Quinlan, R. (1983). Learning efficient classification procedures and their application to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitechell, (Eds.), Machine Learning—an Artificial Intelligence Approach (pp. 463–482) Palo Alto, CA: Tioga. Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proc. 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann. Available online at www.cs.umass.edu/∼mccallum/papers/active-icm/101.ps.
Enhancing Density-Based Data Reduction Using Entropy
495
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization. New York: Wiley. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38, 257–286. Yang, Z. P., & Zwolinski, M. (2001). Mutual information theory for adaptive mixture models. IEEE Trans. on PAMI, 23(4), 396–403.
Received December 29, 2004; accepted June 27, 2005.
LETTER
Communicated by Heinrich Buelthoff
Receptive Field Structures for Recognition Benjamin J. Balas
[email protected] Pawan Sinha
[email protected] Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02142, U.S.A.
Localized operators, like Gabor wavelets and difference-of-gaussian filters, are considered useful tools for image representation. This is due to their ability to form a sparse code that can serve as a basis set for highfidelity reconstruction of natural images. However, for many visual tasks, the more appropriate criterion of representational efficacy is recognition rather than reconstruction. It is unclear whether simple local features provide the stability necessary to subserve robust recognition of complex objects. In this article, we search the space of two-lobed differential operators for those that constitute a good representational code under recognition and discrimination criteria. We find that a novel operator, which we call the dissociated dipole, displays useful properties in this regard. We describe simple computational experiments to assess the merits of such dipoles relative to the more traditional local operators. The results suggest that nonlocal operators constitute a vocabulary that is stable across a range of image transformations. 1 Introduction Information theory has become a valuable tool for understanding the functional significance of neural response properties. In particular, the idea that a goal of early sensory processing may be to efficiently encode natural stimuli has generated a large body of work describing the function of the human visual system in terms of redundancy reduction and maximum-entropy responses (Attneave, 1954; Barlow, 1961; Atick, 1992; Field, 1994). In the compound eye of the fly, for example, the contrast response function of a particular class of interneuron approximates the distribution of contrast levels found in natural scenes (Laughlin, 1981). This is the most efficient encoding of contrast fluctuations, meaning that from the point of view of information theory, these cells are optimally tuned to the statistics of their environment. In the context of the primate visual system, it has been proposed that the receptive fields of various cells may have the form they do for similar reasons. Olshausen and Field (1996, 1997) and Bell Neural Computation 18, 497–520 (2006)
C 2006 Massachusetts Institute of Technology
498
B. Balas and P. Sinha
and Sejnowski (1997) have demonstrated that the oriented edge-finding receptive fields that are found in early visual cortex (Hubel & Wiesel, 1959) may exist because they provide an encoding of natural scenes that maximizes information. Olshausen and Field were able to produce such filters through enforcing sparseness constraints on their encoding while ensuring that the representation allowed high-fidelity reconstruction of the original scene. Bell and Sejnowski enforced the statistical independence of the filters rather than working with an explicit sparseness criterion. These two approaches are actually equivalent, as demonstrated by Olshausen and Field. An aspect of Bell and Sejnowski’s work that sets it apart, however, is their progression through constraints of different strength, such as principal component analysis (orthogonal basis), ZCA (zero-phase whitening filters), and finally independent component analysis (statistical independence). These different constraints lead to qualitatively different filters, such as checkerboard-like structures and center-surround functions, resembling the preferred stimuli of cells found in some parts of the visual pathway (V4 and the lateral geniculate nucleus, (LGN), respectively). The search for efficient codes has helped direct the efforts of researchers interested in explaining neural response properties in the visual system and fostered the study of ecological constraints in natural scenes (Simoncelli & Olshausen, 2001). However, there are many other tasks that the visual system must accomplish, for which the goal may be quite different from high-fidelity input reconstruction. The task of recognizing complex objects is an important case in point. A priori, we cannot assume that the same computations that result in sparse coding would also support robust recognition. Indeed, the resilience of human recognition performance to image degradations suggests that image measurements underlying recognition can survive significant reductions in reconstruction quality. Extracting measurements that are stable against ecologically relevant transformations of an object (lighting and pose, for example) is a constraint that might result in qualitatively different receptive field structures from the ones that support high-fidelity reconstruction. In this letter, we examine the nature of receptive fields that emerge under a recognition- rather than reconstruction-based criterion. We develop and illustrate our ideas primarily in the context of human faces, although we expect that similar analyses can be conducted with other object classes as well. In this analysis, we note the emergence of a novel receptive field structure that we call the dissociated dipole. These dipoles (or “sticks”) perform simple nonlocal luminance comparisons, allowing a region-based representation of image structure. We also compare the stability characteristics of various kinds of filters. These include model neurons with receptive field structures like those found by sparse coding constraints and sticks operators. Our goal is to eventually gain an understanding of how object representations that are
Receptive Field Structures for Recognition
499
useful for recognition might be constructed from simple image measurements.
2 Experiment 1: Searching for Simple Features in the Domain of Faces We begin by investigating what kinds of simple features can be used to discriminate among frontally viewed faces. The choice of a specific example class is primarily for ease of exposition. The ideas we develop are intended to be more generally applicable. (We substantiate this claim in experiment 2 when we describe computational experiments with arbitrary object classes.) Computationally, there are many methods for performing the face discrimination task with relatively high accuracy, especially if the faces are already well normalized for position, pose, and scale. Using nothing more than the Euclidean distance between faces to do nearest-neighbor classification in pixel space, one can obtain reasonably good results (∼65% with a 40-person classification task using the ORL database, compiled by AT&T Laboratories, Cambridge, UK). Using eigenfaces, one can improve this score somewhat by removing the contribution of higher-order eigenvectors, effectively “denoising” the face space. Further adjustments can be made as well, including the explicit modeling of intra- and interpersonal differences (Moghaddam, Jebara, & Pentland, 2000) and the use of more complex classifiers. On the other side of the spectrum from these global techniques are methods for rating facial similarity that rely on Gabor jets placed at fiducial points on a face (Wiskott, Fellous, Kruger, & von der Malsburg, 1997). These techniques use information at multiple spatial scales to produce a representation built up from local analyses; they are also quite successful. The overall performance of these systems depends on both the choice of representation and the back-end classification strategy. Since we focus exclusively on the former, our goal is not to produce a system for recognition that is superior to these approaches, but rather to explore the space of front-end feature choices. In other words, we look within a specific set of image measurements, bilobed differential operators, to see what spatial analyses lead to the best invariance across images of the same person. For our purposes, a bilobed differential operator is a feature type in which weighted luminance is first calculated over two image regions, and the final output of the operator is the signed difference between those two average values. In general, these two image regions need not be connected. Some examples of these filters are shown in Figure 1. Conceptually, the design of our experiment is as follows. We exhaustively consider all possible bilobed differential operators (with the individual lobes modeled as rectangles for simplicity). We evaluate the discrimination performance of the corresponding measurements over a face database (discriminability refers to maximizing separation between individuals and minimizing distances within instances of the same person). By sorting the
500
B. Balas and P. Sinha
Figure 1: Examples of bilobed differential operators of the sort employed in experiment 1.
large space of all operators using the criterion of discriminability, we can determine which are likely to constitute a good vocabulary for recognition. We note that this approach differs substantially from efforts to find reliable features for face and object detection in cluttered backgrounds. For example, Ullman’s work on features of intermediate complexity (IC) (Ullman, Vidal-Naquet, & Sali, 2002) demonstrates a method for learning class-diagnostic image fragments using mutual information. These IC features are both very likely to be present in an image when the object is present and unlikely to appear in the image background by chance. Other feature learning studies have concentrated on developing generative models for object recognition (Fei-Fei, Fergus, & Perona, 2003; Fergus, Perona, & Zisserman, 2003; Fei-Fei, Fergus, & Perona, 2004) in which various appearance densities are estimated for diagnostic image fragments. This allows recognition of an object in a cluttered scene to proceed in a Bayesian manner. These studies are unquestionably valuable to our understanding of object recognition. Our goals in this study are slightly different, however. First, we are interested in discovering what features support invariance to a particular object rather than a particular object class. It is for this reason that we do not attempt to segment the objects under consideration from a cluttered background. We envision segmentation proceeding via parts-based representations such as those described above. Indeed, it has recently been shown that simple contrast relationships can be used to detect objects in cluttered backgrounds with good accuracy (Sinha, 2002) and that good segmentation results can be obtained once one has recognized an object at the class level (Borenstein & Ullman, 2002, 2004). While it may be possible to learn diagnostic features of an individual that could be used for segmentation purposes, we believe that it is also plausible to consider segmentation as a process that proceeds prior to individuation (subordinate-level classification). Second, rather than looking for complex object parts that support invariance, we commence by considering very simple features. This means that we are not likely to find globally optimal features for individuation. Instead, we aim to determine what structural properties of potentially
Receptive Field Structures for Recognition
501
low-level RFs contribute to recognition. In a sense, we are trying to understand what computations between the lowest and highest levels of visual processing lead to the impressive invariances for object transformations displayed by our visual system. Given that we are attempting to understand how recognition abilities are built up from low-level features, one might ask why we do not explicitly assume preprocessing by center-surround or wavelet filters. Indeed, others have pursued this line of thought (Edelman, 1993; Schneiderman & Kanade, 1998; Riesenhuber & Poggio, 1999), and such an analysis could help us understand how the outputs of early visual areas (such as the LGN and V1) serve as the basis for further computations that might support recognition. That said, we have chosen not to adopt this strategy, so that we can remain completely agnostic as to what basic computations are necessary first steps toward solving high-level problems. However, it is straightforward to extend this work to incorporate a front-end comprising simple filters. 2.1 Stimuli. We use faces drawn from the ORL database (Samaria and Harter, 1994) for this initial experiment. The images are all 112 × 92 pixels in size, and there are 10 unique images of each of the 40 individuals included in the database. We chose to work with 21 randomly chosen individuals in the database, using the first 5 images of each person. The faces are imaged against uniform backdrops. Therefore, the task in our experiment is not to segregate faces from a cluttered background, but rather to individuate them. 2.2 Preprocessing 2.2.1 Block Averaging. Relaxing locality constraints results in a very large number of allowable square differential operators in a particular image. To reduce the size of our search space, we first down-sample all of the images in our database to a much smaller size of 11 × 9 pixels. Much of the information necessary for successful classification is present at this small size, as evidenced by the fact that the recognition performance of a simple nearest-neighbor classifier actually increases slightly (from 65% correct at full resolution to 70% using 8 × 8 pixel blocks) if we use these smaller images as input. 2.2.2 Constructing Difference Vectors. Our next step involves changing our recognition problem from a 21-class categorization task into a binary one. We do this by constructing difference vectors, which comprise two classes of intra- and interpersonal variation (Moghaddam et al., 2000). Briefly, we subtract one image from another, and if the two images used depicted the same individual, then that difference vector captures intrapersonal variation. If the two images were of different individuals, then that difference vector would be one that captured interpersonal variation. Given these two
502
B. Balas and P. Sinha
sets, we look for spatial features that can distinguish between these two types of variation in facial appearance rather than attempting to find features that are always stable within each of 21 categories. To assemble the difference vectors used in this experiment, we took all unique pair-wise differences between images that depicted the same person (intrapersonal set) and used the first image of each individual to construct a set of pair-wise differences that matched our first set in size (interpersonal set). The faces used to construct these difference vectors were not precisely registered. We attempted to find features robust to the variations in facial position and view that arise in this data set. 2.2.3 Constructing Integral Images. Now that we have two sets of lowresolution difference vectors, we introduce one last preprocessing step designed to speed up the execution of our search. Since the differential operators we are analyzing have rectangular lobes, we construct integral images (Viola & Jones, 2001) from each of our difference vectors. Integral images allow the fast computation of rectangular image features, reducing the process to a series of look-ups. The value of each pixel in the integral image created from a given stimulus represents the sum of all pixels above and to the left of that pixel in the original picture. 2.3 Feature Ranking. In our 11 × 9 images, there are a total (n) of 2970 unique box features. Given that we are interested in all possible differential operators, there are approximately 4.5 million spatial features (n2 /2) for us to consider. To decide which of these features were best for recognition, we used A as our measure of discriminability (Green & Swets, 1966). A is a nonparametric measure of discriminability calculated by finding the area underneath an observer’s ROC (receiver-operating-characteristic) curve. This curve is determined by plotting the number of “hits” and “false alarms” a given observer obtains when using a particular numerical threshold to judge the presence or absence of a signal. In this experiment, we treat each differential operator as one observer. The signals we wish to detect are the intrapersonal difference vectors. The response of each operator (mean value of pixels under the white rectangle minus mean value of pixels under the black rectangle) was calculated on each difference vector, and then the labels associated with those vectors (intra- versus interpersonal variation) were sorted according to that numerical output. With the distribution of labeled difference vectors in hand for a particular feature, we could proceed to calculate the value of A . We determined how many hits and false alarms there would be for a threshold placed at each possible location along the continuum of observed feature values. This allowed us to plot a discretized ROC curve for each feature. Calculating the area underneath this curve is straightforward, yielding the discriminability for that operator. A scores range from 0.5 to 1. A perfect separation of intra- and interpersonal difference vectors would lead to an
Receptive Field Structures for Recognition
503
A score of 1, while a complete enmeshing of the two classes would lead to a score of 0.5. In one simulation, the absolute value of each feature was taken (rectified results), and in another the original responses were unaltered (unrectified results). In this way, we could establish how instances of each class were distributed with respect to each spatial feature, both with and without information concerning the direction of brightness differences. It is important to note at this stage that there is no reason to expect that any of the values we recover from our analysis of these spatial features will be particularly high. In boosting procedures, it is customary to use a cascade of relatively poor filters to construct a classifier capable of robust performance, meaning that even with a collection of bad features, one can obtain worthwhile results. In this experiment, we are interested only in the relative ranking of features, though it is possible that the set of features we obtain could be useful for recognition despite their poor abilities in isolation. We shall explicitly consider the utility of the features discovered here in a recognition paradigm presented in experiment 2.
2.4 Results 2.4.1 Differential Operators. The top-ranked differential operators recovered from our analysis of the space of possible two-lobed box filters are displayed in Figure 2. As we expected, the A measured for each individual feature is not particularly high, with the best operator in these two sets scoring approximately 0.71. There are four main classes of features that dominate the top 100 differential operators. First, features resembling center-surround structures appear in several top slots, in both the rectified and unrectified data. This is somewhat surprising, given that cells with this structure are most commonly associated with very early visual processing implicated in low-level tasks such as contrast enhancement, rather than higher-level tasks like recognition. Of course, the features we have recovered here are far larger in terms of their receptive field than typical center-surround filters used for early image processing, so perhaps these structures are useful for recognition if scaled up to larger sizes. The second type of feature that is very prevalent in the results is what we will call a dissociated dipole, or stick, operator, and appears primarily in the unrectified results. These features have a spatially disjoint structure, meaning that they execute brightness comparisons across widely separate parts of an image. Admittedly, the connection between these operators and the known physiology of the primate visual system is weak. To date, there have been no cells with this sort of dissociated receptive field structure found in the human visual pathway, although they may exist in the auditory and somatosensory processing streams (Young, 1984; Chapin, 1986).
504
B. Balas and P. Sinha
Top 100 features for ORL recognition task
Figure 2: The top 100 ranked features for discriminating between intra- and interpersonal difference vectors. Beneath the 10 × 10 array are representatives of the most common features discovered.
The final two features are elongated edge and line detectors, which dominate the results of the rectified operators. An elongated edge detector appears in the unrectified rankings as well, but other structurally similar features are found only in the next 100 ranked features. These structures resemble some of the receptive fields known to exist in striate cortex, as well as the wavelet-like operators that support sparse coding of natural scenes. We point out that multiple copies of these features appear throughout our rankings, which is to be expected. Small structural changes to these filters only slightly alter their A score, meaning that many of the top features have very similar forms. We do not attribute any particular importance to the fact that the nonlocal operators that perform best appear to be comparing values on the right edge of the image to values in the center, or to the tendency for elongated edge detectors to appear in the center of the image. It is only the generic structure of each operator that is important to us here. 2.4.2 Single Rectangle Features. We chose to examine differential operators in our initial analysis for several reasons. First, cells with both excitatory and inhibitory regions are found throughout the visual system. Second, by
Receptive Field Structures for Recognition
505
0.74
Two Rects – signed Two Rects – unsigned
0.72
One Rect – signed One Rect – unsigned
0.7
Aprime
0.68 0.66 0.64 0.62 0.6 0.58
0
20
40
60
80
100
Feature Rank Figure 3: Plots of A scores across the best features from each family of operators (single versus double rectangle features, as well as rectified versus unrectified operator values).
taking the difference in luminance between one region or another, one is far less sensitive to uniform changes in illumination brought on by haze a bright lighting, for example. However, given that we are using a database of faces that is already relatively well controlled in terms of lighting and pose, it may be the case that even simpler features can support recognition. To examine this possibility, we conduct the same analysis described above for differential operators on the set of all single-rectangle box features in our images. We find that single-rectangle features are not as useful for discriminating between our two classes as are differential operators. The range of A values for the top 100 features from each category is plotted in Figure 3, where it is clear that both sets of differential operators provide better recognition performance than single box filters. Even in circumstances where many of the reasons to employ differential operators have been removed through clever database construction (say, by disallowing fluctuations in ambient illumination), we find that they still outperform simpler measurements.
506
B. Balas and P. Sinha
2.5 Discussion. In our analysis of the best differential operators for face recognition, we have observed a new type of operator (the dissociated dipole) that offers an alternative form of processing by which within-class stability might be achieved for images of faces. An important question to consider is how this operator fits within the framework of previous computational models of recognition, as well as whether it has any relevance to human vision. The dissociated dipole is an instance of a higher-order image statistic, a binary measurement. The notion that such statistics might be useful for pattern recognition is not new; indeed, Julesz (1975) suggested that needle statistics could be useful for characterizing random dot textures. In the computer vision community, nonlocal comparisons are employed in integral geometry to characterize shapes (Novikoff, 1962). The possibility that nonlocal luminance comparisons may be useful for object and face recognition has not been thoroughly explored, however. Such an approach differs from traditional shape-based approaches to object recognition, in that it implicitly considers relationships between regions to be of paramount importance. Our recent results (Balas & Sinha, 2003) have demonstrated that such a nonlocal representation of faces provides for better recognition performance than a strictly local one. Furthermore, Kouh and Riesenhuber (2003) have found that to model the responses of V4 neurons to various gratings using a hierarchical model of recognition (Riesenhuber & Poggio, 1999), it is necessary to pool responses from spatially disjoint low-level neurons. Before proceeding, we wish to specify more precisely the relationship between local, nonlocal, and global image analysis. We consider local analyses those in which a contiguous set of pixels (either 4- or 8-connected) is represented in terms of a single output value. A global analysis is similar to this, save for the amount of the image under consideration. In the limit, a global image analysis uses all pixels in the image to construct the output value. A local analysis might use only some small percentage of image area. This distinction is not truly categorical. Rather, there is a spectrum between local and global image analysis. Likewise, a similar spectrum exists between local and nonlocal analysis. While a local analysis considers only a set of contiguous pixels, a nonlocal analysis breaks this condition of contiguity. In the extreme, one can imagine a highly nonlocal feature composed of two pixels located at opposite corners of an image. At the other extreme would be a highly local feature consisting of two neighboring pixels. Of course, there are many operators spanning these two possibilities that are neither purely local nor nonlocal. Moreover, if one measures local features (like Gabor filter outputs) at several nonoverlapping positions, is this a local or a nonlocal analysis? If one is merely concatenating the values of each local analysis into one feature vector, then this is not a truly nonlocal computation by our definition. If, however, the values of those local features are explicitly combined to produce one output value, then we would have arrived at a nonlocal analysis
Receptive Field Structures for Recognition
507
Figure 4: A dipole measurement is parameterized in terms of the space constant σ of each lobe, the distance δ between the centers of each lobe, and the angle of orientation, θ .
of the image. Nonlocal analysis of this type has traditionally received less attention than local or global strategies of image processing. The reason nonlocal representations of brightness have not been studied in great detail may be due to the sheer number of generic binary statistics. In general, the trouble with appeals to higher-order statistics for recognition is that there is a vast space of possible measurements that are allowable with the introduction of new parameters (in our case, the distance between operator lobes). This combinatorial explosion makes it hard to determine which particular measurements are actually useful within the large range of possibilities. This is, of course, a serious problem in that the utility of any set of proposed measurements is dependent on the ability to separate helpful features from useless ones. We also note that there are several computational oddities associated with nonlocal operators. Suppose that we formulate a dissociated dipole as a difference-of-offset-gaussians operator (a model we present in full in the next experiment), allowing the distance between the two gaussians to be manipulated independent of either one’s spatial constant (see Figure 4). In so doing, we lose the ability to create steerable filters (Freeman & Adelson, 1991), meaning that to obtain dipoles at a range of orientations, we have no other option than to use a large number of operators. This is not impossible, but it lacks the elegance and efficiency of more traditional approaches by which multiscale representations can be created at any orientation through the use of a small number of basis functions. Another important difference between local and nonlocal computations is the distribution of operator outputs. Natural images are spatially redundant, meaning that the output of most local operators is near zero (Kersten,
508
B. Balas and P. Sinha
1987). The result is a highly kurtotic distribution of filter outputs, indicating that a sparse representation of the image using those filters is expected. In many cases, this is highly desirable from both metabolic and computational viewpoints. As we increase the distance between the offset gaussians we use to model dissociated dipoles, the kurtosis of the distribution decreases significantly. This means that using these operators yields a coarse (or distributed) encoding of the image under consideration. This may not be unreasonable, especially given that distributed representations of complex objects may help increase robustness to image degradation. However, it is important to note that nonlocal computations depart from some conventional ideas about image representation in significant ways. Finally, given that we have discussed our findings in the context of discovering receptive field structures that are good for recognition rather than encoding, it is important to describe what differences we see between those two processes. The initial stages of any visual system have to perform transduction—transforming the input into a format amenable to further processing. Encoding is the process by which this re-representation of the visual input is accomplished. Recognition is the process by which labels that reflect aspects of image content are assigned to images. The constraints on encoding processes are twofold: the input signal should be represented both accurately and efficiently. Given the variety of visual tasks that must be accomplished with the same initial input, it makes sense that early visual stages would not be committed to optimizing any one of them. For that reason, we suggest that recognition operates on a signal that is initially encoded via localized edge-like operators, but may rely on different measurements extracted from that signal that prove more useful. In our next experiment, we directly address the question of whether the structures we have discovered in this analysis are useful for face and object classification. In this next analysis, we remove many of the simplifications necessary for an exhaustive search to be tractable in experiment 1. We also move beyond the domain of face recognition to include multiple object classes in our recognition task. 3 Experiment 2: Face and Object Recognition Using Local and Nonlocal Features In our first experiment, we noted the emergence of center-surround operators and nonlocal operators under a recognition criterion for frontally viewed faces. However, in our first experiment, many compromises were made in order to conduct an exhaustive search through the space of possible operators. First, our images were reduced to an extremely small size in order to limit the number of features we needed to consider. Though faces can be recognized at very low resolutions, it is also clear that there is interesting and useful structure at finer spatial scales. Second, we chose to work with difference images rather than the original faces. This allowed
Receptive Field Structures for Recognition
509
A
B
Figure 5: Examples of stimuli used in experiment 2. (A) Training images of several individuals depicted in the ORL database. (B) Training images of objects depicted in the COIL database. Note that the COIL database contains multiple exemplars of some object classes (such as the cars in this figure), making withinclass discrimination a necessary part of performing recognition well using this database.
us to transform a multicategory classification task into a binary task, but embodied the implicit assumption that a differencing operation occurs as part of the recognition process. Third, we point out that in any consideration of all possible bilobed features in an image, the number of nonlocal features will far exceed the number of local features. Greater numbers need not imply better performance, yet it is still possible that the abundance of useful nonlocal operators may be a function of set size. Finally, we note that in considering only face images, it is unclear whether the features we discovered are useful for general recognition purposes or specific to face matching. In this second experiment, we attempt to address these concerns through a recognition task that eliminates many of these difficulties. We employ high-resolution images of both faces and various complex objects in a classification task designed to test the efficacy of center-surround, local-oriented, and nonlocal features in an unbiased fashion. 3.1 Stimuli. For our face recognition experiment, we once again make use of the ORL database. In this case, all 40 individuals were used, with one image of each person serving as a training image. The images were not preprocessed in any way and remained at full resolution (112 × 92 pixels). To help determine if our findings hold up across a range of object categories, we also conduct this recognition experiment with images taken from the COIL database (see Figure 5; Nayar, Nene, & Murase, 1996; Nene, Nayar, & Murase, 1996). These images are 128 × 128 pixel images of 100 different objects, including toy cars, foods, pharmaceutical products, and many other diverse items. We selected these images for the wide range of surface and structural properties represented by the objects. Also, repeated exemplars
510
B. Balas and P. Sinha
of a few object categories (such as cars) make both across-class and withinclass recognition necessary. Each object is depicted rotated in depth from its original position in increments of 5 degrees. We chose the 0 degree images of each object as training images, and used the following 9 images as test images. The only preprocessing performed on these images was reducing them from full color to grayscale. 3.2 Procedure. To determine the relative performance of centersurround, local-oriented, and nonlocal features in an unbiased way, we model all of our features as generalized difference-of-gaussian operators. A generic bilobed operator in two-dimensional space can be modeled as follows: √
1 2π |1 |
e 1/2
−(x−µ1 )t −1 (x−µ1 ) 1 2
−(x−µ2 )t −1 (x−µ2 ) 1 2 2 −√ e . 1/2 2π|2 |
(3.1)
For all of our remaining experiments, we consider only operators with diagonal covariance matrices 1 and 2 . Further, the diagonal elements of each matrix shall be equal, yielding isotropic gaussian lobes. For this simplified case, equation 3.1 can be expressed as 2
√
2
−(x−µ1 ) −(x−µ2 ) 1 1 2 2 e 2σ1 − √ e 2σ2 . 2πσ1 2πσ2
(3.2)
We introduce also a parameter δ to represent the separation between two lobes. This is simply the Euclidean norm of the difference between the two means: δ = µ2 − µ1 .
(3.3)
In order to build a center-surround operator, δ must be set to zero, and the spatial constants of the center and surround should be in a ratio of 1 to 1.6 to match the dimensions of receptive fields found in the human visual system (Marr, 1982). To create a local-oriented operator, we shall set σ 1 = σ 2, and set the distance δ to be equal to three times the value of the spatial constant. Finally, nonlocal operators can be created by allowing the distance δ to exceed the value 3σ (once again assuming equal spatial constants for the two lobes). Examples of all of these operators are displayed in Figure 6. Given this simple parameterization of our three feature types, we choose in this experiment to sample equal numbers of each kind of operator from the full set of possible features. In this way, we may represent each of our training images in terms of some small number of features drawn from a specific operator family and evaluate subsequent classification performance.
Receptive Field Structures for Recognition
511
Figure 6: Representative operators drawn from the four operator families considered in experiment 2. Top to bottom, we display examples of centersurround features, local oriented features, and two kinds of nonlocal features (δ = 6σ, s = 9σ ).
Four operator families were considered: center-surround features (δ = 0), local-oriented features (δ = 3σ ), and two kinds of nonlocal features (δ = 6σ and 9σ ). For each operator family, we constructed 40 banks of 50 randomly positioned and oriented operators each. Twenty of these feature banks contained operators with a spatial constant of 2 pixels, and the other 20 feature banks contained operators with a 4 pixel spatial constant. Each bank of operators was applied to the training images to generate a feature vector consisting of 50 values. The same operators were then applied to all test images, and the resulting feature vectors were classified using a nearest-neighbor metric (L2 norm). This procedure was carried out on both the ORL and the COIL databases. 3.3 Results. The number of images correctly identified for a given filter bank was calculated for each recognition trial, allowing us to compute an average level of classification performance from the 20 runs within each operator family and spatial scale (see Figure 7). We find in this task that once again, center-surround and nonlocal features offer the best recognition performance. This result holds at both spatial scales used in this task, as well as for both face recognition and multiclass object recognition. We also note the small variability in recognition performance around each operator’s mean value. Despite the random sampling of features used to constitute our operator banks, the resulting recognition performance remained very consistent. In both cases, we note that center-surround performance slightly exceeds that obtained using nonlocal operators. It is interesting to note, however, that a larger separation between the lobes of a nonlocal feature results in better recognition performance. This cannot continue indefinitely, of course, as longer and longer separations will lead to more limitations on where operators can be placed within the image. Increased accuracy with increased
512
B. Balas and P. Sinha
Proportion Correct
Face Recognition
Object Recognition
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5 0
2
4
6
8
sigma = 2 pixels sigma = 4 pixels
0
2
4
6
8
Lobe separation in multiples of spatial constant (sigma) Figure 7: Recognition performance for both faces (left) and objects (right) as a function of both the distance between operator lobes and the spatial constant of the lobes.
nonlocality does suggest that larger distances between lobes are more useful, however, and that it is not enough simply to deviate from locality. We note that the distinct dip in performance for local-oriented features is both consistent and puzzling. Why should it be the case that unoriented local features are good at recognition while oriented local features are poor? Center-surround operators analyze almost the same pixels as a local-oriented operator placed at the same location, so why should they be so different in terms of their recognition performance? Moreover, how is it that radically different operators like the dissociated dipole and the center-surround operator should perform so similarly? In our third and final experiment, we attempt to address these questions by breaking down the recognition problem into distinct parts so we can learn how these operator families function in classification tasks. Specifically, we point out that good recognition performance is made possible when an operator possesses two distinct properties. First, an operator must provide a stable response to images of objects with the same identity. Second, the operator must respond differently to images of objects with different identities. Neither condition is sufficient for recognition to proceed, but both are necessary. We hypothesize that though both centersurround operators and nonlocal operators provide useful information for recognition, they do so in different ways. In our last experiment, we assess both the stability and variability of each operator type to determine how good recognition results are achieved with different receptive field structures.
Receptive Field Structures for Recognition
513
4 Experiment 3: Feature Stability and Variability In experiment 2, we determined that both center-surround and nonlocal operators outperform local oriented features at recognition of faces and objects. In many ways, this is quite surprising. Center-surround features appear to share little with nonlocal operators as we have defined them, yet their recognition performance is quite similar. In this task, we break down the recognition process into components of stability and variability. To perform well at recognition, a particular operator must first be able to respond in much the same way to many different images of the same face. This is how we define stability, and one can think of it in terms of various identity-preserving transformations. Whether a face is smiling or not, lit from the side or not, a useful operator for recognition must not vary its response too widely. If this proves true, we may say that that feature is stable with respect to the transformation being considered. We use this notion to formulate an operational definition of stability in terms of a set of image measurements and a particular face transformation. Let us first assume that we possess a set of image measurements in a filter bank, just as we did in experiment 2. This filter bank is applied to some initial image, which shall always depict a person in frontal view with a neutral expression. The value of each operator in our collection can be determined and stored in a one-dimensional vector, x. This same set of operators is then applied to a second image, depicting the same person as the original image but with some change of expression or pose. The values resulting from applying all operators to this new image are then stored in a second vector, y. The two vectors x and y may then be compared to see how drastic the changes in operator response were across the transformation from the first image to the second. If by some luck our operators are perfectly invariant to the current transformation, plotting x versus y would produce a scatter plot in which all points would lie on the line y = x. Poor invariance would be reflected in a plot in which points are distributed randomly. For two vectors x and y (each of length n), we may use the value of the correlation coefficient (see equation 4.1) between them as our quantitative measure of feature stability: r=
n(xy) − (x)(y) [nx 2 − (x)2 ][ny2 − (y)2 ]
.
(4.1)
The second component of recognition is variability. It is not enough to be stable to transformations; one must also be diagnostic of identity. Imagine, for example, that one finds an image measurement that is perfectly stable across lighting, expression, and pose transformations. It may seem that this measurement is ideal for recognition, but let us also imagine that it turns out to be of the same value for every face considered. This provides no
514
B. Balas and P. Sinha
means of distinguishing one face from another, despite the measurement’s remarkable invariance to transformations of a single face. What is needed is an ability to be stable within images of a single face, but vary broadly across images of many different faces. This last attribute we shall call variability, and we may quantify it for a particular measurement as the variance of its response across a population of faces. In this third experiment, we use these operational definitions of stability and variability to determine what properties center-surround and nonlocal operators possess that make them useful for recognition. We shall return once again to the domain of faces, as they provide a rich set of transformations to consider, both rigid and nonrigid alterations of the face in varying degree. 4.1 Stimuli. We use 16 faces (8 men, 8 women) from the Stirling face database for this experiment. The faces are grayscale images of individuals in a neutral, frontal pose accompanied by pictures of the same models smiling and speaking while facing forward, and also in a three-quarter pose with neutral expression. We call these transformations the SMILE, SPEECH, and VIEW transforms, respectively. The original images were 284 × 365 pixels, and the only preprocessing step applied was to crop out a 256 × 256 pixel region centered in the original image rectangle. 4.2 Procedure. All operators in these sets were built as difference-ofgaussian features, exactly as described in experiment 2. Also as before, center-surround, local oriented, and two kinds of nonlocal features were evaluated. Because we would like to understand how both the separation of lobes and their individual spatial extent affect performance, two scales were employed for each kind of feature. Space constants of 4 pixels (fine scale) and 8 pixels (coarse scale) were used. In the case of center-surround features, the value of the space constant always refers to the size of the surround. For each pair of images to be analyzed, we construct 120 collections of 50 operators each. These feature banks were split into 10 center-surround, 10 local, and 20 nonlocal banks (10 banks each for separations of six and nine times the spatial constant of the lobes) at both scales mentioned above. Once a set of operators was constructed, we applied it to each neutral, frontal image in our data set to assemble the feature value for the starting image. The same operators were then applied to each of the three transformed images so that a value for Pearson’s R could be calculated for that set of operators relative to each transformation. The average value of Pearson’s R could then be taken across all 16 faces in our set. This process was repeated for all families and scales of operator banks to assess stability. To assess variability, operator banks were once again applied to the neutral, frontal images once again. This time, the variance in each operator’s output was calculated across the population of 16 faces. The results were
Receptive Field Structures for Recognition
Coeff Corr.
Smile Trans.
515
Speech Trans.
View Trans. 1
1
1
0.8
0.8
0.6
0.6
0.6
0.4
0.8
0 2 4 6 8
0 2 4 6 8
0.2
0 2 4 6 8
Lobe separation in multiples of spatial constant (sigma) sigma = 4 pixels sigma = 8 pixels Figure 8: The stability of each feature type (x-axis) as a function of both the spatial scale of the gaussian lobes and various facial transformations.
combined and expressed in terms of the mean variance of response and its standard deviation. 4.3 Results 4.3.1 Difference-of-Gaussian Features. Plots depicting the average values of the correlation coefficients (averaged again over all individuals) are presented in Figure 8. We present the measured stability of each kind of operator across three ecologically relevant transformations: SMILE (second image of individuals smiling), SPEECH (second image of individuals speaking), and VIEW (second image of individuals in three-quarters pose). These plots highlight several interesting characteristics of our operators. First, center-surround filters at both scales appear to perform quite well compared to the other features once again. As soon as we move the two gaussians apart to form oriented local operators, however, a sharp dip in stability occurs. This indicates that the two-lobed oriented edge detectors used here provide for comparatively poor stability across all three of the transformations we have examined here. That said, as the distance between the lobes of our operators increases further, stability of response also increases. Nonlocality seems to increase stability across all three transformations, nearly reaching the level of center-surround stability at a coarse scale. Stability, however, is not the only attribute required to perform recognition tasks well. As discussed earlier, a feature that is stable across face transformations is useful only if it is not also stable across images of different individuals. That is, a universal feature is not of any use for recognition
516
B. Balas and P. Sinha
Table 1: Mean ± S.E. of Operator Variance Across Individuals.
Center-surround Local (s = 3) Nonlocal (s = 6) Nonlocal (s = 9)
σ =4
σ =8
σ = 16
122.5 ± 3.7 242.0 ± 9.6 378.8 ± 11.4 430.2 ± 11.0
206.6 ± 6.2 527.0 ± 15.0 718.5 ± 17.7 795.4 ± 19.7
311.3 ± 8.5 986.9 ± 26.7 1204.1 ± 29.9 1271.7 ± 32.6
because it has no discriminative power. We present next the amount of variability in response for each family of operators (see Table 1). Center-surround operators appear to be the least variable across images of different individuals, while nonlocal operators appear to vary most. All feature types except for the center-surround filters increase in variability as their scale increases, which seems somewhat surprising, as one might expect more dramatic differences in individual appearance to be expressed at a finer scale. Nonetheless, we can see from the combination of these results and the stability results that center-surround and nonlocal operators achieve good recognition performance through different means. Centersurround operators are not so variable from person to person, but make up for it with an extremely stable response to individual faces despite significant transformations. In contrast, nonlocal operators lack the full stability of center-surround operators, but appear to make up for it by being much more variable in response across the population of faces. The localoriented features rank poorly in terms of both their stability and variability characteristics, thus limiting their usefulness for recognition tasks. 4.4 Discussion. The results of our stability analysis of differential operators reveal two main findings. First, the same features that were discovered to perform the best discrimination between intra- and interpersonal difference vectors in experiment 1 (large center-surround filters and nonlocal operators) and to perform best in a simple recognition system for both faces and objects (experiment 2) also display the greatest combination of stability and variability when confronted with ecologically relevant face transforms. However, the limited stability of local oriented operators suggests that they may not provide the most useful features for handling these image transforms. 5 Conclusion We have noted the emergence of large center-surround and nonlocal operators as tools for performing object recognition using simple features and found that both of these operators provide good stability of response across a range of different transforms. These structures differ from receptive field forms known to support sparse encoding of natural scenes, yet
Receptive Field Structures for Recognition
517
seem to provide a better means of discriminating between individual objects and providing stable responses to image transforms. This suggests that the constraints that govern information-theoretic approaches to image representation may not necessarily be useful for developing representations that can support the recognition of objects in images. In the specific context of faces, do large center-surround fields or nonlocal comparators, on their own, present a viable alternative to performing efficient face recognition? At present, the answer to this question is no. Complex (and truly global) features such as eigenface (Turk & Pentland, 1991) bases provide for higher levels of recognition performance than we expect to achieve using these far simpler features. We note, however, that the discovery of a useful vocabulary of low-level features may aid global recognition techniques like eigenface-based systems. One could easily compute PCA bases on nonlocal and center-surround measurements rather than pixels. The added stability of these operators may help significantly increase recognition performance. The larger question at stake, however, does not only concern face recognition, despite its’ being our domain of choice for this study. Of greater interest than building a face recognition engine is learning how one might obtain stability to relevant image transforms given some set of simple measures. Little is known about how one moves from highly selective, small receptive fields in V1 to the large receptive fields in inferotemporal cortex that demonstrate impressive invariance to stimulus manipulations within a particular class. We have introduced here a particular measurement, the dissociated dipole, which represents one example of a very broad space of alternative computations by which limited amounts of invariance might be achieved. Our proposal of nonlocal operators draws support from several studies of human perception. Indeed, past psychophysical studies of the long-range processing of pairs of lines suggest the existence of similarly structured “coincidence detectors,” which enact non-local comparisons of simple stimuli (Morgan & Regan, 1987; Kohly & Regan, 2000). Further work exploring nonlocal processing of orientation and contrast has more recently given rise to the idea of a “cerebral bus” shuttling information between distant points (Danilova & Mollon, 2003). These detectors could contribute to shape representation, as demonstrated by Burbeck’s idea of encoding shapes via medial “cores” built by integrating information across disparate “boundariness” detectors (Burbeck & Pizer, 1995). Our overarching goal in this work is to redirect the study of nonclassical receptive field structures toward examining the possibility that object recognition may be governed by computations outside the realm of traditional multiscale pyramids, and subject to different constraints from those that guide formulations of image representation based on information theory. The road from V1 to IT (and, computationally speaking, from Gabors and gaussian derivatives to eigenfaces) may contain many surprising image processing tools.
518
B. Balas and P. Sinha
Even within the realm of dissociated dipoles, there are many parameters to explore. For example, the two lobes need not be isotropic or be of equal size and orientation. The lobes could easily take the form of gaussian derivatives rather than gaussians. Given that there are many more parameters that could be introduced to the simple DOG framework, it is possible that even better invariance could be achieved by introducing more degrees of structural freedom. The point is that expanding our consideration to nonlocal operators opens up a large space of possible filters, and systematic exploration of this space, while difficult, may be very rewarding. Acknowledgments This research was funded in part by the DARPA HumanID Program and the National Science Foundation ITR Program. B.J.B. is supported by an NDSEG fellowship. P.S. is supported by an Alfred P. Sloan Fellowship in neuroscience and a Merck Foundation Fellowship. We also thank Ted Adelson, Gadi Geiger, Mriganka Sur, Chris Moore, Erin Conwell, and David Cox for many helpful discussions.
References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61, 183–193. Balas, B. J., & Sinha, P. (2003). Dissociated dipoles: Image representation via nonlocal operators. Cambridge, MA: MIT Press. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In Proceedings of the European Conference on Computer Vision (pp. 109–124). Berlin: SpringVerlag. Borenstein, E., & Ullman, S. (2004). Learning to segment. In Proceedings of the European Conference on Computer Vision (pp. 315–328). Berlin: Springer-Verlag. Burbeck, C. A., & Pizer, S. M. (1995). Object representation by cores: Identifying and representing primitive spatial regions. Vision Research, 35(13), 1917–1930. Chapin, J. K. (1986). Laminar differences in sizes, shapes, and response profiles of cutaneous receptive fields in the rat SI cortex. Exp. Brain Research, 62(3), 549–559. Danilova, M. V., & Mollon, J. D. (2003). Comparison at a distance. Perception, 32(4), 395–414. Edelman, S. (1993). Representing 3-D objects by sets of activities of receptive fields. Biological Cybernetics, 70, 37–45.
Receptive Field Structures for Recognition
519
Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. Paper presented at the International Conference on Computer Vision, Nice, France. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148, 574–591. Julesz, B. (1975). Experiments in the visual perception of texture. Scientific American, 232(4), 34–43. Kersten, D. (1987). Predictability and redundancy of natural images. J. Opt. Soc. Am. A, 4(12), 2395–2400. Kohly, R. P., & Regan, D. (2000). Coincidence detectors: Visual processing of a pair of lines and implications for shape discrimination. Vision Research, 40(17), 2291–2306. Kouh, M., & Riesenhuber, M. (2003). Investigating shape representation in area V4 with HMAX: Orientation and grating selectivities. (Rep. AIM=2003=021, CBCL=231). Cambridge, MA: MIT. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch, 36, 910–912. Marr, D. (1982) Vision. New York: Freeman. Moghaddam, B., Jebara, T., & Pentland, A. (2000). Bayesian face recognition. Pattern Recognition, 333(11), 1771–1782. Morgan, M. J., & Regan, D. (1987). Opponent model for line interval discrimination: Interval and Vernier performance compared. Vision Research, 27(1), 107–118. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. Paper presented at the ARPA Image Understanding Workshop, Palm Springs, FL. Nene, S. A., Nayar, S. K., & Murase, H. (1996). Columbia Object Image Library (COIL100). New York: New York University. Novikoff, A. (1962). Integral geometry as a tool in pattern perception. In H. Foerster & G. Zopf (Eds.), Principles of self-organization. New York: Pergamon. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607– 609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.
520
B. Balas and P. Sinha
Samaria, F., & Harter, A. (1994). Parametrisation of a stochastic model for human face identification. Paper presented at the Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL. Schneiderman, H., & Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Sinha, P. (2002). Qualitative representations for recognition. Lecture Notes in Computer Science, 2525, 249–262. Turk, M. A., & Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HI. Wiskott, L., Fellous, J.-M., Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. Young, E. D. (1984). Response characteristics of neurons of the cochlear nucleus. In C. I. Berlin (Ed.), Hearing science recent advances. San Diego, CA: College Hill Press.
Received July 28, 2004; accepted August 9, 2005.
LETTER
Communicated by Sidney Lehky
A Neural Model of the Scintillating Grid Illusion: Disinhibition and Self-Inhibition in Early Vision Yingwei Yu
[email protected] Yoonsuck Choe
[email protected] Department of Computer Science, Texas A&M University, College Station, Texas 77843-3112, U.S.A
A stationary display of white discs positioned on intersecting gray bars on a dark background gives rise to a striking scintillating effect—the scintillating grid illusion. The spatial and temporal properties of the illusion are well known, but a neuronal-level explanation of the mechanism has not been fully investigated. Motivated by the neurophysiology of the Limulus retina, we propose disinhibition and self-inhibition as possible neural mechanisms that may give rise to the illusion. In this letter, a spatiotemporal model of the early visual pathway is derived that explicitly accounts for these two mechanisms. The model successfully predicted the change of strength in the illusion under various stimulus conditions, indicating that low-level mechanisms may well explain the scintillating effect in the illusion. 1 Introduction The scintillating grid illusion consists of bright discs superimposed on intersections of orthogonal gray bars on a dark background (see Figure 1A; Schrauf, Lingelbach, & Wist, 1997). In this illusion, illusory dark spots are perceived as scintillating within the white discs. Several important properties of the illusion have been discovered and reported inrecent years. (1) The discs that are closer to a fixation show less scintillation (Schrauf et al., 1997), which might be due to the fact that receptive fields in the periphery are larger than those in the fovea. As shown in Figure 2, if the periphery of the scintillating grid is correspondingly scaled up, the scintillation effect is diminished. Note that the diminishing effect is not due to the polar arrangement alone, as can be seen in Figure 1B. (2) The illusion is greatly reduced or even abolished both with steady fixation and by reducing the contrast between the constituent grid elements (Schrauf et al., 1997). (3) As speed of motion is increased (either efferent eye movement or afferent grid movement), the strength of scintillation decreased (Schrauf, Wist, & Ehrenstein, 2000). (4) The presentation duration of the grid also plays a role Neural Computation 18, 521–544 (2006)
C 2006 Massachusetts Institute of Technology
522
Y. Yu and Y. Choe
A
B
Scintillating grid
Polar arrangement
Figure 1: The scintillating grid illusion and its polar variation. (A) The original scintillating grid illusion is shown (redrawn from Schrauf et al., 1997). (B) A polar variation of the illusion. The scintillating effectis still strong in the polar arrangement (cf. Kitaoka, 2003).
in determining the strength of illusion. The strength first increases when the presentation time is less than about 220 ms, but it slowly decreases once the presentation duration is extended beyond that (Schrauf et al., 2000). What kind of neural process may be responsible for such a dynamic illusion? The scintillating grid can be seen as a variation of the Hermann grid illusion where the brightness level of the intersecting bars is reduced (a representative example of simultaneous contrast; Gerrits, & Vendrik, 1970). The illusory dark spots in the Hermann grid can be explained by feedforward lateral inhibition mechanism, commonly modeled with difference of gaussian (DOG) filters (Spillmann, 1994). Thus, DOG filters may seem like a plausible mechanism contributing to the scintillating grid illusion. However, they are not exactly fit to explain the complexities of the scintillating grid illusion because of the following reasons. (1) The DOG model cannot account for the change in the strength of scintillation over different brightness and contrast conditions, as shown in the experiments by Schrauf et al., 1997). (2) Furthermore, DOG cannot explain the basic scintillation effect, which has a temporal dimension to it. Thus, the feedforward lateral mechanism represented by DOG fails to fully explain the scintillating effect. Anatomical and physiological observations show that the centersurround property in early visual processing may not be strictly feedforward: the process involves a recurrent inhibition, which leads to disinhibition. Moreover, the process also includes self-inhibition—the inhibition of the cell itself. For example, Hartline et al. used Limulus (horseshoe crab) optical cells to demonstrate disinhibition and self-inhibition in the retina
A Neural Model of Scintillating Grid Illusion
523
Figure 2: A variation without the scintillating effect. The grids toward the periphery are significantly scaled up, which results in the abolishment of the scintillating effect when stared at in the middle (see Raninen & Rovamo, 1987, for a similar input scaling approach to alter perceptual effects). This is because the scintillating grid illusion highly depends on the size of the receptive fields. In the fovea, the receptive field size is small, and in the periphery, the receptive field size is relatively larger. (Note that Kitaoka, 2003, presented a similar plot, but there, the periphery was not significantly scaled up such that the scintillating effect was preserved.)
(Hartline & Ratliff, 1957). Disinhibition and self-inhibition have been discovered in mammals and other vertebrates as well. As for disinhibition, it has been found in the retina of cats (Li et al., 1992; Kolb & Nelson, 1993), tiger salamanders (Roska, Nemeth, & Werblin, 1998), and mice (Frech, PerezLeon, Wassle, & Backus, 2001). For example, Kolb and Nelson (1993) have shown that the A2 amacrine cells in the cat retina contribute to lateral inhibition among ganglion cells, and they can play a role in disinhibition. With regard to self-inhibition, Hartveit (1999) found that depolarization of a rod bipolar cell in rat retina evokes a feedback response to the same cell, thus indicating that a mechanism similar to those in the Limulus may exist in mammalian vision. Also, Nirenberg and Meister (1997) have shown that transient and motion-sensitive responses in ganglion cells may be produced by self-inhibitory feedback of the amacrine cells in mouse retina (for similar results, see Berry, Brivanlou, Jordan, & Meister, 1999; Victor, 1987; Crevier & Meister, 1998). Computational models also suggested that self-inhibition
524
Y. Yu and Y. Choe
may exist in cells sensitive to light-dark contrast (Neumann, Pessoa, & Hanse, 1999). Disinhibition can effectively reduce the amount of inhibition where there is a large area of bright input, and self-inhibitioncan give rise to oscillations in the response over time. Thus, the combination of those two mechanisms, disinhibition and self-inhibition, may provide an explanation for the intriguing scintillating grid illusion. In this letter, we present a model based on disinhibition and selfinhibition in the early visual pathway to explain the scintillating grid and its various spatiotemporal properties reported by Schrauf et al. (1997, 2000). Our model is, to our knowledge, the first attempt at computationally modeling the scintillating grid illusion at a neuronal level. In the next section, we begin with a brief review of disinhibition and self-inhibition and introduce our model in detail. Next we present the main results, followed by discussion and conclusion. 2 Disinhibition and Self-Inhibition in Early Vision In this section, we review the basic mechanism of disinhibition and selfinhibition in the Limulus optical cells. The study by Hartline and Ratliff (1957) was the first to show that lateral inhibition exists at the very first stage in visual processing—between optical cells in the Limulus. Furthermore, they showed that the lateral interaction has a nontrivial ripple effect. That is, the final response of a specific cell can be considered as the overall effect of the response from itself and from all other cells directly or indirectly connected to that cell. As a result, the net response of a cell can be enhanced or reduced due to such an inhibitory interaction depending on the surrounding context. For example, inhibition of an inhibitory neuron X will release the target of X from inhibition, thus allowing the latter to fire (or increase its firing rate). This process is known as disinhibition, and it has been shown that such a mechanism may be more accurate than lateral inhibition alone in explaining subtle brightnesscontrast effects (see, e.g., Yu, Yamauchi, & Choe 2004). Self-inhibition is also found in Limulus optical cells (Ratliff, Hartline, & Miller, 1963). When a depolarizing step input is applied to the cell, a transient peak in firing rate can be observed, which is followed by a rapid decay to a steady rate. The self-inhibition effect is due to synaptic interactions, which produce a negative feedback into the cell itself. The self-inhibition mechanism is illustrated by the process in which cells go from an initial transient peak to a stable equilibrium plateau, which is a form of neural adaptation: each impulse discharge acts recurrently to delay the discharge of the next impulse within the same neuron (Hartline, Wager, & Ratliff, 1956; Hartline & Ratliff, 1957). Together with the feedback to neighboring cells, the feedback to oneself may be essential for explaining the evolution of the temporal dynamics observed in the scintillating grid illusion.
A Neural Model of Scintillating Grid Illusion
525
3 A Spatiotemporal Model of Disinhibition and Self-Inhibition Hartline and his colleagues developed an early computational model of the response in the Limulus retina. The Hartline-Ratliff equation describing disinhibition in Limulus can be summarized as follows (Hartline & Ratliff, 1957, 1958; Stevens, 1964), r m = m − K s r m −
vmn (rn − tmn ),
(3.1)
n=m
where rm is the final response of the mth ommatidium, K s its self-inhibition rate, m itsexcitation level, vmn the inhibitory weight from another ommatidium n, and tmn its threshold. When equation 3.1 is used to calculate the evolution of responses in a network of cells, the effect of disinhibition arises. Brodie, Knight, and Ratliff (1978) extended this equation to derive a spatio-temporal filter, where the input was assumed to be a sinusoidal grating. The reason for limiting the input in such a way was to make tractable the explicit calculation of the responses. As a result, the derived model was applicable only to sinusoidal grating inputs. In addition to model selfinhibition, which gives the temporal property, they replaced the constant self-inhibition rate in equation 3.1 with a time-dependent term (cf. K s (t) in section 3.1). Their model is perfect in accounting for the responses in Limulus retina, but it is limited to a single spatial frequency channel input. Because of this, their extended model cannot be applied to complex images containing a broader band of spatial frequency, which is typically the case for visual illusions such as the scintillating grid illusion. In the following section, we extend the Hartline-Ratliff equation using a different strategy to derive a filter that can address these issues, while remaining tractable. 3.1 A Simplified Model in One Dimension. Rearranging equation 3.1 by omitting the threshold, we have (1 + K s) rm −
wmnrn = m ,
(3.2)
where wmn is the strength of interaction (either excitatory or inhibitory) from cell m to n.Note that wmn extends the definition of vmn in equation 3.1 to allow excitatory as well as inhibitory contributions. To generalize the model to n inputs, the responses of n cells can be expressed in matrix form as (Yu et al., 2004), Kr = e,
(3.3)
526
Y. Yu and Y. Choe
where r is the output vector of size n, e the input vector of size n, and K the n × n weight matrix: K=
1 + K s (t)
−w(1)
..
−w(1)
1 + K s (t)
..
..
..
..
−w(n − 1)
..
..
−w(n − 1)
−w(n − 2) , .. 1 + K s (t)
(3.4)
where K s (t) is the self-inhibition rate at time t, and w(·) is the connection weight, which is a function of the distance between the cells. (Note that unlike in our previous models—(Yu et al., 2004; Yu & Choe, 2004a)—the introductionof the time-varying term K s (t) allows for the model to have a temporal behavior.) For the convenience of calculation, we assume K s (t) here approximately equals the self-inhibition rate of a single cell. The exact derivation of K s (t) is as follows (Brodie et al., 1978): K s (t) =
y(t) , r (t)
(3.5)
where a y(t) is the amount of self-inhibition at time t, and r (t) is the response at time t for this cell. We know that the Laplace transform y(s) of y(t) has the following property: y(s) = r (s)Ts (s),
(3.6)
k , 1 + sτ
(3.7)
Ts (s) =
where k is the maximum value K s (t) can reach and τ the time constant. By assuming that the input e(t) is a step input for this cell, the Laplace transform of e(t) can be written as e(s) =
I0 , s
(3.8)
where I0 is a constant representing the strength of the light stimulus. From the definition of y(t), we know that dr = e(t) − y(t). dt
(3.9)
A Neural Model of Scintillating Grid Illusion
527
To solve the response r (t), we can apply Laplace transform and plug in e(s) and y(s): r (s) =
I0 k − r (s) s 1 + sτ
1 . s
(3.10)
Solving this equation, we get r (s) =
I0 sτ + 1 . s τ s2 + s + k
(3.11)
By substituting r (s) and T(s) in equation 3.6 with equations 3.7 and 3.11, we get y(s) =
k I0 (sτ + 1) . s (τ s 2 + s + k) (1 + sτ )
(3.12)
Then, by inverse Laplace transform, we can get y(t) and r (t). Finally, the exact expression for K s (t) can be obtained by evaluating equation 3.5: K s (t) =
4kτ − 1 + (1 − 4k)h(t) cos(ωt) − 2kh(t)ω sin(ωt) , 4kτ − 1 + (1 − 4k)h(t) cos(ωt) + (4kτ − 2)h(t)ω sin(ωt)
(3.13)
where 1 , h(t) = τ exp − 2τ t
(3.14)
and √ ω=
4kτ − 1 . 2τ
(3.15)
An intuitive way of understanding the above expression is to see it as a division of two convolutions, K s (t) =
e(t) ∗ g(t) , e(t) ∗ f (t)
(3.16)
where g(t) and f (t) are impulse response functions derived from the above and ∗ the convolution operator (see the appendix for details). Figure 3 shows several curves plotting the self-inhibition rate under different parameter conditions. As discovered in Limulus by Hartline and Ratliff (1957, 1958), self-inhibition is strong (k = 3), while lateral contribution is weak
528
Y. Yu and Y. Choe
3.5 3.0
Ks(t)
2.5 2.0 1.5
k=3, τ=0.3 k=3, τ=0.1 k=2, τ=1.0 k=2, τ=0.5
1.0 0.5 0.0 0.0
0.2
0.6 0.4 time: t
0.8
1.0
Figure 3: Self-inhibition rate. The evolution of the self-inhibition rate K s (t) (y-axis) over time (x-axis) is shown for various parameter configurations (see equations 3.5 to 3.7). The parameter k defines the peak value of the curve, and τ defines how quickly the curve converges to a steady state. For all computational simulations in this article, the values k = 3 and τ = 0.3 were used.
(0.1 or less). Hartline and Ratliff (1957, 1958) experimentally determined these values, whereas they left τ as a free parameter. Now we have a model for the response of cells arranged in one dimension, but for realistic visual inputs, we need a 2D model. In the following section, we provide details about extending the 1D model above to 2D. 3.2 Extending the Model to Two Dimensions. The 1D model can be easily extended to two dimensions. We simply serialize the input and output matrices to vectors to fit in the 1D model we have. The weight matrix K can then be defined to represent the weight K i j from cell j to cell i based on their distance in 2D, following the DOG distribution (Marr & Hildreth, 1980): Ki j =
−w(|i, j|)
1 + K s (t)
x2 w(x) = kc exp − 2 σc
when i = j when i = j
,
x2 − ks exp − 2 , σs
(3.17)
(3.18)
where |i, j| is the Euclidean distance between cell i and j; kc and ks are the scaling constants that determine scale of the excitatory and √ the relative √ inhibitory distributions, set to 1/( 2πσc ) and 1/( 2πσs ); and σc and σs their
A Neural Model of Scintillating Grid Illusion
529
widths. The σ values were indirectly specified, as a fraction of the receptive field size ρ: σc = ρ/24 and σs = ρ/6. Finally, the response vector r can be derived from equation 3.3 as follows (Yu et al., 2004): r = K−1 e,
(3.19)
and we can apply reverse serialization to get the vector r back into a 2D matrix form. Figure 4 shows a single row of the weight matrix K, corresponding to a weight matrix (when reverse serialized) of a cell in the center of the 2D retina, at various time points. The plot shows that the cell in the center can be influenced by the inputs from locations far away, outside its classical receptive field area. The initial state shown in Figure 4A looks bumpy (and somewhat random), but on closer inspection we can observe concentric rings of ridges as in Figure 4B. (The apparent bumpiness along the ridges is due to the aliasing effect caused by the square boundary of the receptive field.) Another noticeable feature is that the spatial extent of excitation and inhibition shrinks over time (from Figure 4A to 4F). This may seem inconsistent with the notion of persisting inhibitory surround, but in fact the spatiotemporal property of on-off receptive fields shows such a diminishing lateral influence over time (in retinal ganglion cells, Jacobs & Werblin, 1998; and also in the lateral geniculate nucleus, Cai, DeAngelis, & Freeman, 1997). 4 Methods To match the behavior of the model to psychophysical data, we need to measure the degree of the illusory effect in the scintillating grid. Measuring the strength of the overall scintillation effect is difficult because it involves many potential factors, such as the change in the perceived brightness of the discs over time and the perceived number of scintillating dark spots. (In fact, in Schrauf et al., 1997, 2000, subjects were simply asked to report the strength of illusion on a scale of 1 to 5 without any reference to these various factors.) For our computational experiments, one simple yet meaningful measure of the strength of scintillation can be derived based on the changes in the perceived contrast. More specifically, we are interested in the change over time in the relative contrast of the disc versus the gray bars: S(t) = C(t) − C(0),
(4.1)
where S(t) is the perceived strength t time units from the last eye movement or the time of initial presentation of the scintillating grid stimulus (time t in
530
Y. Yu and Y. Choe
A
B
1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
40 20
20 0
20 0
t=0.007
t=0.031
C
D 1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
40 20
20 0
20 0
t=0.062
t=0.124
E
F
1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
20 0
t=0.2452
40 20
20 0
t=0.895
Figure 4: Disinhibition filter at various time points. The filter (i.e., the connection weights) of the central optical cell shown at different time steps (k = 3, τ = 0.3, ρ = 20). The self-inhibition rate evolved over time as follows: (A) K s (t) = 0.03, (B) K s (t) = 0.15, (C) K s (t) = 0.29, (D) K s (t) = 0.54, (E) K s (t) = 0.99, and (F) K s (t) = 2.59. Initially, a large ripple extends over a long distance from the center (beyond the classical receptive field), but as time goes on, the longrange influence diminishes. In other words, the effective receptive field size is reduced over time due to the change in self-inhibition rate. (Note that the visual field size is shown above is 41 × 41 for better visualization, compared to 30 × 30 in the actual experiments.)
A Neural Model of Scintillating Grid Illusion
531
our model is on an arbitrary scale) and C(t) is the contrast between the disc and the gray bars in the center row of the response matrix: C(t) =
Rdisc (t) − Rmin (t) , Rbar (t) − Rmin (t)
(4.2)
where Rdisc (t) is the response at the center of the disc region, Rbar (t) the response at the center of either of the gray bar regions, and Rmin (t) the minimum response in the output at time t. In other words, the function of perceived strength of illusion S(t) is defined as the relative disc-to-bar contrast at time t as compared to its initial value at time 0. Using this measure, in the experiments below, we tested our model under various experimental conditions, mirroring those in Schrauf et al. (1997, 2000). In all calculations, the effect of illusion was measured on an image consisting of a single isolated grid element of size 30 × 30 pixels. The disc at the center had a diameter of 8, and the bars had a width of 6. The model parameters k = 3 and τ = 0.3 were fixed throughout all experiments, and so was the pattern where the background luminance was set to 10, the gray bar to 50, and the white disc to 100, unless stated otherwise. Dependent on the experimental condition under consideration, the model parameters (receptive field size ρ) and/or the stimulus conditions (such as the duration of exposure to the stimulus and/or brightness of different components of the grid) were varied. The units of the receptive field size, the width of the bar, and the diameter of the disc were all equivalent in pixels on the receptor surface, where each pixel corresponds to one photo receptor. The details of the variations are provided in section 5. 5 Experiments and Results 5.1 Experiment 1: Perceived Brightness as a Function of Receptive Field Size. In the scintillating grid illusion, the scintillating effect is most strongly present in the periphery of the visual field. As we stated earlier, this may be due to the fact that the receptive field size is larger in the periphery than in the fovea, thus matching the scale of the grid. If there is a mismatch in the scale of the grid and the receptive field size, the illusory dark spot would not appear. For example, in Figure 2, the input is scaled up in the periphery, thus creating a mismatch between the peripheral receptive field sizeand the scale of the grid. As a result, the scintillating effect is abolished. Conversely, if the receptive field size is reduced in size with no change to the input, the perceived scintillation would diminish (as it happens in the center of gaze in the original scintillating grid; see Figure 1). To verify this point, we tested our model with different receptive field sizes while the input grid size was fixed. As shown in Figure 5A, smaller receptive fields result in almost no darkening effect in the white disc.
532
Y. Yu and Y. Choe
A
C
B
ρ =3
ρ =6
D
ρ =9
ρ =12
ρ =15
G 7
k=3, τ=0.3
6 5 4 3 2 1
ρ=3
8 Normalized response
F
C(t=0.01)
E
ρ=6
6
ρ=9
4
ρ=12
2
ρ=15
0
0 2
4
6 8 10 12 14 16 18 20 Receptive field size: ρ
0
20 10 Position
30
Figure 5: Response under various receptive field sizes. The response of our model to a single grid element in the scintillating grid is shown, under various receptive field sizes at t = 0.01. (A–E) The responses of the model are shown when the receptive field size was increased from ρ = 3 to 6, 9, 12, and 15. Initially, the disc in the center is bright (simulating the fovea), but as ρ increases, it becomes darker (simulating the periphery). (F) The relative brightness level of the central disc compared to the gray bar C(t) is shown (see equation 4.2). The contrast decreases as ρ increases, indicating that the disc in the center becomes relatively darker than the gray bar region. The contrast drops abruptly until around ρ = 6 and then gradually decreases. (G) The normalized responses of the horizontal cross section of A to F are shown. For normalization, the darkest part and the gray bar region of the horizontal cross-section were scaled between 0.0 and 1.0. When ρ is small (=3), the disc in the center is very bright (the plateau in the middle in the black ribbon), but it becomes dark relative to the gray bars as ρ increases (white ribbon).
As the receptive field size grows, the dark spot becomes more prominent (see Figures 5B to 5E). The cross-sections of Figures 5A to 5E are shown in figure 5G. Figure 5F shows the bar-disc contrast C(t) over different receptive field sizes (ρ), where a sudden drop in contrast can be observed around ρ = 6. Note that at this point, C(t) is still above 1.0, suggesting that the disc in the center is brighter than the gray bars. However, C(t) is not an absolute measure of perceived brightness since it does not count in the vividly bright halo around the disc (already visible in Figure 5B). Thus, our interpretation that low C(t) (close to 1.0 or below) means a perceived dark spot may be reasonable.
A Neural Model of Scintillating Grid Illusion
533
In sum, these results could be an explanation as to why there is no scintillating effect in Figure 2. In the original configuration, the peripheral receptive fields were large enough to give rise to the dark spot; however, in the new configuration, they are not large enough, and thus no dark spot can be perceived. 5.2 Experiment 2: Perceived Brightness as a Function of Time. In this experiment, the response of the model at different time steps was measured.
A
B
t=0.01
C
t=0.1
D
t=0.8
t=10.0
G 3.0
k=3, τ=0.3
2.5
C(t)
t=1.6
2.0 1.5 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 time: t
Normalized Response
F
E
t=10 t=1.6 t=0.8 t=0.1 t=0.01
5 4 3 2 1 0
0
20
10
30
Position
Figure 6: Response at various time points. The response of the model to an isolated scintillating grid element is shown over time. The parameters used for this simulation were: receptive field size = 6 (represents the periphery), k = 3, and τ = 0.3. The plots demonstrate a single blinking effect of the white disc. (A) In the beginning when the self-inhibition rate is small, the illusory dark spot can be seen in the central disc (K s (t) = 0.0495). (B–E) As time goes on, the illusory dark spot disappears as the self-inhibition rate increases: K s (t) = 0.04521, K s (t) = 2.4107, K s (t) = 3.2140, and K s (t) = 3, respectively. (F) The relative brightness level of the central disc compared to the gray bar C(t) is shown (see equation 4.2). The results demonstrate an increase in the relative perceived brightness of the center disc as time progresses. (G) The normalized responses of the horizontal cross-section of A to E are shown. Normalization was done as described in Figure 5G. In the beginning (t = 0.01), the disc region in the middle is almost level with the flanking gray bar region (white ribbon near the bottom). However, as time goes on, the plateau in the middle rises, signifying that the disc in the center is becoming perceived as brighter.
534
Y. Yu and Y. Choe
In Figures 6A to 6E, five snapshots are shown. In the beginning, the dark spot can clearly be observed in the center of the disc, but as time goes on, it gradually becomes brighter. Figure 6F plots the relative brightness of the disc compared to the bars as a function of time, which shows a rapid increase to a steady state. Such a transition from dark to bright corresponds to a single scintillation. (Note that the opposite effect, bright to dark, is achieved by refreshing of the neurons via saccades.) Figure 6G shows the actual response level in a horizontal cross-section of the response matrix shown in Figures 6A to 6E. Initially, the response to the disc area shown as the sunken plateau in the middle is relatively low compared to that to the gray bars represented by the flanking areas (bottom trace, white ribbon). However, as time passes by, the difference in response between the two areas dramatically increases (top trace, black ribbon). Again, the results show a nice transition from a perception of a dark spot to that of a bright disc. 5.3 Experiment 3: Strength of Scintillation as a Function of Luminance. The strength in perceived illusion can be affected by changes in the luminance of the constituent parts of the scintillating grid, such as the gray bar, disc, and the dark background (see figures 7A and 7C; Schrauf et al., 1997). Figures 7B and 7D show a variation in response in our model under such stimulus conditions. Our results show a close similarity to the experimental results by Schrauf et al. (1997). As the luminance of the gray bar increases,the strength of illusion increases, but after reaching about 40% of the disc brightness, the strength gradually declines (see Figure 7B), consistent with experimental results (see Figures 7A). Such a decrease is due to disinhibition, which cannot be explained by DOG (Yu et al., 2004). When the luminance of the disc was increased, the model (see Figure 7D, right) demonstrated a similar increase in the scintillating effect as in the human experiment (see Figure 7C, right). When the disc has a luminance lower than the bar, a Hermann grid illusion occurs (Schrauf et al., 1997). Both the human data (see Figure 7C, left) and the model results (see Figure 7D, left) showed an increase in the Hermann grid effect when the disc became darker. Note that disinhibition plays an important role here, especially for the bar luminance experiments (see Figures 7A and 7B). In standard DOG, which lacks the recurrent inhibitory interaction, the illusory effect will monotonically increase with the increase in the luminance of the gray bars. However, with disinhibition, the increasing illusory effect will reach a critical point followed by a decline. (See section 6 for details.) 5.4 Experiment 4: Strength of Scintillation as a Function of Motion Speed and Presentation Duration. As we have seen above, the scintillating effect has both a spatial and atemporal component. Combining these two may give rise to a more complex effect. Schrauf et al. (2000) demonstrated that such an effect in fact exists. They conducted experiments under
A Neural Model of Scintillating Grid Illusion
A
3.5 3.0 2.5 2.0 1.5 1.0 0.0
5.0
2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4
Model
10
10.0 15.0 20.0 25.0 30.0
15 20
25 30
35 40
Bar luminance (cd/m2)
Bar brightness in input
Schrauf et al. (1997)
Model
C
45
50
D 6.0
Hermann (Human) Scintillating (Human)
5.0 4.0 3.0 2.0 1.0
2.5 Strength of scintillation: S
Mean rated strength
Human
4.0
Strength of scintillation: S
B 4.5
Mean rated strength
535
Hermann (Model) Scintillating (Model)
2.0 1.5 1.0 0.5 0.0
0
20
40
60
80 100 120 140 160
0
50
100
150
200
Disc luminance (cd/m2)
Disc brightness in input
Schrauf et al. (1997)
Model
250
Figure 7: Strength of scintillation under various luminance conditions. (A) Mean rated strength of scintillation in human experimentsis shown as a function of disc luminance (Schrauf et al., 1997). (B) Scintillation effects in the model shown as a function of bar luminance. (C) Mean rated strength of scintillation in human experimentsis plotted as a function of bar luminance (Schrauf et al., 1997). The plot shows results from two separate experiments: the Hermann grid on the left and the scintillating grid on the right. (D) The Hermann grid and scintillation effects in the model are plotted as functions of disc luminance. Under both conditions, the model results closely resemble those in human experiments. For B and D, the strength of the scintillation effectin the model was calculated as S = C(∞) − C(0), where C(∞) is the steady-state value of C(t) (see equation 4.1). The illusion strength in the Hermann grid portion in D was calculated as S = 1/C(∞) − 1. The reciprocal was used because in the Hermann grid, the intersection is darker than the bars, whereas in the scintillating grid, it is the other way around (the disc is brighter than the bars).
three conditions: (1) smooth pursuit movements executed across a stationary grid (efferent condition), (2) grid motion at an equivalent speed while the eyes are held stationary (afferent condition), and (3) brief exposure of a stationary grid while the eyes remained stationary. For conditions 1 and 2,
536
Y. Yu and Y. Choe
A
B 0.55
Human
Strength of scintillation: S
Mean rated strength
4.5 4.0 3.5 3.0 2.5 2.0 1.5 2.0
Model
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15
4.0
6.0
8.0
10.0
0
12.0 14.0
1
3
4
5
6
7
8
9 10
Model
Schrauf et al. (2000) D
C Human
Strength of scintillation: S
4.5 Mean rated strength
2
Speed of motion (pixel/s)
Speed of motion (deg/s)
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.0
0.2 0.4 0.6 0.8 Presentation duration (s)
1.0
Schrauf et al. (2000)
0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20
Model
0
0.5
1 1.5 2 2.5 3 3.5 Presentation duration: t
4
Model
Figure 8: Strength of scintillation under varying speed and presentation duration. (A) Mean rated strength of the illusion as a function of the speed of stimulus movement (Schrauf et al., 2000). (B) Scintillation effect as a function of the speed of motion (v) in the model is shown. The receptive field size was 6, and the strength of scintillation was calculated as S(v −1 ) = C(v −1 ) − C(0). (C) Mean rated strength of the illusion as a function of the duration of exposure (Schrauf et al., 2000). (D) Scintillation effect as a function of presentation duration (t) in the model. The receptive field size was 6, and the strength of scintillation was computed as S(t) = C(t) − C(0). In both cases (A–B and C–D) the curves show a very similar trend.
both afferent and efferent motion produced very similar results: the strength of scintillation gradually decreased as the speed of motion increased (see Figure 8A). For condition 3, the strength of illusion abruptly increases, coming to a peak at around 200 ms, and then slowly decreases (see Figure 8C). We tested our model under these conditions to verify if temporal dynamics induced by self-inhibition can accurately account for the experimental results. First, we tested the model when either the input or the eye was moving (assuming that conditions 1 and 2 above are equivalent). In our experiments,
A Neural Model of Scintillating Grid Illusion
537
instead of directly moving the stimulus, we estimated the effect of motion in the following manner. Let v be the corresponding speed of motion, either afferent or efferent. From this, we can calculate the amount of time elapsed before the stimulus (or the eye) moves on to a new location. For a unit distance, the elapsed time t is simply an inverse function of motion speed v; thus, the effect of illusion can be calculated as S(v −1 ). Figure 8B shows the results from our model, which closely reflects the experimental results in Figure 8A. Note that we used a single flash because we modeled the input as a single step function (see equation 3.8). So the experiment may have been slightly different from the real one where grids come in and go out of view, corresponding to multiple step inputs. However, most of the perceived effect was accountable under this simplifying assumption, as shown in the results, indicating that the perceived dynamics within a single grid element can play an important role in determining the overall effect under moving conditions. Next, we tested the effect of stimulus flash duration on our model behavior. Figure 8D shows our model’s prediction of the brightness as a function of the presentation duration. In this case, given a duration of d, the strength of illusion can be calculated as S(d). The perceived strength initially increases abruptly up to around t = 1.5; then it slowly decreases until it reaches a steady level. Again, the computational results closely reflect those in human experiments (see Figure 8C). The initial increase might be due to the fact that the presentation time is within the time period required for one scintillation, and the slow decline may be due to no new scintillation being produced after the first cycle as the eyes were fixated, so that the overall perception of the scintillating strength declines. Again, this experiment is an approximation of the real condition, but we can make a reasonable assumption that the effect at the end of the input flash duration is what is perceived. For example, consider turning off the stimulus at a different point of time in Figure 6F (which in fact shows a close similarity to Figure 8D). Finally, note that because of the relationship between speed of motion and elapsed time mentioned above, the data presented in Figures 8B and 8D are identical (i.e., from the same simulation), except for the appropriate transformation in the x-axis. This is why we see the small wiggle in the beginning (on the left) in Figure 8B, which corresponds to the peak near t = 1.5 and the slight decrease to a steady state in Figure 8D (on the right). In summary, our model based on disinhibition and self-inhibition was able to accurately replicate experimental data under various temporal conditions. 6 Discussion The main contribution of this work was to provide, to our knowledge, the first neurophysiologically grounded computational model to replicate the scintillating grid illusion. We have demonstrated that disinhibition and
538
Y. Yu and Y. Choe
self-inhibition are sufficient mechanisms to explain a broad range of spatiotemporal phenomena observed in psychophysical experiments with the scintillating grid. The DOG filter failed to account for the change in the strength of scintillation, because it does not incorporate the disinhibition mechanism or the dynamics of self-inhibition. Disinhibition can effectively reduce the amount of inhibition in the case where there is a large area of bright input (Yu et al., 2004). Therefore, a DOG filter without a disinhibition mechanism cannot explain why the dark illusory spots in the scintillating grid are perceived to be much darker than those in the Herman grid. The reason is that the DOG filter predicts that the white bars in the Hermann grid should give stronger inhibition to its intersection than the gray bars in the scintillating grid to its disc. Thus, according to DOG, the intersection in the Herman grid should appear darker than that in the scintillating grid, which is contrary to fact. However, with a disinhibition mechanism, since disinhibition is stronger in the Hermann grid than in the scintillating grid (because the bars are brighter in the Hermann grid, there is more disinhibition), the inhibition in the center of the Hermann grid is weaker than that in the scintillating grid. Thus, the center appears brighter (because of weaker inhibition) in the Hermann grid than in the scintillating grid due to disinhibition. Regarding the issue of dynamics, the lack of a self-inhibition mechanism in the DOG filter causes it to fail to explain the temporal properties of the scintillation. There are certain issues with our model that may require further discussion. In our simulations, we used a step input with an abrupt stimulus onset. In a usual viewing condition, the scintillating grid as a whole is presented, and when the gaze moves around, the scintillating effect is generated. All the while, the input is continuously present, without any discontinuous stimulus onset. Thus, the difference in the mode of stimulus presentation could be a potential issue. However, as Schrauf et al. (2000) observed, what causes the scintillation effect is not the saccadic eye movement per se, but the transient stimulation that the movement brings about. Thus, such a transient stimulation can be modeled as a step input, and the results of our model may well be an accurate reflection of the real phenomena. Another concern is about the way we measured the strength of the scintillation effect in the model. In our model, we were mostly concerned about the change in the perceived brightness of the disc over time (see equation 4.1), whereas in psychophysical experiments, other measures of the effect have been incorporated, such as the perceived number of scintillating dark spots (Schrauf et al., 2000). However, one observation is that the refresh rate of the stimulus depends on the number of saccades in a given amount of time. Considering that a single saccade triggers an abrupt stimulus onset, we can model multiple saccades as a series of step inputs in our simulations. Since our model perceives one scintillation per stimulus onset, the frequency of flickering reported in the model can be modulated exactly by changing the number of stimulus onsets in our simulations. A related issue is the use
A Neural Model of Scintillating Grid Illusion
539
of a single grid element (instead of a whole array) in our experiments. It may seem that the scintillation effect would require at least a small array (say 2 × 2) of grid elements. However, as McAnany and Levine (2004) have shown, even a single grid element can elicit the scintillating effect quite robustly; thus, the stimulus condition in our simulations may be sufficient to model the target phenomenon. The model also gives a couple of interesting predictions (both brought to our attention by Rufin VanRullen). The first prediction is that a scintillating effect will occur only in an annular region in the visual field surrounding the fixation point, where the size of the receptive field matches that of the grid element size. However, this does not seem to be the case under usual viewing conditions. Our explanation for this apparent shortcoming of the model is that the size of the usual scintillating grid image is not large enough to go beyond the outer boundary of the annular region. Our explanation can be tested in two ways: test the strength of illusion with (1) a very large scintillating grid image where the grid element size remains the same, or with (2) the usual sized image with a reduced gridelement size (similar in manner to the input scaling approach by Raninen & Rovamo, 1987). We expect that the annular region will become visible in both cases, where no scintillating effect is observed beyond the outer boundary of the annular region. The second prediction is that the scintillation would be synchronous, due to the same time course followed by the neurons responding to each scintillating grid element. Again, this is quite different from our perceived experience, which is more asynchronous. In our observation, the asynchronicity is largely due to the random nature of eye movement. If that is true, the scintillating effect will become synchronous if eye movement is suppressed. That is, if we fixate on one location of the scintillating grid while the stimulus is turned on and off periodically (or alternatively, we can blink our eyes to simulate this), all illusory dark spots would seem to appear all at the same time, in a synchronous manner. (In fact, this seems to be the case in our preliminary experiments.) Then why is our experience asynchronous? The reason that we perceive the scintillation to be asynchronous may be that when we move our gaze from point X to point Y in a long saccade, first the region surrounding X, and then, later, the region surrounding Y scintillates. This will give the impression that the scintillating effect is asynchronous. In sum, the predictions of the model are expected to be consistent with experiments under similar conditions as in the computational simulations. Further psychophysical experiments may have to be conducted to test the model predictions more rigorously. Besides the technical issues as discussed, more fundamental questions need to be addressed. Our model was largely motivated by the pioneering work by Hartline et al. in the late 1950s. However, the animal model they used was the Limulus, an invertebrate with compound eyes; thus, the applicability of our extended model in human visual phenomena may
540
Y. Yu and Y. Choe
be questionable. However, disinhibition and self-inhibition, the two main mechanisms in the Limulus, have been discovered in mammals and other vertebrates. Mathematically, the recurrent inhibitory influence in the disinhibition mechanism and the self-inhibitory feedback are the same in both the limulus and in mammals. Therefore, our model based on the Limulus may generalize to human vision. Finally, an important question is whether our bottom-up model accounts for the full range of phenomena in the scintillating grid illusion. Why should the scintillating effect originate only from such a low level in the visual pathway? In fact, recent experiments have shown that part of the scintillating effect can arise based on top-down, covert attention (VanRullen & Dong, 2003). The implication of VanRullen and Dong’s study is that although the scintillation effect can originate in the retina, it can be modulated by later stages in the visual hierarchy. This is somewhat expected because researchers have found that the receptive field properties (which may include the size) can be effectively modulated by attention (for a review, see Gilbert, Ito, Kapadia, & Westheimer, 2000). It is unclear how exactly such a mechanism can affect brightness-contrast phenomena that depend on the receptive field size at such a very low level; thus, it may require further investigation. Schrauf and Spillmann (2000) also pointed out a possible involvement of a later stage by studying the illusion in stereo depth. But as they admitted, the major component of the illusion may be retinal in origin. Regardless of these issues, modeling spatiotemporal properties at the retinal level may be worthwhile by serving as a firm initial stepping-stone on which a more complete theory can be constructed.
7 Conclusion In this letter, we presented a neural model of the scintillating grid illusion, based on disinhibition and self-inhibition in early vision. The two mechanisms inspired by neurophysiology were found to be sufficient in explaining the multifaceted spatiotemporal properties of the modeled phenomena. We expect that our model can be extended to the latest results that indicate a higher-level involvement in the illusion, such as that of attention.
Appendix: Derivation of Self-Inhibition Rate K s (t) The exact formula for K s (t) can be derived as follows:
K s (t) =
y(t) , r (t)
(A.1)
A Neural Model of Scintillating Grid Illusion
541
where y(t) is the amount of self-inhibition at time t and r (t) theresponse at time t for this cell. We know that the Laplace transform r (s) of r (t) has the following property: r (s)s = e(s) − r (s)Ts (s),
(A.2)
k , 1+τ
(A.3)
Ts (s) =
where Ts (s) is a transfer function, k the maximum value K s (t) can reach, τ the time constant, and e(s) the Laplace transform of the step input of this cell: e(s) =
1 . s
(A.4)
By rearranging equation A.2, we can solve for r (s) to obtain r (s) = e(s)
1 . s + Ts
(A.5)
Therefore, r (t) can be treated as the step input function e(t) convolved with an impulse response function, r (t) = e(t) ∗ f (t),
(A.6)
where ∗ is the convolution operator, and f (t) = L −1
1 , s + Ts
(A.7)
where L −1 is the inverse Laplace transform operator. Solving equation A.7, we get f (t) as a superposition of two exponential functions: f (t) =
1 C1 exp(C2 t) + C2 exp(C1 t) , C
(A.8)
√ where C = 1 − 4τ k, C1 = (C + 1)/2, and C2 = (C − 1)/2. The function y(t) can also be obtained in a similar manner as shown above: y(s) = r (s)Ts (s).
(A.9)
542
Y. Yu and Y. Choe
Impulse response
2.5
f(t) g(t)
2.0 1.5 1.0 0.5 0.0 -0.5 0
0
1
2
2
2
3
t Figure 9: Impulse response functions f (t) and g(t).
By substituting r (s) with the right-hand side in equation A.5, we have y(s) = e(s)
Ts . s + Ts
(A.10)
Therefore, y(t) can also be treated as the step input function e(t) convolved with an impulse response function g(t) in time domain: y(t) = e(t) ∗ g(t),
(A.11)
where g(t) is a sine-modulated exponentially decaying function: g(t) = L
−1
Ts s + Ts
√ √ = 6 5 exp(−5t) sin( 5t).
(A.12)
Hence, the final form of K s (t) can then be calculated as a division of two convolutions as follows: K s (t) =
e(t) ∗ g(t) . e(t) ∗ f (t)
(A.13)
Figure 9 shows the impulse response functions f (t) and g(t). The exact formula in equation 3.13 was derived based on the above derivation. Acknowledgments We thank Takashi Yamauchi, Rufin VanRullen, and an anonymous reviewer for helpful discussions and Jyh-Charn Liu for his support. This research
A Neural Model of Scintillating Grid Illusion
543
was funded in part by the Texas Higher Education Coordinating Board ATP grant 000512-0217-2001 and by the National Institute of Mental Health Human Brain Project, grant 1R01-MH66991. A preliminary version of the material presented here appeared in Yu and Choe (2004b) as an abstract. References Berry II, M. J., Brivanlou, I. H., Jordan, T. A., & Meister, M. (1999). Anticipation of moving stimuli by the retina. Nature, 398, 334–338. Brodie, S., Knight, B. W., & Ratliff, F. (1978). The spatiotemporal transfer function of the limulus lateral eye. Journal of General Physiology, 72, 167–202. Cai, D., DeAngelis, G. C., & Freeman, R. D. (1997). Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78, 1045–1061. Crevier, D. W., & Meister, M. (1998). Synchronous period-doubling in flicker vision of salamander and man. Journal of Neurophysiology, 79, 1869–1878. Frech, M. J., Perez-Leon, J., Wassle, H., & Backus, K. H. (2001). Characterization of the spontaneous synaptic activity of amacrine cells in the mouse retina. Journal of Neurophysiology, 86, 1632–1643. Gerrits, H. J., & Vendrik, A. J. (1970). Simultaneous contrast, filling-in process and information processing in man’s visual system. Experimental Brain Research, 26, 411–430. Gilbert, C., Ito, M., Kapadia, M., & Westheimer, G. (2000). Interactions between attention, context and learning in primary visual cortex. Vision Research, 40, 1217– 1226. Hartline, H. K., & Ratliff, F. (1957). Inhibitory interaction of receptor units in the eye of Limulus. Journal of General Physiology, 40, 357–376. Hartline, H. K., & Ratliff, F. (1958). Spatial summation of inhibitory influences in the eye of Limulus, and the mutual interaction of receptor units. Journal of General Physiology, 41, 1049–1066. Hartline, H. K., Wager, H., & Ratliff, F. (1956). Inhibition in the eye of Limulus. Journal of General Physiology, 39, 651–673. Hartveit, E. (1999). Reciprocal synaptic interactions between rod bipolar cells and amacrine cells in the rat retina. Journal of Neurophysiology, 81, 2932–2936. Jacobs, A. L., & Werblin, F. S. (1998). Spatiotemporal patterns at the retinal output. Journal of Neurophysiology, 80, 447–451. Kitaoka, A. (2003). Trick eyes 2. Tokyo: Kanzen. Kolb, H., & Nelson, R. (1993). Off-alpha and off-beta ganglion cells in the cat retina. Journal of Comparative Neurology, 329, 85–110. Li, C. Y., Zhou, Y. X., Pei, X., Qiu, F. T., Tang, C. Q., & Xu, X. Z. (1992). Extensive disinhibitory region beyond the classical receptive field of cat retinal ganglion cells. Vision Research, 32, 219–228. Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proceedings of the Royal Society of London B, 207, 187–217. McAnany, J. J., & Levine, M. W. (2004). The blanking phenomenon: A novel form of visual disappearance. Vision Research, 44, 993–1001.
544
Y. Yu and Y. Choe
Neumann, H., Pessoa, L., & Hanse, T. (1999). Interaction of on and off pathways for visual contrast measurement. Biological Cybernetics, 81, 515–532. Nirenberg, S., & Meister, M. (1997). The light response of retinal ganglion cells is truncated by a displaced amacrine circuit. Neuron, 18, 637–650. Raninen, A., & Rovamo, J. (1987). Retinal ganglion-cell density and receptive-field size as determinants of photopic flicker sensitivity across the human visual field. Journal of the Optical Society of America A, 4, 1620–1626. Ratliff, F., Hartline, H. K., & Miller, W. H. (1963). Spatial and temporal aspects of retinal inhibitory interaction. Journal of the Optical Society of America, 53, 110–120. Roska, B., Nemeth, E., & Werblin, F. (1998). Response to change is facilitated by a three-neuron disinhibitory pathway in the tiger salamander retina. Journal of Neuroscience, 18, 3451–3459. Schrauf, M., Lingelbach, B., & Wist, E. R. (1997). The scintillating grid illusion. Vision Research, 37, 1033–1038. Schrauf, M., & Spillmann, L. (2000). The scintillating grid illusion in stereo depth. Vision Research, 40, 717–721. Schrauf, M., Wist, E. R., & Ehrenstein, W. H. (2000). The scintillating grid illusion during smooth pursuit, stimulus motion, and brief exposure in humans. Neuroscience Letters, 284, 126–128. Spillmann, L. (1994). The Hermann grid illusion: A tool for studying human perceptive field organization. Perception, 23, 691–708. Stevens, C. F. (1964). A quantitative theory of neural interactions: Theoretical and experimental investigations. Unpublished doctoral dissertation, Rockefeller Institute. VanRullen, R., & Dong, T. (2003). Attention and scintillation. Vision Research, 43, 2191–2196. Victor, J. (1987). The dynamics of the cat retina x cell centre. Journal of Physiology, 386, 219–246. Yu, Y., & Choe, Y. (2004a). Angular disinhibition effect in a modified Poggendorff illusion. In K. D. Forbus, D. Gentner, T. Regier (Eds.), Proceedings of the 26th Annual Conference of the Cognitive Science Society (pp. 1500–1505). Mahwah, NJ: Erlbaum. Yu, Y., & Choe, Y. (2004b). Explaining the scintillating grid illusion using disinhibition and self-inhibition in the early visual pathway. In Society for Neuroscience Abstracts. Program No. 301.10. Washington, DC: Society for Neuroscience. Yu, Y., Yamauchi, T., & Choe, Y. (2004). Explaining low-level brightness-contrast illusions using disinhibition. In A. J. Ijspeert, M. Murata, N. Wakamiya (Eds.), Biologically inspired approaches to advanced Information technology (BioADIT 2004) (pp. 166–175). Berlin: Springer.
Received February 18, 2005; accepted September 8, 2005.
LETTER
Communicated by Jonathan Victor
A Comparison of Descriptive Models of a Single Spike Train by Information-Geometric Measure Hiroyuki Nakahara
[email protected].
Shun-ichi Amari
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Saitama, 351-0198 Japan
Barry J. Richmond
[email protected] Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20892, U.S.A.
In examining spike trains, different models are used to describe their structure. The different models often seem quite similar, but because they are cast in different formalisms, it is often difficult to compare their predictions. Here we use the information-geometric measure, an orthogonal coordinate representation of point processes, to express different models of stochastic point processes in a common coordinate system. Within such a framework, it becomes straightforward to visualize higher-order correlations of different models and thereby assess the differences between models. We apply the information-geometric measure to compare two similar but not identical models of neuronal spike trains: the inhomogeneous Markov and the mixture of Poisson models. It is shown that they differ in the second- and higher-order interaction terms. In the mixture of Poisson model, the second- and higher-order interactions are of comparable magnitude within each order, whereas in the inhomogeneous Markov model, they have alternating signs over different orders. This provides guidance about what measurements would effectively separate the two models. As newer models are proposed, they also can be compared to these models using information geometry. 1 Introduction Over the past two decades, studies of the information-carrying properties of neuronal spike trains have intensified and become more sophisticated. Many earlier studies of neuronal spike trains concentrated mainly on using general methods to reduce the dimensionality of the description. Recently, however, specific models have been developed to incorporate findings from Neural Computation 18, 545–568 (2006)
C 2006 Massachusetts Institute of Technology
546
H. Nakahara, S. Amari, and B. Richmond
both experimental and theoretical biophysical data (Dean, 1981; Richmond & Optican, 1987; Abeles, 1991; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Reid, Victor, & Shapley, 1992; Softky & Koch, 1993; Shadlen & Newsome, 1994; Victor & Purpura, 1997; Stevens & Zador, 1998; Oram, Wiener, Lestienne, & Richmond, 1999; Shinomoto, Sakai, & Funahashi, 1999; Meister & Berry, 1999; Baker & Lemon, 2000; Kass & Ventura, 2001; Reich, Mechler, & Victor, 2001; Brown, Barbieri, Ventura, Kass, & Frank, 2002; Wiener & Richmond, 2003; Beggs & Plenz, 2003; Fellous, Tiesinga, Thomas, & Sejnowski, 2004). These newer approaches guess at specific structures that give rise to the spike trains in experimental data. Because these models have specific structures, fitting these models is translated into estimating the parameters of the model rather than using general approaches to dimensionality reduction. There are several benefits to these more descriptive models. First, all of the approaches succinctly describe data. Second, the more principled models make their assumptions explicit; they declare which properties in the data are considered important. Third, parametric models have practical value for data analysis because the parameter values of a model can often be reasonably well estimated even with the limited number of samples that can be obtained in experiments. When considering different models, it is natural to ask which model is really good, or perhaps, more to the point, what we learn about the system from different models. In what ways are the models equivalent or different? If the differences can be seen explicitly, experiments can be designed to evaluate features that distinguish the models. A powerful approach for distinguishing them is to project them into a single coordinate frame, especially an orthogonal coordinate one. Information-geometric measure (IG) (Amari, 2001; Nakahara & Amari, 2002b) provides a orthogonal coordinate system to make such projections for models of point processes. Using the IG measure, we consider a probability space, where each point in the space corresponds to a probability distribution. Estimating the probability distribution of the spike train from experimental data corresponds to identifying the location and spread of a point in this space. In this context, different assumptions underlying different spike train models correspond to different search constraints (see Figure 1). Once different models are re-represented by a common coordinate system, one can compare the regions of the probability space that can be occupied by the different models, with the overlapping regions representing properties that the models have in common, and nonoverlapping regions representing properties that are unique to a particular model. Here we use the IG measure to compare two stochastic models of spike trains: the inhomogeneous Markov (IM) (Kass & Ventura, 2001) and the mixture of Poisson (MP) models (Wiener & Richmond, 2003). Experimentally recorded spike trains generally depart from Poisson statistics (see section 2.) The variance-to-mean relation is seldom one, the interval distribution is often not exponential, and the count distribution in a period is
Spike Train by Information Geometry
"Full" probability space
547
Subspace of a model
Subspace of another model Figure 1: Schematic drawing to show that each parametric model for spike train description is embedded as a subspace in a full probability space. Given a parametric model, estimating parameter values (or probability distribution) from experimental data corresponds to identifying a point in the subspace.
often not Poisson. Recently, both IM and MP models have been proposed to deal with these deviations. Both treat the spike trains as point processes, but they emphasize different properties. The intersection of the two models is an inhomogeneous Poisson process. The IM model emphasizes the non-Poisson nature of interval distributions, whereas the MP model emphasizes the non-Poisson nature of the spike count distribution (see Figure 2 and section 3). To compare the two models, we represent them in a common coordinate system, according to the methods of the IG measure (see section 4). The second- and higher-order statistics of the models can thus be characterized (see section 5) in a form suitable for comparison and then used to discriminate the models. 2 Preliminaries Consider an inhomogeneous Poisson process. For a spike train of a single neuron, consider a time period of N bins, where each bin is so short that it can
548
H. Nakahara, S. Amari, and B. Richmond
raster
Trials
200 100 0 0
100
200
time [ms]
K i, j
ar
100
0.8 0 0
PSTH
[Hz]
Freq.
1.6
100 200
r (r = j-i)
[Prob] # Spikes 0.12
50
0.06
0 0
100 200
0
time [ms]
MIM model
η1 ,L, ηN a1 ,L, a N
0
8
16
[num]
MP model
1
η 1 ,L ,ηN {π k , λk} (k 1L K )
Figure 2: Schematic drawing to indicate how raw experimental data are converted to estimation of parameter values of the two different models. The raw data, or raster data (top), can be converted to different formats of data: K i, j , PSTH, and spike count histogram (middle). For the MIM model, K i, j = a j−i and PSTH data are used, whereas for the MP model, PSTH and the spike count histogram are used. Only the MIM model case (not the IM model) is shown here for simplicity of presentation.
have at most a single spike. Each neuronal spike train is then represented by a binary random vector variable. Let X N = (X1 , . . . , XN ) be N binary random variables, and let p(x N ) = P[X N = x N ], x N = (x1 , . . . , xn ), xi = 0, 1, be its probability, where p(x N ) > 0 is assumed for all x. Each Xi indicates
Spike Train by Information Geometry
549
a spike in the ith bin, by Xi = 1, or no spike, by Xi = 0. With this notation, the inhomogeneous Poisson process is given by
p(x N ) =
N
ηixi (1 − ηi )1−xi ,
i
where ηi = E[xi ]. The probability of a spike occurrence in a bin is independent from those of the other bins:
p(x N ) =
N
p(xi ), where p(xi ) = ηixi (1 − ηi )1−xi .
i
Then (η1 , . . . , η N ), or the peristimulus histogram (PSTH), obtained from experimental data is sufficient to estimate the parameter values of this model. In this condition, the experimental data analysis is simple, and spike generation from this model is easy. This independence property leads to wellknown facts: count statistics obey the Poisson distribution, and interval statistics obey the exponential distribution (the two facts are exact for the homogeneous Poisson process and asymptotically exact for the inhomogeneous one). Its simplicity makes the Poisson model a popular choice as a descriptive model. Experimental findings often suggest that the empirical probability distributions of spike counts and intervals are close to the Poisson and exponential distributions, respectively, but frequently they depart from these as well (Dean, 1981; Tolhurst, Movshon, & Dean, 1983; Gawne & Richmond, 1993; Gershon, Wiener, Latham, & Richmond, 1998; Lee, Port, Kruse, & Georgopoulos, 1998; Stevens & Zador, 1998; Maynard et al., 1999; Oram et al., 1999). These findings led to studies that considered a larger class of models, including the IM and MP models (see section 3.) The Poisson process occupies only a small subspace of the original space of all probability distributions p(x N ). The number of all possible spike patterns is 2 N , since X N ∈ {0, 1} N . Therefore, each p(x N ) is given by 2nprobabilities pi1 ...i N = Prob{X1 = i 1 , . . . , XN = i N }, i k = 0, 1, subject to i1 ,...,i N pi1 ...i N = 1. The set of all possible probability distributions { p(x)} forms a (2 N − 1)-dimensional manifold S N . To represent a point in S N , that is, a probability distribution p(x N ), one simple coordinate system, called the P-coordinate system, is given by { pi1 ...i N } above, where { pi1 ...i N } corresponds to 2 N probabilities. Every set of values has 2 N probabilities, and each corresponds to a specific probability distribution, p(x N ). Since { pi1 ...i N } sums to one, the effective coordinate dimension is 2 N − 1 (instead of 2 N ). For the IG measure, we use two other coordinate systems. The first is the θ -coordinate system explained in section 4, and the second is the
550
H. Nakahara, S. Amari, and B. Richmond
η-coordinate system, given by the expectation parameters, ηi = E[xi ] = Prob{xi = 1}, i = 1, . . . , n ηij = E[xi x j ] = Prob{xi = x j = 1}, i < j ηi1 i2 ...il = E[xi1 , . . . , xil ] = Prob{xi1 = xi2 = · · · = xil = 1} i 1 < i 2 < · · · < il . All ηijk , and so on, together have 2 N − 1 components, that is, η = (η1 , η2 , . . . , η N ) = (ηi , ηij , ηijk , . . . , η12...N ) has 2 N − 1 components, where we write η1 = (ηi ), η2 = (ηij ) and so on, forming the η-coordinate system in S N . Any probability distribution of X N can be completely represented by Por η-coordinates if and only if all of the coordinates are used. If any model of the probability distribution of X N has fewer parameters than 2 N − 1 (this is usually the case), the probability space in which the model lies is restricted. Since the Poisson process uses η1 as its coordinates, it lies in a much smaller subspace than the full space S N . The components of the η-coordinates, η1 = (ηi ) = (η1 , η2 , . . . , η N ), (i = 1, . . . , N) are familiar, since they would correspond to the PSTH in experimental data analysis. They are taken to represent the time course expectation of the neural firing, expressed as the probability of a spike at each time or as the firing frequency. Below, we freely exchange the PSTH and η1 for simplicity. However, we note the difference between η1 and the probability density of firing, since the PSTH is often regarded as the latter as well. The probability density will be recalculated if the bin size changes so that it is invariant under the change of the bin size and has a spike count per unit (infinitesimal) time. The firing frequency would become the probability density as the bin size approaches zero (or the time resolution approaches infinite). In data analysis, the firing frequency is then taken to be the empirically measured density. In contrast, each of η1 , that is, ηi , is the probability of a bin, P[Xi = 1], not a probability density. For ηi , it is assumed that each bin can contain at most one spike. In practical data analysis, it thus can be regarded as the density, as long as the bin size is small enough. In general, though, we must be aware of this difference and carefully translate between them. How large to make the bins is an important question but beyond the scope of this study. Others may express concern about events that are not found to occur in the data, that is, if zero probability must be assigned to the event. In such cases, we can, in principle, reconstruct a probability model from which the events of zero probabilities are omitted or impose some assumptions on those zero probabilities, which seems more useful in practice (see Nakahara & Amari, 2002b).
Spike Train by Information Geometry
551
To simplify the notation, we sometimes will write Xi = 0 and Xi = 1 as xi0 and xi1 , respectively—for example, x N = (X1 = 0, X2 = 0, X3 = 1, . . . , XN = 1 1) = (x10 , x20 , x31 v, . . . , xN ). The notation of p(ijk) , and so on, is used to define 0 . p(i1 i2 ...ik ) = P x10 , . . . , xi11 , . . . , xi12 , . . . , xi1k , . . . , xN 0 We also use p(0) = P[x10 , . . . , xN ]. The cardinality of a spike train, that is, the number of spikes or the spike count, is Y ≡ X N = # {Xi = 1} .
Given a specific spike train x N , n is reserved to indicate n = |x N |, and s(1), s(2), . . . , s(n) are used to denote the specific timings of spikes that are the set of indices having xi = 1. For example, with this notation, we write 0 1 0 1 1 0 1 x N = x10 , . . . , xs(1)−1 , xs(1) , xs(1)+1 , . . . , xs(2) , . . . , xs(3) , xs(3)+1 , . . . , xs(n) , 0 0 xs(n)+1 , . . . , xN . 3 Two Parametric Models for Single Spike Trains Here we present the original formulations of the inhomogeneous Markov (IM) (Kass & Ventura, 2001) and the mixture of Poisson (MP) models (Wiener & Richmond, 2003). 3.1 Inhomogeneous Markov Model. The IM model was developed as a tractable class of spike interval probability distributions to account for observations that the spike firing over bins is not completely independent and thus departs from the Poisson process (Kass & Ventura, 2001; Ventura, Carta, Kass, Gettner, & Olson, 2002). The inhomogeneous Markov assumption is the key to the IM model and 1 assumes the following equality: for any spike event, xs(l) (l ≤ n), 1 0 0 x1 , . . . , xs(l)−1 = P x 1 x 1 P xs(l) s(l) s(l−1) , xs(l−1)+1 , . . . , xs(l)−1 . The probability of firing at a time t given its past, which is the left side of the equation, depends on only t and the time of the last spike. Denote the right-hand side by K˜ s(l−1),s(l) , as 1 1 0 0 x K˜ s(l−1),s(l) = P xs(l) s(l−1) , xs(l−1)+1 , . . . , xs(l)−1 , where l = 2, . . . , n (potentially n is up to N).
552
H. Nakahara, S. Amari, and B. Richmond
1 If K˜ s(l−1),s(l) = ηs(l) = P[xs(l) ], this is an inhomogeneous Poisson process. By explicitly including the parameter, { K˜ s(l−1),s(l) }, the IM model enlarges the class of probabilities beyond the Poisson process. The original parameters of the IM model are given by {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j). After some calculations, we obtain the following:
Proposition 1. Given the original parameters {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j), the probability of any spike train under the IM model is given by PI M (x N ) s(1)−1 n n = (1 − ηl ) ηs(1) K˜ s(l−1),s(l) ) l=1
l=2
l=1
s(l+1)−s(l)−1
k=1
(1 − K˜ s(l),s(l)+k ),
(3.1)
where (and hereafter) PI M is used to denote a probability distribution of the IM model and the convention s(n + 1) = N + 1 is introduced without loss of generality. (See appendix A for the proof.)
Equation 3.1 states that the probability of any spike train p(x N ) depends on only ηi and K˜ i, j under the IM model. Thus, {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j) is one coordinate system for the IM model. Another coordinate system, {ηi , K i, j }
(i, j = 1, . . . , N, i < j),
can be introduced by defining K s(l−1),s(l) such that K˜ s(l−1),s(l) = ηs(l) K s(l−1),s(l) (provided that ηi > 0). The number of parameters or the IM model dimensionality is N + N(N−1) . 2 A subclass of the IM model, called the multiplicative inhomogeneous Markov (MIM) model, has also been proposed (Kass & Ventura, 2001). In addition to the IM assumption, they assumed another constraint on K i, j , given by a j−i ≡ K i, j .
(3.2)
Note that this assumption is not equivalent to assuming K˜ i, j = K˜ i , j , where j − i = j − i . The assumption further constrains the probability subspace available to the model (see Figure 2.) The dimensionality of the MIM model is 2N − 1. The MIM model is easily expressed by substituting K˜ i, j = η j a j−i in equation 3.1.
Spike Train by Information Geometry
553
3.2 Mixture of Poisson Model. The MP model was introduced to better account for the spike count statistics of neural spike trains over an interval of interest (Wiener & Richmond, 2003). In many neurophysiological recordings of single neurons, the spike count and its probability distribution are easy to obtain and arguably the most robust measure to estimate. The MP model fits the spike count distribution to a mixture of Poisson distributions instead of a single Poisson distribution. However, the mixture of Poisson distributions itself cannot determine spike train generation without further assumptions. In the original work (Wiener & Richmond, 2003), each trial was drawn from one of the Poisson distributions; each kth component of the Poisson process is chosen with a probability πk in each trial of the experiment, generating a spike train of the trial. Thus, the MP model enjoys the simplicity of the Poisson process for generating spike trains in each trial. Let us write each kth (inhomogeneous) Poisson process, Pk ,
Pk [X N = x N ] =
N
xi ηi,k (1 − ηi,k )1−xi ,
i
where we define ηi,k = E k [xi ], and E k denotes expectation with respect to probability Pk . The trial-by-trial mixture of the Poisson processes is given by
P[X N = x N ] =
K
πk Pk [X N = x N ],
k
K where {πk } are mixing probabilities with k=1 πk = 1. The corresponding spike count distribution is the mixture of Poisson distribution, given by P[Y = y] = kK πk Pk [Y = y]. Here the Poisson distribution of each kth component is given by
Pk [Y = y] =
yλk −λk e , where we have y!
λk =
N
ηi,k .
i
An important issue is how to estimate the spike generation of each kth component, that is, {ηi,k } (i = 1, . . . , N). Consider first a single Poisson process for which we pretend to have only a single component in the above formulation. Can we recover {ηi,1 } from λ1 ? We can get η = λ1 /N if the process is homogeneous. If it is inhomogeneous, the solution is not unique: various sets of {ηi,1 } may match a value of λ1 . In practice, we get {ηi,1 } by looking at the PSTH from the same experimental data. If there is more than one
554
H. Nakahara, S. Amari, and B. Richmond
component, the PSTH tells us only the left-hand side of the equation below: ηi = E[Xi ] =
K
πk ηi,k (i = 1, . . . , N).
k
In this general case, to obtain {ηi,k } of each kth component, the approach taken by the MP model assumes that the overall shape of PSTH is the same among all components (Wiener & Richmond, 2003). This assumption implies that for each k, there exists a constant αk such that ηi,k = αk ηi for any i = 1, . . . , N. By taking the sum with respect to i, the value of αk is given by
λk , where we defined c 1 ≡ ηi . c1 i N
αk =
The MP model, as a generative model of spike trains, is the trial-by-trial mixture of Poisson process with this assumption, summarized as follows: Proposition 2. Given original parameters {πk , λk , ηi } (k = 1, . . . , K ; i = 1, . . . , N), the probability distribution of any spike pattern of the MP model is given by PMP (x N ) =
K
πk Pk (x N ),
k
where (and hereafter) PMP denotes the probability distribution of the MP model and Pk denotes the probability distribution of the kth component, Pk (x N ) =
N
xi ηi,k (1 − ηi,k )1−xi ,
(3.3)
i
where ηi,k is defined by ηi,k =
λk ηi (i = 1, . . . , N) . c1
Here, c 1 is a constraint of the model parameters given by c1 =
N
i
ηi =
N
K
i
k
πk ηi,k =
K
πk λk .
k
Another constraint of the model parameter is
K k
πk = 1.
Spike Train by Information Geometry
555
Thus, the dimensionality of the MP model is 2K + N − 2. We note that the constraint of c 1 is intrinsic, a part of, the MP model (see Nakahara, Amari, & Richmond, 2005). 4 Representation of the Two Models by Information-Geometric Measure 4.1 Information-Geometric Measure. Having established the two models, we rewrite them in IG measure (Amari, 2001; Nakahara & Amari, 2002b). The usefulness of the IG measure was studied earlier for spike data analysis (Nakahara & Amari, 2002a, 2002b; Nakahara, Amari, Tatsuno, Kang, & Kobayashi, 2002; Amari, Nakahara, Wu, & Sakai, 2003) and for DNA microarray data (Nakahara, Nishimura, Inoue, Hori, & Amari, 2003). Although these studies emphasized neural population firing and interactions among neurons (or gene expressions), almost all the earlier results can be directly applied in analyzing single-neuron spike trains because the mathematical formulation is general in the sense that it can be applied to any binary random vectors. Here we give a brief description of the IG measure. Let us first introduce the θ -coordinate system, defined by log P[X N = x N ] =
θi xi +
θij xi x j +
i< j
θijk xi x j xk , . . . ,
i< j , and α(t) = 0 otherwise.
572
A. Casile and M. Rucci
Spatial receptive fields of simple cells were modeled by means of Gabor filters, sη (x) = Aη cos([ωη , 0] · x + φ)e −
x T R(ρ)T ση R(ρ)x 2
,
σ −2 0 where Aη is the amplitude, ση = η0x σ −2 is the covariance matrix of the ηy gaussian, ωη and φ are the angular velocity and phase of the plane wave, and R is a rotation matrix that introduces the angle ρ between the major axis of the gaussian and the plane wave. Parameters were adjusted to model 10 simple cells following neurophysiological data from Jones and Palmer (1987a, Table 1). Spatial kernels of geniculate units were modeled as differences of gaussians, sα (x) = Acnt e
−
xT x 2 2σcnt
− Asrn e
−
xT x 2 2σsrn
,
where the subscripts indicate contributions from the receptive field center (cnt) and surround (srn). Kernel parameters followed neurophysiological data from Linsenmeier, Frishman, Jakiela, and Enroth-Cugell (1982) to model ON-center cells with receptive fields located between 5 degrees and 15 degrees of visual eccentricity. At each angle of visual eccentricity, spatial receptive fields of modeled OFF-center cells were equal in magnitude and opposite in sign to those of ON-center units, i.e. sαON = −sαOFF . Since many neurons in the LGN and V1 possess similar temporal dynamics (Alonso, Usrey, & Reid, 2001) for both cortical and geniculate units, the temporal element h(t) was modeled as a difference of two gamma functions (DeAngelis, Ohzawa, & Freeman, 1993a; Cai, DeAngelis, & Freeman, 1997), h α (t) = h η (t) = k1 (t, t1 , c 1 , n1 ) − k2 (t, t2 , c 2 , n2 ), n −c(t−to )
e . Following data from Cai et al. (1997), where (t, t0 , c, n) = [c(t−tn0 )]n e −n temporal parameters were t1 = t2 = 0, n1 = n2 = 2, k1 = 1, k2 = 0.6, c 1 = 60s −1 , c 2 = 40s−1 . Previous studies in which cell responses were simulated during free viewing of natural images have shown that the second-order statistics of thalamocortical activity produced by this model are insensitive to the level of rectification (Rucci et al., 2000; Rucci & Casile, 2004). To probe into the origins of our previous simulation results, in this study we focused on the specific case of no rectification for simple cells and rectification with zero threshold for geniculate units. This assumption enables correlation
Fixational Instability and Thalamocortical Development
573
difference maps to be expressed as the product of linear geniculate and cortical responses, Rη (x) = η x η (t) α xON (t) − α xOFF (t) I,t = η x η (t)α x α (t)I,t , α α
(2.1)
where α(t) = α ON (t) − α OFF (t) = kαON (x, t) I (x, t). While this choice of rectification parameters simplified the mathematical analysis of this letter, our previous simulation data ensure that results remain valid for a wide range of thresholds. In this letter, correlation difference maps were estimated on the basis of equation 2.1, without explicitly simulating cell responses. Examples of traces of neuronal activity can be found in Figure 6 of our previous study (Rucci & Casile, 2004). 3 Thalamocortical Activity Before Eye Opening To establish a reference baseline, we first examined the structure of thalamocortical activity immediately before eye opening. Experimental evidence indicates that many of the response features of V1 cells are already present at the time of eye opening (Hubel & Wiesel, 1963; Blakemore & van Sluyters, 1975). Computational studies have shown that correlation-based mechanisms of synaptic plasticity are compatible with the emergence of simple cell receptive fields in the presence of endogenous spontaneous activity (Linsker, 1986; Miller, 1994; Miyashita & Tanaka, 1992). For simplicity, we restricted our analysis to the two-dimensional case of one spatial and one temporal dimension, by considering sections of the spatial receptive fields. The receptive fields of simple cells were sectioned along the axis orthogonal to the cell-preferred orientation. For LGN cells, we considered a section along a generic axis crossing the center of the receptive field. Results are, however, general and can be directly extended to the full 3D space-time case. In the presence of spontaneous retinal activity, levels of correlation between the responses of thalamic and cortical units can be estimated by means of linear system theory (Papoulis, 1984), Rη (x) = F −1 {K η (ωx , ωt )K α (ωx , ωt )CSA (ωx , ωt )}|t=0 ,
(3.1)
where F −1 indicates the operation of inverse Fourier transform, CSA (ωx , ωt ) is the power spectrum of spontaneous activity in the retina, and K α (ωx , ωt ) and K η (ωx , ωt ) are the Fourier transforms of LGN and V1 kernels. Under the model assumption of space-time separability of cell kernels, equation 3.1 gives Rη (x) = TF −1 {Sη (ωx )Sα (ωx )SSA (ωx )},
(3.2)
574
A. Casile and M. Rucci
where we also assumed space-time separability of the power spectrum of ∞ spontaneous retinal activity. T is a multiplicative factor equal to −∞ H SA (ωt )Hη (ωt )Hα (ωt )dωt , and SSA (ωx ), HSA (ωt ), Sα (ωx ), Hα (ωt ), Sη (ωx ), and Hη (ωt ) are, respectively, the spatial and temporal components of the power spectrum of spontaneous retinal activity and of the Fourier transforms of LGN and V1 kernels. Data from Mastronarde (1983) show that retinal spontaneous activity is characterized by narrow spatial correlations. These data are accurately interpolated by gaussian functions. Least-squares interpolations of levels of correlation between ganglion cells at different separations produced gaussians with amplitude ASA = 13.9 independent of the cell eccentricity and standard deviation σSA that ranged from 0.18 degree at eccentricity 5 degrees to 0.35 degree at 25 degrees. Use in equation 3.2 of a gaussian approximation for retinal spontaneous activity gives, after some algebraic manipulations, an analytical expression for the structure of correlated activity, x2
x2
− 2σˆ 2 cos( − 2σ˜ 2 cos( ηα (x) + R ηα (x), (3.3) Rη (x) ∝ Ae ω x + φ) + Ae ω x + φ) = R where the parameters are given by: 2 + σ2 + σ2 2 + σ2 σ = σ = σ ση2 + σsrn cnt η SA SA ση2 ση2 and . ω = σˆ 2 ωη ω = σ˜ 2 ωη ˆ ˜ η) η) Aη Acnt ASA σcnt ση σSA 2π − ση2 ωη (2ω−ω Aη Asrn ASA σsrn ση σSA 2π − ση2 ωη (2ω−ω e e A= A= σˆ σ˜ (3.4) Substitution of cell-receptive field parameters in equation 3.4 yields A A at all considered angles of visual eccentricity. Thus, the second term of equation 3.3 can be neglected, and correlation difference maps are described by Gabor functions: x2
η (x) = Ae − 2σˆ 2 cos( Rη (x) ≈ R ω x + φ).
(3.5)
Since the spatial receptive fields of modeled V1 units are also represented by Gabor functions, the similarity between correlation difference maps and cortical receptive fields can be quantified by directly comparing the two parameters of the Gabor maps: σ , the width of the gaussian, and ω, the spatial frequency of the plane wave. Figure 1 compares the correlation difference maps given by equation 3.5 to the receptive fields of modeled V1 units. Since the precise locations of the receptive fields of recorded cells were not reported by Jones and Palmer
Fixational Instability and Thalamocortical Development
(A)
575
(B)
1
RF Rη at 15o Rη at 5o
0.5
r
σ
2.5
rω
2 1.5 1
0
0.5 –0.5
–2
–1
0 1 distance (deg)
2
0
4
6
8 10 12 14 LGN eccentricity (deg)
16
Figure 1: Comparison between the spatial organization of simple cell receptive fields and the structure of thalamocortical activity for retinal inputs with various levels of spatial correlation. (a) Results for one of the 10 modeled simple cells in the case of spontaneous activity. The correlation difference maps (Rη ) measured between the considered simple cell and arrays of geniculate units located around 5 and 15 degrees of visual eccentricity are compared to the profile of the receptive field (RF). (b) Comparison between the parameters of the Gabor functions that represented receptive fields and patterns of correlated activity. Dashed and solid ω/ωη and rσ = σ /ση evaluated in the curves show, respectively, the ratios rω = presence of retinal spontaneous activity (◦), white noise (), and broad spatial correlations (σSA = 1◦ ) at the level of the retina (). The closer these two ratios are to 1, the higher is the similarity between patterns of correlation and the spatial structure of simple cell receptive fields. Each curve represents average values over 10 modeled simple cells. Error bars represent standard deviations. The x-axis marks the angle of visual eccentricity of the receptive fields of geniculate units.
(1987a, 1987b), we estimated the patterns of correlation that each modeled V1 unit would establish with LGN cells located at various angles of visual eccentricity. Figure 1a shows an example for one of the modeled V1 units. The patterns of correlated activity measured at both 5 and 15 degrees of visual eccentricity closely resembled the receptive field profile of the cortical cell. The curves marked by solid triangles in Figure 1b show, respectively, the mean values of the two ratios rω = ω/ωη and rσ = σ /ση evaluated over all 10 modeled V1 cells as a function of the visual eccentricity of geniculate units. Both ratios were close to 1 at all eccentricities, indicating a close matching between the patterns of correlated activity and the receptive fields of all simulated cells. The average values of the two indices of similarity were r¯σ = 1.08 ± 0.05 and r¯ω = 0.86 ± 0.07, respectively. Thus, in the model, the structure of thalamocortical activity present immediately before eye opening matched the spatial organization of simple cell receptive fields. It is important to notice that the similarity between receptive fields and correlation difference maps shown in Figure 1 originated from the narrow
576
A. Casile and M. Rucci
spatial correlations of spontaneous activity. Indeed, when no spatial correlation was present at the level of the retina, that is, when spontaneous activity was modeled as white noise (C SA(ωx , ωt ) = 1), correlation difference maps calculated from equation 3.1 were virtually identical to the simulated receptive fields. Mean ratios over all simulated cells and angles of visual eccentricity were r¯σ = 1.03 ± 0.02 and r¯ω = 0.94 ± 0.03, indicating that correlation difference maps and cortical receptive fields were highly similar. This similarity did not occur in the presence of large input spatial correlations. For example, in the case of σSA = 1◦ , the mean matching ratios were r¯σ = 1.77 ± 0.31 and r¯ω = 0.35 ± 0.14. This analysis shows that the narrow correlations of spontaneous retinal activity were responsible for the compatibility between the structure of thalamocortical activity and the Hebbian maturation of cortical receptive fields observed in our previous modeling studies (Rucci et al., 2000; Rucci & Casile, 2004). 4 Thalamocortical Activity After Eye Opening After eye opening, the assumption of narrow spatial correlations in the visual input is no longer valid. Luminance values in natural scenes are correlated over relatively large distances, as revealed by the power law profile of the power spectrum of natural images (Field, 1987; Burton & Moorhead, 1987; Ruderman, 1994). Figure 2 examines the impact of these broad input correlations on the structure of thalamocortical activity. Following the approach of section 3, correlation difference maps were given by Rη (x) = CF −1 {Sη (ω)Sα (ω)N (ω)},
(4.1)
where N (ω) is the power spectrum of natural images and C is a multiplicative factor equal to Hα (0)Hη (0). The power spectrum N (ω) was estimated from a set of 15 natural images (van Hateren & van der Schaaf, 1998). Its radial mean was best interpolated by N¯ (ω) ∝ ω−2.02 , which is in agreement with previous measurements (Field, 1987; Ruderman, 1994). Similar to the results of our previous study (Rucci & Casile, 2004), patterns of correlated activity did not match the receptive fields of simple cells during static presentation of natural scenes. An example for one of the 10 modeled simple cells is shown in Figure 2a, which compares the profile of the cell receptive field to sections of the correlation difference maps measured at 5 and 15 degrees of visual eccentricity. The mismatch is particularly evident in correspondence of the side lobes of the receptive field, where levels of correlation predicted stabilization of afferents from geniculate cells with the wrong polarity (ON- instead of OFF-center). Figure 2b shows average results obtained over the entire population of simulated simple cells. Since, in this case, an analytical expression of Rη (x) was not available, correlation difference maps obtained by solving
Fixational Instability and Thalamocortical Development
(A)
577
(B)
1
RF Rη at 15o o
Rη at 5
0.5
r RF rSL
1 0.75 0.5 0.25 0
0
–0.25 –0.5 –0.75
–0.5
–2
–1
0 1 distance (deg)
2
–1
5
7.5 10 12.5 LGN eccentricity (deg)
15
Figure 2: Comparison between the spatial organization of simple cell receptive fields and the structure of thalamocortical activity in the case of static presentation of natural images. (a) Results for one of the 10 modeled simple cells. The two correlation difference maps (Rη ) measured between the considered simple cell and arrays of geniculate units located around 5 and 15 degrees of visual eccentricity are compared to the profile of the simple cell receptive field (RF). (b) Average matching across the 10 modeled V1 units. Bars indicate the matching between correlation difference maps and cortical receptive fields evaluated both over the entire receptive field (r RF ) and only in correspondence of the secondary lobes (r SL ) (see the text for details). The x-axis represents the angle of visual eccentricity of simulated geniculate units. Vertical lines represent the standard deviation.
numerically equation 4.1 were compared to receptive fields by means of the correlation coefficient r RF . This index measures the similarity of two patterns. It varies between −1 and +1, with +1 indicating perfect matching and −1 perfect mirror symmetry. In addition to the mean correlation coefficient r RF , a second, more specific correlation coefficient index, r SL , quantified the similarity between receptive fields and correlation difference maps only over the secondary lobes of cell receptive fields. At all considered eccentricities, a clear mismatch was present between correlation maps and receptive fields. Average correlation coefficients were r¯ RF = 0.65 ± 0.09 over the entire receptive fields and r¯SL = −0.45 ± 0.3 in correspondence of the secondary lobes. That is, contrary to the case of retinal spontaneous activity, the structure of correlated activity measured in the presence of the broad correlations of natural images was not compatible with a Hebbian refinement of the receptive fields of simple cells. The results of Figure 2 were obtained in the absence of eye movements. Under natural viewing conditions, however, the retinal image is always in motion as small movements of the eyes, head, and body prevent maintenance of a steady direction of gaze. Results from previous computational studies have shown a strong influence of fixational instability on
578
A. Casile and M. Rucci
the structure of correlated activity in models of the LGN and V1 (Rucci et al., 2000; Rucci & Casile, 2004). To examine the origins of this influence, in this article we model fixational instability by means of a two-dimensional ergodic process ξ (t) = [ξx (t), ξ y (t)]T . For simplicity, we assumed zero moments of the first order (ξx (t) = 0 and ξ y (t) = 0) and uncorrelated movements along the two axes (Rξx ξ y (t) = 0). By means of Taylor expansion, the luminance profile I ( x) of a natural image in the neighborhood of a generic point x can be approximated as I ( x) ≈ I (x) + ∇ I (x)T · [ x − x]. Thus, if the average area covered by fixational instability is sufficiently small, the input to the retina during visual fixation can be approximated by Ir (x, t) ≈ I (x) + ∇ I (x)T · ξ (t) = I (x) +
∂ I (x) ∂ I (x) ξx (t) + ξ y (t). ∂x ∂y
Using this approximation, we can estimate the responses of cortical and geniculate cells during visual fixation: η x 0 (t) = kη (µ, τ ) Ir (µ, τ )|(x 0 ,t) ≈ kη (µ, τ ) I (µ)|(x 0 ,t) ∂ I (µ) ∂ I (µ) + kη (µ, τ ) ξx (τ ) + kη (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 0 ,t) (x 0 ,t) = ηSx 0 (t) + ηD x 0 (t) α x 1 (t) = kα (µ, τ ) Ir (µ, τ )|(x 1 ,t) ≈ kα (µ, τ ) I (µ)|(x 1 ,t) ∂ I (µ) ∂ I (µ) + kα (µ, τ ) ξx (τ ) + kα (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 1 ,t) (x 1 ,t) = α xS1 (t) + α xD1 (t), where x 0 and x 1 are the locations of receptive fields centers and S η (t) = kη (µ, τ ) I (µ)|(x 0 ,t) x0 ∂ I (µ) ∂ I (µ) D η x 0 (t) = kη (µ, τ ) ξx (τ ) + kη (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 0 ,t) (x 0 ,t) Dy x = ηD (t) + η (t) x 0 x0
(4.2)
Fixational Instability and Thalamocortical Development
S α x 1 (t) = kα (µ, τ ) I (µ)(x 1 ,t) I (µ) α xD1 (t) = kα (µ, τ ) ∂∂µ ξ (τ ) + kα (µ, τ ) x x (x 1 ,t) D = α xD1x (t) + α x 1y (t).
579
∂ I (µ) ξ (τ ) ∂µ y y
(x 1 ,t)
That is, cell responses can be decomposed into a static component with nonzero mean (η S and α S ) and a zero-mean dynamic component introduced by fixational instability (η D and α D ). ηDx , α Dx , ηD y , and α D y are the contributions to cell responses generated by the instability of visual fixation along the x- and y-axes. Given this decomposition, correlation difference maps can also be expressed as the sum of a static and a dynamic term: Rη (x) = RηS (x) + RηD (x).
(4.3)
Indeed, from our assumptions on the statistical moments of fixational instability, it follows that only three of the nine terms obtained by direct multiplication of the responses η x 0 (t) and α x 1 (t) have nonzero means. The first of these terms is given by
ηSx 0 (t)α xS1 (t) ξ,I,t = (kη (µ, τ ) I (µ)|(x 1 ,t) )(kα (µ, τ ) I (µ)|(x 0 ,t) )ξ,I,t ∞ ∞ = (sη (µ) sα (−µ) N(µ)|(x 1 −x 0 ) ) h α (τ )dτ h η (τ )dτ = Csη (µ) sα (−µ) N(µ) x ,
−∞
−∞
t
where N(x) is the autocorrelation function of natural images. Since this term depends on only the static components of cell responses, it represents the correlation difference map that would be obtained in the absence of fixational instability (see equation 4.1). The second term is given by
Dx x ηD x 0 (t)α x 1 (t) ξ,I,t =
∂ I (µ) ∂ I (µ) kη (µ, τ ) ξx (τ ) ξx (τ ) kα (µ, t) ∂µx ∂µ x x 1 ,t x 0 ,t
ξ,I,t
= (sη (µ) sα (−µ) Nx (µ)| x 1 −x 0 )h η (τ ) h α (−τ ) Rξx ξx (τ )|τ =0 t = Dsη (µ) sα (−µ) Nx (µ)| x ,
(4.4)
580
A. Casile and M. Rucci 10
power spectrum
10
N’(ω) N(ω)
8
10
6
10
4
10
–1
10
0
1
10 10 frequency (cycles/deg)
Figure 3: Comparison between the power spectra of natural images N (ω) and the dynamic power spectrum N (ω) = Nx (ω) + N y (ω) given by the sum of the power spectra of the x and y components of the gradient of natural images. The two curves are radial averages over a set of 15 natural images.
where Nx (µ) is the autocorrelation function of the first component of the gradientof natural images (the derivative along the x-axis). D is a constant ∞ equal to −∞ Hη (ωt )Hα (ωt )Rξx ξx (ωt )dωt , where Rξx ξx (ωt ) indicates the Fourier transform of Rξx ξx (t). By using a similar procedure, we obtain the third term,
D D η x 0y (t)α x 1y (t) ξ,I,t = Dsη (µ) sα (−µ) Ny (µ) x
where Ny (µ) is the autocorrelation function of the second component of the gradient of natural images (the derivative along the y-axis). By adding these three terms and defining N (µ) = Nx (µ) + Ny (µ), we obtain Rη (x) = Csη (µ) sα (−µ) N(µ)| x + Dsη (µ) sα (−µ) N (µ)| x = RηS (x) + RηD (x),
(4.5)
which proves equation 4.3. Equation 4.5 shows that fixational instability adds a contribution RηD (x) to the correlation map RηS (x) obtained with presentation of the same stimuli in the absence of retinal image motion. Whereas in the absence of fixational instability, levels of correlation depend on the autocorrelation function of the stimulus N(x) (or, equivalently, its power spectrum N (ω)), the term RηD (x) introduced by the jittering of the eye depends on the autocorrelation function of the gradient of the stimulus, N (x) (or, equivalently, its power spectrum N (ω), the dynamic power spectrum). Figure 3a compares N (ω) and N (ω) for the case of images of natural scenes. Whereas N (ω) followed, as expected, a power law with exponent
Fixational Instability and Thalamocortical Development
(A)
(B)
1
RF RD at 15o RD at 5o
0.5
0
–0.5
–2
581
–1
0 1 distance (deg)
2
1 0.75 0.5 0.25 0 –0.25 –0.5 –0.75 –1
rRF rSL
5
7.5 10 12.5 LGN eccentricity (deg)
15
Figure 4: Comparison between the spatial organization of simple cell receptive fields and patterns of correlated activity measured when images of natural scenes were examined in the presence of fixational instability (the term RηD (x) in equation 4.5). The layout of the panels and the graphic notation are the same as in Figure 2.
approximately equal to −2, the dynamic power spectrum N (ω) was almost flat up to a cut-off frequency of about 10 cycles/deg—that is, it was uncorrelated. Thus, in the presence of natural images, fixational instability adds an input signal that discards spatial correlations. It should be observed that the whitening of the dynamic power spectrum is a direct consequence of the scale invariance of natural images and has a simple explanation in the frequency domain. Since the Fourier transforms of the two partial derivatives ∂ I∂(x) and ∂ I∂(x) are, respectively, proportional to x y ωx I (ω) and ω y I (ω), the two power spectra Nx (ω) and N y (ω) are proportional to ωx2 N (ω) and ω2y N (ω). Thus, N (ω) = Nx (ω) + N y (ω) ∝ |ω|2 N (ω). For images of natural scenes, N (ω) ∝ |ω|−2 (Field, 1987) and the product |ω|2 N (ω) produce a dynamic power spectrum N (ω) with uniform spectral density. In other words, our analysis shows that whereas the intensity values of natural images tend to be correlated over large distances, local changes in intensity around pairs of pixels are uncorrelated. Therefore, fixational instability represents an optimal decorrelation strategy for visual input with power spectrum that declines as |ω|−2 . We have already shown in Figure 2 that the patterns of correlated activity RηS (x) measured with static presentation of images of natural scenes did not match the receptive fields of modeled simple cells. Figure 4 analyzes the contribution of fixational instability, the term RηD (x) in equation 4.5, to the structure of correlated activity. In this case, correlation difference maps closely resembled the spatial organization of cortical receptive fields irrespective of the eccentricity of simulated geniculate units. The mean
582
A. Casile and M. Rucci
matching index was r¯ RF = 0.98 ± 0.006 over the entire receptive fields and r¯SL = 0.92 ± 0.06 over the secondary lobes. That is, each simple cell established strong correlations with either ON- or OFF-center geniculate units only when the receptive fields of these units overlapped an ON or an OFF subregion. Similar to the case of spontaneous retinal activity, this pattern of correlated activity is compatible with a Hebbian refinement of simple cell receptive fields. To summarize, equation 4.5 shows that in the presence of the self-motion of the retinal image that occurs during natural viewing, the second-order statistics of thalamocortical activity depend on both the spatial configuration of the stimulus and how its retinal projection changes during visual fixation. The first component is represented in equation 4.5 by the term RηS , which depends on the power spectrum of the stimulus N (ω). The latter component is represented by RηD , which is determined by the dynamic power spectrum N (ω), a spectrum that discards the broad spatial correlations of natural images. Of these two terms, only RηD matches the spatial organization of simple cell receptive fields (compare Figure 4 with Figure 2). The overall structure of correlated activity is given by the weighted sum of the results of Figures 2 and 4. During fixational instability, the relative influence of RηS and RηD depends on two elements: (1) the powers of the two inputs N (ω) and N (ω) and (2) neuronal sensitivity to both input signals. In natural images, most energy is concentrated at low spatial harmonics. Since N (ω) attenuates the low spatial frequencies of the stimulus, it tends to possess less power than N (ω). For example, for the two power spectra in Figure 3, the ratio of power dynamic/static in the range 0 to 10 cycles/deg was only 0.08. That is, N (ω) provided over 10 times more power than N (ω) within the main spatial range of sensitivity of geniculate cells. However, in equation 4.5, N (ω) and N (ω) are modulated by the multiplicative terms C and D, which depend on the temporal characteristics of cell responses (both C and D) and fixational instability (D only). Since geniculate neurons respond more strongly to changing stimuli than stationary ones, D tends to be higher than C. For example, a retinal image motion with gaussian temporal correlation (the term Rξx ξx in equation 4.4) characterized by a standard deviation of 30 ms and a mean amplitude of 10, values that are consistent with the instability of fixation of several species, produced a ratio D/C ≈ 950. Thus, although N (ω) provided less power than N (ω), the weighted ratio of the total power (DN (ω))/(CN (ω)) in the range 0 to 10 cycles/deg was approximately 76. Since the term RηD dominated the sum of equation 4.5, the matching between correlation difference maps and receptive fields increased from r¯ RF = 0.65 ± 0.09 and r¯SL = −0.45 ± 0.3 (the values obtained with static presentation of natural images) to r¯ RF = 0.90 ± 0.05 and r¯SL = 0.12 ± 0.55. That is, in the presence of fixational instability, the responses of simulated cortical units tended to be correlated with those of geniculate units with correct polarity.
Fixational Instability and Thalamocortical Development
583
It is important to observe that several mechanisms might further enhance the impact of fixational instability on the refinement of thalamocortical connectivity. A first possibility is a rule of synaptic plasticity that depends on the covariance (and not the correlation) between the responses of preand postsynaptic elements (Sejnowski, 1977): Rη (x) = (η(t) − η)(α(t) − α). In the case in which mean activity levels are estimated over periods of fixation, η = ηS and α = α S , yielding Rη (x) = RηD (x). Thus, the term RηS (x) does not affect synaptic plasticity, and the structure of thalamocortical activity is compatible with the spatial organization of the receptive fields of simple cells. This is consistent with the results of our previous simulations in which we analyzed the statistics of geniculate activity during natural viewing (Rucci et al., 2000). A second mechanism that might enhance the influence of fixational instability is a nonlinear attenuation of the responses of simple cells to unchanging stimuli. Systematic deviations from linearity have been observed in the responses of simple cells. In particular, it has been shown that responses to stationary stimuli tend to decline faster and give lower steady-state levels of activity than would be expected from linear predictions (Tolhurst, Walker, Thompson, & Dean, 1980; DeAngelis et al., 1993a). This attenuation can be incorporated into our model by assuming that after an initial transitory period following the onset of visual fixation, a simple cell responds as η(t) = (1 − β) · ηS (t) + ηD (t), where the constant β ∈ [0, 1] defines the degree of attenuation. With this modification, correlation difference maps are given by Rη (x) = (1 − β)RηS (x) + RηD (x).
(4.6)
Figure 5 compares the receptive fields of simulated simple cells with the correlation difference maps estimated with various degrees of attenuation. It is clear by comparing these data to those of Figure 2 that even a partial attenuation of cortical responses to unchanging stimuli resulted in a substantial improvement in the degree of similarity between patterns of correlation and receptive fields. A 60% attenuation was sufficient to produce an almost perfect matching (¯r RF = 0.97 ± 0.02 and r¯SL = 0.50 ± 0.44). Thus, consistent with our previous simulations of thalamocortical activity (Rucci & Casile, 2004), in the presence of fixational instability, a nonlinear attenuation of simple cell responses leads to a regime of correlated activity that is compatible with a Hebbian refinement of the spatial organization of simple cell receptive fields.
584
A. Casile and M. Rucci
(A)
(B)
1
RF β=0.6 β=0.8 β=1
0.8 0.6 0.4 0.2 0 –0.2 –0.4 –2
–1
0 1 distance (deg)
2
r RF r SL
1 0.75 0.5 0.25 0 –0.25 –0.5 –0.75 –1
0.6 0.8 0.9 1 degree of attenuation of static activity (β)
Figure 5: Effect of nonlinear attenuation of simple cells responses to unchanging stimuli. (a) Results for one of the 10 modeled simple cells. The correlation difference maps (Rη ) estimated from equation 4.6 for three values of the attenuation factor β are compared to the profile of the simple cell receptive field (RF). (b) Mean matching indices over all modeled V1 units as a function of the attenuation factor. Both correlation coefficients evaluated over the entire receptive field (r RF ) and the secondary subregions (r SL ) are shown. Parameters of LGN units simulated an eccentricity of 10 degrees.
5 Conclusions Many of the response characteristics of V1 neurons develop before eye opening and refine with exposure to pattern vision (Hubel & Wiesel, 1963; Blakemore & van Sluyters, 1975; Buisseret & Imbert, 1976; Pettigrew, 1974). After eye opening, small eye and body movements keep the retinal image in constant motion. The statistical analysis of this article, together with the results of our previous simulations (Rucci et al., 2000; Rucci & Casile, 2004), indicate that the physiological instability of visual fixation contributes to decorrelating cell responses to natural stimuli and establishing a regime of neural activity similar to that present before eye opening. Thus, at the time of eye opening, no sudden change occurs in the second-order statistics of thalamocortical activity, and the same correlation-based mechanism of synaptic plasticity can account for both the initial emergence and the later refinement of simple cell receptive fields. In this study, we have used independent models of LGN and V1 neurons to examine whether the structure of thalamocortical activity is compatible with a Hebbian maturation of the spatial organization of simple cell receptive fields. The results of our analysis are consistent with a substantial body of previous modeling work. Before eye opening, in the presence of spontaneous retinal activity, a modeled simple cell established strong correlations with ON- and OFF-center geniculate units only when the receptive fields of these units overlapped, respectively, the ON and OFF subregions
Fixational Instability and Thalamocortical Development
585
within its receptive fields. This pattern of correlated activity is in agreement with the results of previous studies that modeled the activity-dependent development of cortical orientation selectivity (Linsker, 1986; Miller, 1994; Miyashita & Tanaka, 1992). After eye opening, the visual system is exposed to the broad spatial correlations of natural scenes. In the absence of retinal image motion, these input correlations would coactivate geniculate units with the same polarity (ONor OFF-center) and with receptive fields at relatively large separations, a pattern of neural activity that is not compatible with a Hebbian refinement of simple cell receptive fields. During natural fixation, however, neurons receive input signals that vary in time as their receptive fields move with the eye (Gur & Snodderly, 1997). This study shows that in the presence of images of natural scenes, these input fluctuations lack spatial correlations. In the model, this spatially uncorrelated input signal strongly influenced neuronal responses and produced patterns of thalamocortical activity that were similar to those measured immediately before eye opening. Thus, our analysis shows that a direct scheme of Hebbian plasticity can be added to the category of activity-dependent mechanisms compatible with the maturation of cortical receptive fields in the presence of decorrelated natural visual input (Law & Cooper, 1994; Olshausen & Field, 1996). The fact that fixational instability might have such a strong effect on the development of cortical receptive fields should not come as a surprise. Consistent with the results of our analysis, several experimental studies have shown that prevention and manipulation of eye movements during the critical period disrupt the maturation of the response properties of cortical neurons (for a review, see Buisseret, 1995). For example, no restoration of cortical orientation selectivity (Gary-Bobo et al., 1986; Buisseret et al., 1978) and ocular dominance (Freeman & Bonds, 1979; Singer & Raushecker, 1982) is observed in dark-reared kittens exposed to visual stimulation with their eye movements prevented. In addition, neurophysiological results have shown that fixational eye movements strongly influence the responses of geniculate and cortical neurons (Gur et al., 1997; Leopold & Logothetis, 1998; Martinez-Conde et al., 2002). In the primary visual cortex of the monkey, bursts of spikes have been recorded following fixational saccades (MartinezConde, Macknik, & Hubel, 2000), and distinct neuronal populations have been found that selectively respond to the two main components of fixational eye movements, saccades and drifts (Snodderly et al., 2001). This study relied on two important assumptions. A first assumption was the use of linear models to predict cell responses to visual stimuli. Linear spatiotemporal models enabled the derivation of analytical expressions of levels of correlation in thalamocortical activity. A substantial body of evidence shows that LGN X cells act predominantly as linear filters. Responses to drifting gratings contain most power in the first harmonic (So & Shapley, 1981), and responses to both flashed and complex naturalistic stimuli are well captured by linear predictors (Stanley, Li, & Dan, 1999; Cai et al., 1997).
586
A. Casile and M. Rucci
Also, the responses of V1 simple cells contain a strong linear component (Jones & Palmer, 1987b; DeAngelis, Ohzawa, & Freeman, 1993b). However, for these neurons, important deviations from linearity have been reported. In particular, it has been observed that responses to stationary stimuli decline faster and settle on lower steady-state levels than would be expected from linear predictions (Tolhurst et al., 1980; DeAngelis et al., 1993a). We have shown that a nonlinear attenuation of cortical responses to unchanging stimuli enhances the influence of fixational instability on the structure of correlated activity. In the model, the broad correlations of natural scenes had little impact on the second-order statistics of thalamocortical activity in the presence of strong nonlinear attenuation. A second assumption concerned the way we modeled the self-motion of the retinal image. In this study, the physiological instability of visual fixation was modeled as a zero-mean stochastic process with uncorrelated components along the two Cartesian axes. These assumptions simplified our analysis and led to the elimination of several terms in equation 4.5. However, the results presented in this article do not critically depend on them. Simulations in which retinal image motion replicated the cat’s oculomotor behavior have produced patterns of correlated activity that are very similar to the theoretical predictions of this study (Rucci et al., 2000; Rucci & Casile, 2004). Furthermore, although a statistical analysis of the instability of visual fixation under natural viewing conditions has not been performed, the motion of the retinal image as subjectively experienced by a jitter after-effect (Murakami & Cavanagh, 1998) appears to be compatible, at least qualitatively, with our modeling assumptions. It is worth emphasizing that during natural viewing, other elements, in addition to eye movements, contribute to the instability of visual fixation. In particular, small movements of the head and body and imperfections in the vestibulo-ocular reflex (Skavenski, Hansen, Steinman, & Winterson, 1979) are known to amplify the self-motion of the retinal image. Our analysis aims to address the joint effect of all these movements. It can be shown analytically that the factor D, which in equation 4.5 modulates the impact of a moving retinal stimulus (the term RηD (x)), depends in a quadratic manner on the spatial extent of fixational instability. Therefore, within the limits of validity of the Taylor approximation of equation 4.2, the larger the amplitude of fixational instability, the stronger its influence on the structure of correlated activity. It should also be noted that while this article focuses on the examination of static images of natural scenes, our analysis applies to any jittering stimulus on the retina, regardless of the origin of motion, whether self-generated or external. For example, the trembling of leaves on a tree exposed to the wind might produce a decorrelation of neural activity similar to that of fixational instability. Our results appear to contrast with a previous proposal according to which the spatial response characteristics of retinal and geniculate neurons are sufficient to decorrelate the spatial signals provided by images of natural
Fixational Instability and Thalamocortical Development
587
scenes (Atick & Redlich, 1992). According to this hypothesis, a neuronal sensitivity function that increases linearly with the spatial frequency would counterbalance the power spectrum of natural images and produce a decorrelated pattern of neural activity. However, neurophysiological recordings have shown that in both the cat and the monkey, the frequency responses of cells in the retina and the LGN deviate significantly from linearity in the low spatial frequency range (So & Shapley, 1981; Linsenmeier et al., 1982; Derrington & Lennie, 1984; Croner & Kaplan, 1995). Such deviation is not compatible with Atick and Redlich’s proposal and, in the absence of fixational instability, would lead to a regime of thalamocortical activity strongly influenced by the broad spatial correlations of natural images (see Figure 2). In contrast to this static decorrelation mechanism, the decorrelation of visual input produced by fixational instability does not depend on the spatial response properties of geniculate and cortical units. Thus, the proposed mechanism is highly robust with respect to individual neuronal differences in spatial contrast sensitivity functions. While in this study we have focused on the developmental consequences of a chronic exposure to fixational instability, our results also have important implications concerning the way visual information is represented in the early visual system. A number of recent studies have suggested an important role for fixational instability in the neural encoding of visual stimuli ¨ (Ahissar & Arieli, 2001; Greschner, Bongard, Rujan, & Ammermuller, 2002; Snodderly et al., 2001). The results presented here suggest that fixational instability, by decreasing statistical dependencies between neural responses, might contribute to discarding broad input correlations, thus establishing efficient visual representations of natural visual scenes (Barlow, 1961; Attneave, 1954). Further theoretical and experimental studies are needed to characterize and test this hypothesis. Acknowledgments We thank Matthias Franz, Alessandro Treves, and Martin Giese for many helpful comments on a preliminary version of this article. This work was supported by the Volkswagen Stiftung, the National Institutes of Health grant EY015732-01, and the National Science Foundation grants CCF0432104 and CCF-0130851. Correspondence and requests for materials should be addressed to A.C.
References Ahissar, E., & Arieli, A. (2001). Figuring space by time. Neuron, 32, 185–201. Alonso, J., Usrey, M., & Reid, R. C. (2001). Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. J. Neurosci, 21(11), 4002–4015.
588
A. Casile and M. Rucci
Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comput., 4, 196–210. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61(3), 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Blakemore, C., & van Sluyters, R. C. (1975). Innate and environmental factors in the development of the kitten’s visual cortex. J. Physiol., 248, 663–716. Buisseret, P. (1995). Influence of extraocular muscle proprioception on vision. Physiol. Rev., 75(2), 323–338. Buisseret, P., Gary-Bobo, E., & Imbert, M. (1978). Ocular motility and recovery of orientational properties of visual cortical neurons in dark-reared kittens. Nature, 272, 816–817. Buisseret, P., & Imbert, M. (1976). Visual cortical cells: Their developmental properties in normal and dark-reared kittens. J. Physiol., 255, 511–525. Burton, G. J., & Moorhead, I. R. (1987). Color and spatial structure in natural scenes. Appl. Opt., 26, 157–170. Cai, D., DeAngelis, G. C., & Freeman, R. D. (1997). Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. J. Neurophysiol., 78, 1045–1061. Changeux, J. P., & Danchin, A. (1976). Selective stabilization of developing synapses as a mechanism for the specification of neuronal networks. Nature, 264, 705–712. Croner, L. J., & Kaplan, E. (1995). Receptive fields of P and M ganglion cells across the primate retina. Vision Res., 35(1), 7–24. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993a). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. I. General characteristics and postnatal development. J. Neurophysiol., 69(4), 1091–1117. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993b). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69(4), 1118–1135. Derrington, A. M., & Lennie, P. (1984). Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. J. Physiol., 357, 219– 240. Ditchburn, R. (1980). The function of small saccades. Vision Res., 20, 271–272. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Fiorentini, A., Maffei, L., & Bisti, S. (1979). Change of binocular properties of cortical cells in the central and paracentral visual field projections of monocularly paralyzed cats. Brain. Res., 171, 541–544. Freeman, R. D., & Bonds, A. B. (1979). Cortical plasticity in monocularly deprived immobilized kittens depends on eye movement. Science, 206, 1093–1095. Gary-Bobo, E., Milleret, C., & Buisseret, P. (1986). Role of eye movements in developmental processes of orientation selectivity in the kitten visual cortex. Vision Res., 26(4), 557–567.
Fixational Instability and Thalamocortical Development
589
¨ Greschner, M., Bongard, M., Rujan, P., & Ammermuller, J. (2002). Retinal ganglion cell synchronization by fixational eye movements improves feature estimation. Nature, 5(4), 341–347. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17(8), 2914– 2920. Gur, M., & Snodderly, D. M. (1997). Visual receptive fields of neurons in primary visual cortex (V1) move in space with the eye movements of fixation. Vision Res., 37(3), 257–265. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture of the cat’s visual cortex. J. Physiol., 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1963). Receptive fields of cells in striate cortex of very young, visually inexperienced kittens. J. Neurophysiol., 26, 994–1002. Jones, J. P., & Palmer, L. A. (1987a). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1233–1258. Jones, J. P., & Palmer, L. A. (1987b). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1187–1211. Law, C. C., & Cooper, L. N. (1994). Formation of receptive fields in realistic visual environments according to the Bienenstock, Cooper and Munro (BCM) theory. Proc. Natl. Acad. Sci. USA, 91, 7797–7801. Leopold, D. A., & Logothetis, N. K. (1998). Microsaccades differentially modulate neural activity in the striate and extrastriate visual cortex. Exp. Brain. Res., 123, 341–345. Linsenmeier, R. A., Frishman, L. J., Jakiela, H. G., & Enroth-Cugell, C. (1982). Receptive field properties of X and Y cells in the cat retina derived from contrast sensitivity measurements. Vision Res., 22, 1173–1183. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. USA, 83, 8390–8394. Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2000). Microsaccadic eye movements and firing of single cells in the macaque striate cortex. Nat. Neurosci., 3(3), 251–258. Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2002). The function of bursts of spikes during visual fixation in the awake primate lateral geniculate nucleus and primary visual cortex. Proc. Natl. Acad. Sci. USA, 99(21), 13920–13925. Mastronarde, D. N. (1983). Correlated firing of cat retinal ganglion cells. I. Spontaneously active inputs to X and Y cells. J. Neurophysiol., 49, 303–323. Miller, K. D. (1994). A model of the development of simple cell receptive fields and the ordered arrangement of orientation columns through activity-dependent competition between ON- and OFF- center inputs. J. Neurosci., 14(1), 409– 441. Miller, K. D., Erwin, E., & Kayser, A. (1999). Is the development of orientation selectivity instructed by activity? J. Neurobiol., 41, 55–57. Miyashita, M., & Tanaka, S. (1992). A mathematical model for the self-organization of orientation columns in visual cortex. Neuroreport, 3, 69–72. Murakami, I., & Cavanagh, P. (1998). A jitter after-effect reveals motion-based stabilization of vision. Nature, 395, 798–801.
590
A. Casile and M. Rucci
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Papoulis, A. (1984). Probability, random variables and stochastic processes. New York: McGraw-Hill. Pettigrew, J. D. (1974). The effect of visual experience on the development of stimulus specificity by kitten cortical neurons. J. Physiol., 237, 49–74. Ratliff, F., & Riggs, L. A. (1950). Involuntary motions of the eye during monocular fixation. J. Exp. Psychol., 40, 687–701. Reid, R. C., & Alonso, J. M. (1995). Specificity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284. Rucci, M., & Casile, A. (2004). Decorrelation of neural activity during fixational instability: Possible implications for the refinement of V1 receptive fields. Vis. Neurosci., 21(5), 725–738. Rucci, M., Edelman, G. M., & Wray, J. (2000). Modeling LGN responses during freeviewing: A possible role of microscopic eye movements in the refinement of cortical orientation selectivity. J. Neurosci., 20(12), 4708–4720. Ruderman, D. L. (1994). The statistics of natural images. Network, 5, 517–548. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4(4), 303–321. Singer, W., & Raushecker, J. (1982). Central-core control of developmental plasticity in the kitten visual cortex: II. Electrical activation of mesencephalic and diencephalic projections. Exp. Brain. Res., 47, 223–233. Skavenski, A. A., Hansen, R., Steinman, R. M., & Winterson, B. J. (1979). Quality of retinal image stabilization during small natural and artificial body rotations in man. Vision Res., 19, 365–375. Snodderly, D. M., Kagan, I., & Gur, M. (2001). Selective activation of visual cortex neurons by fixational eye movements: Implications for neural coding. Vis. Neurosci., 18, 259–277. So, Y. T., & Shapley, R. (1981). Spatial tuning of cells in and around lateral geniculate nucleus of cat: X and Y relay cells and perigeniculate interneurons. J. Neurophysiol., 45(1), 107–120. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19(18), 8036– 8042. Steinman, R. M., Haddad, G. M., Skavenski, A. A., & Wyman, D. (1973). Miniature eye movement. Science, 181(102), 810–819. Stent, G. S. (1973). A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. USA, 70, 997–1001. Tolhurst, D., Walker, N., Thompson, I., & Dean, A. F. (1980). Non-linearities of temporal summation in neurones in area 17 of the cat. Exp. Brain. Res., 38, 431–435. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B. Biol. Sci., 265, 359–366. Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum Press.
Received January 5, 2005; accepted July 19, 2005.
LETTER
Communicated by C. Lee Giles
Learning Beyond Finite Memory in Recurrent Networks of Spiking Neurons ˇ Peter Tino
[email protected] Ashely J. S. Mills
[email protected] School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
We investigate possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons strictly operating in the pulse-coding regime. We extend the existing gradient-based algorithm for training feedforward spiking neuron networks, SpikeProp (Bohte, Kok, & La Poutr´e, 2002), to recurrent network topologies, so that temporal dependencies in the input stream are taken into account. It is shown that temporal structures with unbounded input memory specified by simple Moore machines (MM) can be induced by recurrent spiking neuron networks (RSNN). The networks are able to discover pulse-coded representations of abstract information processing states coding potentially unbounded histories of processed inputs. We show that it is often possible to extract from trained RSNN the target MM by grouping together similar spike trains appearing in the recurrent layer. Even when the target MM was not perfectly induced in a RSNN, the extraction procedure was able to reveal weaknesses of the induced mechanism and the extent to which the target machine had been learned. 1 Introduction A considerable amount of work has been devoted to studying computations on time series in a variety of connectionist models, most prominently in models with feedback delay connections between the neural units. Such models are commonly referred to as recurrent neural networks (RNNs). Feedback connections endow RNNs with a form of neural memory that makes them (theoretically) capable of processing time structures over arbitrarily long time spans. However, although RNNs are capable of simulating Turing machines (Siegelmann & Sontag, 1995), induction of nontrivial temporal structures beyond finite memory can be problematic (Bengio, Frasconi & Simard, 1994). Finite state machines (FSMs) and automata constitute a simple yet well-established and easy-to-analyze framework for describing temporal structures that go beyond finite memory relationships. In general, for a finite description of the string mapping realized by an FSM, one needs Neural Computation 18, 591–613 (2006)
C 2006 Massachusetts Institute of Technology
ˇ and A. Mills P. Tino
592
a notion of an abstract information processing state that can encapsulate histories of processed strings of arbitrary finite length. Indeed, FSMs have been a popular benchmark in the recurrent network community, and there is a huge amount of literature dealing with empirical and theoretical aspects of learning FSMs and automata in RNNs (e.g., Cleeremans, Servan-Schreiber, ˇ ˇ & Sajda, & McClelland, 1989; Giles et al., 1992; Tino 1995; Frasconi, Gori, ˇ Horne, Giles, Maggini, & Soda, 1996; Omlin & Giles, 1996; Casey, 1996; Tino, & Collingwood, 1998). However, the RNNs under consideration have been based on traditional artificial neural network models that describe neural activity in terms of rates of spikes1 produced by individual neurons (rate coding). It remains controversial whether, when describing computations performed by a real biological system, one can abstract from the individual spikes and consider only macroscopic quantities, such as the number of spikes emitted by a single neuron (or a population of neurons) per time interval. Several models of spiking neurons, where the input and output information is coded in terms of exact timings of individual spikes (pulse coding), have been proposed (see, e.g., Gerstner, 1999). Learning algorithms for acyclic networks of such (biologically more plausible) artificial neurons have been developed and tested (Bohte, Kok, & La Poutr´e, 2002; Moore, 2002). Maass (1996) proved that networks of spiking neurons with feedback connections (recurrent spiking neuron networks, RSNNs) can simulate Turing machines. Yet virtually no systematic work has been reported on inducing deeper temporal structures in such networks. There are recent developments along this direction, however; for example, Natschl¨ager and Maass (2002) investigated induction of finite memory machines (of depth 3) in feedforward spiking neuron networks. A memory mechanism was implemented in a biologically realistic model of dynamic synapses (Maass & Markram, 2002) feeding a pool P of spiking neurons. The output was given by a spiking neuron converting space-rate coding in P into an output spike train. In robotics, Floreano and collaborators evolved controllers containing spiking neuron networks for vision-based mobile robots and adaptive indoor micro-flyers (Floreano & Mattiussi, 2001; Floreano, Zufferey, & Nicoud, 2005). In such studies, there is usually a leap in the coding strategy from emphasis on spike timings in individual neurons (pulse coding) into more space-rate-based population codings. Although most experimental research focuses on characterizations of potential information processing states using temporal statistics of rate properties in spike trains (e.g., Abeles et al., 1995; Martignon et al., 2000) there is some experimental evidence that in certain situations, the temporal information may be pulse-coded (Nadasdy, Hirase, Czurk, Csicsvari, & Buzski, 1999; DeWeese & Zador, 2003).
1
Identical electrical pulses also known as action potentials.
Learning Beyond Finite Memory with Spiking Neurons
593
A related development was marked by the introduction of so-called liquid state machines (LSM) by Maass, Natschl¨ager, and Markram (2002). LSM is a metaphor for a new way of viewing real-time computation in recurrent neural circuits. Recurrent networks (possibly with spike-driven dynamics) serve as fixed, nonadaptable general-purpose temporal integrators. The only adaptive part (conforming to the specification of the temporal task at hand) consists of a relatively simple and trainable readout unit operating on the recurrent circuit. The circuit needs to be sufficiently complex that subtle information in the input stream is encoded in the high-dimensional transient states of the circuit that are intelligible to the readout (output) unit. In other words, in the recurrent circuit, there is no learning to find temporal features useful for solving the task at hand. Instead, it is assumed that the hard-wired subsystem responsible for computing the features (information processing states) from the input stream is complex enough to serve as a general-purpose filter for a wide range of temporal tasks. The concept of LSM represents an exciting and fresh outlook on computations performed by neural circuits. It argues that there is an alternative to the more traditional attractor-based computations in neural networks. Stable internal states are not required for stable output. Provided certain conditions on dynamics of the neural circuit and the readout unit are satisfied, the LSMs have universal power for computations with fading memory.2 However, a recent study of decoding properties of LSM on sequences ¨ of visual stimuli encoded in a temporal population code by Knusel, Wyss, ¨ Konig, & Verschure (2004) shows that the LSM mechanism can lead to undesirable mixing of information across stimuli. When the stimuli are not strongly temporally correlated, a special stimulus-onset reset signal may be needed to improve decoding of temporal population codes for individual stimuli. In this study, we are concerned with possibilities of inducing deep temporal structures without fading memory in recurrent networks of spiking neurons. We will strictly adhere to pulse coding, for example, all the input, output, and state information is coded in terms of spike trains on subsets of neurons. Using the words of Natschl¨ager and Ruf (1998), “This paper is not about biology but about possibilities of computing with spiking neurons which are inspired by biology. . . . There is still not much known about possible learning in such systems. . . . A thorough understanding of such networks, which are rather simplified in comparison to real biological networks, is necessary for understanding possible mechanisms in biological systems.” This letter is organized as follows. We introduce the recurrent spiking neuron network used in this study in section 2 and develop a training
2 Any time-invariant map with fading memory from functions of time u(t) to functions of time y(t) can be approximated by LSM to any degree of precision (Maass et al., 2002).
ˇ and A. Mills P. Tino
594
algorithm for such a network in section 3. The experiments in section 4 are followed by a discussion in section 5. Section 6 concludes by summarizing the key messages of this study. 2 Model First, we briefly describe the formal model of spiking neurons, the spike response model (Gerstner, 1995), employed in this study (see also Bohte, 2003; Maass & Bishop, 2001). Spikes emitted by neuron i are propagated to neuron j through several synaptic channels k = 1, 2, . . . , m, each of which has an associated synaptic efficacy (weight) wijk , and an axonal delay dijk . In each synaptic channel k, input spikes get delayed by dijk and transformed by a response function ijk , which models the rate of neurotransmitter diffusion across the synaptic cleft. The response function can be either excitatory (contributing to excitatory postsynaptic potential, EPSP) or inhibitory (contributing to inhibitory postsynaptic potential, IPSP). Formally, denote the set of all (presynaptic) neurons emitting spikes to neuron j by j . Let the last spike time of a presynaptic neuron i ∈ j be tia . The accumulated potential at time t on soma of unit j is
x j (t) =
m
wijk · ijk t − tia − dijk ,
(2.1)
i∈ j k=1
where the response function ijk is modeled as ijk (t) = σijk · (t/τ ) · exp(1 − (t/τ )) · H(t).
(2.2)
Here, σijk is 1 and −1 if the synapse k between neurons i, j is concerned with transmitting EPSP and IPSP, respectively; τ represents the membrane potential decay time constant (which describes the rate at which current leaks out of the postsynaptic neuron); H(t) is the Heaviside step function, which is 1 for t > 0 and otherwise 0. Neuron j fires a spike (and depolarizes) when the accumulated potential x j (t) reaches a threshold . In a feedforward spiking neuron network (FFSNN), the first neurons to fire a spike are the input units. Spatial spike patterns across input neurons code the information to be processed by the FFSNN; the spikes propagate to subsequent layers, finally resulting in a pattern of spike times across neurons in the output layer. The output spike times represent the response of FFSNN to the current input. The input-to-output propagation of spikes through FFSNN is confined to a simulation interval of length ϒ. All neurons
Learning Beyond Finite Memory with Spiking Neurons
595
can fire at most once within the simulation interval.3 After the simulation interval has expired, the output spike pattern is read out and interpreted, and a new simulation interval is initialized by presenting a new input spike pattern in the input layer. Given a mechanism for temporal encoding and decoding of the input and output information, respectively, Bohte et al. (2002) have recently formulated a backpropagation-like supervised learning rule for training FFSNN, called SpikeProp. Synaptic efficacies on connections to the output unit j are updated as follows: wijk = −η · ijk t aj − tia − dijk · δ j ,
(2.3)
where
δ = j
i∈ j
t dj − t aj a m k k a a k a k k=1 wij · ij t j − ti − dij · (1 t j − ti − dij − 1/τ )
(2.4)
and η > 0 is the learning rate. The numerator is the difference between the desired t dj and actual t aj firing times of the output neuron j within the simulation interval. Synaptic efficacies on connections to the hidden unit i are updated analogously: k k a whi = −η · hi ti − tha − dhik · δ i ,
(2.5)
where m
δi =
a 1 t j − tia − dijk − 1/τ · δ j , · 1 tia − tha − dhik − 1/τ
k k a a k k=1 wij · ij t j − ti − dij · m k k a a k k=1 whi · hi ti − th − dhi h∈i j∈ i
(2.6) and i denotes the set of all (postsynaptic) neurons to which neuron i emits spikes. The numerator pulls in contributions from the layer succeeding that for which δ’s are being calculated.4 Obviously, FFSNN cannot properly deal with temporal structures in the input stream that go beyond finite memory. One possible solution is to turn FFSNN into a recurrent spiking neuron network (RSNN) by extending the feedforward architecture with feedback connections. In analogy with
3 The period of neuron refractoriness (a neuron is unlikely to fire shortly after producing a spike) is not modeled; thus, to maintain biological plausibility, a neuron may fire only once within the simulation interval (see, e.g., Bohte et al., 2002). 4 When a neuron does not fire, its contributions are not incorporated into the calculation of δ’s for other neurons; neither is a δ calculated for it.
ˇ and A. Mills P. Tino
596
RNN, we select a hidden layer in FFSNN as the layer responsible for coding (through spike patterns) important information about the history of inputs seen so far (recurrent layer) and feed back its spiking patterns through the delay synaptic channels to an auxiliary layer at the input level, called the context layer. The input and context layers now collectively form a new extended input layer of the RSNN. The delay feedback connections temporally translate spike patterns in the recurrent layer by the delay constant , α(t) = t + .
(2.7)
Such temporal translation can be achieved using networks of spiking neurons. Experimentation has shown that it is trivial to train an FFSNN with one input and one output to implement an arbitrary delay to high precision, so long as the desired delay does not exceed the temporal resolution at which the FFSNN operates. Thus, multiple copies of these trained networks can be used to delay the firing times of recurrent units in parallel. Figure 1 shows a typical RSNN architecture used in our experiments. As in the general TIS (transformed input and state) memory model (Mozer, 1994), the network consists of five layers. Each input item is processed within a single simulation interval. The extended input layer (input and context layers, denoted by I and C, respectively) feeds the first auxiliary hidden layer H1 , which in turn feeds the recurrent layer Q. Within the nth simulation interval, the spike timings of neurons in the input and context layers I and C are stored in the spatial spike train vectors i(n) and c(n), respectively. The spatial spike trains of the first hidden and recurrent layers are stored in vectors h1 (n) and q(n), respectively. The role of the recurrent layer Q is twofold: 1. The spike train q(n) codes information about the history of n inputs seen so far. This information is passed to the next simulation interval through the delay FFSNN network5 α, c(n + 1) = α(q(n)). The delayed spatial spike train c(n + 1) appears in the context layer. Spike train (i(n + 1), c(n + 1)) of the extended input in simulation interval n + 1 consists of the history-coding spike train c(n + 1) (representing the previously seen n inputs) and a spatial spike train i(n + 1) coding the current, (n + 1)st, external input (input symbol). 2. The recurrent layer feeds the second auxiliary hidden layer H2 , which finally feeds the output layer O. Within the nth simulation interval, the spatial spike trains in the second hidden and output layers are stored in vectors h2 (n) and o(n), respectively.
5
The delay function α(t) is applied to each component of q(n).
Learning Beyond Finite Memory with Spiking Neurons
α
o(n)
O
h 2 (n)
H2
q(n)
Q
h1 (n)
H1
i(n)
I
597
Spike train o(n) encodes output for n–th input item
Spike train q(n) encodes state after presentation of n–th input item
delay by ∆
C
c(n) = α (q(n–1))
Spike train i(n) encodes n–th input item
Spike train c(n) is delayed state information from the previous input item presentation
Figure 1: Typical RSNN architecture used in our experiments. When processing the n th input item, the external input is encoded through spike train i(n) in layer I , the state information is computed as spike train q(n) in layer Q, and the output is represented by the spike train o(n) in layer O. Information about the inputs processed before seeing the nth input item is contained in context spike train c(n) in layer C. c(n) = α(q(n − 1)) is a state-encoding spike train from the previous time step, delayed by . Hidden layers H1 and H2 (with spike trains h1 (n) and h2 (n), respectively) are auxiliary layers enhancing capabilities of the network to represent the state information and calculate the output.
Parameters, such as the length ϒ of the simulation interval, feedback delay , and spike time encodings of input-output symbols, have to be carefully coordinated. We illustrate this by unfolding an RSNN through two simulation periods (corresponding to processing of two input items) in Figure 2. The input string is 10, and the corresponding desired output string is 01. The input layer I has five neurons. There is a single neuron in the output layer O. All the other layers have two neurons each. Index n indexes the input items (and simulation intervals). Input symbols 0 and 1 are encoded in the five input units through spike patterns i0 = [0, 6, 0, 6, 0] and i1 = [0, 6, 6, 0, 0], respectively (all times are shown in ms). Spike patterns in the single output neuron for output symbols 0 and 1 are o0 = [20] and o1 = [26], respectively. Approximately 5 ms elapses between the firing of
ˇ and A. Mills P. Tino
598
n=1 t start (1) = 0ms
n=2
Target ’0’ coded as [20] t(1) = [20]
t start (2) = 40ms
o(1) = [22]
o(2) = [65]
h 2 (2) = [57,56]
h 2 (1) = [16,17]
q(1) = [12,11]
Target ’1’ coded as [26] t(2) = [66]
q(2) = [51,52]
α delay by ∆ = 30ms
h1 (1) = [7,8]
c(1) = c start
i(1) = [0,6,6,0,0]
Input ’1’ coded as [0,6,6,0,0]
h 1 (2) = [48,47]
c(2) = [42,41]
i(2)=[40,46,40,46,40]
Input ’0’ coded as [0,6,0,6,0]
Figure 2: The first two steps in the operation of an RSNN. The input string is 10, and the corresponding desired output is the string 01. The input layer I has five neurons. There is a single neuron in the output layer O. All the other layers have two neurons each. Spike train vectors are shown in ms. Processing of the nth input symbol starts at tstart (n) = (n − 1) · 40 ms. The target (desired) output patterns t(n) are shown above the output layers. Context spike train c(2) is the state spike train q(1) from the previous time step, delayed by = 30 ms. The initial context spike train, cstart , is imposed externally at the beginning of training.
neurons in each subsequent layer. Processing of the nth input symbol starts at tstart (n) = (n − 1) · ϒ,
(2.8)
where ϒ = 40 ms is the length of the simulation interval. Spike train i(n) representing the nth input symbol sn ∈ {0, 1} is calculated by shifting the
Learning Beyond Finite Memory with Spiking Neurons
599
corresponding input spike pattern isn by tstart (n), i(n) = isn + tstart (n).
(2.9)
The same principle applies to calculation of the target (desired) spike patterns t(n) that would be observed at the output, if the network functioned correctly. If, after presentation of the nth input symbol sn , the desired output symbol is σn ∈ {0, 1}, then t(n) is calculated as t(n) = oσn + tstart (n).
(2.10)
The network is trained by minimizing the difference between the target and observed output spike patterns, t(n) and o(n), respectively. The training procedure is outlined in the next section. Context spike train c(2) is the state spike train q(1) from the previous time step, delayed by = 30 ms. The initial context spike train, cstart , is imposed externally at the beginning of training. 3 Training: SpikePropThroughTime We extended the SpikeProp algorithm (Bohte et al., 2002) for training FFSNN to recurrent models in the spirit of backpropagation through time for rate-based RNN (Werbos, 1989), that is, using the unfolding-in-time methodology. We call this learning algorithm SpikePropThroughTime. Given an input string of length n, n copies of the base RSNN are made, stacked on top of each other, and sequentially simulated, incrementing tstart by ϒ after each simulation interval. Expanding the base network through time via multiple copies simulates processing of the input stream by the base RSNN. Adaptation δ’s (see equations 2.4 and 2.6) are calculated for each of the network copies. The synaptic efficacies (weights) in the base network are then updated using δ’s calculated in each of the copies by adding up, for every weight, the n corresponding weight-update contributions of equations 2.3 and 2.5. Figure 3 shows expansion through time of a five-layer base RSNN on an input of length 2. The first copy is simulated with tstart (1) = ϒ · 0 = 0. As explained in the previous section, all firing times of the first copy are relative to 0. For copies n > 1, the external inputs and desired outputs are made relative to tstart (n) = (n − 1) · ϒ. In an FFSNN, when calculating the δ’s for a hidden layer, the firing times from the preceding and succeeding layers are used. Special attention must be paid when calculating δ’s of neurons in the recurrent layer Q. Context spike train c(n + 1) in copy (n + 1) is the delayed recurrent spike train q(n) from the nth copy. The relationship of firing times in c(n + 1) and h1 (n + 1)
ˇ and A. Mills P. Tino
600
o(2)
O
h2 (2)
Copy 2
H2
q(2)
h1 (2)
i(2)
I
o(1)
H1
c(2)
O
Q
C
α
Copy 1
delay by ∆
h2 (1)
H2
q(1)
h1 (1)
I
i(1)
Q
H1
c(1)
C
Figure 3: Expansion through time of a five-layer base RSNN on an input of length 2.
Learning Beyond Finite Memory with Spiking Neurons
601
contains the information that should be incorporated into the calculation of the δ’s for recurrent units in copy n. The delay constant is subtracted from the firing times h1 (n + 1) of H1 and then, when calculating the δ’s for recurrent units in copy n, these temporally translated firing times are used as if they were simply another hidden layer succeeding Q in copy n. Denoting by 2,n the set of neurons in the second auxiliary hidden layer H2 of the nth copy and by 1,n+1 the set of neurons in the first auxiliary hidden layer H1 of the copy (n + 1), the δ of the ith recurrent unit in the nth copy is calculated as
δi =
h∈i
+
wijk · ijk t aj − − tia − dijk · 1 t aj − − tia − dijk − 1/τ · δ j m k k a a k k tia − tha − dhi − 1/τ k=1 whi · hi ti − th − dhi · 1
m k=1
j∈1,n+1
j∈2,n
h∈i
wijk · ijk t aj − tia − dijk · 1 t aj − tia − dijk − 1/τ · δ j . m k k a a k k tia − tha − dhi − 1/τ k=1 whi · hi ti − th − dhi · 1 m k=1
(3.1)
4 Learning Beyond Finite Memory in RSNN: Inducing Moore Machines 4.1 Moore Machines. One of the simplest computational models that encapsulates the concept of unbounded input memory is the Moore machine (Hopcroft & Ullman, 1979). Formally, an (initial) Moore machine (MM) M is a six-tuple M = (U, V, S, β, γ , s0 ), where U and V are finite input and output alphabets, respectively; S is a finite set of states; s0 ∈ S is the initial state; β : S × U → S is the state transition function; and γ : S → V is the output function. Given an input string u = u1 u2 , . . . , un of symbols from U (ui ∈ U, i = 1, 2, . . . , n), the machine M acts as a transducer by responding with the output string v = M(u) = v1 v2 , . . . , vn , vi ∈ V, computed as follows. First, the machine is initialized with the initial state s0 . Then for all i = 1, 2, . . . , n, the new state is recursively determined, si = β(si−1 , ui ), and the machine emits the output symbol vi = γ (si ). Moore machines are conveniently represented as directed labeled graphs: nodes represent states, and arcs, labeled by symbols from U, represent state transitions initiated by the input symbols. An example of a simple MM is shown in Figure 4. 4.2 Encoding of Input and Output. In this study, we consider input alphabets of one or two symbols. Moreover, there is a special end-of-string symbol 2 initiating transitions to the initial state. In the experiments, the input layer I had five neurons. The input symbols 0, 1, and 2 are encoded in the five input units through spike patterns i0 = [0, 6, 0, 6, 0], i1 = [0, 6, 6, 0, 0], and i2 = [6, 0, 0, 6, 0], respectively. The firing times are
ˇ and A. Mills P. Tino
602
1 0 2
0
1 1
0 2
0
1
Figure 4: An example of a simple two-state Moore machine. Circles denote states, and arcs denote state transitions. The number in the upper left of each state labels it. The number in the lower right of each state specifies its output value. Processing of each input string starts in the initial state 0. Each arc is labeled with the input symbol initiating state transition specified by the arc. The special end-of-string reset symbol 2 initiates a transition from every state to the initial state (dashed arcs). As an example, the input sequence 00111001 is mapped by the machine to the output sequence 00101110.
in ms. The last input neuron acts like a reference neuron, always firing at the beginning of any simulation interval. In all our experiments, we used binary output alphabet V = {0, 1}, and the output layer O of RSNN consisted of a single neuron. Spike patterns (in ms) in the output neuron for output symbols 0 and 1 are o0 = [20] and o1 = [26], respectively. 4.3 Generation of Training Examples and Learning. Given a target Moore machine M, a set of training examples is constructed by explicitly constructing input strings6 u over U and then determining the corresponding output string M(u) over V (by traversing edges of the graph of M, starting in the initial state, as prescribed by the input string u). The training set D consists of N couples of input-output strings, D = {(u1 , M(u1 )), (u2 , M(u2 )), . . . , (u N , M(u N ))}.
6 It is important to choose strings that exercise structures prominent in the target Moore machine; for example, if a Moore machine contains several discrete loops linked by transitions, then the training string should exercise these loops and transitions. Random walks over the target machine, even those that exercise every transition, were found to be less effective at inducing structures than an intuitive exercising of the main structural components.
Learning Beyond Finite Memory with Spiking Neurons
603
We adopt the strategy of inducing the initial state in the recurrent network ˇ (as opposed to externally imposing it; see Forcada & Carrasco, 1995; Tino ˇ & Sajda, 1995). The context layer of the network is initialized with the fixed predefined context spike train c(1) = cstart only at the beginning of training. From the network’s point of view, the training set is a couple (u, ˜ M(u)) ˜ of the long concatenated input sequence, u˜ = u1 2u2 2u3 2, . . . , 2u N−1 2u N 2, and the corresponding output sequence is M(u) ˜ = M(u1 )γ (s0 )M(u2 )γ (s0 )M(u3 )γ (s0 ) . . . γ (s0 )M(u N−1 )γ (s0 )M(u N )γ (s0 ). Input symbol 2 is instrumental in inducing the start state by acting as an end-of-string reset symbol initiating transition from every state of M to the initial state s0 . The network is trained using SpikePropThroughTime (section 3) to minimize the squared error between the desired output spike trains derived from M(u) ˜ when the RSNN is driven by the input u. ˜ The RSNN is unfolded, and SpikePropThroughTime is applied for each training pattern (ui , M(ui )), i = 1, 2, . . . , N. In rate-based neural networks, the weights are conventionally initialized to random elements of from the interval [0, 1]. For general feedforward networks of spiking neurons, it appears that this is not applicable when using the SpikeProp learning rule; learning of simple nonlinear training sets appears to fail if the threshold is not sufficiently high and the weights are not sufficiently scaled. The problem was first identified in Moore (2002).7 The firing threshold is 50, and the weights are initialized to random elements of from the interval (0, 10).8 We use a dynamic learning rate strategy that detects oscillatory behavior and plateaus within the error space. The action to take upon detecting oscillation or plateau is, respectively, to decrease the learning rate by multiplying by an oscillation-counter-coefficient (< 1), or increase the learning rate by multiplying by a plateau-counter-coefficient (> 1) (see, e.g., Lawrence, Giles, & Fong, 2000). 4.4 Extraction of Induced Structure. It has been extensively verified in the context of traditional rate-based RNN that it is often possible to
7 We used the initial weight setting, neuron threshold parameters, and the learning rate as suggested by this thesis. 8 Whatever weight setting is chosen, the weights must be initialized so that the neurons in subsequent layers are sufficiently excited by those in their previous layer that they fire; otherwise, the network would be unusable. There is no equivalent in traditional rate-based neural networks to the nonfiring of a neuron in this sense.
ˇ and A. Mills P. Tino
604
extract from RNN the target finite state machine M that the network was successfully trained and tested on. The extraction procedure operates on activation patterns appearing in the recurrent layer while processing9 M ˇ ˇ & Sajda, (e.g., Giles et al., 1992; Tino 1995; Frasconi et al., 1996; Omlin & Giles, 1996; for an extensive review and criticism, see Jacobsson, 2005). One possibility10 is to drive the network with sufficiently long input strings and record the recurrent layer activations in a set B. The set B is then clustered using a vector-quantization tool. The cluster indexes will become states of the extracted finite state machine. Next, the network is reset with repeated presentation of the the end-of-string symbol 2 and then driven once more with sufficiently long input strings. At each time step n, one records:
r
r
The index q˜ (n) of the cluster containing the recurrent activation vector. The cluster index q˜ (n) represents context cluster c˜ (n + 1) = q˜ (n) at time step n + 1. Given the current input symbol and context cluster c˜ (n) = q˜ (n − 1), the network transits (on the cluster level) to the next context cluster c˜ (n + 1) = q˜ (n). The network output associated with the next state c˜ (n + 1) = q˜ (n).11
Finally, the cluster transitions and output actions are summarized as a graph of a finite state machine that is minimized to its equivalent canonical form (Hopcroft & Ullman, 1979). The state information in RSNN is coded as spike trains in the recurrent layer Q. We studied whether, in analogy with rate-based RNN, the abstract information processing states can be discovered by detecting natural groupings of normalized spike trains12 (q(n) − tstart (n)) using a vector quantization tool (in our experiments, k-means clustering).13 We also applied the extraction procedure to RSNN that managed to induce the target Moore machine only partially, so that the extent to which the target machine has been induced and the weaknesses of the induced mechanism can be exposed and understood. 4.5 Experimental Results. The network had five neurons in layers I , C, H1 , Q, and H2 . Within each of those layers, one neuron was inhibitory; all the others were excitatory. Each connection between neurons had
9
With frozen weights. Given the Moore machine setting and RNN architecture used in this letter. 11 If there are several output symbols associated with recurrent layer activations in the cluster q˜ (n), the cluster q˜ (n) is split so that the inconsistency is removed. 12 The spike times q(n) entering the quantization phase are made relative to the start time tstart (n) of the simulation interval. 13 The goal here is to assign identical cluster indexes to similar firing times and different indexes to dissimilar firing times. Although the spiking neuron clustering method of Natschl¨ager and Ruf (1998) worked, k-means clustering is simpler (it requires fewer hyperparameters) and faster, so it was used in practice. 10
Learning Beyond Finite Memory with Spiking Neurons
605
m = 16 synaptic channels, with delays dijk = k, k = 1, 2, . . . , m, realizing axonal delays between 1 ms and 16 ms. The decay constant τ in response functions ij was set to τ = 3. The length ϒ of the simulation interval was set to 40 ms. The delay was 30 ms. The inputs and desired outputs were coded into the spike trains as described in equations 2.8 to 2.10 and section 4.2. We used SpikePropThroughTime to train RSNN. The training was error monitored, and training was stopped when it was clear that the network had learned the target (zero thresholded output error) with sufficient stability. In some cases, training was carried on until zero absolute error was achieved. The maximum number of training epochs (sweeps through the training set) was 10,000. First, we experimented with cyclic machines C p of period p ≥ 2: U = {0}; V = {0, 1}; S = {0, 1, 2, . . . , p − 1}; s0 = 0; for 0 ≤ i < p − 1, β(i, 0) = i + 1 and β( p − 1, 0) = 0; γ (0) = 0 and for 0 < i ≤ p − 1, γ (i) = 1. The RSNN perfectly learned machines C p , 2 ≤ p < 5. After training, the respective RSNN emulated the operation of these MM perfectly and apparently indefinitely (no deviations from expected behavior were observed over test sets having length of the order 104 ). The training set had to be incrementally constructed by iteratively training with one presentation of the cycle, then two presentations, and so on. Note that given that the network can only observe the inputs, these MMs would require an unbounded input memory buffer. So no mechanism with vanishing (input) memory can implement string mappings represented by such MMs. Using the successful networks, we extracted unambiguously all the machines C p of period 1 ≤ p < 5. The number of clusters in k-means clustering was set to 10. Second, we trained RSNN on a two-state machine M2 shown in Figure 4. Again, the RSNN perfectly learned the machine.14 As in the previous experiment, no mechanism with vanishing input memory can implement string mappings defined by this Moore machine. Using the successful networks, we extracted unambiguously the machine M2 (the number of clusters in k-means clustering was 10). In the third experiment, we investigate the information available in the context firing times in case of partial induction. Consider the three-state machine M3 in Figure 5. The machine has two main fixed-input cycles in opposite directions. Training led to an error rate of ≈ 0.3 over test strings of ˜ 3 is shown in Figure 6. length 10,000. The extracted induced machine M The cycle on input 1 in M3 has been successfully induced, but the cycle on input 0 has not. The oscillation between states 4 and 1 on strings {01}+ in ˜ 3 corresponds to the oscillation between states 1 and 2 in M3 . Transitions M
14
Repeated presentation of only five carefully selected training patterns of length 4 was sufficient for induction of this machine. No deviations from expected behavior were observed over test sets of length of the order 104 .
ˇ and A. Mills P. Tino
606
1 0 2
0
0
0 2
1
1
1
0
0
2 1
2 Figure 5: A three-state target Moore machine M3 that was partially induced.
˜ 3, a on the reset symbol 2 have also been induced correctly. Curiously, in M cycle of length 4 (over the states 1,3,5,4) has been induced on fixed input 0. 5 Discussion We were able to train RSNN to mimic target MMs requiring unbounded input memory on only a relatively simple set of MMs. Compared with traditional rate-based RNN, two major problems are apparent when inducing structures like MM in RSNN: 1. There are two timescales the network operates on: (1) the shorter timescale of spike trains coding the input, output, and state information within a single simulation interval and (2) the longer timescale of sequences of simulation intervals, each representing a single inputto-output processing step. This timescale can be synchronized using spike oscillations as in Natschl¨ager and Maass (2002). Long-term dynamics have to be induced based on the behavior of the target MM, but these dynamics are driven ultimately by the shortterm dynamics of individual spikes. So in order to exhibit the desired
Learning Beyond Finite Memory with Spiking Neurons
2
2
1
4
0 2
0
1
607
0 1
1 0
1
0
3
0
0
0 2
1
0
2
1
0
1
1
5 1
2 2 ˜ 3 from a partial induction of the Moore machine Figure 6: Extracted machine M M3 shown in Figure 5.
long-term behavior, the network has to induce the appropriate shortterm dynamics. In contrast, only the long-term dynamics need to be induced in the rate-base RNN. 2. Spiking neurons used in this letter produce a spike only when the accumulated potential x j (t) reaches a threshold . This leads to discontinuities in the error surface. Gradient-based methods for training feedforward networks of spiking neurons alleviate this problem by resorting to simplifying assumptions on spike patterns within a single simulation interval (see Bohte et al., 2002; Bohte, 2003). The situation is much more complicated in the case of RSNN. A small weight perturbation can prevent a recurrent neuron from firing in the shorter timescale of a simulation interval. That can have serious consequences for further long-timescale processing, especially if such a change in short-term behavior appears at the beginning of presentation of a long input string. The error surface becomes erratic, as evidenced in Figure 7. We took an RSNN trained to perfectly mimic the cycle 4 machine C4 (see section 4.5). We studied the influence of perturbing weights w∗ in the recurrent part of the RSNN (e.g., between layers I , C, H1 , and Q) on the test error calculated on a long test string of length 1000. For each weight perturbation extent ρ, we randomly sampled 100 weight vectors w from the hyperball of radius ρ centered at w∗ . Shown are the mean and standard deviation values of the
ˇ and A. Mills P. Tino
608
Weight Perturbation vs Error for a Simple Four–State Automata 5 Mean Over 100 Trials Standard Deviation
4.5
4
Per Symbol Error
3.5
3
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5 2 Absolute Perturbation Extent
2.5
3
Figure 7: Maximum radius of weight perturbation versus test error of RSNN trained to mimic the cycle 4 machine C4 . For each setting of weight perturbation extent ρ, we randomly sampled 100 weight vectors w from the hyperball of radius ρ centered at induced RSNN weights w∗ . Shown are the mean (solid line) and standard deviation (dashed line) values of the absolute output error per symbol.
absolute output error per symbol for 0 < ρ ≤ 3. Clearly, small perturbations of weights lead to large, abrupt changes in the test error. Obviously gradient-based methods, like our SpikePropThroughTime, have problems in locating good minima on such error surfaces. We tried the following to find RSNN weights in the experiments in section 4.4, but without much success. The abrupt and erratic nature of the error surface makes it hard, even for evolutionary techniques, to locate a good minimum.
r
Fast evolutionary strategies (FES) with (recommended) configuration (30,200)-ES,15 employing the Cauchy mutation function (see Yao & Liu, 1997; Yao, 1999).
15 In each generation, 30 parents generate 200 offspring through recombination and mutation.
Learning Beyond Finite Memory with Spiking Neurons
r r
609
Extended Kalman filtering in the parameter space (Puskorius & Feldkamp, 1994). A recent powerful evolutionary method for optimization on realvalued domains (Rowe & Hidovic, 2004).
We tried RSNNs with varying numbers of neurons in the hidden and recurrent layers. In general, the increased representational capacity of RSNNs with more neural units could not be used because of the problems with finding good weight settings due to the erratic nature of the error surface. Our SpikePropThroughTime algorithm can be extended by allowing for adjustable axonal delays dijk , individual synaptic decay constants τijk , and firing thresholds j (Schrauwen & Van Campenhout, 2004). While this may lead to models of reduced size, the inherent problems with learning instabilities will not be eliminated. We note that the finite memory machines induced in feedforward spiking neuron networks (Natschl¨ager & Maass, 2002, with dynamic synapses, Maass & Markram, 2002) were quite simple (of depth 3). The input memory depth is limited by the feedforward nature of such networks. As soon as one tries to increase the processing capabilities of spiking networks by introducing feedback connections while insisting on pulse coding, the induction process becomes complicated. Theoretically, it is possible to emulate in a stable manner any MM in a sufficiently rich RSNN. For example, given an MM M, one needs to first fix appropriate spike patterns representing abstract states of the machine M. Then an FFSNN NS (playing the role of subnetwork of RSNN with layers C, I, H1 , and Q) can be trained on the input-driven state transition structure of M. The target spike trains on the top layer of NS are calculated by shifting the spike patterns representing states of M by an appropriate time delay . Note that NS can be trained with traditional SpikeProp algorithm. While training, the target spike trains can be slightly perturbed to yield stable representations in NS of state transitions in M. Trained NS with added delay lines α (making spike trains at the top layer of NS appear in the context layer of NS with delay ) forms a recurrent part of the RSNN being constructed. We need to stack another FFSNN NO on top of NS to associate states with outputs. Again, NO can be trained using SpikeProp to associate spike trains in the top layer of NS (representing abstract states of M) with the corresponding outputs. The target output spike trains need to be shifted by an appropriate simulation interval length ϒ. Because processing in the spiking neuron networks is driven purely by relative differences between spike times in individual neurons, the RSNN consisting of NS (endowed with delay lines α) and NO will emulate M. Indeed, following this procedure, we were able to construct an RSNN perfectly emulating, for example, machine M2 in Figure 4 in a stable manner. Moreover, one can envisage a procedure for RSNN, analogous to that developed for
610
ˇ and A. Mills P. Tino
rate-based RNN by Giles and Omlin (1993), that would enable direct insertion of (the whole or part of) a given finite state machine (FSM) into an RSNN. Such RSNN-based emulators of FSMs can be used as submodules in larger computational devices operating on spike trains. However, this article aims to study induction of deeper-than-finite-inputmemory temporal structures in RSNN. FSMs offer a convenient platform for our study, as they represent a simple, well-established, and easy-to-analyze framework for describing temporal structures that go beyond finite memory relationships. Many previous studies of inducing deep input memory structures in rate-based RNNs were performed in this framework (Cleereˇ ˇ & Sajda, mans et al., 1989; Giles et al., 1992; Tino 1995; Frasconi et al., 1996; ˇ et al., 1998). When training an RSNN Omlin & Giles, 1996; Casey, 1996; Tino on example string mappings drawn from some target MM, the structure of the target MM is assumed unknown for the learner. Hence, we cannot split training of RSNN into two separate phases (training two FFSNNs NS and NO on state transitions and state-output associations, respectively). RSNN is a dynamical system, and weight changes in RSNN lead to complex bifurcation mechanisms, making it hard to induce more complex MM through a guided search in the weight space. It may be that in biological systems, long-term dependencies are represented using rate-based codings and/or a liquid state machine mechanism (Maass et al., 2002) with a complex but nonadaptable recurrent pulse-coded part. 6 Conclusion We have investigated the possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons operating strictly in the pulse-coding regime. All the input, output, and state information is coded in terms of spike trains on subsets of neurons. We briefly summarize key results of the letter: 1. A pulse-coding strategy for processing temporal information in recurrent spiking neuron networks (RSNN) was introduced. We have extended, in the sense of backpropagation through time (Werbos, 1989), the gradient-based Spike-Prop algorithm (Bohte et al., 2002) for training feedforward spiking neuron networks (FFSNN). The algorithm SpikePropThroughTime is able to account for temporal dependencies in the input stream when training RSNN. 2. We have shown that temporal structures with unbounded input memory specified by simple Moore machines can be induced by RSNN. The networks were able to discover pulse-coded representations of abstract information processing states coding potentially unbounded histories of processed inputs. However, the nature of pulse coding, in
Learning Beyond Finite Memory with Spiking Neurons
611
the context of the training strategies tried here, appears not to allow induction beyond simple Moore machines. 3. In analogy with traditional rate-based RNN trained on finite state machines, it is often possible to extract from RSNN the target machines by grouping together similar spike trains appearing in the recurrent layer. Furthermore, extraction of finite state machines from RSNN that managed to induce the target Moore machine can only partially reveal weaknesses of the induced mechanism and the extent to which the target machine has been learned. Although, theoretically, RSNN operating on pulse coding can process any (computationally feasible) temporal structure of unbounded input memory, the induction of such structures through a guided search in the weight space is another matter. Weight changes in RSNN lead to complex bifurcation mechanisms, enormously complicating the training process.
References Abeles, M., Bergman, H., Gat I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995). Cortical activity flips among quasi stationary states. Proc. Natl. Acad. Sci. USA, 92, 8616–8620. Bengio, Y., Frasconi, P., & Simard, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bohte, S. M. (2003). Spiking neural networks. Unpublished doctoral dissertation, Centre for Mathematics and Computer Science, Amsterdam. Bohte, S., Kok, J., & La Poutr´e, H. (2002). Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(1–4), 17–37. Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6), 1135–1178. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. DeWeese, M. R., & Zador, A. M. (2003). Binary coding in auditory cortex. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 101–108). Cambridge, MA: MIT Press. Floreano, D., & Mattiussi, C. (2001). Evolution of spiking neural controllers for autonomous vision-based robots. In T. Gomi (Ed.), Evolutionary robotics IV (pp. 38–61). Berlin: Springer–Verlag. Floreano, D., Zufferey, J., & Nicoud, J. (2005). From wheels to wings with evolutionary spiking neurons. Artificial Life, 11(1–2), 121–138. Forcada, M. L. & Carrasco, R. C. (1995). Learning the initial state of a second-order recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930.
612
ˇ and A. Mills P. Tino
Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Insertion of finite state automata in recurrent radial basis function networks. Machine Learning, 23, 5–32. Gerstner, W. (1995). Time structure of activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (1999). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed coupled neural networks (pp. 3–54). Cambridge, MA: MIT Press. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite state automata with second–order recurrent neural networks. Neural Computation, 4(3), 393–405. Giles, C. L., & Omlin, C. W. (1993). Insertion and refinement of production rules in recurrent neural networks. Connection Science, 5(3), 307–377. Hopcroft, J., & Ullman, J. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison–Wesley. Jacobsson, H. (2005). Rule extraction from recurrent neural networks: A taxonomy and review. Neural Computation, 17(6), 1223–1263. ¨ ¨ Knusel, P., Wyss, R., Konig, P., & Verschure, P. F. M. J. (2004). Decoding a temporal population code. Neural Computation, 16(10), 2079–2100. Lawrence, S., Giles, C. L., & Fong, S. (2000). Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering, 12(1), 126–140. Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural Computation, 8(1), 1–40. Maass, W., & Bishop, C. (Eds.). (2001). Pulsed neural networks. Cambridge, MA: MIT Press. Maass, W., & Markram, H. (2002). Synapses as dynamic memory buffers. Neural Networks, 15(2), 155–161. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Martignon, L., Deco, G., Laskey, K. B., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Computation, 12(11), 2621–2653. Moore, S. (2002). Back propagation in spiking neural networks. Unpublised master’s thesis, University of Bath. Mozer, M. C. (1994). Neural net architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Predicting the future and understanding the past (pp. 243–264). Reading, MA: Addison-Wesley. Nadasdy, Z., Hirase, H., Czurk, A., Csicsvari, J., & Buzski, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience, 19(21), 9497–9507. Natschl¨ager, T., & Maass, W. (2002). Spiking neurons and the induction of finite state machines. Theoretical Computer Science: Special Issue on Natural Computing, 287(1), 251–265. Natschl¨ager, T., & Ruf, B. (1998). Spatial and temporal pattern analysis via spiking neurons. Network: Computation in Neural Systems, 9(3), 319–332. Omlin, C., & Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41–51.
Learning Beyond Finite Memory with Spiking Neurons
613
Puskorius, G., & Feldkamp, L. (1994). Neural control of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. on Neural Networks, 5(2), 279–297. Rowe, J., & Hidovic, D. (2004). An evolution strategy using a continuous version of the gray-code neighbourhood distribution. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2004) (pp. 725–736). Berlin: SpringerVerlag. Schrauwen, B., & Van Campenhout, J. (2004). Extending SpikeProp. In Proceedings of the International Joint Conference on Neural Networks (pp. 471–476). Piscataway, NJ: IEEE. Siegelmann, H., & Sontag, E. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150. ˇ P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1998). Finite state machines Tino, and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). Orlando, FL: Academic Press. ˇ ˇ P., & Sajda, Tino, J. (1995). Learning and extracting initial mealy machines with a modular neural network model. Neural Computation, 7(4), 822–844. Werbos, P. (1989). Generalization of backpropagation with applications to a recurrent gas market model. Neural Networks, 1(4), 339–356. Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423–1447. Yao, X., & Liu, Y. (1997). Fast evolution strategies. Control and Cybernetics, 26(3), 467–496.
Received October 22, 2004; accepted July 12, 2005.
LETTER
Communicated by Daniel Amit
Effects of Fast Presynaptic Noise in Attractor Neural Networks J. M. Cortes
[email protected] J. J. Torres
[email protected] J. Marro
[email protected] P. L. Garrido
[email protected] Institute Carlos I for Theoretical and Computational Physics and Department of Electromagnetism and Physics of Matter, University of Granada, 18071 Granada, Spain
H. J. Kappen
[email protected] Department of Biophysics, Radboud University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
We study both analytically and numerically the effect of presynaptic noise on the transmission of information in attractor neural networks. The noise occurs on a very short timescale compared to that for the neuron dynamics and it produces short-time synaptic depression. This is inspired in recent neurobiological findings that show that synaptic strength may either increase or decrease on a short timescale depending on presynaptic activity. We thus describe a mechanism by which fast presynaptic noise enhances the neural network sensitivity to an external stimulus. The reason is that, in general, presynaptic noise induces nonequilibrium behavior and, consequently, the space of fixed points is qualitatively modified in such a way that the system can easily escape from the attractor. As a result, the model shows, in addition to pattern recognition, class identification and categorization, which may be relevant to the understanding of some of the brain complex tasks. 1 Introduction There is multiple converging evidence (Abbott & Regehr, 2004) that synapses determine the complex processing of information in the brain. An aspect of this statement is illustrated by attractor neural networks. These show that synapses can efficiently store patterns that are retrieved later with only partial information on them. In addition to this time effect, Neural Computation 18, 614–633 (2006)
C 2006 Massachusetts Institute of Technology
Effects of Fast Presynaptic Noise in Attractor Neural Networks
615
artificial neural networks should contain some synaptic noise. That is, actual synapses exhibit short-time fluctuations, which seem to compete with other mechanisms during the transmission of information—not to cause unreliability but ultimately to determine a variety of computations (Allen & Stevens, 1994; Zador, 1998). In spite of some recent efforts, a full understanding of how brain complex processes depend on such fast synaptic variations is lacking (see Abbott & Regehr, 2004, for instance). A specific matter under discussion concerns the influence of short-time noise on the fixed points and other details of the retrieval processes in attractor neural networks (Bibitchkov, Herrmann, & Geisel, 2002). The observation that actual synapses endure short-time depression or facilitation is likely to be relevant in this context. That is, one may understand some observations by assuming that periods of elevated presynaptic activity may cause either a decrease or an increase in neurotransmitter release and, consequently, the postsynaptic response will be either depressed or facilitated depending on presynaptic neural activity (Tsodyks, Pawelzik, & Markram, 1998; Thomson, Bannister, Mercer, & Morris, 2002; Abbott & Regehr, 2004). Motivated by neurobiological findings, we report in this article on the effects of presynaptic depressing noise on the functionality of a neural circuit. We study in detail a network in which the neural activity evolves at random in time regulated by a “temperature” parameter. In addition, the values assigned to the synaptic intensities by a learning (e.g., Hebb’s) rule are constantly perturbed with microscopic fast noise. A new parameter is involved by this perturbation that allows a continuum transition from depression to normal operation. As a main result, this letter illustrates that in general, the addition of fast synaptic noise induces a nonequilibrium condition. That is, our systems cannot asymptotically reach equilibrium but tend to nonequilibrium steady states whose features depend, even qualitatively, on dynamics (Marro & Dickman, 1999). This is interesting because in practice, thermodynamic equilibrium is rare in nature. Instead, the simplest conditions one observes are characterized by a steady flux of energy or information, for instance. This makes the model mathematically involved, for example, there is no general framework such as the powerful (equilibrium) Gibbs theory, which applies only to systems with a single Kelvin temperature and a unique Hamiltonian. However, our system still admits analytical treatment for some choices of its parameters, and in other cases, we discovered the more intricate model behavior by a series of computer simulations. We thus show that fast presynaptic depressing noise during external stimulation may induce the system to scape from the attractor: the stability of fixed-point solutions is dramatically modified. More specifically, we show that for certain versions of the system, the solution destabilizes in such a way that computational tasks such as class identification and categorization are favored. It is likely this is the first time such behavior is reported in an artificial neural network as a consequence of biologically motivated stochastic behavior of synapses.
616
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
Similar instabilities have been reported to occur in monkeys (Abeles et al., 1995) and other animals (Miller & Schreiner, 2000), and they are believed to be a main feature in odor encoding (Laurent et al., 2001), for instance. 2 Definition of Model Our interest is in a neural network in which a local stochastic dynamics is constantly influenced by (pre)synaptic noise. Consider a set of N binary neurons with configurations S ≡ {si = ±1; i = 1, . . . , N}.1 Any two neurons are connected by synapses of intensity:2 wij = wij x j ∀i, j.
(2.1)
Here, wij is fixed, namely, determined in a previous learning process, and x j is a stochastic variable. This generalizes the hypothesis in previous studies of attractor neural networks with noisy synapses (see, e.g., Sompolinsky, 1986; Garrido & Marro, 1991; Marro, Torres, & Garrido, 1999). Once W ≡{w ij } is given, the state of the system at time t is defined by setting S and X ≡ {xi }. These evolve with time—after the learning process that fixes W—via the familiar master equation: ∂ Pt (S, X) = −Pt (S, X) c[(S, X) → (S , X )] ∂t X S + c[( S , X ) → (S, X)]Pt (S , X ). X S
(2.2)
We further assume that the transition rate or probability per unit time of evolving from (S, X) to (S , X ) is c[(S, X) → (S , X )] = p c X [S → S ]δ(X − X ) + (1 − p) c S [X → X ]δS,S . (2.3) This choice (Garrido & Marro, 1994; Torres, Garrido, & Marro, 1997) amounts to considering competing mechanisms. That is, neurons (S) evolve stochastically in time under a noisy dynamics of synapses (X), the latter
1 Note that such binary neurons, although a crude simplification of nature, are known to capture the essentials of cooperative phenomena, which is the focus here. See, for instance, Abbott and Kepler (1990) and Pantic, Torres, Kappen, and Gielen (2002). 2 For simplicity, we are neglecting here postsynaptic dependence of the stochastic perturbation. There is some claim that plasticity might operate on rapid timescales on postsynaptic activity (see Pitler & Alger, 1992). However, including xij in equation 2.1 instead of x j would impede some of the algebra in sections 3 and 4.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
617
evolving (1 − p)/ p times faster than the former. Depending on the value of p, three main classes may be defined (Marro & Dickman, 1999): 1. For p ∈ (0, 1), both the synaptic fluctuation and the neuron activity occur on the same temporal scale. This case has already been preliminarily explored (Pantic et al., 2002; Cortes, Garrido, Marro, & Torres, 2004). 2. The limiting case p → 1. This corresponds to neurons evolving in the presence of a quenched synaptic configuration, that is, xi is constant and independent of i. The Hopfield model (Amari, 1972; Hopfield, 1982) belongs to this class in the simple case that xi = 1, ∀i. 3. The limiting case p → 0. The rest of this article is devoted to this class of systems. Our interest for the latter case is a consequence of the following facts. First, there is adiabatic elimination of fast variables for p → 0, which decouples the two dynamics (Garrido & Marro, 1994; Gardiner, 2004). Therefore, an exact analytical treatment—though not the complete solution—is then feasible. To be more specific, for p → 0, the neurons evolve as in the presence of a steady distribution for X. If we write P(S, X) = P(X|S) P(S), where P(X|S) stands for the conditional probability of X given S, one obtains from equations 2.2 and 2.3, after rescaling time tp → t (technical details are worked out in Marro & Dickman, 1999, for instance) that ∂ Pt (S) = −Pt (S) c¯ [S → S ] + c¯ [S → S]Pt (S ). ∂t S S
(2.4)
Here, c¯ [S → S ] ≡
dX P st (X|S) c X [S → S ],
(2.5)
and P st (X|S) is the stationary solution that satisfies P st (X|S) =
d X c S [X → X] P st (X |S) . dX c S [X → X ]
(2.6)
This formalism allows modeling fast synaptic noise, which, within the appropriate context, will induce a sort of synaptic depression, as explained in detail in section 4. The superposition, equation 2.5, reflects the fact that activity is the result of competition among different elementary mechanisms. That is, different underlying dynamics, each associated with a different realization of the stochasticity X, compete and, in the limit p → 0, an effective rate results
618
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
from combining c X [S → S ] with probability P st (X|S) for varying X. Each of the elementary dynamics tends to drive the system to a well-defined equilibrium state. The competition will, however, impede equilibrium, and in general, the system will asymptotically go toward a nonequilibrium steady state (Marro & Dickman, 1999). The question is whether such a competition between synaptic noise and neural activity, which induces nonequilibrium, is at the origin of some of the computational strategies in neurobiological systems. Our study seems to indicate that this is a sensible issue. In matter of fact, we shall argue below that p → 0 may be realistic a priori for appropriate choices of P st (X|S). For simplicity, we shall be concerned in this article with sequential updating by means of single neuron, or “spin-flip,” dynamics. That is, the elementary dynamic step will simply consist of local inversions si → −si induced by a bath at temperature T. The elementary rate c X [S → S ] then reduces to a single site rate that onemay write as [u X (S, i)]. Here, uX (S, i) ≡ 2T −1 si h iX (S), where h iX (S) = j=i w ij x j s j is the net (pre)synaptic current arriving at (or local field acting on) the (postsynaptic) neuron i. The function (u) is arbitrary except that for simplicity, we shall assume (u) = exp(−u)(−u), (0) = 1 and (∞) = 0 (Marro & Dickman, 1999). We report on the consequences of more complex dynamics in Cortes, et al. (2005). 3 Effective Local Fields Let us define a function H eff (S) through the condition of detailed balance, namely, c¯ [S → Si ] eff i eff −1 . (S ) − H (S) T = exp − H c¯ [Si → S]
(3.1)
Here, Si stands for S after flipping at i, si → −si . We further define the effective local fields h ieff (S) by means of H eff (S) = −
1 eff h (S) si . 2 i i
(3.2)
Nothing guarantees that H eff (S) and h ieff (S) have a simple expression and are therefore analytically useful. This is because the superposition 2.5, unlike its elements (u X ), does not satisfy detailed balance in general. In other words, our system has an essential nonequilibrium character that prevents one from using Gibbs’s statistical mechanics, which requires a unique Hamiltonian. Instead, here there is one energy associated with each realization of X ={xi }. This is in addition to the fact that the synaptic weights wij in equation 2.1 may not be symmetric.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
619
For some choices of both the rate and the noise distribution P st (X|S), the function H eff (S) may be considered a true effective Hamiltonian (Garrido & Marro, 1989; Marro & Dickman, 1999). This means that H eff (S) then generates the same nonequilibrium steady state as the stochastic time evolution equation, which defines the system (see equation 2.4), and that its coefficients have the proper symmetry of interactions. To be more explicit, assume that P st (X|S) factorizes according to Pst (X|S) =
P x j |s j
(3.3)
j
and that one also has the factorization c¯ [S → S ] = i
dx j P(x j |s j ) (2T −1 si w ij x j s j ).
(3.4)
j=i
The former amounts to neglecting some global dependence of the factors on S = {si } (see below), and the latter restricts the possible choices for the rate function. Some familiar choices for this function that satisfy detailed balance are the one corresponding to the Metropolis algorithm, that is, (u) = min[1, exp(−u)]; the Glauber case (u) = [1 + exp(u)]−1 ; and (u) = exp(−u/2) (Marro & Dickman, 1999). The last fulfills (u + v) = (u)(v), which is required by equation 3.4.3 It then ensues after some algebra that h ieff = −T
j=i
αij+ s j + αij− ,
(3.5)
with αij± ≡
c¯ (βij ; +) c¯ (±βij ; −) 1 ln , 4 c¯ (−βij ; ∓) c¯ (∓βij ; ±)
(3.6)
where βij ≡ 2T −1 wij , and c¯ (βij ; s j ) =
d x j P(x j |s j ) (βij x j ).
(3.7)
3 In any case, the rate needs to be properly normalized. In computer simulations, it is customary to divide (u) by its maximum value. Therefore, the normalization happens to depend on temperature and the number of stored patterns. It follows that this normalization is irrelevant for the properties of the steady state; it just rescales the timescale.
620
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
This generalizes a case in the literature for random S-independent fluctuations (Garrido & Munoz, 1993; Lacomba & Marro, 1994; Marro & Dickman, 1999). In this case, one has c¯ (±κ; +) = c¯ (±κ; −) and, consequently, αij− = 0 ∀i, j. However, we are concerned here with the case of S-dependent disorder, which results in a nonzero threshold, θi ≡ j=i αij− = 0. In order to obtain a true effective Hamiltonian, the coefficients αij± in equation 3.5 need to be symmetric. Once (u) is fixed, this depends on the choice for P(x j |s j ), that is, on the fast noise details. This is studied in the next section. Meanwhile, we remark that the effective local fields h ieff defined above are very useful in practice. That is, they may be computed, at least numerically, for any rate and noise distribution. As far as (u + v) = (u)(v) and Pst (X|S) factorizes,4 it follows an effective transition rate as
c¯ [S → Si ] = exp − si h ieff /T .
(3.8)
This effective rate may then be used in computer simulation, and it may also serve to be substituted in the relevant equations. Consider, for instance, the overlaps defined as the product of the current state with one of the stored patterns: mν (S) ≡
1 ν si ξi . N i
(3.9)
Here, ξ ν = {ξiν = ±1, i = 1, . . . , N} are M random patterns previously stored in the system, ν = 1, . . . , M. After using standard techniques (Hertz, Krogh, & Palmer, 1991; Marro & Dickman, 1999; see also Amit, Gutfreund, & Sompolinsky, 1987), it follows from equation 2.4 that ∂t mν = 2N−1
ξiν sinh h ieff /T − si cosh h ieff /T ,
(3.10)
i
which is to be averaged over both thermal noise and pattern realizations. Alternatively, one might perhaps obtain dynamic equations of type 3.10 by using Fokker-Planck-like formalisms as, for instance, in Brunel and Hakim (1999). 4 Types of Synaptic Noise The above discussion and, in particular, equations 3.5 and 3.6, suggest that the system emergent properties will importantly depend on the details The factorization here does not need to be in products P(x j |s j ) as in equation 3.3. The same result (see equation 3.8) holds for the choice that we introduce in the next section, for instance. 4
Effects of Fast Presynaptic Noise in Attractor Neural Networks
621
of the synaptic noise X. We now work out the equations in section 3 for different hypotheses concerning the stationary distribution, equation 2.6. Consider first equation 3.3 with the following specific choice:
P(x j |s j ) =
1 + s j Fj 1 − s j Fj δ(x j + ) + δ(x j − 1). 2 2
(4.1)
This corresponds to a simplification of the stochastic variable x j . That is, for Fj = 1 ∀ j, the noise modifies w ij by a factor − when the presynaptic neuron is firing, s j = 1, while the learned synaptic intensity remains unchanged when the neuron is silent. In general, wij = −w ij with probability 12 (1 + s j Fj ). Here, Fj stands for some information concerning the presynaptic site j such as, for instance, a local threshold or Fj = M−1 ν ξ νj . Our interest for case 4.1 is twofold: it corresponds to an exceptionally simple situation and reduces our model to two known cases. This becomes evident by looking at the resulting local fields:
h ieff =
1 [(1 − ) s j − (1 + )Fj ]w ij . 2 j=i
(4.2)
That is, exceptionally, symmetries here are such that the system is described by a true effective Hamiltonian. Furthermore, this corresponds to the Hopfield model, except for a rescaling of temperature and the emergence of a threshold θi ≡ j w ij Fj (Hertz et al., 1991). It also follows that, concerning stationary properties, the resulting effective Hamiltonian, equation 3.2, reproduces the model as in Bibitchkov et al. (2002). In fact,∞this would correspond in our notation to h ieff = 12 j=i wij s j x ∞ j , where x j stands for the stationary solution of certain dynamic equation for x j . The conclusion is that (except perhaps concerning dynamics, which is something worth investigating) the fast noise according to equation 3.3 with equation 4.1 does not imply any surprising behavior. In any case, this choice of noise illustrates the utility of the effective field concept as defined above. Our interest here is in modeling the noise consistent with the observation of short-time synaptic depression (Tsodyks et al., 1998; Pantic et al., 2002). In fact, equation 4.1 in some way mimics that increasing the mean firing rate results in decreasing the synaptic weight. With the same motivation, a more intriguing behavior ensues by assuming, instead of equation 3.3, the factorization P st (X|S) =
j
P(x j |S)
(4.3)
622
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
with
δ(x j + ) + [1 − ζ (m)]
δ(x j − 1). P(x j |S) = ζ (m)
(4.4)
= m(S)
Here, m ≡ (m1 (S), . . . , m M (S)) is the M-dimensional overlap vector,
stands for a function of m
to be determined. The depression and ζ (m) effect here depends on the overlap vector, which measures the net current arriving to postsynaptic neurons. The nonlocal choice, equations 4.3 and 4.4, thus introduces nontrivial correlations between synaptic noise and neural activity, which is not considered in equation 4.1. Note that therefore we are not modeling here the synaptic depression dynamics in an explicit way as, for instance, in Tsodyks et al. (1998). Instead, equation 4.4 amounts to considering fast synaptic noise, which naturally depresses the strength of
the synapses after repeated activity, namely, for a high value of ζ (m). Several further comments on the significance of equations 4.3 and 4.4, which is here a main hypothesis together with p → 0, are in order. We first mention that the system time relaxation is typically orders of magnitude larger than the timescale for the various synaptic fluctuations reported to account for the observed high variability in the postsynaptic response of central neurons (Zador, 1998). On the other hand, these fluctuations seem to have different sources, such as, for instance, the stochasticity of the opening and closing of the vesicles (S. Hilfiker, private communication, April 2005), the stochasticity of the postsynaptic receptor, which has its own causes, variations of the glutamate concentration in the synaptic cleft, and differences in the potency released from different locations on the active zone of the synapses (Franks, Stevens, & Sejnowski, 2003). This is the complex situation that we try to capture by introducing the stochastic variable x in equation 2.1 and subsequent equations. It may be further noticed that the nature of this variable, which is microscopic here, differs from the one in the case of familiar phenomenological models. These often involve a mesoscopic variable, such as the mean fraction of neurotransmitter, which results in a deterministic situation, as in Tsodyks et al. (1998). The depression in our model rather naturally follows from the coupling between the synaptic noise and the neurons’ dynamics via the overlap functions. The final result is also deterministic for p → 0 but only, as one should perhaps expect, on the timescale for the neurons. Finally, concerning also the reality of the model, it should be clear that we are restricting ourselves here to fully connected networks for simplicity. However, we have studied similar systems with more realistic topologies such as scale-free, small-world, and diluted networks (Torres, Munoz, Marro, & Garrido, 2004), which suggests one can generalize the present study in this sense. Our case (see equations 4.3 and 4.4) also reduces to the Hopfield model
Otherwise, the competition but only in the limit → −1 for any ζ (m). results in rather complex behavior. In particular, the noise distribution
Effects of Fast Presynaptic Noise in Attractor Neural Networks
623
P st (X|S) lacks with equation 4.4 the factorization property, which is required to have an effective Hamiltonian with proper symmetry. Nevertheless, we may still write d x j P(x j |S) (si x j s j βij ) c¯ [S → Si ] = . c¯ [Si → S] d x j P(x j |Si ) (−si x j s j βij ) j=i
(4.5)
Then, using equation 4.4, we linearize around w ij = 0, that is, βij = 0 for T > 0. This is a good approximation for the Hebbian learning rule (Hebb, 1949), wij = N−1 ν ξiν ξ νj , which is the one we use hereafter, as far as this rule stores only completely uncorrelated, random patterns. In√fact, fluctuations √ in this case are of order M/N for finite M—of order 1/ N for finite α— which tends to vanish for a sufficiently large system, for example, in the macroscopic (thermodynamic) limit N → ∞. It then follows the effective weights,
1+
+ ζ (m
i )] wij , wijeff = 1 − [ζ (m) 2
(4.6)
= m(S),
i ≡ m(S
i) = m
− 2si ξ i /N, and ξ i = (ξi1 , ξi2 , ..., ξiM ) is the where m m binary M–dimensional stored pattern. This shows how the noise modifies synaptic intensities. The associated effective local fields are h ieff =
wijeff s j .
(4.7)
j=i
The condition to obtain a true effective Hamiltonian, that is, proper symme i =m
− 2si ξ i /N m.
This is a good try of equation 4.6 from this, is that m approximation in the thermodynamic limit, N → ∞. Otherwise, one may proceed with the dynamic equation 3.10 after substituting equation 4.7, even though this is not then a true effective Hamiltonian. One may follow the same procedure for the Hopfield case with asymmetric synapses (Hertz et al., 1991), for instance. Further interest in the concept of local effective fields as defined in section 3 follows from the fact that one may use quantities such as equation 4.7 to importantly simplify a computer simulation, as we do below. To proceed further, we need to determine the probability ζ in equation 4.4. In order to model activity-dependent mechanisms acting on the synapses, ζ should be an increasing function of the net presynaptic current
624
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
simply needs to depend on the overlaps, besides or field. In fact, ζ (m) preserving the ±1 symmetry. A simple choice with these requirements is
= ζ (m)
1 ν [m (S)]2 , 1+α ν
(4.8)
where α = M/N. We describe next the behavior that ensues from equations 4.6 to 4.8 as implied by the noise distribution, equation 4.4. 5 Noise-Induced Phase Transitions Let us first study the retrieval process in a system with a single stored pattern, M = 1, when the neurons are acted on by the local fields, equation 4.7. One obtains from equations 3.8 to 3.10, after using the simplifying (mean field) assumption si ≈ si , that the steady solution corresponds to the overlap, m = tanh{T −1 m[1 − (m)2 (1 + )]},
(5.1)
m ≡ mν=1 , which preserves the symmetry ±1. Local stability of the solutions of this equation requires that 1 |m| > mc (T) = √ 3
Tc − T
− c
12
.
(5.2)
The behavior of equation 5.1 is illustrated in Figure 1 for several values of
. This indicates a transition from a ferromagnetic-like phase (i.e., solutions m = 0 with associative memory) to a paramagnetic-like phase, m = 0. The transition is continuous or second order only for > c = −4/3, and it then follows a critical temperature Tc = 1. Figure 2 shows the tricritical point at (Tc , c ) and the general dependence of the transition temperature with . A discontinuous phase transition allows much better performance of the retrieval process than a continuous one. This is because the behavior is sharp just below the transition temperature in the former case. Consequently, the above indicates that our model performs better for large negative ,
< −4/3. We also performed Monte Carlo simulations. These concern a network of N = 1600 neurons acted on by the local fields, equation 4.7, and evolving by sequential updating via the effective rate, equation 3.8. Except for some finite-size effects, Figure 1 shows good agreement between our simulations and the equations here; in fact, the computer simulations also correspond to a mean field description given that the fields 4.7 assume fully connected neurons.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
625
m
1
0.5
0 0.4
0.8
1.2
T Figure 1: The steady overlap m(T), as predicted by equation 5.1, for different values of the noise parameter, namely, = −2.0, −1.5, −1.0, −0.5, 0, 0.5, 1.0, 1.5, 2.0, from top to bottom, respectively. ( = −1 corresponds to the Hopfield case, as explained in the text.) The graphs depict second-order phase transitions (solid curves) and, for the most negative values of , first-order phase transitions (the discontinuities in these cases are indicated by dashed lines). The symbols stand for Monte Carlo data corresponding to a network with N = 1600 neurons for = −0.5 (filled squares) and −2.0 (filled circles).
6 Sensitivity to the Stimulus As shown above, a noise distribution such as equation 4.4 may model activity-dependent processes reminiscent of short-time synaptic depression. In this section, we study the consequences of this type of fast noise on the retrieval dynamics under external stimulation. More specifically, our aim is to check the resulting sensitivity of the network to external inputs. A high degree of sensibility will facilitate the response to changing stimuli. This is an important feature of neurobiological systems, which continuously adapt and quickly respond to varying stimuli from the environment. Consider first the case of one stored pattern, M = 1. A simple external input may be simulated by adding to each local field a driving term −δξi , ∀i, with 0 < δ 1 (Bibitchkov et al., 2002). A negative drive in this case of
626
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
-1
Φ
-1.5
P
-2
F
-2.5 -3 0.8
1.2
1.6
2
T Figure 2: Phase diagram depicting the transition temperature Tc as a function of T and . The solid (dashed) curve corresponds to a second- (first-) order phase transition. The tricritical point is at (Tc , c ) = (1, −4/3). F and P stand for the ferromagnetic-like and paramagnetic-like phases, respectively. The best retrieval properties of our model system occur close to the lower-left quarter of the graph.
a single pattern ensures that the network activity may go from the attractor, ξ , to the “antipattern,” −ξ . It then follows the stationary overlap, m = tanh[T −1 F (m, , δ)],
(6.1)
F (m, , δ) ≡ m[1 − (m)2 (1 + ) − δ].
(6.2)
with
Figure 3 shows this function for δ = 0 and varying . This illustrates two types of behavior: (local) stability (F > 0) and instability (F < 0) of the attractor, which corresponds to m = 1. That is, the noise induces instability, resulting in this case in switching between the pattern and the antipattern. This is confirmed in Figure 4 by Monte Carlo simulations. The simulations correspond to a network of N = 3600 neurons with one stored pattern, M = 1. This evolves from different initial states, corresponding to different distances to the attractor, under an external stimulus −δξ 1
Effects of Fast Presynaptic Noise in Attractor Neural Networks
627
3
F(m,Φ,δ)
2 1 0 -1 -2 -3 0
0.5
1
m Figure 3: The function F as defined in equation 6.2 for δ = 0 and, from top to bottom, = −2, −1, 0, 1, and 2. The solution of equation 6.1 becomes unstable so that the activity will escape the attractor (m = 1) for F < 0, which occurs for
> 0 in this case.
for different values of δ. The two left graphs in Figure 4 show several independent time evolutions for the model with fast noise, namely, for
= 1; the two graphs to the right are for the Hopfield case lacking the noise ( = −1). These, and similar graphs one may obtain for other parameter values, clearly demonstrate how the network sensitivity to a simple external stimulus is qualitatively enhanced by adding presynaptic noise to the system. Figures 5 and 6 illustrate similar behavior in Monte Carlo simulations with several stored patterns. Figure 5 is for M = 3 correlated patterns with mutual overlaps |mν,µ | ≡ |1/N i ξiν ξiµ | = 1/3 and | ξiν | = 1/3. More specifically, each pattern consists of three equal initially white (silent neurons) horizontal stripes, with one of them colored black (firing neurons) located in a different position for each pattern. The system in this case begins with the first pattern as initial condition, and, to avoid dependence on this choice, it is let to relax for 3 × 104 Monte Carlo steps (MCS). It is then perturbed by a drive −δξ ν , where the stimulus ν changes (ν = 1, 2, 3, 1, . . .) every 6 × 103 MCS. The top graph shows the network response in the Hopfield case. There is no visible structure of this signal in the absence of fast noise as far as δ 1. In fact, the depth of the basins of attraction is large enough in the Hopfield model to prevent any move for small δ, except
628
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
A
B
1
1
m
m
0.5 0
0.5
-0.5 -1
0 0
4000
8000
D
0.5 0 -0.5 -1
m
m
C 1
0
1500
3000
0
3000
6000
1 0.5 0
0
4000 time (MCS)
8000
time (MCS)
Figure 4: Time evolution of the overlap, as defined in equation 3.9, between the current state and the stored pattern in Monte Carlo simulations with 3600 neurons at T = 0.1. Each graph, for a given set of values for (δ , ), shows different curves corresponding to evolutions starting with different initial states. The two top graphs are for δ = 0.3 and = 1 (graphs A and C) and = −1 (graphs B and D), the latter corresponding to the Hopfield case lacking the fast noise. This shows the important effect noise has on the network sensitivity to external stimuli. The two bottom graphs illustrate the same for a fixed initial distance from the attractor as one varies the external stimulation, namely, for δ = 0.1, 0.2, 0.3, 0.4, and 0.5 from top to bottom.
when approaching a critical point (Tc = 1), where fluctuations diverge. The bottom graph depicts a qualitatively different situation for = 1. That is, adding fast noise in general destabilizes the fixed point for the interesting case of small δ far from criticality. Figure 6 confirms the above for uncorrelated patterns, for example, mν,µ ≈ δ ν,µ and ξiν ≈ 0. This shows the response of the network in a similar simulation with 400 neurons at T = 0.1 for M = 3 random, othogonal patterns. The initial condition is again ν = 1, and the stimulus is here +δξ ν with ν changing every 1.5 × 105 MCS. Thus, we conclude that the switching phenomenon is robust with respect to the type of pattern stored. 7 Conclusion The set of equations 2.4 to 2.6 provides a general framework to model activity-dependent processes. Motivated by the behavior of neurobiological
Effects of Fast Presynaptic Noise in Attractor Neural Networks
P1 P2 P3 P1 P2 P3 P1 P2 P3
δ=0
1
629
Input Hopfield
m1
Fast Noise
0
-1 160
180
200
220
240
3
time (10 MCS) Figure 5: Time evolution during a Monte Carlo simulation with N = 400 neurons, M = 3 correlated patterns (as defined in the text), and T = 0.1. The system in this case was let to relax to the steady state and then perturbed by the stimulus −δξ ν , δ = 0.3, with ν = 1 for a short time interval and then with ν = 2, and so on. After suppressing the stimulus, the system is again allowed to relax. The graphs show as a function of time, from top to bottom, the number of the pattern used as the stimulus at each time interval; the resulting response of the network, measured as the overlap of the current state with pattern ν = 1, in the absence of noise, that is, the Hopfield case = −1; and the same for the relevant noisy case = 1.
systems, we adapted this to study the consequences of fast noise acting on the synapses of an attractor neural network with a finite number of stored patterns. We presented in this letter two different scenarios corresponding to noise distributions fulfilling equations 3.3 and 4.3, respectively. In particular, assuming a local dependence on activity as in equation 4.1, one obtains the local fields, equation 4.2, while a global dependence as in equation 4.4 leads to equation 4.7. Under certain assumptions, the system in the first of these cases is described by the effective Hamiltonian, equation 3.2. This reduces to a Hopfield system—the familiar attractor neural network without any synaptic noise—with rescaled temperature and a threshold. This was already studied for a gaussian distribution of thresholds (Hertz et al., 1991; Horn & Usher, 1989; Litinskii, 2002). Concerning stationary properties, this case is also similar to the one in Bibitchkov et al. (2002). A more intriguing
630
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
δ=0 P1
P2
P3
P1
P2
P3
P1
δ=0
m
µ
1
0
-1 0
5
10
15
5
Time (10 MCS) Figure 6: The same as in Figure 5 but for three stored patterns that are orthogonal (instead of correlated). The stimulus is +δξ ν , δ = 0.1, with ν = ν(t), as indicated at the top. The time evolution of the overlap mν is drawn with a different color (black, dark gray, and light gray, respectively) for each value of ν to illustrate that the system keeps jumping between the patterns in this case.
behavior ensues when the noise depends on the total presynaptic current arriving at the postsynaptic neuron. We studied this case both analytically, by using a mean field hypothesis, and numerically, by a series of Monte Carlo simulations using single-neuron dynamics. The two approaches are fully consistent with and complement each other. Our model involves two main parameters. One is the temperature T, which controls the stochastic evolution of the network activity. The other parameter, , controls the depressing noise intensity. Varying this, the system describes the route from normal operation to depression phenomena. A main result is that the presynaptic noise induces the occurrence of a tricritical point for certain values of these parameters, (Tc , c ) = (1, −4/3). This separates (in the limit α → 0) first from second-order phase transitions between a retrieval phase and a nonretrieval phase. The principal conclusion in this letter is that fast presynaptic noise may induce a nonequilibrium condition that results in an important intensification of the network sensitivity to external stimulation. We explicitly show that the noise may turn the attractor or fixed-point solution of the retrieval process unstable, and the system then seeks another attractor. In particular, one observes switching from the stored pattern to the corresponding antipattern for M = 1 and switching between patterns for a larger number of
Effects of Fast Presynaptic Noise in Attractor Neural Networks
631
stored patterns, M. This behavior is most interesting because it improves the network ability to detect changing stimuli from the environment. We observe the switching to be very sensitive to the forcing stimulus, but rather independent of the network initial state or the thermal noise. It seems sensible to argue that besides recognition, the processes of class identification and categorization in nature might follow a similar strategy. That is, different attractors may correspond to different objects, and a dynamics conveniently perturbed by fast noise may keep visiting the attractors belonging to a class that is characterized by a certain degree of correlation among its elements (Cortes et al., 2005). In fact, a similar mechanism seems to be at the basis of early olfactory processing of insects (Laurent et al., 2001), and instabilities of the same sort have been described in the cortical activity of monkeys (Abeles et al., 1995) and other cases (Miller & Schreiner, 2000). Finally, we mention that the above complex behavior seems confirmed by preliminary Monte Carlo simulations for a macroscopic number of stored patterns, that is, a finite loading parameter α = M/N = 0. On the other hand, a mean field approximation (see below) shows that the storage capacity of the network is αc = 0.138, as in the Hopfield case (Amit et al., 1987), for any < 0, while it is always smaller for > 0. This is in agreement with previous results concerning the effect of synaptic depression in Hopfield-like systems (Torres, Pantic, & Kappen, 2002; Bibitchkov et al., 2002). The fact that a positive value of tends to shallow the basin, thus destabilizing the attractor, may be understood by a simple (mean field) argument, which is confirmed by Monte Carlo simulations (Cortes et al., 2005). Assume that the stationary activity shows just one overlap of order unity. This corresponds to the condensed pattern; the overlaps with the √ rest, M − 1 stored patterns, is of order of 1/ N (non-condensed patterns) (Hertz et al., 1991). The resulting probability of change of the synaptic inM tensity, namely, 1/(1 + α) ν=1 (mν )2 , is of order unity, and the local fields, Hopfield . Therefore, the storage capacity, equation 4.7, follow as h ieff ∼ − h i which is computed at T = 0, is the same as in the Hopfield case for any
< 0, and always lower otherwise. Acknowledgments We acknowledge financial support from MCyT–FEDER (project No. ´ y Cajal contract). BFM2001-2841 and a Ramon
References Abbott, L. F., & Kepler, T. B. (1990). Model neurons: From Hodgkin-Huxley to Hopfield. Lectures Notes in Physics, 368, 5–18. Abbott, L. F., & Regehr, W. G. (2004). Synaptic computation. Nature, 431, 796–803.
632
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidelman, E., Tishby, N., & Vaadia, E. (1995). Cortical activity flips among quasi-stationary states. Proc. Natl. Acad. Sci. USA, 92, 8616–8620. Allen, C., & Stevens, C. F. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci. USA, 91, 10380–10383. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans. Syst. Man. Cybern., 2, 643–657. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Bibitchkov, D., Herrmann, J. M., & Geisel, T. (2002). Pattern storage and processing in attractor networks with short-time synaptic dynamics. Network: Comput. Neural Syst., 13, 115–129. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comp., 11, 1621–1671. Cortes, J. M., Garrido, P. L., Kappen, H. J., Marro, J., Morillas, C., Navidad, D., & Torres, J. J. (2005). Algorithms for identification and categorization. In AIP Conf. Proc. 779 (pp. 178–184). Cortes, J. M., Garrido, P. L., Marro, J., & Torres, J. J. (2004). Switching between memories in neural automata with synaptic noise. Neurocomputing, 58–60, 67– 71. Franks, K. M., Stevens, C. F., & Sejnowski, T. J. (2003). Independent sources of quantal variability at single glutamatergic synapses. J. Neurosci., 23(8), 3186–3195. Gardiner, C. W. (2004). Handbook of stochastic methods: For physics, chemistry and the natural sciences. Berlin: Springer-Verlag. Garrido, P. L., & Marro, J. (1989). Effective Hamiltonian description of nonequilibrium spin systems. Phys. Rev. Lett., 62, 1929–1932. Garrido, P. L., & Marro, J. (1991). Nonequilibrium neural networks. Lecture Notes in Computer Science, 540, 25–32. Garrido, P. L., & Marro, J. (1994). Kinetic lattice models of disorder. J. Stat. Phys., 74, 663–686. Garrido, P. L., & Munoz, M. A. (1993). Nonequilibrium lattice models: A case with effective Hamiltonian in d dimensions. Phys. Rev. E, 48, R4153–R4155. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Horn, D., & Usher, M. (1989). Neural networks with dynamical thresholds. Phys. Rev. A, 40, 1036–1044. Lacomba, A. I. L., & Marro, J. (1994). Ising systems with conflicting dynamics: Exact results for random interactions and fields. Europhys. Lett., 25, 169–174. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active, dynamical process: Experiments, computation and theory. Annu. Rev. Neurosci., 24, 263–297. Litinskii, L. B. (2002). Hopfield model with a dynamic threshold. Theoretical and Mathematical Physics, 130, 136–151.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
633
Marro, J., & Dickman, R. (1999). Nonequilibrium phase transitions in lattice models. Cambridge: Cambridge University Press. Marro, J., Torres, J. J., & Garrido, P. L. (1999). Neural network in which synaptic patterns fluctuate with time. J. Stat. Phys., 94(1–6), 837–858. Miller, L. M., & Schreiner, C. E. (2000). Stimulus-based state control in the thalamocortical system. J. Neurosci., 20, 7011–7016. Pantic, L., Torres, J. J., Kappen, H. J., & Gielen, S. C. A. M. (2002). Associative memory with dynamic synapses. Neural Comp., 14, 2903–2923. Pitler, T., & Alger, B. E. (1992). Postsynaptic spike firing reduces synaptic GABA(a) responses in hippocampal pyramidal cells. J. Neurosci., 12, 4122–4132. Sompolinsky, H. (1986). Neural networks with nonlinear synapses and a static noise. Phys. Rev. A, 34, 2571–2574. Thomson, A. M., Bannister, A. P., Mercer, A., & Morris, O. T. (2002). Target and temporal pattern selection at neocortical synapses. Philos. Trans. R. Soc. Lond. B Biol. Sci., 357, 1781–1791. Torres, J. J., Garrido, P. L., & Marro, J. (1997). Neural networks with fast time-variation of synapses. J. Phys. A: Math. Gen., 30, 7801–7816. Torres, J. J., Munoz, M. A., Marro, J., & Garrido, P. L. (2004). Influence of topology on the performance of a neural network. Neurocomputing, 58–60, 229–234. Torres, J. J., Pantic, L., & Kappen, H. J. (2002). Storage capacity of attractor neural networks with depressing synapses. Phys. Rev. E., 66, 061910. Tsodyks, M. V., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229.
Received February 9, 2005; accepted July 29, 2005.
LETTER
Communicated by Paul Bressloff
Response Variability in Balanced Cortical Networks Alexander Lerchner∗
[email protected] Technical University of Denmark, 2800 Lyngby, Denmark
Cristina Ursta
[email protected] Niels Bohr Institut, 2100 Copenhagen Ø, Denmark
John Hertz
[email protected] Nordita, 2100 Copenhagen Ø, Denmark
Mandana Ahmadi
[email protected] Nordita, 2100 Copenhagen Ø, Denmark
Pauline Ruffiot pruffi
[email protected] Universit´e Joseph Fourier, Grenoble, France
Søren Enemark
[email protected] Niels Bohr Institut, 2100 Copenhagen Ø, Denmark
We study the spike statistics of neurons in a network with dynamically balanced excitation and inhibition. Our model, intended to represent a generic cortical column, comprises randomly connected excitatory and inhibitory leaky integrate-and-fire neurons, driven by excitatory input from an external population. The high connectivity permits a mean field description in which synaptic currents can be treated as gaussian noise, the mean and autocorrelation function of which are calculated selfconsistently from the firing statistics of single model neurons. Within this description, a wide range of Fano factors is possible. We find that the irregularity of spike trains is controlled mainly by the strength of the synapses relative to the difference between the firing threshold and the postfiring reset level of the membrane potential. For moderately strong ∗
Current address: Laboratory of Neuropsychology, NIMH, NIH, Bethesda, MD 20893,
USA Neural Computation 18, 634–659 (2006)
C 2006 Massachusetts Institute of Technology
Response Variability in Balanced Cortical Networks
635
synapses, we find spike statistics very similar to those observed in primary visual cortex.
1 Introduction The observed irregularity and relatively low rates of the firing of neocortical neurons suggest strongly that excitatory and inhibitory input are nearly balanced. Such a balance, in turn, finds an attractive explanation in the approximate, heuristic mean field description of Amit and Brunel (1997a, 1997b) and Brunel (2000). In this treatment, the balance does not have to be put in “by hand”; rather, it emerges self-consistently from the network dynamics. This success encourages us to study firing correlations and irregularity in models like theirs in greater detail. In particular, we would like to quantify the irregularity and identify the parameters of the network that control it. This is important because one cannot extract the signal in neuronal spike trains correctly without a good characterization of the noise. Indeed, an incorrect noise model can lead to spurious conclusions about the nature of the signal, as demonstrated by Oram, Wiener, Lestienne, and Richmond (1999). Response variability has been studied for a long time in primary visual cortex (Heggelund & Albus, 1978; Dean, 1981; Tolhurst, Movshon, & Thompson, 1981; Tolhurst, Movshon, & Dean, 1983; Vogels, Spileers, & Orban, 1989; Snowden, Treue, & Andersen, 1992; Gur, Beylin, & Snodderly, 1997; Shadlen & Newsome, 1998; Gershon, Wiener, Latham, & Richmond, 1998; Kara, Reinagel, & Reid, 2000; Buracas, Zador, DeWeese, & Albright, 1998) and elsewhere (Lee, Port, Kruse, & Georgopoulos, 1998; Gershon et al., 1998; Kara et al., 2000; DeWeese, Wehr, & Zador, 2003). Most, though not all, of these studies found rather strong irregularity. As an example, we consider the findings of Gershon et al. (1998). In their experiments, monkeys were presented with flashed, stationary visual patterns for several hundred ms. Repeated presentations of a given stimulus evoked varying numbers of spikes in different trials, though the mean number (as well as the peristimulus time histogram) varied systematically from stimulus to stimulus. The statistical objects of interest to us here are the distributions of single-trial spike counts for given fixed stimuli. Often one compares the data with a Poisson model of the spike trains, for which the count distribution P(n) = mn e−m /n!. This distribution has the property that its mean n = m is equal to its variance δn2 = (n − n)2 . However, the experimental finding was that the measured distributions were quite generally wider than this: δn2 > m. Furthermore, when data were collected for many stimuli, the variance of the spike count was fit well by a power law function of the mean count: δn2 ∝ m y , with y typically in the range 1.2 to 1.4, broadly consistent with the results of many of the other studies cited above.
636
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Some of this observed variance could have a simple explanation: the condition of the animal might have changed between trials, so the intrinsic rate at which the neuron fires might differ from trial to trial, as suggested by Tolhurst et al. (1981). But it is far from clear whether all the variance can be accounted for in this way. Moreover, there is no special reason to take a Poisson process as the null hypothesis, so we do not even really know how much variance we are trying to explain. In this article, we try to address the question of how much variability, or more generally, what firing correlations can be expected as a consequence of the intrinsic dynamics of cortical neuronal networks. The theories of Amit and Brunel (1997a, 1997b) and of van Vreeswijk and Sompolinsky (1996, 1998) do not permit a consistent study of firing correlations. The AmitBrunel equations ignore firing correlations and variations in firing rate within neuronal populations; thus, they do not constitute a complete mean field theory. Although one can calculate the variability of the firing (Brunel, 2000), the calculation is not self-consistent. Van Vreeswijk and Sompolinsky use a binary neuron model with stochastic dynamics, which makes it difficult, if not impossible, to study temporal correlations that might occur in networks of spiking neurons. Therefore, in this article, we do a complete mean field theory for a network of leaky integrate-and-fire neurons, including, as self-consistently-determined order parameters, both firing rates and autocorrelation functions. This kind of theory is needed whenever the connections in the network are random. A general formalism for doing this was introduced by Fulvi Mari (2000) and used for an all-excitatory network; here we employ it for a network with both excitatory and inhibitory neurons. A preliminary study of this approach for an all-inhibitory network was presented previously (Hertz, Richmond, & Nilsen, 2003). 2 Model and Methods The model network, indicated schematically in Figure 1, consists of N1 excitatory neurons and N2 inhibitory ones. In this work, we use leaky integrateand-fire neurons, though the methods could be carried over directly to networks of other kinds of model neurons, such as conductance-based ones. They are randomly interconnected by synapses, both within and between populations, with the mean number of connections from population b to population a equal to K b , independent of a . In specific calculations, we have used K 1 from 400 to 6400, and we take K 2 = K 1 /4. We scale the synaptic strengths in the way van Vreeswijk and Sompolinsky (1996, 1998) did, with each √nonzero synapse from population b to population a having the value J ab / K b . Thus, the mean value of a synapse J ijab is √ J ijab
=
K b J ab , Nb
(2.1)
Response Variability in Balanced Cortical Networks
637
Figure 1: Structure of the Model Network.
and its variance is
δ J ijab
2
2 K b J ab = 1− . Nb Nb
(2.2)
The parameters J ab are taken to be of order 1, so the net input current to a neuron from the K b neurons in population b connected to it is of order √ K b . With this scaling, the fluctuations in this current are of order 1. Similarly, we assume that the external input to any neuron is the sum of K 0 1 contributions from individual neurons (in the lateral geniculate √ nucleus, if we are thinking √ about modeling V1), each of order 1/ K 0 , so the net input is of order K 0 . In our calculations, we have used K 0 = K 1 . We point out that this scaling is just for convenience in thinking about the problem. In the balanced asynchronous firing state, the large excitatory and inhibitory input currents nearly cancel, leaving a net input current of order 1. Thus, for this choice, both the net mean current and its typical fluctuations are of order 1, which is convenient for analysis. The physiologically relevant assumptions are only that excitatory and inhibitory inputs are separately much larger than their sum and that the latter is of the same order as its fluctuations.
638
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Our synapses are not modeled as conductances. Our synaptic strength simply defines the amplitude of the postsynaptic current pulse produced by a single presynaptic spike. The model is formally specified by the subthreshold equations of motion for the membrane potentials uia (a = 1, 2, i = 1, . . . , Na ), b duia ua J ijab Sbj (t), =− i + dt τ
2
N
(2.3)
b=0 j=1
together with the condition that when uia reaches the threshold θa , the neuron spikes and the membrane potential is reset to a value ura . The indices a or b = 0, 1, or 2 label populations: b = 0 refers to the (excitatory) population providing the external input, b = 1 refers to the excitatory population, and b = 2 to the inhibitory population. In equation 2.3, τ is the membrane time constant (taken the same for all neurons, for convenience), and Sbj (t) = s δ(t − t sjb ) is the spike train of neuron j in population b. We have ignored transmission delays, and we take the reset levels ura equal to the rest value of the membrane potential, 0. In our calculations, the thresholds are given a gaussian distribution with a standard deviation equal to 10% of the mean. We fix the mean threshold θa = 1. Analogous variability in other single-cell parameters (such as membrane time constants) could also be included in the model, but for simplicity, we do not do so here. We assume that the neurons in the external input population (b = 0) fire as independent Poisson processes. However, the neurons in the network (b = 1, 2) are not in general Poissonian; it is their correlations that we want to find in this investigation. 2.1 Mean Field Theory: Stationary States. We describe the mean field theory and its computational implementation first for the case of stationary rates. We will assume Na K a 1 (large but extremely dilute connectivity). Any mean field theory has to start with an ansatz for the structure of its order parameters. In words, our ansatz is that neurons fire noisily: thus, they are characterized by their rates (which can vary across a neuronal population) and autocorrelation functions. We assume the latter to contain a delta function spike at equal times of strength equal to the rate (because they are spiking neurons) plus a continuous part at unequal times. In our dilute limit, the theory is simplified by the fact that there are no crosscorrelations between neurons. (Therefore, we generally drop the “auto” from “autocorrelation.”) Under these assumptions, each of the three terms in the sum on b on the right-hand side of equation 2.3 can be treated as a gaussian random function with time-independent mean. (This can be proved formally using the generating-functional formalism of Fulvi Mari, 2000; it is a consequence of the fact that K b 1 and the independence of the J ijab . Furthermore,
Response Variability in Balanced Cortical Networks
639
experiments (Destexhe, Rudolph, & Par´e, 2003) show that a gaussian approximation is very good for real synaptic noise.) We write the contribution from population b as
Iiab (t) =
Nb
J ijab + δ J ijab rb + δr bj + δSbj (t) ,
(2.4)
j=1
where rb is the mean rate in population b, δr bj is the deviation of the rate of neurons j from its population mean, and δSbj (t) = Sbj (t) − r bj describes the fluctuations of the activity of neuron j from its temporal mean r bj . The mean (over both time and neurons in the receiving population a ) comes from the product of the first terms:
Iiab (t) = K b J ab rb .
(2.5)
By · · · we mean a time average or, equivalently, an average over “trials” (independent repetitions of the Poisson processes defining the input population neurons). We will generally use a bar over a quantity to indicate an average over the neuronal population or over the distribution of the J ijab . (Note that these two kinds of averages are very different things.) The fluctuations around this mean are of two kinds. One is the neuron-toneuron rate variations in population a , obtained from the time-independent terms in equation 2.4: ab b ab b δ Iiab = J ij δr j + δ J ij r j . j
(2.6)
j
Using equations 2.1 and 2.2, their variance reduces, for K b Nb , to ab 2 b 2 J 2 b 2 2 δ Ii r j = J ab rj . = ab Nb j
(2.7)
The second kind is the temporal fluctuations for single neurons, obtained from the terms in equation 2.4 involving δSbj (t). Their population-averaged correlation function is proportional to the average correlation function in population b:
ab J 2 b 2 Cb (t − t ). δ Ii (t)δ Iiab (t ) = ab δS j (t)δSbj (t ) ≡ J ab Nb j
(2.8)
640
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Thus, we can write this contribution to the input current for a single neuron as
ab
Ii (t) = J ab
K b rb +
b 2 ab ab r j xi + ξi (t) ,
(2.9)
where xiab is a unit-variance gaussian random number and
ξiab (t)ξiab (t ) = Cb (t − t ).
(2.10)
The xiab are time and trial independent, while the noise ξiab (t) varies both in time within a trial and randomly from trial to trial. Note that for this model, a correct and complete mean field theory has to include rate variations, through (r bj )2 , and the temporal firing correlations, given by Cb (t − t ), as well as the mean rates. In our treatment here, we will assume that the neurons in the external input population fire like Poisson processes, so Iia 0 (t) is white noise. However, the neurons providing the source of the recurrent currents are not generally Poissonian, so their correlations appear in the statistics of the noise term. The self-consistency equations of mean field theory are simply the conditions that the average output statistics of the neurons, ra , (r aj )2 and Ca (t − t ), are the same as those used to generate the inputs for single neurons using integrate-and-fire neurons with synaptic input currents given by equation 2.9. In an equivalent formulation, the second term in equation 2.9 can be omitted if the noise terms ξiab (t) have correlations equal to the unsubtracted correlation function, Cbtot (t − t ) =
1 b S j (t)Sbj (t ) . Nb j
(2.11)
For |t − t | → ∞, Cbtot (t − t ) → (r bj )2 , so ξiab (t) acquires a random static component of mean square value (r bj )2 . In still another way to do it, one can use the average rate rb in place of its root mean square value in the second term on the right-hand side of equation 2.9 and employ noise with a correlation function: C˜ b (t − t ) =
1 b S j (t) − rb Sbj (t ) − rb . Nb j
(2.12)
Response Variability in Balanced Cortical Networks
641
For |t − t | → ∞, 2 2 C˜ b (t − t ) → r bj − rb ≡ δr bj .
(2.13)
There are now two static random parts of Iiab (t) in equation 2.9: one from the second term and one from the static component of the noise. Their sum is a gaussian random number with variance equal to (r bj )2 . Thus, these three ways of generating the input currents are equivalent. 2.1.1 The Balance Condition. In a stationary, low-rate state, the mean membrane potential described by equation 2.3 has to√be approximately stationary. If excitation dominates, we have duia /dt ∝ K 0 , implying a firing rate √ of order K 0 (or one limited only by the refractory period of the neuron). If inhibition dominates, the neuron will never fire. The only way to have a stationary state at a low rate (less than one spike per membrane time constant) is to have the excitation and inhibition nearly cancel. Then the mean membrane potential can lie a little below threshold, and the neuron can fire occasionally due to the input current fluctuations. This suggests the following heuristic theory, based on this approximate balance. Using equations 2.3 and 2.5, we have 2
J ab K b rb = O(1),
(2.14)
b=0
√ or, up to corrections of O(1/ K 0 ), 2
Jˆab rb = 0,
(2.15)
b=0
√ with Jˆab = J ab K b /K 0 . These are two linear equations in the two unknowns ra , a = 1, 2, with the solution ra =
2 [Jˆ−1 ]ab J b0 r0 ,
(2.16)
b=1
where Jˆ−1 is the inverse of the 2 × 2 matrix with elements Jˆab , a , b = 1, 2. If there is a stationary balanced state, the average rates of the excitatory and inhibitory populations are given by the solutions of equation 2.16. This argument, given by Amit and Brunel and by Sompolinsky and van Vreeswijk, depends only on the rates, not on the correlations.
642
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
2.1.2 Numerical Procedure. For integrate-and-fire neurons in a stationary state, the mean field theory can be reduced to a set of analytic equations if neuron-to-neuron rate variations are neglected and a white-noise (Poisson firing) approximation is made (Amit & Brunel, 1997a, 1997b; Brunel, 2000). However, in a complete mean field theory for our model, the randomness in the connectivity forces these features to be taken into account, and it is necessary to resort to numerical methods. Thus, we simulate single neurons driven by gaussian synaptic currents; collect their firing statistics to compute the rates ra , rate fluctuations (δr aj )2 , and correlations Ca (t − t ); and then use these to generate improved input current statistics. The cycle is repeated until the input and output statistics are consistent. This algorithm was first used by Eisfeller and Opper (1992) to calculate the remanent magnetization of a mean field model for spin glasses. Explicitly, we proceed as follows. We simulate single excitatory and inhibitory neurons over “trials” 100 integration time steps long. (We will call each time step a “millisecond.” We have explored using smaller time steps and verified that there are no qualitative changes in the results.) We start from estimates of the rates given by the balance condition, which makes the net mean input current vanish. Then the sum over presynaptic √ populations of the O( K b ) terms in equation 2.9 vanishes, leaving only the rate variation and noise terms. We then run 10,000 trials of single excitatory and inhibitory neurons, selecting on each trial random values of xiab and ξiab (t). Since at this point we do not have any estimates of either the rate fluctuations (δr bj )2 or the correlations Cb (t − t ), we use rb2 in place of (r bj )2 in equation 2.9 and use white noise for ξiab (t): Cb (t − t ) → rb δ(t − t ). The random choice of xi from trial to trial effectively samples across the neuronal populations, so we can then collect the statistics ra , (r aj )2 (or, equivalently, (δr aj )2 ), and Ca (t − t ) from these trials. These can be used to generate an improved estimate of the input noise statistics to be used in equation 2.9 in a second set of trials, which yields new spike statistics again. This procedure is iterated until the input and output statistics agree. This may take up to several hundred iterations, depending on network parameters and how the computation is organized. If one tries this procedure in its naive form, that is, using the output statistics directly to generate the input noise at the next step, it will lead to big oscillations and √ not converge. It is necessary to make small corrections (of relative order 1/ K 0 ) to the previous input noise statistics to guarantee convergence. When one computes statistics from the trials in any iteration, the simplest procedure involves calculating not the average correlation function Cb (t − t ) defined in equation 2.8 but, rather, C˜ b (t − t ) (see equation 2.12). From it, we can proceed in two ways. In one (the first of the three schemes described above for organizing the noise), from its |t − t | → ∞ limit we can obtain (δr bj )2 , and thereby (r bj )2 = rb2 + (δr bj )2 for use in equation 2.9. Subtracting this limiting value from C˜ b (t − t ) gives us Cb (t − t ) (which vanishes for
Response Variability in Balanced Cortical Networks
643
large |t − t |) for use in generating the noise ξiab (t). We will call this the subtracted correlation method. Alternatively, as in the third of the schemes above, we can, at each step of our iterative procedure, generate noise directly with the correlations C˜ b (t − t ) (which have a long-range time dependence) and use rb2 in place of (r bj )2 in equation 2.9. We call this the unsubtracted correlation method. We have verified that the two methods give the same results when carried out numerically, though the second one converges more slowly. While the true rates in the stationary case are time independent and Ca (t, t ) is a function only of t − t , the statistics collected over a finite set of noise-driven trials will not exactly have these stationarity properties. Therefore, we improve the statistics and impose time-translational invariance by averaging the measured ra (t) and (δr aj (t))2 over t and averaging over the measured values Ca (t, t ) with a fixed t − t . After the iterative procedure converges, so that we have a good estimate of the statistics of the input, we want to run many trials on a single neuron and compute its firing statistics. This means that the numbers xiab (b = 0, 1, 2) should be held constant over these trials. In this case, it is necessary to subtract out the large t − t limit of C˜ a (t − t ) and use fixed xiab (constant in time and across trials) to generate the input noise. (If we did it the other way, without the subtraction, we would effectively be assuming that xiab changed randomly from trial to trial, which is not correct.) In our calculations we have used 10,000 trials to calculate these singleneuron firing statistics. We perform the subtraction of the long-time limit of C˜ a (t − t ) at |t − t | = 50, and we have checked that equation 2.12 is flat beyond this point in all the cases we have done. If we perform this kind of measurement separately for many values of the xiab , we will be able to see how the firing statistics vary across the population. Here, however, we will confine most of our attention to what we call the “average neuron”: the one with the average value (0) of all three xiab . In particular, we calculate the mean spike count in the 100 ms trials and its variance across trials. From this we can get the Fano factor F (the varianceto-mean ratio). We also compute the autocorrelation function, which offers a consistency check, since the Fano factor can also be obtained from F =
1 r
∞
−∞
C(τ )dτ.
(2.17)
(This formula is valid when the measurement period is much larger than the time over which C(τ ) falls to zero.) We will study how these firing statistics vary as we change various parameters of the model: the input rates r0 , parameters that control the balance of excitation and inhibition and the overall strength of the synapses.
644
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
This will give us some generic understanding of what controls the degree of irregularity of the neuronal firing. 2.2 Nonstationary Case. When the input population is not firing at a constant rate, almost the same calculational procedure can be followed, except that one does not average measured rates, their fluctuations, or correlation function over time. To start, we get initial instantaneous rate estimates from the balance condition, assuming that the time-dependent average input currents do not vary too quickly. (This condition is not very stringent; van Vreeswijk and √ Sompolinsky showed that the stability eigenvalues are proportional to K 0 , so if they have the right sign, the convergence to the balanced state is very rapid.) To do the iterative procedure to satisfy the self-consistency conditions of the theory, it is simplest to use the second of the two ways described above (the unsubtracted correlation method). In this case, we get equations for the noise input currents just like equation 2.9, except that the second term is omitted and the rb are t-dependent and the correlation functions Cbtot depend on both t and t , not just their difference. The only tricky part is the subtraction of the long-time limit of the correlation function, which is not simply defined. We treat this problem in the following way. We examine the ratenormalized quantity, Cˆ a (t, t ) =
Catot (t, t ) . ra (t)ra (t )
(2.18)
We find that this quantity is time-translation invariant (i.e., a function only of t − t ) to a very good approximation, so we perform the subtraction of the long-time limit on it. Then multiplying the subtracted Cˆ by ra (t)ra (t ) gives a good approximation to the true (subtracted) correlation function Ca (t, t ). The meaning of this finding is, loosely speaking, that when the rates vary (slowly enough) in time, the correlation functions just inherit these rates as overall factors without changing anything else about the problem. We will use the this time-dependent formulation below to simulate experiments like those of Gershon et al. (1998), where the LGN input r0 (t) to visual cortical cells is time dependent because of the flashing on and off of the stimulus. 3 Results The results presented in this article were obtained from simulations with parameters K 1 = 4444 excitatory inputs and K 2 = 1111 inhibitory inputs to each neuron. The average number of external (excitatory) inputs K 0
Response Variability in Balanced Cortical Networks
645
was chosen to be equal to K 2 . All neurons have the same membrane time constant τ of 10 ms. To study the effect of various combinations in synaptic strength, we use the following generic form to define the intracortical weights J ab :
J 11
J 12
J 21
J 22
=
−2g
1 −2g
.
(3.1)
For the synaptic strengths from the external population, we use J 10 = 1 and J 20 = . With this notation, g determines the strength of inhibition relative to excitation within the network and the strength of intracortical excitation. Additionally, we scale the overall strength of the synapses with a multiplicative scaling factor denoted J s so that each synapse has an actual weight of J s · J ab , regardless of a and b. Figure 2 summarizes how the firing statistics depend on all of the parameters g, , and J s . The irregularity of spiking, as measured by the Fano factor, depends most sensitively on the overall scaling of the synaptic strength, J s . The Fano factor increases systematically as J s increases, and higher values of intracortical excitation also result in higher values of F . The same pattern holds for stronger intracortical inhibition, parameterized by g. For all of these cases, the mean firing rate remains virtually unchanged due to the dynamic balance of excitation and inhibition in the network, whereas the fluctuations increase with the increase of any of the synaptic weights. Interspike interval (ISI) distributions are shown in Figure 3 for three different values of J s , keeping and g fixed at 0.5 and 1, respectively. For a Poisson spike train, the Fano factor F = 1, while F > 1 (which we term super-Poissonian) indicates a tendency of spikes occurring in clusters separated by accordingly longer empty intervals, and F < 1 (sub-Poissonian) indicates more regularity, reflected by a narrower distribution. We have adjusted the input rate r0 so that the output rate is the same in all three cases. The top panel of Figure 3 shows the ISI distribution of a super-Poissonian spike train, obtained for J s = 1.42. Overlaied on the histogram of ISI counts is an exponential curve indicating a Poisson distribution with the same mean ISI length. Compared with the Poisson distribution, the superPoissonian spike train contains more short intervals, as seen by the peak at short lengths, and also more long intervals, causing a long tail. Necessarily, the interval count around the average ISI length is lower than that for a Poisson spike train. The ISI distribution in the middle panel of Figure 3 belongs to a spike train with a Fano factor close to one, obtained for J s = 0.714. The overlaid exponential reveals a deviation from the ISI count: while intervals of diminishing length are the most likely ones for a real Poisson process, our neuronal spike trains always show some refractoriness reflected by a dip at
646
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark g = 0.950
7 6
7
Fano Factor
6
5
5 4
4 3
3
2 1
2
0 2.85 2.38 1.90
0.855
1
0.713 1.43 0.950
0.475 0.475
J
0.238
s
0
ε
g = 1.43
7 6
7
Fano factor
6
5
5
4
4 3
3
2 1
2
0 2.85 2.38
0.855 1.90
0.713 1.43 0.950
J
0.475 0.475
0.238
s
1 0
ε
g = 1.90 7 6 7
Fano factor
6
5
5 4
4
3 3
2 1
2
0 2.85 2.38 1.90
0.855 0.713 1.43 0.950
Js
0.475 0.475
0.238
1 0
ε
Figure 2: Fano factors as a function of overall synaptic strength J s and intracortical excitation strength for three different inhibition factors: g = 1, 1.5, and 2, respectively. The increase of any of these parameters results in more irregular firing statistics as measured by the Fano factor.
Response Variability in Balanced Cortical Networks
647
0.06
Superpoisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
80
100
80
100
80
100
0.06
Poisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
0.06
Subpoisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40 60 ISI length [ms]
Figure 3: Interspike interval distributions for fixed = 0.5 and g = 1, and three different values of overall synaptic strength J s : 1.42 (super-Poissonian), 0.714 (Poissonian), and 0.357 (sub-Poissonian). Overlaid on each figure is the exponential fall-off of a true Poisson distribution with the same average rate as in all of the three cases.
648
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
the shortest intervals. (We have not used an explicit refractory period in our model. The dip seen here simply reflects the fact that it takes a little time for the membrane potential distribution to return to its steady-state form after reset.) Apart from this deviation, however, there is a close resemblance between the observed distribution and the “predicted” one. Finally, the lower panel of Figure 3 depicts a case with F < 1, with weaker synapses, leading to a stronger refractory effect and (since the rate is fixed) an accordingly narrower distribution around the average ISI length, as compared to the overlaied Poisson distribution. This distribution was obtained with weak synapses produced by a small scaling factor of J s = 0.357. As mentioned in the previous section, the Fano factor can also be obtained by integrating over the spike train autocorrelation divided by the spike rate, equation 2.17. For a Poisson process, the autocorrelation vanishes for all lags different from zero. In contrast, F > 1 (super-Poissonian case) implies a positive integral over nonzero lags, whereas in the sub-Poissonian case, there must be a negative area under the curve. Figure 4 shows examples of autocorrelations for all of the three cases. For the super-Poissonian case (dashed line), there is a “hill” of positive correlations for short intervals,
−3
7
x 10
6
Poisson Superpoisson Subpoisson
5
Autocorrelation
4 3 2 1 0 −1 −2 −3
−40
−30
−20
−10
0
10
20
30
40
50
Time shift [ms]
Figure 4: Three different spike train autocorrelations illustrating the relationship between the Fano factor F and the area under the curve. For F = 1 (Poissonian, solid line), the autocorrelation is an almost perfect delta function. F > 1 (super-Poissonian, dashed line) is reflected by a hill generating a positive area, and F < 1 (sub-Poissonian, dotted line) is accompanied by a valley of negative correlations. (See the text for more details.)
Response Variability in Balanced Cortical Networks
649
2
10
Spike Count Variance
Superpoisson Poisson Subpoisson 1
10
0
10
−1
10
−1
10
0
1
10 10 Spike Count Mean
2
10
Figure 5: Spike count log(variance) versus log(mean) for three different values of overall synaptic strength J s , varying the external input rate r0 . For J s = 1.19 (super-Poissonian, triangles), the data look qualitatively like those from experiments. The other values for J s are 0.714 (Poisson, stars) and 0.357 (subPoissonian, crosses).
reflecting the tendency toward spike clustering. The sub-Poissonian autocorrelation (dotted line) shows a valley of negative correlations for short intervals, indicating well-separated spikes in a more regular spike train. The curve labeled as Poisson (solid line) does have a small valley around zero lag, which reflects once more the refractoriness of neurons to fire at extremely short intervals, unlike a completely random Poisson process. (Actually, the measured F in this case is slightly greater than 1, implying that in this case, the integral of the very small positive tail for t > 2 ms is slightly larger than that of the (more obvious) negative short-time dip.) Measurements on V1 neurons in awake monkeys (see, e.g., Gershon et al., 1998) suggest a linear relationship between the log variance and the log mean of stimulus-elicited spike counts. We find a similar dependence for neurons within our model network. Figure 5 shows results for three different values of J s . In each case, five different values of the external input rate r0 were tried, causing various mean spike counts and variances. The logarithm of the spike count variance is plotted as a function of the logarithm of the spike count mean, and a solid diagonal line indicates the identity, that is, a Fano factor of exactly 1. We see that for the largest value of J s used here, the data look qualitatively like those from experiments, with Fano factors in the range around 1.5 to 2.
650
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Input rate r0(t) 0.07
Tonic part A
0.06
Phasic part B0
Input rate r0(t)
0
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
80
100
Time steps
Figure 6: Parameterization of the time-dependent input rate r0 (t). The input is modeled as the sum of three functions: (1) a stationary background rate (which is zero in this case); (2) a tonic part, which rises within the first 20 ms to a constant level of A0 , where it stays for 60 ms, falling back to zero within the last 20 ms; and (3) an initial phasic part, which is nonzero only in the first 50 ms, rising to a maximum value of B0 .
3.1 Nonstationary Case. The results presented in the previous section were obtained with stationary inputs, while experimental data like those from Gershon et al. (1998) were collected from visual neurons subject to time-dependent inputs. Therefore, we performed calculations of the spike statistics in which the input population rate r0 was time dependent. The modeled temporal shape of r0 (t) is depicted in Figure 6. It is the sum of three terms: r0 (t) = R0 + A(t) + B(t).
(3.2)
The first, R0 , is a constant, as in the preceding section. The second term, A(t), rises to a maximum over a 25 ms interval, remains constant for 50 ms, and then falls off to zero over the final 25 ms, 0.5A0 (1 − cos(4tπ/T)) for 0 < t ≤ T/4 for T/4 < t ≤ 3T/4 A(t) = A0 0.5A (1 − cos(4(T − t)π/T)) for 3T/4 < t ≤ T, 0
(3.3)
Response Variability in Balanced Cortical Networks
651
2
Spike Count Variance
10
1
10
0
10 0 10
1
10 Spike Count Mean
2
10
Figure 7: Spike count log(variance) versus log(mean) for time-varying external inputs with varying overall strength. The neuron in the simulated network (triangles) fires in a super-Poissonian regime, with an almost linear relationship for low spike rates between the log variance and the log mean, resembling closely data obtained from in vivo experiments. The diagonal solid line indicates the identity of variance and mean (Fano factor F = 1).
where T is the total simulation interval of 100 ms. The third term, B0 , rises to a maximum in the first 25 ms and then falls back to zero in the next 25 ms, remaining zero thereafter: for 0 < t ≤ T/4 0.5B0 (1 − cos(4tπ/T)) 0.5B (1 − cos(4(T/2 − t)π/T)) for T/4 < t ≤ T/2 0 B(t) = 0 for T/2 < t ≤ T.
(3.4)
Figure 7 shows the logarithm of the spike count variance plotted against the logarithm of the spike count mean for various nonstationary inputs characterized by different values of A0 and B0 . The graph shows results for J s = 0.95, = 0.5, g = 1, and a background rate of R0 = 0.1. Table 1 shows the choice of the 16 combinations of the stimulus parameters A0 and B0 , together with the resulting Fano factors F for the simulated neuron. The data look qualitatively like those obtained from in vivo experiments by Gershon et al. (1998) and are similar to the super-Poissonian case in Figure 5. The neuron fires consistently in a super-Poissonian regime with Fano factors slightly higher than 1 and an almost linear relationship between the
652
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Table 1: Stimulus Parameters A0 and B0 for the Results Depicted in Figure 7 and the Resulting Fano Factors F . A0 B0
0.375 0.125
0.375 0.375
0.500 0.125
0.500 0.375
0.750 0.250
0.750 0.750
1.000 0.250
1.000 0.750
F
1.14
1.2
1.22
1.23
1.29
1.36
1.37
1.4
A0 B0
1.500 0.500
1.500 1.500
2.000 0.500
2.000 1.500
3.000 1.000
3.000 3.000
4.000 1.000
4.000 3.000
F
1.48
1.5
1.55
1.53
1.57
1.41
1.43
1.34
log variance and the log mean for low spike counts. For higher spike counts, the curve bends toward values of lower Fano factors, just as for stationary inputs (see Figure 5). In both cases, this bend reflects the decrease in irregularity of firing caused by an increasingly prominent role of refractoriness for shorter interspike intervals. 3.2 Comparison with Network Simulations. An extensive exploration of the validity of mean field theory is beyond the scope of this letter. However, we have performed simulations of networks constructed according to the model of section 2 and compared their firing irregularity with that obtained in mean field theory. Specifically, we tested two main results from our mean field analysis, stating that Fano factors increase systematically with synaptic strength and that there is an approximate power law between the mean spike count and the spike count variance, similar to experimental findings (see Figures 2 and 5, respectively). Figure 8 shows measured Fano factors for a typical neuron in a network with K 0 = K 1 = 400, K 2 = 100, and N = 10000, where we varied J s in the same range as in Figure 2 (other parameters were g = 1 and = 0.5). The Fano factor increases systematically as J s increases, lying in the quantitative range predicted by mean field theory. In addition (results not shown), we explored the lower and upper limits of Fano factors in our network. For J s = 0.1, the average Fano factor of all neurons in the network was 0.034. Notwithstanding the very regular firing of all neurons in this network with very weak synapses, the overall activity remained asynchronous, as required in our mean field analysis. At the other extreme, for very strong synapses with J s = 32, we observed an average Fano factor of 16.05, and individual neurons exhibited Fano factors of up to 30 and higher. These results show that networks of integrate-andfire neurons exhibit a wide range of Fano factors in their balanced state, depending on synaptic strength. In Figure 9, we show plots of the logarithm of the spike count variance as a function of the logarithm of the mean spike count for six individual
Response Variability in Balanced Cortical Networks
653
7 6 Fano Factor
5 4 3 2 1 0 0.5
1
1.5
2
2.5
3
J
s
Figure 8: Fano factors as a function of overall synaptic strength J s obtained from network simulations for a randomly chosen neuron (g = 1, = 0.5). A comparison with Figure 2 reveals that the mean field calculations correctly predict both the qualitative relationship between Fano factors and synaptic strength and the quantitative range of Fano factors for this range of J s values.
neurons in the network. Analogous to Figure 5, results for three different values of J s are shown (1.5, 1.1, and 0.714, indicated by triangles, stars, and crosses, respectively), each probed with five different strengths of external inputs. The neurons were chosen randomly from all 8000 excitatory neurons with nonzero firing rates. With the exception of neuron 4638 (in the lower middle panel), J s = 1.5 resulted in super-Poissionian firing statistics, J s = 0.714 in sub-Poissionian firing, and J s = 1.1 in approximately Poisson statistics. There is a strong qualitative resemblance between the network simulation results in Figure 9 and the mean field results in Figure 5, with the latter showing spike count statistics of the hypothetical “average neuron” defined above. Taken together, these results suggest that mean field theory provides a reliable way to estimate firing variability in balanced networks. 4 Discussion Cortical neurons receive thousands of excitatory and inhibitory inputs, and despite the high number of inputs from nearby neurons with similar firing statistics and similar connectivity, their observed firing is very irregular (Heggelund & Albus, 1978; Dean, 1981; Tolhurst et al., 1981, 1983; Vogels, Spileers, & Orban, 1989; Snowden et al., 1992; Gur et al., 1997; Shadlen &
654
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
2
Neuron number 4662
2
Mean Spike Count
10
1
0
0
2
10
10
Neuron number 1808
−1
0
10 2
10 Mean Spike Count
0
10
−1
0
10
2
1
Neuron number 4638
2
1
0
0
10
−1
0
2
10
Neuron number 4239
1
0
10 10 Spike Count Variance
2
10
10
10
−1
0
10
10
10
10
10
10
10
10
10
10
10
−1
Neuron number 2672
1
10
10
2
2
10
1
10
10
Neuron number 3388
10
−1
0
2
10 10 Spike Count Variance
10
0
2
10 10 Spike Count Variance
Figure 9: Spike count statistics obtained from network simulations for six randomly chosen neurons. Spike count log(variance) versus log(mean) are plotted as in Figure 5, for various external input rates at three different values of synaptic strength (J s = 1.5, 1.1, and 0.714, represented by triangles, stars, and crosses, respectively). There is a close qualitative resemblance to the results from mean field calculations shown in Figure 5, where spike statistics of a hypothetical neuron with average overall input are shown.
Newsome, 1998; Gershon et al., 1998; Kara et al., 2000; Buracas et al., 1998; Lee et al., 1998; DeWeese et al., 2003). Dynamically balanced excitation and inhibition through a simple feedback mechanism provide an explanation that naturally accounts for this phenomenon without requiring fine-tuning of the parameters (Amit & Brunel, 1997a, 1997b; Brunel, 2000; van Vreeswijk and Sompolinsky, 1996, 1998). Moreover, neurons in such model networks show an almost linear input-output relationship (input current versus firing frequency), as do neurons in the neocortex. Whenever one wants to understand a complex dynamical system, one asks whether there is an approximate theory, possibly exact in some interesting limit, that captures and affords insight into the observed properties. Here, the high connectivity of the cortical networks of interest suggests trying to obtain this insight from mean field theory, which becomes exact for an infinite, extensively connected system. In this article, we have formulated a
Response Variability in Balanced Cortical Networks
655
complete mean field description of the dynamically balanced asynchronous firing state in the dilute, high-connectivity limit Na K a 1. Because of the assumed random connection structure in the network, the mean field theory has to include autocorrelation functions and rate variations as well as population-mean rates as order parameters, as in spin glasses (Sompolinsky & Zippelius, 1982). We used this mean field theory to analyze firing correlations. We found that the relationship between the observed irregularity of firing (spike count variance) and the firing rate (spike count mean) of the neurons resembles closely data collected from in vivo experiments (see Figures 5 and 7). To do this, we developed a complete mean field theory for a network of leaky integrate-and-fire neurons, in which both firing rates and correlation functions are determined self-consistently. Using an algorithm that allows us to find the solutions to the mean field equations numerically, we could elucidate how the strength of synapses within the network influences the expected firing statistics of cortical neurons in a systematic manner (see Figure 2). We have shown that the irregularity of firing, as measured by the Fano factor, increases with increasing synaptic strengths (see Figure 2). Nearly Poisson statistics (with F ≈ 1) are observed for moderately strong synapses, but the transition from sub-Poissonian to super-Poissonian statistics is smooth, without a special role for F = 1. The higher irregularity in the spike counts is always accompanied by a tendency toward more “bursty” firing. (These bursts are a network effect because the model contains only leaky integrate-and-fire neurons, which do not burst on their own.) This burstiness can best be seen in the spike train autocorrelation function (see Figure 4), which acquires a hill of growing size and width around zero lag for increasing Fano factors. The interdependence between firing irregularity and bursting can be understood with the help of the ISI distributions depicted in Figure 3: when the rate, and thus the average ISI, is kept constant, then any higher count for shorter-than-average ISIs must be accompanied by an accordingly higher count for longer ISIs (indicating bursts), and vice versa. Thus, higher irregularity always goes hand in hand with a higher tendency toward temporal clustering of spikes. Why do stronger synapses lead to higher irregularity in firing? The size of the input current fluctuations in equation 2.9 is controlled by the J ab , and so, therefore, are the corresponding membrane potential fluctuations. Thus, for example, the width of the steady-state membrane potential distribution is proportional to J s . We next have to consider where this distribution is centered. Remembering that, according to the balance condition, the firing rate is independent of J s , and the center of the distribution has to move farther away from threshold as J s is increased in order to keep the rate fixed. Therefore, for very small J s almost the entire equilibrium membrane potential distribution will lie well above the postspike reset value, while for large J s , it will be mostly below reset.
656
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Immediately after a spike, the membrane potential distribution is a delta function centered at the reset (here 0). It then spreads, and its mean moves up or down toward its equilibrium value. This equilibration will take about a membrane time constant. If the equilibrium value is well above zero (the small-J s case), the probability of reaching threshold will be suppressed during this time, implying a refractory dip in the ISI distribution and the correlation function and a tendency toward a Fano factor less than 1. In the large-J s case, where the membrane potential is reset much closer to the threshold than to its eventual equilibrium√value, the initial rapid spread (with the width growing proportional to J s t) leads to an enhanced probability of early spikes. At short times, this diffusive spread dominates the downward drift of the mean (which is only linear in t). Thus, there is extra weight in the ISI distribution and a positive correlation function at these short times, leading to a Fano factor greater than 1. Empirically, an approximate power-law relationship between the mean and variance of the spike count has frequently been observed for cortical neurons (see, e.g., Tolhurst et al., 1981; Vogels, Spileers, & Orban, 1989; Gershon et al., 1998; Lee et al., 1998). Our model shows the same qualitative feature (see Figures 5 and 6), though we have no argument that the relation should be an exact power law. However, this agreement suggests that the model captures at least part of physics underlying the firing statistics. As already observed, not all of the variability in measured neuron responses has to be explained in the manner outlined above. Changing conditions during the run of a single experiment may introduce extra irregularity, caused by collecting statistics over trials with different mean firing rates. Our analysis shows why—and how much—irregularity can be expected due to intrinsic cortical dynamics. Other authors have also studied firing irregularity, in phase-oscillator models (Bressloff & Coombes, 2000; Bressloff, Bressloff, & Cowan, 2000) and in a ring model with inhomogeneous excitation and inhibition (Lin, Pawelzik, Ernst, & Sejnowski, 1998). Both groups found that their models could produce highly irregular firing. In our work, we have tried to make a systematic study of how the irregularity (quantified by the Fano factor) depends on system parameters for a fairly simple model appropriate for describing local (intracolumn) neocortical networks. We have used instantaneous synapses; that is, we have not included synaptic filtering of input spike trains in the calculations we have reported here. However, we have incorporated such filtering, with a simple exponential kernel, into our code and explored the effects of a nonzero synaptic current decay time τsyn . We find that for small τsyn /τ , Fano factors grow proportional to this ratio, τsyn , F (τsyn ) = F (0) 1 + a τ
(4.1)
Response Variability in Balanced Cortical Networks
657
with a = O(1) > 0. Since we are most interested in the limit τsyn τ , we have not studied these corrections in detail. However, it should be noted that in a model where the synapses are modeled by conductances instead of current pulses (Lerchner, Ahmadi, & Hertz, 2004), the effective membrane time constant can become very small, so τsyn can be considerably larger than it (Destexhe et al., 2003). In this case, the dynamics become rather different. Our formulation of the mean field theory is general enough to allow other straightforward extensions toward greater biological realism and more complicated network architectures. We have extended the model to include systematic structure in the connections, modeling an orientation hypercolumn in the primary visual cortex (Hertz & Sterner, 2003). Moreover, our algorithm for finding the mean field solutions is not restricted to networks of integrate-and-fire neurons. It can be applied to any kind of neuronal model. Furthermore, synaptic depression and facilitation can be incorporated by using synaptically filtered spike trains to compute the self-consistent solutions. As we remarked earlier, if one ignores correlations in the synaptic input and neuron-to-neuron rate variations (Amit & Brunel, 1997a, 1997b), analytic self-consistent equations for the population rates can be derived. From these, one can calculate the steady-state Fano factor analytically in closed form (Brunel, 2000). Such a calculation is obviously not self-consistent, although it can give qualitative information about firing irregularity. We have done some calculations, using our single-neuron simulation methods, but in which we impose the Amit-Brunel approximations by hand when generating the input noise. As one could anticipate, we find that this procedure systematically underestimates Fano factors in the super-Poissonian regime, by factors of up to 2 or so at the largest values of J s studied here. Since it is necessary to solve the full mean field theory numerically, one might ask: If it is necessary to resort to numerical solution anyway, why not just simulate the network directly? Our answer is that beyond the advantage of having to simulate only one neuron at a time, it is interesting to know what the predictions of mean field theory are. To the extent that they agree with network simulations, we can understand our findings in terms of single-neuron properties (albeit with self-consistent synaptic current statistics). Discrepancies would point to either finite-size or finite-concentration effects or more subtle correlation effects not included in the mean field ansatz. Identifying such effects, if they exist, would point the way toward future theoretical investigations, which could shed potentially useful light on the dynamics of these networks.
References Amit, D., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.
658
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Amit, D., & Brunel, N. (1997b). Model of spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237– 252. Bressloff, P. C., Bressloff, N. W., & Cowan, J. D. (2000). Dynamical mechanism for sharp orientation tuning in an integrate-and-fire model of a cortical hypercolumn. Neural Comp., 12, 2473–2511. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled neurons. Neural Comp., 12, 91–129. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Buracas, G. T., Zador, A. M., DeWeese, M. R., & Albright, T. D. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–761. DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. J. Neurosci., 23, 7940–7949. Eisfeller, H., & Opper, M. (1992). New method for studying the dynamics of disordered spin systems without finite-size effects. Phys. Rev. Lett., 68, 2094– 2097. Fulvi Mari, C. (2000). Random networks of spiking neurons: Instability in the Xenopus tadpole moto-neuron pattern. Phys. Rev. Lett., 85, 210–213. Gershon, E., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortex. J. Neurophysiol., 79, 1135–1144. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17, 2914–2920. Heggelund, P., & Albus, K. (1978). Response variability and orientation discrimination of single cells in striate cortex of cat. Exp. Brain Res., 32, 197–211. Hertz, J., Richmond, B., & Nilsen, K. (2003). Anomalous response variability in a balanced cortical network model. Neurocomputing, 52–54, 787–792. Hertz, J., & Sterner, G. (2003). Mean field model of an orientation hypercolumn. Soc. for Neurosci. Abstract, no. 911.19. Kara, P., Reinagel, P., & Reid, R. C. (2000). Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 27, 635–646. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of primate cortex. J. Neurosci., 18, 1161–1170. Lerchner, A., Ahmadi, M., & Hertz, J. (2004). High conductance states in a mean field cortical network model. Neurocomputing, 58–60, 935–940. Lin, J. K., Pawelzik, K., Ernst, U., & Sejnowski, T. J. (1998). Irregular synchronous activity in stochastically-coupled networks of integrate-and-fire neurons. Network, 9, 333–344. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely-timed spike patterns in visual system neural responses. J. Neurophysiol., 81, 3021–3033.
Response Variability in Balanced Cortical Networks
659
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Snowden, R. J., Treue, S., & Andersen, R. A. (1992). The response of neurons in areas V1 and MT of the alert rhesus monkey to moving random dot patterns. Exp. Brain Res., 88, 389–400. Sompolinsky, H., & Zippelius, A. (1982). Relaxational dynamics of the EdwardsAnderson model and the mean-field theory of spin glasses. Phys. Rev. B., 25, 6860–6875. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res., 23, 775–785. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Exp. Brain Res., 41, 414–419. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1371. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res., 77, 432–436.
Received February 18, 2004; accepted August 23, 2005.
LETTER
Communicated by Richard Zemel
The Costs of Ignoring High-Order Correlations in Populations of Model Neurons Melchi M. Michel
[email protected] Robert A. Jacobs
[email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
Investigators debate the extent to which neural populations use pairwise and higher-order statistical dependencies among neural responses to represent information about a visual stimulus. To study this issue, three statistical decoders were used to extract the information in the responses of model neurons about the binocular disparities present in simulated pairs of left-eye and right-eye images: (1) the full joint probability decoder considered all possible statistical relations among neural responses as potentially important; (2) the dependence tree decoder also considered all possible relations as potentially important, but it approximated high-order statistical correlations using a computationally tractable procedure; and (3) the independent response decoder, which assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Simulation results indicate that high-order correlations among model neuron responses contain significant information about binocular disparities and that the amount of this high-order information increases rapidly as a function of neural population size. Furthermore, the results highlight the potential importance of the dependence tree decoder to neuroscientists as a powerful but still practical way of approximating high-order correlations among neural responses. 1 Introduction The left and right eyes of human observers are offset from each other, and, thus, the visual images received by these eyes differ. For example, an object in the visual environment may project to one location in the left eye image but project to a different location in the right eye image. Differences in left eye and right eye images that arise in this manner are known as binocular disparities. Disparities are important because they are often among the most reliable cues to the relative depth of a surface or object in space. Observers with normal stereo vision are typically able to Neural Computation 18, 660–682 (2006)
C 2006 Massachusetts Institute of Technology
Costs of Ignoring High-Order Correlations
661
make fine depth discriminations because they can resolve differences in horizontal disparities below 1 arc minute (Andrews, Glennerster, & Parker, 2001). How this is accomplished is a matter of current research. Neurophysiological and modeling studies have identified binocular simple and complex cells in primary visual cortex as a likely source of disparity information, and researchers have developed a computational model known as a binocular energy filter to characterize the responses of these cells to visual scenes viewed binocularly (DeAngelis, Ohzawa, & Freeman, 1991; Freeman & Ohzawa, 1990; Ohzawa, DeAngelis, & Freeman, 1990). Based on analyses of binocular energy filters, Qian (1994), Fleet, Wagner, and Heeger (1996), and others have argued, however, that the response of an individual simple or complex cell is ambiguous. In addition to uncertainty introduced by neural noise, ambiguities arise because a cell’s preferred disparity depends on the distribution of stimulus frequencies, a cell’s tuning response has multiple false peaks (i.e., the cell gives large responses to disparities that differ from its preferred disparity), and image features in a cell’s left eye and right eye receptive fields may influence a cell’s response even when the features do not arise from the same event in the visual world. These points suggest that in order to overcome the ambiguity of an individual neuron’s responses, the neural process responsible for estimating disparity must pool the responses of a large number of neurons. Researchers studying neural codes often use statistical techniques to interpret the activities of neural populations (Abbott & Dayan, 1999; Oram, ¨ ak, Perrett, & Sengpiel, 1998; Pouget, Dayan, & Zemel, 2003). A matFoldi` ter of current debate among these investigators is the relative importance of considering dependencies, or correlations, among cells in a population when decoding the information that the cells convey about a stimulus. Correlations among neural responses have been investigated as a potentially important component of neural codes for over 30 years (Perkel & Bullock, 1969). Unfortunately, determining the importance of correlations is not straightforward. For methodological reasons, it is typically feasible only to experimentally measure pairwise or second-order correlations among neural responses, meaning that high-order correlations are not measured. Even if correlations are accurately measured, there is no guarantee that these correlations contain useful information: correlations can increase, decrease, or leave unchanged the total information in a neural population (Abbott & Dayan, 1999; Nirenberg & Latham, 2003; Seri`es, Latham, & Pouget, 2004). To evaluate the importance of correlations, researchers have often compared the outputs of statistically efficient neural decoders, based on maximum likelihood or Bayesian statistical theory, that make different assumptions regarding correlations. Neural decoders are not models of neural mechanisms, but rather statistical procedures that help determine how much information neural responses contain about a stimulus by expressing this information as a probability distribution (Abbott & Dayan, 1999; Oram et al., 1998; Pouget et al., 2003). Statistically efficient neural decoders are
662
M. Michel and R. Jacobs
useful because they provide an upper bound on the amount of information about a stimulus contained in the activity of a neural ensemble. Researchers can evaluate the importance of correlations by comparing the value of this bound when it is computed by a neural decoder that makes use of correlations with the value of this bound when it is computed by a decoder that does not. Alternatively, researchers can compare the performances of neural decoders that use or do not use correlations on a stimulus-relevant task. Several recent studies have suggested that correlations among neurons play only a minor role in encoding stimulus information (e.g., Averbeck & Lee, 2003; Golledge et al., 2003; Nirenberg, Carcieri, Jacobs, & Latham, 2001; Panzeri, Schultz, Treves, & Rolls, 1999; Rolls, Franco, Aggelopoulos, & Reece, 2003), and that the independent responses of neurons carry more than 90% of the total information available in the population response (Averbeck & Lee, 2004). An important limitation of these studies is that they considered only pairwise or second-order correlations among neural responses and thus ignored high-order correlations either by assuming multivariate gaussian noise distributions (e.g., Averbeck & Lee, 2003) or by using a shorttime scale approximation to the joint distribution of responses and stimuli (e.g., Panzeri et al., 1999; Rolls et al., 2003). These studies therefore did not fairly evaluate the information contained in the response of a neural population when correlations are considered versus when they are ignored. In a population of n neurons, there are on the order of n P pth-order statistical interactions among neural response variables. In other words, computing high-order correlations is typically not computationally feasible with current computers. This does not mean, of course, that the nervous system does not make use of high-order correlations or that researchers who fail to consider high-order correlations are justified in concluding that nearly all the information in a neural code is carried by the independent responses of the neurons comprising the population. What is needed is a computationally tractable method for estimating high-order statistics, even if this is done in only an approximate way. This letter addresses these issues through the use of computer simulations of model neurons, known as binocular energy filters, whose binocular sensitivities resemble those of simple and complex cells in primary visual cortex. The responses of the model neurons to binocular views of visual scenes of frontoparallel surfaces were computed. These responses were then decoded in order to measure how much information they carry about the binocular disparities in the left eye and right eye images. Three neural decoders were simulated. The first decoder, referred to as the full joint probability decoder (FJPD), did not make any assumptions regarding statistical correlations. Because it considered all possible combinations of neural responses, it is the gold standard to which all other decoders were compared. The second decoder, known as the dependence tree decoder (DTD), is similar to the FJPD in the sense that it regarded all correlations as potentially important. However, it used a computationally tractable method to estimate
Costs of Ignoring High-Order Correlations
663
high-order statistics, albeit in an approximate way (Chow & Liu, 1968; Meil˘a & Jordan, 2000). The final decoder, referred to as the independent response decoder (IRD), assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Via computer simulation, we measured the percentage of information that is lost in a population of disparity tuned cells when high-order correlations are approximated and when all correlations are ignored. We also examined the abilities of the DTD and IRD (and a decoder limited to second-order correlations) to correctly estimate the disparity of a frontoparallel surface. The results reveal several interesting findings. First, relative to the amount of information about disparity calculated by the FJPD, the amounts of information calculated by the IRD and DTD were proportionally smaller when more model neurons were used. In other words, the informational cost of ignoring correlations or of roughly approximating high-order correlations increased as a function of neural population size. This implies that there is a large amount of information about disparity conveyed by secondorder and high-order correlations among model neuron responses. Second, the informational cost of ignoring all correlations (as in the IRD) rose as the number of neural response levels increased. For example, relative to the amount of information calculated by the FJPD, the amount of information calculated by the IRD was smaller when neuron responses were discretized to four levels (2 bits of information about each neural response) than when they are discretized to eight levels (3 bits of information about a neural response). This trend was less evident for the DTD. Third, when used to estimate the disparity in a pair of left eye and right eye images, the DTD consistently outperformed the IRD, and the magnitude of its performance advantage increased rapidly as the neural population size increased and as the number of response levels increased. Because the DTD also outperformed a neural decoder based on a multivariate gaussian distribution, our data again indicate that high-order correlations among model neuron responses contain significant information about binocular disparities. These results have important implications for researchers studying neural codes. They suggest that earlier studies indicating that independent neural responses carry the vast majority of information conveyed by a neural population may be flawed because these studies limited their investigations to second-order correlations and thus did not examine high-order correlations. Furthermore, these results highlight the potential importance of the DTD to neuroscientists. This decoder uses a technique developed in the engineering literature (Chow & Liu, 1968; Meil˘a & Jordan, 2000), but seemingly unknown in the neuroscientific literature, to approximate highorder statistics. Significantly, it does so in a way that is computationally tractable—the calculation of the approximation requires only knowledge about pairs of neurons. This fact, in the context of the results summarized above, suggests that the DTD can replace the IRD as a better, but still practical, approximation to the information contained in a neural population.
664
M. Michel and R. Jacobs
2 Simulated Images The simulated images were created in a manner similar to the method used by Lippert & Wagner (2002), with the difference that the texture elements used by those authors were random black and white dots, whereas the elements that we used were white noise (luminances were real-valued as in Tsai & Victor, 2003). Each image depicted a one-dimensional frontoparallel surface on which were painted dots whose luminance values were chosen from a uniform distribution to take values between 0 (dark) and 1 (light). A virtual observer who maintained fixation at a constant depth and horizontal position in the scene viewed the surface as its depth was varied between 15 possible depth values relative to the fixation point. One of these depth values was the depth of the fixation plane; of the remaining depths, 7 were located farther than the fixation point from the observer, and 7 were located nearer the observer. Each image of a scene extended over 5 degrees of visual angle and was divided into 186 pixels per degree. Because each pixel’s luminance value was chosen randomly from a uniform distribution, an image contained approximately equal power at all frequencies between 0 cycles per degree and 93 cycles per degree (the Nyquist frequency). For each stereo pair, the left image was generated first; then the right image was created by shifting the left image to the right by a particular number of pixels (this was done with periodic borders; e.g., pixel values that shifted past the right border were assigned to pixels near the left border). This shift varied between –7 and 7 pixels so that the shift was negative when the surface was nearer the observer, zero when the surface was located at the fixation plane, and positive when the surface was located beyond fixation. 3 Model Neurons Model neurons were instances of binocular energy filters, which are computational models developed by Ohzawa et al. (1990). We used binocular energy filters because they provide a good approximation to the binocular sensitivities of simple and complex cells in primary visual cortex. The fidelity of the energy model with respect to the responses of binocular simple and complex cells has been demonstrated in both cat area 17 (Anzai, Ohzawa, & Freeman, 1997; Ohzawa et al., 1990; Ohzawa, DeAngelis, & Freeman, 1997) and in macaque V1 (Cumming & Parker, 1997; Perez, Castro, Justo, Bermudez, & Gonzalez, 2005; Prince, Pointon, Cumming, & Parker, 2002). Although modifications and extensions to the model have been proposed by different researchers (e.g., Fleet et al., 1996; Qian & Zhu, 1997; Read & Cumming, 2003; Tsai & Victor, 2003), the basic form of the energy model remains a widely accepted representation of simple and complex cell responses to binocular stimuli. A simple cell is modeled as comprising left eye and right eye receptive subfields. Each subfield is modeled as a
Costs of Ignoring High-Order Correlations
665
Gabor function, which is a sinusoid multiplied by a gaussian envelope. We used the phase-shift version of the binocular energy model, meaning that the retinal positions of the gaussian envelopes for the left eye and right eye Gabor functions are identical, though the sinusoidal components differ by a phase shift. Formally, the left (gl ) and right (gr ) simple cell subfields are expressed as the following Gabor functions: 1 2 2 e (−x /2σ ) sin(2πωx + φ) gl = √ 2 2πσ
(3.1)
1 2 2 e (−x /2σ ) sin(2πωx + φ + δφ), gr = √ 2 2πσ
(3.2)
where x is the distance to the center of the gaussian, the variance σ 2 specifies the width of the gaussian envelope, ω represents the frequency of the sinusoid, φ represents the base phase of the sinusoid, and δφ represents the phase shift between the sinusoids in the right and left subfields. The response of a simple cell is formed in two stages: first, the convolution of the left eye image with the left subunit Gabor is added to the convolution of the right eye image with the right subunit Gabor; next, this sum is rectified. The response of a complex cell is the sum of the squared outputs of two simple cells whose parameter values are identical except that one has a base phase of 0 and the other has a base phase of π/2.1 In our simulations, the gaussian envelopes for all neurons were centered at the same point in the visual scene. The parameter values that we used in our simulations were randomly sampled from the same distributions used by Lippert and Wagner (2002); these investigators picked distributions based on neurophysiological data regarding spatial frequency selectivities of neurons in macaque visual cortex. Preferred spatial frequencies were drawn from a log-normal distribution whose underlying normal distribution had a mean of 1.6 cycles per degree and a standard deviation of 0.7 cycle per degree. The range of these preferred frequencies was clipped at a ceiling value of 20 cycles per degree and a floor value of 0.4 cycle per degree. The simple cells’ receptive field sizes were sampled from a normal distribution with a mean of 0.5 period and a standard deviation of 0.25 period, with a floor value of 0.1 period. A cell’s preferred disparity, given by 2πδφ/ω, was sampled from a normal distribution with a mean of 0 degrees of visual angle and a standard deviation of 0.5 degree. Figure 1 shows the normalized responses of a typical model complex cell to three different scenes, each using a different white noise pattern to cover the frontoparallel surface. Each of the lines in the figure represents the
1 Note that binocular energy filters are deterministic. The probability distributions we use have nonzero variances because the white noise visual stimuli are stochastic.
666
M. Michel and R. Jacobs
image 1 0.3
normalized response
image 2 image 3 0.2
0.1
0 -0.15
-0.1
-0.05
0
0.05
0.1
0.15
stimulus disparity (degrees) Figure 1: Characteristic responses of an individual model neuron as a function of the disparity (in degrees of visual angle) of the presented surface. The three curves show the normalized responses of a single model binocular energy neuron to each of three sample surfaces presented along a range of disparities (from −0.2 to 0.2 degree). The vertical dotted line indicates the cell’s preferred disparity (−0.0417 degree). This figure illustrates the fact that an individual model neuron’s response depends on many factors and thus is an ambiguous indicator of stimulus disparity.
responses of the model neuron as the disparity of a surface was varied. The neuron responded differently to different surfaces, illustrating that a single neuron’s response is an ambiguous indicator of stimulus disparity. This finding motivates the importance of decoding the activity of a population of neurons rather than that of a single neuron (Fleet et al., 1996; Qian, 1994). 4 Neural Decoders Neural decoders are statistical devices that estimate the distribution of a stimulus parameter based on neural responses. Three different decoders evaluated p(d | r), the distribution of disparity, denoted d, given the
Costs of Ignoring High-Order Correlations
667
responses of the model complex cells, denoted r. The decoders differ in their assumptions about the importance of correlations among neural responses. 4.1 Full Joint Probability Decoder. The FJPD is the simplest of the decoders used, but also has the highest storage cost since it requires representing the full joint distribution of disparity and complex cell responses p(d, r). This distribution has sbn states, where s is the number of possible binocular disparities, b is the number of bins or response levels (i.e., each complex cell response was discretized to one of b values), and n is the number of complex cells in the population. The conditional distribution of disparity was calculated as pfull (d | r) =
p(d, r) , p(r )
(4.1)
where the joint distribution p(d, r) and marginal distribution p(r ) were computed directly from the complex cell responses to the visual scenes (histograms giving the frequencies of each of the possible values of r and (d, r) were generated and then normalized; see below). The result of equation 4.1 represents the output of the FJPD. 4.2 Dependence Tree Decoder. The DTD makes use of a data structure and learning algorithm originally proposed in the engineering literature (Chow & Liu, 1968; see also Meil˘a & Jordan, 2000). It can be viewed as an instance of a graphical model or Bayesian network, a type of model that is currently popular in the artificial intelligence community (Neapolitan, 2004). The basic idea underlying Bayesian networks is that a joint distribution over a set of random variables can be represented by a graph in which nodes correspond to variables and directed edges between nodes correspond to statistical dependencies (e.g., an edge from node x1 to node x2 means that the distribution of variable x2 depends on the value of variable x1 ; as a matter of terminology, node x1 is referred to as the parent of x2 ). Dependence trees are Bayesian networks that are restricted in the following ways: (1) the graphical model must be a tree (i.e., ignoring the directions of edges, there are no loops in the graph, meaning that there is exactly one path between every pair of nodes); (2) there is one node that is the root of the tree—this node has no parents; and (3) all other nodes have exactly one parent. A dependence tree is a graphical representation of the following factorization of a joint distribution:
p(x1 , . . . , xn ) =
n i=1
p(xi | pa (i)),
(4.2)
668
M. Michel and R. Jacobs
p (r1) r1 p ( r6r1)
p (r 3r1)
r6
r3 1
p (r5r6 )
p (r 7r6)
p (r4|r r6) r7
r4
r5
p (r2r4) r2 Figure 2: An example of a dependence tree. Each of the nodes r1 , . . . , r7 represents a random variable, such as the response of a model neuron. The edges (depicted as arrows) represent the conditional dependencies between variables and are labeled with the conditional distribution of a child variable given its parent p(child|parent). According to this tree, the joint distribution of these variables is factorized as follows: p(r1 , . . . , r7 ) = p(r1 ) p(r3 | r1 ) p(r6 | r1 ) p(r7 | r6 ) p(r4 | r6 ) p(r5 | r6 ) p(r2 | r4 ).
where p(x1 , . . . , xn ) is the joint distribution of variables x1 , . . . , xn and p(xi | pa (i)) is the conditional distribution of variable xi given the value of its parent (if xi is the root of the tree, then p(xi | pa (i)) = p(xi )). Figure 2 depicts an example of a dependence tree. Of course, not all joint distributions can be factorized in this way. In this case, the right-hand side of equation 4.2 gives an approximation to the joint distribution. How can good approximations be discovered?
Costs of Ignoring High-Order Correlations
669
Chow and Liu (1968) developed an algorithm for finding approximations and proved that this approximation maximizes the likelihood of the data over all tree distributions. In short, the algorithm has three steps: (1) compute all pairwise marginal distributions p(xi , x j ) where xi and x j are a pair of random variables; (2) compute all pairwise mutual informations Iij ; and (3) compute the maximum weight spanning tree using Iij as the weight for the edge between nodes xi and x j . This spanning tree is the dependence tree.2 Importantly for our purposes, the algorithm has quadratic time complexity in the number of random variables, linear space complexity in the number of random variables, and quadratic space complexity in the number of response levels. That is, discovering the dependence tree that approximates the joint distribution among a set of variables will often be computationally tractable. The dependence tree decoder computes a dependence tree to approximate the joint distribution of complex cell responses given a binocular disparity value.3 This approximation is denoted ptree (r | d). Using Bayes’ rule, the distribution of disparity given cell responses is given by ptree (d | r) =
ptree (r | d) p(d) , p(r )
(4.3)
where p(d), the distribution of disparities, is a uniform distribution (i.e., all disparities are equally likely), and p(r ), the distribution of cell responses, is computed by marginalizing ptree (r | d) over d. Equation 4.3 is the output of the DTD. 4.3 Independent Response Decoder. Using Bayes’ rule, we can rewrite the probability of a disparity d given a response ras p(d | r) =
p(r | d) p(d) , p(r )
(4.4)
where p(d) is the prior distribution of binocular disparities and p(r ) is a distribution over complex cell responses. Because all disparities were equally likely, we set p(d) to be a uniform distribution. Consequently,
p(d | r ) = kp(r | d),
(4.5)
2 The spanning tree is an undirected graph. Our simulations used an equivalent directed graph obtained by choosing an arbitrary node to serve as the root of the tree. The directionality of all edges follows from this choice (all edges point away from the root). 3 Our data structure can be regarded as a mixture of trees in which there is one mixture component (i.e., one dependence tree) for each possible disparity value (Meil˘a & Jordan, 2000).
670
M. Michel and R. Jacobs
where k is a normalization factor equal to p(d)/ p(r ). The distinguishing feature of the independent response decoder (IRD) is that it assumes that the complex cell responses are statistically independent given the binocular disparity. In other words, the conditional joint distribution of cell responses is equal to the product of the distributions for the individual cells, that is, p(r | d) = i p(r i | d), where r i is the response of the ith complex cell. Equation 4.5 can therefore be rewritten as pind (d | r) = k
p(r i | d).
(4.6)
i
The distribution of disparity as computed by equation 4.6 is the output of the IRD. The conditional distributions for individual cells p(r i | d) were approximated in our simulations by normalized histograms based on cell responses to visual scenes. 4.4 Response Histograms. Normalized relative frequency histograms were used in our simulations to approximate the distributions of cell responses. In these histograms, each cell’s response was discretized to one of b bins or response levels. This discretization was based on a cell’s maximum observed response value. Our procedure was similar to that used by Lippert and Wagner (2002), with one important difference. Because the probability of a response was a rapidly decreasing function of response magnitude, Lippert and Wagner created bins representing responses from zero to half of the maximum observed response value and grouped all responses greater than half-maximum into the final bin. This was necessary to avoid bins corresponding to response values that never (or rarely) occurred. To deal with this same problem, we created histograms whose bin values were a logarithmic function of cell response.4 5 Simulation Results Two sets of simulations were conducted. The goal of the first set was to compute the informational costs of using the approximate distributions calculated by the IRD, pind (d | r), or the DTD, ptree (d | r), instead of the exact distribution calculated by the FJPD, pfull (d | r). To quantify these costs, we used an information-theoretic measure, referred to as I/I, introduced 4 Specifically, histograms were created as follows. A cell’s responses were first linearly normalized by dividing each response by that cell’s maximum response across all stimuli. Next, each normalized response was discretized into one of b bins where boundaries between bins were logarithmically spaced. To get probabilities of responses given a disparity, bin counts were appropriately normalized and then smoothed using a gaussian kernel whose standard deviation equaled one-quarter of a bin width. This was done to avoid probabilities equal to zero.
Costs of Ignoring High-Order Correlations
671
by Nirenberg et al. (2001). We chose this measure because, unlike other measures of information difference such as Ishuffled (Nirenberg & Latham, 2003; Panzeri et al., 2002) and Isynergy (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000), this measure is sensitive only to dependencies that are relevant for decoding (Nirenberg & Latham, 2003).5 In brief, I/I can be characterized as follows. The numerator of this measure is the Kullback-Leibler distance between the exact distribution and an approximating distribution. This distance is normalized by the mutual information between a stimulus property (e.g., the disparity d) and the neural responses r based on the exact distribution. A small value of I/I means that the decoding produced by an approximate distribution contains similar amounts of information about the stimulus property as the decoding produced by an exact distribution, whereas a large value means that the approximate decoding contains much less information than the exact decoding. Simulations were conducted with a variety of neural population sizes (denoted n) and bins or response levels (denoted b). Neural populations sizes were kept small because of the computational costs of computing the exact distribution pfull (d | r). Note that the possible values of r equals b n — for example, if n = 8 and b = 8, then r can take 16,777,216 possible values. Fortunately, in practice, r took a smaller number of values by a factor of about 100, allowing us to compute pfull (d | r) using fewer presentations of visual scenes than would otherwise be the case. We used the responses of model neurons to a collection of 3 × 106 visual scenes in which the frontoparallel surface was located at all possible depths (15 possible depths × 200,000 scenes per depth) to compute each of the probability distributions pfull (d | r), ptree (d | r), and pind (d | r). This process was repeated six times for each combination of neural population size n and number of bins b. The repetitions differed in the parameter values (e.g., spatial frequencies, receptive field sizes) used by the model neurons. The results are illustrated in Figure 3. The horizontal axis represents the simulation condition (combination of n and b), and the vertical axis represents the measure I/I. Dark bars give the value of this measure for the IRD, and light bars give the value for the DTD. The error bars indicate the standard errors of the means based on six repetitions of each condition. There are at least two interesting trends in the data. First, for both the IRD and DTD approximations, the information cost grows with the size of the neural population. In other words, the approximate distributions provided by these decoders become poorer relative to the exact distribution as the neural population grows in size. A two-way (decoder by population size) ANOVA across the b = 3 conditions confirmed that this effect is significant (F (2,35) = 22.15; p < 0.001), with no significant decoder 5
The best way to measure the distance between two distributions for the purposes of neural decoding is a topic of ongoing scientific discussion (e.g., Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003).
672
M. Michel and R. Jacobs
0.3
DTD IRD
∆I/I
0.2
0.1
0.0 (n=4,b=3)
(n=4,b=8)
(n=8,b=3)
(n=8,b=8) (n=16,b=3)
conditions Figure 3: The informational cost I/I of using the dependence tree decoder (DTD; light bars) or the independent response decoder (IRD; dark bars) as a function of population size (n) and the number of discretized response levels (b). Error bars represent the standard errors of the means.
by population size interaction. This trend is not surprising given that the number of possible high-order correlations grows rapidly with the number of neurons in a population. This result has important implications. Many studies that have attempted to measure the information lost by assuming independence among neural responses have approximated the exact joint distribution with a distribution that takes into account only secondorder dependencies (e.g., Abbott & Dayan, 1999; Averbeck & Lee, 2003; Golledge et al., 2003; Nirenberg et al., 2001; Panzeri et al., 1999; Rolls et al., 2003; Seri`es et al., 2004). Our results suggest that the difference in relevant information between an approximation based on the assumption that responses are independent and an approximation based on second-order correlations may greatly underestimate the information difference that investigators actually care about: the difference between an approximation based on statistical independence and the exact distribution. If so, this may account for why previous investigators concluded that most of the useful information is in the independent responses of individual neurons. A
Costs of Ignoring High-Order Correlations
673
second trend in our data is an increase in information cost as the number of discrete response levels increases. This trend is unsurprising as we would expect the differences between exact and inexact distributions to increase as the resolution of neuron responses increases. A three-way ANOVA (decoder by population size by response levels) confirmed that this trend is significant (F (1,47) = 9.49; p < 0.01), along with a main effect for decoder type (F (1,47) = 4.35; p < 0.05) and a decoder by response levels interaction (F (1,47) = 5.05; p < 0.05), which indicate that the effect is significantly greater and more pronounced, respectively, for the IRD than the DTD. In summary, the results of the first set of simulations suggest that the cost of ignoring or approximating statistical dependencies becomes greater with larger populations and also may tend to increase with more neural response levels. A limitation of the first set of simulations is that the excessive computational cost of calculating the exact distribution pfull (d | r) prevented us from examining large population sizes. Therefore, a second set of simulations was conducted in which we evaluated the IRD and DTD with large populations using a performance measure that compared the disparity predicted by a decoder with the true disparity present in a pair of left eye and right eye images. The disparity predicted by a decoder was the disparity with the highest conditional probability (i.e., the disparity that maximized p(d | r), known as the maximum a posteriori estimate). The distributions pind (d | r) and ptree (d | r) generated by the IRD and DTD, respectively, were computed on the basis of 150,000 visual scenes in which the frontoparallel surface was located at all possible depths (15 possible depths × 10,000 scenes per depth). However, the performances of the decoders were measured using a different set of scenes. This set consisted of 1400 scenes in which the surface was located at the central seven depths (possible disparities ranged from −3 to 3 pixels × 200 scenes per disparity). The simulation results are illustrated in Figure 4. The horizontal axis indicates the simulation condition (neural population size n and number of response levels b), and the vertical axis indicates the root mean squared (RMS) error of the disparity estimate. Dark bars give the RMS error value for the IRD, and light bars give the value for the DTD. The error bars indicate the standard errors of the means based on six repetitions of each condition. A three-way ANOVA showed significant main effects for population size (F (2,107) = 9.83; p < 0.0001), for decoder (F (1,107) = 343.55; p < 0.0001), and for the number of discretized response levels (F (2,107) = 12.71; p < 0.0001), along with significant effects ( p < 0.0001) for all two-way interactions. Three primary trends can be gleaned from these combined effects. First, performance for the DTD improved as the population size increased. This was also found for the IRD in the b = 5 condition. This trend is unsurprising, as we would expect the amount of information to increase with the size of a neural population. Second, the performance of the DTD became
674
M. Michel and R. Jacobs
4.0 DTD
3.5
IRD
RMS error
3.0 2.5 2.0 1.5 1.0 0.5
(n
(n
=1
6, b= 5)
=1 6, b= 10 (n ) =1 6, b= 20 (n ) =3 2, b= (n 5) =3 2, b= 10 (n ) =3 2, b= 20 (n ) =6 4, b= (n 5) =6 4, b= 10 (n ) =6 4, b= 20 )
0.0
conditions Figure 4: Root mean squared (RMS) error (in pixels) for the DTD (light bars) and IRD (dark bars) as a function of population size (n) and number of discretized response levels (b). RMS errors were calculated by comparing the maximum a posteriori estimates of disparity given by the decoders with the true disparities over 1400 test trials (or novel visual scenes). Error bars indicate the standard errors of the means.
significantly better than that of the IRD with increases in population size, suggesting that the proportion of information about disparity contained in high-order correlations increases with population size compared with the proportion stored in the independent responses of model neurons. Third, the performance of the IRD decreased as the number of discretized response levels increased. In contrast, the performance of the DTD showed the opposite trend—for example, its performance improved slightly from the b = 5 to b = 10 conditions. This trend may seem surprising given that the number of parameters estimated by the DTD grows quadratically with b while the number of parameters estimated by the IRD grows only linearly. However, the DTD is capable of representing much richer distributions than the IRD. Increasing the number of discretized response levels, like increasing the
Costs of Ignoring High-Order Correlations
675
number of neurons in a population, increases the possible complexity of correlations. To the extent that information about a stimulus is contained in the possibly high-order response correlations of a neural population, we should expect that any decoder that takes into account these correlations will perform better than the IRD, which, by definition, discards all information in these correlations. Similar to the results of the first set of simulations, the results of the second set of simulations suggest that much of the information about disparity is carried by statistical dependencies among model neuron responses. These results do not, however, indicate whether the information carried by response dependencies is limited to second-order dependencies or whether higher-order dependencies also need to be considered. To examine this issue, we evaluated the performance of a decoder that was limited to second-order statistics; it approximated the distribution of neural responses given a disparity, p(r | d), with a multivariate gaussian distribution whose mean vector and covariance matrix were estimated using a maximum likelihood procedure. The performance of this decoder is not plotted because the decoder consistently generated a prediction of disparity equal to 0 pixels (the frontoparallel surface is at the depth of the fixation point) regardless of the true disparity in the left eye and right eye images. A decoder that was forced to use a diagonal covariance matrix produced the same behavior. The poor performances of these decoders are not surprising given the fact that the marginal distributions of an individual neuron’s response given a disparity, p(r i | d), are highly nongaussian. The horizontal axis of the graph in Figure 5 represents a normalized response of a typical model neuron, and the vertical axis represents the probability that the neuron will give each response. The light bars indicate the probability when the disparity in a pair of images equals the preferred disparity of the neuron, and the dark bars indicate the probability when the image disparity is different from the neuron’s preferred disparity. In both cases, the probability distributions are highly nongaussian; the distributions peak near a response of zero (the neuron most frequently gives a small response) and have relatively long tails (especially the distribution for when the image and preferred disparities are equal). This finding is consistent with earlier results, such as those reported by Lippert and Wagner (2002; see Figure 3). A possible objection to the simulations discussed so far is that the simulations used a very large number of training stimuli. In contrast, neuroscientists use much smaller data sets, and there is no guarantee that the results that we have found will also be found when using fewer data items. To address this issue, we conducted new simulations with a relatively small data set (100 training samples for each disparity). Figure 6 shows the results for the IRD and the DTD when population sizes were set to 64 neurons, and the number of response levels was set to either 5, 10, or 20. Again, the DTD consistently outperformed the IRD, and the trends described above for the large training set appear to hold for the small training set too.
probability of response
676
M. Michel and R. Jacobs
preferred response
0.6
non-preferred response 0.5 0.4 0.3 0.2 0.1 0.0
0
0.2
0.4
0.6
0.8
1
normalized response Figure 5: Sample response histograms for a typical model neuron. The black bars indicate the probability of a normalized response to an image pair with the cell’s preferred disparity, and the white bars indicate the probability of a response to an image pair with an arbitrarily selected nonpreferred disparity. Note that cell responses are highly nongaussian; the probability distributions are skewed with a peak at very low responses and tails at higher response values. In general, as the selected disparity deviates from the preferred disparity, the mass of the response distribution becomes increasingly concentrated at zero.
A second possible objection to the simulations discussed above is that they used white-noise stimuli; frontoparallel surfaces were covered with dots whose luminance values were independently sampled from a uniform distribution ranging from 0 (dark) to 1 (light). We chose these stimuli for several reasons. White noise stimuli have simple properties that make them amenable to mathematical analyses. Consequently, they have played an important role in engineering, neuroscientific, and behavioral studies. In addition, for our current purposes, we are interested in how binocular disparities can be evaluated in the absence of form information. Furthermore, white noise stimuli do not contain correlations across spatial frequency bands, and thus their use should not introduce biases into our evaluations of the role of high-order correlations when decoding populations of model neurons. Despite the motivations for the use of white noise stimuli, natural visual stimuli contain very different properties. Images of natural scenes usually contain a great deal of form information and contain energy in
Costs of Ignoring High-Order Correlations
677
8.0 7.0
RMS error
6.0
DTD IRD
5.0 4.0 3.0 2.0 1.0 0.0 (n=64,b=5)
(n=64,b=10)
(n=64,b=20)
conditions Figure 6: RMS error of the maximum a posteriori disparity estimate provided by the DTD (light bars) and IRD (dark bars) as a function of the number of discretized response levels (b) in the small training sample case. These data were generated using a fixed population size (n = 64), and using only 100 training samples per disparity rather than the 10,000 training samples per disparity used to generate the data in Figure 4.
a large range of spatial frequency bands. Because of dependencies in the energies across frequency bands, we expect that high-order correlations in model neuron responses to natural stimuli should be important during neural decoding, as was found when using white noise stimuli. To partially evaluate this prediction, we repeated some of the preceeding simulations using more “naturalistic” stimuli.6
6 Ideally, we would have liked to conduct simulations using left eye and right eye images of natural scenes. Unfortunately, this was not possible for a variety of reasons. Perhaps most important, there are no available databases, to our knowledge, of large numbers of left eye and right eye images of natural scenes taken by well-calibrated camera systems that include ground truth information (e.g., true disparity or depth at each point in the scene).
678
M. Michel and R. Jacobs
8.00 8.0 7.00 7.0
5.00 5.0
DTD IRD MVG
RMS error
RMS error
6.00 6.0
DTD IRD MVG
4.00 4.0 3.00 3.0 2.00 2.0 1.00 1.0 0.00 0.0 (n=64,b=5) (n=64,b=10) (n=64,b=20)
MVG
conditions conditions Figure 7: RMS error of the maximum a posteriori disparity estimate provided by the DTD (white bars) and IRD (gray bars) as a function of the number of discretized response levels (b), along with the performance of a multivariate gaussian fitted to the training data (black bar) when the training and test surfaces were painted with 1/f noise rather than white noise. These data were generated using a fixed population size (n = 64) and using 10,000 training samples per disparity.
In these new simulations, we exploited the fact that the amplitude spectra of natural images fall as approximately 1/f (Burton & Moorhead, 1987; Field, 1987, 1994; Tolhurst, Tadmor, & Tang, 1992), We generated left eye and right eye images in the manner described above for the white-noise stimuli, with the exception that each image was a “noise texture” with 1/f amplitude spectra; the luminance values of the dots on a surface were independently sampled from a uniform distribution and then passed through a 1/f filter (i.e., the luminance values were Fourier transformed, the amplitude at each frequency f was multiplied by 1/f , and the result was inverse Fourier transformed; in addition, the images resulting from this process were normalized so that their luminance values fell in the range from 0 to 1). The graph in Figure 7 shows the results for the IRD and the DTD based on a population of 64 neurons. As was the case with white noise stimuli, the DTD consistently outperformed the IRD, though the performance of both decoders was markedly worse with the 1/f -noise stimuli. These results are
Costs of Ignoring High-Order Correlations
679
consistent with our earlier conclusions that high-order correlations among model neuron responses contain significant information about binocular disparities. 6 Summary Investigators debate the extent to which neural populations use pairwise and higher-order statistical dependencies among neural responses to represent information about a visual stimulus. To study this issue, we used three statistical decoders to extract the information in the responses of model neurons about the binocular disparities present in simulated pairs of left eye and right eye images. The full joint probability decoder (FJPD) considered all possible statistical relations among neural responses as potentially important. The dependence tree decoder (DTD) also considered all possible relations as potentially important, but it approximated high-order statistical correlations using a computationally tractable procedure. Finally, the independent response decoder (IRD) assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Two sets of simulations were performed. The first set examined the informational cost of ignoring all correlations or of approximating high-order correlations by comparing the IRD and DTD with the FJPD. The second set compared the performances of the IRD and DTD on a binocular disparity estimation task when neural population size and number of response levels were varied. The results indicate that high-order correlations among model neuron responses contain significant information about disparity and that the amount of this high-order information increases rapidly as a function of neural population size. In addition, the DTD consistently outperformed the IRD (and also a decoder based on a multivariate gaussian distribution) on the disparity estimation task, and its performance advantage increased with neural population size and the number of neural response levels. These results raise the possibility that previous researchers who have ignored pairwise or high-order statistical dependencies among neuron responses, or who have examined the importance of statistical dependencies in a way that limited their evaluation to pairwise dependencies may not be justified in doing so. Moreover, the results highlight the potential importance of the dependence tree decoder to neuroscientists as a powerful but still practical way of approximating high-order correlations among neural responses. Finally, the strengths and limitations of this work highlight important areas for future research. For example, future investigations will need to make use of databases of natural images, such as databases with many pairs of right eye and left eye images of natural scenes taken by wellcalibrated camera systems, along with ground-truth information about each scene (e.g., depth or disparity information at every point in a scene). Such a database for the study of binocular vision in natural scenes does not
680
M. Michel and R. Jacobs
currently exist. In addition, future computational work will need to use more detailed neural models, such as models of populations of neurons that communicate via action potentials and models of individual neurons that include ion kinetics. We expect that the results reported here will generalize to these more realistic situations, but further work is needed to test this prediction. Acknowledgments We thank A. Pouget for encouraging us to study the contributions to neural computation of high-order statistical dependencies among neuron responses and thank F. Klam and A. Pouget for many helpful discussions on this topic. This work was supported by NIH research grant RO1-EY13149. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Andrews, T. J., Glennerster, A., & Parker, A. J. (2001). Stereoacuity thresholds in the presence of a reference surface. Vision Research, 41, 3051–3061. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proceedings of the National Academy of Sciences, 94, 5438–5443. Averbeck, B. B., & Lee, D. (2003). Neural noise and movement-related codes in the macaque supplementary motor area. Journal of Neuroscience, 23, 7630–7641. Averbeck, B. B., & Lee, D. (2004). Coding and transmission of information by neural ensembles. Trends in Neurosciences, 27, 225–230. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Burton, G. J., & Moorhead, I. R. (1987). Color and spatial structure in natural scenes. Applied Optics, 26, 157–170. Chow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14, 462–467. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280–283. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive field structure. Nature, 352, 156–159. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A, 4, 2379– 2394. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Neural encoding of binocular disparity: Energy models, position shifts, and phase shifts. Vision Research, 36, 1839–1857. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Research, 30, 1661–1676.
Costs of Ignoring High-Order Correlations
681
Golledge H. D. R., Panzeri, S., Zheng, F., Pola, G., Scannell, J. W., Giannikopoulos, D. V., Mason, R. J., Tov´ee, M. J., & Young, M. P. (2003). Correlations, featurebinding and population coding in primary visual cortex. NeuroReport, 14, 1045– 1050. Lippert, J., & Wagner, H. (2002). Visual depth encoding in populations of neurons with localized receptive fields. Biological Cybernetics, 87, 249–261. Meil˘a, M., & Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1, 1–48. Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Prentice Hall. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Nirenberg, S., & Latham, P. E. (2003). Decoding neuronal spike trains: How important are correlations? Proceedings of the National Academy of Sciences USA, 100, 7348– 7353. Ohzawa I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. Journal of Neurophysiology, 77, 2879–2909. ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Oram M. W., Foldi` Decoding neural population signals. Trends in Neuroscience, 29, 259–265. Panzeri, S., Golledge, H. D. R., Zheng, F., Pola, G., Blanche, T. J., Tovee, M. J., & Young, M. P. (2002). The role of correlated firing and synchrony in coding information about single and separate objects in cat V1. Neurocomputing, 44–46, 579–584. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of information in the nervous system. Proceedings of the Royal Society of London Series B, 266, 1001–1012. Perez, R., Castro, A. F., Justo, M. S., Bermudez, M. A., & Gonzalez, F. (2005). Retinal correspondence of monocular receptive fields in disparity-sensitive complex cells from area V1 in the awake monkey. Investigative Ophthalmology and Visual Science, 46, 1533–1539. Perkel, D. H., & Bullock, T. H. (1969). Neural coding. In F. O. Schmitt, T. Melnechuk, G. C. Quarton, & G. Adelman (Eds.), Neurosciences research symposium summaries (pp. 405–527). Cambridge, MA: MIT Press. Pouget, A., Dayan, P., & Zemel, R. S. (2003). Computation and inference with population codes. Annual Review of Neuroscience, 26, 381–410. Prince, S. J. D., Pointon, A. D., Cumming, B. G., & Parker, A. J. (2002). Quantitative analysis of the responses of V1 neurons to horizontal disparity in dynamic random-dot stereograms. Journal of Neurophysiology, 87, 191–208. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Research, 37, 1811–1827.
682
M. Michel and R. Jacobs
Read, J. C. A., & Cumming, B. G. (2003). Testing quantitative models of binoculary disparity selectivity in primary visual cortex. Journal of Neurophysiology, 90, 2795– 2817. Rolls, E. T., Franco, L., Aggelopoulos, N. C., & Reece, S. (2003). An information theoretic approach to the contributions of the firing rates and the correlations between the firing of neurons. Journal of Neurophysiology, 89, 2810–2822. Seri`es, P., Latham, P. E., & Pouget, A. (2004). Tuning curve sharpening for orientation selectivity: Coding efficiency and the impact of correlations. Nature Neuroscience, 10, 1129–1135. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Tolhurst D. J., Tadmor, Y., & Tang, C. (1992). The amplitude spectra of natural images. Ophthalmic and Physiological Optics, 12, 229–232. Tsai, J. J., & Victor, J. D. (2003). Reading a population code: A multi-scale neural model for representing binocular disparity. Vision Research, 43, 445–466.
Received February 4, 2005; accepted July 20, 2005.
LETTER
Communicated by Michael Cohen
Dynamical Behaviors of Delayed Neural Network Systems with Discontinuous Activation Functions Wenlian Lu
[email protected],
[email protected] Tianping Chen
[email protected] Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai, 200433, People’s Republic of China
In this letter, without assuming the boundedness of the activation functions, we discuss the dynamics of a class of delayed neural networks with discontinuous activation functions. A relaxed set of sufficient conditions is derived, guaranteeing the existence, uniqueness, and global stability of the equilibrium point. Convergence behaviors for both state and output are discussed. The constraints imposed on the feedback matrix are independent of the delay parameter and can be validated by the linear matrix inequality technique. We also prove that the solution of delayed neural networks with discontinuous activation functions can be regarded as a limit of the solutions of delayed neural networks with high-slope continuous activation functions.
1 Introduction It is well known that recurrently connected neural networks (RCNNs), proposed by Cohen and Grossberg (1983) and Hopfield (1984; Hopfield & Tank, 1986), have been extensively studied in both theory and applications. They have been successfully applied in signal processing, pattern recognition, and associative memories, especially in static image treatment. Such applications heavily rely on dynamical behaviors of the neural networks. Therefore, analysis of dynamical behaviors is a necessary step for the practical design of neural networks. In hardware implementation, time delays inevitably occur due the to finite switching speed of the amplifiers and communication time. What is more, to process moving images, one must introduce time delays in the signals transmitted among the cells (see Civalleri, Gilli, & Pabdolfi, 1993). Neural networks with time delay have much more complicated dynamics due to the incorporation of delays. The model concerning delay is described Neural Computation 18, 683–708 (2006)
C 2006 Massachusetts Institute of Technology
684
W. Lu and T. Chen
as follows: d xi (t) = −di xi (t) + a i j g j (x j (t)) + b i j g j (x j (t − τ )) + Ii , dt n
n
j=1
j=1
i = 1, 2, . . . , n,
(1.1)
where n is the number of units in a neural network, xi (t) is the state of the ith unit at time t, and g j (x j (t)) denotes the output of jth unit at time t. a i j denotes the strength of the jth unit on the ith unit at time t, and b i j denotes the strength of the jth unit on the ith unit at time t − τ . Ii denotes the input to the ith unit, τ corresponds to the transmission delay and is a nonnegative constant, and di represents the positive rate with which the ith unit will reset its potential to the resting state in isolation when disconnected from the network and the external inputs Ii . System (1.1) can be rewritten as d x(t) = −Dx(t) + Ag(x(t)) + Bg(x(t − τ )) + I, dt
(1.2)
where x = (x1 , x2 , . . . , xn )T , g(x) = (g1 (x1 ), g2 (x2 ), . . . , gn (xn ))T , I = (I1 , I2 , . . . , In )T , T denotes transpose, D = diag{d1 , d2 , . . . , dn }, A = {a i j } is the feedback matrix, and B = {b i j } is the delayed feedback matrix. Some useful results on the stability analysis of delayed neural networks (DNNs) have already been obtained. Readers can refer to Chen (2001), Zeng, Weng, and Liao (2003), Lu, Rong, and Chen (2003), Joy (2000), and many others. In particular, Lu et al. (2003) and Joy (2000) provided some effective criteria based on LMI. All discussion in these article is based on the assumption that the activation functions are continuous and even Lipshitzean. As Forti and Nistri (2003) pointed out, a brief review of some common neural network models reveals that neural networks with discontinuous activation functions are important and frequently arise in practice. For example, consider the classical Hopfield neural networks with graded response neurons (see Hopfield 1984). The standard assumption is that the activations are used in high-gain limit where they closely approach discontinuous and comparator functions. As shown by Hopfield (1984; Hopfield & Tank, 1986), the high-gain hypothesis is crucial to make negligible the connection to the neural network energy function of the term depending on neuron self-inhibitions, and to favor binary output formation—for example, a hard comparator function sign(s). A conceptually analogous model based on hard comparators are discrete-time neural networks discussed by Harrer, Nossek, and Stelzl (1992). Another important example concerns the class of neural networks introduced by Kennedy and Chua (1988) to solve linear and nonlinear
Delayed Systems with Discontinuous Activations
685
programming problems. Those networks exploit constrained neurons with diode-like input-output activations. Again, in order to guarantee satisfaction of the constraints, the diodes are required to possess a very high slope in the conducting region, that is, they should approximate the discontinuous characteristic of an ideal diode (see Chua, Desoer, & Kuh, 1987). And when dealing with dynamical systems possessing high-slope nonlinear elements, it is often advantageous to model them with a system of differential equations with a discontinuous right-hand side, rather than studying the case where the slope is high but of finite value (see Utkin, 1977). Forti and Nistri (2003) applied the concepts and results of differential equations with discontinuous right-hand side introduced by Filippov (1967) to investigate the global convergence of neural networks with discontinuous neuron activations. Furthermore, they also discussed various types of convergence behaviors of global stability, such as convergence in finite time and convergence in measure. Useful sufficient conditions were obtained to ensure global convergence. But the discontinuous activations are assumed bounded. Lu and Chen (2005) studied global stability of a more general neural network model: Cohen-Grossberg neural networks. In this letter, the discontinuous activation functions were not assumed bounded. However, in both articles, the models do not involve time delays. In this letter, we introduce a new concept of solution for delayed neural networks with discontinuous activation functions. Without assuming the boundedness and the continuity of the neuron activations, we present sufficient conditions for the global stability of neural networks with time delay based on linear matrix inequality, and we discuss their convergence. Moreover, we explore the importance of the concept of the solution presented in this letter.
2 Preliminaries In this section, we present some definitions used in this letter. Definition 1. (See Preliminaries in Forti & Nistri, 2003.) Suppose E ⊂ Rn . Then x → F (x) is called a set-value map from E → Rn , if to each point x of a set E ⊂ Rn , there corresponds a nonempty set F (x) ⊂ Rn . A setvalue map F with nonempty values is said to be upper semicontinuous at x0 ∈ E, if for any open set N containing F (x0 ), there exists a neighborhood M of x0 such that F (M) ⊂ N. F (x) is said to have a closed (convex, compact) image if for each x ∈ E F (x) is closed (convex,compact). Gra ph(F (E)) = {(x, y)|x ∈ E, a nd y ∈ F (x)}, where E is subset of Rn . More details about set value maps can be found in Aubin and Frankowska (1990).
686
W. Lu and T. Chen
Definition 2. Class G¯ of functions: Let g(x) = (g1 (x1 ), g2 (x2 ), . . . , gn (xn ))T . We ¯ if for all i = 1, 2, . . . , n, g (·) satisfies: call g(x) ∈ G, i
1. gi (·) is nondecreasing and continuous, except on a countable set of isolated points {ρki }, where the right and left limits gi+ (ρki ) and gi− (ρki ), satisfy gi+ (ρki ) > gi− (ρki ). Moreover, in every compact set of R, gi (·) has only finite discontinuous points. 2. Denote the set of points {ρki : i = 1, . . . , n; k = . . . , −2, −1, 0, 1, 2, . . .} of i discontinuity in the following way: for any i, ρk+1 > ρki , and there exist constants G i,k > 0, i = 1, . . . , n; k = . . . , −2, −1, 0, 1, 2, . . ., such that 0≤
gi (ξ ) − gi (ζ ) ≤ G i,k ξ −ζ
i f or all ξ = ζ and ξ, ζ ∈ (ρki , ρk+1 ).
Remark 1. Forti and Nistri (2003) assumed that discontinuous activations satisfy the first condition of definition 2. We impose some local Lipschitz continuity on each interval that does not contain any points of discontinuity. Furthermore, we do not assume the activation functions are bounded, which is required by Forti and Nistri (2003). Note that the function gi (·) is undefined at the points, where gi (·) is discontinuous. Such discontinuous functions G¯ include a number of neuron activations of interest for applications—for example, the standard hard comparator function sign(·): sign(s) =
1 s>0 −1 s < 0
.
(2.1)
¯ the right-hand side of equation 1.2 is disconIt is clear that if g(·) ∈ G, tinuous. Therefore, we have to explain the meaning of a solution of the Cauchy problem associated with equation 1.2 before further investigation. Filippov (1967) developed a concept of a solution for differential equations with a discontinuous right-hand side, which was used by Forti et al. (2003) and Lu and Chen (2005) to investigate the stability of neural networks with discontinuous activation functions. In the following, we apply this framework in discussing delayed neural networks with discontinuous activation functions. Now we introduce the concept of Filippov solution. Consider the following system, dx = f (x), dt where f (·) is not continuous.
(2.2)
Delayed Systems with Discontinuous Activations
687
Definition 3. A set-value map is defined as φ(x) =
K [ f (B(x, δ) − N)]
(2.3)
δ>0 µ(N)=0
where K (E) is the closure of the convex hull of set E, B(x, δ) = {y : y − x ≤ δ}, and µ(N) is Lebesgue measure of set N. A solution of the Cauchy problem for equation 2.2 with initial condition x(0) = x0 is an absolutely continuous function x(t), t ∈ [0, T], which satisfies x(0) = x0 , and differential inclusion: dx ∈ φ(x), dt
a .e. t ∈ [0, T].
(2.4)
The concept of the solution in the sense of Filippov is useful in engineering applications because the Filippov solution is a limit of solutions of ordinary differential equations (ODEs) with the continuous right-hand side. Thus, we can model a system that is a near discontinuous system and expect that the Filippov trajectory of the discontinuous system will be close to the real trajectories. This approach is important in many applications, such as variable structure control and nonsmooth analysis (see Utkin, 1997; Aubin & Cellina, 1984; Paden & Sastry, 1987). Moreover, Haddad (1981), Aubin (1991), and Aubin and Cellina (1984) gave a functional differential inclusion with memory as follows: dx (t) ∈ F (t, A(t)x), dt
(2.5)
where F : R × C([−τ, 0], Rn ) → Rn is a given set-value map and [A(t)x](θ ) = xt (θ ) = x(t + θ ).
(2.6)
Now we denote K [g(x)] = (K [g1 (x1 )], K [g2 (x2 )], . . . , K [gn (xn )])T where K [gi (xi )] = [gi− (xi ), gi+ (xi )]. We extend the concept of Filippov solution to the delayed differential equations 1.2 as follows: dx (t) ∈ −Dx(t) + AK [g(x(t))] + B K [g(x(t − τ ))] + I, dt Equivalently, dx (t) = −Dx(t) + Aα(t) + Bβ(t − τ ) + I, dt where α(t) ∈ K [g(x(t))] and β(t) ∈ K [g(x(t))].
for almost all t.
688
W. Lu and T. Chen
In the sequel, we assume α(t) = β(t). It is reasonable. The output of the system should be consistent over time. Therefore, we will consider the following delayed system, dx (t) = −Dx(t) + Aα(t) + Bα(t − τ ) + I, dt
for almost all t,
(2.7)
where output α(t) is measurable, and α(t) ∈ K [g(x(t))],
for almost all t.
Definition 4. A solution of the Cauchy problem for the delayed system 2.7 with initial condition φ(θ ) ∈ C([−τ ,0], Rn ) is an absolutely continuous function x(t) on t ∈ [0, T], such that x(θ ) = φ(θ ) for θ ∈ [−τ, 0], and dx = −Dx(t) + Aα(t) + Bα(t − τ ) + I dt
a .e. t ∈ [0, T],
(2.8)
where α(t) is measurable and for almost t ∈ [0, T], α(t) ∈ K [g(x(t))]. Remark 2. Concerning the solution of the ODEs or functional differential equations (FDEs) with a discontinuous right-hand side, there are various definitions, such as Euler solutions and generalized sampling solutions. Among them, Carath´eodory and a weak solution set are widely studied. Liz and Pouso (2002) gave some general results for the existence of the solutions of first-order discontinuous FDEs subject to nonlinear boundary conditions. The Carath´eodory solution set is defined as follows (here, we compare these solution set in the case without time delay). Consider the following ODE, d x(t) = f (x(t)), dt
t ∈ [0, T],
(2.9)
with initial condition x(0) = x0 . An absolutely continuous function ξ (t) is said to be a Carath´eodory solution if d x(t) = f (x(t)), dt x(0) = x0 .
a .e. t ∈ [0, T] (2.10)
Delayed Systems with Discontinuous Activations
689
A function ζ (t) ∈ L 1 ([0, T]) is said to be a weak solution in L 1 ([0, T]) if for each p(t) ∈ C0∞ ([0, T]), there holds
T
0
d p(t) ζ (t)dt = − dt
T
f (ζ (t)) p(t)dt.
(2.11)
0
It is clear that the Carath´eodory solution set is surely a subset of the Filippov solution set if it involves a discontinuous right-hand side. On the other hand, the weak solution set might contain a discontinuous solution. But if we focus on the absolutely continuous weak solutions, this solution set is equivalent to the Carath´eodory solution set. Both are subsets of the Filippov solution set. Spraker and Biles (1996) pointed out that in a one-dimensional case, the Carath´eodory solution set equals the Filippov solution set if and only if {x, 0 ∈ φ(x)} = {x, f (x) = 0}, where φ(·) is defined as in definition 3. Otherwise, the two solution sets are different. For example, consider the following one-dimensional ODE, d x(t) = −x − q (x), dt
(2.12)
where 1 q (x) = −1 1 2
if
ρ>0
if
ρ 0 and M > 0, such that x(t) − x ≤ Me − t . The letter is organized as follows. In section 3, we discuss the existence of the equilibrium point and the solution for system 2.8. The stability of the equilibrium point and the convergence of the output of the delayed neural networks are studied in section 4. Some numerical examples are presented in section 5. We conclude this letter in section 6. 3 Existence of an Equilibrium Point and Solution In this section, we prove that under some conditions, system 2.8 has an equilibrium point and a solution in the infinite time interval [0, ∞). 3.1 Existence of an Equilibrium Point. First, we investigate the existence of an equilibrium point. For this purpose, consider the following differential inclusion, dy ∈ −Dy(t) + T K [g(y(t))] + I. dt
(3.1)
where y(t) = (y1 (t), y2 (t), . . . , yn (t))T , D, K [g(·)], and I are the same as those in system 2.8. We need the following result. ¯ If there exists a Theorem A. (Lu and Chen, 2005, theorem 2). Suppose g ∈ G. positive definite diagonal matrix P such that −P T − T T P is positive definite, then there exists an equilibrium point of system 3.1; that is, there exist y ∈ Rn and α ∈ K [g(y )], such that 0 = −Dy + Tα + I.
(3.2)
By theorem A, we can prove theorem 1. Theorem 1. If there exist a positive definite diagonal matrix diag{P1 , P2 , . . . , Pn } and a positive definite symmetric matrix Q such that
−P A − AT P − Q −P B −B T P
Q
P=
> 0,
then there exists an equilibrium point of system (see equation 2.8).
(3.3)
Delayed Systems with Discontinuous Activations
691
Proof. By the Schur complement theorem (see Boyd, Ghaoui, Feron, & Balakrishnan (1994), inequality 3.3 is equivalent to −(P A + AT P) > 1 1 1 1 P B Q−1 B T P + Q. By the inequality [Q− 2 B T P − Q 2 ]T [Q− 2 B T P − Q 2 ] ≥ 0, P B Q−1 B T P + Q ≥ P B + B T P holds. Then inequality 3.3 becomes −P(A + B) − (A + B)T P > 0.
(3.4)
From theorem A, there exists an equilibrium point x ∈ Rn and α ∈ K [g(x )] such that 0 = −Dx + (A + B)α + I,
(3.5)
which implies that α is an equilibrium point of system 2.8. Theorem 1 is proved. Suppose that x = (x1 , x2 , . . . , xn )T is an equilibrium point of system 2.8, that is, there exists α = (α1 , α2 , . . . , αn )T ∈ K [g(x)] such that equation 3.5 satisfies. Let u(t) = x(t) − x be a translation of x(t) and γ (t) = α(t) − α be a translation of α(t). Then u(t) = (u1 (t), u2 (t), . . . , un (t))T satisfies du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ), dt
for almost t,
(3.6)
where γ (t) ∈ K [g (u(t))], gi (s) = gi (s + xi ) − γi , i = 1, 2, . . . , n. To simplify, we still use gi (s) to denote gi (s), . Therefore, instead of equation 2.8, we will investigate du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ) dt
for almost t
(3.7)
¯ and 0 ∈ K [g (0)], for all i = 1, 2, . . . , n. where γ (t) ∈ K [g(u(t))], g(·) ∈ G, i It can be seen that the dynamical behavior of equation 2.8 is equivalent to that of equation 3.7. Namely, if there exists a solution u(t) for equation 3.7, then x(t) = u(t) + x must be a solution for equation 2.8; moreover, if all trajectories of equation 3.7 converge to the origin, then the equilibrium x must be globally stable for system 2.8, as defined in definition 6. Therefore, instead of equation 2.8, we will investigate the dynamics of system 3.7. 3.2 Viability. In this section, we investigate the viability of system 3.7, that is, there exists at least one solution of system 3.7 on [0, +∞), which is the prerequisite to study global stability. First, we give following lemma on matrix inequalities:
692
W. Lu and T. Chen
Lemma 1. If P = diag{P1 , P2 , . . . , Pn } with Pi > 0, Q is a positive definite symmetric matrix such that
−P A − AT P − Q −P B Z= Q −B T P
> 0,
(3.8)
then the following two statements hold true: 1. There are a small, positive constant ε < mini di , a positive diagonal matrix ˆ Pˆ = diag{ Pˆ 1 , Pˆ 2 , . . . , Pˆ n }, and a positive definite symmetric matrix Q, such that −2D + ε I A B ˆ ετ Pˆ B AT Pˆ A + AT Pˆ + Qe Z1 = (3.9) ≤ 0. T T ˆ ˆ B −Q B P 2. There are a small, constant > 0, a diagonal matrix P˜ = diag{ P˜ 1 , P˜ 2 , . . . , P˜ n } with P˜ i > 0, and a positive definite symmetric matrix ˜ such that Q, −2D A B (3.10) Z2 = AT P˜ A + AT P˜ + I P˜ B ≤ 0. BT
B T P˜
˜ −Q
ˆ = α Q, where P and Q are defined in inequality 3.8, Proof. Let Pˆ = α P, Q and ε and α are constants determined later. Then, for any x, y, and z ∈ Rn , we have x [x T , yT , zT ]Z1 y = −2x T Dx + εx T x + 2x T Ay + 2x T Bz z + αyT (P A + AT P)y + αyT Qy + 2αyT P Bz − αzT Qz + α(e ετ − 1)yT Qy = −2x T Dx + εx T x + 2x T Ay + 2x T Bz − α[yT , zT ]Z y × + α(e ετ − 1)yT Qy z ≤ −x T Dx + 2x T Ay − αyT [λI − (e ετ − 1)Q]y − x T (D − ε I )x + 2x T Bz − αλzT z T = − D1/2 x − D−1/2 Ay D1/2 x − D−1/2 Ay + yT AT D−1 Ay − αyT [λI − (e ετ − 1)Q]y
Delayed Systems with Discontinuous Activations
693
T − (D − ε I )1/2 x − (D − ε I )−1/2 Bz (D − ε I )1/2 x − (D − ε I )−1/2 Bz + zT B T (D − ε I )−1 Bz − αλzT z ≤ 0,
(3.11)
where λ = λmin (Z) > 0. Pick ε, satisfying ε < mini {di } and e ετ < and α satisfying α > max
λ Q2
+ 1,
AT D−1 A2 B T (D − ε I )−1 B2 , , λmin {λI − (e ετ − 1)Q} λ
where X2 = λmax (XT X) and λmax (Z) and λmin (Z) denote the maximum and minimum eigenvalue of the square matrix Z, respectively. Then x [x T , yT , zT ]Z1 y ≤ 0 z holds for any x, y, z ∈ Rn , which implies Z1 ≤ 0. In a similar way, we can prove equation 3.10. To prove the existence of the solution for system 3.7, we will construct a sequence of functional differential systems and prove that the solutions of these systems converge to a solution of system 3.7. Specifically, consider the following Cauchy problem: dx a .e. t ∈ [0, T] dt (t) ∈ −Dx(t) + Aγ (t) + Bγ (t − τ ), a measurable function γ (t) ∈ K [g(x(t))], for almost t ∈ [0, T] x(θ ) = φ(θ ) θ ∈ [−τ, 0].
(3.12)
Let C = C(Rn , Rn ), and define a family of functions = { f (x) : f (x) = [ f 1 (x1 ), f 2 (x2 ), . . . , f n (xn )]T ∈ C} satisfying: 1. Every f i (·) is nondecreasing for all i = 1, 2, . . . , n 2. Every f i (·) is uniformly locally bounded, that is, for any compact set Z ⊂ Rn , there exists a constant M > 0 independent of f such that | f i (x)| ≤ M
f or all x ∈ Z i = 1, . . . , n.
3. Every f i (·) is locally Lipschitzean continuous, that is, for any compact set Z ⊂ Rn , there exists λ > 0 such that for any ξ , ζ ∈ Z, and
694
W. Lu and T. Chen
i = 1, 2, . . . , n, we have | f i (ξ ) − f i (ζ )| ≤ λ|ξ − ζ |. 4. f i (0) = 0, for all i = 1, 2, . . . , n. As pointed out in Hale (1977), if f ∈ , then the following system,
du f dt
(t) = −Du f (t) + Af (u f (t)) + B f (u f (t − τ ))
u f (θ ) = φ(θ ) θ ∈ [−τ, 0],
(3.13)
has a unique solution u f (t) = (u1 (t), u2 (t), . . . , un (t))T on [0, +∞). Moreover, we can prove the following result. Theorem 2. If the matrix inequality 3.3 is satisfied, then for any φ ∈ C, there exists M = M(φ) > 0 such that ε
|u f (t)| ≤ Me − 2 t ,
f or all t > 0 a nd f ∈ ,
(3.14)
where ε > 0 is a constant. Proof. Let V2 (t) = e εt uTf u f + 2 +
n
e εt Pˆ i
i=1 t
ufi
f i (ρ)dρ 0
ˆ f (u f (s))e ε(s+τ ) ds, f T (u f (s)) Q
t−τ
ˆ and Q ˆ are of the second conclusion of lemma 2. And differentiate where ε, P, V2 (t): d V2 (t) = εe εt u f (t)T u f (t) + 2e εt uTf [−Du f + Af (u f (t)) + B f (u f (t − τ ))] dt ˆ + 2e εt f T (u f (t)) P[−Du f (t) + Af (u f (t)) + B f (u f (t − τ ))] + εe εt
n
Pˆ i
i=1
ufi
ˆ f (u f (t − τ )) f i (ρ)dρ − e εt f T (u f (t − τ )) Q
0
ˆ f (u f (t)). + e ε(t+τ ) f T (u f (t)) Q Pick ε < mini di . Then ε 0
ufi
f i (ρ)dρ ≤ εu f i (t) f i (u f i (t)) ≤ di u f i (t) f i (u f i (t)).
(3.15)
Delayed Systems with Discontinuous Activations
695
By matrix inequality 3.9, we have d V2 (t) ≤ e εt uTf (t), f T (u f (t)), f T (u f (t − τ )) Z1 dt
u f (t) f (u f (t))
.
f (u f (t − τ ))
≤0 Therefore, u f (t)T u f (t)e εt ≤ V√ 2 (t) ≤ V2 (0) ≤ M, where M is a constant independent of f ∈ . Let M1 = M. We have ε
u f (t) ≤ M1 e − 2 t
for all t > 0 and f ∈ .
(3.16)
Construct a sequence of systems with high-slope continuous activation functions, and prove that the sequence converges to a solution of system 3.7. Let {ρk,i } be the set of discontinuous points of gi (·). Pick a strictly decreasing sequence {δk,i,m }m with limm→∞ δk,i,m = 0 such that Ik1 ,i,m Ik2 ,i,n = ∅ holds for any k1 = k2 and m, n ∈ N, where Ik,i,m = [ρk,i − δk,i,m , ρk,i + δk,i,m ]. Define functions {g m (x) = (g1m (x1 ), . . . , gnm (xn ))T }, m = 1, . . ., as follows: ¯ Ik,i,m gi (s) if s ∈ k g (ρ +δ )−g (ρ −δ ) i k,i k,i,m2δk,i,mi k,i k,i,m [x − (ρk,i + δk,i,m )] if 0 = ρk,i a nd s ∈ Ik,i,m gim (s) = + gi (ρk,i + δk,i,m ) gi (ρk,i +δk,i,m ) [x − ρk,i ] if 0 = ρk,i and s ∈ [ρk,i , ρk,i + δk,i,m ] δk,i,m gi (ρk,i −δk,i,m ) − δk,i,m [x − ρk,i ] if 0 = ρk,i and s ∈ [ρk,i − δk,i,m , ρk,i ]. (3.17) It can be seen that every {g m (x)}, m = 1, . . . , satisfies
r r
g m (x) ∈ . For any compact set Z ⊂ Rn , lim d (Gra ph(g m (Z)), Gra ph(K [g(Z)])) = 0
m→∞
where d (A, B) = sup inf x − y. x∈A y∈B
r
For every continuous point s of gi (·), there exists m0 ∈ N such that gim (s) = gi (s), for all m ≥ m0 , i = 1, 2, . . . , n.
696
W. Lu and T. Chen
m T Let um (t) = (um 1 (t) . . . , un (t)) be the solution of the following system:
dum (t) = −Dum (t) + Ag m (um (t)) + Bg m (um (t − τ )) dt um (θ ) = φ(θ ) θ ∈ [−τ, 0].
(3.18)
Next, we will prove that system 3.7 (or, equivalently, system 2.8) has at least one solution. ¯ Theorem 3 (Viability theorem). If the matrix inequality 3.3 holds and g ∈ G, T then the system 3.7 has a solution u(t) = (u1 (t), . . . , un (t)) for t ∈ [0, ∞). Proof. By theorem 2, we know that all the solutions {um (t)} of system m 3.18 are uniformly bounded, which implies that { dudt(t) } are also uniformly bounded. By the Arzela-Ascoli lemma and the diagonal selection principle, we can select a sub-sequence of {um (t)} (still denoted by {um (t)}) such that um (t) uniformly converges to a continuous function u(t) on any compact set of R. Because the derivative of {um (t)} is uniformly bounded, it can be seen that for any fixed T > 0, u(t) is Lipschitz continuous on [0, T]. Therefore, du(t) exists for almost all t and is bounded on [0, T]. dt For each p(t) ∈ C0∞ ([0, T], Rn ) (noticing that C0∞ ([0, T], Rn ) is dense in the Banach space L 1 ([0, T], Rn )),
T
0
dum (t) du(t) − dt dt
T
p(t) dt = − 0
dp(t) m (u (t) − u(t)) dt dt
m
m
holds and { dudt(t) } is uniformly bounded. Therefore, dudt(t) weakly converge on L ∞ ([0, T], Rn ) ⊂ L 1 ([0, T], Rn ). By Mazur’s convexity theorem (see to du dt ∞ Yosida, 1978), we can find constants αlm ≥ 0 with l=m αlm = 1, and for any ∞ m only finite αlm = 0 such that ym (t) = l=m αlm ul (t). Then ym (θ ) = φ(θ ), if θ ∈ [−τ, 0], and lim ym (t) = u(t), on [0, T] uniformly
m→∞
dym (t) du(t) = , for all almost t ∈ [0, T]. m→∞ dt dt lim
Let γ m (s) =
∞ l=m
(3.19) (3.20)
αlm gl (ul (s)). Then
dym (t) = −Dym (t) + Aγ m (t) + Bγ m (t − τ ). dt
(3.21)
Delayed Systems with Discontinuous Activations
697
Finally, we will prove that there exists a measurable function γ (t) that is limit of a sub-sequence of γ m (t) and satisfies du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ), dt
for almost t ∈ [0, T].
First, we consider the time interval of t ∈ [0, τ ]. For s ∈ [−τ, 0], gl (ul (s)) = g (φ(s)). According to the boundedness of φ(·), the uniform boundedness of gm (·) on the image of φ, and the limitedness of the set of discontinuous points of g(·) on the image of φ, we can find a sub-sequence of g m (still denoted by g m ) and a measurable function γ (t) on [−τ, 0] such that g m (φ(s)) converges to γ (s) for all s ∈ [−τ, 0]. Therefore, limm→∞ γ m (φ(s)) = γ (s) for all s ∈ [−τ, 0]. For t ∈ [0, τ ] and then t − τ ∈ [−τ, 0], we can find a measurable function on [0, τ ], still denoted by γ (t), such that l
dym + Dym (t) + Bγ (t − τ ) m→∞ m→∞ dt −1 du + Du(t) + Bγ (t − τ ) for almost t ∈ [0, τ ]. =A dt
γ (t) = lim γ m (t) = A−1 lim
Similarly, we can construct a measurable function γ (t), t ∈ [0, T], such that du = −Du(t) + Aγ (t) + Bγ (t − τ ) dt
for almost t ∈ [0, T].
(3.22)
Then we will prove γ (t) ∈ K [g(u(t))]. Both ym (t) and um (t) converge to u(t) uniformly, and K [g(·)] is an uppersemicontinuous set-valued map. Therefore, for any > 0, there exists N > 0 such that for all m > N and t ∈ [0, T], we have g m (um (t)) ∈ O(K [g(u(t))], ), where O(K [g(u(t))], ) = {x ∈ Rn : d(x, K [g(u(t))]) < }. Because K [g(·)] is convex and compact, γ m (t) ∈ O(K [g(u(t))], ) for t ∈ [0, T]. Letting m → ∞, we have γ (t) ∈ O(K [g(u(t))], ) for t ∈ [0, T]. Because of the arbitrariness of , we conclude that γ (t) ∈ K [g(u(t))]
t ∈ [0, T].
(3.23)
Because T is arbitrary, the solution u(t) can be extended to infinite time interval [0, +∞). Theorem 3 is proved. 4 Global Asymptotic Stability In this section, we study the global stability of system 3.7.
698
W. Lu and T. Chen
Theorem 4 (global exponential asymptotic stability). If the matrix inequality 3.3 ¯ then for any solution u(t) on [0, ∞) of system 3.7, there exists holds and g(·) ∈ G, M = M(φ) > 0 such that ε
u(t) ≤ Me − 2 t
f or all t > 0,
where ε is given by matrix inequality 3.9. Equivalently, for any solution x(t) on [0, ∞) of system 2.8, we have ε
x(t) − x ∗ ≤ Me − 2 t
f or all t > 0.
Proof. Let εt T
V3 (t) = e u (t)u(t) + 2
n
εt
e Pˆ i
s 0
ui (t)
gi (ρ) dρ +
0
i=1
Notice that for pi (s) = gi+ (s)}.
t
ˆ (s)e ε(s+τ ) ds. γ (s)T Qγ
t−τ
gi (ρ) dρ, we have ∂c pi (s) = {v ∈ R : gi− (s) ≤ v ≤
Differentiating V3 (t) by the chain rule (for details, see Baciotti, Conti, & Marcellini, 2000; Clarke, 1983; or Lu & Chen, 2005, for details), we have d V3 (t) = εe εt u(t)T u(t) + 2e εt uT [−Du + Aγ (t) + Bγ (t − τ )] dt ˆ + Aγ (t) + Bγ (t − τ )] + 2e εt γ (t) P[−Du(t) + εe εt
n i=1
+e
ε(t+τ )
Pˆ i
ui
ˆ (t − τ ) gi (ρ)dρ − e εt γ T (t − τ ) Qγ
0
ˆ (t). γ T (t) Qγ
Because ε < mini di , we have ε
ui
gi (ρ)dρ ≤ εui (t)γi (t) ≤ di ui (t)γi (t)
0
and u(t) d V3 (t) ≤ e εt [uT (t), γ T (t), γ T (t − τ )]Z1 γ (t) dt γ (t − τ ) ≤ 0.
(4.1)
Delayed Systems with Discontinuous Activations
699
Then, u(t)T u(t) ≤ V3 (0)e −εt and
ε
V3 (0)e − 2 t ε x(t) − x 2 ≤ V3 (0)e − 2 t u(t)2 ≤
hold. Remark 3. From theorem 4, the uniqueness of the equilibrium is also obtained. Corollary 1. If condition 3.3 holds and gi (·) is locally Lipschitz continuous, then there exist ε > 0 and x ∗ ∈ Rn such that for any solution x(t) on [0, ∞) of system 1.1, there exists M = M(φ) > 0 such that ε
x(t) − x ∗ ≤ Me − 2 t
f or all t > 0
where ε is given by the matrix inequality 3.9. If every xi∗ is a continuous point of the activation functions gi (·), i = 1, . . . , n. For the outputs, we have limt→∞ gi (xi (t)) = gi (xi∗ ). Instead, if for some i, xi∗ is a discontinuous point of the activation function gi (·), we can prove the outputs converge in measure (also see Forti & Nistri, 2003). Theorem 5 (convergence of output). If the matrix inequality 3.3 holds and g(·) ∈ ¯ then the output α(t) of system 2.7 converges to α in measure, that is, µ − G, limt→∞ α(t) = α Proof. Define V5 (t) = uT (t)u(t) + 2
n i=1
P˜ i
ui
gi (ρ) dρ +
0
t
˜ (s) ds, γ (s)T Qγ
t−τ
˜ Q, ˜ and are those in the matrix inequality 3.10 of lemma 1. where P, Differentiate V5 (t): d V5 (t) ˜ = 2uT (t)[−Du(t) + Aγ (t) + Bγ (t − τ )] + 2γ T (t) P[−Du(t) dt ˜ (t) − γ T (t − τ ) Qγ ˜ (t − τ ) + Aγ (t) + Bγ (t − τ )] + γ T (t) Qγ + γ (t)T γ (t) − γ (t)T γ (t)
700
W. Lu and T. Chen
u(t) = [uT (t), γ T (t), γ T (t − τ )]Z2 γ (t) − γ T (t)γ (t) γ (t − τ )
≤ − γ T (t)γ (t).
(4.2)
Then V5 (t) − V5 (0) ≤ −
t
γ T (s)γ (s)ds.
0
Since lim V5 (t) = 0, t→∞
0
∞
1 γ T (s)γ (s)ds ≤ − V5 (0).
For any 1 > 0, let E 1 = {t ∈ [0, ∞) : γ (t) > 1 }: V5 (0) ≥
0
∞
γ T (s)γ (s)ds ≥ E 1
γ T (s)γ (s) ≥ 12 µ(E ).
Therefore, µ(E 1 ) < ∞. From proposition 2 in Forti and Nistri (2003), one can see that γ (t) converges to zero in measure, that is, µ − limt→∞ γ (t) = 0. Therefore, µ − limt→∞ α(t) = α . Remark 4. From the proof of theorem 5, one can see that the equilibrium of output α is also unique. Similar to the concept of the Filippov solution for a system of ordinary differential equations with a discontinuous right-hand side, we propose the concept of the solution for the delayed system 2.8. Suppose g(·) ∈ G¯ is bounded. The solution of system 2.8 can be regarded as an approximation of the solutions of delayed neural networks with high-slope gain functions. From the proof of viability, one can see that any limit of the solutions of delayed neural networks with high-slope activation functions, which converge to the discontinuous activations α(t), is a solution of system 2.8. More precisely, we give the following result: Theorem 6. Suppose g(·) ∈ G¯ is bounded, and the function sequence {g m (x) = (g1m (x1 ), g2m (x2 ), . . . , gnm (xn ))T : m = 1,2, . . .} satisfies: 1. {gim (·)} is nondecreasing for all i = 1, 2, . . . , n. 2. {gim (·)} is locally Lipschitz for all i = 1, 2, . . . , n.
Delayed Systems with Discontinuous Activations
701
3. For any compact set Z ⊂ Rn , lim d (Gra ph(g m (Z)), Gra ph(K [g(Z)])) = 0.
m→∞
xm (t) is the solution of the following system: d xm = −Dxm (t) + Agm (xm (t)) + Bgm (xm (t − τ )) + I dt xm (θ ) = φ(θ ) θ ∈ [−τ, 0].
(4.3)
Then there exists a sub-sequence mk such that for any T > 0, xmk (t) converges uniformly to an absolutely continuous Function x(t) on [0, T], which is a solution of system 2.8 on [0, ∞). And for any sub-sequence mk that converges uniformly to an absolutely continuous function x(t) on any [0, T], x(t) must be a solution of system 2.8 on [0, ∞); moreover, if the solution of the system 2.8 is unique, then the sequence xm (t) itself uniformly converges to x(t) on any [0, T]. The proof is similar to that of theorem 3. Details are omitted. Remark 5. There are close relationships and essential differences between this article and that of Forti and Nistri (2003). The dynamical behaviors of neural networks with discontinuous activations were first investigated in Forti and Nistri (2003). However, in that article, the authors assumed that the discontinuous neuron activations are bounded. We do not make that assumption here. Thus, the discussion of the existence of the equilibrium and solution of system 2.7 is much more difficult. Furthermore, the model in this article is a delayed neural network. Here, we introduce a new concept for the solution of delayed neural networks with discontinuous activation functions. Obviously, if a delayed feedback matrix B = 0, the conclusion that Forti and Nistri (2003) proposed for global stability can be regarded as a corollary. 5 Numerical Examples In this section, we present several numerical examples to verify the theorems we have given in the previous sections. Consider a two-dimensional neural network with time delay, d x(t) = −Dx(t) + Ag(x(t)) + Bg(x(t − τ )) + I, dt
(5.1)
where x(t) = (x1 (t), x2 (t))T ∈ R2 denotes the state and α(t) = (α1 (t), α2 (t))T denotes the output satisfying α1 (t) ∈ K [g1 (x1 (t))], α2 (t) ∈ K [g2 (x2 (t))].
702
W. Lu and T. Chen
In examples 1, 2, and 3, we assume D=
10
A=
01
− 14
2
B=
−10 − 14
1 10 1 − 17 10 1 5
,
and g1 (s) = g2 (s) = g(s) = s + sign(s), τ = 1. By the Matlab LMI and Control Toolbox, we obtain P=
5.8083
0
0
1.1796
Q=
1.1515 0.2456 0.2456 0.4056
,
such that Z=
−P A − AT P −P B −B T P Q
> 0.
By theorem 4, system 5.1 is globally exponentially stable for any input I ∈ R2 . 3
2
1 x1(t) 0
−1
x (t)
−2
2
−3
−4 0
2
4
6
8
10 time
12
14
16
18
Figure 1: Dynamical behaviors of the solution of the neural networks 5.2.
20
Delayed Systems with Discontinuous Activations
703
4
3
2 α1(t) 1
0
−1
−2
−3
α (t) 2
−4
−5
0
2
4
6
8
10 time
12
14
16
18
20
Figure 2: Dynamical behaviors of the outputs of the neural networks 5.2.
5.1 Example 1. Consider the following system: 1 x˙ 1 = −x1 (t) − [x1 (t) + sign(x1 (t))] + 2[x2 (t) + sign(x2 (t))] 4 1 1 + [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) + sign(x2 (t − 1))] + 6 5 10 1 x˙ 2 = −x2 (t) − 10[x1 (t) + sign(x1 (t))] − [x2 (t) + sign(x2 (t))] 4 1 1 − [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) + sign(x2 (t − 1))] + 10. 7 10 (5.2) Let φ1 (θ ) = −e 10θ , φ2 (θ ) = sin(20θ ), θ ∈ [−1, 0], be the initial condition. By the previous analysis, system 5.2 is globally exponentially stable. The trajectories of x1 (t) and x2 (t) are shown in Figure 1, and the trajectories of output, α1 (t) and α2 (t), are shown in Figure 2. The equilibrium of system 5.2 is
704
W. Lu and T. Chen
3
2
1
x1(t)
0
x (t) 2
−1
−2
−3
0
2
4
6 time
8
10
12
Figure 3: Dynamical behaviors of the solution of the neural networks 5.3.
(0.1974, −1.7346)T . It is clear that 0.1974 and −1.7346 are continuous points of the function ρ + sign(ρ). Thus, the output lim α(t) = (1.1974, −2.7346)T . t→∞
5.2 Example 2. In this example, we change the inputs and consider the following system, 1 x˙ 1 (t) = −x1 (t) − [x1 (t) + sign(x1 (t))] + 2[x2 (t) + sign(x2 (t))] 4 1 1 + [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) 5 10 43 + sign(x2 (t − 1))] + 20 1 x˙ 2 (t) = −x2 (t) − 10[x1 (t) + sign(x1 (t))] − [x2 (t) + sign(x2 (t))] 4 1 1 − [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) 7 10 1399 + sign(x2 (t − 1))] + , 140
(5.3)
Delayed Systems with Discontinuous Activations
705
4
3
2 α1(t) 1
0
−1 α2(t) −2
−3
−4
0
2
4
6 time
8
10
12
Figure 4: Dynamical behaviors of the outputs of the neural networks 5.3.
with the same initial condition as in example 1. The equilibrium of system 5.3 is (0, 0)T , and the equilibrium of outputs is (1, −1)T . It can be seen that 0 is a discontinuous point of the activation function g(ρ) = ρ + sign(ρ) and K [g(0)] = [1, −1]. In this case, the solution trajectories (x1 (t), x2 (t))T converge to the equilibrium, as indicated by Figure 3. The outputs (α1 (t), α2 (t))T cannot converge in norm, but by theorem 5, they converge in measure, as indicated by Figure 4. 5.3 Example 3. We use this example to verify the validity of theorem 6. Consider the following system, x˙ 1 = −x1 (t) + sign(x1 (t)) + sign(x2 (t)) + sign(x1 (t − 1)) + sign(x2 (t − 1)) x˙ 2 = −x2 (t) + sign(x1 (t)) + sign(x2 (t)) + sign(x1 (t − 1)) + sign(x2 (t − 1)), (5.4) and construct a sequence of systems as follows, x˙ m,1 = −xm,1 (t) + tan h(mxm,1 (t)) + tan h(mxm,2 (t)) + tan h(mxm,1 (t − 1)) + tan h(mxm,2 (t − 1))
706
W. Lu and T. Chen 0.14
0.12
0.1
errorm
0.08
0.06
0.04
0.02
0
0
50
100
150
200
250 m
300
350
400
450
500
Figure 5: Variation of error m with respect to m.
x˙ m,2 = −xm,2 (t) + tan h(mxm,1 (t)) + tan h(mxm,2 (t)) + tan h(mxm,1 (t − 1)) + tan h(mxm,2 (t − 1)),
(5.5)
with the same initial condition φ1 (s) = −4 and φ2 (s) = 10 for s ∈ [−1, 0]. Pick 20 sample points 1 = t1 < t2 < . . . < t20 = 3, and define error m = max x(tk ) − xm (tk ). k
Figure 5 indicates that error m converges to zero with respect to m. Therefore, the solutions {xm (t)} of systems 5.5 with high-slope activation functions converge to the solution x(t) of the delayed neural network with discontinuous activation. 6 Conclusion In this letter, we considered the global stability of delayed neural networks with discontinuous activation functions, which might be unbounded. We extend the Filippov solution to the case of delayed neural networks. Under some conditions, the existence of equilibrium and absolutely continuous solution on infinite time interval is proved. Thus, the Lyapunov-type method
Delayed Systems with Discontinuous Activations
707
can be used to study stability. In this way, we obtain an LMI-based criterion for global convergence. If some components of the equilibrium are discontinuous points of the activation function, the output does not converge in norm. In this case, we prove that the output converges in measure. Furthermore, we point out that the solution of the delayed neural networks with discontinuous activation functions can be regarded as a limit of the solution sequence of delayed systems with high-slope activation functions. Acknowledgments We are very grateful to the reviewers for their comments, which were helpful to us in revising the article. This work is supported by National Science Foundation of China 60374018 and 60574044 and also supported by the Graduate Student Innovation Foundation of Fudan University.
References Aubin, J. P. (1991). Viability theory. Boston: Birkhauser. Aubin, J. P., & Cellina, A. (1984). Differential inclusions. Berlin: Springer-Verlag. Aubin, J. P., & Frankowska, H. (1990). Set-valued analysis. Boston: Birkhauser. Baciotti, A., Conti, R., & Marcellini, P. (2000). Discontinuous ordinary differential equations and stabilization. Tesi di Dottorato di Ricerca in Mathematica. Universita degli STUDI di Firenze. Boyd, S., Ghaoui, L. E., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in system and control theory. Philadelphia: SIAM. Chen, T. P. (2001). Global exponential stability of delayed Hopfield neural networks. Neural Networks, 14, 977–980. Chua, L. O., Desoer, C. A. & Kuh, E. S. (1987). Linear and nonlinear circuits. New York: McGraw-Hill. Civalleri, P. P., Gilli, L. M., & Pabdolfi, L. (1993). On stability of cellular neural networks with delay. IEEE Trans. Circuit Syst., 40, 157–164. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley. Cohen, M. A., & Grossberg, S. (1983). Absolute stability and global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Syst. Man. Cybern., 13, 815–826. Filippov, A. F. (1967). Classical solution of differential equations with multivalued right-hand side. SIAM J. Control, 5, 609–621. Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Trans. Circuits Syst.-1, 50, 1421–1435. Haddad, G. (1981). Monotine viable trajectories for functional differential inclusions. J. Diff. Equ., 42, 1–24. Hale, J. K. (1977). Theory of functional differential equations. New York: SpringerVerlag. Harrer, H., Nossek, J. A., & Stelzl, R. (1992). An analog implementation of discretetime neural networks. IEEE Trans. Neural Networks, 3, 466–476.
708
W. Lu and T. Chen
Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-stage neurons. Proc. Nat. Acad. SCi-Biol., 81, 3088– 3092. Hopfield, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Joy, M. P. (2000). Results concerning the absolute stability of delayed neural networks. Neural Networks, 13, 613–616. Kennedy, M. P., & Chua, L. O. (1988). Neural networks for nonlinear programming. IEEE Trans. Circuits Syst.–1, 35, 554–562. Liz, E., & Pouso, R. L., (2002). Existence theory for first order discontinuous functional differential equations. Proc. Amer. Math. Soc., 130, 3301–3311. Lu, W. L., & Chen, T. P. (2005). Dynamical behaviors of Cohen-Grossberg neural networks with discontinuous activation functions. Neural Networks, 18, 231–242. Lu, W. L., Rong, L. B., & Chen, T. P. (2003). Global convergence of delayed neural networks systems. Int. J. Neural Networks, 13, 193–204. Paden, B. E., & Sastry, S. S. (1987). Calculus for computing Filippov’s differential inclusion with application to the variable structure control of robot manipulator. IEEE Trans. Circuits Syst., 34, 73–82. Spraker, J. S., & Biles, D. C. (1996). A comparison of the Carath´eodory and Filippov solution set. J. Math. Anal. Appl., 198, 571–580. Utkin, V. I. (1977). Variable structure systems with sliding modes. IEEE Trans. Automat. Contr., AC-22, 212–222. Yosida, K. (1978). Functional analysis. New York: Springer-Verlag. Zeng, Z., Wang, J., & Liao, X. (2003). Global exponential stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circuit Syst.–1, 50, 1353–1358.
Received September 24, 2004; accepted July 11, 2005.
LETTER
Communicated by Sun-ichi Amari
A Generalized Contrast Function and Stability Analysis for Overdetermined Blind Separation of Instantaneous Mixtures Xiao-Long Zhu xlzhu
[email protected] Xian-Da Zhang
[email protected] National Laboratory for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
Ji-Min Ye
[email protected] School of Science, Xidian University, Xi’an 710071, China
In this letter, the problem of blind separation of n independent sources from their m linear instantaneous mixtures is considered. First, a generalized contrast function is defined as a valuable extension of the existing classical and nonsymmetrical contrast functions. It is applicable to the overdetermined blind separation (m > n) with an unknown number of sources, because not only independent components but also redundant ones are allowed in the outputs of a separation system. Second, a natural gradient learning algorithm developed primarily for the complete case (m = n) is shown to work as well with an n × m or m × m separating matrix, for each optimizes a certain mutual information contrast function. Finally, we present stability analysis for a newly proposed generalized orthogonal natural gradient algorithm (which can perform the overdetermined blind separation when n is unknown), obtaining an expectable result that its local stability conditions are slightly stricter than those of the conventional natural gradient algorithm using an invertible mixing matrix (m = n). 1 Introduction The problem of blind source separation (BSS) has been studied intensively in recent years (Girolami, 1999; Hyvarinen, Karhunen, & Oja, 2001). In the noise-free instantaneous case, the available sensor vector xt = [x1 (t), . . . , xm (t)]T and the unobservable source vector st = [s1 (t), . . . , sn (t)]T are related by xt = Ast , Neural Computation 18, 709–728 (2006)
(1.1) C 2006 Massachusetts Institute of Technology
710
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
where A is an unknown m × n mixing matrix. For simplicity, all variables in this article are supposed to be real-valued. The superscript T denotes transpose of a vector or matrix. The objective of BSS is to find a separating matrix B given just a sequence of observations {xt }, such that the output vector, yt = Bxt ,
(1.2)
provides accurate estimates of the n source signals. For this purpose, the following assumptions are usually made: A1. The unknown mixing matrix A is of full column rank with n ≤ m. A2. The source signals s1 (t), . . . , sn (t) are statistically independent. A3. Each source signal is a zero-mean and unit-power stationary process. A4. At most one of the source signals is subject to the normal distribution. The full-column-rank requirement of the mixing matrix guarantees that all source signals are theoretically recoverable (Cao & Liu, 1996; Li & Wang, 2002). More precisely, when the column rank of A is deficient, at most a part of source signals can be successfully extracted unless additional prior information about the mixing matrix or the source signals is available (Lewicki & Sejnowski, 2000). The independence assumption is a premise of the BSS problem, and it holds in many practical situations. The unit power assumption comes from the amplitude indeterminacy inherent in the problem (Comon, 1994; Tong, Liu, Soon, & Huang, 1991), and it incorporates the scale information of each source signal into the corresponding column vector of the mixing matrix. The last assumption is made due to the fact that a linear mixture of several gaussian signals is still gaussian and thus cannot be factorized exclusively. The BSS problem is also solvable if the assumptions regarding the source signals (A2–A4) are replaced by another assumption that all the source signals are statistically nonstationary with distinct variances at two or more time instants (Douglas, 2002; Pham & Cardoso, 2001). Numerous algorithms have been proposed for BSS (see, e.g., Amari & Cichocki, 1998; Cardoso, 1998; Girolami, 1999; Hyvarinen et al., 2001). Of particular note is the natural gradient algorithm (Amari, Cichocki, & Yang, 1996; Amari, 1998; Yang & Amari, 1997), Bt+1 = Bt + ηt I − φ(yt )ytT Bt ,
(1.3)
where ηt is a positive learning rate parameter and φ(yt ) = [ϕ1 (y1 (t)), . . . , ϕn (yn (t))]T is a nonlinear-transformed vector of yt . This
A Generalized Contrast Function and Stability Analysis
711
algorithm is computationally efficient with uniform convergence performance independent of the conditioning of the mixing matrix (Cardoso & Laheld, 1996). Additionally, algorithm 1.3 is developed primarily for complete BSS where both the mixing matrix and the separating matrix are invertible square matrices (m = n); nevertheless, it works effectively as well in the overdetermined case (m > n) by designating an n × m separating matrix Bt (Zhang, Cichocki, & Amari, 1999; Zhu & Zhang, 2004). The extension of algorithm 1.3 to the overdetermined case requires that the source number n should be known a priori. When the source number is unknown, we can exact the source signals one by one using the sequential approaches (Delfosse & Loubaton, 1995; Hyvarinen & Oja, 1997; Thawonmas, Cichocki, & Amari, 1998). They work regardless of the source number, but the quality of the separated source signals would increasingly degrade due to the accumulated errors coming from the deflation process. Moreover, the subsequent signal extraction cannot proceed unless the current deflation process has converged, which is prohibitive in some real-time situations such as wireless communications. As to parallel algorithms that recover source signals simultaneously, the BSS problem with an unknown number of sources can be achieved in two stages. The observations are first preprocessed such that the sensor vector is transformed to a white vector, and at the same time its dimensionality is decreased from m to n; and then an n × n orthogonal matrix is determined to obtain source separation. This scheme suffers from a poor separation for ill-conditioned mixing matrix or weak source signals (Karhunen, Pajunen, & Oja, 1998). Surprisingly, algorithm 1.3 can be employed directly to learn an m × m nonsingular matrix. Extensive experiments show that among m outputs at its convergence, n components are mutually independent, providing the rescaled source signals; each of the remaining m − n components is a copy of some independent component. Therefore, a postprocessing layer can be utilized to remove the redundant signals and estimate the source number. The above finding was reported earlier (Cichocki, Karhunen, Kasprzak, & Vigario, 1999), and later a theoretical justification was given from the viewpoint of contrast function (Ye, Zhu, & Zhang, 2004). As also verified by experiments, if the dimensionality of yt is greater than the source number, then the convergence of algorithm 1.3 is impermanent: it would diverge inevitably because the separation state is not a stationary point. To perform BSS with an unknown number of sources effectively, a generalized orthogonal natural gradient algorithm was proposed (Ye et al., 2004), but the (local) stability conditions remain to be given. The purpose of this article is to fill this gap. Additionally, a generalized contrast function is defined here, which includes the existing classical contrast function and the nonsymmetrical one as its two special cases. The natural gradient algorithm, 1.3, is able to perform (before divergence) the overdetermined BSS (m > n) with an m × m separating matrix in that it maximizes locally a generalized mutual information contrast function.
712
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
2 A Generalized Contrast Function A contrast function (or contrast) plays a crucial role in optimization since its maximization gives solutions of certain problems. There are two types of contrasts for BSS in the literature: classical contrast, first defined by Comon (1994), and nonsymmetrical contrast introduced by Moreau and ThirionMoreau (1999). The two contrasts address the case where the composite matrix C = BA is a square matrix, which implies that the separating matrix B is an n × m matrix and the source number n is known a priori. In this section, we use the following notations. Cn×n denotes the set of n × n nonsingular matrices, Dn×n denotes the set of n × n invertible diagonal matrices, Gn×n denotes the set of n × n generalized permutation matrices (which contains only one nonzero entry at each row and each column), Sn is the set of source vectors satisfying the assumptions A2 and A3, and Yn is the set of n-dimensional output vectors built from y = Cs, where C ∈ Cn×n and s ∈ Sn . Similarly, Cq ×q , Dq ×q , Gq ×q , and Yq are defined. ∅ represents an empty set containing no element. For convenience, we drop the time index t from the source vector st , the observation vector xt , and the output vector yt . Definition 1: Classical Contrast (Comon, 1994; Moreau & Macchi, 1996). A contrast function on Yn is a multivariate mapping J from the set Yn to the real number set R, which satisfies the following three requirements: R1. ∀ y ∈ Yn , ∀C ∈ Gn×n , J (C y) = J ( y). R2. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) ≤ J (s). R3. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) = J (s) ⇔ C ∈ Gn×n . Definition 2: Nonsymmetrical Contrast (Moreau & Thirion-Moreau, 1999). A contrast function on Yn is a multivariate mapping J from the set Yn to R, which satisfies the following three requirements: Q1. ∀ y ∈ Yn , ∀ D ∈ Dn×n , J ( D y) = J ( y). Q2. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) ≤ J (s). Q3. ∀s ∈ Sn , ∀C ∈ Cn×n , ∃G n×n ⊂ Gn×n with G n×n = ∅, J (Cs) = J (s) ⇔ C ∈ G n×n . From the above two definitions, it can be seen that a classical contrast is invariant under all separation states, and its global maximization is a necessary and sufficient condition for source separation. In contrast, a nonsymmetrical contrast is merely invariant to nonzero scaling transformation of its arguments. Although a separation state where the outputs consist of independent components does not necessarily correspond to a maximum, the maximization of a nonsymmetrical contrast does result in source separation.
A Generalized Contrast Function and Stability Analysis
713
In BSS, mostly used are the classical contrasts, including the maximum likelihood contrasts (Cardoso, 1998), the mutual information contrasts (Amari et al., 1996; Comon, 1994; Pham, 2002; Yang & Amari, 1997), the nonlinear principal component analysis contrasts (Karhunen et al., 1998; Oja, 1997; Zhu & Zhang, 2002), and the high-order statistics contrasts (Cardoso, 1999; Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999). The negative mutual information of the outputs in equation 1.2 is given by (Amari et al., 1996; Comon, 1994; Yang & Amari, 1997), J (y1 , . . . , yq ; B) = H(y1 , . . . , yq ) −
q
H(yk ),
(2.1)
k=1
where the differential entropy is defined as (Cover & Thomas, 1991) H(y1 , . . . , yq ) = −
∞ −∞
p(y1 , . . . , yq ) ln p(y1 , . . . , yq )dy1 , . . . dyq ,
(2.2)
in which p(y1 , . . . , yq ) is the joint probability density function (PDF) of {y1 , . . . , yq }, and ln(·) is the natural logarithm operator. By the property of the differential entropy, J (y1 , . . . , yq ; B) ≤ 0 with equality if and only if q p(y1 , . . . , yq ) = k=1 p(yk ). It is straightforward that equation 2.1 with q = n is a classical contrast for BSS (Comon, 1994), meaning J (y1 , . . . , yn ; B) = 0
⇔
BA ∈ Gn×n .
(2.3)
Since the outputs (y1 , . . . , yq ) are linear combinations of n source signals, they are impossible to be mutually independent, and the global maximum of equation 2.1 cannot be reached if q > n. As mentioned previously, the task of BSS is to recover the source signals from the observations, so the dimensionality of y can be larger than the number of sources. An algorithm also achieves source separation if y1 , . . . , yq (q ≥ n) are made up of n independent components plus q − n redundant components. That is, a sufficient condition for BSS is that all the source signals are recovered at least once, and the composite matrix takes the following form, C = G · [e1 , . . . , en , cn+1 , . . . , cq ]T ,
(2.4)
where G ∈ Gq ×q , [e1 , . . . , en ] = I is the n × n identity matrix, and ci (i = n + 1, . . . , q ) is either a null vector or a column of I. To develop such algorithms, the conventional contrasts need to be modified to handle a rectangular matrix as well as a square matrix C. Definition 3: Generalized Contrast. Let ξ be a q × 1 vector (q ≥ n), n components of which are the original source signals and each of the rest q − n
714
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
components is a constant zero or some source signal. All q ! · (n + 1)q −n possible versions of ξ form the set q . Denote q = {ς| ς = Dξ , ξ ∈ q , D ∈ Dq ×q }, Qq ×q = { Q| Qξ ∈ q , ξ ∈ q }, and let Wq ×q be the set of q × q (singular or nonsingular) matrices. A contrast function on Yq is a multivariate mapping J from the set Yq to R, which satisfies the following three requirements: T1. ∀ y ∈ Yq , ∀ D ∈ Dq ×q , J ( D y) = J ( y). T2. ∀ y ∈ Yq , ∀W ∈ Wq ×q , ∃δ > 0, J ( y + δW y) ≤ J ( y) ⇒ y ∈ q . T3. ∀ξ ∈ q , ∀W ∈ Cq ×q , ∃Qq ×q ⊂ Qq ×q with Qq ×q = ∅, J (Wξ ) = J (ξ ) ⇔ W ∈ Qq ×q . Obviously, when a generalized contrast is maximized, the outputs provide all source signals up to ordering and scaling, which is common with the classical and nonsymmetrical contrasts. The main difference of the three definitions is that all local maxima of the generalized contrast correspond to source separation (where q outputs are composed of n independent components and q − n redundant components), while only the global ones of the classical and nonsymmetrical contrasts do so (where no redundant component is allowed). Hence, the generalized contrast is particularly useful for the overdetermined BSS problem with an unknown number of sources. The classical and nonsymmetrical contrasts, as special instances of the generalized one, are simply appropriate for the case where the source number is known a priori. Now, we turn back to equation 2.1. For a square mixing matrix A, the separating matrix B should be taken as an n × n invertible matrix, resulting in (Amari et al., 1996; Bell & Sejnowski, 1995; Comon, 1994; Yang & Amari, 1997) H(y1 , . . . , yn ) = H(x1 , . . . , xn ) + ln | det (B)|,
(2.5)
where | det(B)| is the absolute value of the determinant of B. On the other hand, if the m × n matrix A is of full column rank with n in hand, then an n × m full-row-rank matrix B can be assigned to perform BSS. By the singular value decomposition (Golub & van Loan, 1996), there must be an n × n orthogonal matrix U, an n × n nonsingular diagonal matrix , . and an m × m orthogonal matrix V such that B = U[ .. O]VT , where O denotes an n × (m − n) null matrix. Let V1 be a submatrix made up of the first n columns of V and V2 the remainder m − n columns. The output vector then becomes y = UV1T x. According to definition 2.2 and the property det() =
det(BBT ), we obtain
1 H(y1 , . . . , yn ) = H V1T x + ln | det (BBT )|. 2
(2.6)
A Generalized Contrast Function and Stability Analysis
715
Since B = UV1T does not depend on V2 , from the formula (Cover & Thomas, 1991) H V1T x = H(x) − H V2T x V1T x ,
(2.7)
where H(V2T x|V1T x) represents the differential entropy of V2T x conditioning on V1T x, it can be deduced that H(V1T x) is not a function of B (e.g., H(V1T x) = H(x) when V2T A = O). To sum up, if the source number is known and equal to the number of outputs, the mutual information contrast, equation 2.1, subject to definition 1 can be simplified to (Zhu & Zhang, 2004) 1 H(yk ), ln det BBT − 2 n
J1 (y1 , . . . , yn ; B) =
(2.8)
k=1
in which the trivial term H(x1 , . . . , xn ) in equation 2.5 or H(V1T x) in equation 2.6 is removed. The separating matrix permits two choices: Bt ∈ Rn×n or Bt ∈ Rn×m . We notice that another contrast is also useful in this case (Zhang et al., 1999),
J2 (y1 , . . . , yn ; B) = ln | det(BET )| −
n
H(yk ),
(2.9)
k=1
where E is the identity element of the Lie group Gl(n, m). Clearly, the classical contrasts 2.8 and 2.9 are equivalent for ln | det(BET )| = 12 ln | det (BBT )|. When the source number is unknown, neither the classical contrast nor the nonsymmetrical contrast works, and it is necessary to use the generalized contrast. Since the sensor number m is, if not always, usually in hand, the separating matrix B can be restricted to be an m × m nonsingular matrix. Thus, an n × m submatrix B1 exists such that B1 A is invertible. Without loss of generality, we suppose that B1 is composed of the first n rows of B and denote by B2 the remaining submatrix. It can be shown that equation 2.1 with q = m becomes (Ye et al., 2004)
J3 (y1 , . . . , ym ; B) = −I(y1 , . . . , yn ) −
m−n
H(yn+k ),
(2.10)
k=1
in which [y1 , . . . , yn ]T = B1 x, [yn+1 , . . . , ym ]T = B2 x, and I(y1 , . . . , yn ) denotes the mutual information of {y1 , . . . , yn }. Because a random variable
716
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
that is the sum of several independent components has the entropy not less than any individual does (Cover & Thomas, 1991), that is, H(yn+k ) ≥ max H([B2 A]n+k, j s j ),
(2.11)
j∈{1,...,n}
where C pq denotes the ( p, q )th entry of matrix C, the second term (including the minus sign) on the right-hand side of equation 2.10 reaches a local maxima when each component yn+k is a copy of some source signal. Furthermore, the first term obtains the global maximum of zero if and only if y1 , . . . , yn are mutually independent. Therefore, we can infer that equation 2.10 or its equivalent version, J3 (y1 , . . . , ym ; B) = ln | det (B)| −
m
H (yk ),
(2.12)
k=1
works as a generalized contrast function for overdetermined BSS. Based on equation 2.12 and the natural gradient learning (Amari, 1998), we can readily get the following overdetermined BSS algorithm (Cichocki et al., 1999; Ye et al., 2004), Bt+1 = Bt + ηt I − F(yt )ytT Bt ,
(2.13)
where Bt ∈ Rm×m , F(y) = [ f 1 (y1 ), . . . , f m (ym )]T and f i (yi ) = −
dp (y ) 1 · i i , pi (yi ) dyi
i = 1, . . . , m
(2.14)
are referred to as the score functions of BSS. Regarding equation 2.13, we make the following remarks: Remark 1. For the complete BSS problem, the sensor number is equal to the number of sources (m = n), and the natural gradient algorithm, 1.3, which was first proposed heuristically in Cichocki, Unbehauen, and Rummert (1994) and later theoretically justified in Amari (1998), has the same form as equation 2.13 except Bt ∈ Rn×n . Remark 2. In the overdetermined case (m > n), if the source number n is known, we can apply the relative gradient learning (Cardoso & Laheld, 1996) to the classical contrast, equation 2.8 (Zhu & Zhang, 2004), or apply the natural gradient learning to equation 2.9 (Zhang et al., 1999), obtaining finally a homologous algorithm to equation 2.13 but Bt ∈ Rn×m .
A Generalized Contrast Function and Stability Analysis
717
Remark 3. When the source number n is unknown, it is necessary to employ algorithm 2.13 with Bt ∈ Rm×m . Let A+ = (AT A)−1 AT be the pseudo-inverse of the mixing matrix A, N (AT ) be an m × (m − n) matrix formed by the basis vector(s) of the null space of AT , and B be an (m − n) × m matrix whose rows are made up of some row(s) of A+ . As long as the separating matrix Bt updates in the equivalent class (Amari, Chen, & Cichocki, 2000; Ye et al., 2004), defined by B|
B = G[(A+ )T , N (AT ) + B ]T , G ∈G m×m }, B = {
T
(2.15)
m output components consist of n rescaled source signals and m − n redundant signals. In other words, yt = Gξt and some estimated source signals are replicated. From the viewpoint of contrast function, one local maximum of equation 2.12 is achieved. Remark 4. For the ideal noiseless case, m components in yt = Bt Ast , which are the linear mixtures of n independent source signals, are impossible to be mutually independent when m > n. Thus, the stationary condition E I − F(yt )ytT = O
(2.16)
does not hold, and the natural gradient algorithm, 2.13, cannot stabilize in the separation state of yt = Gξt . As a result, algorithm 2.13 would diverge inevitably unless the learning rate ηt is sufficiently small. A different phenomenon arises in the noisy case xt = Ast + vt , where vt is a vector
= [A,A] of additive noises. Let A be an m × (m − n) matrix such that A is an m × m invertible matrix. Then the sensor vector can be rewritten
·
as xt = A st +
vt with
st = [stT ,vtT ]T and
vt = vt − Avt . Provided that the components of vt are independent of the source signals, it is a noisy but complete BSS model (in the specific case vt = Avt , it reduces to a noiseless
·
one, xt = A st ); thereby in the noisy case, the behavior of algorithm 2.13 is similar to the (noisy) complete natural gradient algorithm 1.3. It will collect n unknown source signals (with reduced noise) and m − n noise signals after convergence (under the condition that the noise is not too large; usually its power is less than 3% of the source signal’s; Cichocki, Sabala, Choi, Orsier, & Szupiluk, 1997). Furthermore, algorithm 2.13 does not diverge any more as condition 2.16 is fulfilled. The above analysis as well as the efficiency of the framework, 2.13, using different forms of separating matrices (Bt ∈ Rn×n , Bt ∈ Rn×m or Bt ∈ Rm×m ) has been verified by extensive experiments (Amari et al., 1996; Cichocki et al., 1994, 1999; Girolami, 1999; Hyvarinen et al., 2001; Ye et al., 2004; Zhu & Zhang, 2004).
718
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
Remark 5. In algorithm 1.3 or 2.13, the identity matrix I determines the amplitude of the reconstructed signals, and it can be replaced by a nonsingular diagonal matrix , yielding Bt+1 = Bt + ηt − F(yt )ytT Bt .
(2.17)
If = diag {F(yt )ytT } is a diagonal matrix composed of the on-diagonal elements of F(yt )ytT , then equation 2.17 becomes an orthogonal natural gradient algorithm (Amari et al., 2000; Cichocki et al., 1997). With Bt ∈ Rm×m and m > n, this algorithm can make the redundant components decay to zero, and it is particularly useful when the magnitudes of the source signals are intermittent or changing rapidly over time (e.g., speech signals). One drawback of the orthogonal natural gradient algorithm is that the redundant signals, although very weak in scale, may still be mixtures of several source signals. Furthermore, the magnitudes of some reconstructed source signals are possibly very small as well; hence, it is sometimes difficult to discriminate them from the redundant ones. 3 Stability Analysis of a Generalized Natural Gradient Algorithm The natural gradient algorithm, 2.13, would diverge in the noiseless case, while the orthogonal natural gradient algorithm, 2.17, may experience trouble in determining whether some weak outputs are the recovered source signals or the redundant ones. To overcome these problems, we have recently proposed a generalized orthogonal natural gradient algorithm (Ye et al., 2004), Bt+1 = Bt + ηt R − φ(yt )ytT Bt ,
(3.1)
in which R = E φ(yt )ytT |yt =Gξt
(3.2)
is no longer a diagonal matrix when m > n. In practice, R should start with the identity matrix and then take an exponentially windowed average of equation 3.2, that is,
Rt =
t ≤ T0
I, λRt−1 + (1 −
λ)φ(yt )ytT ,
t > T0
,
(3.3)
where 0 < λ < 1 is a forgetting factor and T0 is an empirical number associated with the convergence speed. Additionally, when the source number is known a priori or has been estimated, other choices of R are also possible
A Generalized Contrast Function and Stability Analysis
719
(they may control what the redundant components are or where they appear; refer to Ye et al., 2004 for more details). Notice that the PDFs of source signals are usually inaccessible in the BSS problem; hence, another group of nonlinearities, called activation functions ϕi (yi ), are generally used instead of the score functions in equation 2.14. A pivotal question is: How much can the activation functions ϕi (yi ) depart from the score functions f i (yi ) corresponding to true distributions before a separating stationary point becomes unstable? The local stability conditions answer such a question (Amari, Chen, & Cichocki, 1997; Amari, 1998; Amari et al., 2000; Cardoso & Laheld, 1996; Cardoso, 2000; Ohata & Matsuoka, 2002; von Hoff, Lindgren, & Kaelin, 2000). To the best of our knowledge, however, none of the existing literature considers the general case where there are more outputs than source signals. It is worth mentioning that the generalized algorithm 3.1 can perform the overdetermined BSS with an unknown number of sources, and its validity has been confirmed by computer simulations, but the local stability conditions remain to be given. Here, we study this problem, and formulate the following theorem. Again, the time index t is omitted for simplicity. Theorem 1. When the activation functions and the zero-mean source signals satisfy the following conditions, E{ϕi (si )s j } = 0, E ϕi (si )s j sk = 0,
if
i = j
(3.4)
if
j = k
(3.5)
where ϕi (si ) denotes the first-order derivative of ϕi (si ), and when the inequalities E ϕi (si )s 2j E ϕ j (s j )si2 > E{ϕi (si )si }E{ϕ j (s j )s j } E ϕi (si )s 2j > 0
(3.6) (3.7)
hold for all i, j = 1, . . . , n, then the separating matrix B ensuring y = B As = Gξ is a stable equilibrium point of the generalized orthogonal natural gradient algorithm 3.1. Proof. The local (asymptotic) stability of a stochastic BSS algorithm can be examined by the response of the composite matrix C = BA near equilibrium points. Multiplying both sides of equation 3.1 by A and taking mathematical expectation, we have E {C} := E {C} + ηE
R − φ(y)y C .
(3.8)
In what follows, the variables on the left-hand side of the notation “:=” have the time index t + 1 while those on the right-hand side have the time
720
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
index t. In addition, ϕi (si ) and ϕi (si ) are simplified by ϕi and ϕi , respectively. Suppose there is a small disturbance matrix of C at a separating point; the local stability issue is then simplified to investigating whether and under what conditions E{ } converges to a null matrix. Without loss of generality, we assume n < m ≤ 2n and the output vector y = ξ with
ξi =
si ,
i = 1, . . . , n
si−n , i = n + 1, . . . , m,
(3.9)
namely, the composite matrix is C =
I = [e1 , . . . , en , e1 , . . . , em−n ]T ,
(3.10)
where ei is the ith column vector of the n × n identity matrix. Taking the small disturbance into account, the output vector becomes y = ξ + s, and the first-order Taylor expansion of φ(y) reads φ(y) ≈ φ(ξ ) + Dϕ s,
(3.11)
in which Dϕ = diag[ϕ1 , . . . , ϕn , ϕ1 , . . . , ϕm−n ]. Substituting R = E{φ(ξ )ξ T } along with the approximation 3.11 into equation 3.8 and neglecting the second-order infinitesimal of , we obtain
E{ } := E{ } − ηE{φ(ξ )sT T + Dϕ s
sT },
(3.12)
s =
IT · ξ = [2s1 , . . . , 2sm−n , sm−n+1 , . . . , sn ]T . Since the where =
IT · and
disturbance is uncorrelated with the source signals, the expression 3.12 under the assumptions 3.4 and 3.5 can be written into six different cases: Case1 (C1). 1 ≤ i ≤ m − n E{ i,i } := 1 − ηE 2ϕi si2 + ϕi si E i,i − ηE ϕi si E n+i,i . Case2 (C2). m − n < i ≤ n E i,i := 1 − ηE ϕi si2 + ϕi si E i,i . Case3 (C3). 1 ≤ i ≤ n, 1 ≤ j ≤ m − n, i = j E i, j := 1 − ηE 2ϕi s 2j E i, j − ηE ϕi si E j,i + n+ j,i .
A Generalized Contrast Function and Stability Analysis
721
Case4 (C4). 1 ≤ i ≤ n, m − n < j ≤ n, i = j E i, j := 1 − ηE ϕi s 2j E i, j − ηE ϕi si E j,i . Case5 (C5). n < i ≤ m, 1 ≤ j ≤ m − n E i, j 2 := 1 − ηE 2ϕi−n s j E i, j − ηE ϕi−n si−n E j,i−n + n+ j,i−n . Case6 (C6). n < i ≤ m, m − n < j ≤ n 2 s j E i, j − ηE ϕi−n si−n E j,i−n . E i, j := 1 − ηE ϕi−n Now, we study the convergence of each entry of E{ }. The diagonal terms E{ i,i }, m − n < i ≤ n, behave according to C2, and for stability, the coefficient [1 − ηE{ϕi si2 + ϕi si }] must lie between 0 and 1. Assuming a sufficiently small positive learning rate η, then the condition E{ϕi si2 + ϕi si } > 0
(3.13)
should be met (m − n < i ≤ n). Note that the diagonal terms E{ i,i }, 1 ≤ i ≤ m − n, rely not only on themselves but also on the off-diagonal terms E{ n+i,i }; thus, they need to be considered in pairs. Based on C1 and C5, the vector V0 = [ i,i , n+i,i ]T evolves by E{V0 } := E{V0 } − η0 E{V0 },
(3.14)
where 0 is the following matrix 0 =
E 2ϕi si2 + ϕi si E{ϕi si }
E{ϕi si } . E 2ϕi si2 + ϕi si
In order to make V0 converge to a null vector, both of the eigenvalues of 0 must be positive, which requires the inequalities E 2ϕi si2 + ϕi si > 0 2 E 2ϕ s + ϕ s > |E{ϕ s }| i i i i i i
(3.15) (3.16)
to hold for 1 ≤ i ≤ m − n. Concerning the off-diagonal terms E{ i, j }, 1 ≤ i = j ≤ m − n, they depend on E{ i, j }, E{ j,i }, E{ n+i, j } and E{ n+ j,i }, and hence the quadruples
722
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
in V1 = [ i, j , j,i , n+i, j , n+ j,i ]T should be studied. Owing to C3 and C5, a matrix similar to 0 in equation 3.14 can be obtained as 2 E 2ϕi s j E{ϕ j s j } 1 = 0 E{ϕ j s j }
E{ϕi si } E 2ϕ j si2
0
E{ϕi si } 0
E{ϕi si }
E{ϕ j s j } E 2ϕi s 2j
0
E{ϕ j s j }
E{ϕi si } E 2ϕ j si2
,
which is a positive-definite matrix under the following conditions: E ϕi s 2j > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j },
(3.17) (3.18)
where 1 ≤ i = j ≤ m − n. Based on C4, C3 and C6, the triples in V2 = [ i, j , j,i , n+i, j ]T , 1 ≤ i ≤ m − n, m − n < j ≤ n, are associated with the matrix 2 E ϕi s j 2 = E ϕ j s j 0
E{ϕi si } E 2ϕ j si2 E{ϕi si }
0
E{ϕ j s j } , 2 E ϕi s j
and the convergence is guaranteed by E ϕi s 2j > 0 E ϕi s 2j + 2ϕ j si2 > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.19) (3.20) (3.21)
Analogously, using C3, C4, and C6, the triples in V3 = [ i, j , j,i , n+ j,i ]T , m − n < i ≤ n, 1 ≤ j ≤ m − n, are related to the matrix 2 E 2ϕi s j 3 = E{ϕ j s j } E{ϕ j s j }
E{ϕi si } E ϕ j si2
E{ϕi si } 0
0
E ϕ j si2 }
and the conditions E ϕ j si2 > 0 E 2ϕi s 2j + ϕ j si2 > 0
(3.22) (3.23)
A Generalized Contrast Function and Stability Analysis
723
E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.24)
When m − n < i = j ≤ n, the pairs in V4 = [ i, j , j,i ]T behave according to C4, yielding 4 =
E ϕi s 2j E{ϕ j s j }
E{ϕi si } , E ϕ j si2
and for stability, the requirements below should be satisfied: E ϕi s 2j + ϕ j si2 > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.25) (3.26)
Following the similar procedures used to determine the convergence of the off-diagonal terms E{ i, j }, 1 ≤ i = j ≤ m − n, the quadruples in V5 = [ i, j , i−n, j , j,i−n , n+ j,i−n ]T , n < i ≤ m, 1 ≤ j ≤ m − n, produce the matrix 2 E 2ϕi−n s j 0 5 = E{ϕ s } j j E{ϕ j s j }
0
E{ϕi−n si−n }
E{ϕ j s j }
E{ϕi−n si−n } 2 E 2ϕ j si−n
E{ϕ j s j }
0
2 sj E 2ϕi−n
E{ϕi−n si−n }
E{ϕi−n si−n } , 0 2 E 2ϕ j si−n
and they vanish if and only if 2 E ϕi−n sj > 0 2 E ϕ j si−n > 0 2 2 2E ϕi−n s j E ϕ j si−n > E{ϕi−n si−n }E{ϕ j s j }.
(3.27) (3.28) (3.29)
Finally, the triples in V6 = [ i, j , i−n, j , j,i−n ]T , n < i ≤ m, m − n < j ≤ n, are associated with the matrix 2 E ϕi−n s j 6 = 0 E{ϕ j s j }
0
2 sj E ϕi−n E{ϕ j s j }
E{ϕi−n si−n }
E{ϕi−n si−n } 2 E 2ϕ j si−n
and the conditions 2 sj > 0 E ϕi−n
(3.30)
724
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
2 2 >0 E ϕi−n s j + 2ϕ j si−n 2 2 E ϕi−n s j E ϕ j si−n > E{ϕi−n si−n }E{ϕ j s j }.
(3.31) (3.32)
In conclusion, the conditions 3.13 and 3.15 to 3.32 can be summarized as follows: √ Z1. 1 ≤ i ≤ m − n: E 2ϕi si2 + ϕi si > E ϕi si , 2E ϕi si2 > |E{ϕi si }|. Z2. m − n < i ≤ n: E ϕi si2 + ϕi si > 0. Z3. 1 ≤ i ≤ m − n, 1 ≤ j ≤ n: E ϕi s 2j > 0. Z4. 1 ≤ i ≤ m − n, m − n ≤ j ≤ n: E ϕi s 2j + 2ϕ j si2 > 0. Z5. m − n ≤ i = j ≤ n: E ϕi s 2j + ϕ j si2 > 0. Z6. 1 ≤ i = j ≤ n: E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }. Obviously, inequalities 3.6 and 3.7 are sufficient conditions for Z1 to Z6. Moreover, the above analysis is based on the special output vector 3.9. Somewhat tedious but almost identical manipulations show that the same conclusion can be made for m > n and for the general form of the output vector y = Gξ . This completes the proof of theorem 1. Remark 6. Taking m = n in Z1 to Z6, we obtain the local stability conditions of the complete BSS algorithm, 1.3, given by E ϕi si2 + ϕi si > 0, i = 1, . . . , n; E ϕi s 2j + ϕ j si2 > 0, 1 ≤ i = j ≤ n; E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }, 1 ≤ i = j ≤ n.
(3.33)
This result can be found, for example, in Amari (1998); Amari et al. (1997, 2000), Cardoso (2000), Ohata and Matsuoka (2002), and von Hoff et al. (2000). By comparison, it can be seen that the local stability conditions, equations 3.6 to 3.7, for the overdetermined BSS algorithm, 3.1, are a little stricter than those in quation 3.33 for the complete one, equation 1.3, which agrees with our conjecture. Remark 7. The activation functions depend on the distributions of the source signals. Among all nonlinearities, the score functions 2.14 are shown to be most efficient in that they always satisfy the local stability conditions, they work most robustly against outliers, and they can also obtain the best quality of the separated source signals (Hyvarinen et al., 2001; Mathis & Douglas, 2002; von Hoff et al., 2000).
A Generalized Contrast Function and Stability Analysis
725
Remark 8. For two gaussian source signals si and s j , it can be shown that E f i (si )s 2j E f j (s j )si2 = E{ f i (si )si }E{ f j (s j )s j },
(3.34)
which contradicts the requirement, equation 3.6. Consequently, the assumption A4 that at most one source signal is gaussian is usually made in BSS. Remark 9. In the proof of theorem 1, it is conditions 3.4 and 3.5 rather than the statistical independence of source signals that are used. Utilizing the independence assumption, we may get an expedient rule for the activation functions as E{ϕi (si )} > 0,
E{φi (si )si2 + φi (si )si } > 0,
E ϕi (si )si2 > E{ϕi (si )si }, (3.35)
in which i = 1, . . . , n. The conditions in equation 3.35 are stricter than those in theorem 1 and than those in equation 3.33. (See Amari et al., 1997; Cardoso, 2000; Girolami, 1999; Hyvarinen et al., 2001; and Mathis & Douglas, 2002, for some illustrative nonlinearities of ϕi .) Remark 10. Theorem 1 as well as the existing stability analyses (Amari et al., 1997, 2000; Amari, 1998; Cardoso & Laheld, 1996; Cardoso, 2000; Ohata & Matsuoka, 2002; von Hoff et al., 2000) address simply the noiseless case. When the observations are polluted by additive noises, a rigorous stability analysis for BSS seems very complicated and even impossible. Here, we consider the special case as mentioned in remark 4, where the additive noise vector is vt = Avt such that the noisy observation vector
·
= [A,A] and
xt = Ast + vt reduces to a noiseless one xt = A st with A st = [stT ,vtT ]T . It is a complete BSS model, and by equation 3.33, the stability conditions depend not only on the distributions of the source signals in st but also on those of the noise signals in vt . In other words, to analyze the stability of a noisy BSS algorithm, we must take into account the properties of additive noises, which is impractical in many applications. Therefore, most algorithms for BSS assume that each additive noise is not too large (otherwise the algorithms would degrade severely and even not work at all; Cichocki et al., 1997), and the activation functions are generally selected according to equation 3.35. In any event, the stability issue of a noisy BSS algorithm deserves further study. 4 Conclusion In this letter, we defined a generalized contrast function that takes the existing classical contrast and nonsymmetrical contrast as its two special cases. It can handle the general blind separation problem where there are more
726
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
mixtures than sources and at the same time the source number is unknown. Then, by means of the classical and the generalized mutual information contrast functions, we justify the finding that the natural gradient algorithm can be employed to perform the complete or overdetermined BSS whether the source number n is known or not, by a single adjustment of the dimension of the separating matrix (n × n, n × m, or m × m). Finally, we analyze the local stability of the generalized orthogonal natural gradient algorithm (Ye et al., 2004) and obtain an expectable result that the nonlinear activation functions of the overdetermined BSS algorithm are specified by somewhat stricter conditions than those of the complete BSS algorithm. Acknowledgments We are grateful to the anonymous reviewers and Terrence J. Sejnowski for their valuable comments and suggestions that helped to greatly improve the quality and clarity of the presentation. We also thank the anonymous reviewers for bringing our attention to Cichocki et al., (1994, 1997). This work was supported by the Major Program of the National Natural Science Foundation of China under grant 60496311 and by the Chinese Postdoctoral Science Foundation under grant 2004035061. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., Chen, T. P., & Cichocki, A. (2000). Nonholonomic orthogonal learning algorithms for blind source separation. Neural Computation, 12(6), 1463–1484. Amari, S., & Cichocki A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings IEEE, 86(10), 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Cao, X. R., & Liu, R. W. (1996). A general approach to blind source separation. IEEE Trans. on Signal Processing, 44(3), 562–571. Cardoso, J. F. (1998). Blind source separation: Statistical principles. Proc. IEEE, 86(10), 2009–2025. Cardoso, J. F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J. F. (2000). On the stability of source separation algorithms. Journal of VLSI Signal Processing, 26(1), 7–14. Cardoso, J. F., & Laheld, H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3029.
A Generalized Contrast Function and Stability Analysis
727
Cichocki, A., Karhunen, J., Kasprzak, W., & Vigario, R. (1999). Neural networks for blind separation with unknown number of sources. Neurocomputing, 24(1), 55–93. Cichocki, A., Sabala, I., Choi, S., Orsier, B., & Szupiluk, R. (1997). Self-adaptive independent component analysis for sub-gaussian and super-gaussian mixtures with an unknown number of sources and additive noise. International Symposium on Nonlinear Theory and Its Applications, 2, 731–734. Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17), 1386–1387. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Cover, T. M., & Thomas J. A. (1991). Elements of information theory. New York: Wiley. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45(1), 59–83. Douglas, S. C. (2002). Simple algorithms for decorrelation-based blind source separation. IEEE Workshop on Neural Networks for Signal Processing, 12, 545–554. Girolami, M. (1999). Self-organising neural networks: Independent component analysis and blind source separation. London: Springer-Verlag. Golub, G. H., & van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvarinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Karhunen, J., Pajunen, P., & Oja, E. (1998). The nonlinear PCA criterion in blind source separation: Relations with other approaches. Neurocomputing, 22(1), 5–20. Lewicki, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Li, Y. Q., & Wang, J. (2002). Sequential blind extraction of instantaneously mixed sources. IEEE Trans. on Signal Processing, 50(5), 997–1006. Mathis, H., & Douglas, S. C. (2002). On the existence of universal nonlinearities for blind source separation. IEEE Trans. Signal Processing, 50(5), 1007–1016. Moreau, E., & Macchi, O. (1996). High-order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10(1), 19–46. Moreau, E., & Thirion-Moreau, N. (1999). Nonsymmetrical contrasts for source separation. IEEE Trans. on Signal Processing, 47(8), 2241–2252. Ohata, M., & Matsuoka, K. (2002). Stability analysis of information-theoretic blind separation algorithms in the case where the sources are nonlinear processes. IEEE Trans. Signal Processing, 50(1), 69–77. Oja, E. (1997). The nonlinear PCA learning rule and signal separation: Mathematical analysis. Neurocomputing, 17(1), 25–46. Pham, D. T. (2002). Mutual information approach to blind separation of stationary sources. IEEE Trans. Information Theory, 48(77), 1935–1946. Pham, D. T., & Cardoso, J. F. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Trans. on Signal Processing, 49(9), 1837–1848. Thawonmas, R., Cichocki, A., & Amari, S. (1998). A cascade neural network for blind extraction without spurious equilibria. IEICE Trans. on Fundamentals of Electronics, Communications, & Computer Science, E81-A(9), 1833–1846.
728
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
Tong, L., Liu, R., Soon, V. C., & Huang, Y. F. (1991). Indeterminacy and identifiability of blind identification. IEEE Trans. on Circuits and Systems, 38(5), 499–509. von Hoff, T. P., Lindgren, A. G., & Kaelin, A. N. (2000). Transpose properties in the stability and performance of the classical adaptive algorithms for blind source separation and deconvolution. Signal Processing, 80(9), 1807–1822. Yang, H. H., & Amari, S. (1997). Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(5), 457–1482. Ye, J. M., Zhu, X. L., & Zhang, X. D. (2004). Adaptive blind separation with an unknown number of sources. Neural Computation, 16(8), 1641–1660. Zhang, L. Q., Cichocki, A., & Amari, S. (1999). Natural gradient algorithm for blind separation of over-determined mixtures with additive noise. IEEE Signal Processing Letters, 6(11), 293–295. Zhu, X. L., & Zhang, X. D. (2002). Adaptive RLS algorithm for blind source separation using a natural gradient. IEEE Signal Processing Letters, 9(12), 432–435. Zhu, X. L., & Zhang, X. D. (2004). Overdetermined blind source separation based on singular value decomposition. Journal of Electronics and Information Technology, 26(3), 337–343. (in Chinese)
Received June 18, 2004; accepted August 10, 2005.
LETTER
Communicated by Nicholas Hatsopoulos
Connection and Coordination: The Interplay Between Architecture and Dynamics in Evolved Model Pattern Generators Sean Psujek
[email protected] Department of Biology, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
Jeffrey Ames
[email protected] Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
Randall D. Beer
[email protected] Department of Biology and Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
We undertake a systematic study of the role of neural architecture in shaping the dynamics of evolved model pattern generators for a walking task. First, we consider the minimum number of connections necessary to achieve high performance on this task. Next, we identify architectural motifs associated with high fitness. We then examine how high-fitness architectures differ in their ability to evolve. Finally, we demonstrate the existence of distinct parameter subgroups in some architectures and show that these subgroups are characterized by differences in neuron excitabilities and connection signs. 1 Introduction From molecules to cells to animals to ecosystems, biological systems are typically composed of large numbers of heterogeneous nonlinear dynamical elements densely interconnected in specific networks. Understanding such systems necessarily involves understanding not only the dynamics of their elements, but also their architecture of interconnection. Interest in the role of network architecture in complex systems has been steadily growing for several years, with work on a diverse range of systems, including genetic networks, metabolic networks, signaling networks, nervous systems, food webs, social networks and the Internet (Watts & Strogatz, 1998; Jeong, Tombor, Albert, Oltvai, & Barab´asi, 2000; Strogatz, 2001; Guelzim, Bottani, Neural Computation 18, 729–747 (2006)
C 2006 Massachusetts Institute of Technology
730
S. Psujek, J. Ames, and R. Beer
Bourgine, & K´ep`es, 2002; Garlaschelli, Caldarelli, & Pietronero, 2003; Newman, 2003). Most recent research on complex networks has focused primarily on structural questions. For example, studies of a wide variety of naturally occurring networks have found that small-world structures are common (Watts & Strogatz, 1998). Structural questions have also been a major concern in neuroscience (van Essen, Anderson, & Felleman, 1992; Sporns, Tononi, & Edelman, 2000; Braitenberg, 2001). In addition, research on the dynamics of network growth has begun to provide insight into how observed network structures might arise. For example, preferential attachment of new links during network growth can produce scale-free network architectures (Barab´asi & Albert, 1999). An equally important but less well-studied aspect of complex networks is how network architecture shapes the dynamics of the elements it interconnects. For example, some architectures lend robustness to perturbations of both parameters and topology, while others do not (Albert, Jeong, & Barab´asi, 2000; Stelling, Klamt, Bettenbrock, Schuster, & Gilles, 2002). Again, the influence of circuit architecture on neural activity has long been a major concern in neuroscience, especially in the invertebrate pattern generation community, where detailed cellular and synaptic data are sometimes available (Getting, 1989; Marder & Calabrese, 1996; Roberts, 1998). However, while a great deal of work has been done on nonlinear oscillators coupled in regular patterns (Pikovsky, Rosenblum, & Kurths, 2001), there has been very little research on nonlinear dynamical systems connected in irregular but nonrandom patterns. Yet, arguably, this is the case most relevant to biological systems. In this article, we undertake a systematic study of the role of network architecture in shaping the dynamics of evolved model pattern-generation circuits for walking (Beer & Gallagher, 1992). While simple, this walking task raises a number of interesting coordination issues and has been extensively analyzed (Chiel, Beer, & Gallagher, 1999; Beer, Chiel, & Gallagher, 1999), providing a solid foundation for detailed studies of the interplay between architecture and dynamics. We first consider the minimum number of connections necessary to achieve high performance on this task. Next, we identify architectural motifs that are associated with high fitness and study the impact of architecture on evolvability. Finally, we demonstrate the existence of distinct parameter subgroups in some architectures and show that these subgroups are characterized by differences in neuron excitabilities and connection signs. 2 Methods We examined the effect of architecture on the evolution of central pattern generators for walking in a simple legged body (Beer & Gallagher, 1992). The body consisted of a single leg with a joint actuated by two opposing swing
Connection and Coordination
731
“muscles” and a foot. When the foot was “down,” any torque produced by the muscles served to translate the body under Newtonian mechanics. When the foot was “up,” any torque produced by the muscles served to swing the leg relative to the body. Details of the body model and its analysis can be found in Beer et al. (1999). This leg was controlled by a continuous-time recurrent neural network (CTRNN): τi y˙ i = −yi +
N
wji σ (y j + θ j ) i = 1, . . . , N
j=1
where yi is the state of the ith neuron, y˙ i denotes the time rate of change of this state, τi is the neuron’s membrane time constant, wji is the strength of the connection from the jth to the ith neuron, θi is a bias term, and σ (x) = 1/(1 + e −x ) is the standard logistic output function. We interpret a self-connection wii as a simple nonlinear active conductance rather than as a literal connection. We focus here on three-, four-, and five-neuron CTRNNs. Three of these neurons are always motor neurons that control the two opposing muscles of the leg (labeled BS for backward swing and FS for forward swing) and the foot (labeled FT), while any additional neurons are interneurons (labeled INTn) with no preassigned function. A real-valued genetic algorithm was used to evolve CTRNN parameters. A population of 100 individuals was maintained, with each individual encoded as a vector of N2 + 2N real numbers (N time constants, N biases, and N2 connection weights). Elitist selection was used to preserve the best individual each generation, whereas the remaining children were generated by mutation of selected parents. Individuals were selected for mutation using a linear rank-based method, with the best individual producing an average of 1.1 offspring. A selected parent was mutated by adding to it a random displacement vector with uniformly distributed direction and normally distributed magnitude (B¨ack, 1996). The mutation magnitude had zero mean and a variance of 0.5. Searches were run for 250 generations. Connection weights and biases were constrained to lie in the range ±16, while time constants were constrained to the range [0.5, 10]. The walking performance measure optimized by the genetic algorithm was average forward velocity of the body. This average velocity was computed in two ways. During evolution, truncated fitness was evaluated by integrating the model for 220 time units using the forward Euler integration method with a step size of 0.1 and then computing the average velocity (total forward distance covered in 220 time units divided by 220). During analysis, asymptotic fitness was evaluated by integrating the model for 1000 time units to skip transients and then computing its average velocity for one stepping period (with a fitness of 0 assigned to nonoscillatory circuits). Although asymptotic fitness more accurately describes the long-term
732
S. Psujek, J. Ames, and R. Beer
performance of a circuit, truncated fitness is much less expensive to compute during evolutionary searches. In both cases, the highest average velocity achievable is known to be 0.627 from a previous analysis of the optimal controller for this task and body model (Beer et al., 1999). The best truncated fitness that can be achieved by a nonoscillatory circuit (which takes only a single step) is also known to be 0.125. We define an architecture to be a set of directed connections between neurons. Since we do not consider self-connections to be part of an architecture, there are N2 − N possible interconnections in an N-neuron circuit, 2 and thus 2 N −N possible architectures. However, not all of these architectures are unique. When more than one interneuron is present, all permutations of interneuron labels that leave an architecture invariant should be counted only once since, unlike motor neurons, the interneuron labels are arbitrary. Counting the number of unique N-neuron architectures is an instance of the partially labeled graph isomorphism problem, which can be ´ solved using Polya’s enumeration theorem (Harary, 1972). We found that there are 64 distinct three-neuron architectures, 4096 distinct four-neuron architectures, and 528,384 distinct five-neuron architectures. Details of the ´ Polya theory calculations for the five-neuron case can be found in Ames (2003). The studies described in this article are based on the results of evolutionary searches on a large sample of different three-, four-, and five-neuron architectures. In the three- and four-neuron cases, the samples were exhaustive. We ran all 64 three-neuron architectures (300 random seeds each) and all 4096 four-neuron architectures (200 random seeds each). Because the number of five-neuron architectures was so large, we ran only a sample of 5000 five-neuron architectures (100 random seeds each). Thus, 1,338,400 evolutionary searches were run to form our baseline data set. A total of 850,900 additional experiments were run as described below to augment this baseline data set when necessary.
3 A Small Number of Connections Suffice How many connections are required to achieve high performance on the walking task? In the absence of any architectural constraints, a common simplification is to use fully interconnected networks because they contain all possible architectures as subcircuits. However, the number of connections between N neurons has been observed to scale roughly linearly in mammals (Stevens, 1989), much slower than the O(N2 ) scaling produced by full interconnection. Thus, our first task was to characterize the relationship between the number of connections and the best attainable fitness. There are N2 − N = 6, 12 and 20 possible connections for three-, four-, and five-neuron circuits, respectively. A uniform sample of architectures
Connection and Coordination
733
0.6
Best Fitness
0.5 0.4
3-neuron
0.3
4-neuron 0.2
5-neuron 0.1 0 0
5
10
15
20
Number of Connections Figure 1: Maximum asymptotic fitness obtained by all evolutionary searches in our baseline data set as a function of the number of connections by three-neuron (dashed), four-neuron (gray), and five-neuron (black) circuits.
leads to a nonuniform sample of number of connections because there 2 are ( N C−N ) architectures with C connections (0 ≤ C ≤ N2 − N). Thus, most architectures of an N-neuron circuit have close to (N2 − N)/2 connections. Because of the binomial distribution of architectures having a given number of connections, our sample of 5000 five-neuron architectures contained very few architectures with few or many connections. In order to compensate for this bias, we augmented our baseline data set with 732,300 additional fiveneuron experiments that exhaustively covered the five-neuron architectures having 0 to 5 and 18 to 20 connections. Figure 1 plots the very best asymptotic fitness obtained for three- (dashed line), four-, (gray line), and five-neuron (solid line) architectures as a function of number of connections. Regardless of the number of neurons, circuits with fewer than two connections have essentially zero fitness, while circuits with more than two connections have high fitness. The reason for this difference is that it takes at least two connections to link three motor neurons, and it takes at least three connections to form an oscillator involving all three motor neurons. Most interesting, although there is an increase in best fitness with larger numbers of connections in four- and five-neuron circuits, the additional benefit has saturated by about five connections. Thus, far sparser than fully interconnected circuits are sufficient to achieve high performance on the walking task.
734
S. Psujek, J. Ames, and R. Beer
4 Architectural Motifs Predict Performance Which architectures perform well on the walking task, and what particular connectivity features predict the best fitness that an architecture can achieve? There is growing evidence for recurring network motifs in biological networks, leading to the hope that general structural design principles may exist (Milo et al., 2002). In order to explore the existence of architectural motifs in our model and their correlation with fitness, we analyzed our three-neuron data in detail and then tested the patterns we found against our four- and five-neuron data. If we plot the best asymptotic fitness obtained over all runs of each three-neuron architecture (see Figure 2A), the data clearly fall into three distinct fitness groups, with wide gaps between them. This suggests that architecture strongly constrains the maximum achievable fitness of a circuit and that three separate classes of architectures may exist. Behaviorally, architectures from the low-fitness group (29/64) produced at most a single step. Architectures from the middle-fitness group (8/64) stepped rhythmically, but either the swing or stance phase of the motion was very slow. Closer inspection revealed that one of the swing motor neurons always adopted a fixed output, while the foot and the other swing motor neuron rhythmically oscillated. Interestingly, the constant outputs adopted by each swing motor neuron were consistently distinct in the best circuits in this group. When BS was the constant output motor neuron, the output it adopted was always around 0.7. In contrast, when FS was the constant output motor neuron, it adopted an output value around 0.3. Finally, architectures from the high-fitness group (27/64) exhibit fast rhythmic stepping. What architectural features characterize these three fitness groups? An example three-neuron architecture from each group is shown in the left column of Figure 3. Architectures in the low-fitness group lack feedback
Figure 2: Fitness classification of architectural motifs. In all cases, the horizontal axis represents an arbitrary architecture label. (A) The maximum asymptotic fitnesses obtained by all evolutionary searches with each of the 64 possible three-neuron architectures (black points) fall into three distinct fitness groups (indicated by gray rectangles). Architectures can independently be classified by their connectivity patterns (labeled class 1, class 2, and class 3). Note that architecture class strongly predicts fitness group in three-neuron circuits. (B) Maximum asymptotic fitness for all four-neuron baseline searches, with the data classified as class 1 (black points), class 2 (gray points), or class 3 (crosses) based on the connectivity pattern of each architecture. The dashed line indicates the maximum fitness obtainable given the strategy used by the best class 2 architectures. (C) Maximum asymptotic fitness for all five-neuron baseline searches classified by connectivity pattern.
Connection and Coordination
735
A 0.6
Class 1
Best Fitness
0.5
Class 2
0.4 0.3 0.2 0.1
Class 3
0
B Best Fitness
0.6 0.5 0.4 0.3 0.2 0.1 0
C 0.6
Best Fitness
0.5 0.4 0.3 0.2 0.1 0
Class 1 Class 2 Class 3
736
S. Psujek, J. Ames, and R. Beer
FT
Class 1
FT FT
BS BS
BS
INT2
INT1
FS FS
FS
INT1
FT
FT
Class 2
FT
BS BS BS
INT2
INT1
FS
FS
FS
INT1
FT
Class 3
FT FT
BS BS BS
INT2
INT1
FS
FS
FS
3-Neuron
4-Neuron
INT1
5-Neuron
Figure 3: Three-, four-, and five-neuron examples of the three architecture classes identified in Figure 2.
loops that link foot and swing motor neurons. Because they cannot achieve oscillatory activity involving both the foot and a swing motor neuron, these circuits are unable to produce rhythmic stepping. Architectures in the middle-fitness group possess feedback loops between the foot and one of the swing motor neurons, but these feedback loops do not drive the other swing motor neuron. Thus, these circuits can oscillate, but one direction of leg motion is always slowed by constant activity in the opposing
Connection and Coordination
737
Table 1: Definition of the Three Architecture Classes. CD(FT)
CD(BS)
CD(FS)
Class
Fitness
T T T T F F F F
T T F F T T F F
T F T F T F T F
1 2 2 3 3 3 3 3
High Medium Medium Low Low Low Low Low
Note: The predicate CycleDriven(m) has been abbreviated to CD(m). T = true; F = false.
swing motor neuron. Architectures in the high-fitness group contain feedback loops that either involve or drive all three motor neurons. This pattern of feedback allows these circuits to produce coordinated oscillations in all three motor neurons. These results suggest that neural circuits can be partitioned into distinct classes based solely on their connectivity and that these architecture classes might strongly predict the best obtainable fitness. In order to test the generality of these predictions in larger circuits, we must first state the definition of each architecture class precisely and in such a way that it can be applied to circuits with interneurons. Let the predicate CycleDriven(m) be true of a motor neuron m in a particular architecture if and only if m either participates in or is driven by a feedback loop in that architecture. Since we have three motor neurons, there are eight possibilities, which are classified according to the architectural patterns observed above (see Table 1). By definition, classes 1, 2, and 3 for three-neuron circuits are fully consistent with the high-, middle-, and low-fitness groups shown in Figure 2A, respectively. Examples of each of the three classes for four-neuron and five-neuron circuits are shown in Figure 3. We next tested the ability of this architecture classification to predict best asymptotic fitness in our four- and five-neuron data sets (see Figures 2B and 2C, respectively). In the four-neuron circuits, 2617/4096 architectures were class 1, 528/4096 were class 2, and 951/4096 were class 3. In the fiveneuron circuits, 3991/5000 were class 1, 488/5000 were class 2, and 521/5000 were class 3. In both cases, class 1 (black points), class 2 (gray points), and class 3 (crosses) were strongly correlated with high, middle and low fitness, respectively. However, unlike in the three-neuron case, there was some fitness overlap between class 1 and class 2 architectures (12/4096 for fourneuron circuits and 37/5000 for five-neuron circuits) and a small amount of fitness overlap between class 2 and class 3 architectures in the five-neuron case (2/5000). We hypothesized that this overlap was caused by an insufficient number of searches for these architectures, so that these architectures had not yet
738
S. Psujek, J. Ames, and R. Beer
achieved their best attainable fitness. To test this hypothesis, we performed additional searches on all overlap architectures. As a control, we also ran the same number of additional searches for two class 2 architectures with comparable fitness for each class 1 overlap architecture and with two class 3 architectures for each class 2 overlap architecture. We ran 43,500 additional experiments in the four-neuron case and 75,100 additional experiments in the five-neuron case. After these additional experiments, only three class 1 overlap architectures remained in the four-neuron case (see Figure 4A), and only two class 1 overlap architectures and one class 2 overlap architecture remained in the five-neuron case (see Figure 4B). Interestingly, all remaining overlap architectures contained independent cycles, in which subgroups of motor neurons were driven by separate feedback loops (see Figure 4C). Even if the oscillations produced by independent cycles are individually appropriate for walking, they will not in general be properly coordinated unless their initial conditions are precisely set. However, this cannot be done stably unless some other source of coupling is present, such as shared sensory feedback or mechanical coupling of the relevant body degrees of freedom. Independent cycle architectures aside, the fitness separation between class 2 and class 3 architectures is quite large. However, the boundary between class 1 and class 2 architectures is very sharp, occurring at a fitness value of around 0.47. What defines this fitness boundary, and how can we calculate its exact value? As noted above, one swing neuron always has a constant output in class 2 architectures. By repeating the optimal fitness calculations described in appendix A of Beer et al. (1999) with the constraint that either BS or FS must be held constant, we obtain expressions for the optimal fitness achievable as a function of these constant values: √ −1 85 4 5π /3 +√ √ √ 1 − BS 6 BS 165 FS (1 − FS) . V ∗ (FS) = √ √ √ √ 85 6 FS + 8 1 − FS 15π
55 V (BS) = 2 ∗
These expressions can be maximized exactly, but it is sufficient for our purposes here to do so only numerically. We find that when BS is the constant motor neuron, the highest fitness is achieved at BS∗ ≈ 0.709. In contrast, when FS is held constant, the highest fitness is achieved at FS∗ ≈ 0.291. Note that these values correspond closely to the constant outputs observed in the best-evolved class 2 circuits. The maximum fitnesses are the same in both cases: V ∗ (FS∗ ) = V ∗ (BS∗ ) ≈ 0.473, which is very close to the observed boundary between class 1 and class 2 (dashed lines in Figure 2). This value serves as an upper bound for the best fitness achievable by a class 2 architecture.
Connection and Coordination
739
B
0.6
0.6
0.5
0.5
Best Fitness
Best Fitness
A 0.4 0.3
0.2
0.4 0.3
0.2
0.1
0.1
0
0
C BS BS
FT
FT
FT
INT2
BS
INT2
INT1
FS
FS
INT1
FS
INT1
Figure 4: Investigation of the fitness overlaps between architecture classes in Figures 2B and 2C. (A) Data obtained from additional evolutionary searches with the class 1 overlap architectures from Figure 2B and class 2 architectures of comparable fitness. Note that only three class 1 overlap architectures remain. (B) Data obtained from additional evolutionary searches with the class 1 and class 2 overlap architectures from Figure 2C and, respectively, class 2 and class 3 overlap architectures of comparable fitness. Note that only two class 1 and one class 2 overlap architectures remain. (C) Examples of overlap architectures from A and B. (Left) A class 1 four-neuron independent cycles architecture from A whose best fitness lies in the range characteristic of class 2 architectures. Note that the BS and FS motor neurons occur in separate cycles. (Middle) A class 1 five-neuron independent cycles architecture from B whose best fitness lies in the range characteristic of class 2. Note that BS and FS occur in separate cycles. (Right) A class 2 five-neuron independent cycles architecture from B whose best fitness lies in the range characteristic of class 3. Note that the foot motor neuron FT occurs in a cycle separate from both BS and FS.
740
S. Psujek, J. Ames, and R. Beer
5 Architecture Influences Evolvability Are high-fitness circuits easier to evolve with some architectures than others? The best fitness obtained over a set of evolutionary searches provides a lower bound on the maximum locomotion performance that can be achieved with a given architecture. In contrast, the average of the best fitness obtained in each of a set of searches provides information about the difficulty of finding high-performance circuits with that architecture through evolutionary search. The lower this average is relative to the best fitness achievable with a given architecture, the less frequently evolutionary runs with that architecture attain high fitness, and thus the more difficult that architecture is to evolve. In order to examine the impact of architecture on evolvability, we examined scatter plots of best and average asymptotic fitness for all five-neuron circuit architectures that we evolved (see Figure 5A). Qualitatively identical results were obtained for the three- and four-neuron circuit architectures when using average or median fitness as a surrogate for evolvability. In this plot, the three architecture classes described in the previous section are apparent along the best fitness (horizontal) axis, but no such groupings exist along the average fitness (vertical) axis. Instead, for any given best fitness, there is a range of average fitnesses. This suggests that architectures with the same best achievable fitness can differ significantly in their evolvability. Interestingly, the spread of average fitnesses increases with best fitness, so that the largest range of evolvability occurs for the best architectures in each architecture class. We will focus on the class 1 architectures with the highest best fitness. In order to characterize these differences in evolvability, two subpopulations of five-neuron architectures whose best fitness was greater than 0.6 were studied. The high-evolvability subgroup had average fitnesses that were greater than 0.38 (N = 39), while the low-evolvability subgroup had average fitnesses that were less than 0.1 (N = 34). These subgroups are indicated by light gray rectangles in Figure 5A. Using 106 random samples
Figure 5: An analysis of evolvability. (A) A scatter plot of average and best fitness for all five-neuron evolutionary searches in our baseline data set. Architectures of class 1, 2, and 3 are represented as black points, gray points, and crosses, respectively, as in Figure 2. The gray rectangles indicate two subgroups of high best fitness architectures with high- (upper rectangle) and low- (lower rectangle) evolvability. (B) Mean truncated fitness distributions for the high- (black) and low- (gray) evolvability subgroups from A based on 106 parameter samples for each architecture. Note that the vertical scale is logarithmic. (C) Fraction (mean ±SD) of parameter space samples whose truncated fitness is greater than 0.125 for low- (gray bars) and high- (white bars) evolvability subgroups of three-, four-, and five-neuron architectures.
Connection and Coordination
741
A
High Evolvability Subgroup
0.4
Average Fitness
Class 1 0.3
Class 2 Class 3
0.2
0.1
Low Evolvability Subgroup
0 0
0.1
0.2
0.3
0.4
0.5
0.6
Best Fitness
B Log Frequency
6 5
High Evolvability
4
Low Evolvability
3 2 1 0 0
0.2
0.3 0.4
0.5
0.6
Fitness Frequency x 10-5
C
0.1
8 6 4 2
3-Neuron 4-Neuron 5-Neuron
742
S. Psujek, J. Ames, and R. Beer
from the parameter spaces of each architecture, we computed the mean truncated fitness distribution for each subgroup of architectures (see Figure 5B), with the high-evolvability subgroup distribution denoted by a black line and the low-evolvability subgroup distribution denoted by a gray line. Truncated rather than asymptotic fitness is the appropriate measure here because it is the one used during evolution. These fitness distributions exhibit several interesting features. The fraction of samples below a truncated fitness of 0.125 is several orders of magnitude larger than the fraction above 0.125. This reflects the extreme scarcity of high-fitness oscillatory behavior in the parameter spaces of architectures in both subgroups. Below 0.125, the distributions are nearly identical for the two subgroups, with strong peaks at 0 (no steps) and 0.125 (a single step). However, above 0.125, the fitness distributions of the two subgroups exhibit a clear difference. While both the low- and high-evolvability subgroups follow power law distributions within this range (with exponents of −2.23 and −3.28, respectively), a larger fraction of the parameter spaces of the high-evolvability architectures clearly have a fitness greater than 0.125. This suggests that the difference in evolvability between the two subgroups is due primarily to differences in the probability of finding an oscillatory circuit whose fitness is higher than that of a single stepper. Plots of the mean fraction of parameter space volume with truncated fitness greater than 0.125 (see Figure 5C) demonstrate that this conclusion holds not only for five-neuron circuits, but also for analogous subgroups of low- and high-evolvability three-neuron and four-neuron architectures. Using a two-sample Kolmogorov-Smirnov test, the four- and five-neuron differences are highly significant ( p < 0.00001), while the significance of the three-neuron difference is borderline ( p < 0.07). Ultimately, we would like to relate the observed differences in evolvability to particular architectural features, as we did for best fitness in section 4. Although we found some evidence for correlations between evolvability and the presence of particular feedback loops (Ames, 2003), none of these correlations was particularly strong. The best predictor of evolvability that we found was the fraction of an architecture’s parameter space with fitness greater than 0.125.
6 Beyond Architecture The results clearly demonstrate that circuit architecture plays a major role in determining both the maximum attainable performance and the evolvability of model pattern generators. Of course, architecture alone is insufficient to completely specify circuit dynamics. Ultimately, we would like to refine our architectural classification with quantitative information. Are there patterns in the best parameter sets discovered by multiple evolutionary searches with a given architecture?
Connection and Coordination
743
To begin to explore this question, we studied one of the highestperforming three-neuron architectures: a class 1 architecture consisting of a counterclockwise ring of connections among the three motor neurons. We performed a principal component analysis of the parameter sets of all evolutionary runs with this architecture whose best truncated fitness exceeded 0.5 (90/300). The first two principal components are plotted in Figure 6. Clearly, the evolved parameters sets are neither identical nor randomly distributed. Instead, they fall into two distinct clusters. What network features underlie these parameter clusters? Computing the means of the circuit parameters for each cluster separately reveals that they correspond to distinct sign patterns. Circuits in the left cluster have three intrinsically active neurons arranged in an inhibitory ring oscillator. In contrast, circuits in the right cluster have one intrinsically active and two intrinsically inactive neurons arranged in a mixed excitatory/inhibitory ring oscillator. In addition, the sign of the self-weight of FT changes from negative to positive between the left and right clusters, and the self-weight sign of BS changes from positive to negative. This suggests that neuron excitabilities and connection signs may represent an intermediate level of analysis between connectivity and raw parameter values. 7 Discussion Complex networks are ubiquitous in the biological world, and understanding the dynamics of such networks is arguably one of the most important theoretical obstacles to progress in many subdisciplines of biology. Most research on networks has focused on either structural questions that largely ignore node dynamics or network dynamics questions that assume a regular or random connection topology. However, realistic networks have both nontrivial node dynamics and specific but irregular connection topologies (Strogatz, 2001). As a first step in this direction, we have systematically studied the impact of neural architecture on walking performance in a large population of evolved model pattern generators for walking. Specifically, we have shown that a small number of connections is sufficient to achieve high fitness on this task, characterized the correlation between architectural motifs and fitness, explored the impact of architecture on evolvability, and demonstrated the existence of parameter subgroups with distinct neuron excitabilities and connection signs. These results lay the essential groundwork for a more detailed analysis of the interplay between architecture and dynamics. We have explained the observed correlations between architecture and best fitness in terms of the structure of feedback loops in the circuits, while the relationship between architecture and evolvability was explained in terms of the fraction of an architecture’s parameter space that contains oscillatory dynamics whose fitness is greater than that obtainable by nonoscillatory circuits. However, several questions remain. How do different architectures differ in their
744
S. Psujek, J. Ames, and R. Beer
FT
BS
FT
FS
BS
FS
20
PCA 2
10
0
-10
-20 -30
-20
-10
0
10
20
30
PCA 1 Figure 6: Two variants of the three-neuron architecture consisting of a counterclockwise ring. A principal components analysis of the parameters of evolved circuits whose best truncated fitness exceeds 0.5 reveals two subgroups corresponding to distinct neuron excitability and connection sign patterns. Here inhibitory connections are denoted by a filled circle, excitatory connections are denoted by a short line, and neurons are shaded according to whether they are intrinsically active (white) or inactive (black). These two principal components account for 87.6% of the variance (78.8% for PCA1 and 8.8% for PCA2).
Connection and Coordination
745
dynamical operation? Which excitability and sign variants of a given architecture can achieve high fitness? What underlies the fitness differences between architectures within a class? What architectural properties produce the parameter space differences responsible for the observed differences in evolvability? Ultimately such questions can be answered only by detailed studies of particular circuits in our existing data set (Beer, 1995; Chiel et al., 1999; Beer et al., 1999). There has been a great deal of interest in the use of evolutionary algorithms to evolve not only neural parameters but also neural architecture (Angeline, Saunders, & Pollack, 1994; Yao, 1999; Stanley & Miikkulainen, 2002). However, this previous work provides little understanding as to why a particular architecture is chosen for a given problem, or how the structure of the space of architectures biases an evolutionary search. Our approach is complementary. While it is obviously impractical to evaluate all possible architectures on a given task, a systematic study such as ours can provide a foundation for the analysis of architectural evolution. Given our substantial data on the best architectures and their evolvability for the walking task, it could serve as an interesting benchmark for comparing different architecture evolution algorithms and analyzing their behavior. In fact, several such algorithms have already been applied to a multilegged version of exactly this walking task (Gruau, 1995; Kodjabachian & Meyer, 1998). While the questions explored in this article are general ones, the importance of the particular feedback structures we have described is obviously specific to our walking task. Likewise, our evolvability results depend on the structure of the fitness space of each architecture, which in turn depends on the particular neural and mechanical models we chose and the performance measure we used. Examining a wider variety of neural models and tasks will be necessary to identify any general principles that might exist. As we have done here, it will be important for such studies to examine large populations of circuits, so that trends can be identified. In addition, the development of more powerful mathematical tools for studying the interplay of architecture and dynamics is essential. One promising development along these lines is the recent application of symmetry groupoid methods to analyze the constraints that network topology imposes on network dynamics (Stewart, Golubitsky, & Pivato, 2003). Acknowledgments We thank Hillel Chiel and Chad Seys for their feedback on an earlier draft of this article. This research was supported in part by NSF grant EIA-0130773 and in part by an NSF IGERT fellowship. References Albert, R., Jeong, H., & Barab´asi, A.-L. (2000). Error and attack tolerance of complex networks. Nature, 406, 378–382.
746
S. Psujek, J. Ames, and R. Beer
Ames, J. C. (2003). Design methods for pattern generation circuits. Master’s thesis, Case Western Reserve University. Angeline, P. J., Saunders, G. M., & Pollack, J. B. (1994). An evolutionary algorithm that constructs recurrent neural networks. IEEE Trans. Neural Networks, 5, 54– 65. B¨ack, T. (1996). Evolutionary algorithms in theory and practice. New York: Oxford University Press. Barab´asi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512. Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3, 469–509. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Computational Neuroscience, 7, 119–147. Beer, R. D., & Gallagher, J. C. (1992). Evolving dynamical neural networks for adaptive behavior Adaptive Behavior, 1, 91–122. Braitenberg, V. (2001). Brain size and number of neurons: An exercise in synthetic neuroanatomy. J. Computational Neuroscience, 10, 71–77. Chiel, H. J., Beer, R. D., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking I. Dynamical modules. J. Computational Neuroscience, 7, 99– 118. Garlaschelli, D., Caldarelli, G., & Pietronero, L. (2003). Universal scaling relations in food webs. Nature, 423, 165–168. Getting, P. (1989). Emerging principles governing the operation of neural networks. Annual Review of Neuroscience, 12, 185–204. Gruau, F. (1995). Automatic definition of modular neural networks. Adaptive Behavior, 3, 151–183. Guelzim, N., Bottani, S., Bourgine, P., & K´ep`es, F. (2002). Topological and causal structure of the yeast transcriptional regulatory network. Nature Genetics, 31, 60–63. Harary, F. (1972). Graph theory. Reading, MA: Addison-Wesley. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., & Barab´asi, A.-L. (2000). The largescale organization of metabolic networks. Nature, 407, 651–654. Kodjabachian, J., & Meyer, J.-A. (1998). Evolution and development of neural controllers for locomotion, gradient-following, and obstacle avoidance in artificial insects. IEEE Trans. Neural Networks, 9, 796–812. Marder, E., & Calabrese, R. L. (1996). Principles of rhythmic motor pattern generation. Physiological Reviews, 76, 687–717. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., & Alon, U. (2002). Network motifs: Simple building blocks of complex networks. Science, 298, 824– 827. Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Review, 45, 167–256. Pikovsky, A., Rosenblum, M., & Kurths, J. (2001). Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press. Roberts, P. D. (1998). Classification of temporal patterns in dynamic biological networks. Neural Computation, 10, 1831–1846.
Connection and Coordination
747
Sporns, O., Tononi, G., & Edelman, G. M. (2000). Theoretical neuroanatomy: Relating anatomical and functional connectivity in graphs and cortical connection matrices. Cerebral Cortex, 10, 127–141. Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10, 99–127. Stelling, J. Klamt, S., Bettenbrock K., Schuster, S., & Gilles, E. D. (2002). Metabolic network structure determines key aspects of functionality and regulation. Nature, 420, 190–193. Stevens, C. F. (1989). How cortical interconnectedness varies with network size. Neural Computation, 1, 473–479. Stewart, I., Golubitsky, M., & Pivato, M. (2003). Symmetry groupoids and patterns of synchrony in coupled cell networks. SIAM J. Applied Dynamical Systems, 2, 609–646. Strogatz, S. H. (2001). Exploring complex networks. Nature, 510, 268–276. van Essen, D. C., Anderson, C. H., & Felleman, D. J. (1992). Information processing in the primate visual system: An integrated systems perspective. Science, 255, 419–423. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks. Nature, 393, 440–442. Yao, X. (1999). Evolving artificial neural networks. Proc. of the IEEE, 87, 1423–1447.
Received August 24, 2004; accepted August 9, 2005.
NOTE
Communicated by Bernard Haasdonk
An Invariance Property of Predictors in Kernel-Induced Hypothesis Spaces Nicola Ancona
[email protected] Institute of Intelligent Systems for Automation, C. N. R., Bari, Italy
Sebastiano Stramaglia
[email protected] TIRES, Center of Innovative Technologies for Signal Detection and Processing, University of Bari, Italy; Dipartimento Interateneo di Fisica, University of Bari, Italy; and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
We consider kernel-based learning methods for regression and analyze what happens to the risk minimizer when new variables, statistically independent of input and target variables, are added to the set of input variables. This problem arises, for example, in the detection of causality relations between two time series. We find that the risk minimizer remains unchanged if we constrain the risk minimization to hypothesis spaces induced by suitable kernel functions. We show that not all kernel-induced hypothesis spaces enjoy this property. We present sufficient conditions ensuring that the risk minimizer does not change and show that they hold for inhomogeneous polynomial and gaussian radial basis function kernels. We also provide examples of kernel-induced hypothesis spaces whose risk minimizer changes if independent variables are added as input. 1 Introduction Recent advances in kernel-based learning algorithms have brought the field of machine learning closer to the goal of autonomy: providing learning systems that require as little intervention as possible on the part of a human user (Vapnik, 1998). Kernel algorithms work by embedding data into a Hilbert space and searching for linear relations in that space. The embedding is performed implicitly, by specifying the inner product between pairs of points. Kernel-based approaches are generally formulated as convex optimization problems, with a single minimum, and thus do not require heuristic choices of learning rates, start configuration, or other free parameters. The choice of the kernel and the corresponding feature space are central choices that generally must be made by a human user. While this provides opportunities to use prior knowledge about the problem at hand, in practice it is Neural Computation 18, 749–759 (2006)
C 2006 Massachusetts Institute of Technology
750
N. Ancona and S. Stramaglia
difficult to find prior justification for the use of one kernel instead of another (Shawe-Taylor & Cristianini, 2004). The purpose of this work is to introduce a novel property enjoyed by some kernel-based learning machines, which is of particular relevance when a machine learning approach is developed to evaluate causality between two simultaneously acquired signals. In this article, we define a learning machine to be invariant with respect to independent variables (property IIV) if it does not change when statistically independent variables are added to the set of input variables. We show that the risk minimizer constrained to belong to suitable kernel-induced hypothesis spaces is IIV. This property holds true for hypothesis spaces induced by inhomogeneous polynomial and gaussian kernel functions. We discuss the case of quadratic loss function and provide sufficient conditions for a kernel machine to be IIV. We also present examples of kernels that induce spaces where the risk minimizer is not IIV, and they should not be used to measure causality. 2 Preliminaries We focus on the problem of predicting the value of a random variable (r.v.) s ∈ R with a function f (x) of the r.v. vector x ∈ Rd . Given a loss function V and a set of functions called the hypothesis space H, the best predictor is sought in H as the minimizer f ∗ of the prediction error or generalization error or risk, defined as R[ f ] =
V(s, f (x)) p(x, s)dxds,
(2.1)
where p(x, s) is the joint density function of x and s. Given another r.v. y ∈ Rq , let us add y to the input variables and define a new vector appending x and y—z = (x , y ) . Let us also consider the predictor f ∗ (z) of s, based on the knowledge of the r.v. x and y, minimizing the risk: R [ f ] =
V(s, f (z)) p(z, s)dzds.
(2.2)
If y is statistically independent of x and s, it is intuitive to require that f ∗ (x) and f ∗ (z) coincide and have the same risk. Indeed in this case, y variables do not convey any information on the problem at hand. The property stated above is important when predictors are used to identify causal relations among simultaneously acquired signals, an important problem with applications in many fields ranging from economy to physiology (see, e.g., Ancona, Marinazzo, & Stramaglia, 2004). The major approach to this problem examines if the prediction of one series could be improved by incorporating information of the other, as proposed by Granger (1969). In particular, if the prediction error of the first time series is reduced by including
An Invariance Property of Kernel Predictors
751
measurements from the second time series in the regression model, then the second time series is said to have a causal influence on the first time series. However, not all prediction schemes are suitable to evaluate causality between two time series; they should be invariant with regard to independent variables, so that, at least asymptotically, they would be able to recognize variables without causality relationship. In this work, we consider as predictor the function minimizing the risk, and we show that it does not always enjoy this property. In particular, we show that if we constrain the minimization of the risk to suitable hypothesis spaces, then the risk minimizer is IIV (stable under inclusion of independent variables). We limit our analysis to the case of quadratic loss function V(s, f (x)) = (s − f (x))2 . 2.1 Unconstrained H. If we do not constrain the hypothesis space, then H is the space of measurable functions for which R is well defined. It is well known (Papoulis, 1985) that the minimizer of equation 2.1 is the regression function: f ∗ (x) =
sp(s|x)ds.
Note that if y is independent of x and s, then p(s|x) = p(s|x, y), and this implies f ∗ (z) =
sp(s|x, y)ds =
sp(s|x)ds = f ∗ (x).
Hence the regression function does not change if y is also used for predicting s; the regression function is stable under inclusion of independent variables. 2.2 Linear Hypothesis Spaces. Let us consider the case of linear hypothesis spaces: d H = f | f (x) = w . x x, wx ∈ R Here, and in all the other hypothesis spaces that we consider in this article, we assume that the mean value of the predictor and the mean of s coincide: E s − w x x = 0,
(2.3)
where E{·} means the expectation value. This can be easily achieved by adding a constant component (equal to one) to the x vector. Equation 2.3 is
752
N. Ancona and S. Stramaglia
a sufficient condition for property IIV in the case of linear kernels. Indeed, let us consider the risk associated with an element of H: R [wx ] =
s − w xx
2
p(x, s)dxds.
(2.4)
The parameter vector w∗x , minimizing the risk, is a solution of the following linear system: E{xx }wx = E{sx}.
(2.5)
Let us consider the hypothesis space of linear functions in the z = (x , y ) variable: d+q . H = f | f (z) = w z z,wz ∈ R q Writing wz = (w x , wy ) with wy ∈ R , let us consider the risk associated with an element of H :
R [wz ] =
2 s − w x x − wy y p(x, y, s)dxdyds.
(2.6)
If y is independent of x and s, then equation 2.6 can be written, due to equation 2.3, as:
R [wz ] = R[wx ] +
2 wy y p(y)dy.
(2.7)
It follows that the minimum of R corresponds to wy = 0. In conclusion, if y is independent of x and s, the predictors f ∗ (x) = w∗x x and f ∗ (z) = w∗z z, which minimize the risks of equations 2.4 and 2.6, respectively, coincide (i.e., f ∗ (x) = f ∗ (x, y) for every x and every y). Moreover the weights associated with the components of the y vector are identically null. So the risk minimizer in linear hypothesis spaces is a IIV predictor. 3 Nonlinear Hypothesis Spaces Let us now consider nonlinear hypothesis spaces. An important class of nonlinear models is obtained mapping the input space to a higherdimensional feature space and finding a linear predictor in this new space. Let φ be a nonlinear mapping function that associates with x ∈ Rd the vector φ(x) = (φ1 (x), φ2 (x), . . . , φh (x)) ∈ Rh , where φ1 , φ2 , . . . , φh are h fixed
An Invariance Property of Kernel Predictors
753
real-valued functions. Let us consider linear predictors in the space spanned by the functions φi for i = 1, 2, . . . , h. The hypothesis space is then H = f | f (x) = wx φ(x), wx ∈ Rh . In this space, the best linear predictor of s is the function f ∗ ∈ H minimizing the risk: 2 s − wx φ(x) p(x, s)dxds. R[wx ] = (3.1) Let us denote w∗x the minimizer of equation 3.1. We first restrict to the case of a single additional new feature: let y be a new real random variable, statistically independent of s and x, and denote γ (z), with z = (x , y) , a generic new feature involving the y variable. For predicting the r.v. s, we use the linear model involving the new feature, f (z) = wz φ (z), where φ (z) = (φ(x) , γ (z)) and wz = (wx , v) has to be fixed, minimizing R [wz ] =
2 s − wx φ(x) − vγ (x, y) p(x, s) p(y)dxdyds.
(3.2)
We would like to have v = 0 at the minimum of R . Therefore, let us evaluate ∂ R = −2 γ (x, y) s − w∗x φ(x) p(x, s) p(y)dxdyds, ∂v 0 where ∂/∂|0 means that the derivative is evaluated at v = 0 and wx = w∗x , where w∗x minimizes the risk of equation 3.1. If ∂ R /∂v|0 is not zero, then the predictor is changed after inclusion of feature γ . Therefore, ∂ R /∂v|0 = 0 is the condition that must be satisfied by all the features involving y, to constitute a IIV (stable) predictor. It is easy to show that if γ does not depend on x, then this condition holds. More important, it holds if γ is the product of a function γ (y) of y alone and of a component φi of the feature vector φ(x): γ (x, y) = γ (y)φi (x)
for some i ∈ {1, . . . , h}.
(3.3)
Indeed, in this case we have ∂ R ∗ = −2 γ (y) p(y)dy φi (x) s − wx φ(x) p(x, s)dxds = 0, ∂v 0
754
N. Ancona and S. Stramaglia
because the second integral vanishes as w∗x minimizes the risk of equation 3.1 when only x variables are used to predict s. We observe that the second derivative, 2 ∂ 2 R γ (x, y) p(x, s) p(y)dxdyds, = 2 2 ∂v 0 is positive; (w∗x , 0) remains a minimum after inclusion of the y variable. In conclusion, if the new feature γ involving y verifies equation 3.3, then the predictor f ∗ (z), which uses both x and y for predicting s, minimizing equation 3.2, and the predictor f ∗ (x) minimizing equation 3.1 coincide. This shows that the risk minimizer is unchanged after inclusion of y in the input variables. This preliminary result, which is used in the next section, may be easily seen to hold also for finite-dimensional vectorial y. 3.1 Kernel-Induced Hypothesis Spaces. In this section, we analyze whether our invariance property holds true in specific hypothesis spaces that are relevant for many learning schemes such as support vector machines (Vapnik, 1998) and regularization networks (Evgeniou, Pontil, Poggio, 2000), citing just a few. In order to predict s, we map x in a higherdimensional feature space H by using the mapping √ √ √ φ(x) = ( α1 ψ1 (x), α2 ψ2 (x), . . . , αh ψh (x), . . .), where αi and ψi are the eigenvalues and eigenfunctions of an integral operator whose kernel K (x, x ) is a positive-definite symmetric function with the property K (x, x ) = φ(x) φ(x ) (see Mercer’s theorem in Vapnik, 1998). Let us now consider in detail two important kernels. 3.2 Case K (x, x ) = (1 + x x ) p . Let us consider the hypothesis space induced by this kernel, d H = f | f (x) = w , x φ(x), wx ∈ R where the components φi (x) of φ(x) are d monomials, up to pth degree, which enjoy the following property: φ(x) φ(x ) = (1 + x x ) p . Let f ∗ (x) be the minimizer of the risk in H. Moreover, let z = (x , y ) , and consider the hypothesis space H induced by the mapping φ (z) such that φ (z) φ (z ) = (1 + z z ) p .
An Invariance Property of Kernel Predictors
755
Let f ∗ (z) be the minimizer of the risk in H . If y is independent of x and s, then f ∗ (x) and f ∗ (z) coincide. In fact the components of φ (z) are all the monomials, in the variables x and y, up to the pth degree: it follows trivially that φ (z) can be written as φ (z) = φ(x) , γ (z) , where each component γi (z) of the vector γ (z) verifies equation 3.3, that is, it is given by the product of a component φ j (x) of the vector φ(x) and of a function γi (y) of the variable y only: γi (z) = φ j (x)γi (y). As an example, we show this property for the case of x = (x1 , x2 ) , z = (x1 , x2 , y) , and p = 2. In this case, the mapping functions φ(x) and φ (z) are √ √ √ φ(x) = 1, 2x1 , 2x2 , 2x1 x2 , x12 , x22 , √ √ √ √ √ √ φ (z) = 1, 2x1 , 2x2 , 2x1 x2 , x12 , x22 , 2y, 2x1 y, 2x2 y, y2 , where one can easily check that φ(x) φ(x ) = (1 + x x )2 and φ (z) φ (z ) = (1 + z z )2 . In this case, the vector γ (z) is √ γ (z) = (φ1 (x) 2y, φ2 (x)y, φ3 (x)y, φ1 (x)y2 ) . According to the argument already made, the risk minimizer in this hypothesis space satisfies the invariance property. Note that, remarkably, the risk minimizer in the hypothesis space induced by the homogeneous polynomial kernel K (x, x ) = (x x ) p does not have the invariance property for a generic probability density, as one can easily check working out explicitly the p = 2 case. 3.3 Translation-Invariant Kernels. In this section we present a formalism that generalizes our discussion to the case of hypothesis spaces whose features constitute an uncountable set. We show that the IIV property holds for linear predictors on feature spaces induced by translation-invariant kernels. In fact, let K (x, x ) = K (x − x ) be a positive-definite kernel function, with x, x ∈ Rd . Let K˜ (ω x ) be the Fourier transform of K (x): K (x) ↔ K˜ (ω x ). For the time-shifting property, we have that K (x − x ) ↔ K˜ (ω x )e − j ωx x . By definition of the inverse Fourier transform, neglecting constant factors, we know that (Girosi, 1998) K (x − x ) =
Rd
K˜ (ω x )e − j ωx x e j ωx x dω x .
756
N. Ancona and S. Stramaglia
Since K is positive definite, we can write
K (x − x ) =
Rd
K˜ (ω x )e j ωx x
K˜ (ω x )e j ωx x
∗
dω x ,
where ∗ indicates conjugate. Then we can write K (x, x ) = φ x , φ x where φ x (ω x ) =
K˜ (ω x )e j ωx x
(3.4)
are the generalized eigenfunctions. Note that in this case, the mapping function φ x associates a function to x, that is, φ x maps the input vector x in a feature space with an infinite and uncountable number of features. Let us consider the hypothesis space induced by K : H = f | f (x) = w x , φ x , w x ∈ W x , where w x , φ x =
Rd
w x (ω x )φ ∗x (ω x )dω x ,
(3.5)
and W x is the set of complex measurable functions for which equation 3.5 is well defined and real.1 Note that w x is now a complex function; it is no longer a vector. In this space, the best linear predictor is the function f¯ = w ¯ x , φ x in H minimizing the risk functional: R [w x ] = E (s − w x , φ x )2 . It is easy to show that the optimal function w ¯ x is a solution of the following integral equation, E se − j ωx x =
Rd
w x (ξ x ) K˜ (ξ x )∗x (ω x + ξ x )dξ x ,
(3.6)
where ξ x is a dummy integration variable and x (ω x ) = E{e j ωx x } is the characteristic function2 of the r.v. x (Papoulis, 1985). Let us indicate F˜ (ω x ) = ˜ x ) = E{se j ωx x }. Then equation 3.6 can be written as w x (ω x ) K˜ (ω x ) and G(ω ˜ x ) = F˜ (ω x ) x (ω x ), G(ω
1 2
In particular, elements of W x satisfy w x (−ω x ) = w ∗x (ω x ). x (−ω x ) is the Fourier transform of the probability density p(x) of the r.v. x.
An Invariance Property of Kernel Predictors
757
where indicates cross-correlation between complex functions. In the spatial domain, this implies G(x) = F ∗ (x) p(−x). In conclusion, assuming that the density p(x) is strictly positive, the function w ¯ x (ω x ) minimizing the risk is unique, and it is given by w ¯ x (ω x ) = F G ∗ (x)/ p(−x) / K˜ (ω x ). Substituting this expression into equation 3.5 leads to f¯ (x) =
sp(s|x)ds,
that is, the risk minimizer coincides with the regression function. In other words, the hypothesis space H, induced by K , is sufficiently large to contain the regression function. This proves that translation-invariant kernels are IIV. It is interesting to work out and explicitly prove the IIV property in the case of translation-invariant and separable kernels. As in the previous section, let y ∈ Rq be an r.v. vector independent of x and s and use the vector z = (x , y ) for predicting s. Now let us consider the following mapping function, φ z(ω z) =
K˜ (ω z)e j ωz z ,
(3.7)
where ω z = (ω x , ω y ) and φ z, φ z = K (z − z ). Let us consider the hypothesis space induced by K :
H = f | f (z) = w z, φ z , w z ∈ W z . ¯ z , φz in H minimizing the The best linear predictor is the function f¯ = w risk functional, R [w z] = E
2
, s − w z, φ z
where the optimal function w ¯ z is the solution of the integral equation (see equation 3.6), E se − j ωz z =
Rd+q
w z(ξ z) K˜ (ξ z)∗z (ω z + ξ z)dξ z,
(3.8)
758
N. Ancona and S. Stramaglia
where ω z = (ω x , ω y ) . Note that being x and y independent, the characteristic function of z factorizes z(ω z) = x (ω x ) y(ω y). If K (z) is separable, K (z) = K (x)H(y),
(3.9)
˜ y). Being then its Fourier transform takes the form of K˜ (ω z) = K˜ (ω x ) H(ω ω − j ω z − j ω x − j y y }, equation 3.8 becomes z } = E{se x }E{e E{se E se − j ωx x ∗y(ω y) = ˜ y)∗x (ω x + ξ x )∗y(ω y + ξ y)dξ z. w z(ξ z) K˜ (ξ x ) H(ξ Rd+q
(3.10)
The risk minimizer w ¯ z solution of equation 3.10 is δ(ω y) . ¯ x (ω x ) w ¯ z(ω x , ω y) = w ˜ H(0)
(3.11)
This can be checked by substituting equation 3.11 in equation 3.10 and using equation 3.6. The structure of equation 3.11 guarantees that the predictor is unchanged under inclusion of variables y. This is the case, in particular, for the gaussian radial basis function kernel. Finally note that a property similar to equation 3.3 holds true in this hypothesis space too. In fact, as K is separable, equation 3.7 implies that φ z(ω z) = φ x (ω x )γ y(ω y),
(3.12)
˜ y)e j ωy y with the property: γ y, γ y = H(y − y ). where γ y(ω y) = H(ω Equation 3.12 may be seen as a continuum version of property 3.3. 4 Discussion In this work, we consider, in the frame of kernel methods for regression, the following question: Does the risk minimizer change when statistically independent variables are added to the set of input variables? We show that this property is not guaranteed by all the hypothesis spaces. We outline sufficient conditions ensuring this property and show that it holds for inhomogeneous polynomial and gaussian radial basis function kernels. While
An Invariance Property of Kernel Predictors
759
these results are relevant to construct machine learning approaches to study causality between time series, in our opinion they might also be important in the more general task of kernel selection. Our discussion concerns the risk minimizer; hence, it holds only in the asymptotic regime. The analysis of the practical implications of our results—when only a finite data set is available to train the learning machine—is a matter for further research. It is worth noting, however, that our results hold also for a finite set of data if the probability distribution is replaced by the empirical measure. Another interesting question is how this scenario changes when a regularization constraint is imposed on the risk minimizer (Poggio & Girosi, 1990) and loss functions different from the quadratic one are considered. Moreover, it would be interesting to analyze the connections between our results and classical problems of machine learning such as feature selection and sparse representation, that is, the determination of a solution with only a few nonvanishing components. If we look for the solution in overcomplete or redundant spaces of vectors or functions, where more than one representation exists, then it makes sense to impose a sparsity constraint on the solution. In the case here considered, the sparsity of w∗ emerges as a consequence of the existence of independent input variables using a quadratic loss function. Acknowledgments We thank two anonymous reviewers whose comments were valuable in improving the presentation of this work. References Ancona, N., Marinazzo, D., & Stramaglia, S. (2004). Radial basis function approach to nonlinear Granger causality of time series. Physical Review E, 70, 56221–56227. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1), 1–50. Girosi, F. (1998) An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480. Granger, C.W.J. (1969). Testing for causality and feedback. Econometrica, 37, 424–438. Papoulis, A. (1985). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–986. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Received February 25, 2005; accepted August 10, 2005.
LETTER
Communicated by Emanuel Todorov
Modeling Sensorimotor Learning with Linear Dynamical Systems Sen Cheng
[email protected] Philip N. Sabes
[email protected] Sloan-Swartz Center for Theoretical Neurobiology, W. M. Keck Foundation Center for Integrative Neuroscience and Department of Physiology, University of California, San Francisco, CA 94143-0444, U.S.A.
Recent studies have employed simple linear dynamical systems to model trial-by-trial dynamics in various sensorimotor learning tasks. Here we explore the theoretical and practical considerations that arise when employing the general class of linear dynamical systems (LDS) as a model for sensorimotor learning. In this framework, the state of the system is a set of parameters that define the current sensorimotor transformation— the function that maps sensory inputs to motor outputs. The class of LDS models provides a first-order approximation for any Markovian (statedependent) learning rule that specifies the changes in the sensorimotor transformation that result from sensory feedback on each movement. We show that modeling the trial-by-trial dynamics of learning provides a substantially enhanced picture of the process of adaptation compared to measurements of the steady state of adaptation derived from more traditional blocked-exposure experiments. Specifically, these models can be used to quantify sensory and performance biases, the extent to which learned changes in the sensorimotor transformation decay over time, and the portion of motor variability due to either learning or performance variability. We show that previous attempts to fit such models with linear regression have not generally yielded consistent parameter estimates. Instead, we present an expectation-maximization algorithm for fitting LDS models to experimental data and describe the difficulties inherent in estimating the parameters associated with feedback-driven learning. Finally, we demonstrate the application of these methods in a simple sensorimotor learning experiment: adaptation to shifted visual feedback during reaching. 1 Introduction Sensorimotor learning is an adaptive change in motor behavior in response to sensory inputs. Here, we explore a dynamical systems approach to modeling sensorimotor learning. In this approach, the mapping from sensory Neural Computation 18, 760–793 (2006)
C 2006 Massachusetts Institute of Technology
Modeling Sensorimotor Learning with Linear Dynamical Systems
761
inputs to motor outputs is described by a sensorimotor transformation (see Figure 1, top), which constitutes the state of a dynamical system. The evolution of this state is governed by the dynamics of the system (see Figure 1, bottom), which may depend on both exogenous sensory inputs and sensory feedback. The goal is to quantify how these sensory signals drive trial-bytrial changes in the state of the sensorimotor transformations underlying movement. To accomplish this goal, empirical data are fit with linear dynamical systems (LDS), a general, parametric class of dynamical systems. The approach is best illustrated with an example. Consider the case of prism adaptation of visually guided reaching, a well-studied form of sensorimotor learning in which shifted visual feedback drives rapid recalibration of visually guided reaching (von Helmholtz, 1867). Prism adaptation has almost always been studied in a blocked experimental design, with exposure to shifted visual feedback occurring over a block of reaching trials. Adaptation is then quantified by the after-effect, the change in the mean reach error across two blocks of no-feedback test reaches—one before and one after the exposure block (Held & Gottlieb, 1958). This experimental approach has had many successes, for example, identifying different components of adaptation (Hay & Pick, 1966; Welch, Choe, & Heinrich, 1974) and the experimental factors that influence the quality of adaptation (e.g., Redding & Wallace, 1990; Norris, Greger, Martin, & Thach, 2001; Baraduc & Wolpert, 2002). However, adaptation is a dynamical process, with behavioral and
MOVEMENT Sensorimotor Transformation wt Inputs
Ft
yt Movement
γt Noise Ft+1
ut
SENSORIMOTOR LEARNING
Ft
Ft+1 Space of Sensorimotor Transformations
Figure 1: Sensorimotor learning modeled as a dynamic system in the space of sensorimotor transformations. For definitions of variables, see section 2.1.
762
S. Cheng and P. Sabes
neural changes in both the behavior and the underlying patterns of neural activity occurring on every trial. Our goal is to describe how the state of the system, which in this case could be modeled as the mean reach error, changes after each trial in response to error feedback (e.g., reach errors, perceived visual-proprioceptive misalignment) on that trial. As we will describe, a comparison of the performance before and after the training block is not sufficient to characterize this process. Only recently have investigations of sensorimotor learning from a dynamical systems perspective begun to appear (Thoroughman & Shadmehr, 2000; Scheidt, Dingwell, & Mussa-Ivaldi, 2001; Baddeley, Ingram, & Miall, 2003; Donchin, Francis, & Shadmehr, 2003). While these investigations have all made use of the LDS model class, they primarily focused on the application of these methods. A number of important algorithmic and statistical issues that arise when applying these methods remain unaddressed. Here we outline a general framework for modeling sensorimotor learning with LDS models, discuss the key analytical properties of these models, and address the statistical issues that arise when estimating model parameters from experimental data. We show how LDS can account for performance bias and the decay of learning over time, observed properties of adaptation that have not been included in previous studies. We show that the decay effect can be confounded with the effects of sensory feedback and that it can be difficult to separate these effects statistically. In contrast, the effects of exogenous inputs that are uncorrelated with the state of the sensorimotor transformation are much easier to characterize. We describe a novel resampling-based hypothesis test that can be used to identify the significance of such effects. The estimation of the LDS model parameters requires an iterative, maximum likelihood, system identification algorithm (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996), which we present in a slightly modified form. This iterative algorithm is necessary because, as we show, simple linear regression approaches are biased or inefficient, or both. The maximumlikelihood model can be used to quantify characteristics of the dynamics of sensorimotor learning and can make testable predictions for future experiments. Finally, we illustrate this framework with an application to a modern variant of the prism adaptation experiment. 2 A Linear Dynamical Systems Model for Sensorimotor Learning 2.1 General Formulation of the Model. Movement control can be described as a transformation of sensory signals into motor outputs. This transformation is generally a continuous-time stochastic process that includes both internal (“efference copy”) and sensory feedback loops. We will use the term sensorimotor transformation to refer to the input-output mapping of this entire process—feedback loops and all. This definition is useful in the case of discrete movements and other situations where continuous
Modeling Sensorimotor Learning with Linear Dynamical Systems
763
time can be discretized in a manner that permits a concise description of the feedback processes within each time step. We assume that at any given movement trial or discrete time step, indexed by t, the motor output can be quantified by a vector yt . In general, this output might depend on both a vector of inputs wt to the system and the “output noise” γt , the combined effect of sensory, motor, and processing variability. As shown in the upper box of Figure 1, the sensorimotor transformation can be formalized as a time-varying function of these two variables, yt = Ft (wt , γt ).
(2.1a)
We next define sensorimotor learning as a change in the transformation Ft in response to the sensorimotor experience of previous movements, as shown in the lower box of Figure 1. We let ut represent the vector of sensorimotor variables at time step t that drive such learning. This vector might include exogenous inputs rt , and since feedback typically plays a large role in learning, the motor outputs yt . The input ut might have all, some, or no components in common with the inputs wt . Generally, learning after time step t can depend on the complete history of this variable, Ut ≡ {u1 , . . . , ut }. Sensorimotor learning can then be modeled as a discrete-time dynamical system whose state is the current sensorimotor transformation, Ft , and whose state-update equation is the “learning rule” that specifies how F changes over time: Ft+1 = L({Fτ }tτ =1 , Ut , ηt , t),
(2.1b)
where the noise term ηt includes sensory feedback noise as well as intrinsic variability in the mechanisms of learning and will be referred to as learning noise. Previous studies that have adopted a dynamical systems approach to studying sensorimotor learning have taken only the output noise into account (Thoroughman & Shadmehr, 2000; Donchin et al., 2003; Baddeley et al., 2003). However, it seems likely that variability exists in both the sensorimotor output and the process of sensorimotor learning. Attempts to fit empirical data with parametric models of learning that do not account for learning noise may yield incorrect results (see section 4.4 for an example). Aside from these practical concerns, it is also of intrinsic interest to quantify the relative contributions of the output and learning noise to performance variability. 2.2 The Class of LDS Models. The model class defined in equation 2.1 provides a general framework for thinking about sensorimotor learning, but it is too general to be of practical use. Here we outline a series of assumptions that lead us from the general formulation to the specific class
764
S. Cheng and P. Sabes
of LDS models, which can be a practical yet powerful tool for interpreting sensorimotor learning experiments:
r r
r
r
Stationarity: On the timescale that it takes to collect a typical data set, the learning rule L does not explicitly depend on time. Parameterization: Ft can be written in parametric form with a finite number of parameters, xt ∈ Rm . This is not a serious restriction, as many finite basis functions sets can describe large classes of functions. The parameter vector xt now represents the state of the dynamical system at time t, and Xt is the history of these states. The full model, consisting of the learning rule L and sensorimotor transformation F , is now given by xt+1 = L(Xt , Ut , ηt ),
(2.2a)
yt = F (xt , wt , γt ).
(2.2b)
Markov property: The evolution of the system depends on only the current state and inputs, not on the full history: xt+1 = L(xt , ut , ηt ),
(2.3a)
yt = F (xt , wt , γt ).
(2.3b)
In other words, we assume online or incremental learning, as opposed to batch learning, a standard assumption for models of biological learning. Linear approximation and gaussian noise: The functions F and L can be linearized about some equilibrium values for the states (xe ), inputs (ue and we ), and outputs (ye ): xt+1 − xe = A(xt − xe ) + B(ut − ue ) + ηt , yt − ye = C(xt − xe ) + D(wt − we ) + γt .
(2.4a) (2.4b)
Thus, if the system were initially set up in equilibrium, the dynamics would be solely driven by random fluctuations about that equilibrium. The linear approximation is not unreasonable for many experimental contexts in which the magnitude of the experimental manipulation, that is, the inputs, is small, since in these cases, the deviations from equilibrium of the state and the output are small. The combined effect of the constant equilibrium terms in equation 2.4 can be lumped into a single constant “bias” term for each equation: xt+1 = Axt + But + b x + ηt ,
(2.5a)
yt = C xt + Dwt + b y + γt .
(2.5b)
Modeling Sensorimotor Learning with Linear Dynamical Systems
765
We will show in section 3.2 that it is possible to remove the effects of the bias terms b x and b y from the LDS. In anticipation of that result, we drop the bias terms in the following discussion. With the additional assumption that ηt and γt are zero-mean, gaussian white noise processes, we arrive at the LDS model class we use below: xt+1 = Axt + But + ηt ,
(2.6a)
yt = C xt + Dwt + γt ,
(2.6b)
with ηt ∼ N(0, Q),
γt ∼ N(0, R).
(2.6c)
In principle, signal-dependent motor noise (Clamann, 1969; Matthews, 1996; Harris & Wolpert, 1998; Todorov & Jordan, 2002) could be incorporated into this model by allowing the output variance R to vary with t. In practice, this would complicate parameter estimation. In the case where the data consist of a set of discrete movements with similar kinematics (e.g., repeated reaches with only modest variation in start and end locations), the modulation of R with t would be negligible. We will restrict our consideration to the case of stationary R. Among the LDS models defined in equation 2.6 there are distinct subclasses that are functionally equivalent. The parameters of two equivalent models are related to each other by a similarity transformation of the states xt , xt −→ A −→ P AP −1 B −→ P B Q −→ P QP T
P xt C −→ C P −1 D −→ D R −→ R,
(2.7)
where P is an invertible matrix. This equivalence exists because the state cannot be directly observed, but must be inferred from the outputs yt . An LDS from one subclass of equivalent models cannot be transformed into an LDS of another subclass via equation 2.7. In particular, a similarity transformation of an LDS with A = I always yields another LDS with A = I since (A = I ) −→ (P AP −1 = I ).
(2.8)
Therefore, it is restrictive to assume that A = I —that there is no “decay” of input-driven state changes over time. The equivalence under similarity transformation can be useful if one wishes to place certain constraints on the LDS parameters. For instance, if
766
S. Cheng and P. Sabes
one wishes to identify state components that evolve independently in the absence of sensory inputs, then the matrix A has to be diagonal. In many cases,1 this constraint can be met by performing unconstrained parameter estimation and then transforming the parameters with P = [v1 · · · vn ]−1 , where vi are the eigenvectors of the estimate of A. The transformed matrix A = P AP −1 is a diagonal matrix composed of the eigenvalues of A. As another example, the relationship between the state and the output might be known, that is, C = C0 . If both C0 and the estimated value of C are invertible, this constraint is achieved by transforming the estimated LDS ˆ with P = C0−1 C. 2.3 Feedback in LDS Models. In the LDS model of equation 2.6, learning is driven by an input vector ut . In an experimental setting, the exact nature of this signal will depend on the details of the task and is likely to be unknown. In general, it can include sensory feedback of the previous movement as well as exogenous sensory inputs. When we consider the problem of parameter estimation in section 4, we will show that the parameters corresponding to these two components of the input have different statistical properties. Therefore, we will explicitly write the input vector as uTt = [rtT ytT ], where the vector rt contains the exogenous input signals. We will similarly partition the input parameter, B = [G H]. This yields the form of the LDS model that will be used in the subsequent discussion: xt+1 = Axt + [ G H ]
rt yt
+ ηt ,
(2.9a)
yt = C xt + Dwt + γt ,
(2.9b)
ηt ∼ N(0, Q),
(2.9c)
γt ∼ N(0, R).
The decomposition of ut specified above is not overly restrictive, since any feedback signal can be divided into a component that is uncorrelated with the output (rt above) and a component that is a linear transformation of the output. Furthermore, equation 2.9 can capture a variety of common feedback models. To illustrate this point, we consider three forms of errorcorrective learning. In the first case, learning is driven by the previous performance error, ut = yt − yt∗ , where yt∗ is the target output. As an example, yt∗ could be a visual reach target, and ut could be the visually perceived reach error. If we let rt = −yt∗ and G = H, then equation 2.9a acts as a feedback controller designed to reduce performance error.
1 This transformation exists only if there are n linearly independent eigenvectors, where n is the dimensionality of the state.
Modeling Sensorimotor Learning with Linear Dynamical Systems
767
As a second form of error corrective learning, consider the case where learning is driven by the unexpected motor output, ut = yt − yˆ t , where yˆ t = C xt + Dwt is the predicted motor output. This learning rule would be used if the goal of the learner were to accurately predict the output of the system given the inputs ut and wt , that is, to learn a “forward model” of the physical plant yˆ (ut , wt , xt ) (Jordan & Rumelhart, 1992). Writing this learning rule in the LDS form of equation 2.6a, we obtain xt+1 = Axt + B(yt − C xt − Dwt ) + ηt = (A − BC) xt − B D wt + B yt + ηt . A
−G
H
Thus, this scheme can be expressed in the form of equation 2.9a, with rt = wt and parameters A , G , and H . Finally, learning could be driven by the predicted performance error, ut = yˆ t − yt∗ (e.g., Jordan & Rumelhart, 1992; Donchin et al., 2003). This scheme would be useful if the learner already had access to an accurate forward model. Using the predicted performance to estimate movement error would eliminate the effects that motor variability and feedback sensor noise have on learning. Also, in the context of continuous time systems, learning from the predicted performance error minimizes the effects of delays in the feedback loop. Putting this learning rule into the form of equation 2.6a, we obtain xt+1 = Axt + B(C xt + Dwt − yt∗ ) + ηt wt = (A + BC) xt + [BD −B] ∗ + ηt . yt A G rt
Again, this scheme is consistent with equation 2.9a, with the inputs rt and parameters A and G and H = 0. 2.4 Example Applications. LDS models can be applied to a wide range of sensorimotor learning tasks, but there are some restrictions. The true dynamics of learning must be consistent with the assumptions underlying the LDS framework, as discussed in section 2.2. Most notably, both the learning dynamics and motor output have to be approximately linear within the range of states and inputs experienced. In addition, LDS models can be fit to experimental data only if the inputs ut and outputs yt are well defined and can be measured by the experimenter. Identifying inputs amounts to defining the error signals that could potentially drive learning. While the true inputs will typically not be known a priori, it is often the case that several candidate input signals are available. Hypothesis testing can then
768
S. Cheng and P. Sabes
be used to determine which signals contribute significantly to the dynamics of learning, as discussed in section 4.2. The outputs yt must be causally related to the state of the sensorimotor system, since it functions as a noisy readout of the state. Several illustrative examples are described here. Consider first the case where t represents discrete movements. Two example tasks would be goal-directed reaching and hammering. A reasonable choice of state variable for these tasks would be the average positional error at the end of the movement. In this case, yt would be the error on each trial. In the hammering task, one might also include task-relevant dynamic information such as the magnitude and direction of the impact force. In some circumstances, these end-point-specific variables might be affected too much by online feedback to serve as a readout of the sensorimotor state. In such a case, one may choose to focus on variables from earlier in the movement (e.g., Thoroughman & Shadmehr, 2000; Donchin et al., 2003). Indeed, a more complete description of the sensorimotor state might be obtained by augmenting the output vector by multiple kinematic or dynamic parameters of the movement and similarly increasing the dimensionality of the state. In the reaching task, for example, yt could contain the position and velocity at several time points during the reach. Similarly, the output for the hammering task might contain snapshots of the kinematics of the hammer or the forces exerted on the hammer by the hand. If learning is thought to occur independently in different components of the movement, then state and output variables for each component should be handled by separate LDS models in order to reduce the overall model complexity. Next, consider the example of gain adaptation in the vestibular ocular reflex (VOR). The VOR stabilizes gaze direction during head rotation. The state of the VOR can be characterized by its “gain,” the ratio of the angular speed of the eye response over the speed of the rotation stimulus. When magnifying or minimizing lenses are used to alter the relationship between head rotation and visual motion, VOR gain adaptation is observed (Miles & Fuller, 1974). An LDS could be used to model this form of sensorimotor learning with the VOR gain as the state of the system. If the output is chosen to be the empirical ratio of eye and head velocity, then a linear equation relating output to state is reasonable. The input to the LDS could be the speed of visual motion or the ratio of that speed to the speed of head rotation. On average, such input would be zero if the VOR had perfect gain. A more elaborate model could include separate input, state, and output variables for movement about the horizontal and vertical axes. The dynamics of VOR adaptation could be modeled across discrete trials, consisting, for example, of several cycles of sinusoidal head rotations about some axis. The variables yt and ut could then be time averages over trial t of the respective variables. On the other hand, VOR gain adaptation is more accurately described as a continuous learning process. This view can also be incorporated into the LDS framework. Time is discretized into time steps indexed by t, and the variables yt and ut represent averages over each time step.
Modeling Sensorimotor Learning with Linear Dynamical Systems
769
In the examples described so far, the input and output error signals are explicitly defined with respect to some task-relevant goal. It is important to note, however, that the movement goal need not be explicitly specified or directly measurable. There are many examples where sensorimotor learning occurs without an explicit task goal: when auditory feedback of vowel production is pitch-shifted, subjects alter their speech output to compensate for the shift (Houde & Jordan, 1998); when reaching movements are performed in a rotating room, subjects adapt to the Coriolis forces to produce nearly straight arm trajectories even without visual feedback (Lackner & Dizio, 1994); when the visually perceived curvature of reach trajectories is artificially shifted, subjects adapt their true arm trajectories to compensate for the apparent curvature (Wolpert, Ghahramani, & Jordan, 1995; Flanagan & Rao, 1995). What is common to these examples is an apparent implicit movement goal that subjects are trying to achieve. The LDS approach can still be applied in this common experimental scenario. In this case, a measure of the trial-by-trial deviation from a baseline (preexposure) movement can serve as a measure of the state of sensorimotor adaptation or as an error feedback signal. This approach has been applied successfully in the study of reach adaptation to force perturbations (Thoroughman & Shadmehr, 2000; Donchin et al., 2003). 3 Characteristics of LDS Models We now describe how the LDS parameters relate to two important characteristics of sensorimotor learning: the steady-state behavior of the learner and the effects of performance bias. 3.1 Dynamics vs. Steady State. Most experiments of sensorimotor learning have focused on the after-effect of learning, measured as the change in motor output following a block of repeated exposure trials. The LDS can be used to model such blocked exposure designs. An LDS with constant exogenous inputs (rt = r , wt = w) will, after many trials, approach a steady state in which the state and output are constant except for fluctuations due to noise. The after-effect is essentially the expected value of the steady-state output, y∞ = lim E(yt ) = C x∞ + Dw. t→∞
(3.1)
An expression for the steady state x∞ = limt→∞ E(xt ) is obtained by taking the expectation value and then the limit of equation 2.9a, yielding x∞ = Ax∞ + Gr + Hy∞ .
(3.2)
770
S. Cheng and P. Sabes
Combining equations 3.1 and 3.2, the steady state is given by the solution of −(A + HC − I )x∞ = Gr + H Dw.
(3.3)
A unique solution to equation 3.3 exists only if A + HC − I is invertible. One sufficient condition for this is asymptotic stability of the system, which means that the state converges to zero as t → ∞ in the absence of any inputs or noise. This follows from the fact that a system is asymptotically stable if and only if all of the eigenvalues of A + HC have magnitude less than unity (Kailath, 1980). When a unique solution exists, it is given by x∞ = −(A + HC − I )−1 (Gr + H Dw),
(3.4)
and the steady-state output is y∞ = −C(A + HC − I )−1 (Gr + H Dw) + Dw.
(3.5)
This last expression can be broken down into the effective gains for the inputs r and w—the coefficients C(A + HC − I )−1 G and C(A + HC − I )−1 H D + D, respectively. While these two gains can be estimated from the steady-state output in a blocked-exposure experiment, they are insufficient to determine the full system dynamics. In fact, none of the parameters of the LDS model in equation 2.9 can be directly estimated from these two gains. The difference between studying the dynamics and the steady state is best illustrated with examples. We consider the performance of two specific LDS models. For simplicity, we place several constraints on the LDS: C = I , meaning the state x represents the mean motor output; D = 0, meaning the exogenous input has no direct effect on the current movement; and G and H are invertible, meaning that no dimensions of the input r or feedback yt are ignored during learning. In the first example, we consider the case where A = I . This is a system with no intrinsic decay of the learned state. From equation 3.5, we see that the steady-state output in this case converges to the value y∞ = −H −1 Gr.
(3.6)
If this is a simple error-corrective learning algorithm as in the first example of section 2.3, then G = H and the output converges to a completely adaptive response in the steady state, y∞ = −r . In such a case, the steady-state output is independent of the value of H, and so experiments that measure only the steady-state performance will miss the dynamical effects due to the structure of H. Such effects are illustrated in Figure 2, which shows
Modeling Sensorimotor Learning with Linear Dynamical Systems
771
1 0.8
y
0.6 0.4 0.2 0 0
0.5 x
1
Figure 2: Illustration of the difference between trial-by-trial state of adaptation (connected arrows) and steady-state of adaptation (open circles) in a simple simulation of error-corrective learning. The data were simulated with no noise and with diagonal matrices G = H. The learning rate in the x-direction, H11 , was 40% smaller than in the y-direction, H22 . Four different input vectors rt = −yt∗ were used, as shown in the inset in the corresponding shade of gray.
the simulated evolution of the state of the system when the learning rate along the x-direction (H11 ) is 40% smaller than in the y-direction (H22 ). Such spatial anisotropies in the dynamics of learning provide important clues about the underlying mechanisms of learning. For example, anisotropies could be used to identify differences in the learning and control of specific spatial components of movement (Pine, Krakauer, Gordon, & Ghez, 1996; Favilla, Hening, & Ghez, 1989; Krakauer, Pine, Ghilardi, & Ghez, 2000). In the second example, we consider a learning rule that does not achieve complete adaptation, a more typical outcome in experimental settings. Unlike in the last example, we assume that the system is isotropic: A = a I , G = −g I , and H = −h I , where a , g, and h are positive scalars. In this case, the steady-state output is y∞ =
−g r. (1 − a ) + h
(3.7)
A system with a = 1 and g = h will exhibit complete adaptation: y∞ = −r . However, incomplete adaptation, |y∞ | < |r |, could be due to either state decay (a < 1) or a greater weighting of the feedback signal compared to the exogenous shift (h > g). Measurements of the steady state are not sufficient to distinguish between these two qualitatively different scenarios. These examples illustrate that knowing the steady-state output does not fully constrain the important features of the underlying mechanisms of adaptation.
772
S. Cheng and P. Sabes
3.2 Modeling Sensorimotor Bias. The presence of systematic errors in human motor performance is well documented (e.g., Weber & Daroff, 1971; Soechting & Flanders, 1989; Gordon, Ghilardi, Cooper, & Ghez, 1994). Such sensorimotor biases could arise from a combination of sensory, motor, and processing errors. While bias has not been considered in previous studies using LDS models, failing to account for it can lead to poor estimates of model parameters. Here, we show how sensorimotor bias can be incorporated into the LDS model and how the effects of bias on parameter estimation can be avoided. It might seem that the simplest way to account for sensorimotor bias would be to add a bias vector b y to the output equation (equation 2.9): xt+1 = Axt + Grt + Hyt + ηt ,
(3.8a)
yt = C xt + Dwt + b y + γt .
(3.8b)
However, in a stable feedback system, feedback-dependent learning will effectively adapt away the bias. This can be seen by examining the steady state of equation 3.8: x∞ = −(A + HC − I )−1 (Gr + H Dw + Hb y ), y∞ = −C(A + HC − I )−1 (Gr + H Dw + Hb y ) + Dw + b y . Considering the case where A = I and C and H are invertible, the steady state compensates for the bias, x∞ = −C −1 (H −1 Gr + Dw + b y ), and so the bias term vanishes entirely from the asymptotic output: y∞ = −H −1 Gr . The simplest way to capture a stationary sensorimotor bias in an LDS model is to incorporate it into the learning equation. For reasons that will become clear below, we add a bias term −Hb x , xt+1 = Axt + Grt + Hyt − Hb x + ηt , yt = C xt + Dwt + γt .
(3.9a) (3.9b)
Now the bias carries through to the sensorimotor output: x∞ = −(A + HC − I )−1 (Gr + H Dw − Hb x ), y∞ = −C(A + HC − I )−1 (Gr + H Dw − Hb x ) + Dw. Again assuming that A = I , and C and H are invertible, the stationary output becomes y∞ = −H −1 Gr + b x , which is the unbiased steady-state output plus the bias vector b x . As described in section 2, the sensorimotor biases defined above are closely related to the equilibrium terms xe , ye , and so on in equation 2.4. If
Modeling Sensorimotor Learning with Linear Dynamical Systems
773
equation 3.9 were fit to experimental data, the bias term b x would capture the combined effects of all the equilibrium terms in equation 2.4a. Similarly, a bias term b y in the output equation would account for the equilibrium terms in equation 2.4b. However, adding these bias parameters to the model would increase the model complexity with little benefit, as it would be very difficult to interpret these composite terms. Here we show how bias can be removed from the data so that these model parameters can be safely ignored. 1 T−1 With T being the total number of trials or time steps, let x¯ = T−1 t=1 xt , and u, ¯ w, ¯ and y¯ defined accordingly. Averaging equation 2.5 over t, we get 1
1
xt+1 = A¯x + B u¯ + b x + ηt , T −1 T −1 T−1
T−1
t=1
t=1
1
γt . T −1 T−1
y¯ = C x¯ + Dw ¯ + by +
t=1
1 T−1 1 T−1 ¯ + O(1/T), T−1 With the approximations T−1 t=1 xt+1 ≈ x t=1 ηt ≈ 0 + √ √ T−1 1 O(1/ T), and T−1 γ ≈ 0 + O(1/ T), which are quite good for large t t=1 T, we get
x¯ = A¯x + B u¯ + b x , y¯ = C x¯ + Dw ¯ + b y.
Subtracting these equations from equation 2.5 leads to xt+1 − x¯ = A(xt − x¯ ) + B(ut − u) ¯ + ηt , ¯ + γt . yt − y¯ = C(xt − x¯ ) + D(wt − w)
(3.10) (3.11)
Therefore, the mean-subtracted values of the empirical input and output variables obey the same dynamics as the raw data, but without the bias terms. This justifies using equation 2.6 to model experimental data, as long as the inputs and outputs are understood to be the mean-subtracted values. 4 Parameter Estimation Ultimately, LDS models of sensorimotor learning are useful only if they can be fit to experimental data. The process of selecting the LDS model that best accounts for a sequence of inputs and outputs is called system identification (Ljung, 1999). Here we take a maximum-likelihood approach to system identification. Given a sequence (or sequences) of input and output data,
774
S. Cheng and P. Sabes
we wish to determine the model parameters for equation 2.9 for which the data set has the highest likelihood; that is, we want the maximum likelihood estimates (MLE) of the model parameters. Since no analytical solution exists in this case, we employ the expectation-maximization (EM) algorithm to calculate the MLE numerically (Dempster, Laird, & Rubin, 1977). A review of the EM algorithm for LDS (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996) is presented in the appendix, with one algorithmic refinement. A Matlab implementation of this EM algorithm is freely available online at http://keck.ucsf.edu/∼sabes/software/. Here we discuss several issues that commonly arise when trying to fit LDS models that include sensory feedback. 4.1 Correlation Between Parameter Estimates. Generally, identification of a system operating under closed loop (i.e., where the output is fed back to the learning rule) is more difficult than if the same system were operating in open loop (no feedback). This is partly because the closed loop makes the system less sensitive to external input (Ljung, 1999). In addition, and perhaps more important for our application, since the output directly affects the state, these two quantities tend to be correlated. This makes it difficult to distinguish their effects on learning, that is, to fit the parameters A and H. To determine the conditions that give rise to this difficulty consider two hypothetical LDS models, xt+1 = Axt + Grt + Hyt + ηt ,
xt+1 = A xt + Grt + H yt + ηt ,
(4.1) (4.2)
which are related to each other by A = A − δC and H = H + δ. These models differ in how much the current state affects the subsequent state directly (A) or through output feedback (H), and the difference is controlled by δ. However, the total effect of the current state is the same in both models: A + HC = A + H C. Distinguishing these two models is thus equivalent to separating the effects of the current state and the feedback on learning. To determine when this distinction is possible, we rewrite the second LDS in terms of A and H: xt+1 = (A − δC)xt + Grt + (H + δ)yt + ηt = Axt + Grt + Hyt + δ(yt − C xt ) + ηt = Axt + Grt + Hyt + δ(Dwt + γt ) + ηt ,
(4.3)
where the last step uses equation 2.9b. Comparing equations 4.1 and 4.3, we see that the two models are easily distinguished if the contribution of the δ Dwt term is large. However, in many experimental settings, the
Modeling Sensorimotor Learning with Linear Dynamical Systems
correlations
T = 400
R/Q = 1/8
correlations
R/Q = 8
2
2
2
0
0
0
−2 ^ A^ ^ G ^ H ^ Q R
T = 4000
R/Q = 1
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
2
2
2
0
0
0
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
775
−2 ^ A^ ^ G ^ H ^ Q R
29.1
^ ^R ^Q ^H ^G A
^ ^R ^Q ^H ^G A
Figure 3: Correlations between LDS parameter estimates across 1000 √ simulated data sets. Each panel corresponds to a particular value for T and R/Q. Simulations used an LDS with parameters A = 0.8, |Gr | = 0.5, H = −0.2, C = 1, D = 0, Q = 1, and zero-mean, white noise inputs rt with unit variance.
exogenous input has little effect on the ongoing motor output (e.g., in the terminal feedback paradigm described in section 5), implying that Dwt is negligible. In this case, the main difference between these two models is the noise term δγt , and the ability to distinguish between the models depends on the relative magnitudes of the variances of the output and learning noise terms, R = Var(γt ) and Q = Var(ηt ). For |R| |Q|, separation will be relatively easy. In other cases, the signal due to δγt may be lost in the learning noise. We next fit a one-dimensional LDS to simulated data in order to confirm the correlation between the estimates for A and H and to investigate other potential correlations among parameter estimates. In these simulations, the inputs rt and the learning noise ηt were both zero-mean, gaussian white noise processes. Both variables had unity variance, determining the scale for the other parameters’ values, which are listed in Figure 3. Three differ√ ent values for R/Q and two data set lengths T were used. Note that the input affects learning via the product Gr , and so the standard deviation of this product quantifies the input effect. Here and below we √ use |Gr | to refer to this standard deviation. For each pair of values for R/Q and T, we simulated 1000 data sets and estimated the LDS parameters with the EM algorithm. We then calculated the correlations between the various parameter estimates across simulated data sets (see Figure 3). With sufficiently large data sets the MLE are consistent and exhibit little variability
776
S. Cheng and P. Sabes
(see Figure 3, bottom row). If the data sets are relatively small, two different effects can be seen, depending on the relative magnitudes of R and Q. As predicted above, when R < Q, there is large variability in the estimates Aˆ ˆ and they are negatively correlated (see Figure 3, top-left panel). In and H, separate simulations (data not shown), we confirmed that this correlation disappears when there are substantial feedthrough inputs—that is, when |Dw| is large. In contrast, when the output noise has large variance, R > Q, the estimate Rˆ covaries with the other model parameters (see Figure 3, topright panel). This effect has an intuitive explanation: when the estimated output variance is larger than the true value, even structured variability in the output is counted as noise. In other words, a large Rˆ masks the effects of the other parameters, which are thus estimated to be closer to zero. Next, we isolate and quantify in more detail the correlation between the ˆ We computed the MLE for simulated data under two estimates Aˆ and H. different constraint conditions. In the first case, only Hˆ was fit to the data, while all other parameters were fixed at their true values (H unknown, A known). In the second case, the sum A + HC and all parameters other than A and H were fixed to their true values (A and H unknown, A + HC known). In this case, Aˆ and Hˆ were constrained to be A − δC and H + δ, respectively, and the value of δ was fit to the data. This condition tests directly whether the maximum likelihood approach can distinguish the effects of the current state and the feedback on learning. Under both of these conditions, there is only a single free parameter, and so a simple line search algorithm was used to find the MLE of the parameter from simulated data. Data were simulated with√the same parameters as in Figure 3 while three quantities were varied: R/Q, |Gr |, and T (see Figure 4). For a given set of parameter values, we simulated 1000 data sets and found the MLE Hˆ for each. The 95% confidence interval (CI) for Hˆ was computed as the ˆ symmetric interval around the true H containing 95% of the fit values H. The results for the first constraint case (H unknown) are shown on the left side of Figure 4. Uncertainty in Hˆ is largely invariant to the magnitude of the output noise or the exogenous input signal, |Gr |, although there is a small interaction between the two at high input levels. For later comparison, we note that with T = 1000, we obtain a 95% CI of ±0.05. The results are very different when A + HC is fixed but A and H are unknown (right side of Figure 4). While the uncertainty in Hˆ is independent of the input magnitude, it is highly dependent on the magnitude of the output noise. As predicted, when R is small relative to Q, there is much greater uncertainty in Hˆ compared to the first constraint condition. For example, if √ R/Q ≈ 2 (comparable to the empirical results shown in Figure 8C below), several thousand trials would be needed to reach a 95% CI of ±0.05. In ˆ order to match the √ precision of H obtained in the first constraint condition with T = 1000, R/Q ≈ 8 is needed.
Modeling Sensorimotor Learning with Linear Dynamical Systems
H unknown A known
A
0.04 0.15 0.3
0.15
A and H unknown A+HC known
C 0.2
|Gr| / Q
^
half-width of 95% CI of H
0.2
777
0.6 1.2 2.4
0.15
0.1
0.1
0.05
0.05
0
0 0
2
4
6
R/Q
B
0
8
2
4
D
0.2
6
8
3000
4000
R/ Q
0.2
^
half-width of 95% CI of H
R/Q 1/8 1/2 1
0.15
2 4 8
0.15
0.1
0.1
0.05
0.05
0
0 0
1000
2000
3000
dataset length
4000
0
1000
2000
dataset length
Figure 4: Uncertainty in the feedback parameter Hˆ in two different constraint conditions. All data were simulated with parameters A = 0.8, H = −0.2, C = 1, D = 0. Q sets the scale. (A) Variability of Hˆ as a function of the output noise magnitude, when all other parameters, in particular A, are known (T = 400 trials). Lines correspond to different values of the magnitude of the exogenous input signal. (B) Variability of Hˆ as a function of data set length, T, for |Gr | = 0 (no exogenous input). Lines correspond to different levels of output noise. (C, D): Variability of Hˆ when both H and A are unknown, but A + HC, as well as all other parameters, are known; otherwise as in Figures A and B, respectively.
4.2 Hypothesis Testing. One important goal in studying sensorimotor learning is to identify which sensory signal, or signals, drive learning. Consider the case of a k-dimensional vector of potential input signals, rt . We wish to determine whether the ith component of this input has a significant effect on the evolution of the state, that is, whether G i , the ith column of
778
S. Cheng and P. Sabes
G, is significantly different from zero. Specifically, we would like to test the hypothesis H1 : G i = 0 against the null hypothesis H0 : G i = 0. Given the framework of maximum likelihood estimation, we could use a generic likelihood ratio test in order to assess the significance of the parameters G i (Draper & Smith, 1998). However, the likelihood ratio test is valid only in the limit of large data sets. Given the complexity of these models, that limit may not be achieved in typical experimental settings. Instead, we propose the use of a permutation test, a simple class of nonparametric, resampling-based hypothesis tests (Good, 2000). Consider again the null hypothesis H0 : G i = 0, which implies that rti , the ith component of the exogenous input rt , does not influence the evolution system. If H0 were true, then randomly permuting the values of rti across t should not affect the quality of the fit; in any case we expect Gˆ i to be near zero. This suggests the following permutation test. We randomly permute the ith component of the exogenous input, determine the MLE parameters of the LDS model from the permuted data, and compute a statistic representing the goodness-of-fit of the MLE—in our case, the log likelihood of the permuted data given the MLE, L perm . This process is repeated many times. The null hypothesis is then rejected at the (1 − α) level if the fraction of L perm that exceeds the value of L computed from the original data set is below α. Alternatively, the magnitude of G i itself could be used as the test statistic, since G i is expected to be near zero for the permuted data sets. To evaluate the usefulness of the permutation test outlined above, we performed a power analysis on data sets simulated from the one-dimensional LDS described in Figure 5. For each combination of parameters, we simulated 100 data sets. For each of these data sets, we determined the significance of the scalar G using the permutation test described above with k = i = 1, α = 0.05, and 500 permutations of rt . The fraction of data sets for which the null hypothesis was (correctly) rejected represents the power of the permutation test for those parameter values. The top panel of Figure 5 shows√the power as a function of the input magnitude and trial length T, given R/Q = 2. The lower panel shows the magnitude of the input required to obtain a power of 80%, as a function √ of R/Q and T. Plots such as these should be used during experimental design. However, since neither G nor the output and learning noise √ magnitudes are typically known, heuristic values must be used. With R/Q = 2 √ and |Gr |/ Q = 0.2, approximately 1600 trials are needed to obtain 80% power. Note, however, that the exogenous input signal is often under experimental control. In this case, the same power could be achieved with about 400 trials if the magnitude of that signal were doubled. In general, increasing the amplitude of the input signal will allow the experimenter to achieve any desired statistical power for this test. Practically, however, the size of a perturbative input signal is usually limited by several factors, including physical constraints or a requirement that subjects remain
Modeling Sensorimotor Learning with Linear Dynamical Systems
779
A 1
dataset length (T) 4000 3000 1600 800 400 100
power
0.8 0.6 0.4 0.2 0
0
0.1
0.2
B
0.3 |Gr|
0.4
0.5
0.8
0.4
|Gr|
80
/
Q
0.6
0.2
0
0
2
4
6
8
R/Q
Figure 5: Power analysis of the permutation test for the significance of G. Simulation parameters: A = √0.8, H = −0.2, C = 1, and D = 0. Q sets the scale. (A) Statistical power when √ R/Q = 2. (B) Input magnitude required to achieve 80% power, as a ratio of Q. In both panels, α = 0.05 and line type indicates the data set length T.
unaware of the manipulation. Therefore, large numbers of trials may often be required to achieve sufficient power. 4.3 Combining Multiple Data Sets. A practical approach to collecting larger data sets is to perform repeated experiments, either during multiple sessions with a single subject or with multiple subjects. In either case, accurate model fitting requires that the learning rule is stationary (i.e., constant LDS parameters) across sessions or subjects. There are two possible approaches to combining N data sets of length T. First, the data from different sessions can be concatenated to form a “super data set” of length NT. The
780
S. Cheng and P. Sabes
system identification procedure outlined in the appendix can be directly applied to the super data set with the caveat that the initial state xt=1 has to be reset at the beginning of each data set. A second approach can be taken when nominally identical experiments are repeated across sessions, that is, when the input sequences rt and wt are the same in each session. In this approach, the inputs and outputs are averaged across sessions, yielding a single average data set with inputs r˜t and w ˜ t and outputs y˜ t . The dynamics underlying this average data set are obtained by averaging the learning and output equations for each t (see equation 2.9) across data sets: x˜ t+1 = A˜xt + [G H]
r˜t
y˜ t
+ η˜ t ,
(4.4a)
y˜ t = C x˜ t + Dw ˜ t + γ˜t .
(4.4b)
Note that the only difference between this model and that describing the individual data sets is that the noise variances have been scaled: Var(η˜ t ) = Q/N and Var(γ˜t ) = R/N. Therefore, the EM algorithm (see the appendix) can be directly applied to average data sets as well. Since both approaches to combining multiple data sets are valid, we ran simulated experiments to determine which approach produces better parameter estimates, that is, tighter confidence intervals. Simulations were performed with the model √ described in Figure 6, varying R and Q while maintaining a fixed ratio R/Q = 2.
^
half-width of 95% CI of G
0.2
0.15
Q 2 1 1/2 1/4 1/8
0.1
0.05
0
0
1000
2000
3000
4000
dataset length
Figure 6: 95% confidence intervals for the MLE of the input parameter √ G, computed from 1000 simulated data sets. All simulations were run with R/Q = 2 and gaussian white noise inputs with zero-mean and unit variance. A = 0.8, G = −0.3, H = −0.2, C = 1, and D = 0.
Modeling Sensorimotor Learning with Linear Dynamical Systems
781
The preferred approach for combining data sets depends on which parameter one is trying to estimate. CIs for the exogenous input parameter Gˆ are shown in Figure 6. Variability depends strongly on the number of trials but even more so on the noise variances. For example, with 200 trials, R = 1, and Q = 1/2, the 95% CI is ±0.145. Multiplying the number of trials by 4 results in a 41% improvement in CI, yet dividing the noise variances by 4 yields a 76% improvement. Therefore, for estimating G, it is best to average the data sets. It should be noted, however, that the advantage is somewhat weaker for larger input variances (Figure 6 was generated with unit input variance; other data not shown). By contrast, the variability of the MLE for H only mildly depends on the noise variances (data not shown), if at all. Therefore, increasing the data set length by concatenation produces better estimates for H than decreasing the noise variances by averaging. 4.4 Linear Regression. As noted above, there is no analytic solution for the maximum likelihood estimators of the LDS parameters, so the EM algorithm is used (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996). While EM has many attractive properties (Dempster et al., 1977), it is a computationally expensive, iterative algorithm, and it can get caught in local minima. The computation time becomes particularly problematic when system identification is used in conjunction with resampling schemes for statistical analyses such as bootstrapping (Efron & Tibshirani, 1993) or the permutation tests described above. It is therefore tempting to circumvent the inefficiencies of the EM algorithm by converting the problem into one that can be solved analytically. Specifically, there have been attempts to fit a subclass of LDS models using linear regression (e.g., Donchin et al., 2003). It is in fact possible to find a regression equation for the input parameters G, H, and D if we assume that A = I , a fairly strict constraint implying no state decay. There are then two possible approaches to transforming system identification into a linear regression problem. As we describe here, however, both approaches can lead to inefficient, or worse, inconsistent estimates. First we consider the subtraction approach. If A = I , the following expression for the trial-by-trial change in output can be derived from the LDS in equation 2.6 (i.e., the form without explicit feedback), yt+1 − yt = C But + D(wt+1 − wt ) + γt+1 − γt + Cηt .
(4.5)
This equation suggests that the trial-by-trial changes in yt can be regressed on inputs ut and wt in order to determine the parameters C B and D. Obtaining an estimate of B requires knowing C; however, that is not a significant constraint due to the existence of the similarity classes described in equation 2.7. One complication with this approach is that the noise terms in equation 4.5, γt+1 − γt + Cηt , are correlated across trials. Such colored noise can be accommodated in linear regression (Draper & Smith, 1998);
782
S. Cheng and P. Sabes 0 −0.2 −0.4 ^
H
−0.6 −0.8 −1 0
10
20
30
R/Q Figure 7: Bias in Hˆ using two different linear regression fits of simulated data. The data sets were simulated with no exogenous inputs and the LDS model parameters A = 1, G = 0, H = −0.2, C = 1, and D = 0. Q sets the scale. The lower (black) data points represent the average Hˆ over 1000 simulated data sets using the subtraction approach. The upper (gray) data points are for the summation approach. The true value of H is shown as the dotted line. Error bars represent standard deviations.
however, the ratio of R and Q would have to be known in order to construct an appropriate covariance matrix for the noise. A more serious problem with the regression model in equation 4.5 arises in feedback systems, in which But = Grt + Hyt . In this case, the right-hand side of equation equation 4.5 contains the same term, yt , as the left-hand side. This correlation between the dependent variables yt and the independent variables yt+1 − yt leads to a biased estimate for H. As an example, consider a sequence yt that is generated by a pure white-noise process with no feedback dynamics. If the regression model of equation equation 4.5 were applied to this sequence, with C = 1 and no exogenous inputs, the expected value of the regression parameter would be Hˆ = −1, compared to the true value H = 0. Application of the regression model in equation 4.5 to simulated data confirms this pattern of bias in Hˆ (see Figure 7, black line). In the general multivariate case, Hˆ will be biased toward −I as long as the output noise is large compared to the learning noise. The bias described above arises from the fact that the term yt occurs on both sides of the regression equation. This problem can be circumvented by deriving a simple expression for the output as a linear function of the initial state and all inputs up to, but not including, the current time step:
yt = C x1 + C
t−1
τ =1
(Grτ + Hyτ + ητ ) + Dwt + γt .
(4.6)
Modeling Sensorimotor Learning with Linear Dynamical Systems
783
This expression suggests the summation approach to system identification by linear regression. The set of equations 4.6 for each t can be combined into the standard matrix regression equation Y = Xβ+ noise: 1 0 0 y1T T T 1 r y T y2 1 1 .. .. .. .. . = . . T . y t−1 T t−1 T t 1 τ =1 rτ τ =1 yτ .. .. .. .. . . . .
w1T
0
T T T w2 η1 x1 C T T .. G C .. . . + T T H C t−1 T T wt η τ =1 τ DT .. .. . .
γ1T
γ2T .. C T . I . T γt .. . (4.7)
For a given value of C, regression would produce estimates for x1 , G, H, and D. One pitfall with this approach is that the variance of the noise terms grows linearly with the trial count t: Var C
t−1
η τ + γt
= (t − 1)C QC T + R.
(4.8)
τ =1
This problem is negligible if |Q| |R|. Otherwise, as the trial count increases, the accumulated learning noise will dominate all other correlations and the estimated parameter values will approach zero. This effect can be seen for our simulated data sets in Figure 7, gray line. Of course, this problem could be addressed by modeling the full covariance matrix of the noise terms and including them in the regression (equation 4.8 gives the diagonal terms of the matrix). However, this requires knowing Q and R in advance. Also, the linear growth in variance means that later terms will be heavily discounted, forfeiting the benefit of large data sets. 5 Example: Reaching with Shifted Visual Feedback Here we present an application of the LDS framework using a well-studied form of feedback error learning: reach adaptation due to shifted visual feedback. In this experiment, subjects made reaching movements in a virtual environment with artificially shifted visual feedback of the hand. The goal is to determine whether a dynamically changing feedback shift drives reach adaptation and, if so, what the dynamics of learning are. Subjects were seated with their unseen right arm resting on a table. At the beginning of each trial, the index fingertip was positioned at a fixed start location, a virtual visual target was displayed at location gt , and after a short delay an audible go signal instructed subjects to reach to the visual target. The movement began with no visual feedback, but just before the
784
S. Cheng and P. Sabes
end of the movement (determined online from fingertip velocity), a white disk indicating the fingertip position appeared (“terminal feedback”). The feedback disk tracked the location of the fingertip, offset from the finger by a vector rt . The fingertip position at the end of the movement is represented by f t . Each subject performed 200 reach trials. The sequence of visual shifts rt was a random walk in two-dimensional space. The two components of each step, rt+1 − rt , were drawn independently from a zero-mean normal distribution with a standard deviation of 10 mm and with the step magnitude limited to 20 mm. In addition, each component of rt was limited to the ±60 mm range, with reflective boundaries. These limitations were chosen to ensure that subjects did not become aware of the visual shift. In order to model this learning process with an LDS, we need to define its inputs and outputs. Reach adaptation is traditionally quantified with the after-effect, that is, the reach error yt = f t − gt measured in a no-feedback test reach (Held & Gottlieb, 1958). In our case, the terminal feedback appeared sufficiently late to preclude feedback-driven movement corrections. Therefore, the error on each movement, yt , is a trial-by-trial measure of the state of adaptation. We will also define the state of the system xt to be the mean reach error at a given trial, that is, the average yt that would be measured across repeated reaches if learning were frozen at trial t. This definition is consistent with the output equation of the LDS model, equation 2.9b, if two constraints are placed on the LDS: D = 0 and C = I . The first constraint is valid because of the late onset of the visual feedback. The second constraint resolves the ambiguity of the remaining LDS parameters, as discussed in section 2.2. We will also assume that the input to the LDS is the visually perceived error, ut = yt + rt . Thus, we are modeling reach adaptation as error-corrective learning, with a target output yt∗ = −rt (see section 2.3). Note that with this input variable, But = Hyt + Grt if B = H = G. Using the EM algorithm, the sequence of visually perceived errors (inputs) and reach errors (outputs) from each subject was used to estimate the LDS parameters A, B, Q, and R. The parameter fits from four subjects’ data are shown in Figure 8A. The decay parameter Aˆ is nearly diagonal for all subjects, implying that the two components of the state do not mix and thus evolve independently. Also, the diagonal terms of Aˆ are close to 1, which means there is little decay of the adaptive state from one trial to the next. The individual components of the input parameter Bˆ are considerably more variable across subjects. A useful quantity to consider is the square ˆ which is root of the determinant of the estimated input matrix, |det B|, ˆ This value, the geometric mean of the magnitudes of the eigenvalues of B. shown in Figure 8B, is a scalar measure of the incremental state correction due to the visually perceived error on each trial. To determine whether these responses are statistically significant, we performed the permutation test
Modeling Sensorimotor Learning with Linear Dynamical Systems
A
785
B 0.5
2
^
A
B
^
− 0.5
0
−1
11 12 21 22
200 0
S1 S2 S3 S4
C 1/4 ^
400 200 0
11 12 21 22
0
11 12 21 22
^
2
400
^
2
0.1
600 R (m m )
600
0.2
(d et R / d et Q)
−1
Q (m m )
^
0
1
^
|det B|
0.3
11 12 21 22
3 2 1 0 S1 S2 S3 S4
ˆ B, ˆ Q, ˆ and Rˆ for four subjects. Figure 8: (A) Estimated LDS parameters A, Labels on the x-axis indicate the components of each matrix. Each bar shading corresponds to a different subject (S1–S4). (B) Results of permutation test for the ˆ and the error bars input parameter B for each subject. The square marks |det B| show the 95% confidence interval for that value given H0 : detB = 0, generated from 1000 permuted data sets. (C) Estimate of ratio of output to learning noise standard deviation.
ˆ The null hypothesis was described in section 4.2 on the value of |det B|. 2 H0 : detB = 0. Figure 8B shows that H0 can be rejected with 95% confidence for all four subjects, and so we conclude that the visually perceived error significantly contributes to the dynamics of reach adaptation. As discussed in section 4, the statistical properties of the MLE parameters depend to a large degree on the ratio of the learning and output noise terms. In the present example, the covariance matrices are two-dimensional, and so we quantify the magnitude of the noise terms by the square root of ˆ ˆ respectively. The ratio of standard their determinants, det Q and det R, 1/4 ˆ ˆ deviations is thus det R/det Q . The experimental value of this ratio ranges from 1.2 to 3.7 with a mean of 2.5 (see Figure 8C). These novel findings suggest that learning noise might contribute almost as much to behavioral variability as motor performance noise.
Since det B = 0 could result from any singular matrix B, even if B = 0, this test is more stringent than testing B = 0. A singular matrix B implies that some components of the input do not affect the dynamics. 2
786
S. Cheng and P. Sabes
The results presented in Figure 8 depend on the assumption that the visually perceived error ut drives learning. This guess was based on the fact that ut is visually accessible to the subject and is a direct measure of the subject’s visually perceived task performance. However, several alternative inputs exist, even if we restrict consideration to the variables already discussed. Note again that the input term But can be expanded to Hyt + Grt . The hypothesis that the visually perceived error drives learning is thus equivalent to the claim that H = G. However, the true reach error yt and the visual shift rt could affect learning independently. These variables are not accessible from visual feedback alone, but they could be estimated from comparisons of visual and proprioceptive signals or the predictions of internal forward models. While these estimates might be noisier than those derived directly from vision, this sensory variability is included in the learning noise term, as discussed in section 2.1. Note that an incorrect choice of input variable ut means that the LDS model cannot capture the true learning dynamics, and so the resulting estimate of learning noise should be high. Indeed, one explanation for the large Q in the model fits above is that an important input signal was missing from the model. The LDS framework itself can be used to address such issues. In this case, we can test the alternative H = G against the null hypothesis H = G by asking whether a significantly better fit is obtained with two inputs, [ut , rt ], compared to the single input ut . The permutation test, with permuted rt , would be used. We showed in section 4.1, however, that such comparisons require more than the 200 trials per subject collected here. 6 Discussion Quantitative models, even very simple ones, can be extremely powerful tools for studying behavior. They are often used to clarify difficult concepts, quantitatively test intuitive ideas, and rapidly test alternative hypothesis with virtual experiments. However, successful application of such models depends on understanding the properties of the model class being used. The class of LDS models is an increasingly popular tool for studying sensorimotor learning. We have therefore studied the properties of this model class and identified the key issues that arise in their application. We explored the steady-state behavior of LDS models and related that behavior to the traditional measures of learning in blocked-exposure experimental designs. These results demonstrate why the dynamical systems approach provides a clearer picture of the mechanisms of sensorimotor learning. We described the EM algorithm for system identification and discussed some of the details and difficulties involved in estimating model parameters from empirical data. Most important, in closed-loop systems, it is difficult to separate the effects of state decay (A) and feedback (H) on the dynamics
Modeling Sensorimotor Learning with Linear Dynamical Systems
787
of learning. Note that this limitation is an example of a more general difficulty with all linear models. If any two variables in either the learning or output equations are correlated, then it will be difficult to independently estimate the coefficients of those variables. For example, if the exogenous learning signal rt and feedthrough input wt are correlated in a feedback system, then it is difficult to distinguish the exogenous and feedback learning parameters G and H. As a second example, if the exogenous input vector rt is autocorrelated with a timescale longer than that of learning, then the input and the state will be correlated across trials. In this case, it would be difficult to distinguish A and G. Such is likely to be the case in experiments that include blocks of constant input, giving a compelling argument for experimental designs with random manipulations. One attractive feature of LDS models is that they contain two sources of variability: an output or performance noise and a learning noise. Both sources of variability will contribute to the overall variance in any measure of performance. We know of no prior attempts to quantify these variables independently, despite the fact that they are conceptually quite different. In addition, the ratio of the two noise contributions has a large effect on the statistical properties of LDS model fits. We motivated the LDS model class as a linearization of the true dynamics about equilibrium values for the state and inputs. Linearization is justifiable for modeling a set of movements with similar kinematics (e.g., repeated reaches with small variations in start and end locations) and small driving signals. However, many experiments consist of a set of distinct trial types that are quite different from each other, for example, a task with K different reach targets. It is a straightforward extension of the LDS model presented here to include separate parameters and state variables for each of K trial types. In this case, the effect of the input variables (feedback and exogenous) on a given trial will be different for each of the K state variables (i.e., for the future performance of each trial type). The parameters G i j and Hi j that describe these cross-condition effects (the effects of a type i trial on the jth state variable) are essentially measures of generalization across the trial types (Thoroughman & Shadmehr, 2000; Donchin et al., 2003). In addition, each trial type could be associated with a different learning noise variance Qi and output noise variance Ri to account for signal-dependent noise. All of the practical issues raised in this letter apply, except that additional parameters (whose number goes as K 2 ) will require more data for equivalent power. Finally, we note that if the output is known to be highly nonlinear, it is fairly straightforward to replace the linear output equation, equation 2.6b, with a known, nonlinear model of the state-dependent output, equation 2.2b. In that case, the Kalman smoother in the E-step of the EM algorithm would have to be replaced by the extended Kalman smoother and the closed-form solution of the M-step would likely have to be replaced with an iterative optimization routine.
788
S. Cheng and P. Sabes
Appendix: Maximum Likelihood Estimation We take a maximum likelihood approach to system identification. The maximum likelihood estimator (MLE) for the LDS parameters is given by: ˆ = argmax log P (y1 , . . . , yT |; u1 , . . . , uT , w1 , . . . , wT ) ,
(A.1)
where ≡ {A, B, C, D, R, Q} is the complete set of parameters of the model in equation 2.6. In the following, we will suppress the explicit dependence on the inputs, u1 , . . . , uT and w1 , . . . , wt , and use the notation Xt ≡ {x1 , . . . , xt } and Yt ≡ {y1 , . . . , yt }. Generally, equation A.1 cannot be solved analytically, and numerical optimization is needed. Here we discuss the application of the EM algorithm (Dempster et al., 1977) to system identification of the LDS model defined in equation 2.6. The EM algorithm is chosen for its attractive numerical and computational properties. In most practical cases, it is numerically stable; every iteration increases the log likelihood monotonically: ˆ i+1 ≥ log P YT | ˆi , log P YT |
(A.2)
ˆ i is the parameter estimate in the ith iteration, and convergence where to a stationary point of the log likelihood is guaranteed (Dempster et al., 1977). In addition, the two iterative steps of EM algorithm are often easy to implement. The E-step consists of calculating the expected value of the complete log likelihood, log P (XT , YT |), as a function of , given the ˆ i: current parameter estimate ˆ i ) = E(log P (XT , YT |))Y ,ˆ . (, T i
(A.3)
In the M-step, the parameters that maximize the expected log likelihood are found: ˆ i+1 = argmax (, ˆ i ).
(A.4)
The starting point in the formulation of the EM algorithm is the derivation of the complete likelihood function, which is generally straightforward if the likelihood factorizes. Thus, we begin by asking whether the likelihood of an LDS model factorizes, even when there are feedback loops (cf. equation 2.9). From the graphical model in Figure 9, it is evident that yt and xt+1 are conditionally independent of all other previous states and inputs, given xt . The mutual probability of yt and xt+1 , given by Bayes’ theorem, is P (xt+1 , yt |xt ) = P (xt+1 |yt , xt ) P (yt |xt ) .
Modeling Sensorimotor Learning with Linear Dynamical Systems
789
→ xt−1 → xt → xt+1 → ↓ ↓ ↓ yt−1 yt yt+1 Figure 9: Graphical model of the statistical relationship between the states and the outputs of the closed-loop system. The dependence on the deterministic inputs has been suppressed for clarity.
The complete likelihood function is thus P (XT , YT ) = P (x1 )
T−1
P (xt+1 |yt , xt )
t=1
T
P (yt |xt ) .
(A.5)
t=1
This factorization means that for the purposes of this algorithm, we can regard the feedback as just another input variable. This view corresponds to the direct approach to closed-loop system identification (Ljung, 1999). The two steps of the EM algorithm for identifying the LDS model in equation 2.6, when B = D = 0 and C is known, were first reported by Shumway and Stoffer (1982). A more general version, which included estimation of C, was presented by Ghahramani and Hinton (1996). The EM algorithm for the general LDS of equation 2.6 is a straightforward extension, and we present it here without derivation. A.1 E-Step. The E-step consists of calculating the following expectations and covariances: ˆ i ), xˆ t ≡ E(xt |YT , ˆi , Vt ≡ cov xt |YT , ˆi Vt,τ ≡ cov xt , xτ |YT ,
(A.6a) (A.6b) (A.6c)
These are computed by Kalman smoothing (Anderson & Moore, 1979), which consists of two passes through the sequence of trials. The forward pass is specified by t−1 xtt−1 = Axt−1 + But−1 , t−1 T A, Vtt−1 = Q + AVt−1 −1 , K t = Vtt−1 C T C Vtt−1 C T + R xtt = xtt−1 + K t yt − C xtt−1 − Dwt ,
(A.7a) (A.7b) (A.7c) (A.7d)
790
S. Cheng and P. Sabes
Vtt = (I − K t C)Vtt−1 .
(A.7e)
This pass is initialized with x10 = π and V10 = , where π is the estimate for the initial state and is its variance. If there are multiple data sets, all initial state estimates are set to π with variance . In fact, these parameters are included in and will be estimated in the M-step. The backward pass is initialized with xTT and VTT from the last iteration of the forward pass and is given by t J t = Vtt A(Vt+1 )−1 , T t xtT = xtt + J t xt+1 − xt+1 , T t J tT . − Vt+1 VtT = Vtt + J t Vt+1
(A.8a) (A.8b) (A.8c)
The only quantity that remains to be computed is the covariance Vt+1,t , for which we present a closed-form expression: T Vt+1,t = Vt+1 J tT .
(A.8d)
It is simple to show that this expression is equivalent to the recursive equation given by Shumway and Stoffer (1982) and Ghahramani and Hinton (1996). ˆ i , and the state estimates With the previous estimates of the parameters, from the E-step, it is possible to compute the value of the incomplete log likelihood function using the innovations form (Shumway & Stoffer, 1982): T 1
mT ˆ i =− log P YT | log(2π) − log |Rt | 2 2 t=1
−
1 2
T
δtT Rt−1 δt ,
(A.9)
t=1
where δt = yt − C xtt−1 − Dwt are the innovations, Rt = C Vtt−1 C + R their variances, and m is the dimensionality of the output vectors yt . A.2 M-Step. The quantities computed in the E-step are used in the M-step ˆ i ). Using the to determine the argmax of the complete log likelihood (, T T definitions Pt ≡ xˆ t xˆ t + Vt and Pt,τ ≡ xˆ t xˆ τ + Vt,τ the solution to the M-step is given by: π = xˆ 1 ,
(A.10a)
= V1 ,
(A.10b)
Modeling Sensorimotor Learning with Linear Dynamical Systems
[A B] =
Pt+1,t
T
xˆ t+1 ut
Pt ut xˆ tT
xˆ t uTt
791
−1
ut uTt −1 Pt xˆ t wtT T T yt xˆ t yt wt [C D] = , wt xˆ tT wt wtT
,
(A.10c)
(A.10d)
1 T Pt+1 − APt,t+1 − But xˆ t+1 , T −1
(A.10e)
T 1
(yt − C xˆ t − Dwt ) ytT , T
(A.10f)
T−1
Q=
t=1
R=
t=1
T T−1 = t=1 in equation A.10d. where = t=1 in equation A.10c and The parameters A and B in equation 3.10e are the current best estimates computed from equation 3.10c, and C and D in equation A.10f are the solutions from equation A.10f. The above equations, except for equation 3.10a and equation 3.10b, generalize to multiple data sets. The sums are then understood to extend over all the data sets. For multiple data sets, equation 3.10a is replaced by an average over the estimates of the initial state, xˆ 1(i) , across the N data sets: 1
xˆ 1(i) . N N
π=
(3.11a)
i=1
Its variance includes the variances of the initial state estimates as well as variations of the initial state across the data sets:
=
1
1
V1(i) + (xˆ 1(i) − π)(xˆ 1(i) − π)T . N N N
N
i=1
i=1
(3.11b)
Acknowledgments This work was supported by the Swartz Foundation, the Alfred P. Sloan Foundation, the McKnight Endowment Fund for Neuroscience, the National Eye Institute (R01 EY015679), and HHMI Biomedical Research Support Program grant 5300246 to the UCSF School of Medicine. References Anderson, B. D. O., & Moore, J. B. (1979). Optimal filtering. Englewood Cliffs, NJ: Prentice Hall.
792
S. Cheng and P. Sabes
Baddeley, R. J., Ingram, H. A., & Miall, R. C. (2003). System identification applied to a visuomotor task: Near-optimal human performance in a noisy changing task. J. Neurosci., 23(7), 3066–3075. Baraduc, P., & Wolpert, D. M. (2002). Adaptation to a visuomotor shift depends on the starting posture. J. Neurophysiol., 88(2), 973–981. Clamann, H. P. (1969). Statistical analysis of motor unit firing patterns in a human skeletal muscle. Biophys. J., 9(10), 1233–1251. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the E M algorithm. J. Royal Statistical Society, Series B, 39, 1–38. Donchin, O., Francis, J. T., & Shadmehr, R. (2003). Quantifying generalization from trial-by-trial behavior of adaptive systems that learn with basis functions: Theory and experiments in human motor control. J. Neurosci., 23(27), 9032– 9045. Draper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). New York: Wiley. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Favilla, M., Hening, W., & Ghez, C. (1989). Trajectory control in targeted force impulses. VI. Independent specification of response amplitude and direction. Exp. Brain Res., 75(2), 280–294. Flanagan, J. R., & Rao, A. K. (1995). Trajectory adaptation to a nonlinear visuomotor transformation: Evidence of motion planning in visually perceived space. J. Neurophysiol., 74(5), 2174–2178. Ghahramani, Z., & Hinton, G. E. (1996). Parameter estimation for linear dynamical systems (Tech. Rep. No. CRG-TR-96-2). Toronto: University of Toronto. Good, P. I. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses (2nd ed.). New York: Springer-Verlag. Gordon, J., Ghilardi, M. F., Cooper, S. E., & Ghez, C. (1994). Accuracy of planar reaching movements. II. Systematic extent errors resulting from inertial anisotropy. Exp. Brain Res., 99(1), 112–130. Harris, C. M., & Wolpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394(6695), 780–784. Hay, J. C., & Pick, H. L. (1966). Visual and proprioceptive adaptation to optical displacement of the visual stimulus. J. Exp. Psychol., 71(1), 150–158. Held, R., & Gottlieb, N. (1958). Technique for studying adaptation to disarranged hand-eye coordination. Percept. Mot. Skills, 8, 83–86. Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279(5354), 1213–1216. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models—supervised learning with a distal teacher. Cognitive Science, 16(3), 307–354. Kailath, T. (1980). Linear systems. Englewood Cliffs, NJ: Prentice Hall. Krakauer, J. W., Pine, Z. M., Ghilardi, M. F., & Ghez, C. (2000). Learning of visuomotor transformations for vectorial planning of reaching trajectories. J. Neurosci., 20(23), 8916–8924. Lackner, J. R., & Dizio, P. (1994). Rapid adaptation to coriolis force perturbations of arm trajectory. J. Neurophysiol., 72(1), 299–313.
Modeling Sensorimotor Learning with Linear Dynamical Systems
793
Ljung, L. (1999). System identification: Theory for the user (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Matthews, P. B. (1996). Relationship of firing intervals of human motor units to the trajectory of post-spike after-hyperpolarization and synaptic noise. J. Physiol., 492, 597–628. Miles, F., & Fuller, J. (1974). Adaptive plasticity in the vestibulo-ocular responses of the rhesus monkey. Brain Res., 80(3), 512–516. Norris, S., Greger, B., Martin, T., & Thach, W. (2001). Prism adaptation of reaching is dependent on the type of visual feedback of hand and target position. Brain Res., 905(1–2), 207–219. Pine, Z. M., Krakauer, J. W., Gordon, J., & Ghez, C. (1996). Learning of scaling factors and reference axes for reaching movements. Neuroreport, 7(14), 2357–2361. Redding, G., & Wallace, B. (1990). Effects on prism adaptation of duration and timing of visual feedback during pointing. J. Mot. Behav., 22(2), 209–224. Scheidt, R., Dingwell, J., & Mussa-Ivaldi, F. (2001). Learning to move amid uncertainty. J. Neurophysiol., 86(2), 971–985. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3(4), 253–264. Soechting, J. F., & Flanders, M. (1989). Errors in pointing are due to approximations in sensorimotor transformations. J. Neurophysiol., 62(2), 595–608. Thoroughman, K. A., & Shadmehr, R. (2000). Learning of action through adaptive combination of motor primitives. Nature, 407, 742–747. Todorov, E., & Jordan, M. (2002). Optimal feedback control as a theory of motor coordination. Nat. Neurosci., 5(11), 1226–1235. von Helmholtz, H. (1867). Handbuch der physiologischen Optik. Leipzig: Leopold Voss. Weber, R. B., & Daroff, R. B. (1971). The metrics of horizontal saccadic eye movements in normal humans. Vision Res., 11(9), 921–928. Welch, R. B., Choe, C. S., & Heinrich, D. R. (1974). Evidence for a three-component model of prism adaptation. J. Exp. Psychol., 103(4), 700–705. Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). Are arm trajectories planned in kinematic or dynamic coordinates? An adaptation study. Exp. Brain Res., 103(3), 460–470.
Received January 6, 2005; accepted September 9, 2005.
LETTER
Communicated by Terrence Sejnowski
Changing Roles for Temporal Representation of Odorant During the Oscillatory Response of the Olfactory Bulb Soyoun Kim
[email protected] Benjamin H. Singer
[email protected] Michal Zochowski
[email protected] Department of Physics and Biophysics Research Division, University of Michigan, Ann Arbor, MI 48109, U.S.A.
It has been hypothesized that the brain uses combinatorial as well as temporal coding strategies to represent stimulus properties. The mechanisms and properties of the temporal coding remain undetermined, although it has been postulated that oscillations can mediate formation of this type of code. Here we use a generic model of the vertebrate olfactory bulb to explore the possible role of oscillatory behavior in temporal coding. We show that three mechanisms—synaptic inhibition, slow self-inhibition and input properties—mediate formation of a temporal sequence of simultaneous activations of glomerular modules associated with specific odorants within the oscillatory response. The sequence formed depends on the relative properties of odorant features and thus may mediate discrimination of odorants activating overlapping sets of glomeruli. We suggest that period-doubling transitions may be driven through excitatory feedback from a portion of the olfactory network acting as a coincidence modulator. Furthermore, we hypothesize that the period-doubling transition transforms the temporal code from a roster of odorant components to a signal of odorant identity and facilitates discrimination of individual odorants within mixtures. 1 Introduction Odor encoding is a spatially distributed process resulting from activation of different cell populations. However, it has been postulated that in addition to combinatorial codes, the brain may use temporal coding strategies based on synchronization of cell activities within or between different functional networks (Laurent et al., 2001; Laurent, Wehr, & Davidowitz, 1996; Wehr & Laurent, 1996). While the basic properties of combinatorial coding have been relatively well established, the properties of temporal coding are largely undetermined. It has been hypothesized Neural Computation 18, 794–816 (2006)
C 2006 Massachusetts Institute of Technology
Temporal Patterning During Odorant Processing
795
that synchronization, mediated through ubiquitously observed stimulusevoked oscillations (Eckhorn, 1994; Gelperin & Tank, 1990; Laurent & Naraghi, 1994), can underlie binding between different stimulus features and thus promote formation of neural representations of processed stimuli (Eckhorn, 2000; Singer, 1999; von der Malsburg, 1995, 1999). Since the discovery of odorant-evoked oscillations in the olfactory system, it has been shown that the presentation of odorant evokes different types of oscillatory patterning across species (Adrian, 1950; Hughes & Mazurowski, 1962; Lam, Cohen, Wachowiak, & Zochowski, 2000; Laurent & Naraghi, 1994; MacLeod, Backer, & Laurent, 1998; Wehr & Laurent, 1996). The mechanisms and the properties of observed oscillations have been studied extensively, but their role during information processing has not been yet established. However, it has been postulated that as in the case of other sensory modalities, these oscillations mediate binding of different odorant features that are detected by the next processing stages in the brain (the piriform cortex or mushroom body in vertebrates or insects, respectively; MacLeod et al., 1998). This hypothesis was partially confirmed by showing that oscillations in the insect mediate behavioral discrimination of structurally similar odorants, linking oscillatory patterning with cognitive processes (Hosler, Buxton, & Smith, 2000; Stopfer, Bhagavan, Smith, & Laurent, 1997; Teyke & Gelperin, 1999). The olfactory system displays a high degree of structural homology among vertebrate phyla, in both the laminar structure, cell types, and connectivities found in the olfactory bulb (OB) itself and centripetal projections of the olfactory bulb to secondary structures (Allison, 1953; Crosby & Humphrey, 1939). In the vertebrate OB, oscillations are primarily mediated by the interaction of the excitatory M/T cells, and inhibitory periglomerular (PG) and granule cells (Aroniadou-Anderjaska, Zhou, Priest, Ennis, & Shipley, 2000; Rall & Shepherd, 1968). It has been reported that odor stimuli elicit different frequency ranges of oscillations in the vertebrate (Bressler, 1984; Bressler & Freeman, 1980; Eeckman & Freeman, 1990; Gray & Skinner, 1988; Lam et al., 2000) as well as in mollusk (Inokuma, Inoue, Watanabe, & Kirino, 2002). Also a number of groups have addressed the function of the insect antennal lobe (AL) (homologue of OB in vertebrates) through mathematical models and have explored the generation of oscillations from the point of view of network dynamics and cellular biophysics (Bazhenov, Stopfer, Rabinovich, Abarbanel, et al., 2001; Bazhenov, Stopfer, Rabinovich, Huerta, et al., 2001; Brody & Hopfield, 2003; Hendin, Horn, & Tsodyks, 1998; Li & Hertz, 2000; Linster & Cleland, 2001; Olufsen, Whittington, Camperi, & Kopell, 2003); however, the role and the mechanism of these oscillations are still not clear. Our earlier experiments showed that the presentation of odorant evokes rostral (14 Hz) and caudal (14 and 7 Hz) oscillations in the turtle OB (Lam et al., 2000; Lam, Cohen, & Zochowski, 2003; see Figure 1). The oscillations were measured optically as a fractional change in fluorescence of
796
S. Kim, B. Singer, and M. Zochowski
Caudal oscillation
DF/F =10-3
Rostral oscillation
10% isoamyl acetate
1000 ms
Figure 1: Oscillatory response and formation of the temporal sequence due to odorant presentation. Two oscillations detected in the turtle OB are shown: the caudal oscillation (top) and the rostral oscillation (bottom). Two oscillations start relatively simultaneously and initially have the same frequency (14 Hz), with the caudal oscillation undergoing a period-doubling transition resulting in 7 Hz oscillation.
voltage-sensitive dye. It is assumed that peaks of those oscillations correspond to coincident firing of large cell populations, while troughs represent relatively a quiescent state of population activity. Those two oscillations appear simultaneously with the caudal oscillation initially at 14 Hz and eventually undergo a period doubling transition to 7 Hz. Period-doubling transitions have been observed in the mammalian OB, as well as odor-evoked changes in the relative power of oscillation in the γ (50–100 Hz) and β (15–40 Hz) bands of local field potential recordings (Martin, Gervais, Hugues, Messaoudi, & Ravel, 2004; Neville & Haberly, 2003; Ravel et al., 2003). These studies have suggested that the appearance of β activity is dependent on centrifugal feedback from extrabulbar areas and is a correlate of learned olfactory recognition. Here we use a network model incorporating basic properties of the vertebrate OB to capture experimentally observed properties of odor-evoked oscillations and explore basic mechanisms underlying formation of the temporal code on the level of activity of individual glomerular modules within this oscillatory response. We show in our model that (1) in the absence of feedback from higher regions, spatial codes are temporally separated in the OB circuitry depending on the relative strength of the odorant components, and (2) the introduction of feedback from higher-order processing regions leads to synchronization that forms a unified temporal representation of the presented odorant, producing at the same time a period-doubling
Temporal Patterning During Odorant Processing
797
transition. Thus, we suggest that at the time of the bifurcation, the role of the oscillation may be redefined from discrimination of odorant features, grouped by input strength, to the discrimination of whole-odorant identity. 2 Methods 2.1 Description of the Model. The OB has a relatively simple layered cortical structure and is composed of excitatory mitral/tufted (M/T) and inhibitory periglomerular (PG) and granule cells. The axons of olfactory receptor neurons (ORNs) expressing limited receptor types (single type in mammals) are sorted and converge onto specific glomeruli to form a chemotopic map of odorant in the OB (Bozza & Kauer, 1998; Malnic, Hirono, Sato, & Buck, 1999; Mombaerts et al., 1996; Ressler, Sullivan, & Buck, 1993). The glomeruli therefore represent specific types of receptors and are tuned to specific molecular features of the odorant. The glomeruli are relatively large spherical neuropils, where axons of sensory neurons synapse onto the dendrites of relatively few excitatory M/T cells and an extensive number of inhibitory PG cells (Bozza & Kauer, 1998; Mori, Nagao, & Yoshihara, 1999; Pinching & Powell, 1971; Shepherd, 1998). The populations of PG cells and M/T cells innervating the same glomeruli form a glomerular module (GM). M/T cells distribute their axons in secondary cortical structures and thus are the output cells of the OB. The most deeply positioned layer in the OB contains a very large pool of inhibitory granule cells and a substantial number of centrifugal fibers from various forebrain regions. Our model has a simplified structure resembling the basic vertebrate OB anatomy. The excitatory M/T and inhibitory PG cells are grouped into five GMs, each having 4 M/T cells and 20 PG cells (see Figure 2A). In addition, 100 granule cells are located outside the GMs in the OB. The PG cells receive inputs from ORN while the granule cells do not (Aungst et al., 2003; Mori et al., 1999). The ratio of PG/granule cells to M/T cells is different for different species (for example, 200:1 in the mouse OB, (Mori et al., 1999). We tested ratios between 5:1 and 40:1 and could produce the same results by tuning connectivity parameters. 2.1.1 Model: Input. It has been established that an odorant activates several GMs, generating specific odor chemotopic maps, as every GM is associated with the input of a single class of receptor neurons (Mombaerts et al., 1996; Vosshall, Wong, & Axel, 2000). In this model, the odorant is simplified as a set of static input strengths received by a set of GMs, and the relative amplitude of these inputs (see Figure 2B) represents the initial signature of the presented odorants. Excitatory cells in the same GM receive the same strength of input, while inhibitory PG cells receive 50% smaller input than excitatory cells in the same GM. The input builds up linearly to its maximum amplitude over a 200 ms interval after stimulus onset, mimicking the time course of calcium
798
S. Kim, B. Singer, and M. Zochowski
Figure 2: A reduced model of olfactory processing in the turtle olfactory bulb. (A) Model of the OB: The model is divided into glomerular modules (GMs— dashed line). Every GM has an excitatory (M/T cells) population and inhibitory (PG cells) population. Outside of GMs, the inhibitory granule cells are located in the OB. Network interactions are denoted by arrows. Every GM has a different input strength corresponding to the combined activation received from specific olfactory receptor neurons (ORNs). A group of excitatory and inhibitory cells lies outside the OB network and receives converging input from M/T cells in pairs of GMs, acting as a coincidence modulator (see section 2). (B) Definition of the odorant. The odorant is defined as a combination of inputs to different GMs. (C) Time course of SSI (bottom) of excitatory cell population. SSI current is activated when an excitatory cell fires.
Temporal Patterning During Odorant Processing
799
release from ORN terminals (Wachowiak, Cohen, & Zochowski, 2002), and remains constant thereafter. Since in this model we do not include dynamic features of input to the OB, our network model does not show dynamical patterns that might be related to the input dynamics. Neither the differences in receptor activation timescales nor mechanisms such as presynaptic inhibition are considered in this model. Thus, our result does not reproduce slowly modulated patterns of single neuronal activity that have been observed experimentally (Harrison & Scott, 1986; Laurent et al., 1996, 2001; MacLeod & Laurent, 1996). 2.1.2 Model: Connectivity Within a GM. It has been observed that excitatory neurons in the same GM are electronically strongly coupled (Schoppa & Westbrook, 2002) with each other as well as with PG cells (AroniadouAnderjaska et al., 2000; Ennis et al., 2001; Wachowiak & Cohen, 1999). Here, connectivities of different cell types are greatly simplified from the known synaptic organization of the OB. Excitatory cells innervating the same GM are fully connected with each other and coupled with 40% of PG inhibitory neurons within a GM. The PG cells form short-range connections with 50% of other PG cells within the same GM. 2.1.3 Model: Connectivity Between GMs. The granule and PG cells form short-range connections with other inhibitory neurons and excitatory cells in different GMs (Gracia-Llanes, Crespo, Blasco-Ibanez, Marques-Mari, & Martinez-Guijarro, 2003), while anatomical studies have failed to provide any evidence of monosynaptic excitatory connections between M/T cells innervating different glomeruli (Pinching & Powell, 1971). In this model, there is no monosynaptic pathway between excitatory neurons belonging to different GMs, while the M/T neurons project to a fraction of both granule and PG cells. The fraction depends on the distance between GMs. The distance between adjacent GMs is set to unity and increases linearly. Distances between granule cells are calculated by assigning a subset of 20 granule cells to every position of every GM. The excitatory M/T cells in GMi are connected to a fraction of inhibitory PG cells in GMj and granule cells outside the GM. This fraction decreases as the distance between GMs (or GM and the subset of granule cells) increases: Conni, j =
Conn0 , |i − j| + 1
where i, j correspond to the GM number of an individual neuron, Conn0 is the connectivity when length (|i − j|) is zero and, Conn0 = 40% for PG(or granule)-M/T connections. The connectivity between the inhibitory cells scales similarly with Conn0 = 50% for PG (or granule)-PG (or granule) connections.
800
S. Kim, B. Singer, and M. Zochowski
2.1.4 Model: Description of Individual Neurons in the OB. Individual neurons are defined by Hodgkin-Huxley-type equations, and network interactions are adopted with modifications from Kopell, Ermentrout, Whittington, & Traub (2000). The excitatory M/T neurons are known to be modulated through self-induced excitation and inhibition (Schoppa & Urban, 2003). Thus, the group of excitatory neurons has the added feature of an inhibitory slow after-hyperpolarization current, which is to mimic the slow self-inhibition (SSI) observed in the M/T cells (Jahr & Nicoll, 1980; Margrie, Sakmann, & Urban, 2001; Schoppa, Kinzie, Sahara, Segerson, & Westbrook, 1998; Schoppa & Urban, 2003). Although self-inhibition reflects on the activation of interneurons by glutamate released from M/T cell dendrites, which in turn leads to GABA release back onto the M/T cell (Jahr & Nicoll, 1982; Mori & Takagi, 1978; Nowycky, Mori, & Shepherd, 1981; Schoppa et al., 1998), here for model simplicity we modeled the total effect of interplay of the SSI and self-excitation as a single current that produces relatively longlasting hyperpolarization of the excitatory cells (see Figure 2C). Results of the model simulation were independent of the particular shape of this current. The excitatory neurons are defined by: C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) e e − g K n4 (V − VK ) − g SI (w − 0.025)(V − V∞ ) + Isyn + Iappl .
The inhibitory neurons are defined by: i i + Iappl , C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) + Isyn e ie ee ce i ei ii ci where Isyn = Isyn + Isyn + Isyn , and Isyn = Isyn + Isyn + Isyn . Index i, e, and c are, respectively, inhibitory, excitatory, and coincidence modulator cells. The first three currents represent leakage, sodium, and potassium currents, respectively, where gl = 0.1 mS/cm2 , g Na = 100 mS/cm2 , g K = 80 mS/cm2 , Vl = −67 mV, VNa = 50 mV, Vl = −100 mV, and
m =
0.28(V + 27) 0.32(54 + V) (1 − m) − m, 1 − exp(−0.25(V + 54)) exp(0.2(V + 27)) − 1
h = 0.128 exp(−(50 + V)/18)(1 − h) −
4 h, 1 + exp(−0.2(V + 27))
n =
0.032(V + 52) (1 − n) − 0.5 exp(−(57 + V)/40)n, 1 − exp(−0.2(V + 52))
w =
w∞ (V) − w τw (V)
Temporal Patterning During Odorant Processing
w∞ (V) = τw (V) = V∞ (V) =
801
1 , 1 + exp(−0.1(V + 35)) 400 , 3.3 exp(0.05(V + 35)) + exp(−0.05(V + 35)) 150 . 0.1 exp(−0.03(V + 70)) − 1
The term g SI (w − 0.025)(V − V∞ ) in the definition of the excitatory neuron defines the SSI current (see Figure 2B) with g SI = 0.7 mS/cm2 . The conductance is 1 µF/cm2 for all cell populations. e i The Iappl , Iappl represent external currents from receptor neurons to excitatory and inhibitory cells, respectively. The strength of the external input (defining activation of particular GMs) to the excitatory cells is set between 1.4 µA/cm2 and 5 µA/cm2 (depending on the GM), whereas that for inhibitory cells is limited to 50% of the excitatory input of the same GMs. jk ci Isyn denotes a synaptic current from j to k (j, k = e, i, c). For instance, Isyn indicates a current from a coincident modulator (c) to inhibitory (i) cells. The synaptic currents for the excitatory neurons are given by: ie Isyn
+
ce Isyn
= −g ie
j si (t)
(Vj + 80) − g ce
j
scj (t)
(Vk ).
j
Additionally, the M/T cells innervating the same GM are coupled: ee = −g ee Isyn
sej (t)Vj .
j∈G
For inhibitory neurons, ei ii ci Isyn + Isyn + Isyn = −gei
j
−gii
sej (t − τ ) Vj j si (t) (Vj + 80) − gci scj (t) (Vk ),
j
j
where j is an index for connected neuron, from which the synaptic input originates, and Vk = Vj + 80 for inhibitory feedback (when c = I ), or Vk = Vj for excitatory feedback (when c = E). Synaptic currents for inhibitory and excitatory cells are defined by se = 5(1 + tanh(V/4))(1 − se ) − se /2 si = 2(1 + tanh(V/4))(1 − si ) − si /15,
802
S. Kim, B. Singer, and M. Zochowski
and sc is defined by the same equation as either se or si , depending on whether feedback from the coincidence modulator is modeled as excitatory (c = E) or inhibitory (c = I ). The average transmission delay between excitatory and inhibitory neurons is τ = 2 ms. The synaptic efficacies between interconnected neurons are defined to be g ie = 0.03 mS/cm2 , gee = 0.04 mS/cm2 , g ii = 0.005 mS/cm2 and g ei = 0.03 mS/cm2 , gce = 0.4 mS/cm2 , and g ci = 0.3 mS/cm2 . The behavior of the model was tested for a range of connectivity parameters and yielded essentially the same results. The initial values of the parameters were taken from Kopell et al. (2000). Additional tuning was performed only to obtain balanced firing rates of inhibitory and excitatory cells. 2.1.5 Model: Cortical Processing as Coincidence Modulator. The experimentally observed period-doubling transition in the caudal oscillation cannot be mediated solely by our simulated structure of the OB. However, it has been shown that a period-doubling transition could take place on a population level and might be mediated by excitatory-to-excitatory interaction (Ermentrout & Kopell, 1998; Kopell et al., 2000; Whittington, Stanford, Colling, Jefferys, & Traub, 1997). This transition results from mutual synchronization of excitatory cells firing at different oscillatory cycles. It has been well established that the M/T cells project to overlapping fields of termination (Wilson, 2001; Zou, Horowitz, Montmayeur, Snapper, & Buck, 2001), and the OB receives feedback from higher centers (Luskin & Price, 1983; Macrides, 1983; Martin et al., 2004; Neville & Haberly, 2003; Pinching & Powell, 1972; Scott, McBride, & Schneider, 1980). Thus, feedback from central areas may be extensively involved in OB information processing. In this model, we investigate the hypothesis that feedback may mediate the experimentally observed period doubling in the caudal oscillation and the transition is formed through the mutual synchronization of activated M/T cells. This is mediated via an additional layer of 10 excitatory and 10 inhibitory cells outside the OB structure (see Figure 2A). Excitatory (E) and inhibitory (I) cells in a coincidence modulator follow the equations of excitatory and inhibitory cells in the OB except that the applied current j I E Iappl = 0 and Iapp = −g ec j se (t)Vj (where e and E refer to excitatory cells in the OB and the coincidence modulator, respectively) is the converging input from M/T cells in two sets of GMs, where g ec = 0.03 mS/cm2 . The excitatory neurons in the coincidence modulator are described by
C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) IE EE E − g SI (w − 0.025)(V − V∞ ) + Isyn + Isyn + Iappl .
Temporal Patterning During Odorant Processing
803
The inhibitory neurons in the coincidence modulator are described by EI II + Isyn . C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) + Isyn
The parameters defining respective currents are the same as in the OB, and the connectivity within the coincidence modulator is 70%. 2.1.6 Model: Connectivity Between the Coincidence Modulator and the OB. We investigate two types of connectivity between the coincidence modulator and the OB: (1) connections from excitatory (E) cells in the coincidence modulator project directly to M/T and PG cells in the OB, and separately (2) connections from inhibitory (I) cells in the coincidence modulator project to M/T and PG cells in the OB. In the first case, M/T cells innervating same GMs are connected to two assigned E cells in the coincidence modulator, sending output to the E cells in the coincidence modulator and, in turn, getting feedback from them. Fifty percent of PG/granule cells get feedback from two randomly selected E cells in the coincidence modulator. In the second case, M/T and PG/granule cells get feedback in the same way as the first case except that feedback comes from I cells in the coincidence modulator rather than E cells. With tuning of synaptic weights, these two conditions generated essentially the same result, corresponding to either direct excitation of the OB in the first case or disinhibition of the OB in the second case. 2.2 Analysis: Calculation of Angular Separation of Odor Representation. We calculate the angle between odor representations as a measure of odor discrimination. Individual odorant is represented as an n-dimensional vector (n is the number of total GMs), and each dimension indicates a fraction of active (firing) excitatory neurons in each GM. Angular separation is defined as arc cosine of the normalized inner product of the vector representing an odorant in five dimensions:
1 • P 2 P θ = Arcos , 1 · P 2 P i is a vector representation of the ith odorant. where P 2.3 Analysis: Calculation of Temporal Vector Representation and Overlap. To compare the correlation of individual GMs in an odorant, we calculated the fraction of excitatory cells firing within each GM for 10 consecutive oscillatory cycles. Each GM is represented by a 10-dimensional vector. Then we calculated the dot product to calculate the correlation between GMs. This value measures the similarity of firing pattern among GMs.
804
S. Kim, B. Singer, and M. Zochowski
3 Results 3.1 Formation of Temporal Clustering Based on Input Strength. The odorant is defined as a set of activation amplitudes received by the GMs (see Figure 3D inset). When the network is presented with an odorant and the coincidence modulator is not present, two oscillatory patterns appear (see Figure 3A, two bottom traces) generated by excitatory and inhibitory cells populations. The observed oscillatory patterning was robust to changes in parameter values. Excitatory cells in GMs that share similar input amplitudes (see Figure 3D inset) fire at the same cycles of the oscillation, whereas cells belonging to differentially activated GMs fire on alternating cycles, creating a segmented temporal sequence of cell populations activated on different cycles within the oscillatory response. The firings of all four excitatory cells innervating the same GM are synchronized due to their coupling. Three mechanisms contribute to the formation of this temporal pattern. Fast synaptic inhibition within and between the GMs is known to be responsible for the formation of oscillatory patterning in the OB by inhibiting the simultaneous activation of excitatory cell populations belonging to different glomeruli (MacLeod & Laurent, 1996; Stopfer et al., 1997). Elimination of synaptic inhibition in the network (see Figure 3C) abolishes population oscillatory patterning and the associated temporal code as the cells fire without temporal locking. The individual GM’s activity remains similar but desynchronized with the other GMs. The second mechanism is based on the SSI current (see Figure 2C), which prohibits firing of excitatory cells on consecutive cycles of the population oscillation. This mechanism is responsible for the actual formation of the temporal sequence within the oscillatory response through selective inactivation of cell populations. This is because cells that fired at a given cycle and received additional slow inhibition do not recover in time to fire at the next oscillatory cycle. The elimination of SSI does not disrupt oscillatory patterning; however, it abolishes the formation of a temporal sequence as most of the excitatory cells fire at every cycle of the oscillation (see Figure 3B). The third mechanism is based on the patterning of the input received by GMs (see Figure 3D inset). The differences and similarities in input strength to different GMs underlie the temporal binding and segmentation of different cell populations. If the GMs receive similar input, excitatory neurons composing them will robustly fire in the same cycles of the oscillations. Excitatory neurons in GMs receiving different levels of input fire at different cycles. This is because the cells receiving higher input activate first and inhibit firing of the cells receiving weaker input through lateral inhibition. 3.2 Temporal Segmentation of a GM as a Function of Input. We tested how the temporal patterning of a single GM changes as a function of input amplitude in the context of a glomerular network. We assigned fixed
Temporal Patterning During Odorant Processing
SSI and Synaptic inhibition
total exc. total inh.
0
C Depolarization [mV]
B Depolarization [mV]
1 2 3 4 5
200
400
Time [ms]
Synaptic inhibition only 1 2 3 4 5
total exc. total inh.
600
0
Slow self- inhibition only
1 2 3 4 5
D 80
total exc. total inh.
0
200
400
600
Time [ms]
Overlap
Depolarization [mV]
A
805
60
Altering a single input 1 2 3 4 5 Input strength
40 20
2&3 1&4
0
200
400
Time [ms]
600
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Input strength
Figure 3: Response to the odorant presentation. Traces are averaged membrane potentials of the excitatory populations innervating the same GM (matched number), and averaged total membrane potential for both excitatory (black) and inhibitory (gray) cell populations. Each GM fires periodically due to the constant ORN input, with periodicity dependent on their input amplitude, and therefore different GMs desynchronize. The structure of an odorant is included in the inset of D. (A) Synaptic and SSI combined. Excitatory cell populations receiving inputs of similar strength fire on the same cycle. After a transient, excitatory populations fire only on alternate cycles. (B) Synaptic inhibition only. All excitatory cell populations fire at every oscillatory cycle, abolishing the temporal features of the firing pattern. (C) SSI only. The oscillatory response is abolished. Excitatory cell populations receiving similar input do not synchronize. Bottom: time course of the odorant delivery. (D) Segmentation of the temporal pattern depending on the relative input strength. The temporal pattern of the excitae = 2.4, tory population in GM1 is compared to the patterns of GMs 2 and 3 (Iappl
e = 3.0, 2.95 µA/cm2 , dashed line). 2.5 µA/cm2 , solid line) or GMs 1 and 4 (Iappl e The structure of an odorant is included in the inset (Iappl = 3.0, 2.4, 2.5, 2.95, 2.3 for the GM 1–5).
806
S. Kim, B. Singer, and M. Zochowski
e input strengths to four GMs (GM1, 4:Iappl = 3.0, 2.95 µA/cm2 and GM2, 3: e = 2.4, 2.5 µA/cm2 in Figure 3D inset) and systematically varied the Iappl input of GM 5. We then calculated the overlap of the temporal vectors (see the methods in section 2.3) formed based on activity of excitatory cells within a given GM (see Figure 3D). When the input to GM 5 is small (below 2.0 µA/cm2 ), the M/T cells of that GM do not activate. Above the threshold input, the excitatory neurons in GM 5 start to fire on any cycle of the oscillation. With further increase of the input, the tested GM aligned itself with activity of GMs with lower input. When the input increases to 3.0 µA/cm2 , there is an abrupt change as the tested GM aligns itself with activity of the GMs with higher input. With the further increase of the input (above 3.8 µA/cm2 ), the GM 5 module becomes active at both oscillatory cycles, as it recovers relatively fast from the self-inhibition and lateral inhibition received from other GMs. Thus, the segmentation of glomerular activity is driven by the relative strengths of inputs to the glomeruli and shows robust temporal classification over a relatively wide range of parameter values. It is also important to note that the absolute values of the currents are not as important as the relative strength of input among the GMs.
3.3 Formation of Different Temporal Sequences to Presentation of Similar Odorants. Presentation of similar odorants (i.e., odorants activating the same GMs but with different input amplitudes; see Figure 4A) results in the formation of different temporal sequences of activation in different glomeruli (see Figure 4B). Thus, different neuronal assemblies of active excitatory cells are formed temporally. The elimination of the SSI preserves the oscillatory response but abolishes the formation of the specific temporal sequential assemblies, making presented odorants indistinguishable. Such a temporal code may not be critical for dissimilar odorants since their differences in spatial representation may be big enough to separate the spatiotemporal code of two dissimilar odorants, but for similar odorants,
Figure 4: Temporal segmentation and discrimination of two similar odorants. (A) Definitions of the similar odorants. The two odorants activate the same glomerular units (100% spatial overlap) but with different relative intensities. (B) Differences in temporal activation. Averaged membrane potentials (left: odorant 1; right: odorant 2) of excitatory populations (black) innervating different GMs, and of all inhibitory population (gray). (C) Odor discrimination by angular separation of odorant representations over time. The angle between odorant representations is computed as arc-cosine of the normalized inner product of the vectors representing each odorant in five dimensions. When SSI is present, temporal coding is maintained, with the angular separation bounded away from 0. With synaptic inhibition only, the separation of representations goes to 0.
Temporal Patterning During Odorant Processing
Odorant 2
Odorant 1 1 2 3 4 5
Glomeruli
Glomeruli
A
807
1 2 3 4 5 0.0 1.0 2.0 3.0 Relative input strength
Synaptic and slow inhibition Odorant 1 Odorant 2 Depolarization [mV] inh.
Depolarization [mV] inh.
B
0.0 1.0 2.0 3.0 Relative input strength
1 2 3 4 5
1 2 3 4 5
0 100 200 300 400 500 Time [ms]
C
0 100 200 300 400 500 Time [ms]
Discrimination over time 100 80
θ
60 40
With SSI Without SSI
20 0
0
100
200
Time [ms]
300
400
808
S. Kim, B. Singer, and M. Zochowski
temporal segmentation may provide a mechanism of discrimination. To better illustrate the role of the SSI and temporal coding in distinguishing similar stimuli, we calculate the angular separation of the representations for two similar odorants, individually presented to the network (see Figure 4C and the methods in section 2.2). When the SSI is present, angular separation evolves from zero and stays nonzero with mean value [θ = 74.17◦ ], indicating distinct spatiotemporal representations of two odorants. Angular separation remains almost zero when the SSI is abolished, indicating that the spatiotemporal representation of both odorants is nearly the same. 3.4 Interaction of the Olfactory Bulb and a Coincidence Modulator. Introduction of feedback from cortical regions could provide a polysynaptic pathway between different GMs and modify the odor representation in the OB circuitry. To establish whether such a polysynaptic connection could mediate the period-doubling transition, we introduce an additional layer of excitatory and inhibitory neurons outside the OB structure, the coincidence modulator, which receives input from the M/T cells and provides feedback to the OB circuit (see section 2 and Figure 2A). We assume convergence of the OB output on the coincidence modulator excitatory neurons and test two types of feedback (see section 2): (1) excitatory feedback (see Figure 5A) and (2) the inhibitory feedback (see Figure 5B) to the M/T and PG/granule cells. Providing an odor input to the OB network in the absence of feedback from the coincidence modulator results in oscillations of the same frequency in both the inhibitory and excitatory cell populations (see Figure 5). At time t = 500 ms, we activate the excitatory (or inhibitory) feedback connection between the OB and the coincidence modulator (the black arrows on Figures 5A and 5B, respectively). The excitatory neurons synchronize, rapidly generating a period-doubling transition, resulting in the slow oscillation. At the same time, the frequency of the oscillation in the inhibitory population remains unchanged. Similar results were obtained with both types of feedback (see Figure 5). This is due to the fact that the direct excitatory feedback to M/T cells plays the same role in the OB network as disinhibition of the OB via inhibitory feedback to PG/granule cells. In these simulations, we found that synchronization within the coincidence modulator itself was required for the induction of period doubling in the OB through feedback. 4 Summary and Discussion Here we present a reduced model of the vertebrate OB that reproduces basic dynamical patterns observed across phyla. The focus of this model is to propose a role of oscillatory patterning in odorant discrimination within a fixed set of GMs through the formation of combinatorial assemblies of GM
Temporal Patterning During Odorant Processing
Membrane Depolarization
A
809
Excitatory feedback 1 2 3 4 5
tot. excit. tot. inhib. 200
400
600
800
1000
800
1000
Time [ms]
Membrane Depolarization
B
Inhibitory feedback 1 2 3 4 5
tot. excit. tot. inhib. 200
400
600
Time [ms] Figure 5: Feedback from a coincidence modulator can mediate the perioddoubling transition. (A) Excitatory feedback to M/T and PG/granule cells. (B) Inhibitory feedback to M/T and PG/granule cells. The feedback is activated at 500 ms (black arrows with dotted line) for both cases. Two cases produce essentially the similar result (period doubling). The transition to a slow oscillation happens abruptly, whereas there is no significant change in the oscillation generated by the inhibitory cells.
810
S. Kim, B. Singer, and M. Zochowski
activity. We have linked the oscillations to possible underlying mechanisms of formation of the temporal coding and possible functional roles during the information processing. The results are presented here as an analysis of steady-state behavior in response to static input, but the illustrated principle of information processing may form a component of dynamic encoding schemes. We show that the interaction of three effects underlies the initial formation of temporal patterning within the oscillatory response in the absence of feedback from higher cortical regions: fast synaptic interactions responsible for actual formation of the oscillatory patterning, the SSI responsible for alternating activation of different neuronal groups at different cycles, and the differential input strength to different GMs. Processing individual components of an odorant through asynchronous activity of individual GMs allows temporal clustering with given set of GMs. Thus, coding of odorant features may depend not only on which GMs are activated but on when, and in combination with which other GMs, activation occurs. Our results indicate that this initial segmentation is mediated by the relative input strength at different GMs. We further show that although input strength to different GMs may vary continuously, segmentation of GMs into temporal assemblies occurs in a discrete manner. This temporal segmentation underlies the formation of discriminable representations of odorants that activate the same GMs but with different strengths. This differentiation occurs even in the absence of dynamic modulation of OB input by such mechanisms as differential time courses of receptor cell activation and presynaptic inhibition, which produce slow modulation in neural activity (Laurent et al., 1996) that could additionally aid odorant discrimination. Abolition of temporal segmentation in the model leads to the formation of identical spatiotemporal representations of different odorants, despite their dissimilar input to the OB. SSI alone has already been shown to play an important role during discrimination of similar odorants (Bazhenov, Stopfer, Rabinovich, Abarbanel et al., 2001; Bazhenov, Stopfer, Rabinovich, Huerta et al., 2001). In that model, SSI is mostly responsible for slow modulation of the observed activity patterns. Here, however, we suggest how the interaction of three essential components is responsible for the initial formation of a temporal representation of odorant features through pattern segmentation. This dependence of discrimination on temporal patterning has been observed experimentally in insects (Hosler et al., 2000; Stopfer et al., 1997; Teyke & Gelperin, 1999). This further corroborates the hypothesis that the discrimination of similar odorants is mediated by differences in the temporal segmentation of individual features of the presented odorant (Stopfer, Jayaraman, & Laurent, 2003). Temporal segmentation of odorant features may serve as a substrate for the activation of different cell populations
Temporal Patterning During Odorant Processing
811
at the next processing stage, which has been shown to be activated on a cycle-by-cycle basis (Perez-Orive et al., 2002). An experimental study in insects (Perez-Orive, Bazhenov, & Laurent, 2004) and theoretical studies (Abeles, 1982; Konig, Engel, & Singer, 1996) have proposed that neurons in higher processing centers act as coincidence detectors. We show that the feedback from higher cortical regions, modeled here as a simple coincidence modulator network, can cause the experimentally observed period-doubling transition (Lam et al., 2000; Lam et al., 2003) and at the same time redefine the information content of the oscillation. In the absence of feedback, initial feature resolution is based on relative input strength among a set of GMs. With the presence of feedback, this is replaced by binding and synchronization of GM activity corresponding to all components of the odorant. We show that this mechanism is relatively robust and can be obtained using two different feedback mechanisms: direct excitation of M/T cells or disinhibition of appropriate GMs. In both cases, we assume a simple connectivity pattern: M/T cells converge pairwise onto cells in the coincidence modulator and receive pairwise feedback, while inhibitory cells in the OB are connected to coincidence modulator cells at random. We hypothesize that the synchrony of individual GMs in the presence of feedback may represent a transition from a component-wise peripheral representation based purely on odorant features to a representation of odor identity based on the interaction between the initial representation in the OB and central processing. Odor segmentation and discrimination in the presence of feedback from the cortex has been discussed by Li and Hertz (2000). They show that when central feedback is present, odorant can be distinguished by oscillatory patterning in the OB. Here, however, we emphasize the mechanisms of initial segmentation by input properties that form the representation passed to higher processing centers. Initial temporal feature binding may be a necessary substrate for odor identity binding, which is induced by feedback from higher stages. This view is consistent with the hypothesis of two steps of odor recognition processing before and after involvement of odor memory in higher regions (Wilson & Stevenson, 2003). Although we propose here that the formation of temporally segmented assemblies of GMs may be an effective strategy for odor encoding, it is not the only strategy. Future work will illuminate the relationship of temporal segmentation and odor-evoked oscillations to evolving input patterns.
References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18(1), 83–92.
812
S. Kim, B. Singer, and M. Zochowski
Adrian, E. D. (1950). The electrical activity of the mammalian olfactory bulb. Electroencephalogr. Clin. Neurophysiol., 2(4), 377–388. Allison, A. C. (1953). The morphology of the olfactory system in the vertebrates. Biological Reviews of the Cambridge Philosophical Society, 28(2), 195–244. Aroniadou-Anderjaska, V., Zhou, F. M., Priest, C. A., Ennis, M., & Shipley, M. T. (2000). Tonic and synaptically evoked presynaptic inhibition of sensory input to the rat olfactory bulb via GABA(b) heteroreceptors. J. Neurophysiol., 84(3), 1194– 1203. Aungst, J. L., Heyward, P. M., Puche, A. C., Karnup, S. V., Hayar, A., Szabo, G., & Shipley, M. T. (2003). Centre-surround inhibition among olfactory bulb glomeruli. Nature, 426(6967), 623–629. Bazhenov, M., Stopfer, M., Rabinovich, M., Abarbanel, H. D., Sejnowski, T. J., & Laurent, G. (2001). Model of cellular and network mechanisms for odor-evoked temporal patterning in the locust antennal lobe. Neuron, 30(2), 569–581. Bazhenov, M., Stopfer, M., Rabinovich, M., Huerta, R., Abarbanel, H. D., Sejnowski, T. J., & Laurent, G. (2001). Model of transient oscillatory synchronization in the locust antennal lobe. Neuron, 30(2), 553–567. Bozza, T. C., & Kauer, J. S. (1998). Odorant response properties of convergent olfactory receptor neurons. J. Neurosci., 18(12), 4560–4569. Bressler, S. L. (1984). Spatial organization of EEGs from olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol., 57(3), 270–276. Bressler, S. L., & Freeman, W. J. (1980). Frequency analysis of olfactory system EEG in cat, rabbit, and rat. Electroencephalogr. Clin. Neurophysiol., 50(1–2), 19–24. Brody, C. D., & Hopfield, J. J. (2003). Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron, 37(5), 843–852. Crosby, E., & Humphrey, T. (1939). Studies of the vertebrate telencephelon I: The nuclear configuration of the olfactor and accessory olfactory formations and of the nucleus olfactorius anterior of certain reptiles, birds and mammals. J. Comp. Neurol., 71, 121–213. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their possible roles in associations of visual features. Prog. Brain Res., 102, 405–426. Eckhorn, R. (2000). Cortical synchronization suggests neural principles of visual feature grouping. Acta Neurobiol. Exp. (Wars), 60(2), 261–269. Eeckman, F. H., & Freeman, W. J. (1990). Correlations between unit firing and EEG in the rat olfactory system. Brain Res., 528(2), 238–244. Ennis, M., Zhou, F. M., Ciombor, K. J., Aroniadou-Anderjaska, V., Hayar, A., Borrelli, E., Zimmer, L. A., Margolis, F., Shipley, M. T. (2001). Dopamine D2 receptormediated presynaptic inhibition of olfactory nerve terminals. J. Neurophysiol., 86(6), 2986–2997. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95(3), 1259–1264. Gelperin, A., & Tank, D. W. (1990). Odour-modulated collective network oscillations of olfactory interneurons in a terrestrial mollusc. Nature, 345(6274), 437–440. Gracia-Llanes, F. J., Crespo, C., Blasco-Ibanez, J. M., Marques-Mari, A. I., & MartinezGuijarro, F. J. (2003). Vip-containing deep short-axon cells of the olfactory bulb
Temporal Patterning During Odorant Processing
813
innervate interneurons different from granule cells. Eur. J. Neurosci., 18(7), 1751–1763. Gray, C. M., & Skinner, J. E. (1988). Field potential response changes in the rabbit olfactory bulb accompany behavioral habituation during the repeated presentation of unreinforced odors. Exp. Brain Res., 73(1), 189–197. Harrison, T. A., & Scott, J. W. (1986). Olfactory bulb responses to odor stimulation: Analysis of response pattern and intensity relationships. J. Neurophysiol., 56(6), 1571–1589. Hendin, O., Horn, D., & Tsodyks, M. V. (1998). Associative memory and segmentation in an oscillatory neural model of the olfactory bulb. J. Comput. Neurosci., 5(2), 157–169. Hosler, J. S., Buxton, K. L., & Smith, B. H. (2000). Impairment of olfactory discrimination by blockade of GABA and nitric oxide activity in the honey bee antennal lobes. Behav. Neurosci., 114(3), 514–525. Hughes, J. R., & Mazurowski, J. A. (1962). Studies on the supracallosal mesial cortex of unanesthetized, conscious mammals. II. Monkey. B. Responses from the olfactory bulb. Electroencephalogr. Clin. Neurophysiol., 14, 635–645. Inokuma, Y., Inoue, T., Watanabe, S., & Kirino, Y. (2002). Two types of network oscillations and their odor responses in the primary olfactory center of a terrestrial mollusk. J. Neurophysiol., 87(6), 3160–3164. Jahr, C. E., & Nicoll, R. A. (1980). Dendrodendritic inhibition: Demonstration with intracellular recording. Science, 207(4438), 1473–1475. Jahr, C. E., & Nicoll, R. A. (1982). An intracellular analysis of dendrodendritic inhibition in the turtle in vitro olfactory bulb. J. Physiol., 326, 213–234. Konig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19(4), 130–137. Kopell, N., Ermentrout, G. B., Whittington, M. A., & Traub, R. D. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Natl. Acad. Sci. USA, 97(4), 1867–1872. Lam, Y. W., Cohen, L. B., Wachowiak, M., & Zochowski, M. R. (2000). Odors elicit three different oscillations in the turtle olfactory bulb. J. Neurosci., 20(2), 749– 762. Lam, Y. W., Cohen, L. B., & Zochowski, M. R. (2003). Odorant specificity of three oscillations and the DC signal in the turtle olfactory bulb. Eur. J. Neurosci., 17(3), 436–446. Laurent, G., & Naraghi, M. (1994). Odorant-induced oscillations in the mushroom bodies of the locust. J. Neurosci., 14(5 Pt. 2), 2993–3004. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. (2001). Odor encoding as an active, dynamical process: Experiments, computation, and theory. Annu. Rev. Neurosci., 24, 263–297. Laurent, G., Wehr, M., & Davidowitz, H. (1996). Temporal representations of odors in an olfactory network. J. Neurosci., 16(12), 3837–3847. Li, Z., & Hertz, J. (2000). Odour recognition and segmentation by a model olfactory bulb and cortex. Network, 11(1), 83–102. Linster, C., & Cleland, T. A. (2001). How spike synchronization among olfactory neurons can contribute to sensory discrimination. J. Comput. Neurosci., 10(2), 187– 193.
814
S. Kim, B. Singer, and M. Zochowski
Luskin, M. B., & Price, J. L. (1983). The topographic organization of associational fibers of the olfactory system in the rat, including centrifugal fibers to the olfactory bulb. J. Comp. Neurol., 216(3), 264–291. MacLeod, K., Backer, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395(6703), 693–698. MacLeod, K., & Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding neural assemblies. Science, 274(5289), 976–979. Macrides, F. D. B. (1983). The olfactory bulb. In P. C. Emson (Ed.), Neuroanatomy (pp. 391–426). New York: Raven Press. Malnic, B., Hirono, J., Sato, T., & Buck, L. B. (1999). Combinatorial receptor codes for odors. Cell, 96(5), 713–723. Margrie, T. W., Sakmann, B., & Urban, N. N. (2001). Action potential propagation in mitral cell lateral dendrites is decremental and controls recurrent and lateral inhibition in the mammalian olfactory bulb. Proc. Natl. Acad. Sci. USA, 98(1), 319–324. Martin, C., Gervais, R., Hugues, E., Messaoudi, B., & Ravel, N. (2004). Learning modulation of odor-induced oscillatory responses in the rat olfactory bulb: A correlate of odor recognition? J. Neurosci., 24(2), 389–397. Mombaerts, P., Wang, F., Dulac, C., Chao, S. K., Nemes, A., Mendelsohn, M., Edmenson, J., & Axel, R. (1996). Visualizing an olfactory sensory map. Cell, 87(4), 675–686. Mori, K., Nagao, H., & Yoshihara, Y. (1999). The olfactory bulb: Coding and processing of odor molecule information. Science, 286(5440), 711–715. Mori, K., & Takagi, S. F. (1978). An intracellular study of dendrodendritic inhibitory synapses on mitral cells in the rabbit olfactory bulb. J. Physiol., 279, 569–588. Neville, K. R., & Haberly, L. B. (2003). Beta and gamma oscillations in the olfactory system of the urethane-anesthetized rat. J. Neurophysiol., 90(6), 3921–3930. Nowycky, M. C., Mori, K., & Shepherd, G. M. (1981). GABAergic mechanisms of dendrodendritic synapses in isolated turtle olfactory bulb. J. Neurophysiol., 46(3), 639–648. Olufsen, M. S., Whittington, M. A., Camperi, M., & Kopell, N. (2003). New roles for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comput. Neurosci., 14(1), 33–54. Perez-Orive, J., Bazhenov, M., & Laurent, G. (2004). Intrinsic and circuit properties favor coincidence detection for decoding oscillatory input. J. Neurosci., 24(26), 6037–6047. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom body. Science, 297(5580), 359–365. Pinching, A. J., & Powell, T. P. (1971). The neuropil of the glomeruli of the olfactory bulb. J. Cell. Sci., 9(2), 347–377. Pinching, A. J., & Powell, T. P. (1972). The termination of centrifugal fibres in the glomerular layer of the olfactory bulb. J. Cell. Sci., 10(3), 621–635. Rall, W., & Shepherd, G. M. (1968). Theoretical reconstruction of field potentials and dendrodendritic synaptic interactions in olfactory bulb. J. Neurophysiol., 31(6), 884–915.
Temporal Patterning During Odorant Processing
815
Ravel, N., Chabaud, P., Martin, C., Gaveau, V., Hugues, E., Tallon-Baudry, C., Bertrand, O., & Gervais, R. (2003). Olfactory learning modifies the expression of odour-induced oscillatory responses in the gamma (60–90 Hz) and beta (15–40 Hz) bands in the rat olfactory bulb. Eur. J. Neurosci., 17(2), 350– 358. Ressler, K. J., Sullivan, S. L., & Buck, L. B. (1993). A zonal organization of odorant receptor gene expression in the olfactory epithelium. Cell, 73(3), 597–609. Schoppa, N. E., Kinzie, J. M., Sahara, Y., Segerson, T. P., & Westbrook, G. L. (1998). Dendrodendritic inhibition in the olfactory bulb is driven by NMDA receptors. J. Neurosci., 18(17), 6790–6802. Schoppa, N. E., & Urban, N. N. (2003). Dendritic processing within olfactory bulb circuits. Trends Neurosci, 26(9), 501–506. Schoppa, N. E., & Westbrook, G. L. (2002). AMPA autoreceptors drive correlated spiking in olfactory bulb glomeruli. Nat. Neurosci., 5(11), 1194–1202. Scott, J. W., McBride, R. L., & Schneider, S. P. (1980). The organization of projections from the olfactory bulb to the piriform cortex and olfactory tubercle in the rat. J. Comp. Neurol., 194(3), 519–534. Shepherd, G. M. (1998). The synaptic organization of the brain (4th ed.). New York: Oxford University Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24(1), 49–65, 111–125. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390(6655), 70–74. Stopfer, M., Jayaraman, V., & Laurent, G. (2003). Intensity versus identity coding in an olfactory system. Neuron, 39(6), 991–1004. Teyke, T., & Gelperin, A. (1999). Olfactory oscillations augment odor discrimination not odor identification by Limax CNS. Neuroreport, 10(5), 1061–1068. von der Malsburg, C. (1995). Binding in models of perception and brain function. Curr. Opin. Neurobiol., 5(4), 520–526. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24(1), 95–104, 111–125. Vosshall, L. B., Wong, A. M., & Axel, R. (2000). An olfactory sensory map in the fly brain. Cell, 102(2), 147–159. Wachowiak, M., & Cohen, L. B. (1999). Presynaptic inhibition of primary olfactory afferents mediated by different mechanisms in lobster and turtle. J. Neurosci., 19(20), 8808–8817. Wachowiak, M., Cohen, L. B., & Zochowski, M. R. (2002). Distributed and concentration-invariant spatial representations of odorants by receptor neuron input to the turtle olfactory bulb. J. Neurophysiol., 87(2), 1035–1045. Wehr, M., & Laurent, G. (1996). Odour encoding by temporal sequences of firing in oscillating neural assemblies. Nature, 384(6605), 162–166. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G., & Traub, R. D. (1997). Spatiotemporal patterns of gamma frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol., 502 (Pt. 3), 591–607. Wilson, D. A. (2001). Receptive fields in the rat piriform cortex. Chem. Senses, 26(5), 577–584.
816
S. Kim, B. Singer, and M. Zochowski
Wilson, D. A., & Stevenson, R. J. (2003). Olfactory perceptual learning: The critical role of memory in odor discrimination. Neurosci. Biobehav. Rev., 27(4), 307–328. Zou, Z., Horowitz, L. F., Montmayeur, J. P., Snapper, S., & Buck, L. B. (2001). Genetic tracing reveals a stereotyped sensory map in the olfactory cortex. Nature, 414(6860), 173–179.
Received February 15, 2005; accepted September 29, 2005.
LETTER
Communicated by Bard Ermentrout
Computation of the Phase Response Curve: A Direct Numerical Approach W. Govaerts
[email protected] B. Sautois
[email protected] Department of Applied Mathematics and Computer Science, Ghent University, B-9000 Ghent, Belgium
Neurons are often modeled by dynamical systems—parameterized systems of differential equations. A typical behavioral pattern of neurons is periodic spiking; this corresponds to the presence of stable limit cycles in the dynamical systems model. The phase resetting and phase response curves (PRCs) describe the reaction of the spiking neuron to an input pulse at each point of the cycle. We develop a new method for computing these curves as a by-product of the solution of the boundary value problem for the stable limit cycle. The method is mathematically equivalent to the adjoint method, but our implementation is computationally much faster and more robust than any existing method. In fact, it can compute PRCs even where the limit cycle can hardly be found by time integration, for example, because it is close to another stable limit cycle. In addition, we obtain the discretized phase response curve in a form that is ideally suited for most applications. We present several examples and provide the implementation in a freely available Matlab code.
1 Introduction Dynamical systems are often used to model individual neurons or neural networks to study the general behavior of neurons and how neural networks respond to different kinds of inputs. In this field, an important concept is the phase resetting or response curve. When a neuron is firing a train of spikes (action potentials), a short input pulse can modify the ongoing spike train. Even if the incoming pulse is purely excitatory, depending on the model and the exact timing of the pulse, it can either speed up or delay the next action potential. This affects properties of networks such as synchronization and phase locking (Ermentrout, 1996; Hansel, Mato, & Meunier, 1995). Applications of the phase response curve (PRC) to the stochastic response dynamics of weakly coupled or uncoupled neural populations can be found Neural Computation 18, 817–847 (2006)
C 2006 Massachusetts Institute of Technology
818
W. Govaerts and B. Sautois
in Brown, Moehlis, and Holmes (2004). They also derive PRCs for cycles close to bifurcations such as homoclinic orbits and Hopf points. The PRC quantifies the linearized effect of an input pulse at a given point of the orbit on the following train of action potentials. If an input pulse does not affect the spike train, the period is unchanged, and the PRC at that point of the curve is zero. If the input pulse delays the next spike, the period increases, and the PRC is negative; if the pulse speeds up the next spike, the PRC is positive. The PRC can be used to compute the effect of any perturbing influence on the curve if the perturbation does not essentially change the dynamic pattern of the neuron; in particular, it should not move the state of the neuron into a domain of attraction different from that of the spiking orbit. In the case of coupling, PRCs can be used to compute the influence of weak coupling. In this letter, we present a new numerical way to compute the PRC of a spiking neuron. Mathematically it is an implementation of the adjoint method, introduced by Ermentrout and Kopell (1991), but our implementation, as part of the boundary value problem for the limit cycle of the stable orbit, is faster and more robust than any existing method; in fact, we get the PRC as a by-product of the computation of the limit cycle. Our method is particularly useful in the context of the numerical continuation of orbits with a variable parameter of the system. We describe the details of the method, its implementation in freely available software, and several examples in neural models. We provide tests to compare the speed and accuracy with the direct (simulation) method in the case of the Hodgkin-Huxley equations and with the classical adjoint method in two other classical neural models. Also, we provide extensive tests in cases where the periodic orbits are close to homoclinics and to saddle nodes of cycles. We also describe the computation of the response to a function. The input function f is given over the whole periodic orbit. This is more realistic in the case of coupled neurons. We further apply this functionality to a simple study of synchronization of phase models.
2 Response or Resetting? The terms phase response curve and phase resetting curve, both abbreviated to PRC, are used interchangeably in the literature. Since there seems to be some confusion, we start with the precise definitions that will be used in the rest of this article. 2.1 Peak-Based Phase Response Curve. This is a rather intuitive and biologically oriented definition that can easily be applied to experimental data.
Computation of the Phase Response Curve
819
The phase response curve is a curve that is defined over the time span of one cycle in the spike train of a firing neuron with no extra input. It starts at the peak of a spike and extends to the peak of the following spike. At each point of the cycle, it indicates the effect on the oncoming spikes of an input pulse to the neuron at that particular time in its cycle. If we define the period of the unperturbed cycle as Told and Tnew is the period when the input pulse I is given at time t, the phase response curve is defined as G(t, I ) =
Told − Tnew . Told
(2.1)
2.2 Phase Resetting Curve. This notion can be defined mathematically for any stable periodic orbit of a dynamical system. Let the period of the orbit be denoted as Told , and suppose that a point of the orbit is chosen as the base point (in a neural model, this point typically corresponds to a peak in the spike train). The phase resetting curve is defined in the interval [0, 1], in which the variable φ is called the phase and is defined as φ=
t (mod 1), Told
(2.2)
with t ∈ [0, Told ]. Now suppose that an input pulse I is given to the oscillator when it is at the point x(t). Mathematically, the pulse I could be any vector in the state space of the system. In the biological application, it is usually a vector in the direction of the voltage component of the state space, since that component is affected by the synaptic input from other neurons. Since the orbit starting from x(t) + I will close in on the stable limit cycle, there is exactly one point x(t1 ), t1 ∈ [0, Told ] that has the property that d(y(t), z(t)) → 0 for if
y(0) = x(t) + I z(0) = x(t1 )
t→∞
(2.3)
.
Here x(t1 ) and x(t) + I are said to be on the same isochron (cf. Guckenheimer, 1975). We then define g(φ, I ) =
t1 . Told
(2.4)
820
W. Govaerts and B. Sautois
2.3 Universal Definition of the Phase Response Curve (PRC). The definition of the peak-based response curve in section 2.1 is satisfactory only if each period contains precisely one easily identifiable spike and transients can be ignored. In this case, the next spike will occur at time Tnew = (φ + (1 − g(φ, I ))) Told .
(2.5)
So according to equation 2.1, PRC(φ, I ) = G(Tφ, I ) =
Told − (φ + (1 − g(φ, I ))) Told = g(φ, I ) − φ. Told (2.6)
To avoid the difficulties related to transients, we redefine the phase response curve by PRC(φ, I ) = g(φ, I ) − φ
(2.7)
in general. This definition is mathematically unambiguous and reduces to the definition in section 2.1 in the case of no transients. 3 Survey of Methods Currently, two classes of methods are often used to compute PRCs. The simplest methods are direct applications of definition 2.1: using simulations, one passes through the cycle repeatedly, each time giving an input pulse at a different time and measuring the delay of the spike. From this, a PRC curve can be computed. This was done by Guevara, Glass, Mackey, and Shrier (1983) and many others. This method has several advantages. It is simple and does not require any special software (only a good time integrator is needed). It can be used for arbitrarily large perturbations, even if the next spike would be delayed by more than the period of one cycle in which case the term delay might be confusing. Depending on the required accuracy of the PRC, this method typically takes a few seconds. In applications where the PRC of one given limit cycle is desired, it is quite satisfactory for its typical applications. The other methods are based on the use of the equation adjoint to the dynamical system that describes the activity of the neuron. The mathematical basis of this approach goes back to Malkin (1949, 1956; see also Blechman, 1971, and Ermentrout, 1981). An easily accessible reference is in Hoppensteadt and Izhikevich (1997, sec. 9.2). The idea to use this method for numerical implementation goes back to Ermentrout and Kopell (1991). The implementation described in Ermentrout (1996) and available through Ermentrout’s software package
Computation of the Phase Response Curve
821
XPPAUT (Ermentrout, 2002) is based on the backward integration of the adjoint linear equation of the dynamical system. This method is mathematically better defined and more general than the simulation method because it does not use the notion of a spike. It is also more restricted because it does not measure the influence of a real finite perturbation but rather a linearization of this effect that becomes more accurate if the perturbation gets smaller. As Ermentrout (1996) noted, the accuracy of the method based on backward integration is limited for two reasons. First, the adjoint Jacobian of the dynamical system is not available analytically. It has to be approximated by finite differences from a previously numerically computed discretized form of the orbit. Second, the integration of the linear equations produces the usual numerical errors. The method that we propose is mathematically equivalent to the adjoint method; only the implementation is new. The main advantage is that it is much faster. Timing results are given in section 6.2. Although the increase in speed is impressive, this is not very relevant if only one or a few PRCs are needed. The main application of our method therefore lies in cases where a large number of PRCs is needed, such as where the evolution of PRCs is important when a parameter is changed. The sources of error from the existing method do not apply to the method that we propose. In fact, we can compute PRCs even where the stable periodic orbits are hard to find by direct time integration. 4 Used Software Our continuation computations are based on several software packages. Mostly we used MATCONT and CL MATCONT (Dhooge, Govaerts, Kuznetsov, Mestrom, & Riet, 2003). CL MATCONT is a MATLAB package for the study of dynamical systems and their bifurcations; MATCONT (Dhooge, Govaerts, & Kuznetsov, 2003) is its GUI version. Both packages are freely available online at http://matcont.UGent.be/. CL MATCONT and MATCONT are successor packages to AUTO (Doedel et al., 1997–2000) and CONTENT (Kuznetsov & Levitin, 1998), which are written in compiled languages (Fortran, C, C++ ). Recently we made some structural changes to CL MATCONT, included C-code to speed it up, and added some functionalities (Govaerts & Sautois, 2005a). It is this new version that we used test to our method for computing the PRC. This version can be downloaded from http://allserv.UGent.be/∼bsautois. Currently the two versions are being merged into one. 5 Methods 5.1 The PRC as an Adjoint Problem. For background on dynamical systems theory we refer to the standard textbooks (e.g., Guckenheimer
822
W. Govaerts and B. Sautois
& Holmes, 1983, or Kuznetsov, 2004). In this section, we introduce the adjoint method to compute the phase response curve. Our exposition is selfcontained and involves Floquet multipliers and the monodromy matrix; however, these are not used in the actual computations. Let a neural model be defined by a system of differential equations, X˙ = f (X, α),
(5.1)
where X ∈ Rn represents the state variables and α is a vector of parameters. We consider a solution to equation 5.1 that corresponds to the periodically spiking behavior of the neuron with period T. By rescaling time, we find a periodic solution x(t) with period 1, solution to the equations
x˙ − T f (x, α) = 0 x(0) − x(1) = 0
.
(5.2)
We consider the periodic function A(t) = f x (x(t)) with period 1. To equation 5.2, we associate the fundamental matrix solution (t) of the nonautonomous linear equation:
˙ (t) − T A(t) (t) = 0 (0) = In
.
(5.3)
It is also useful to consider the equation adjoint to equation 5.3:
˙ (t) + T A(t)T (t) = 0 (0) = In
.
(5.4)
From equations 5.3 and 5.4, it follows that the time derivative of T (t)(t) is identically zero. By the initial conditions, this implies T (t)(t) ≡ 1.
(5.5)
By taking derivatives of equation 5.2, we find that the tangent vector v(t) = x(t) ˙ satisfies
v(t) ˙ − T A(t) v(t) = 0 v(0) − v(1) = 0
.
(5.6)
From this we infer v(t) = (t)v(0).
(5.7)
Computation of the Phase Response Curve
823
The monodromy matrix M(t) = (t + 1)(t)−1 is the linearized return map of the dynamical system. It is easy to see that (t + 1) = M(t)(t) = (t)M(0).
(5.8)
Hence, M(t) = (t)M(0)(t)−1 .
(5.9)
The eigenvalues of M(t) are called multipliers; the similarity, equation 5.9, implies that they are independent of t. M(t) always has a multiplier equal to +1. By equations 5.6 and 5.7, v(0) = v(1) is a right eigenvector for M(0) = (1) for the eigenvalue 1. By equation 5.9, v(t) is a right eigenvector of M(t) for the eigenvalue 1 for all t. Let us assume that the limit cycle is stable, such that all multipliers different from 1 have modulus strictly less than 1. Then in particular, M(0) T has a unique left eigenvector vl0 for the multiplier 1, for which vl0 v = 1. For all t, we define vl (t) = (t)vl0 . It is now easy to see that v˙l (t) + T A(t)T vl (t) = 0 vl (0) − vl (1) = 0
(5.10)
.
(5.11)
Also, for all t, vl (t) is a left eigenvector of M(t) for the eigenvalue 1 and vl (t)T v(t) = 1.
(5.12)
Let R(t) be the joint right (n − 1)-dimensional eigenspace of M(t) that corresponds to all multipliers different from 1, that is, R(t) is the space orthogonal to vl (t). Now, let I be a pulse given at time t. We can decompose this pulse uniquely as I = Iv + Ir ,
(5.13)
with
Iv = c v(t) Ir ∈ R(t)
where c ∈ R.
,
(5.14)
824
W. Govaerts and B. Sautois
Linearizing this perturbation of x(t), we find that the effect of Ir will die out, while the effect of Iv will be to move the system vector along the orbit. The amount of the change in time is equal to c as defined in equation 5.14. The linearized PRC, which for simplicity we just call the PRC, at t for pulse I is therefore (note that we have rescaled to period 1) PRC(t, I ) = c.
(5.15)
Since vl (t) is orthogonal to Ir we get from equations 5.13 and 5.12 that vl (t)T I = vl (t)T Iv = c,
(5.16)
PRC(t, I ) = vl (t)T I.
(5.17)
so
The adjoint equation to equation 5.1, as used by Ermentrout (1996), is defined by the system
˙ Z(t) = −A(t)T Z(t) , 1 T ˙ =1 Z(t)T X(t)dt T
(5.18)
0
and the periodicity condition Z(T) = Z(0). It is related to our solution by the equation Z(t) = T vl
t . T
(5.19)
In the case of the unscaled system 5.1, if a pulse I is given at time t = Tφ, we find that PRC(φ, I ) = vl (φ)T I.
(5.20)
If I is the unit vector along the first (voltage) component, we have PRC(φ, I ) = (vl (φ))1 ,
(5.21)
where (.)1 denotes the first component. This situation is so common in neural models that (vl (φ))1 is sometimes also referred to as the phase response curve. Instead of giving an impulse I at a fixed time t = Tφ, we can also add a (small) vector function g( Tt ), g continuous over [0, 1], to the right-hand side
Computation of the Phase Response Curve
825
of equation 5.1 to model the ongoing input from other neurons. The phase response (P R) to this ongoing input is then given by
1
PR(g) =
vl (φ)T g(φ)dφ.
(5.22)
0
5.2 Implementation. Our implementation was done using CL MATCONT (see section 4). In this package, as in AUTO (Doedel et al., 1997– 2000) and CONTENT (Kuznetsov & Levitin, 1998), the continuation of limit cycles is based on equation 5.2 and uses orthogonal collocation. We briefly summarize this technique before describing its extension to compute phase response curves. (For details on the orthogonal collocation method, see Ascher & Russell, 1981, and De Boor & Swartz, 1973.) The idea to use the discretized equation to solve the adjoint problem was used in a completely different context in Kuznetsov, Govaerts, Doedel, & Dhooge (in press). Since a limit cycle is periodic, we need an additional equation to fix the phase. For this phase equation, an integral condition is used. The system is the following: ˙ − T f (x(t), α) = 0 x(t) x(0) − x(1) = 0 , 1 T ˙x˜ (t)x(t) dt = 0 0
(5.23)
where x˜ is some initial guess for the solution, typically obtained from the previous continuation step. This system is referred to as the boundary value problem (BVP). To describe a continuous limit cycle numerically, it needs to be stored in a discretized form. First, the interval [0, 1] is subdivided into N intervals: 0 = τ0 < τ1 < · · · < τ N = 1.
(5.24)
The grid points τ0 , τ1 , . . . , τ N form the coarse mesh. In each interval [τi , τi+1 ], the limit cycle is approximated by a degree m polynomial, whose values are stored in equidistant mesh points on each interval, namely, τi, j = τi +
j (τi+1 − τi ) ( j = 0, 1, . . . , m). m
(5.25)
These grid points form the fine mesh. In each interval [τi , τi+1 ], we then require the polynomials to satisfy the BVP exactly at m collocation points. The best choices for these collocation points are the gauss points ζi, j , that is, the roots of the Legendre polynomial of degree m, relative to the interval [τi , τi+1 ] (De Boor & Swartz, 1973).
826
W. Govaerts and B. Sautois
For a given vector function η ∈ C 1 ([0, 1], Rn ), we consider two different discretizations and two weight forms:
r r r r
η M ∈ R(Nm+1)n , the vector of the function values at the mesh points ηC ∈ R Nmn , the vector of the function values at the collocation points ηWL ∈ R(Nm+1)n , the vector of the function values at the mesh points, each multiplied with its coefficient for piecewise Lagrange quadrature ηWG = [ ηηWW1 ] ∈ R(Nm+1)n , where ηW1 is the vector of the function values 2 at the collocation points multiplied by the Gauss-Legendre weights and the lengths of the corresponding mesh intervals, and ηW2 = η(0).
To explain the use of the weight forms, we first consider a scalar function 1 f (t) ∈ C 0 ([0, 1], R). Then the integral 0 f (t)dt can be numerically approximated by appropriate linear combinations of function values. This can be done in several ways. (For background on quadrature methods, we refer to Deuflhard & Hohmann, 1991.) If the fine mesh points are used, then the best approximation has the form m N−1
lm, j ( f M )i, j (τi+1 − τi )
(5.26)
i=0 j=0
=
N−1 m−1
( f WL )i, j + ( f WL ) N−1,m .
(5.27)
i=0 j=0
In equation 5.26, the coefficients lm, j are the Lagrange quadrature coefficients and ( f M )i, j = f (τi, j ); this equation is the exact integral if f (t) is a piecewise polynomial of degree m or less. Equation 5.27 is a reorganization of equation 5.26 and defines f WL . 1 The integral 0 f (t)g(t)dt ( f (t), g(t) ∈ C 0 ([0, 1], R)) is then approximated T g M . For vector functions f (t), g(t) ∈ by the vector inner product f WL 1 0 n T C ([0, 1], R ), the integral 0 f (t) g(t)dt is formally approximated by the T same expression: f WL gM. If the collocation points are used, then the best approximation has the form m N−1
i=0 j=1
ωm, j ( f C )i, j (τi+1 − τi ) =
m N−1
( f W1 )i, j , i=0 j=1
(5.28)
Computation of the Phase Response Curve
827
where ( f C )i, j = f (ζi, j ) and ωm, j are the Gauss-Legendre quadrature coefficients. Formula 5.28 delivers the exact integral if f (t) is a piecewise polynomial of degree 2m − 1 or less. 1 The integral 0 f (t)g(t)dt ( f (t), g(t) ∈ C 0 ([0, 1], R)) is approximated with T T Gauss-Legendre by f W g = fW L C×M g M . For vector functions f (t), g(t) ∈ 1 C 1 1 0 n C ([0, 1], R ), the integral 0 f (t)T g(t)dt is formally approximated by the T T same expression: f W g = fW L C×M g M . 1 C 1 Here we formally used the structured sparse matrix L C×M that converts a vector η M of function values at the mesh points into the vector ηC of its values at the collocation points: ηC = L C×M η M .
(5.29)
This matrix is usually not formed explicitly; its entries are fixed and given by the values of the Lagrange interpolation functions in the collocation points. In the Newton steps for the computation of the solution to equation 5.23, we solve matrix equations with the Jacobian matrix of the discretization of this equation:
(D − T A(t))C×M (− f (x(t), α))C (δ0 − δ1 )TM (x˜˙
T
(t))TWL
0
.
(5.30)
0
In equation 5.30, the matrix (D − T A(t))C×M is the discretized form of the operator D − T A(t) where D is the differentiation operator. So we have (D − T A(t))C×M η M = (η(t) ˙ − T A(t)η(t))C . We note that this is a large, sparse, and well-structured matrix. In AUTO (Doedel et al., 1997–2000) and CONTENT (Kuznetsov & Levitin, 1998), this structure is fully exploited; in CL MATCONT, only the sparsity is exploited by using the MATLAB sparse solvers. We note that the evaluation of (D − T A(t))C×M takes most of the computing time in the numerical continuation of limit cycles. Furthermore, (δ0 − δ1 )TM is the discretization in the fine mesh points of the operator δ0 − δ1 where δ0 , δ1 are the Dirac evaluation operators in 0 and 1, respectively. So the second block row in equation 5.30 is an (n × (Nm + 1)n)matrix whose first (n × n)-block is the identity matrix In and whose last but one (n × n)-block is −In ; all other entries are zero. We note that in a continuation context, equation 5.30 is extended by an additional column, which contains (−T f α (x(t), α))C where α is the free parameter, and by an additional row, which is added by the continuation algorithm.
828
W. Govaerts and B. Sautois
Now if the limit cycle is computed, we also want to compute vl (t), solution to equations 5.11 and 5.12. So vl (t) is defined up to scaling by the condition that it is a null vector of the operator φ2 : C 1 ([0, 1], Rn ) → C 0 ([0, 1], Rn ) × Rn , where φ2 (ζ ) =
ζ˙ + T AT ζ
.
ζ (0) − ζ (1)
By routine computations, one proves that this is equivalent to
ζ
⊥ φ1 (C 1 ([0, 1], Rn ))
ζ (0) where
φ1 : C 1 ([0, 1], Rn ) → C 0 ([0, 1], Rn ) × Rn , φ1 (ζ ) =
ζ˙ − T Aζ ζ (0) − ζ (1)
.
(t) A(t) ] is orthogonal to the range of [ D−T ]. Now, by equation 5.12 So [ vvll (0) δ0 −δ1 and the fact that v(t) = x(t), ˙ we can state that
[((vl )WG ) 0] T
(D − T A(t))C×M (− f (x(t), α))C (δ0 − δ1 )TM
0
T (x˜˙ (t))TWL
0
=
1 0 − T
.
(5.31)
The matrix in equation 5.31 is freely available, since it is the same one as in equation 5.30. Obtaining (vl )WG from equation 5.31 is equivalent to solving a large, sparse system, a task done efficiently by MATLAB. The first part (vl )W1 of the obtained vector (vl )WG represents vl (t) but is evaluated in the collocation points and still weighted with the interval widths and the Gauss-Legendre weights. The second part (vl )W2 of the obtained vector (vl )WG is equal to vl (0). In many important cases (see, e.g.,
Computation of the Phase Response Curve
829
section 7), (vl )W1 is precisely what we need because vl (t) will be used to compute integrals of the form
1
vl , g = 0
vlT g(t)dt,
where g(t) is a given vector function that is continuous in [0, 1]. This integral is then numerically approximated by the vector inner product, (vl )TW1 gC = (vl )TW1 L C×M g M .
(5.32)
In some cases, we want to know (vl ) M . Since we know the GaussLegendre weights and the interval widths, we can eliminate them explicitly from (vl )W1 and obtain (vl )C . Now equation 5.29 converts values in mesh points to values in collocation points. In addition, we know that vl (0) = vl (1) = (vl )W2 . Combining these results, we get
(vl )C vl (0)
=
L C×M 0.5 0 · · · 0 0.5
(vl ) M ,
(5.33)
] is sparse, square, and well conditioned and so equawhere [ 0.5 0L C×M ... 0 0.5 tion 5.33 can be solved easily by MATLAB to get (vl ) M . We note that in this case (and only in this case), it is necessary to form L C×M explicitly. Now we have access to all elements needed to compute the PRC using equation 5.20. 6 Test Results 6.1 Correctness of Our Method 6.1.1 Comparison to Direct Method. Here we compare our method of computing the PRC to the direct method, which consists of giving input pulses at different points in the cycle and measuring the resulting changes in the cycle period (cf. section 3). As a test system, we choose the Hodgkin-Huxley system. Figure 1 shows the PRC for this model for I = 12, as computed using our new method. The period of the limit cycle is then 13.72 seconds. During a continuation experiment, this computation takes about 0.04 seconds. Figure 2 shows two PRCs for the same model and the same parameter values, which were computed in the direct way. The PRCs were computed for different pulse amplitudes and durations: pulse amplitudes are 10 and 20, and pulse durations are 0.05 and 0.15 seconds, for Figures 2A and 2B, respectively.
830
W. Govaerts and B. Sautois 0.03 0.025 0.02 0.015 0.01 0.005 0 – 0.005 – 0.01 – 0.015 0
0.2
0.4
0.6
0.8
1
Figure 1: PRC of the Hodgkin-Huxley model at I = 12.
Visually, it is clear that the shapes of the curves match. A rough computation shows that the matching is also quantitative. Indeed, the situation of Figure 2A corresponds to a resetting pulse of size 10 × 0.05 = 0.5 (millivolts). Dividing the maximal value of the computed PRC by 0.5, we obtain 0.014/0.5 = 0.028 (per millivolt). Similarly, for the situation of Figure 2B, we obtain 0.075/(20 × 0.15) = 0.025 (per millivolt). This closely matches the computed maximal value in Figure 1. This shows that our method is accurate and applicable not only for infinitesimally small input pulses but also for pulses of finite size and duration. The direct computation of these experimental PRCs, with the limited precision they have (as is obvious from the figures), took between 60 and 70 seconds each. Also, the smaller the inputs for an experimental computation, the higher the precision has to be, since the PRC amplitudes shrink, and thus noise due to imprecision increases in relative size. This is already evident from the difference between Figures 2A and 2B. And computation time increases with increasing precision. 6.1.2 Three Classical Situations. In this section we check that in some standard situations, our results correspond with those in the literature. In fact, our figures almost perfectly match the ones computed by Ermentrout (1996), except for the scaling factor (which is the cycle period). This should not come as a surprise since in the absence of computational errors (e.g., rounding, truncation), the methods used are equivalent. Here we present a
Computation of the Phase Response Curve 0.015
831
A
0.01
0.005
0
– 0.005
–0.01 0
0.08
0.2
0.4
0.6
0.2
0.4
0.6
0.8
1
B
0.06 0.04 0.02 0 –0.02 –0.04 – 0.06 0
0.8
1
Figure 2: Experimentally obtained PRCs for the Hodgkin-Huxley model. (A) Pulse amplitude = 10, pulse duration = 0.05. (B) Pulse amplitude = 20, pulse duration = 0.15.
832
W. Govaerts and B. Sautois
couple of the most widely known models. The equations and fixed parameter values for all models are listed in the appendix. The Hodgkin-Huxley model, as described in the appendix in section A.1, is known to exhibit a PRC of type II, which means a PRC with a positive and a negative region. So two coupled Hodgkin-Huxley neurons with excitatory connections can still slow down each other’s spike rate. Figure 1 shows the PRC for this model, for I = 12, where the limit cycle has a period of 13.72. The Morris-Lecar model is given in section A.2. It is known to have different types of behavior and phase response curves at different parameter values (Govaerts & Sautois, 2005b). Figure 3A shows the PRC at V3 = 4 and I = 45, where the cycle period is 62.38. We clearly see a negative and positive regime in the PRC. In Figure 3B, the PRC is shown at V3 = 15 and I = 39, where the period is 106.38; the PRC is practically nonnegative. Finally, we show a result for the Connor model (Connor, Walter, & McKown, 1977), a six-dimensional model, given in section A.3, which has a nonnegative PRC. Figure 4 depicts this PRC, for I = 8.5 and period 98.63. 6.2 Three Further Applications. In this section, we compute families of PRCs in cases where the shape of the PRC is known to have important consequences for networking behavior. The change of the shapes of the PRCs under parameter variation therefore is an interesting object of study. Moreover, this will allow us to obtain timing results for the computation of the PRCs in both absolute and relative terms. We further illustrate the robustness of the method by computing the PRC for a limit cycle that is hard to find by direct integration because of its small domain of attraction. 6.2.1 Morris-Lecar: Limit Cycles Close to Homoclinics. Brown et al. (2004) study the response dynamics of weakly coupled or uncoupled neural populations using the PRCs of periodic orbits near certain bifurcation points. In particular, they obtain the PRCs of the spiking periodic orbits near homoclinic to saddle node orbits (HSN; Brown et al. call this the SNIPER bifurcation) and homoclinic to hyperbolic saddle orbits (HHS; they call this the homoclinic bifurcation). They obtain standard forms for these PRCs, based on a normal form analysis of the homoclinic orbits. In numerical calculations that involve the Hindmarsh-Rose model in the first (HSN) case and the Morris-Lecar model in the second (HHS) case, the normal form predictions are largely confirmed. It turns out that the PRCs in the two cases look very different. On the other hand, it is well known that the transition from HSN to HHS orbits is generic when a curve of HSN or HHS orbits is computed. The generic transition point is a noncentral homoclinic to saddle node orbit (NCHSN). Moreover, this transition is not uncommon in neural models, and indeed we found it in the ubiquitous Morris-Lecar model (cf. Govaerts & Sautois, 2005b).
Computation of the Phase Response Curve 0.12
833
A
0.1 0.08 0.06 0.04 0.02 0 – 0.02 0
0.2
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
B
0.15
0.1
0.05
0
– 0.05
0
Figure 3: PRCs of the Morris-Lecar model. (A) PRC at V3 = 4 and I = 45. (B) PRC at V3 = 15 and I = 39.
834
W. Govaerts and B. Sautois 0.03 0.025 0.02 0.015 0.01 0.005 0 – 0.005 0
0.2
0.4
0.6
0.8
1
Figure 4: PRC of the Connor model at I = 8.5.
It follows that the PRCs near HSN orbits can be transformed smoothly into the PRCs near HHS orbits. Analytically this calls for a normal form analysis of the NCHSN orbits. We have not performed this but used the situation as a testing ground for the computation of PRCs. We computed a branch of spiking orbits with fixed high period (this forces the branch to follow the homoclinic orbits closely) from the HHS regime into the HSN regime, computed the PRCs in a large number of points, and plotted the resulting PRCs in a single picture to get geometric insight into the way the PRCs are transformed from one into the other. In Figure 5A, part of the phase plane is shown for the Morris-Lecar model. Since the curves are very close together, we added a qualitative diagram (see Figure 5B). The pictures show the saddle node curve (thin line) and the curve of HHS orbits (dashed line), which coincide and form a curve of HSN orbits (thick line). The point of coincidence is encircled; this is the NCHSN point. Close to the curves of HHS and HSN orbits is the curve (dotted) of limit cycles with fixed period 237.665 along which we have computed PRCs. The continuation was done using 80 mesh intervals and 4 collocation points per interval. Figure 5C shows the resulting 100 computed PRCs. We started the limit cycle continuation from a cycle with V3 = 10 and I = 40.536. The corresponding PRC is the one that, in Figure 5C, has the left-most peak; it is also slightly darker in the picture than the following PRCs. The picture shows the smooth transition of consecutive PRCs. The gray PRCs are the
Computation of the Phase Response Curve
15
835
A
V3
14 13 12 11 10 39
39.5
40
40.5
I –4
12
x 10
C
10 8 6 4 2 0 –2 0
0.2
0.4
0.6
0.8
1
Figure 5: (A) Phase plot for the Morris-Lecar model. (B) Qualitative picture to clarify relative positions of the curves. Thin line = limit point curve; thick line = HSN curve; dashed line = HHS curve; circle = NCHSN point; dotted line = curve of limit cycles with period 237,6646. (C) PRCs for limit cycles along limit cycle curve. Gray = PRCs for limit cycles close to HHS curves; black = PRCs for limit cycles close to HSN curves.
ones for limit cycles at values of V3 lower than the value at the NCHSN point (V3 = 12.4213). The dark PRCs correspond to limit cycles at higher V3 values. The shapes obtained for PRCs near HSN orbits and those near HHS orbits both confirm the results of Brown et al. (2004). Two significant facts stand out from the picture. First, the dark PRCs are (at least close to) strictly positive, while the gray PRCs have a distinct
836
W. Govaerts and B. Sautois
negative region. Second, the PRCs closest to the V3 value of the NCHSN point are also the PRCs with the lowest peak, that is, the PRCs in the bottom of the “valley” formed by the consecutive PRCs for limit cycles further away from that particular value, in either direction. This suggests that the NCHSN orbit has a distinct influence on the shape of the PRCs of nearby limit cycles. This is not surprising but to our knowledge has never been investigated. The computation of the 100 PRCs along the limit cycle curve took a total time of 2.34 seconds: 0.0234 seconds per PRC. To compare this with standard methods, we note that Ermentrout (2002, section 9.5.3) states that to compute one PRC takes 1 or 2 seconds. This is certainly acceptable if one is interested in a single PRC but hardly acceptable if one is interested in the evolution of the PRC under a change of parameters. The full continuation run for these 100 limit cycles took 62.41 seconds. So our PRC computations took only about 3.75% of the total time needed for the experiment. If one used a PRC computing method that takes, for example, 2 seconds per PRC, this would cause the total run time of the program to increase up to 260 seconds, an increase of more than 300%. 6.2.2 Hodgkin-Huxley: Robustness. Our method is robust in the sense that it can compute PRCs for all limit cycles that can be found through continuation; there are no further sources of error as in the traditional implementation of the computation of the PRC (cf. Ermentrout, 1996). In fact we can compute PRCs even for limit cycles that are hard to find by any means other than numerical continuation, which can happen when their domain of attraction is not easily found. In the Hodgkin-Huxley model, there is typically a short interval in which a stable equilibrium and two stable limit cycles coexist. In our case, for the parameter values specified in section A.1, this happens between values I = 7.8588 and 7.933. These limit cycles are shown in Figure 6A. The smaller of the two stable limit cycles in the picture exists for only a short I -interval and has a very small attraction domain. Therefore, it is extremely hard to find by, for example, time integration. This implies that it is also not trivial to compute the PRC corresponding to that particular limit cycle. Our method, however, has no problem computing it. In Figure 6B, the PRCs are depicted that correspond to the limit cycles from Figure 6A. The shapes of the two PRCs are very different. Also note that the darker PRC was actually larger in amplitude but was rescaled to the same ranges as the gray PRC. 6.2.3 Hodgkin-Huxley: Limit Cycles Close to Saddle Node of Cycles. As a third test, we have done a continuation of both coexisting stable limit cycles in the Hodgkin-Huxley system, mentioned in section 6.2.2. In both cases we did the continuation for decreasing values of I , approaching a saddle node of limit cycles (also limit point of cycles, LPC).
Computation of the Phase Response Curve 1
837
A
0.9 0.8 0.7
m
0.6 0.5 0.4 0.3 0.2 0.1 0 –20
0
20
40
60
80
100
V 1
B
0.8 0.6 0.4 0.2 0 – 0.2 – 0.4 – 0.6 – 0.8 –1 0
0.2
0.4
0.6
0.8
1
Figure 6: (A) Two stable limit cycles for the Hodgkin-Huxley model at parameter value I = 7.919. (B) Corresponding PRCs. The gray PRC corresponds to the gray limit cycle and the black PRC to the black limit cycle.
838
W. Govaerts and B. Sautois
The big stable limit cycle loses its stability at an LPC that occurs at I = 6.276. Starting from I = 7.919, we computed 150 continuation steps, with 80 mesh intervals and 4 collocation points. The final computed limit cycle was at I = 6.377. The last 100 PRCs are shown in Figure 7A, where the palest one, with the biggest amplitude, is the PRC corresponding to the limit cycle closest to the LPC. The small, stable limit cycle loses its stability at an LPC that occurs at I = 7.8588. Starting from I = 7.919, we computed 100 continuation steps, with smaller step sizes than for the big limit cycle. The final computed limit cycle was at I = 7.8592. The corresponding 100 PRCs are shown in Figure 7B. Again the palest one is closest to the LPC. During the continuations, the actual PRC computations took around 4 seconds total for 100 PRCs—0.04 second per PRC. The full continuation run took around 100 seconds, so the PRC computations took about 4% of the total run time. Again, the use of another method to compute the PRC, which takes 2 seconds per limit cycle, would increase the time needed to about 300 seconds, an increase by 200%. Both PRCs have positive and negative regions, but otherwise their shapes are very different. It is noteworthy that the shapes of the PRCs near the big LPC look similar to the PRCs predicted near the Bautin bifurcation by Brown et al. (2004). It is well known that branches of LPCs generically are born at Bautin bifurcation points. However, the shapes of the PRCs near the small LPC are very different. We note that both LPCs are far from Bautin bifurcation points. This is again a subject for further investigation. 7 Response to a Function In section 5.1 we discussed the computation of spike train responses to any given input function g. We can now compute the right-hand side of equation 5.22 numerically. In fact, using equation 5.29, we obtain the highorder approximation PR(g) = (vl )TW1 (g(φ))C = (vl )TW1 L C×M (g(φ)) M .
(7.1)
Hence, the computation of the phase response to a function is reduced to the discretization of that function and the computation of a vector inner product. In this section, we briefly show how this is done in a classical and well-understood situation: the phase dynamics of two coupled Poincar´e oscillators. The Poincar´e oscillator has been used many times as a model of biological oscillations (e.g., in Glass & Mackey, 1988, and Glass, 2003). It is sufficient for present purposes to note that it has two state variables x, y and a stable limit cycle with period 1 given by x = cos 2πt, y = cos 2πt.
Computation of the Phase Response Curve 0.25
839
A
0.2 0.15 0.1 0.05 0 –0.05 –0.1 –0.15
0
10
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
B
8 6 4 2 0 –2 –4 –6 –8 –10 –12
0
Figure 7: (A) PRCs of big Hodgkin-Huxley limit cycles approaching an LPC at I = 6.276. (B) PRCs of small Hodgkin-Huxley limit cycles approaching an LPC at I = 7.919.
840
W. Govaerts and B. Sautois
The dynamics of weakly coupled oscillators can be reduced to their phase dynamics (cf. Hansel, Mato, & Meunier, 1993, where further references can be found). We restrict to the simple setting discussed in Ermentrout (2002). Consider two identical, weakly coupled oscillators, with autonomous period T:
X1 = F(X1 ) + G 1 (X2 , X1 )
X2 = F(X2 ) + G 2 (X1 , X2 )
,
(7.2)
with G 1 and G 2 two possibly different coupling functions, and a small positive number. Let X0 (t) be an asymptotically stable periodic solution of X = F(X) with period T. Then, for sufficiently small, X j (t) = X0 (θ j ) + O() ( j = 1, 2)
(7.3)
with θ1 = 1 + H1 (θ2 − θ1 ) θ2 = 1 + H2 (θ1 − θ2 ) where H j are T-periodic functions given by H j (ψ) ≡
1 T
T
Z(t)T G j [X0 (t + ψ), X0 (t)]dt,
(7.4)
0
where Z is the adjoint solution as defined in equation 5.18. Now consider two identical Poincar´e oscillators. A natural choice for the coupling functions G j is G 1 (Xi , X j ) = G 2 (Xi , X j ) = Xi − X j . It follows that H1 (φ) = H2 (φ) = 0
1
vlT (t)[X0 (t + φ) − X0 (t)]dt.
(7.5)
One finds easily that
1
(θ2 − θ1 ) = 2 sin[2π(θ1 − θ2 )] 0
vl (t)
T
− sin(2πt) dt. cos(2πt)
(7.6)
Computation of the Phase Response Curve
841
We set ζ = θ2 − θ1 . The constant function ζ = 12 is a solution to equation 7.6; it corresponds to antiphase synchronization. On the other hand, if at the starting time ζ = 12 (mod 1), then by elementary calculus, we obtain tan(πζ ) = C exp(−4π α t),
(7.7)
for some constant C = 0, where
1
α=
vl (t)
T
− sin(2πt) cos(2πt)
0
dt.
(7.8)
This leads to the following conclusion about the behavior of ζ for t → ∞: α > 0 ⇒ ζ → 0/1 α < 0⇒ζ →
1 2
α = 0 ⇒ ζ → constant.
(7.9)
In the first case, the two oscillators converge to in-phase synchronization, in the second case they move toward antiphase synchronization, and in the third case, the oscillators remain in the out-of-phase synchronization they started with. Now α is nothing else than the phase response to a function and can be computed by equation 7.1. In this case, we find that α > 0, so two identical Poincar´e oscillators coupled in both directions by the function G(X2 , X1 ) = X2 − X1 always converge to in-phase synchronization, except when they start in perfect antiphase synchronization, in which case they will remain in that state. This is easily checked by numerical simulation (and can be shown analytically). Analogously, it is easy to compute α for other coupling functions and to check the result by simulations. For example, if G 1 (X2 , X1 ) = G 2 (X2 , X1 ) = X1 − X2 , then α < 0, so the two Poincar´e oscillators always converge to antiphase synchronization. A more interesting case is obtained if we set X1 =
x1 x2
, X2 =
G 1 (X2 , X1 ) =
x3
,
x4
x3 − x1 0
, G 2 (X1 , X2 ) =
x1 − x3 0
.
842
W. Govaerts and B. Sautois
In this case, we again find α > 0, and there is always in-phase synchronization, except for when we start in exact antiphase. When we set G 1 (X2 , X1 ) =
x4 − x2 0
, G 2 (X1 , X2 ) =
x2 − x4 0
,
then α is zero (up to truncation and rounding errors), so the oscillators keep their initial out-of-phase synchronization. This is confirmed by numerical simulations. 8 Conclusion We have developed a new method for computing the phase response curve of a spiking neuron. The new method is mathematically equivalent to the adjoint method, but our implementation is faster and more robust than any existing method. It can be used as part of the continuation of the boundary value problem defining the limit cycle that describes the spiking neuron. In that case, the computational work has few costs. Tests on several well-established neural models have shown that our method produces correct results for the PRCs very quickly. We have also extended the code with the possibility of computing the spike train response to any given function. Appendix: Neural Models A.1 Hodgkin-Huxley Model. The Hodgkin-Huxley model is defined by the following equations: C
dV = I − g Na m3 h(V − VNa ) − g K n4 (V − VK ) − g L (V − VL ) dt
dm = φ ((1 − m) αm − m βm ); dt dh = φ ((1 − h) αh − h βh ); dt dn = φ ((1 − n) αn − n βn ); dt
φ=3
T−6.3 10
(A.1)
Computation of the Phase Response Curve
843
25 − V 10 ψαm αm = exp[ψαm ] − 1
ψαm =
βm = 4 exp[−V/18] αh = 0.07 exp[−V/20] βh =
ψαn =
1 1 + exp[(30 − V)/10] 10 − V 10
αn = 0.1
ψαn exp[ψαn ] − 1
βn = 0.125 exp[−V/80]. In this letter, the parameters C = 1, g Na = 120, VNa = 115, g K = 36, VK = −12, gl = 0.3, Vl = 10.559, and T = 6.3 are fixed. I is varied according to the tests. A.2 Morris-Lecar Model. The Morris-Lecar model is defined by the following equations: C
dV = Ie xt − g L (V − VL ) − gCa M∞ (V − VCa ) − g K N(V − VK ) dt dN = τ N (N∞ − N) dt
(A.2)
1 M∞ = (1 + tanh((V − V1 )/V2 )) 2 1 N∞ = (1 + tanh((V − V3 )/V4 )) 2 τ N = φ cosh((V − V3 )/2V4 ) In our tests, C = 5, g L = 2, VL = −60, gCa = 4, VCa = 120, g K = 8, VK = −80, 1 φ = 15 , V1 = −1.2, V2 = 18, and V4 = 17.4 are fixed. I and V3 are varied, according to the tests.
844
W. Govaerts and B. Sautois
A.3 Connor Model. The Connor model is defined by the following equations: C
dV = I − g L (V − E L ) − g Na m3 h(V − E Na ) − g K n4 (V − E K ) dt −g A A3 B(V − E A)
dm m∞ (V) − m = dt τm (V) dh h ∞ (V) − h = dt τh (V) dn n∞ (V) − n = dt τn (V) d A A∞ (V) − A = dt τ A(V) dB B∞ (V) − B = dt τ B (V) with αm =
0.1(V + 29.7) 1 − exp(−0.1(V + 29.7))
βm = 4 exp[−(V + 54.7)/18] αm m∞ = αm + βm τm =
1 1 3.8 αm + βm
αh = .07 exp[−0.05(V + 48)] 1 1 + exp[−0.1(V + 18)] αh h∞ = αh + βh βh =
τh =
1 1 3.8 αh + βh
αn = 0.01
V + 45.7 1 − exp[−0.1(V + 45.7)]
(A.3)
Computation of the Phase Response Curve
845
βn = 0.125 exp(−0.0125(V + 55.7)) αn n∞ = αn + βn τn =
1 2 3.8 αn + βn
1/3 exp[(V + 94.22)/31.84] A∞ = 0.0761 1 + exp[(V + 1.17)/28.93)] τ A = 0.3632 + B∞ =
1.158 1 + exp[(V + 55.96)/20.12]
1 (1 + exp[(V + 53.3)/14.54])4
τ B = 1.24 +
2.678 . 1 + exp[(V + 50)/16.027]
In this article, the parameters C = 1, g L = 0.3, E L = −17, g Na = 120, g A = −47.7, E Na = 55, g K = 20, E K = −72, and E A = −75 are fixed. I is varied according to the tests. Acknowledgments B. S. thanks the Fund for Scientific Research Flanders, FWO for funding the research reported in this article. Both authors thank the two referees for several critical remarks that led to substantial improvements in the article content and presentation. References Ascher, U., & Russell, R. D. (1981). Reformulation of boundary value problems in standard form. SIAM Rev., 23(2), 238–254. Blechman, I. I. (1971). Synchronization of dynamical systems [in Russian: Sinchronizatzia dinamicheskich sistem]. Moscow: Nauka. Brown, E., Moehlis, J., & Holmes, P. (2004). On the phase reduction and response dynamics of neural oscillator populations. Neural Comput., 16, 673–715. Connor, J. A., Walter, D., & McKown, R. (1977). Modifications of the Hodgkin-Huxley axon suggested by experimental results from crustatean axons. Biophys. J., 18, 81– 102. De Boor, C., & Swartz, B. (1973). Collocation at gaussian points. SIAM J. Numer. Anal., 10, 582–606. Deuflhard, P., & Hohmann, A. (1991). Numerische Mathematik: Eine algorithmisch ori¨ entierte Einfuhrung. New York: Walter de Gruyter.
846
W. Govaerts and B. Sautois
Dhooge, A., Govaerts, W., & Kuznetsov, Yu. A. (2003). MATCONT: A MATLAB package for numerical bifurcation analysis of ODEs. ACM TOMS, 29(2), 141– 164. Dhooge, A., Govaerts, W., Kuznetsov, Yu. A., Mestrom, W., & Riet, A. M. (2003). A continuation Toolbox in MATLAB. Available online at http://allserv. UGent.be/∼ajdhooge/doc cl matcont.zip. Doedel, E. J., Champneys, A. R., Fairgrieve, T. F., Kuznetsov, Yu. A., Sandstede, B., & Wang, X. J. (1997–2000). AUTO97: Continuation and bifurcation software for ordinary differential equations (with HomCont). Available at http://indy.cs.concordia.ca/auto. Ermentrout, G. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comput., 8(5), 979–1001. Ermentrout, G. B. (1981). n : m Phase-locking of weakly coupled oscillators. J. Math. Biol., 12, 327–342. Ermentrout, G. (2002). Simulating, analyzing, and animating dynamical systems: A guide to XXPAUT for researchers and students. Philadelphia: SIAM. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biol., 29, 195–217. Glass, L. (2003). Resetting and entraining biological rhythms. In A. Beuter, L. Glass, M. C. Mackey, & M. S. Titcombe (Eds.), Nonlinear dynamics in physiology and medicine. New York: Springer-Verlag. Glass, L., & Mackey, M. C. (1988). From clocks to chaos: The rhythms of life. Princeton, NJ: Princeton University Press. Govaerts, W., & Sautois, B. (2005a). Bifurcation software in MATLAB with applications in neuronal modeling. Comput. Meth. Prog. Bio., 77(2), 141– 153. Govaerts, W., & Sautois, B. (2005b). The onset and extinction of neural spiking: A numerical bifurcation approach. J. Comput. Neurosci., 18(3), 273–282. Guckenheimer, J. (1975). Isochrons and phaseless sets. J. Math. Biology, 1, 259–273. Guckenheimer, J., & Holmes, Ph. (1983). Nonlinear oscillations, dynamical systems and bifurcations of vector fields. New York: Springer. Guevara, M. R., Glass, L., Mackey, M. C., & Shier, A. (1983). Chaos in neurobiology. IEEE Trans. Syst. Man Cybern., 13(5), 790–798. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Neurophys. Lett., 23(5), 367–372. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Kuznetsov, Yu. A. (2004). Elements of applied bifurcation theory (3rd ed.). New York: Springer-Verlag. Kuznetsov, Yu. A., Govaerts, W., Doedel, E. J., & Dhooge, A. (in press). Numerical periodic normalization for codim 1 bifurcations of limit cycles. SIAM J. Numer. Anal. Kuznetsov, Yu. A., & Levitin, V. V. (1998). CONTENT: Integrated environment for analysis of dynamical systems, version 1.5. Amsterdam: CWI. Available online at http://ftp.cwi.nl/CONTENT.
Computation of the Phase Response Curve
847
Malkin, I. G. (1949). Methods of Poincar´e and Lyapunov in the theory of non-linear oscillations [in Russian: Metodi Puankare i Liapunova v teorii nelineinix kolebanii]. Moscow: Gostexizdat. Malkin, I. G. (1956). Some problems in nonlinear oscillation theory [in Russian: Nekotorye zadachi teorii nelineinix kolebanii]. Moscow: Gostexizdat.
Received February 9, 2005; accepted June 27, 2005.
LETTER
Communicated by Richard Hahnloser
Multiperiodicity and Exponential Attractivity Evoked by Periodic External Inputs in Delayed Cellular Neural Networks Zhigang Zeng
[email protected] School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China
Jun Wang
[email protected] Department of Automation and Computer-Aided Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
We show that an n-neuron cellular neural network with time-varying delay can have 2n periodic orbits located in saturation regions and these periodic orbits are locally exponentially attractive. In addition, we give some conditions for ascertaining periodic orbits to be locally or globally exponentially attractive and allow them to locate in any designated region. As a special case of exponential periodicity, exponential stability of delayed cellular neural networks is also characterized. These conditions improve and extend the existing results in the literature. To illustrate and compare the results, simulation results are discussed in three numerical examples. 1 Introduction Cellular neural networks (CNNs) and delayed cellular neural networks (DCNNs) are arrays of dynamical cells that are suitable for solving many complex computational problems. In recent years, both have been extensively studied and successfully applied for signal processing and solving nonlinear algebraic equations. As dynamic systems with a special structure, CNNs and DCNNs have many interesting properties that deserve theoretical studies. In general, there are two interesting nonlinear neurodynamic properties in CNNs and DCNNs: stability and periodic oscillations. The stability of a CNN or a DCNN at an equilibrium point means that for a given activation function and a constant input vector, an equilibrium of the network exists and any state in the neighborhood converges to the equilibrium. The stability of neuron activation states at an equilibrium is prerequisite for most applications. Some neurodynamics have multiple (two) stable equilibria and may be stable at any equilibrium depending on the initial state, which is called Neural Computation 18, 848–870 (2006)
C 2006 Massachusetts Institute of Technology
Multiperiodicity and Exponential Attractivity of CNNs
849
multistability (bistability). For stability, either an equilibrium or a set of equilibria is the attractor. Besides stability, an activation state may be periodically oscillatory around an orbit. In this case, the attractor is a limit set. Periodic oscillation in recurrent neural networks is an interesting dynamic behavior, as many biological and cognitive activities (e.g., heartbeat, respiration, mastication, locomotion, and memorization) require repetition. Persistent oscillation, such as limit cycles, represents a common feature of neural firing patterns produced by dynamic interplay between cellular and synaptic mechanisms. Stimulus-evoked oscillatory synchronization was observed in many biological neural systems, including the cerebral cortex of mammals and the brain of insects. It was also known that time delays can cause oscillations in neurodynamics (Gopalsamy & Leung, 1996; Belair, Campbell, & Driessche, 1996). In addition, periodic oscillations in recurrent neural networks have found many applications, such as associative memories (Nishikawa, Lai, & Hoppensteadt, 2004), pattern recognition (Wang, 1995; Chen, Wang, & Liu, 2000), machine learning (Ruiz, Owens, & Townley, 1998; Townley et al., 2000), and robot motion control (Jin & Zacksenhouse, 2003). The analysis of periodic oscillation of neural networks is more general than stability analysis since an equilibrium point can be viewed as a special case of oscillation with any arbitrary period. The stability of CNNs and DCNNs has been widely investigated (e.g., Chua & Roska, 1990, 1992; Civalleri, Gilli, & Pandolfi, 1993; Liao, Wu, & Yu, 1999; Roska, Wu, Balsi, & Chua, 1992; Roska, Wu, & Chua, 1993; Setti, Thiran, & Serpico, 1998; Takahashi, 2000; Zeng, Wang, & Liao, 2003). The existence of periodic orbits together with global exponential stability of CNNs and DCNNs is studied in Yi, Heng, and Vadakkepat (2002) and Wang and Zou (2004). Most existing studies (Berns, Moiola, & Chen, 1998; Jiang & Teng, 2004; Kanamaru & Sekeine, 2004; Liao & Wang, 2003; Liu, Chen, Cao, & Huang, 2003; Wang & Zou, 2004) are based on the assumption that the equilibrium point of CNNs or DCNNs is globally stable or the periodic orbit of CNNs or DCNNs is globally attractive; hence, CNNs or DCNNs have only one equilibrium point or one periodic orbit. However, in most applications, it is required that CNNs or DCNNs exhibit more than one stable equilibrium point (e.g., Yi, Tan, & Lee, 2003; Zeng, Wang, & Liao, 2004), or more than one exponentially attractive periodic orbit instead of a single globally stable equilibrium point. In this letter, we investigate the multiperiodicity and multistablity of DCNNs. We show that an n-neuron DCNN can have 2n periodic orbits that are locally exponentially attractive. Moreover, we present the estimates of attractive domain of such 2n locally exponentially attractive periodic orbits. In addition, we give the conditions for periodic orbits to be locally or globally exponentially attractive when the periodic orbits locate in a designated position. All of these conditions are very easy to be verified.
850
Z. Zeng and J. Wang
The remaining part of this letter consists of six sections. In section 2, relevant background information is given. The main results are stated in sections 3, 4, and 5. In section 6, three illustrative examples are provided with simulation results. Finally, concluding remarks are given in section 7. 2 Preliminaries Consider the DCNN model governed by the following normalized dynamic equations: dxi (t) a ij f (x j (t)) = − xi (t) + dt n
j=1
+
n
b ij f (x j (t − τij (t))) + ui (t), i = 1, . . . , n,
(2.1)
j=1
where x = (x1 , . . . , xn )T ∈ n is the state vector, A = (a ij ) and B = (b ij ) are connection weight matrices that are not assumed to be symmetric, u(t) = (u1 (t), . . . , un (t))T ∈ n is a periodic input vector with period ω (i.e., there exists a constant ω > 0 such that ui (t + ω) = ui (t) ∀t ≥ 0, ∀i ∈ {1, 2, . . . , n}), τij (t) is the time-varying delay that satisfies 0 ≤ τij (t) ≤ τ (τ is constant), and f (·) is the piecewise linear activation function defined by f (v) = (|v + 1| − |v − 1|)/2. In particular, when b ij ≡ 0 (∀i, j = 1, 2, . . . , n), the DCNN degenerates as a CNN. Let C([t0 − τ, t0 ], D) be the space of continuous functions mapping [t0 − τ, t0 ] into D ⊂ n with the norm defined by ||φ||t0 = max1≤i≤n {supu∈[t0 −τ,t0 ] |φi (u)|}, where φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T . Denote ||x|| = max1≤i≤n {|xi |} as the vector norm of the vector x = (x1 , . . . , xn )T . ∀φ, ϕ ∈ C([t0 − τ, t0 ], D), where φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T , ϕ(s) = (ϕ1 (s), ϕ2 (s), . . . , ϕn (s))T . Denote ||φ, ϕ||t0 = max1≤i≤n {supt0 −τ ≤s≤t0 {|φi (s) − ϕi (s)|}} as a measurement in C([t0 − τ, t0 ], D). The initial condition of DCNN model 2.1 is assumed to be φ(ϑ) = (φ1 (ϑ), φ2 (ϑ), . . . , φn (ϑ))T , where φ(ϑ) ∈ C([t0 − τ, t0 ], n ). Denote x(t; t0 , φ) as the state of DCNN model 2.1 with initial condition (t0 , φ). It means that x(t; t0 , φ) is continuous and satisfies equation 2.1 and x(s; t0 , φ) = φ(s), for s ∈ [t0 − τ, t0 ]. Denote (−∞, −1) = (−∞,−1)1 ×[−1,1]0 ×(1,+∞)0 ;[−1,1] = (−∞,−1)0 × [−1, 1]1 × (1, +∞)0 ; (1, +∞)= (−∞, −1)0 × [−1, 1]0 × (1, +∞)1 , = (−∞, +∞) = (−∞, −1) [−1, 1] (1, +∞), so (−∞, +∞)n can be divided into 3n subspaces: (i)
(i)
(i)
n (−∞, −1)δ1 × [−1, 1]δ2 × (1, +∞)δ3 , = {i=1 (i) (i) (i) δ1 , δ2 , δ3 = (1, 0, 0) or (0, 1, 0) or (0, 0, 1), i = 1, . . . , n};
(2.2)
Multiperiodicity and Exponential Attractivity of CNNs
851
and can be divided into three subspaces: 1 = {[−1, 1]n } n 2 = {i=1 (−∞, −1)δ × (1, +∞)1−δ , δ (i) = 1 or 0, i = 1, . . . , n} (i)
(i)
3 = − 1 − 2 Hence, 1 is composed of one region, 2 is composed of 2n regions, and 3 is composed of 3n − 2n − 1 regions. Definition 1. A periodic orbit x ∗ (t) is said to be a limit cycle of a DCNN if x ∗ (t) is an isolated periodic orbit of the DCNN; that is, there exists ω > 0 such that ∀t ≥ t0 , x ∗ (t + ω) = x ∗ (t), and there exists δ > 0 such that ∀ x ∈ {x| 0 < ||x, x ∗ (t)|| < δ, x ∈ n , t ≥ t0 }, where x is not a point on any periodic orbit of the DCNN. Definition 2. A periodic orbit x ∗ (t) of a DCNN is said to be locally exponentially attractive in region if there exist constants α > 0, β > 0 such that ∀t ≥ t0 , x(t; t0 , φ) − x ∗ (t) ≤ β||φ||t exp{−α(t − t0 )}, 0 where x(t; t0 , φ) is the state of the DCNN with any initial condition (t0 , φ), φ(ϑ) ∈ C([t0 − τ, t0 ], ), and is said to be a locally exponentially attractive set of the periodic orbit x ∗ (t). When = n , x ∗ (t) is said to be globally exponentially attractive. In particular, if x ∗ (t) is a fixed point x ∗ , then the DCNN is said to be global exponentially stable. Lemma 1. (Kosaku, 1978). Let D be a compact set in n , H be a mapping on complete metric space (C([t0 − τ, t0 ], D), ||·, ·||t0 ). If H(C([t0 − τ, t0 ], D)) ⊂ C([t0 − τ, t0 ], D), and there exists a constant α < 1 such that ∀φ, ϕ ∈ C([t0 − τ, t0 ], D), ||H(φ), H(ϕ)||t0 ≤ α||φ, ϕ||t0 , then there exists φ ∗ ∈ C([t0 − τ, t0 ], D) such that H(φ ∗ ) = φ ∗ . Consider the following coupled system: dx(t) = −x(t) + Ay(t) + By(t − τ (t)), dt dy(t) = h(t, y(t), y(t − τ (t))), dt
(2.3) (2.4)
where x(t) ∈ n , y(t) ∈ m , A and B are n × m matrices, h ∈ C( × m × m , m ), and C( × m × m , m ) is the space of continuous functions mapping × m × m into m .
852
Z. Zeng and J. Wang
Lemma 2. If system 2.4 is globally exponentially stable, then system 2.3 is also globally exponentially stable. Proof. By applying the variant format of constants, the solution x(t) of equation 2.3 can be expressed as x(t) = exp{−(t − t0 )}x(t0 ) +
t
exp{−(t − s)}(Ay(s) + By(s − τ (s)))ds.
t0
Since equation 2.4 is globally exponentially stable, there exist constants α, β > 0 such that |y(s)| ≤ β exp{−α(s − t0 )}. Hence, |Ay(s) + By(s − τ (s))| ≤ β¯ exp{−α(s − t0 )}, where β¯ = (||A|| + ||B|| exp{ατ })β. Then ¯ − t0 ) exp{−(t − t0 )}; when α = 1, ∀t ≥ t0 , |x(t)| ≤ |x(t0 )| exp{−(t − t0 )} + β(t ¯ when α = 1, |x(t)| ≤ |x(t0 )| exp{−(t − t0 )} + β(exp{−(t − t0 )} + exp{−α(t − t0 )})/|1 − α|; that is, equation 2.3 is also globally exponentially stable. Throughout this article, we assume that N = {1, 2, . . . , n}, N N 1 2 3 N1 N2 , N1 N3 , and N2 N3 are empty. Denote D1 = {x ∈ n | xi ∈ (−∞, −1), i ∈ N1 ; xi ∈ (1, ∞), i ∈ N2 ; xi ∈ [−1, 1], i ∈ N3 }. Note that D1 ⊂ , where is defined in equation 2.2. If N3 is empty, then denote D2 = {x ∈ n |xi ∈ (−∞, −1), i ∈ N1 ; xi ∈ (1, ∞), i ∈ N2 }. 3 Locally Exponentially Attractive Multiperiodicity in a Saturation Region In this section, we show that an n-neuron delayed cellular neural network can have 2n periodic orbits located in saturation regions and these periodic orbits are locally exponentially attractive. Theorem 1. If ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , |ui (t)| < a ii − 1 −
n j=1, j=i
|a ij | −
n
|b ij |,
(3.1)
j=1
then DCNN (see equation 2.1) has 2n locally exponentially attractive limit cycles. Proof. If ∀s ∈ [t0 − τ, t], x(s) ∈ D2 , from equation 2.1, ∀i = 1, 2, . . . , n, dxi (t) = −xi (t) + (a ij + b ij ) − (a ij + b ij ) + ui (t). dt j∈N j∈N 1
2
(3.2)
Multiperiodicity and Exponential Attractivity of CNNs
853
When i ∈ N2 and xi (t) = 1, from equations 3.1 and 3.2, dxi (t) = −1 + (a ij + b ij ) − (a ij + b ij ) + ui (t) > 0. dt j∈N j∈N 1
(3.3)
2
When i ∈ N1 and xi (t) = −1, from equations 3.1 and 3.2, dxi (t) (a ij + b ij ) − (a ij + b ij ) + ui (t) < 0. =1+ dt j∈N j∈N 1
(3.4)
2
Equations 3.3 and 3.4 imply that if ∀φ ∈ C([t0 − τ, t0 ], D2 ), then x(t; t0 , φ) will keep in D2 , and D2 is an invariant set of DCNN (see equation 2.1). So ∀t ≥ t0 − τ, x(t) ∈ D2 . Hence, DCNN, equation 2.1, can be rewritten as equation 3.2. Let x(t; t0 , φ) and x(t; t0 , ϕ) be two states of DCNN, equation 2.1, with initial conditions (t0 , φ) and (t0 , ϕ), where φ, ϕ ∈ C([t0 − τ, t0 ], D2 ). From equations 2.1 and 3.2, ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , d(xi (t; t0 , φ) − xi (t; t0 , ϕ)) = −(xi (t; t0 , φ) − xi (t; t0 , ϕ)). dt
(3.5)
Hence, ∀i = 1, 2, . . . , n, ∀t ≥ t0 , |xi (t; t0 , φ) − xi (t; t0 , ϕ)| ≤ ||φ, ϕ||t0 exp{−(t − t0 )}.
(3.6)
(t)
Define xφ (θ ) = x(t + θ ; t0 , φ), θ ∈ [t0 − τ, t0 ]. Then from equations 3.3 (t)
and 3.4, ∀φ ∈ C([t0 − τ, t0 ], D2 ), xφ ∈ C([t0 − τ, t0 ], D2 ). Define a mapping (ω)
H : C([t0 − τ, t0 ], D2 ) → C([t0 − τ, t0 ], D2 ) by H(φ) = xφ . Then H(C([t0 − τ, t0 ], D2 )) ⊂ C([t0 − τ, t0 ], D2 ), (mω)
and H m (φ) = xφ . We can choose a positive integer m such that exp{−(mω − τ )} ≤ α < 1. Hence, from equation 3.6, ||H m (φ), H m (ϕ)||t0 ≤ max
1≤i≤n
sup θ∈[t0 −τ,t0 ]
|xi (mω + θ; t0 , φ) − xi (mω + θ; t0 , ϕ)|
≤ ||φ, ϕ||t0 exp{−(mω + t0 − τ − t0 )} ≤ α||φ, ϕ||t0 . Based on lemma 1, there exists a unique fixed point φ ∗ ∈ C([t0 − τ, t0 ], D2 ) such that H m (φ ∗ ) = φ ∗ . In addition, H m (H(φ ∗ )) = H(H m (φ ∗ )) = H(φ ∗ ). This shows that H(φ ∗ ) is also a fixed point of H m . Hence, by the uniqueness of the fixed
854
Z. Zeng and J. Wang (ω)
point of the mapping H m , H(φ ∗ ) = φ ∗ ; that is, xφ ∗ = φ ∗ . Let x(t; t0 , φ ∗ ) be a state of DCNN, equation 2.1, with initial condition (t0 , φ ∗ ). Then from equation 2.1, ∀i = 1, 2, . . . , n, ∀t ≥ t0 , dxi (t; t0 , φ ∗ ) = −xi (t; t0 , φ ∗ ) + (a ij + b ij ) − (a ij + b ij ) + ui (t). dt j∈N j∈N 1
2
Hence, ∀i = 1, 2, . . . , n, ∀t + ω ≥ t0 , dxi (t + ω; t0 , φ ∗ ) = −xi (t + ω; t0 , φ ∗ ) + (a ij + b ij ) dt j∈N −
1
(a ij + b ij ) + ui (t + ω)
j∈N2
= −xi (t + ω; t0 , φ ∗ ) + −
(a ij + b ij )
j∈N1
(a ij + b ij ) + ui (t).
j∈N2
This implies x(t + ω; t0 , φ ∗ ) is also a state of DCNN, equation 2.1, with initial (ω) condition (t0 , φ ∗ ). xφ ∗ = φ ∗ implies that ∀t ≥ t0 , x(t + ω; t0 , φ ∗ ) = x(t; t0 , xφ ∗ ) = x(t; t0 , φ ∗ ). (ω)
Hence, x(t; t0 , φ ∗ ) is a periodic orbit of DCNN, equation 2.1, with period ω. From equation 3.5, it is easy to see that any state of DCNN, equation 2.1, with initial condition (t, φ) (φ ∈ C([t0 − τ, t0 ], D2 )) converges to this periodic orbit exponentially as t → +∞. Hence, the isolated periodic orbit x(t; t0 , φ ∗ ) located in D2 is locally exponentially attractive, and D2 is a locally exponentially attractive set of x(t; t0 , φ ∗ ). Since there exist 2n elements in 2 , there exist 2n isolated periodic orbits in 2 . And such 2n isolated periodic orbits are locally exponentially attractive. When the periodic external input u(t) degenerates into a constant vector, we have the following corollary: Corollary 1. If ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , ui (t) ≡ ui (constant), and
|ui | < a ii − 1 −
n j=1, j=i
|a ij | −
n
|b ij |,
j=1
then DCNN (see equation 2.1) has 2n locally exponentially stable equilibrium points.
Multiperiodicity and Exponential Attractivity of CNNs
855
Proof. Since ui (t) ≡ ui (constant), for an arbitrary constant ν ∈ , ui (t + ν) ≡ ui ≡ ui (t). According to theorem 1, DCNN, equation 2.1, has 2n locally exponentially attractive limit cycles with period ν. The arbitrariness of constant ν implies that such limit cycles are fixed points. Hence, DCNN, equation 2.1, has 2n locally exponentially attractive equilibrium points. Remark 1. In theorem 1 and corollary 1, it is necessary for a ii to be dominant such that a ii > 1 + nj=1, j=i |a ij | + nj=1 |b ij |. Remark 2. A main objective for designing associative memories is to store a large number of patterns as stable equilibria or limit cycles such that stored patterns can be retrieved when the initial probes contain sufficient information about the patterns. CNNs and DCNNs are also suitable for very largescale integration (VLSI) implementations of associative memories. It is also expected that they can be applied to association memories by storing patterns as periodic limit cycles. According to theorem 1 and corollary 1, the n-neuron DCNN model, equation 2.1, can store up to 2n patterns in locally exponentially attractive limit cycles or equilibria, which can be retrieved when the input vector satisfies condition 3.1. This implies that the external stimuli also play a major role in encoding and decoding patterns in DCNN associative memories, in contrast with the zero input vector in the bidirectional associate memories and the autoassociative memories based on the Hopfield network.
4 Locally Exponentially Attractive Periodicity in a Designated Region As the limit cycles are stimulus driven (nonautonomous), some information can be encoded in the phases of the oscillating states xi relative to the inputs ui . Hence, it is necessary to find some conditions on the inputs ui , when the periodic orbit x(t) is desired to be located in a designated region. In this section, we give the conditions that allow a periodic orbit to be locally exponentially attractive and located in any designated region.
Theorem 2. If ∀t ≥ t0 , ui (t) < −1 +
(a ij + b ij ) −
j∈N1
ui (t) > 1 +
j∈N1
(a ij + b ij ) −
(a ij + b ij ) −
j∈N2
j∈N2
(a ij + b ij ) +
(|a ij | + |b ij |),
i ∈ N1 ,
j∈N3
(4.1) (|a ij | + |b ij |),
i ∈ N2 ,
j∈N3
(4.2)
856
Z. Zeng and J. Wang ui (t) < 1 − a ii − −
(a ij + b ij ),
ui (t) > a ii − 1 +
|a ij | −
j∈N3 , j=i
j∈N2
−
|b ij | +
j∈N3
(a ij + b ij )
j∈N1
i ∈ N3 , |a ij | +
j∈N3 , j=i
(a ij + b ij ),
j∈N3
(4.3) |b ij | +
(a ij + b ij )
j∈N1
i ∈ N3 ,
(4.4)
j∈N2
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) = τij (t + ω), then DCNN, equation 2.1, has only one limit cycle located in D1 , which is locally exponentially attractive in D1 . Proof. If ∀s ∈ [t0 − τ, t], x(s) ∈ D1 , then from equation 2.1, ∀s ∈ [t0 , t], ∀i = 1, 2, . . . , n, dxi (s) = −xi (s) − (a ij + b ij ) + (a ij + b ij ) + a ij x j (s) dt j∈N1 j∈N2 j∈N3 b ij x j (s − τij (s)) + ui (s). +
(4.5)
j∈N3
When i ∈ N1 and xi (t) = −1, from equations 4.1 and 4.5, dxi (t) (a ij + b ij ) + (a ij + b ij ) ≤1 − dt j∈N1 j∈N2 + (|a ij | + |b ij |) + ui (t) < 0.
(4.6)
j∈N3
When i ∈ N2 and xi (t) = 1, from equations 4.2 and 4.5, dxi (t) ≥ −1 − (a ij + b ij ) − (|a ij | + |b ij |) dt j∈N1 j∈N3 (a ij + b ij ) + ui (t) > 0. +
(4.7)
j∈N2
When i ∈ N3 and xi (t) = 1, from equations 4.3 and 4.5, dxi (t) ≤ −1 − (a ij + b ij ) + (a ij + b ij ) + a ii dt j∈N1 j∈N2 + |a ij | + |b ij | + ui (t) < 0. j∈N3 , j=i
j∈N3
(4.8)
Multiperiodicity and Exponential Attractivity of CNNs
857
When i ∈ N3 and xi (t) = −1, from equations 4.4 and 4.5, dxi (t) (a ij + b ij ) + (a ij + b ij ) − a ii ≥ 1− dt j∈N1 j∈N2 − |a ij | − |b ij | + ui (t) > 0. j∈N3 , j=i
(4.9)
j∈N3
Equations 4.6 to 4.9 imply that if ∀s ∈ [t0 − τ, t0 ], φ(s) ∈ D1 , then x(t; t0 , φ) will keep in D1 , and D1 is an invariant set of DCNN (see equation 2.1). So ∀t ≥ t0 − τ, x(t) ∈ D1 . Hence, ∀t ≥ t0 , DCNN, equation 2.1, can be rewritten as dxi (t) = −xi (t) − (a ij + b ij ) + a ij x j (t) + b ij x j (t − τij (t)) dt j∈N1 j∈N3 j∈N3 (a ij + b ij ) + ui (t), i = 1, 2, . . . , n. +
(4.10)
j∈N2
Let x(t; t0 , φ) and x(t; t0 , ϕ) be two states of DCNN (equation 4.10) with initial conditions (t0 , φ) and (t0 , ϕ), where φ, ϕ ∈ C([t0 − τ, t0 ], D1 ). From equation 4.10, ∀i = 1, 2, . . . , n; ∀t ≥ t0 , d(xi (t; t0 , φ) − xi (t; t0 , ϕ)) = −(xi (t; t0 , φ) − xi (t; t0 , ϕ)) dt + (a ij (x j (t; t0 , φ) − x j (t; t0 , ϕ)) j∈N3
+b ij (x j (t − τij (t); t0 , φ) − x j (t − τij (t); t0 , ϕ))). (4.11) Let yi (t) = xi (t; t0 , φ) − xi (t; t0 , ϕ). Then from equation 4.11, for i = 1, . . . , n; ∀t ≥ t0 , dyi (t) = −yi (t) + a ij y j (t) + b ij y j (t − τi j (t)). dt j∈N j∈N 3
(4.12)
3
4.3 and 4.4, for i ∈ N3 , a ii + |b ii | + j∈N3 , j=i (|a ij | + |b ij |) + From equations | j∈N1 (a ij + b ij ) − j∈N2 (a ij + b ij ) − ui (t)| < 1. Hence, there exists ϑ > 0 such that (1 − a ii ) −
j∈N3 , j=i
|a ij | +
|b ij | exp{ϑτ } + ϑ ≥ 0,
i ∈ N3 .
j∈N3
(4.13)
858
Z. Zeng and J. Wang
Consider a subsystem of equation 4.12: dyi (t) = −yi (t) + a ij y j (t) + b ij y j (t − τi j (t)), t ≥ t0 , i ∈ N3 . dt j∈N j∈N 3
3
(4.14)
Denote || y|| ¯ t0 = maxt0 −τ ≤s≤t0 {||y(s)||}. Then for i ∈ N3 , ∀t ≥ t0 , |yi (t)| ≤ || y|| ¯ t0 exp{−ϑ(t − t0 )}. Otherwise, one of the following two cases holds: Case i. There exist t2 > t1 ≥ t0 , k ∈ N3 , sufficiently small ε1 > 0 such that yk (t1 ) − ¯ t0 exp{−ϑ(t2 − t0 )} = ε1 , and when s ∈ || y|| ¯ t0 exp{−ϑ(t1 − t0 )} = 0, yk (t2 ) − || y|| ¯ t0 exp{−ϑ(s − t0 )} ≤ ε1 , and [t0 − τ, t2 ], for all i ∈ N3 , |yi (s)| − || y|| dyk (t) ¯ t0 exp{−ϑ(t1 − t0 )} ≥ 0, |t=t1 + ϑ|| y|| dt dyk (t) |t=t2 + ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )} > 0. dt
(4.15)
Case ii. There exist t4 > t3 ≥ t0 , j ∈ N3 , sufficiently small ε2 > 0 such that ¯ t0 exp{−ϑ(t3 − t0 )} = 0, yk (t4 ) + || y|| ¯ t0 exp{−ϑ(t4 − t0 )} = −ε2 , and y j (t3 ) + || y|| ¯ t0 exp{−ϑ(s − t0 )} ≥ −ε2 , and when s ∈ [t0 − τ, t4 ], for all i ∈ N3 , |yi (s)| − || y|| dy j (t) dt dy j (t) dt
|t=t3 − ϑ|| y|| ¯ t0 exp{−ϑ(t3 − t0 )} ≤ 0, |t=t4 − ϑ|| y|| ¯ t0 exp{−ϑ(t4 − t0 )} < 0.
(4.16)
It follows from equations 4.13 and 4.14 that for k ∈ N3 , dyk (t) |t=t2 = −yk (t2 ) + (a k j y j (t2 ) + b k j y j (t2 − τk j (t2 ))) dt j∈N 3
|a k j | ≤ || y|| ¯ t0 exp{−ϑ(t2 − t0 )} − 1 + a kk + +
j∈N3 , j=k
|b k j | exp{ϑτ } + ϑ − ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )}
j∈N3
|a k j | + |b k j | +ε1 − 1 + a kk + j∈N3 , j=k
≤ −ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )}.
j∈N3
Multiperiodicity and Exponential Attractivity of CNNs
859
This contradicts equation 4.15. Similarly, it follows from equations 4.13 and 4.14 that dy j (t) dt
¯ t0 exp{−ϑ(t4 − t0 )}. |t=t4 ≥ ϑ|| y||
This contradicts equation 4.16. The two contradictions show that for i ∈ N3 , ∀t ≥ t0 , ¯ t0 exp{−ϑ(t − t0 )}. |yi (t)| ≤ || y|| Hence, according to lemma 2, there exists ϑ¯ > 0 such that ∀i = 1, 2, . . . , n, ∀t ≥ t0 , ¯ − t0 )}. |xi (t; t0 , φ) − xi (t; t0 , ϕ)| ≤ ||φ, ϕ||t0 exp{−ϑ(t
(4.17)
(t)
Define xφ (θ ) = x(t + θ ; t0 , φ), θ ∈ [t0 − τ, t0 ]. From equations 4.6 to 4.9, if (t) ¯ : C([t0 − φ ∈ C([t0 − τ, t0 ], D1 ), then xφ ∈ C([t0 − τ, t0 ], D1 ). Define a mapping H (ω) ¯ ¯ τ, t0 ], D1 ) → C([t0 − τ, t0 ], D1 ) by H(φ) = xφ , then H(C([t0 − τ, t0 ], D1 )) ⊂ (mω)
¯ m (φ) = xφ . C([t0 − τ, t0 ], D1 ), and H Similar to the proof of theorem 1, there exists a periodic orbit x(t; t0 , φ ∗ ) of DCNN, equation 2.1, with period ω such that ∀t ≥ t0 , x(t; t0 , φ ∗ ) ∈ D1 and all other states of DCNN, equation 2.1, with initial condition (t, φ) (φ ∈ C([t0 − τ, t0 ], D1 )) converge to this periodic orbit exponentially as t → +∞. Hence, the isolated periodic orbit x(t; t0 , φ ∗ ) located in D1 is locally exponentially attractive, and D1 is a locally exponentially attractive set of x(t; t0 , φ ∗ ). Remark 3. From equations 4.1 to 4.4, we can see that the input vector u(t) can control the locality of a limit cycle that represents a memory pattern in a designated region. Specifically, when condition 4.1 holds, the part in corresponding coordinate of the limit cycle is located in (1, +∞); when condition 4.2 holds, the part in corresponding coordinate of the limit cycle is located (−∞, −1); when conditions 4.3 and 4.4 hold, the part in corresponding coordinate of the limit cycle is located [−1, 1]. When N3 is empty, we have the following corollary: Corollary 2. Let N1 ∪ N2 = {1, 2, . . . , n}, and N1 ∩ N2 be empty. If ∀t ≥ t0 , ui (t)
j∈N1
(a ij + b ij ) − 1,
i ∈ N1 ,
(4.18)
i ∈ N2 ,
(4.19)
j∈N2
(a ij + b ij ) −
j∈N2
(a ij + b ij ) + 1,
860
Z. Zeng and J. Wang
then DCNN, equation 2.1, has exactly one limit cycle located in D2 , and such a limit cycle is locally exponentially attractive. Proof. Let N3 in theorem 2 be an empty set. According to theorem 2, corollary 2 holds. When the periodic external input u(t) degenerates into a constant, we have the following corollary. Corollary 3. If ∀t ≥ t0 , ui (t) ≡ ui (constant), and ui < −1 +
(a ij + b ij ) −
j∈N1
ui > 1 +
ui < 1 − a ii −
j∈N2
|a ij | −
j∈N3 , j=i
ui > a ii − 1 +
(a ij + b ij ) −
j∈N2
(a ij + b ij ) −
j∈N1
j∈N3 , j=i
(|a ij | + |b ij |),
j∈N3
|b ij | +
j∈N3
|a ij | +
(|a ij | + |b ij |),
i ∈ N1 ,
j∈N3
(a ij + b ij ) +
(a ij + b ij ) −
j∈N1
|b ij | +
j∈N3
i ∈ N2 , (a ij + b ij ),
i ∈ N3 ,
(a ij + b ij ),
i ∈ N3 ,
j∈N2
(a ij + b ij ) −
j∈N1
j∈N2
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) ≡ τij (constant), then DCNN, equation 2.1, has only one equilibrium point located in D1 , which is locally exponentially stable. Proof. Since ui (t) ≡ ui (constant), for arbitrary constant ν ∈ , ui (t + ν) ≡ ui ≡ ui (t). According to theorem 2, DCNN, equation 2.1, has only one limit cycle located in D1 , which is locally exponentially attractive in D1 . The arbitrariness of constant ν implies that such a limit cycle is a fixed point. Hence, DCNN, equation 2.1, has only one equilibrium point located in D1 , which is locally exponentially stable.
5 Globally Exponentially Attractive Periodicity in a Designated Region In order to obtain optimal spatiotemporal coding in the periodic orbit and reduce computational time, it is desirable for a neural network to be globally exponentially attractive to periodic orbit in a designated region. In this section, we give some conditions that allow a periodic orbit to be globally exponentially attractive and to be located in any designated region. Theorem 3. If ∀t ≥ t0 , ui (t) < −1 −
n j=1
(|a ij | + |b ij |),
i ∈ N1 ,
(5.1)
Multiperiodicity and Exponential Attractivity of CNNs
ui (t) > 1 +
n (|a ij | + |b ij |), i ∈ N2 ,
861
(5.2)
j=1
|ui (t)| < 1 − a ii −
n j=1, j=i
|a ij | −
n
|b ij |, i ∈ N3 ,
(5.3)
j=1
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) = τij (t + ω), then DCNN, equation 2.1, has a unique limit cycle located in D1 , and such a limit cycle is globally exponentially attractive. Proof. When i ∈ N1 and xi (t) ≥ −1, from equations 2.1 and 5.1, dxi (t) ≤1+ (|a ij | + |b ij |) + ui (t) < 0. dt j=1 n
(5.4)
When i ∈ N2 and xi (t) ≤ 1, from equations 2.1 and 5.2, dxi (t) (|a ij | + |b ij |) + ui (t) > 0. ≥ −1 − dt j=1 n
(5.5)
When i ∈ N3 and xi (t) ≤ −1, from equations 2.1 and 5.3, n n dxi (t) ≥ 1 − a ii − |a ij | − |b ij | + ui (t) > 0. dt j=1, j=i j=1
(5.6)
When i ∈ N3 and xi (t) ≥ 1, from equations 2.1 and 5.3, n n dxi (t) ≤ −1 + a ii + |a ij | + |b ij | + ui (t) < 0. dt j=1, j=i j=1
(5.7)
Equations 5.4 to 5.7 imply that x(t; t0 , φ) will go into and keep in D1 , where φ ∈ C([t0 − τ, t0 ], n ). So there exists T > 0 such that ∀t ≥ T, x(t) ∈ D1 . Hence, ∀t ≥ T + τ, DCNN, equation 2.1, can be rewritten as dxi (t) = −xi (t) − (a ij + b ij ) + a ij x j (t) + b ij x j (t − τij (t)) dt j∈N j∈N j∈N +
j∈N2
1
3
(a ij + b ij ) + ui (t), i = 1, 2, . . . , n.
3
862
Z. Zeng and J. Wang
Similar to the proof of theorem 2, DCNN, equation 2.1, has a unique limit cycle located in D1 , and such a limit cycle is globally exponentially attractive. Remark 4. By comparison, we can see that if conditions 5.1 to 5.3 hold, then conditions 4.1 to 4.4 also hold. But not vice versa, as will be shown in examples 2 and 3. In other words, the conditions in theorem 3 are stronger than those in theorem 2. When N1 ∪ N2 is empty, we have the following corollary: Corollary 4. If ∀i, j ∈ {1, 2, . . . , n}, τij (t) = τij (t + ω), and ∀t ≥ t0 , |ui (t)| < 1 − a ii −
n
|a ij | −
j=1, j=i
n
|b ij |,
j=1
then the DCNN, equation 2.1, has a unique limit cycle located in [−1, 1]n , which is globally exponentially attractive. Proof. Choose N3 = {1, 2, . . . , n} in theorem 3. According to theorem 3, the corollary holds. When N3 is empty, we have the following corollary: Corollary 5. Let N3 be empty. If ui (t) < −1 −
n (|a ij | + |b ij |),
i ∈ N1 ,
(5.8)
i ∈ N2 ,
(5.9)
j=1
ui (t) > 1 +
n (|a ij | + |b ij |), j=1
then DCNN, equation 2.1, has a unique limit cycle located in D2 . Moreover, such a limit cycle is globally exponentially attractive. Proof. Since N3 is empty, according to theorem 3, corollary 5 holds. Remark 5. Since −1 − nj=1 (|a ij | + |b ij |) ≤ −1 + j∈N1 (a ij + b ij ) − j∈N2 (a ij + b ij ), if condition 5.8 holds, then condition 4.18 also holds, but not vice versa. Similarly, if condition 5.9 holds, then condition 4.19 also holds, but not vice versa. This implies that the conditions in corollary 5 are stronger than those in corollary 2. In addition, corollary 5 shows that a DCNN has a globally exponentially attractive limit cycle if its periodic external stimulus is sufficiently strong.
Multiperiodicity and Exponential Attractivity of CNNs
863
When the periodic external input u(t) degenerates into a constant vector, we have the following corollary: Corollary 6. If ∀t ≥ t0 , ui (t) ≡ ui (constant), and
ui < −1 −
n (|a ij | + |b ij |),
i ∈ N1 ,
j=1
ui > 1 +
n (|a ij | + |b ij |),
i ∈ N2 ,
j=1
|ui | < 1 − a ii −
n
|a ij | −
j=1, j=i
n
|b ij |,
i ∈ N3 ,
j=1
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) ≡ τij (constant), then DCNN, equation 2.1, has a unique equilibrium point located in D1 and is globally exponentially stable at such an equilibrium point. Proof. Since ui (t) ≡ ui (constant), for an arbitrary constant ν ∈ , ui (t + ν) ≡ ui (t) ≡ ui . According to theorem 3, DCNN, equation 2.1, has a unique limit cycle located in D1 , which is globally exponentially attractive. The arbitrariness of constant ν implies that such a limit cycle is a fixed point. Hence, DCNN, equation 2.1, has a unique equilibrium point located in D1 , and such an equilibrium point is globally exponentially stable.
6 Illustrative Examples In this section, we give three numerical examples to illustrate the new results.
6.1 Example 1. Consider a CNN, where
2
0.2
0.2
0.4 0.6
2.4
A = 0.2 2.4
0.5 sin(t)
0.2 u(t) = −0.6 cos(t)
.
−0.2(sin(t) + cos(t))
According to theorem 1, this CNN has 23 = 8 limit cycles, which are locally exponentially attractive. Simulation results with 136 random initial states are depicted in Figures 1 to 3.
864
Z. Zeng and J. Wang 4
x
1
2 0
−2 −4
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
4
x
2
2 0
−2 −4 4
x
3
2 0
−2 −4
Figure 1: Transient behavior of x1 , x2 , x3 in Example 1.
4
3
2
x3
1
0
−1
−2
−3
−4 −4
−3
−2
−1
0 x
1
1
Figure 2: Transient behavior of (x1 , x3 ) in Example 1.
2
3
4
Multiperiodicity and Exponential Attractivity of CNNs
865
4 3 2
x3
1 0
−1 −2 −3 −4 4 4
2 2
0 0
−2
−2 −4
x2
−4
x1
Figure 3: Transient behavior of (x1 , x2 , x3 ) in Example 1.
6.2 Example 2. Consider a CNN, where
2
A = 0.5
0.2 0.4
−1.4 0.4
0.2
0.5 sin(t)
0.5 u(t) = 0.5 cos(t) 0.8
.
0.5(sin(t) + cos(t))
Since u1 (t) < −1 + a 11 − a 13 − Choose N1 = {1}, N2 = {3}, N3 = {2}. |a 12 |; a 22 + |a 21 − a 23 − u2 (t)| < 1; u3 (t) > 1 + a 31 − a 33 + |a 32 |, according to theorem 2, this CNN has a limit cycle located in D1 = {x| x1 < −1, |x2 | ≤ 1, x3 > 1}, which is locally exponentially attractive in D1 . Choose N1 = {3}, N2 = {1}, N3 = {2}. Since u1 (t) > 1 + a 13 − a 11 + |a 12 |, a 22 + |a 21 − a 23 − u2 (t)| < 1, u3 (t) < −1 + a 33 − a 31 − |a 32 |, according to theorem 2, this CNN has a limit cycle located in D1 = {x| x3 < −1, |x2 | ≤ 1, x1 > 1}, which is locally exponentially attractive in D1 . However, since 0.5 sin(t) > 1 + (2 + 0.2 + 0.2) = 3.4 does not hold (i.e., condition 5.1 does not hold), it does not satisfy the conditions in theorem 3. Simulation results with 136 random initial states are depicted in Figures 4 and 5.
866
Z. Zeng and J. Wang 4
x
1
2 0
−2 −4 0
5
10
15 time
20
25
30
5
10
15 time
20
25
30
5
10
15 time
20
25
30
4
x
2
2 0
−2 −4 0 4
x
3
2 0
−2 −4 0
Figure 4: Transient behavior of x1 , x2 , x3 in Example 2.
4 3 2
x3
1 0
−1 −2 −3 −4 4 4
2 2
0 0
−2 x2
−2 −4
−4
x1
Figure 5: Transient behavior of (x1 , x2 , x3 ) in Example 2.
Multiperiodicity and Exponential Attractivity of CNNs
867
x1
5
0
−5
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
x2
5
0
−5
x3
5
0
−5
Figure 6: Transient behavior of x1 , x2 , x3 in Example 3.
6.3 Example 3. Consider a CNN, where
0.2 0.2
0.2
0.2 0.2
0.2
A = 0.2 −2
0.8 sin(t) − 2.6
0.6 u(t) = 0.8 cos(t)
.
0.8(sin(t) + cos(t)) + 2.8
Since u1 (t) < −1 − 3j=1 |a 1 j |, Choose N1 = {1}, N2 = {3}, N3 = {2}. |u2 (t)| < 1 − a 22 − 3j=1, j=2 |a 2 j |, u3 (t) > 1 + 3j=1 |a 3 j |, according to theorem 3, this CNN has a limit cycle located in D1 = {x| x1 < −1, |x2 | ≤ 1, x3 > 1}, which is globally exponentially attractive. Since u1 (t) < −1 + a 11 − |a 12 | − a 13 ; u2 (t) < 1 − a 22 + a 21 − a 23 , u2 (t) > −1 + a 22 + a 21 − a 23 ; u3 (t) > 1 + a 31 + |a 32 | − a 33 , conditions 4.1 to 4.4 also hold. According to theorem 2, this CNN has a limit cycle located in D1 , which is also locally exponentially attractive. However, since a 11 > 0, a 33 > 0 the conditions in Yi et al. (2003) cannot be used to ascertain the complete stability of this CNN. Simulation results with 136 random initial states are depicted in Figures 6 and 7.
868
Z. Zeng and J. Wang
5 4 3
x3
2 1 0 −1 −2 −3 6 4
6 2
4 2
0 0
−2 x2
−4
−2 −4
x1
Figure 7: Transient behavior of (x1 , x2 , x3 ) in Example 3.
7 Concluding Remarks Rhythmicity represents one of most striking manifestations of dynamic behaviors in biological systems. CNNs and DCNNs, which have been shown to be capable of operating in a pacemaker or pattern generator mode, are studied here as oscillatory mechanisms in response to periodic external stimuli. Some information can be encoded in the oscillating activation states relative to external inputs, and these relative phases change as a function of the chosen limit cycle. In this article, we show that the number of locally exponentially attractive periodic orbits located in saturation regions in a DCNN is exponential of the number of the neurons. In view of the fact that neural information is often desired to be encoded in a designated region, we also give conditions to allow a globally exponentially attractive periodic orbit located in any designated region. The theoretical results are supplemented by simulation results in three illustrative examples.
Acknowledgments This work was supported by the Hong Kong Research Grants Council under grant CUHK4165/03E, the Natural Science Foundation of China
Multiperiodicity and Exponential Attractivity of CNNs
869
under grant 60405002 and China Postdoctoral Science Foundation under Grant 2004035579.
References Belair, J., Campbell, S., & Driessche, P. (1996). Frustration, stability and delay-induced oscillation in a neural network model. SIAM J. Applied Math., 56, 245–255. Berns, D. W., Moiola, J. L., & Chen, G. (1998). Predicting period-doubling bifurcations and multiple oscillations in nonlinear time-delayed feedback systems. IEEE Trans. Circuits Syst. I, 45(7), 759–763. Chen, K., Wang, D. L., & Liu, X. (2000). Weight adaptation and oscillatory correlation for image segmentation. IEEE Trans. Neural Networks, 11, 1106–1123. Chua, L. O., & Roska, T. (1990). Stability of a class of nonreciprocal cellular neural networks. IEEE Trans. Circuits Syst., 37, 1520–1527. Chua, L. O., & Roska, T. (1992). Cellular neural networks with nonlinear and delaytype template elements and non-uniform grids. Int. J. Circuit Theory and Applicat., 20, 449–451. Civalleri, P. P., Gilli, M., & Pandolfi, L. (1993). On stability of cellular neural networks with delay. IEEE Trans. Circuits Syst. I, 40, 157–164. Gopalsamy, K., & Leung, I. (1996). Delay induced periodicity in a neural netlet of excitation and inhibition. Physica D, 89, 395–426. Jiang, H., & Teng, Z. (2004). Boundedness and stability for nonautonomous bidirectional associative neural networks with delay. IEEE Trans. Circuits Syst. II, 51(4), 174–180. Jin, H. L., & Zacksenhouse, M. (2003). Oscillatory neural networks for robotic yo-yo control. IEEE Trans. Neural Networks, 14(2), 317–325. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Trans. Neural Networks, 15(5), 1009– 1017. Kosaku, Y. (1978). Functional analysis. Berlin: Springer-Verlag. Liao, X. X., & Wang, J. (2003). Algebraic criteria for global exponential stability of cellular neural networks with multiple time delays. IEEE Trans. Circuits and Syst. I, 50(2), 268–275. Liao, X., Wu, Z., & Yu, J. (1999). Stability switches and bifurcation analysis of a neural network with continuous delay. IEEE Trans. Systems, Man and Cybernetics, Part A, 29(6), 692–696. Liu, Z., Chen, A., Cao, J., & Huang, L. (2003). Existence and global exponential stability of periodic solution for BAM neural networks with periodic coefficients and time-varying delays. IEEE Trans. Circuits Syst. I, 50(9), 1162–1173. Nishikawa, T., Lai, Y. C., & Hoppensteadt, F. C. (2004). Capacity of oscillatory associative-memory networks with error-free retrieval. Physical Review Letters, 92(10), 108101. Roska, T., Wu, C. W., Balsi, M., & Chua, L. O. (1992). Stability and dynamics of delay-type general and cellular neural networks. IEEE Trans. Circuits Syst. I, 39, 487–490.
870
Z. Zeng and J. Wang
Roska, T., Wu, C. W., & Chua, L. O. (1993). Stability of cellular neural networks with dominant nonlinear and delay-type templates. IEEE Trans. Circuits Syst. I, 40, 270–272. Ruiz, A., Owens, D. H., & Townley, S. (1998). Existence, learning, and replication of periodic motion in recurrent neural networks. IEEE Trans. Neural Networks, 9(4), 651–661. Setti, G., Thiran, P., & Serpico, C. (1998). An approach to information propagation in 1-D cellular neural networks, part II: Global propagation. IEEE Trans. Circuits and Syst. I, 45(8), 790–811. Takahashi, N. (2000). A new sufficient condition for complete stability of cellular neural networks with delay. IEEE Trans. Circuits Syst. I, 47, 793–799. Townley, S., Ilchmann, A., Weiss, M. G., McClements, W., Ruiz, A. C., Owens, D. H., & Pratzel-Wolters, D. (2000). Existence and learning of oscillations in recurrent neural networks. IEEE Trans. Neural Networks, 11, 205–214. Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Networks, 6, 941–948. Wang, L., & Zou, X. (2004). Capacity of stable periodic solutions in discrete-time bidirectional associative memory neural networks. IEEE Trans. Circuits Syst. II, 51(6), 315–319. Yi, Z., Heng, P. A., & Vadakkepat, P. (2002). Absolute periodicity and absolute stability of delayed neural networks. IEEE Trans. Circuits Syst. I, 49(2), 256–261. Yi, Z., Tan, K. K., & Lee, T. H. (2003). Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 15(3), 639–662. Zeng, Z., Wang, J., & Liao, X. X. (2003). Global exponential stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circuits and Syst. I, 50(10), 1353–1358. Zeng, Z., Wang, J., & Liao, X. X. (2004). Stability analysis of delayed cellular neural networks described using cloning templates. IEEE Trans. Circuits and Syst. I, 51(11), 2313–2324.
Received November 30, 2004; accepted June 28, 2005.
LETTER
Communicated by Ennio Mingolla
Smooth Gradient Representations as a Unifying Account of Chevreul’s Illusion, Mach Bands, and a Variant of the Ehrenstein Disk Matthias S. Keil
[email protected] Instituto de Microelectr´onica de Sevilla, Centro Nacional de Microelectr´onica, E-41012 Seville, Spain
Recent evidence suggests that the primate visual system generates representations for object surfaces (where we consider representations for the surface attribute brightness). Object recognition can be expected to perform robustly if those representations are invariant despite environmental changes (e.g., in illumination). In real-world scenes, it happens, however, that surfaces are often overlaid by luminance gradients, which we define as smooth variations in intensity. Luminance gradients encode highly variable information, which may represent surface properties (curvature), nonsurface properties (e.g., specular highlights, cast shadows, illumination inhomogeneities), or information about depth relationships (cast shadows, blur). We argue, on grounds of the unpredictable nature of luminance gradients, that the visual system should establish corresponding representations, in addition to surface representations. We accordingly present a neuronal architecture, the so-called gradient system, which clarifies how spatially accurate gradient representations can be obtained by relying on only high-resolution retinal responses. Although the gradient system was designed and optimized for segregating, and generating, representations of luminance gradients with real-world luminance images, it is capable of quantitatively predicting psychophysical data on both Mach bands and Chevreul’s illusion. It furthermore accounts qualitatively for a modified Ehrenstein disk. 1 Introduction Reflectance, or albedo, is a physical property of surface materials that measures how much of the incident light is reflected from a surface. In realworld environments, reflectance is often composed of a diffusive or lambertian component (light is scattered isotropically in all directions) and a specular component (light is scattered anisotropically in a limited subset of directions). Recent psychophysical data show that the visual system can suppress the specular component for judging the apparent color of a surface (Todd, Norman, & Mingolla, 2004). Specifically, it was found that Neural Computation 18, 871–903 (2006)
C 2006 Massachusetts Institute of Technology
872
M. Keil
achromatic color or lightness is almost entirely determined by the diffusive component of surface reflectance. On the other hand, there is now evidence that the visual system generates explicit representations for object surfaces (e.g., Komatsu, Murakami, & Kinoshita, 1996; Rossi, Rittenhouse, & Paradiso, 1996; MacEvoy, Kim, & Paradiso, 1998; Bakin, Nakayama, & Gilbert, 2000; Castelo-Branco, Goebel, Neuenschwander, & Singer, 2000; Kinoshita & Komatsu, 2001; Sasaki & Watanabe, 2004). The neuronal activity associated with these representations is thought to encode perceived surface attributes, such as color, motion, lightness, or depth (e.g., Paradiso & Nakayama, 1991; Rossi & Paradiso, 1996; Paradiso & Hahn, 1996; Davey, Maddess, & Srinivasan, 1998; Nishina, Okada, & Kawato, 2003; Sasaki & Watanabe, 2004). Given that lightness constancy is preserved in the presence of specular highlights, the visual system must somehow identify and discount the specular component of surface reflectance in the (neuronal) representations of lightness or brightness, respectively. Lightness constancy means that the perception of achromatic surface reflectance remains the same despite changes in illumination. Notice that changes in illumination may not only be caused by changes of the illumination source per se, such as changes in intensity or spectral composition, but also by, for example, selfmotion of the organism, or the presence of different objects near the object under consideration (Bloj, Kersten, & Hurlbert, 1999). Lightness constancy implies that the activity of a corresponding neuronal representation remains approximately constant in spite of changes in illumination, an effect that is experimentally observed as early as in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996). Since specular highlights are viewpoint and illumination dependent, lightness constancy implies furthermore that highlights should be discounted from lightness representations. As obvious from introspection, however, highlights are not discounted perceptually, and one may argue accordingly that the different components of surface reflectance (diffusive and specular) are segregated and subsequently encoded by separated representations. This hypothesis is supported by two observations. First, the location of highlights on an object’s surfaces varies with viewing directions. When trying to recover an object’s 3D shape from disparity cues (“shape from stereo”), the left and the right retina see the highlights at different positions, and thus binocular disparity should provide contradicting information about “shape from stereo.” However, it was demonstrated that perceptual highlights actually enhance the appearance of stereoscopic depth instead of impairing it (Todd, Norman, Koendernik, & Kappers, 1997). Second, when an object is observed in motion (shape from motion), then specular highlights do not stay at fixed positions at the object’s surfaces. Rather, they are subjected to deformation. Even so human observers reveal no apparent difficulties in interpreting them (Norman & Todd, 1994). Segregating the constant components of surface reflectance from its variable components and encoding these components by different representations would have the advantage that downstream mechanisms
Early Gradient Representations
873
for object recognition can select, at each instant, the most reliable cue for determining an object’s 3D shape in order to ensure steadily robust performance for object recognition. Specular highlights are an example of a smooth gradation in surface luminance. Such gradations are often referred to as shading. Shading can encode important information about 3D surface structure (curvature; e.g., Todd & Mingolla, 1983; Mingolla & Todd, 1986; Ramachandran, 1988; Todd & Reichel, 1989; Knill & Kersten, 1991). For instance, a golf ball can be distinguished visually from a table tennis ball by the smooth intensity variations occurring at its dimples. However, shading is not an unambiguous cue for deriving an object’s 3D shape, since shading (like-) effects may be generated by various sources, such as the spectral composition of illumination, local surface reflectance, local surface orientation, penumbras of cast shadows, attenuation of light with distance, interreflections among different surfaces, and observer position (Tittle & Todd, 1995; Todd, 2003; Todd et al., 2004). Shading effects represent, from a more general point of view, a subset of smooth luminance gradients. Here, we define luminance gradients as smooth variations in intensity,1 which includes optical effects such as focal blur (due to the limited depth-of-field representation of the retinal optics; see Elder & Zucker, 1998). Local image blur provides important information about the arrangement of objects in depth (depth from blur; e.g., Deschˆenes, Ziou, & Fuchs, 2004). In general, smooth gradients in luminance can be generated by spatial variations in reflectance of a surface that is homogeneously illuminated, or by gradual variations in illumination of a surface with constant reflectance (Paradiso, 2000). Obviously the information conveyed by luminance gradients may subserve different purposes, such as recovering the 3D shape of objects, estimating surface reflectance, depth estimation from blur, or depth estimation from shadows (Kersten, Mamassian, & Knill, 1997; Mamassian, Knill, & Kersten, 1998). It is important to recognize that smooth gradients can encode surface properties (e.g., curvature) or not (e.g., shadows). This means that the visual system cannot always bind smooth gradients to surfaces in a static fashion or that at some point in the visual system, smooth gradients must be segregated from surface representations. Given that lightness constancy is already observed in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996) and smooth gradients are not lost to perception, we propose that they should be explicitly represented in the visual system, in parallel with, and as early as, corresponding surface representations. Therefore, smooth luminance gradients could be unspecifically suppressed in V1 surface representations, without losing corresponding information. In this way, gradients could be recruited or ignored, respectively,
1 This is to say that the relevant perceptual quantity considered here is the degree of smoothness of a luminance variation. We disregard the direction of such variations.
874
M. Keil
by downstream mechanisms for object recognition, dependent on whether they aid in disambiguating the recognition process. Although mechanisms were proposed for generating two-dimensional (e.g., Grossberg & Todorovi´c, 1988; Pessoa & Ross, 2000; Neumann, Pessoa, & Hansen, 2001) or three-dimensional (e.g., Grossberg & Pessoa, 1998; Kelly & Grossberg, 2000; Grossberg & Howe, 2003) surface representations, theories addressing the processing and representation of luminance gradients in the visual system are scarce (an approach based on differential geometry was suggested by Ben-Shahar, Huggins, & Zucker, 2002; see section 4). Here we propose a corresponding two-dimensional (monocular) architecture, called a gradient system. The gradient system is thought to operate in parallel, and to interact, with mechanisms for generating surface representations (representations for both chromatic and achromatic surface reflectance). Put another way, we propose that representations of brightness gradients as generated by the gradient system are involved in lightness computations. Direct evidence for separated representations of gradients and surfaces comes from the observation that chromatic variations and spatially aligned luminance gradients occur in the first place due to variations in surface reflectance. In contrast, shadows, or shading, generate luminance variations, which are independent from chromatic variations. In line with these observations, recent evidence suggests that the visual system utilizes chromatic and achromatic information for different purposes (Kingdom, 2003), where achromatic information seems to generate the 3D shape for chromatic surfaces. Apart from the conceptional advantages of having gradient representations, the purpose of the gradient system is to clarify, by suggesting corresponding neuronal mechanisms, how such representations can be generated with high spatial acuity from foveal retinal responses. The gradient system consists of two further subsystems. The first subsystem detects luminance gradients by using solely foveal retinal responses. The second subsystem generates representations of luminance gradients. Notice that luminance gradients may extend over distances that may exceed several times the receptive field size of foveal ganglion cells and V1 cells, respectively. Our proposed mechanism detects gradients indirectly, by attenuating high spatial frequency information, which is typically met at lines and points (even symmetric luminance features) or sharply bounded surface edges (odd symmetric features), respectively. In what follows, we designate these latter features nongradient features. Attenuation of the nongradient features results in sparse local measurements for gradient evidence in the first subsystem. These measurements are often noisy and fragmented. The second subsystem generates from these measurements smooth and dense gradient representations by means of a novel diffusion paradigm (clamped diffusion), which is proposed as an underlying neural substrate. Clamped diffusion shares some analogies with filling-in mechanisms, since both involve a locally controlled lateral
Early Gradient Representations
875
propagation of activity. Filling in was proposed as a mechanism for creating cortical surface representations (see above references), and is now supported both psychophysically (Gerrits, de Haan, & Vendrik, 1966; Gerrits & Vendrik, 1970; Paradiso & Nakayama, 1991; Rossi & Paradiso, 1996; Paradiso & Hahn, 1996; Davey et al., 1998; Rudd & Arrington, 2001) and neurophysiologically (Rossi et al., 1996; Rossi & Paradiso, 1999; but see Pessoa & Neumann, 1998; Pessoa, Thompson, & No¨e, 1998). To represent luminance gradients within the filling-in framework, Grossberg and Mingolla (1987) demonstrated that boundary webs can build up in regions of continuous luminance gradients. Whereas nongradient features (e.g., lines and edges) usually constitute impermeable barriers for the filling-in process, boundary webs can lead to a partial blocking, thereby generating effects of smooth shading in perceived brightness (see also Pessoa, Mingolla, & Neumann, 1995). Notice, however, that by incorporating smooth gradients into surface representations, segregation of such gradients is only postponed. As exemplified above, lightness constancy and illumination-invariant object recognition, respectively, imply the suppression of those smooth gradients, which are not properties of object surfaces. In general, most models do not distinguish explicitly between surfaces (achromatic surface reflectance) and gradients (specular highlights, local blur), thus ignoring the diverse semantics of luminance gradients. This is to say that typically a mixed representation of both structures is created, which stands in contrast to having them separated as proposed here. Typical approaches generate such mixed representations by superimposing responses from bandpass filters of various sizes or scales (e.g., Blakeslee & McCourt, 1997, 1999, 2001; Sepp & Neumann, 1999; McArthur & Moulden, 1999), implying that gradient information is implicitly represented in filter responses. In a nutshell, the proposed framework is as follows. The visual system computes bottom-up estimates for brightness gradients and surface brightness in parallel. At this level, surface representations are assumed to be devoid of any activity corresponding to smooth luminance gradients. Although not considered in this article, the latter process may involve low-level interactions between representations for surfaces and gradients. Subsequently, gradient representations contribute to deriving reliable estimates for surface lightness. Lightness computations may involve feedback from downstream mechanisms involved in object recognition, hence establishing high-level interactions between surface and gradient representations. Our line of argumentation is similar to one adopted earlier (Ginsburg, 1975): if gradient representations do exist in the visual system, then it should be possible for the gradient system to predict certain perceptual data. This is indeed the case: the gradient system accounts for psychophysical data on Mach bands and Chevreul’s illusion and a variant of the Ehrenstein disk (with line inducers), despite not taking into account possible interactions
876
M. Keil
between representations for surfaces and gradients. Notice that these illusions are traditionally explained by quite different mechanisms. The universality and robustness of the proposed mechanisms are further demonstrated by successfully segregating luminance gradients with real-world images. 2 Formulation of the Gradient System The gradient system is composed of two further subsystems. The first detects gradient evidence in a given luminance distribution L. The output of the first subsystem represents sparse activity maps for gradient evidence. From these sparse maps, dense representations for brightness gradients are subsequently recovered by the second subsystem. The model was designed under the assumption that all stages relax to equilibrium before changes in luminance occur.2 This is to say that we define L such that ∂Lij (t)/∂t = 0, where (i, j) are spatial indices and t denotes time. The latter assumption is consistent with the notion that shading and blur are so-called pictorial depth cues, which are available with individual and static images (Todd, 2003). In all simulations, black was assigned the intensity L = 0 (minimum value) and white L = 1 (maximum value). Responses of all cells are presumed to represent average firing rates. Model parameters were optimized for obtaining good results for gradient segregation over a set of real-world images. The exact parameter values are not significant for the model’s performance, as long as they stay within their given orders of magnitude. Our model is minimal complex in the sense that it cannot be reduced further: each equation and parameter, respectively, is crucial to arrive at our conclusions. 2.1 Retinal Stage. The retina is the first stage in processing luminance information. We assume that the cortex has access only to responses of the lateral geniculate nucleus (LGN) and treat the LGN as relaying retinal responses. We also assume that any residual DC part in the retinal responses was discounted (see Neumann, 1994; Pessoa et al., 1995; Rossi & Paradiso, 1999). No additional information, such as large filter scales or an “extra luminance channel” (low-pass information) is used (Burt & Adelson, 1983; du Buf, 1992; Neumann, 1994; Pessoa et al., 1995; du Buf & Fischer, 1995). Ganglion cell responses to a luminance distribution L are computed under the assumption of space-time separability (Wandell, 1995; Rodieck, 1965), where we assumed a constant temporal term by definition. The receptive field structure of our ganglion cells approximates a Laplacian (Marr & Hildreth, 1980), with center activity C and surround activity S. Center width is 1 pixel; hence, Cij ≡ Lij . Surround activity is computed by convolving L
2 When simulating real-time perception, however, model stages can no longer be expected to reach a steady state.
Early Gradient Representations
877
with a 3 × 3 kernel with zero weight in the center, exp(−1)/η for the four nearest neighbors and exp(−2)/η for the four next-nearest neighbors. The constant η is chosen such that the kernel integrates to one. Retinal responses are evaluated at steady state of d Vij (t) = gleak (Vrest − Vij ) dt Cij + E surr Sij + gij (E si − Vij ), +E cent (si )
(2.1)
where gleak = 1 denotes the leakage conductance of the cell membrane and (si ) Cij + E surr Sij ]+ denotes Vrest = 0 is the cell’s resting potential. gij ≡ [E cent self-inhibition with reversal potential E si = 0, where [·]+ ≡ max(·, 0). Selfinhibition implements the compressive and nonlinear response curve observed in biological X-type cells (Kaplan, Lee, & Shapley, 1990). From equation 2.1, we obtain two types of retinal responses, with one selective for luminance increment (ON-type) and another selective for luminance decrement (OFF-type). ON-cell responses xij⊕ ≡ [Vij ]+ are obtained by setting E cent = 1 and E surr = −1, and OFF-cell responses xij ≡ [Vij ]+ by E cent = −1 and E surr = 1. Equation 2.1 implies that responses of associated ON and OFF cells are equal, or balanced, except for a sign reversal (parallel pathways).3 Although a given type of biological ON and OFF ganglion cells constitutes nearly parallel pathways with respect to the impulse response function kinetics (DeVries & Baylor, 1997; Benardete & Kaplan, 1999; Keat, Reinagel, Reid, & Meister, 2001; Chichilnisky & Kalmar, 2002), they do not with respect to other properties. For instance, ON cells fire spontaneously at a higher rate than OFF cells under photopic stimulation (Cleland, Levick, & Sanderson, 1973; Kaplan, Purpura, & Shapley, 1987; Troy & Robson, 1992; Passaglia, Enroth-Cugell, & Troy, 2001; Chichilnisky & Kalmar, 2002; Zaghloul, Boahen, & Demb, 2003). Model predictions for synthetic stimuli (such as a luminance ramp or a sign wave grating) are not influenced in a significant way by using different retinal models (e.g., the one proposed in Pessoa et al., 1995, or one with tonic ON activity). The use of different retinal models, however, leads to different results for real-world images. These results are worse with one channel, but not the other, having tonic baseline activity (not shown). Due to the absence of tonic or baseline activity, equation 2.1 may be considered a compact formulation for describing the outcome of a retino-geniculate competition (Maffei & Fiorentini, 1972), which previously has been modeled by separated stages (Neumann, 1996).
3 Associated responses are responses that “belong together”; they were generated by exactly one luminance feature (such as an edge, a line, or a bar).
878
M. Keil
A
bright
dark
B
Figure 1: Two types of luminance gradients. (A) A sine wave grating is an example for a nonlinear gradient. Graphs (columns for a fixed row number) show luminance (big curves) and corresponding retinal ON-responses (small and darker curves) and OFF-responses (small curves). (B) A ramp is an example for a linear gradient. At the “knee points,” where the luminance ramp meets the bright and dark luminance plateau, respectively, Mach bands are perceived (Mach, 1865). Mach bands are illusory overshoots in brightness (labeled “bright”) and darkness (“dark”) at positions indicated by the arrows.
2.2 Subsystem I: Gradient Detection. Ganglion cell responses provide the input to the first stage, the detection subsystem. Responses of our ganglion cells are computed by using a Laplacian-like receptive field. Hence, lines or edges (sharply bounded luminance features) generate responses with activity peaks of ON- and OFF-cells occurring in close vicinity (see Neumann, 1994, for a mathematical treatment). In contrast, luminance gradients (e.g., blurred lines and edges) give rise to spatially more separated ON- and OFF-activity peaks. This leads to the notion that luminance gradients can be detected by suppressing ON- and OFF-responses occurring closely together, while at the same time enhancing more separated responses (see Figure 1). Of course, finding ON- and OFF-responses that are close together amounts to boundary detection. For our purposes, the specific mechanism for detecting boundaries is not relevant—it is the quality of the resulting boundary map that matters. Thus, in order to keep things simple, we detect ON and OFF responses occurring in close vicinity (= nongradients) by multiplicative interactions between retinal channels: dgij1◦ (t) dt dgij1• (t) dt
= −gleak gij1◦ + xij⊕ 1 + gij1• E ex − gij1◦ + ∇ 2 gij1◦ = −gleak gij1• + xij 1 + gij1◦ E ex − gij1• + ∇ 2 gij1• .
(2.2)
The superscript “o” denotes activity that encodes brightness, and “•” denotes darkness. Superscript numbers indicate the respective level of model
Early Gradient Representations
879
hierarchy. ∇ 2 is the Laplacian, which models lateral voltage spread (e.g., Winfree, 1995; Lamb, 1976; Naka & Rushton, 1967). In neurophysiological terms, the latter equation approximates responses of orientationally pooled cortical complex cells (we understand cortical complex cells as being equivalently excited by even-symmetric [lines, points] and odd-symmetric [edges] luminance features; cf. Hubel & Wiesel, 1962, 1968). Notice that one could substitute equation 2.2 by a more sophisticated boundary processing stage, such as proposed, for example, in Gove, Grossberg, & Mingolla (1995) and Mingolla, Ross, and Grossberg (1999) if one aimed to increase the biological realism of our model. We used adiabatic boundary conditions in our simulations. The leakage conductance gleak = 0.35 specifies the spatial extent of lateral voltage spread ¨ (cf. Benda, Bock, Rujan, & Ammermuller, 2001). Therefore, varying gleak provides an easy means for adjusting the spatial frequency sensitivity of the detection circuit: lower values of gleak make equation 2.2 respond to image features with a higher degree of blur (see Figure 7, bottom graph). Excitatory input (reversal potential E ex = 1) into both diffusion layers is provided by two sources. The first source is the output of retinal cells x⊕ and x , respectively. Their activities propagate laterally in each respective diffusion layer. If the input is caused by nongradients (e.g., lines or edges), then in the end, activities in both layers spatially overlap. This situation is captured by multiplicative interaction between diffusion layers and provides the second excitatory input (terms xij⊕ gij1• and xij gij1◦ , respectively). Thus, multiplicative interaction leads to mutual amplification of activity in both layers at nongradient positions. Equations 2.2 were fixed-point-iterated 50 times at steady state.4 Finally, the activity of (notice that g 1◦ , g 1• ≥ 0 ∀t) (2)
gij = gij1◦ × gij1•
(2.3)
correlates with high-spatial-frequency features at position (i, j) (see Gabbiani, Krapp, Koch, & Laurent, 2002 for a biophysical mechanism implementing multiplication). Nongradient suppression is brought about by (2) inhibitory activity g in ≡ α · gij . Inhibitory activity interacts with retinal responses by means of gradient neurons: dgij3◦ (t) dt dgij3• (t) dt 4
in = −gleak gij3◦ + xij⊕ E ex − gij3◦ + gij E in − gij3◦ in E in − gij3• . = −gleak gij3• + xij E ex − gij3• + gij
(2.4)
We compared steady-state fixed-point iteration with direct integration using Euler’s method and an explicit fourth-order Runge-Kutta scheme (t = 0.75). We did not observe any differences in our results.
880
M. Keil
α = 35 is an inhibitory weight constant, chosen such that the last equation cannot produce responses to isolated lines and edges. Equation 2.4 is evaluated at steady state (gleak = 0.75, E ex = 1, and E in = −1). The output g˜ ij3◦ and g˜ ij3• of the two parts of equation 2.4 is computed by means of an (2) activity-dependent threshold ≡ β · (the operator < · > computes the average value over all spatial indices): g˜ ij3◦
=
gij3◦
if gij3◦ >
0
otherwise
.
(2.5)
g˜ ij3• is computed in an analogous way. Activity-dependent thresholding can be considered an adaptational mechanism at network level. Note that it depends, as opposed to simple, fixed-valued thresholds, on the activation of all other neurons. In this respect it is similar to the so-called quenching threshold proposed by Grossberg (1983). Here, it reduces responses to spurious gradient features.5 β = 1.75 is a constant for elevating the threshold such that most of the spurious gradient features are suppressed. The robustness for detecting gradients with real-world images is further improved by eliminating those spurious features that survived previous stages. For this purpose, adjacent responses g˜ 3◦ and g˜ 3• are subjected to mutual inhibition. Mutual inhibition is implemented via the operator A, which adds an activity (say, a ij ) at position (i, j) to its four nearest neighbors, that is, A(a ij ) implies a i −1,j := a i −1,j + a ij , a i ,j −1 := a i ,j −1 + a ij , a i +1,j := a i +1,j + a ij , and a i ,j +1 := a i ,j +1 + a ij , where the symbol := denotes “is replaced by.” A is reduced to existing neighbors at boundary locations. Then: + g˜ ij4◦ = g˜ ij3◦ − A g˜ ij3• + g˜ ij4• = g˜ ij3• − A g˜ ij3◦
(2.6)
defines the output of the first subsystem, where g˜ 4◦ represents gradient brightness, and g˜ 4• gradient darkness. Notice that this output by itself cannot give rise yet to gradient percepts, since, for example, g˜ 4◦ and g˜ 4• responses to a luminance ramp do not correspond to human perception (see Figure 1). Although no responses of equation 2.1 and equation 2.6, respectively, are obtained along the ramp, humans nevertheless perceive a brightness gradient there. This observation leads to the proposal of a second subsystem, where perceived gradient representations are generated by means of a novel diffusion paradigm (clamped diffusion). The responses of the individual stages are illustrated with figure 2 for a real-world image at various levels of gaussian blur. 5
These are erroneously detected gradients that would survive for = 0.
Early Gradient Representations retina
non-gradients
gradient neurons
blurred with σ=2
blurred with σ=1
not blurred
luminance
881
Figure 2: Subsystem I stages. Visualization of individual stages for an unblurred image (top row) and gaussian blurred images (σ = 1 and σ = 2 pixels, respectively). The column luminance shows the input (both L ∈ [0, 1]), retina shows the output of equation 2.1 (x⊕ [x ] activity is indicated by lighter [darker] gray), nongradients shows the output of equation 2.3. Finally, gradient neurons shows the output of equation 2.4. Images were normalized individually to achieve better visualization. The number at each image denotes its respective maximum activity value.
2.3 Subsystem II: Gradient Generation. Luminance gradients can be subdivided into two classes, according to their retinal responses (see Figure 2). Responses of model ganglion cells are obtained to luminance regions with varying slope (nonlinear gradients), but they are unresponsive within regions of constant slope (linear gradients).6 Consequently, for a luminance ramp, we obtain nonzero ganglion cell responses only at the “knee points” where the luminance ramp meets the plateaus. In contrast, ganglion cell responses smoothly follow a sine wave–modulated luminance function. Ideally, we like the spatial layout of gradient representations as generated by the model to be isomorphic with perceived brightness gradients.7 Thus,
6 In biological ganglion cells, this distinction may be relaxed. In fact, what matters is that (retinal) responses to linear and nonlinear gradients are different. 7 For image processing tasks, such isomorphism may be understood such that the input (stimulus) and the output (gradient representation) are equal. In general, however, such
882
M. Keil
we have to explicitly generate a gradient in perceived activity in the case of a linear luminance gradient, whereas perceived activity should match the retinal activity pattern with nonlinear gradients (perceived or perceptual activity denotes the activity that leads to the perception of luminance gradients). The generation and representation of both types of luminance gradients are accomplished by a single mechanism, the clamped diffusion equation: (5)
dgij (t) dt
(5)
(2)
= −gleak gij + γ gij
(5)
E in − gij
(2.7)
(5) + g˜ ij4◦ + xij⊕ − g˜ ij4• + xij + ∇ 2 gij . source
sink
Parameter values are gleak = 0.0025, E in = 0, and γ = 250. Equation 2.7 was integrated through fixed-point iteration at steady state (see note 4). A brightness source (or equivalently a darkness sink) is defined as retinal ON-activity x⊕ enhanced by gradient brightness g˜ 4◦ . Likewise, a darkness source (brightness sink) is defined by OFF-activity x enhanced by gradient darkness g˜ 4◦ . g (5) already expresses perceptual gradient activity: perceived darkness is represented by negative values of g (5) and perceived brightness by positive values. Note that no thresholding operation is applied. The resting potential of equation 2.7 is zero and is identified with the perceptual Eigengrau value (Gerrits et al., 1966; Gerrits & Vendrik, 1970; Gerrits, 1979; Knau & Spillman, 1997). In a biophysically realistic scenario, equation 2.7 has to be replaced by two equations: one for describing perceived brightness and one for perceived darkness. Then, to compute perceptual activity, rectified darkness activity has to be subtracted from, and rectified brightness activity has to be added to, a suitable chosen Eigengrau value. The single equation 2.7 is equivalent to the “brightness-minus-darkness” case for a zero Eigengrau value, given that both the equation for brightness and the equation for darkness make use of shunting inhibition (reversal potentials = 0) but not of subtractive inhibition (reversal potentials < 0). The anchoring process with an Eigengrau level is adopted as a simple means for visualizing the output of the gradient system, especially with real-world images. It is not supposed to represent a fully qualified mechanism for brightness anchoring (cf. Gilchrist et al., 1999). Rather, the gradient system could in principle help solve the lightness anchoring problem for surface representations, for example, according to the recently proposed blurred highest luminance as white (BHLAW) rule (Grossberg & Hong, 2003; Hong & Grossberg, 2004). equivalence is not true for lightness perception, due to nonlinear information processing in the visual system (e.g., the presence of compressive functions)
Early Gradient Representations
883
Equation 2.7 establishes monotony between intensity and perceived gradient activity by (1) the leakage conductance gleak , (2) shunting inhibition (E in = 0) triggered by nongradient features (e.g., lines and edges), and (3) diffusion, which brings about an activity flux between clamped sources and sinks. Shunting inhibition by nongradient features can be conceived as locally increasing the leakage conductance at positions of sharply bounded luminance features. In total, equation 2.7 achieves gradient segregation by two mechanisms: enhancement of gradient activity by gradient neurons (see equation 2.4) and simultaneous suppression of nongradient features by equation 2.3. 3 Model Predictions Results obtained from the gradient system are evaluated in the context of brightness perception. As an approximation, we assume that representations for surfaces and brightness gradients are computed independently from each other (i.e., they do not interact). Also, any activity associated with smooth luminance gradients is assumed to be discounted from surface representations. Although only predictions from the gradient system are shown below, we assume that overall brightness is obtained by some linear combination of both representations, thereby neglecting that, for example, downstream mechanisms could suppress or enhance the one or the other representation. Both representations are thought to be in spatial register (retinotopic). By definition in all graphs here, negative values represent perceived darkness activity and positive values perceived brightness activity. In the images shown, perceived brightness (darkness) corresponds to brighter (darker) levels of gray. 3.1 Nonlinear and Linear Gradients. The evolution of g (5) for a nonlinear and a linear luminance gradient is juxtaposed in Figure 3, where a sine wave grating and a triangular-shaped luminance profile (or triangular grating), respectively, served as input. For the latter stimulus, humans perceive Mach band–like overshoots in brightness and darkness. In fact, psychophysical data from Ross, Morrone, and Burr (1989) support the hypothesis that Mach bands, and the Mach band–like overshoots, derive from the same neuronal mechanism due to their similar detection thresholds. Our predictions are consistent with this finding: both types of Mach features correspond to clamped sources and sinks, respectively, which are visible while gradient representations for luminance ramps are generated. This means that while representations for linear gradients are generated, clamped sources and sinks are visible relatively longer than the actual gradient representation. The situation is different for nonlinear gradients: luminance and clamped sources and sinks have similar activity profiles, and hence corresponding gradient representations retain their spatial layout at each time step (see Figures 1 and 3). In other words, with nonlinear gradients, the dynamics of
triangular wave sine wave
884
M. Keil
luminance
1 iteration
50 iterations
100
250
500
Figure 3: State of equation 2.7 at different times. The images show the perceptual activity g (5) , with numbers indicating elapsed time steps. Images luminance show the respective stimuli, a sine wave grating (= nonlinear gradient), and a triangular-shaped luminance profile (= linear gradient).
A
B
Figure 4: Perceptual activity for linear and nonlinear gradients. Images (insets; 128 × 128 pixel) show gradient representations (i.e., state of equation 2.7 at t = (5) 500 iterations), and graphs show the corresponding profile gij for i = 64, j = 1, . . . , 128, along with normalized luminance (see legend). Stimuli were the same as in Figure 3.
clamped diffusion preserves the shape of retinal responses at gradient positions, and perceived activity patterns remain at constant shape (compare Figure 1 with Figure 4). In order to understand the relationship between input intensity levels and magnitude of perceived gradient activity, we now consider the convergence behavior of equation 2.7 for both gradient types. In the top graph of Figure 5, a sine wave grating was used to examine the interaction of clamped sources and sinks in the presence of negligible inhibition by nongradients (such as lines or edges). For the two gratings with equal spatial frequency (0.03 cycles per pixel), the initial activity difference was preserved during the dynamics. However, doubling the spatial frequency of the full contrast grating to 0.06 cycles per pixel leads to a lower level in convergent activity.
Early Gradient Representations
A
885
B
Figure 5: Evolution of equation 2.7 with time. (A) Evolution of g (5) for a sine wave grating of different contrasts and spatial frequencies at the bright phase position (see inset; full contrast means that L is scaled to the range 0, . . . , 1). (B) Evolution of g (5) at the position of the bright Mach band for different ramp widths (see inset).
This is because an increase in spatial frequency decreases the separation between clamped sources and sinks, which in turn causes an increased activity flux between them. The bottom graph of Figure 5 demonstrates, with a luminance ramp, the effect of shunting inhibition by nongradient features. Although decreasing the ramp width also leads to an increased activity flux between clamped sources and sinks, shunting inhibition by nongradient features (in this case, a luminance edge) is nevertheless the dominating effect in establishing the observed convergence behavior. Large ramp widths (10 pixels) do not trigger nongradient inhibition, and even with intermediate ramp widths (5 pixels) nongradient inhibition is negligible (cf. Figure 7). However, for small ramp widths (2 pixels), nongradient inhibition decreases final activity levels significantly. Summarizing, the final magnitude of perceptual activity |g (5) | depends monotonically on input intensity and the initial amplitude of clamped sources and sinks, respectively, but this monotonic relationship is modulated by the strength of nongradient inhibition and the spatial separation between clamped sources and sinks. 3.2 Linear Gradients and Mach Bands. With more than a century of investigation, there is still no agreement about the mechanisms underlying the generation of Mach bands (Pessoa, 1996; Ratliff, 1965). Within our theory, both Mach bands and the glowing overshoots associated with a triangular-shaped luminance profile occur as a result of clamped sources and sinks being relatively longer visible than the gradient representation that is about to form. Ross et al. (1989) measured Mach band strength as a function of spatial frequency of a trapezoidal wave and found a maximal perceived strength at some intermediate frequency (inverted-U behavior). Both narrow and wide ramps decreased the visibility of Mach bands, and
886
A
M. Keil
B
Figure 6: Threshold contrasts for seeing light (A) and dark (B) Mach bands according to Ross et al. (1989). The data for generating the above plots were extracted from Figure 5 in Ross et al. (1989). The t-values specifies the ratio of ramp width to period (varying from 0 for a square wave to 0.5 for a triangle wave). The ramp width was estimated from spatial frequencies by defining a maximum display size (2048 pixels) at minimum spatial frequency (0.05 cycles per degree). For each spatial frequency and t-value, the original stimulus wave forms were generated with equation 1 in Ross et al., and the ramp width (in pixels) was measured. Notice, however, that the value for the maximum display size is arbitrary, implying that the above data are defined only up to a translation along the abscissa.
Mach bands were barely seen, or not visible at all, at luminance steps. From Figure 5 in Ross et al. (1989), ramp widths of the original stimuli were estimated in order to allow comparison with the gradient system (details are given in the legend of Figure 6). Notice the tendency for the maxima to group around a specific ramp width, regardless of the ramp form used (as specified by the t-value). Figure 7 shows light Mach band strength as a function of ramp width. The gradient system predicted the inverted-U behavior, in agreement with the original data. However, the gradient system makes identical predictions for both the light and the dark Mach bands, whereas the data of Ross et al. (1989) indicate lower threshold contrasts for darker Mach bands (see Figure 6). Nevertheless, although asymmetries seem to exist, studies diverge as to whether light bands are stronger than dark ones, and vice versa (see Pessoa, 1996). In addition, the gradient system predicted a shift of maximum perceived gradient activity toward smaller ramp widths, as the ramp contrast was decreased (data not shown). The driving force in establishing the inverted-U behavior is inhibition of high spatial frequency (i.e., nongradient) features and an increased activity flux with narrow ramps. Wide ramps, on the other hand, generate smaller retinal responses at knee-point positions, and consequently smaller perceptual activities. Varying the value of gleak in equations 2.1 and 2.2 causes the maximum Mach band strength to shift (cf. Figure 7), thus providing an
Early Gradient Representations
A
887
B
Figure 7: Inverted-U behavior of bright Mach band as predicted by the gradient system. These graphs should be compared with Figure 6. Curves in A (B) show the shift of maximum Mach band strength for different values of gleak of equation 2.1 (equation 2.2). These simulations involved t = 2000 iterations of equation 2.7 for each data point.
easy means for fitting model output to psychophysical data. Parameters of the gradient system, however, were optimized for robust performance with real-world images. The gradient system also predicts psychophysical data addressing the interaction of Mach bands with adjacently placed stimuli ´ (data not shown; see Keil, Cristobal, & Neumann, in press). 3.3 Chevreul Illusion. A luminance staircase gives rise to the Chevreul illusion (Chevreul, 1890). The illusion is that brightness across the staircases seems inhomogeneous (except at the first and the last plateau), albeit luminance is constant at each plateau. Chevreul’s illusion is traditionally explained by the contrast sensitivity function of the visual system. More recent explanations rely, for example, on spatially dissipative filling-in activity (Pessoa et al., 1995) or even-symmetric filter responses from coarse filter scales (e.g., Watt & Morgan, 1985; Kingdom & Moulden, 1992; Morrone, Burr, & Ross, 1994; du Buf & Fischer, 1995). Figure 8 illustrates the illusion for different numbers of plateaus. With a three-plateau configuration, human observers perceive the inhomogeneity significantly weaker at the middle plateau. Despite the absence of physical luminance gradients, the gradient system nevertheless engenders representations (see Figure 9). Put another way, Chevreul’s illusion is predicted as a consequence of “incorrect” gradient representations at the plateaus. One must not forget, however, that luminance staircases trigger surface representations at the same time. With a luminance staircase, perceptual activity is assumed to be significantly higher for surface representations than for corresponding gradient representations. Hence, the surface representation constitutes the dominating perceptual event, which is overlaid by a relatively weak gradient
888
M. Keil
Figure 8: Chevreul illusion. Although luminance is homogeneous across each plateau, brightness is perceived inhomogeneously (with the exception of the first and the last one): overshoots in brightness are perceived at the left side of each plateau and overshoots in darkness at each right side. It appears that the spatial layout of the illusion changes with increasing number of plateaus, with a corresponding increment in nonuniformity of brightness (128×128 pixel). Notice that this effect may be difficult to see due to the photographic reproduction process.
representation. Why does the gradient system predict an increase of the illusion with increasing number of steps? As compared to the inverted-U characteristic of Mach bands (see Figure 7), this increment is not an effect of increasing amplitude of perceptual activity (see the corresponding numbers indicating maximum activities in Figure 9). Rather, the profile plots in Figure 9 reveal that peaks (see graphs) decrease, and generated gradients are getting steeper, with an increasing number of plateaus (peaks are created by clamped sources and sinks). With three plateaus, the perceptual gradient is shallow, and peaks are relatively accentuated. This means that the generated activity gradient between the peaks has a relatively small slope at the middle plateau. With an increasing number of plateaus, however, the peaks get smaller and broader, while gradients increase in slope. This means that gradient activity varies strongly across plateaus—what gives rise to stronger predicted illusions. Notice, however, that we were not aware of psychophysical studies that investigated the strength of Chevreul’s illusion in dependence on the number of plateaus. Model ganglion cells are unresponsive to luminance gradients with constant slope. Hence, one can devise a “teeth”-shaped luminance profile, which makes ganglion cells respond the same way as to a luminance staircase (see Figure 10). This implies that the gradient system cannot distinguish between both luminance profiles and generates equivalent representations for both of them. These situations can be distinguished only by taking into account corresponding surface representations, where a brightness staircase is generated for the luminance staircase (brightness levels increase) and a flat brightness profile for the “teeth” profile. These observations can furthermore be used to verify our claim that Chevreul’s illusion is caused by a representation of a linear luminance gradient: superimposing a teeth profile onto a luminance staircase should give the impression of an enhanced
Early Gradient Representations
889
A
0.04419388
0.05304163
0.04839845
B
Figure 9: Predictions for the Chevreul illusion. Predictions of the gradient system, for luminance configurations shown in Figure 8, after 2000 iterations. (A) For better visualization, each image was normalized individually (num(5) bers indicate maximum activity values of g (5) ). (B) Graphs show gij for i = 64 and j = 1, . . . , 128 (black) and luminance (gray, individually re-normalized). The illusion is predicted to be weaker at the first and the last plateau of each staircase, consistent with psychophysical data.
890
M. Keil
A
B
Figure 10: Two luminance distributions giving rise to the same gradient representations. A luminance staircase (A) and a “teeth” profile (B) generate the same retinal response patterns and consequently lead to identical gradient representations. However, the luminance staircase triggers surface representations with different brightness levels, whereas the teeth profile generates a surface representation with constant brightness. This difference causes gradient representations to be the dominating perceptual event in the latter case but not in the former one.
staircase
50% teeth 100% teeth
teeth
Figure 11: Enhancing Chevreul’s illusion. A “teeth” profile (last image) is superimposed on a luminance staircase (first image). The mixture was done according to mixed image = (1 − η) × staircase + η × teeth with η ∈ {0.25, 0.50} corresponding to the indicated numbers {50%, 100%}. Overlaying the teeth profile on the staircase seems to enhance the Chevreul illusion (compare the middle images with the right image of Figure 8). This experiment corroborates our claim that Chevreul’s illusion is the consequence of a weakly perceived luminance gradient representation.
Chevreul effect. The reader may verify this by scrutinizing Figure 11, where a teeth profile and a luminance staircase were superimposed with varying relative amplitudes. Remarkably, the enhanced illusion appears to be very similar to the percept of a staircase with actually more plateaus (compare
Early Gradient Representations
luminance
891
gradient
effect
Figure 12: Gradient representation for a line-based Ehrenstein disk. Luminance show the input to the gradient system (128 × 128 pixel, 8 lines, inner disk radius 15 pixel), and gradient shows corresponding gradient representations (1000 iterations). In order to illustrate the brightening effect of the disk, perceptual activity at positions around lines was set to zero (i.e., the line inducers were deleted from the gradient representations).
Figure 11 with the right image of Figure 8).8 Notice that the teeth profile only enhances the illusion; it does not alter its spatial appearance. In other words, the enhanced illusion does not appear unnaturally or distorted to us. Hence, this paradigm could be used in psychophysical experiments as a means for quantifying the strength of Chevreul’s illusion. 3.4 Modified Ehrenstein Disk. The original Ehrenstein disk is an illusory brightening of a disk-shaped region enclosed by a set of radially arranged bars. Here we consider an Ehrenstein disk induced by lines, since the gradient system does not predict the bar-induced Ehrenstein disk. Figure 12 demonstrates that the Ehrenstein disk is generated even if the lines do not define an entire (or closed) circle Although possibly difficult to see in Figure 12, the gradient system also predicts an illusory darkening of inducer line ends, along with brightness buttons just outside both ends of the lines (Day & Jory, 1978; Kennedy, 1979). Nevertheless, the perceptual activity is comparatively small (order 10−5 ) at the disk positions where humans perceive an illusionary brightening. Furthermore, the gradient system cannot explain other similar illusions like Kanizsa’s square. This indicates that for this class of illusions, corresponding surface representations play the dominant role for perception. Notice that this result was not included 8 Although we conducted no sophisticated psychophysical study, 10 out of 10 persons we asked considered this to be the case (all participants were naive on the subject).
892
M. Keil
not blurred
blurred with σ=1 blurred with σ=2
Figure 13: Gradient representations for images of Figure 2. The images show (5) the perceptual activity gij at 2000 iterations, with maximum values indicated by numbers (for visualization, images were normalized individually). As is obvious from the maximum values, activity levels of gradient representations increase according to the degree of blur.
to explain the Ehrenstein disk, but rather to indicate that clamped diffusion is functionally more universal than originally thought. 3.5 Real-World Images. The robustness of our proposed mechanisms is demonstrated with real-world images. We believe that successful processing of real-world images is an important point, since the primate visual system evolved with such images, apart from opening the door to potential applications. Figure 13 shows how gradient representations depend on the degree of gaussian blurring. Low-pass filtering an image by gaussian blurring transforms it entirely into a nonlinear luminance gradient, causing model ganglion cells to respond along gradients. Therefore, gradient representations of blurred images are similar to their originals, albeit slightly deblurred and with enhanced contrasts. Speed is another relevant issue: representations for nonlinear luminance gradients are readily available, since nonlinear gradients need not be “reconstructed” (i.e., explicitly generated) by clamped diffusion. The degree of local blur correlates with the magnitude of perceptual activity: the higher the degree of blurring, the higher is the perceptual activity at corresponding positions in the gradient representation. Figure 14 shows gradient representations for two standard test images. This experiment nicely demonstrates how gradient representations are triggered in a localized fashion. Specifically, the left image reveals focal blur with the peppers in the upper left corner, as a consequence of objects being at remote positions. The specular highlights appearing on various peppers are also represented in the gradient image. Furthermore, gradient representations are engendered where penumbral blur (the transition region between a shadow and the illuminated region surrounding it) is present in the original image. The right image also contains focal blur (the
893
gradient
luminance
Early Gradient Representations
Figure 14: Gradient representations for real-world images. The images show (5) the perceptual activity gij at 2000 iterations for two standard real-world images (luminance; size 256 × 256 pixels). Notice that sharply bounded luminance features (nongradients) are attenuated, whereas out-of-focus image features (e.g., specular highlights on the peppers and the bars in the background, respectively) are enhanced.
bar in the background, the mirror, and the mirror image) and highlights (e.g., at the nose). Again, gradient representations are triggered for the corresponding image regions. The deblurring effect is especially prominent for the out-of-focus bar in the background, which appears better focused in the gradient representation. In both examples, no gradient representations are triggered in image regions without blurring. Similarly, surface texture (i.e., even symmetric luminance features) impedes the generation of luminance gradients. Notice that in both images, the specular highlights provide information about surface curvature, and downstream mechanisms for recovering the 3D shape could readily retrieve corresponding information from the gradient representations.
894
M. Keil
4 Summary and Conclusions With this contribution, we propose a novel theory on the perception and representation of smooth gradations in luminance (here called luminance gradients) in the early stages of the primate visual system. The corresponding proposed mechanism is called a gradient system. Luminance gradients are, besides object surfaces, prevailing constituents of real-world scenes. Because luminance gradients may encode different types of information, they accordingly may be of different value for object recognition. For example, shading effects can reveal information about curvature of object surfaces, and thus may contribute to recovering an object’s 3D shape. On the other hand, if a soft shadow from one object is cast over a second object’s surface, then it should be ignored for recovering the second object’s shape, because it probably is not a surface attribute. However, in the latter situation, the cast shadow provides information about the object’s arrangements in depth. Since it is a priori unknown whether the information that is locally encoded by a luminance gradient corresponds to surface properties, the visual system cannot bind by default smooth luminance gradients to object surfaces. But information on luminance gradients is unlikely being lost, as suggested by everyday perceptual experience. The latter suggests that smooth luminance gradients are explicitly represented in the visual system in parallel with surface representations. With the coexistence of gradient representations, we propose that smooth luminance gradients are discounted from surface representations, consistent with the observation of lightness constancy in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996). Having separated representations for surfaces and gradients enables downstream, or higher-level, mechanisms for object recognition to locally “decide” whether to use gradient information or not, according to its consistency with other processes for deriving shape information. The gradient system is proposed to interact with brightness or lightness perception, respectively, and makes consistent predictions: it provides a novel account of Mach band formation and Chevreul’s illusion, whereby in both cases, available psychophysical data are successfully predicted. Mach bands are explained as the consequence of handling both linear and nonlinear gradients with a single mechanism for generating gradient representations (clamped diffusion). Clamped diffusion preserves the shape of nonlinear gradients but triggers an explicit generation process for building representations of linear gradients. In the latter case, clamped sources and sinks of activity are visible relatively longer during the dynamics of gradient formation (as illustrated by Figure 3): bright Mach bands correspond to clamped sources of activity and dark Mach bands to sinks of activity. Furthermore, the gradient system predicts that the Mach band–like overshoots in brightness and darkness, which are observed with a triangular luminance profile, are caused by
Early Gradient Representations
895
identical neural mechanisms, in agreement with psychophysical data (Ross et al., 1989). This is because a triangular profile is also a linear gradient. A luminance staircase, on the other hand, triggers an erroneous representation of a linear, sawtooth-shaped luminance gradient. Since this gradient is actually absent from the luminance staircase, it forms the underlying substrate of Chevreul’s illusion. We verified this prediction by superimposing a sawtooth–shaped luminance pattern onto a luminance staircase (see Figure 10). This causes Chevreul’s illusion to be enhanced but without perceptually distorting it (i.e., the artificially enhanced illusion is supposed to look like a different Chevreul’s illusion with a corresponding greater number of plateaus; see note 8). Mixing a sawtooth-shaped luminance profile with a luminance staircase also provides a relatively simple means to quantify the strength of Chevreul’s illusion in psychophysical experiments: one can use it either to cancel the illusion or match its strength. Notice that other models for brightness perception typically explain Chevreul’s illusion as a by-product of some nonlinearity interposed between the superposition of bandpass filter responses (e.g., du Buf & Fischer, 1995; Pessoa et al., 1995; McArthur & Moulden, 1999). To our knowledge, there is no attempt on the part of other models for brightness perception to examine the dependence of the strength of Chevreul’s illusion on the number of plateaus with a fixed display size. Clamped diffusion is also successful in predicting the (also illusory) brightening effect of an Ehrenstein disk based on line inducers (see Figure 12). This result should be compared with other models for brightness perception that explain the Ehrenstein disk by neural grouping mechanisms ¨ (Heitger, von der Heydt, Peterhans, Rosenthaler, & Kubler, 1998; Gove et al., 1995). Although our claim is not to account for brightness illusions like the Kanizsa triangle or the Ehrenstein disk by means of gradient representations, this result is a further demonstration that the gradient system makes consistent predictions and is computationally more universal than originally planned. This universality and its robustness are furthermore demonstrated with real-world images (see Figures 13 and 14), where gradient representations for specular highlights, cast shadows (i.e., penumbral blur), and optically blurred image features (e.g., due to arrangement in depth) are locally triggered. The formation of gradient representations is impeded at image locations with high-contrast texture, that is even-symmetric contrast configurations (see the right images of Figure 14). One may argue that gradient representations as generated by the gradient system are not useful for the visual system, since they unspecifically lump together all kind of smooth gradients. However, one has to take into account that the gradient system in its present form does not involve any interpretation of gradient structures. Rather, it clarifies how to segregate and recover smooth luminance gradients from retinal information by using model cells with small receptive fields (i.e., nonglobal information).
896
M. Keil
Thus, the situation is similar to one faced with surface representations in V1, where neuronal responses are initially quite unspecific and later indicate figure-ground relationships (Lamme, 1995; Lee, Mumford, Romero, & Lamme, 1998). Segregation of objects from background is thought to be mediated by feedback from extrastriate or higher visual areas. Is there any evidence for corresponding high-level interactions involved with smooth gradients? Knill and Kersten (1991) showed that luminance gradients are interpreted differently depending on an object’s contour curvature. Similarly, Ramachandran (1988) provided a demonstration on how the form of an object’s outline changes the interpretation and perception of luminance gradients generated by illumination and surface curvature, respectively. Adopting a conservative point of view, these findings may be interpreted as evidence in favor of an explicit representation for smooth gradients associated with illumination (i.e., a specific representation), most likely in an extrastriate area. But where one should look for unspecific gradient representations in the brain? Because there are no specific data available to address the issue directly, one can only speculate by extrapolating data about surface representations. A widespread assumption is that surface representations are an element of midlevel vision. However, a recent fMRI study conducted by Sasaki and Watanabe (2004) suggests that V1 is possibly the only cortical area involved in generating surface representations. Nevertheless, their data indicated a different situation for contour processing, where the early visual areas V1, V2, V3/VP, and V4v seem to be involved. According to these recent data, we suggest that unspecific representations of smooth luminance gradients reside also in V1. In addition, the anatomical requirements necessary for the gradient system are met in V1. First, the gradient system relies on neurons providing information about contours, such as simple cells or complex cells. Second, the gradient system processes all information at the highest spatial acuity, and V1 is the largest visual cortical area having the highest spatial resolution. Third, evidence exists that long-range horizontal connections are involved in lateral activity propagation (e.g., Bringuier, Chavane, Glaeser, & Fr´egnac, 1999), and specifically in isotropic filling-in processes for generating surface representations (e.g., Sasaki & Watanabe, 2004). Thus, long-range horizontal connections between cortical microcolumns could also serve as a substrate for implementing the clamped diffusion process. Properties of neurons that are involved in unspecific gradient representations can be derived from the model. For example, just like the filling-in process, gradient representations involve a lateral propagation of activity, and hence one should observe latency effects when stimulating such neurons with smooth variations in intensity. One should observe different response latencies for linear gradient stimuli and nonlinear gradients, respectively (latencies associated with linear gradients are predicted to be longer). With
Early Gradient Representations
897
stimulation inside the receptive field, the depolarization of gradient neurons should increase with the degree of blur of the stimulus. In contrast, bars and lines should hyperpolarize such neurons where stimulus contours fall within their receptive field. In analogy to surface representations, the responses of gradient neurons may be modulated by attention, and initially unspecific responses may encode a particular type of feature in their later response phase (e.g., surface curvature, a cast shadow, or illumination gradients). Such modification of responses most likely requires feedback from extrastriate visual areas. It is also conceivable that more specific representations are found in extrastriate areas. However, since extrastriate areas usually have lower spatial resolution than V1 (Paradiso, 2002), these areas can be expected to interact with V1 to preserve spatial accuracy. Recently, Ben-Shahar et al. (2002) proposed a differential geometry– based approach for processing shading information. In their approach, shading is initially measured by oriented operators selective for low spatial frequencies. The resulting flow field is then compared with a generic flow field model (osculating flow field), which is locally parameterized over the curvature values of the originally measured flow field. A relaxation labeling algorithm (Hummel & Zucker, 1983) is used to update the original flow field according to the generic model. The latter process finally results in a consistent flow field. In order to increase stability, Ben-Shahar et al. incorporated edge information into the relaxation labeling algorithm. The use of edges can additionally be motivated by recognizing that edges in many cases correlate with material changes. Edge maps were computed using standard techniques—Canny’s edge detector (Canny, 1986) and the logical/linear detector Iverson & Zucker, 1995). The approach of Ben-Shahar et al. (2002) compares to the gradient system as follows. The gradient system does not make use of orientation-selective low-spatial frequency operators for detecting shading. Whereas the model of Ben-Shahar et al. (2002) relies on such operators, the gradient system detects shading indirectly by suppressing features with high spatial frequencies. Nevertheless, boundary maps or edge maps (i.e., high-frequency information) in both approaches subserve similar goals: in the gradient system, they are indispensable for ensuring spatial accuracy of gradient representations, and in the relaxation labeling algorithm, they regulate the growth of flow structures. Since in both approaches edges act in an inhibitory fashion, both gradient representations and flow fields, respectively, are rejected in textured regions of an image. A further difference between both methods lies in the explanatory power. The gradient system focuses on the prediction of psychophysical data (Keil et al., 2005) and linking them to concrete neuronal mechanisms. On the other hand, the generic flow field model predicts anatomical data about longrange horizontal connections acting on perceptual variables such as shading (Ben-Shahar & Zucker, 2004), with relaxation labeling as an abstraction of corresponding neuronal computations for processing the underlying flow
898
M. Keil
fields. The latter emphasizes the importance of the orientation component of horizontal connections. Conversely, the gradient system relies on only isotropic operators in its perceptual module (clamped diffusion). Hence, horizontal connections in that case may serve to implement an isotropic (clamped) diffusion process. Notice that the steady state of clamped diffusion corresponds in fact to a dynamic equilibrium, and clamped diffusion establishes nothing but a representation of a shading flow field. As a consequence, the approach of Ben-Shahar et al. (2002) could in principle act on gradient representations as a subsequent stage in the processing hierarchy. We emphasize that the performance of the gradient system was optimized for real-world images, where psychophysical data were used only to constrain the circuitry. The gradient system was not designed ad hoc for the explanation of psychophysical data. All employed mechanisms lie in the range of biophysical possibilities and are not contradictory to existing anatomical data. The gradient system is a minimally complex model since all stages are necessary to arrive at our results.
Acknowledgments This work has been partially supported by the following grants: LOCUST IST 2001-38097, and AMOVIP INCO-DC 961646. I thank the two reviewers whose comments helped to significantly improve the first drafts of this article.
References Bakin, J., Nakayama, K., & Gilbert, C. (2000). Visual responses in monkey areas V1 and V2 to three-dimensional surface configurations. Journal of Neuroscience, 20(21), 8188–8198. Ben-Shahar, O., Huggins, P., & Zucker, S. (2002). On computing visual flows with ¨ boundaries: The case of shading and edges. In H.-H. Bulthoff, S.-W. Lee, T. Poggio, & C. Wallraven (Eds.), 2nd Workshop on Biologically Motivated Computer Vision (BMCV 2002) (Vol. LNCS 2525, pp. 189–198). Berlin: Springer-Verlag. Ben-Shahar, O., & Zucker, S. (2004). Geometrical computations explain projection patterns of long range horizontal connections in visual cortex. Neural Computation, 16(3), 445–476. Benardete, E., & Kaplan, E. (1999). The dynamics of primate M retinal ganglion cells. Visual Neuroscience, 16, 355–368. ¨ Benda, J., Bock, R., Rujan, P., & Ammermuller, J. (2001). Asymmetrical dynamics of voltage spread in retinal horizontal cell networks. Visual Neuroscience, 18(5), 835–848. Blakeslee, B., & McCourt, M. (1997). Similar mechanisms underlie simultaneous brightness contrast and grating induction. Vision Research, 37, 2849–2869.
Early Gradient Representations
899
Blakeslee, B., & McCourt, M. (1999). A multiscale spatial filtering account of the white effect, simultaneous brightness contrast and grating induction. Vision Research, 39, 4361–4377. Blakeslee, B., & McCourt, M. (2001). A multiscale spatial filtering account of the Wertheimer-Benary effect and the corrugated Mondrian. Vision Research, 41, 2487– 2502. Bloj, M., Kersten, D., & Hurlbert, A. (1999). Perception of three-dimensional shape influences colour perception through mutual illumination. Nature, 402, 877– 879. Bringuier, V., Chavane, F., Glaeser, L., & Fr´egnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration field of area 17 neurons. Science, 283, 695–699. Burt, P., & Adelson, E. (1983). The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4), 532–540. Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698. Castelo-Branco, M., Goebel, R., Neuenschwander, S., & Singer, W. (2000). Neural synchrony correlates with surface segregation rules. Nature, 405, 685–689. Chevreul, M. (1890). In C. T. Martel (Ed.), The principles of harmony and contrast of colours, and their applications to arts. London: Bell. Chichilnisky, E., & Kalmar, R. (2002). Functional asymmetries in ON and OFF ganglion cells of primate retina. Journal of Neuroscience, 22(7), 2737–2747. Cleland, B., Levick, W., & Sanderson, K. (1973). Properties of sustained and transient ganglion cells in the cat retina. Journal of Physiology (London), 228, 649–680. Davey, M., Maddess, T., & Srinivasan, M. (1998). The spatiotemporal properties of the Craik-O’Brien-Cornsweet effect are consistent with “filling-in.” Vision Research, 38, 2037–2046. Day, R., & Jory, M. (1978). Subjective contours, visual acuity, and line contrast. In J. A. J. Krauskopf & B. Wooten (Eds.), Visual psychophysics and physiology (pp. 331– 349). New York: Academic Press. Deschˆenes, F., Ziou, D., & Fuchs, P. (2004). A unified approach for a simultaneous and cooperative estimation of defocus blur and spatial shifts. Image and Vision Computing, 22, 35–57. DeVries, S., & Baylor, D. (1997). Mosaic arrangement of ganglion cell receptive fields in rabbit retina. Journal of Neurophysiology, 78(4), 2048–2060. du Buf, J. (1992). Lowpass channels and White’s effect. Perception, 21, A80. du Buf, J., & Fischer, S. (1995). Modeling brightness perception and syntactical image coding. Optical Engineering, 34(7), 1900–1911. Elder, J., & Zucker, S. (1998). Local scale control for edge detection and blur estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(7), 699–716. Gabbiani, F., Krapp, H., Koch, C., & Laurent, G. (2002). Multiplicative computation in a visual neuron sensitive to looming. Nature, 420, 320–324. Gerrits, H. (1979). Apparent movements induced by stroboscopic illumination of stabilized images. Experimental Brain Research, 34, 471–488. Gerrits, H., de Haan, B., & Vendrik, A. (1966). Experiments with retinal stabilized images: Relations between the observations and neural data. Vision Research, 6, 427–440.
900
M. Keil
Gerrits, H., & Vendrik, A. (1970). Simultaneous contrast, filling-in process and information processing in man’s visual system. Experimental Brain Research, 11, 411–430. Gilchrist, A., Kossyfidis, C., Bonato, F., Agostini, T., Cataliotti, J., Li, X., Spehar, B., Annan, V., & Economou, E. (1999). An anchoring theory of lightness perception. Psychological Review, 106(4), 795–834. Ginsburg, A. (1975). Is the illusory triangle physical or imaginary? Nature, 257, 219– 220. Gove, A., Grossberg, S., & Mingolla, E. (1995). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neuroscience, 12, 1027–1052. Grossberg, S. (1983). The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behavioral and Brain Sciences, 6, 625–692. Grossberg, S., & Hong, S. (2003). Cortical dynamics of surface lightness anchoring, filling-in, and perception (abstract). Journal of Vision, 3(9), 415a. Grossberg, S., & Howe, P. (2003). A laminar cortical model of stereopsis and threedimensional surface perception. Vision Research, 43, 801–829. Grossberg, S., & Mingolla, E. (1987). Neural dynamics of surface perception: Boundary webs, illuminants, and shape-from-shading. Computer Vision, Graphics, and Image Processing, 37, 116–165. Grossberg, S., & Pessoa, L. (1998). Texture segregation, surface representation, and figure-ground separation. Vision Research, 38, 2657–2684. Grossberg, S., & Todorovi´c, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241–277. ¨ Heitger, F., von der Heydt, R., Peterhans, E., Rosenthaler, L., & Kubler, O. (1998). Simulation of neural contour mechansims: representing anomalous contours. Image and Vision Computing, 16, 407–421. Hong, S., & Grossberg, S. (2004). A neuromorphic model for achromatic and chromatic surface representation of natural images. Neural Networks, 17(5–6), 787–808. Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, London, 160, 106–154. Hubel, D., & Wiesel, T. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, London, 195, 214–243. Hummel, R., & Zucker, S. (1983). On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 267–287. Iverson, L., & Zucker, S. (1995). Logical/linear operators for image curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10), 982–996. Kaplan, E., Lee, B., & Shapley, R. (1990). New views of primate retinal function. Progress in Retinal Research, 9, 273–336. Kaplan, E., Purpura, K., & Shapley, R. (1987). Contrast affects the transmission of visual information through the mammalian lateral geniculate nucleus. Journal of Physiology (London), 391, 267–288. Keat, J., Reinagel, P., Reid, R., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30, 803–817. ´ Keil, M., Cristobal, G., & Neumann, H. (in revision). Gradient representation and perception in the early visual system—a novel account to Mach band formation. Vision Research.
Early Gradient Representations
901
Kelly, F., & Grossberg, S. (2000). Neural dynamics of 3-D surface perception: Figureground separation and lightness perception. Perception and Psychophysics, 62, 1596–1619. Kennedy, J. (1979). Subjective contours, contrast and assimilation. In C. Nodine & D. Fisher (Eds.), Perception and pictorial representation. New York: Praeger. Kersten, D., Mamassian, P., & Knill, D. (1997). Moving cast shadows induce apparent motion in depth. Perception, 26, 171–192. Kingdom, F. (2003). Color brings relief to human vision. Nature Neuroscience, 6(6), 641–644. Kingdom, F., & Moulden, B. (1992). A multi-channel approach to brightness coding. Vision Research, 32, 1565–1582. Kinoshita, M., & Komatsu, H. (2001). Neural representations of the luminance and brightness of a uniform surface in the macaque primary visual cortex. Journal of Neurophysiology, 86, 2559–2570. Knau, H., & Spillman, L. (1997). Brightness fading during Ganzfeld adaptation. Journal of the Optical Society of America A, 14(6), 1213–1222. Knill, D., & Kersten, D. (1991). Apparent surface curvature affects lightness perception. Nature, 6351, 228–230. Komatsu, H., Murakami, I., & Kinoshita, M. (1996). Surface representations in the visual system. Cognitive Brain Research, 5, 97–104. Lamb, T. (1976). Spatial properties of horizontal cell responses in the turtle retina. Journal of Physiology, 263, 239–255. Lamme, V. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15, 1605–1615. Lee, T., Mumford, D., Romero, R., & Lamme, V. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38, 2429–2454. MacEvoy, S., Kim, W., & Paradiso, M. (1998). Integration of surface information in the primary visual cortex. Nature Neuroscience, 1(7), 616–620. MacEvoy, S., & Paradiso, M. (2001). Lightness constancy in the primary visual cortex. Proceedings of the National Academy of Sciences USA, 98(15), 8827–8831. ¨ Mach, E. (1865). Uber die Wirkung der r¨aumlichen Verteilung des Lichtreizes auf die Netzhaut, I. Sitzungsberichte der mathematisch-naturwissenschaftlichen Klasse der Kaiserlichen Akademie der Wissenschaften, 52, 303–322. Maffei, L., & Fiorentini, A. (1972). Retinogenigulate convergence and analysis of contrast. Journal of Neurophysiology, 35, 65–72. Mamassian, P., Knill, D., & Kersten, D. (1998). The perception of cast shadows. Trends in Cognitive Sciences, 2(8), 288–294. Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proc. R. Soc. Lond. B, 207, 187–217. McArthur, J., & Moulden, B. (1999). A two-dimensional model of brightness perception based on spatial filtering consistent with retinal processing. Vision Research, 39, 1199–1219. Mingolla, E., Ross, W., & Grossberg, S. (1999). A neural network for enhancing boundaries and surfaces in synthetic aperture radar images. Neural Networks, 12, 499–511. Mingolla, E., & Todd, J. (1986). Perception of solid shape from shading. Biological Cybernetics, 53, 137–151.
902
M. Keil
Morrone, M., Burr, D., & Ross, J. (1994). Illusory brightness step in the Chevreul illusion. Vision Research, 12, 1567–1574. Naka, K.-I., & Rushton, W. (1967). The generation and spread of S-potentials in fish (cyprinidae). Journal of Physiology, 193, 437–461. Neumann, H. (1994). An outline of a neural architecture for unified visual contrast and brightness perception (Tech. Rep. No. CAS/CNS-94-003). Boston: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems. Neumann, H. (1996). Mechanisms of neural architectures for visual contrast and brightness perception. Neural Networks, 9(6), 921–936. Neumann, H., Pessoa, L., & Hansen, T. (2001). Visual filling-in for computing perceptual surface properties. Biological Cybernetics, 85, 355–369. Nishina, S., Okada, M., & Kawato, M. (2003). Spatio-temporal dynamics of depth propagation on uniform region. Vision Research, 43, 2493–2503. Norman, J., & Todd, J. (1994). The perception of rigid motion in depth from the optical deformations of shadows and occlusion boundaries. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 343–356. Paradiso, M. (2000). Visual neuroscience: Illuminating the dark corners. Current Biology, 10(1), R15–R18. Paradiso, M. (2002). Perceptual and neural correspondence in primary visual cortex. Current Opinion in Neurobiology, 12, 155–161. Paradiso, M., & Hahn, S. (1996). Filling-in percepts produced by luminance modulation. Vision Research, 36(17), 2657–2663. Paradiso, M., & Nakayama, K. (1991). Brightness perception and filling-in. Vision Research, 31(7/8), 1221–1236. Passaglia, C., Enroth-Cugell, C., & Troy, J. (2001). Effects of remote stimulation on the mean firing rate of cat retinal ganglion cells. Journal of Neuroscience, 21, 5794–5803. Pessoa, L. (1996). Mach-bands: How many models are possible? Recent experimental findings and modeling attempts. Vision Research, 36(19), 3205–3277. Pessoa, L., Mingolla, E., & Neumann, H. (1995). A contrast- and luminance-driven multiscale network model of brightness perception. Vision Research, 35(15), 2201– 2223. Pessoa, L., & Neumann, H. (1998). Why does the brain fill-in? Trends in Cognitive Sciences, 2, 422–424. Pessoa, L., & Ross, W. (2000). Lightness from contrast: A selective integration model. Perception and Psychophysics, 62(6), 1160–1181. Pessoa, L., Thompson, E., & No¨e, A. (1998). Finding out about filling-in: A guide to perceptual completion for visual science and the philosophy of perception. Behavioral and Brain Sciences, 21, 723–802. Ramachandran, V. (1988). Perception of shape from shading. Nature, 331, 163–166. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden Day. Rodieck, R. W. (1965). Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5, 583–601. Ross, J., Morrone, M., & Burr, D. (1989). The conditions under which Mach bands are visible. Vision Research, 29(6), 699–715. Rossi, A., & Paradiso, M. (1996). Temporal limits of brightness induction and mechanisms of brightness perception. Vision Research, 36(10), 1391–1398.
Early Gradient Representations
903
Rossi, A., & Paradiso, M. (1999). Neural correlates of perceived brightness in the retina, lateral geniculate nucleus, and striate cortex. Journal of Neuroscience, 193(14), 6145–6156. Rossi, A., Rittenhouse, C., & Paradiso, M. (1996). The representation of brightness in primary visual cortex. Science, 273, 1391–1398. Rudd, M., & Arrington, K. (2001). Darkness filling-in: A neural model of darkness induction. Vision Research, 41(27), 3649–3662. Sasaki, Y., & Watanabe, T. (2004). The primary visual cortex fills in color. Proceedings of the National Academy of Sciences USA, 52(101), 18251–18256. Sepp, W., & Neumann, H. (1999). A multi-resolution filling-in model for brightness perception. In ICANN99, Ninth International Conference on Artificial Neural Networks (Vol. 470, pp. 461–466). University of Edinburgh. Tittle, J., & Todd, J. (1995). Perception of three-dimensional structure. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 715–718). Cambridge, MA: MIT Press. Todd, J. (2003). Perception of three-dimensional structure. In M. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 868–871). Cambridge, MA: MIT Press. Todd, J., & Mingolla, E. (1983). Perception of surface curvature and direction of illumination from patterns of shading. Journal of Experimental Psychology: Human Perception and Performance, 9(4), 583–595. Todd, J., Norman, J., Koendernik, J., & Kappers, A. (1997). Effects of texture, illumination and surface reflectance on stereoscopic shape perception. Perception, 26, 806–822. Todd, J., Norman, J., & Mingolla, E. (2004). Lightness constancy in the presence of specular highlights. Psychological Science, 15(1), 33–39. Todd, J., & Reichel, F. (1989). Ordinal structure in the visual perception and cognition of smoothly curved surfaces. Psychological Review, 96(1), 643–657. Troy, J., & Robson, J. (1992). Steady discharges of X and Y retinal ganglion cells of cat under photopic illuminance. Visual Neuroscience, 9, 535–553. Wandell, B. (1995). Foundations of vision. Sunderland, MA: Sinauer. Watt, R., & Morgan, M. (1985). A theory of the primitive spatial code in human vision. Vision Research, 25, 1661–1674. Winfree, A. (1995). Wave propagation in cardiac muscle and in nerve networks. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 1054–1056). Cambridge, MA: MIT Press. Zaghloul, K., Boahen, K., & Demb, J. (2003). Different circuits for ON and OFF retinal ganglion cells cause different contrast sensitivities. Journal of Neurosciene, 23(7), 2645–2654.
Received February 23, 2004; accepted August 23, 2005.
LETTER
Communicated by Jean-Pierre Nadal
Memory Capacity for Sequences in a Recurrent Network with Biological Constraints Christian Leibold
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, Germany, and Neuroscience Research Center, Charit´e, Medical Faculty of Berlin, Germany
Richard Kempter
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin; Bernstein Center for Computational Neuroscience, Berlin; and Neuroscience Research Center, Charit´e, Medical Faculty of Berlin, Germany
The CA3 region of the hippocampus is a recurrent neural network that is essential for the storage and replay of sequences of patterns that represent behavioral events. Here we present a theoretical framework to calculate a sparsely connected network’s capacity to store such sequences. As in CA3, only a limited subset of neurons in the network is active at any one time, pattern retrieval is subject to error, and the resources for plasticity are limited. Our analysis combines an analytical mean field approach, stochastic dynamics, and cellular simulations of a time-discrete McCulloch-Pitts network with binary synapses. To maximize the number of sequences that can be stored in the network, we concurrently optimize the number of active neurons, that is, pattern size, and the firing threshold. We find that for one-step associations (i.e., minimal sequences), the optimal pattern size is inversely proportional to the mean connectivity c, whereas the optimal firing threshold is independent of the connectivity. If the number of synapses per neuron is fixed, the maximum number P of stored sequences in a sufficiently large, nonmodular network is independent of its number N of cells. On the other hand, if the number of synapses scales as the network size to the power of 3/2, the number of sequences P is proportional to N. In other words, sequential memory is scalable. Furthermore, we find that there is an optimal ratio r between silent and nonsilent synapses at which the storage capacity α = P/[c(1 + r )N] assumes a maximum. For long sequences, the capacity of sequential memory is about one order of magnitude below the capacity for minimal sequences, but otherwise behaves similar to the case of minimal sequences. In a biologically inspired scenario, the information content per synapse is far below theoretical optimality, suggesting that the brain trades off error tolerance against information content in encoding sequential memories. Neural Computation 18, 904–941 (2006)
C 2006 Massachusetts Institute of Technology
Capacity for Sequences Under Biological Constraints
905
1 Introduction Recurrent neuronal networks are thought to serve as a physical basis for learning and memory. A fundamental strategy of memory organization in animals and humans is the storage of sequences of behavioral events. One of the brain regions of special importance for sequence learning is the hippocampus (Brun et al., 2002; Fortin, Agster, & Eichenbaum, 2002; Kesner, Gilbert, & Barua, 2002). The recurrent network in the CA3 region of hippocampus, in particular, is critically involved in the rapid acquisition of single-trial or one-shot, episodic-like memory (Nakazawa, McHugh, Wilson, & Tonegawa, 2004), that is, memory of the sequential ordering of events. It is generally assumed that the hippocampus can operate in at least ¨ two states (Lorincz & Buzs`aki, 2000). One state, called theta, is dedicated to fast, or one-shot, learning; the other state, referred to as sharp-wave ripple, is dedicated to the replay of stored sequences. Experiments by Wilson ´ Csicsvari, and Buzs´aki and McNaughton (1994), N´adasdy, Hirase, Czurko, (1999), and Lee and Wilson (2002) strongly corroborate the hypothesis that the hippocampus can replay sequences of previously experienced events. The sequences are assumed to be stored within the highly plastic synapses that recurrently connect the pyramidal cells of the CA3 region (Csicsvari, Hirase, Mamiya, & Buzsaki, 2000). In this letter, we tackle the problem of how many sequences can be stored in a recurrent neuronal network such that their replay can be triggered by an activation of adequate cue patterns. This question is fundamental to neural computation, and many classical papers calculate the storage capacity of pattern memories. There, one can roughly distinguish between two major classes of network models: perceptron-like feedforward networks in which associations occur within one time step (Willshaw, Bunetman, & Longuet-Higgins, 1969; Gardner, 1987; Nadal & Toulouse, 1990; Brunel, Nadal, & Toulouse, 1992) and recurrent networks that describe memories as attractors of a time-discrete dynamics (Little, 1974; Hopfield, 1982; Amit, Gutfreund, & Sompolinsky, 1987; Golomb, Rubin, & Sompolinsky, 1990). Also for networks that act as memory for sequences, capacities have been calculated in both the perceptron (Nadal, 1991) and the attractor case (Herz, Li, & van Hemmen, 1991). An important result is that the capacity of sequence memory in Hopfield-type networks is about twice as ¨ large as that of a static attractor network (During, Coolen, & Sherington, 1998). Here, we describe sequence replay in a sparsely connected network by means of time-discrete dynamics, binary neurons, and binary synapses. Our model for sequential replay of activity patterns is different from attractortype models (Sompolinsky & Kanter, 1986; Buhmann & Schulten, 1987; Amit, 1988). In fact, we completely dispense with fixed points of the network dynamics. Instead, we discuss transients that are far from equilibrium
906
C. Leibold and R. Kempter
(Levy, 1996; August & Levy, 1999; Jensen & Lisman, 2005). In the case of a sequence consisting of a single transition between two patterns (a minimal sequence), the mathematical structure we choose is similar to the one of Nadal (1991) for an autoassociative Willshaw network (Willshaw et al., 1969). For longer sequences, our analysis resembles that of synfire networks (Diesmann, Gewaltig, & Aertsen, 1999), although we take expectation values as late as possible (Nowotny & Huerta, 2003). Some of the previous approaches to memory capacity explicitly focus on questions of biological applicability. Golomb et al. (1990), for example, ¨ address the problem of low firing rates. Herrmann, Hertz, and PrugelBennett (1995) explore the biological plausibility of synfire chains. Other approaches assess the dependence of storage capacity on restrictions to connectivity (Gutfreund & M´ezard, 1988; Deshpande & Dasgupta, 1991; Maravall, 1999) and on the distribution of synaptic states (Brunel, Hakim, Isope, Nadal, & Barbour, 2004). We propose a framework that allows discussing how a combination of several biological constraints affects the performance of neuronal networks that are operational in the brain. An important constraint that supports dynamical stability at a low level of activity is a low mean connectivity. Another one is imposed by limited resources for synaptic plasticity; that is, not every synapse that may combinatorially be possible can really be established. This constraint sets an upper bound to the maximum connectivity between two groups of neurons that are to be associated. Moreover, the number of synapses per neuron may be limited. Another important constraint for sequential memories is the length of replayed sequences, which interferes with dynamical properties of the network. Finally, the capacity of sequential memory is also influenced by the specific nature of a neuronal structure that reads out replayed patterns. This influence is often neglected by assuming a perfect detector for the network states. Our approach explicitly takes into account that synapses are usually classified into activated and silent ones (Montgomery, Pavlidis, & Madison, 2001; Nusser et al., 1998). Activated synapses have a nonzero efficacy or weight and are essential for the network dynamics. Silent synapses, which do not contain postsynaptic AMPA receptors (Isaac, Nicoll, & Malenka, 1995), are assumed to not contribute to the network dynamics. Changing the state of synapses from the silent to the nonsilent state, and vice versa, acts as a resource for plasticity for the storage of sequences. Synaptic learning rules can set a fixed ratio between silent and nonsilent synapses, which gives rise to an additional constraint. In this letter, we calculate the capacity of sequential memory in a constrained recurrent network by means of a probabilistic theory as well as a mean field approach. Both theoretical models are compared to cellular simulations of networks of spiking units. We thereby describe the memory capacity for sequences in dependence on five free parameters. The network size N, the mean connectivity c, and the ratio r between silent and nonsilent
Capacity for Sequences Under Biological Constraints
907
synapses are three network parameters. In addition there are two replay parameters: the sequence length Q and the threshold γ of pattern detection. The number M of active neurons per pattern and the neuronal firing threshold θ are largely considered as dependent variables. It is shown how M and θ are to be optimized to allow replaying a maximum number of sequences. Scaling laws are then derived by using the optimal values for M and θ , both being functions of the five free parameters. 2 Model of a Recurrent Network for the Replay of Sequences In this section we specify notations to describe the dynamics and morphology of a recurrent network that allows for a replay of sequences of predefined activity patterns. A list of symbols used throughout this article can be found in appendix A. 2.1 Binary Synapses Connect Binary Neurons. Let us consider a network of N McCulloch-Pitts (McCulloch & Pitts, 1943) neurons that are described by binary variables xk , 1 ≤ k ≤ N. At each discrete point in time t, neuron k can be either active, xk (t) = 1, or inactive, xk (t) = 0. The state of the network is then denoted by a binary N-vector x(t) = [x1 (t), . . . , xN (t)]T . A neuron k that is active at time t provides input to a neuron k at time t + 1 if there is an activated synaptic connection from k to k . Neuron k fires at time t + 1 if its synaptic input crosses some firing threshold θ > 0. In order to specify a neuron’s input, we classify synapses into activated and silent ones. All activated connections contribute equally to the synaptic input. Silent synapses have no influence on the dynamics. Therefore, the synaptic input of neuron k at time t + 1 equals the number of active neurons at time t that have an activated synapse to neuron k . Silent synapses are assumed to act as a resource for plastic changes, although this article does not directly incorporate plasticity rules. The total number c N2 of activated synapses in the network defines a mean connectivity c > 0, which later will be interpreted as the probability of having an activated synapse connecting a particular neuron to another one. The connectivity through activated synapses in the network is described by the N × N binary matrix C, where Ckk = 1 if there is an activated synapse from neuron k to neuron k , and Ckk = 0 if there is a silent synapse or no synapse at all. Similarly, the connectivity through silent synapses is denoted by c s , and the total number of silent synapses in the network is c s N2 . Then each neuron has on average (c + c s )N morphological synapses, which in turn defines the morphological connectivity c m = c + c s . Experimental literature (Montgomery et al., 2001; Nusser et al., 1998) usually assesses the ratio c s /c between the silent and nonsilent connectivities. For convenience, we introduce the abbreviation r = c s /c. We note that the four connectivity parameters c, c m , c s , and r have two independent degrees of freedom.
908
C. Leibold and R. Kempter
2.2 Patterns and Sequences. A pattern or event is defined as a binary N-vector ξ where M elements of ξ are 1 and N − M elements are 0. The network is in the state ξ at time t if x(t) = ξ . An ordered series of events is called a sequence. A minimal sequence is defined as a series of two events, say, a cue pattern ξ A preceding a target pattern ξ B . The minimal sequence ξ A → ξ B is said to be stored in the network if initialization with the cue x(t) ≈ ξ A at time t leads to the target x(t + 1) ≈ ξ B one time step later. Typically, the network only approximately recalls or replays the events of a sequence (see section 4). Sequences of arbitrary length, denoted by Q ≥ 1, are obtained by concatenating minimal sequences of length Q = 1. In the next section, we specify how to set up the connectivity such that a recurrent network can act as a memory for sequences. 3 Embedding Sequences and Storage Capacity For a minimal sequence ξ A → ξ B to be stored in the network, one requires an above-average connectivity through activated synapses from the cells that are active in the cue ξ A to those that are supposed to be active during the recall of the target ξ B . In what follows, we assume that all morphological synapses from neurons active in the cue pattern to cells active in the target pattern are switched on and none of them is silent. Such a network can be constructed similar to the one in Willshaw et al., 1969 (see also Nadal & Toulouse, 1990, and Buckingham & Willshaw, 1993). Let us therefore consider a randomly connected network—the probability of having a morphological synapse from one neuron to another one is c m . Beginning with all synapses being in the silent state, one randomly defines pairs of patterns that are to be connected into minimal sequences. Then one converts those silent synapses into active ones that connect the M active neurons in a cue pattern to the M active neurons in the corresponding target pattern. Imprinting of sequences stops when the overall connectivity through activated synapses reaches the value c; that is, the total number of activated synapses in the network attains a value of c N2 . 3.1 Capacity of Sequential Memory. Let us now address the question of how many sequences can be concurrently stored using the above algorithm for a given mean connectivity c and morphological connectivity c m > c. In so doing, we define the capacity α of sequential memory as the maximum number P of minimal sequences that can be stored, normalized by the number c m N = (1 + r )c N of morphological synapses per neuron, α :=
P . cm N
(3.1)
The number P of minimal sequences that can be stored is assessed by extending the classical derivation of Willshaw et al. (1969). Suppose that we
Capacity for Sequences Under Biological Constraints
909
have two groups of M cells that should be linked into a minimal sequence. For each morphological synapse in the network, the probability that the presynaptic neuron is active in the cue pattern is M/N, and the probability that the postsynaptic neuron is active in the target pattern is also M/N. Then the probability that a synapse is not involved in this specific minimal sequence is 1 − M2 /N2 . Given P stored minimal sequences, the probability that a synapse does not contribute to any of those sequences is [1 − M2 /N2 ] P , and therefore the probability of a synapse being in a nonsilent state is C = 1 − [1 − M2 /N2 ] P . For a mean connectivity c, on the other hand, the probability C also equals the ratio between the number c N2 of activated synapses and the total number c m N2 of synapses in the network: C = c/c m . Combining the two approaches, we can derive P for any given pair of connectivities c and c m = c (1 + r ) and find α=
log(1 − c/c m ) . c m N log (1 − M2 /N2 )
(3.2)
Equation 3.2 is valid for all biologically reasonable choices of M, c, and c m and also can account for nonorthogonal patterns, as in Willshaw et al. (1969). A somewhat simpler expression for α can be obtained in the case M/N 1. Independent of specific values of c and c m , we can expand [1 − M2 /N2 ] P ≈ 1 − P M2 /N2 to end up with α=
cN 2 M2 cm
for
1 M N.
(3.3)
Equation 3.3 can also be interpreted through a different way of estimating the number P of minimal sequences that can be stored: P roughly equals the ratio between the total number c N2 of activated synapses in the network and the number c m M2 of activated synapses that link two patterns: P = c N2 /(c m M2 ). This estimate, however, requires that different patterns are represented by different groups of neurons; there is no overlap between the patterns, which is an excellent approximation for sparsely coded patterns, M/N 1. Equations 3.2 and 3.3 for the capacity α of sequential memory, however, do not tell us whether embedded sequences can actually be replayed. In the next section, we therefore introduce a method to quantify sequence replay. 4 Replaying Sequences We consider a sequence as being stored in the network if and only if it can be replayed at a given quality. In order to be able to efficiently simulate replay in large networks, this section introduces a probabilistic Markovian
910
C. Leibold and R. Kempter
dynamics that approximates the deterministic cellular simulations well. Finally, we define a measure to quantify the quality of sequence replay. 4.1 Capacity and Dynamical Stability. Let us design a network and patterns such that the number of sequences that can be concurrently stored is as large as possible. From equations 3.2 and 3.3 we see that the capacity α is maximized if the pattern size M is as small as possible. However, M cannot be arbitrarily small, which will be illustrated below and explained in detail in section 5. Examples of how sequence replay depends on network parameters are illustrated by simulations of a network of N = 100,000 McCulloch-Pitts units at low connectivities c = c s = 0.05. The choice r = c s /c = 1 roughly resembles values experimentally obtained by Nusser et al. (1998), Montgomery et al. (2001), and Isope and Barbour (2002). Minimal sequences have been concatenated so that nonminimal sequences [ξ 0 → ξ 1 → . . . → ξ Q ] of length Q = 20 are obtained. In the simulations, the network is always initialized with the cue pattern ξ 0 at time step 0. The replay of nonminimal sequences at times t > 0 is then indicated through two order parameters: the number of correctly activated neurons (hits), mt := x(t) · ξ t , and the number of incorrectly activated neurons (false alarms), nt := x(t) · (1 − ξ t ), where 1 = [1, . . . , 1]T and the symbol ‘·’ denotes the standard dot product. Figure 1 shows sequence replay in cellular simulations for two different pattern sizes (M = 800 and M = 1600). Sequence replay crucially depends on the value of the firing threshold θ . In general, if the threshold is too high, the network becomes silent after a few iterations. If the threshold is too low, the whole network becomes activated within a few time steps. Whether there exist values of θ at which a sequence can be fully replayed, however, also critically depends on the pattern size M. At a small pattern size of M = 800, there is no such firing threshold, whereas for a pattern size M = 1600, there is a whole range of thresholds that allow replaying the full sequence. So there is a conflict between the maximization of the capacity of the network, which requires M to be small, and the dynamical stability of replay, which becomes more robust for larger values of M (cf. section 7). In section 5, we will derive a lower bound for the pattern size below which replay is impossible, and we also determine the respective firing threshold. In connection with equation 3.2, these results enable us to calculate the maximum number of sequences that can be simultaneously stored in a recurrent network such that all stored sequences can be replayed. These calculations require a simultaneous optimization of pattern size M and threshold θ . A numerical treatment as shown in Figure 1, however, is infeasible for much larger networks. Therefore, we first introduce a numerically less costly approach. 4.2 Markovian Dynamics. Assessing the dynamics of large networks of neurons by means of computer simulations is mainly constrained by
Capacity for Sequences Under Biological Constraints
A
B
M=800 m/M n/(N-M)
1
911
M=1600 m/M n/(N-M)
1 θ=65
0 1
θ=134
0 1
θ=63 0 1 θ=62 0 1
θ=133
Relative activity
Relative activity
θ=64 0 1
0 1 θ=121 0 1 θ=114 0 1
θ=61 0 1
θ=113 0 1
θ=60 0
θ=112 0
0
4
8
12
Time step
16
20
0
4
8
12
16
20
Time step
Figure 1: Stability-capacity conflict. Sequence replay critically depends on both the firing threshold θ and the pattern size M. In all graphs, we show the fraction mt /M of hits (disks) at time step t and the fraction nt /(N − M) of false alarms (crosses) during the replay of a nonminimal sequence of length Q = 20. The network consists of N = 105 McCulloch-Pitts neurons with a mean connectivity of activated synapses of c = 5%. The ratio of silent and activated synapses is r = 1. (A) For a pattern size M = 800, full replay is impossible. For high thresholds θ ≥ 64, the sequence dies out, whereas for low thresholds θ ≤ 63, the network activity explodes. (B) For a pattern size of M = 1600, sequence replay is possible for a broad range of thresholds θ between 114 and 133.
the amount of accessible memory. Simulations of a network of N = 105 cells with a connectivity of about c = 5%, as the ones shown in Figure 1, require about 2 GB of computer memory. A doubling of neurons would therefore result in 8 GB and is thus already close to the limit of these days’ conventional computing facilities. Networks with more than 106 cells that need at least 200 GB are very inconvenient. It is therefore reasonable to follow a different approach for investigating scaling laws of sequential memory. To be able to simulate pattern replay in large networks, we reduce the dynamical degrees of freedom of the network to the two order parameters defined in the previous section: the number mt of correctly activated neurons (hits) and the number nt of incorrectly activated neurons (false alarms) at time t (see also Figure 2A). Furthermore, we take advantage of the fact that the network dynamics has only a one-step memory and thus
912
C. Leibold and R. Kempter
A t
B
t
M
N−M
mt
nt
mt
nt
c 11
m t+1
ξA c 00
c 01
t+1
ξ
c 10 nt+1
ξB
Figure 2: Pattern size and connectivity matrix. (A) At some time t, the network is assumed to be associated with a specific event ξ t = ξ A of size M. We therefore divide the network of N neurons into two parts. The first part consists of the M neurons that are supposed to be active if an event ξ t is perfectly represented. The second part contains the N − M neurons that are supposed to be inactive. The quantities mt (hits) and nt (false alarms) denote the number of active neurons in the two groups at time t. (B) The number mt+1 of hits and the number nt+1 of false alarms with respect to pattern ξ t+1 = ξ B at time t + 1 are determined by the state of the network at time t, x(t) = ξ A, and the connectivity matrix C. The average number of synaptic connections between four groups of cells c10the , which is defined in is described by the reduced connectivity matrix cc11 01 c 00 section 4.2.1.
reduce the full network dynamics to a discrete probabilistic dynamics governed by a transition matrix T (Nowotny & Huerta, 2003; Gutfreund & M´ezard, 1988). The transition matrix is defined as the conditional probability T(mt+1 , nt+1 |mt , nt ) that a network state (mt+1 , nt+1 ) follows the state (mt , nt ). We note that due to this probabilistic interpretation, the dynamics of (mt , nt ) is stochastic, although single units behave deterministically. More precisely, we derive a dynamics for the probability distribution of (mt , nt ). How to interpret expectation values with respect to this distribution is specified next. 4.2.1 Reduced Connectivity Matrix. In the limit of a large pattern size M, the connectivities c and c m can be interpreted as probabilities of having synaptic connections—in other words, the probability that in the embedded sequence ξ A → ξ B there is an activated synapse from a cell active in ξ A to a cell active in ξ B is c m . This probabilistic interpretation can be formalized by means of a reducc 10 tion of the binary connectivity matrix C to four mean connectivities cc11 , 01 c 00 which are average values over all P minimal sequences stored (see also Figure 2B). First, we define the mean connectivity c 11 between neurons that
Capacity for Sequences Under Biological Constraints
913
are supposed to be active in cue patterns and those that are supposed to be active in their corresponding targets, c 11 =
P N 1 1 A ξk Ckk ξkB . P {A,B} N2
(4.1)
k,k =1
Here the sum over {A, B} is meant to be taken over the cue target pairs of P different minimal sequences. By construction (see section 3), c 11 is at its maximum c m . Second, the connectivity c 10 describes activated synapses between cells that are active in cue patterns to cells that are supposed to be inactive in target patterns. Similarly, the mean connectivity c 01 describes activated synapses from neurons that are supposed to be inactive in the cue to those that should be active in the target pattern. Finally, c 00 denotes the mean connectivity between cells that are to be silent in both the cue and the target. The four connectivities are summarized in the reduced connectivity mean c 10 matrix cc11 (see also Figure 2B). The interpretation of the mean connec01 c 00 tivities as probabilities of having activated synaptic connections between two neurons can be considered as the assumption of binomial statistics. This assumption is a good approximation for Willshaw-type networks in the limit N M 1 (Buckingham & Willshaw, 1992). Cues and targets of minimal sequences are assumed to be linked as tight as possible, which results in c 11 = c m = c (1 + r ). The remaining three entries of the reduced connectivity matrix follow from normalization conditions: since every active neuron in a target pattern, for example, ξ B , receives, on average, c N activated synapses, and those synapses originate from two different groups of neurons in a cue pattern, for example, ξ A, we have c N = c 11 M + c 01 (N − M). Similarly, every inactive neuron in the target pattern receives, on average, c N = c 10 M + c 00 (N − M) activated synapses. As a consequence of recurrence, every neuron of a cue pattern projects, on average, to c N postsynaptic neurons. From that we obtain two similar conditions with c 10 and c 01 interchanged and thus c10 = c01 . c 10 can therefore be All entries of the reduced connectivity matrix cc11 01 c 00 expressed in terms of the mean connectivity c, the ratio r of silent and nonsilent connectivities, the pattern size M, and the network size N,
c 11 c 10 c 01 c 00
=c
1+r
1−r
M 1 − r N−M 1+r
M N−M M2 (N−M)2
.
(4.2)
The assumption of binary statistics together with the reduced connectivity matrix enables us to specify the transition matrix T as it was defined at the beginning of section 4.2. Calculation of the capacity α for an arbitrary connectivity c 11 , that is, c < c 11 < c m , between cue and target patterns is somewhat more involved than
914
C. Leibold and R. Kempter
in the case of section 3, where patterns were connected with the maximum morphological connectivity c 11 = c m . The scenario c < c 11 < c m is outlined in appendix B. For 1 M N, however, equation 3.3 with c m replaced by c 11 turns out to be an excellent approximation. 4.2.2 Transition Matrix. Due to statistical independence of the activation of different postsynaptic neurons, the transition matrix can be separated, T(mt+1 , nt+1 |mt , nt ) = p(mt+1 |mt , nt ) q (nt+1 |mt , nt ),
(4.3)
where p(mt+1 |mt , nt ) is the probability that at time t + 1 a number of mt+1 cells are correctly activated, and q (nt+1 |mt , nt ) is the probability of having nt+1 cells incorrectly active, given mt and nt . Defining the binomial probability b j,l (x) =
l x j (1 − x)l− j , j
(4.4)
with 0 ≤ x ≤ 1, and 0 ≤ j ≤ l, we obtain p(mt+1 |mt , nt ) = b mt+1 ,M (ρmt nt ) and q (nt+1 |mt , nt ) = b nt+1 ,N−M (λmt nt ), (4.5) with ρmt nt and λmt nt denoting the probabilities of correct (ρ) and incorrect (λ) activation of a single ccell, respectively. Both are specified by the reduced 10 connectivity matrix cc11 and the firing threshold θ , 01 c 00 ρmt nt =
b j,M
j,k; j+k≥θ
λmt nt =
j,k; j+k≥θ
b j,M
m
t
M m
t
M
c 11 b k,N−M
nt c 01 , N−M
(4.6)
nt c 00 . N−M
(4.7)
c 10 b k,N−M
Equations 4.6 and 4.7 can be understood as adding up the probabilities of all combinations of the number j of hits and the number k of false alarms that together cross the firing threshold θ . The transition matrix T gives rise to probability distributions t for the number mt of hits and the number nt of false alarms. To be able to compare the Markovian dynamics with the network dynamics obtained from cellular simulations (see Figure 1), we calculate the expectation values mt and nt
of hits and false alarms with respect to the probability distribution t for
Capacity for Sequences Under Biological Constraints
915
t ≥ 1: mt =
M N−M
µ t (µ, ν|m0 , n0 )
(4.8)
ν t (µ, ν|m0 , n0 ),
(4.9)
µ=1 ν=1
nt =
M N−M µ=1 ν=1
where t (mt , nt |m0 , n0 ) =
{(m1 ,n1 )}
···
t
T(mτ , nτ |mτ −1 , nτ −1 )
(4.10)
{(mt−1 ,nt−1 )} τ =1
is the probability of having mt hits and nt false alarms at some time t, given that the network has been initialized with m0 = M hits and n0 = 0 false alarms at time zero. Equation 4.10 can be derived from the recursive formula t (.|.) = {(.)} T(.|.) t−1 (.|.), and the sums in equation 4.10 are meant to be over all pairs (mτ , nτ ) ∈ {0, . . . , M} ⊗ {0, . . . , N − M}, for 1 ≤ τ ≤ t − 1. An increase in numerical efficiency is gained from the fact that sums over binomial probabilities can be evaluated by means of the incomplete beta function (Press, Flannery, Teukolsky, & Vetterling, 1992). Moreover, numerical treatment of the Markovian dynamics can take advantage of the separability of T = p q (see equation 4.3). But still, for large numbers of N, computing and multiplying p and q in full is costly. We therefore reduced p and q to at most 5000 interpolation points, where each of them is assigned to the same portion of probability. The reduced vectors are then used to calculate an iteration step t → t + 1. Numerical results provided are thus estimates in the above sense and serve as approximations to the full Markov dynamics. Figure 3 shows a numerical evaluation of the Markovian dynamics for the same parameter regime as used for the cellular simulations in Figure 1. We observe a qualitative agreement between the two approaches but also small differences regarding the upper and lower bounds for the set of firing thresholds allowing stable sequence replay. A further comparison is postponed to section 5. 4.3 Quality of Replay and Detection Criterion. In the examples shown in Figures 1 and 3, the quality of sequence replay at a certain time step is obvious because we typically have to distinguish among only three scenarios: (1) all neurons are silent, (2) all neurons are active, and (3) a pattern is properly represented. If, however, the network exhibits intermediate states, one needs a quantitative measure of whether a particular sequence is actually replayed. For this purpose, we specify the quality at which single
916
C. Leibold and R. Kempter M=800
A
m/M n/(N-M)
1
M=1600
B
m/M n/(N-M)
1 θ=65
0 1
θ=135
0 1
θ=63 0 1 θ=62 0 1
θ=134
Relative activity
Relative activity
θ=64 0 1
0 1 θ=133 0 1 θ=121 0 1
θ=61 0 1
θ=112 0 1
θ=60 0
θ=111 0
0
4
8
12
16
20
Time step
0
4
8
12
16
20
Time step
Figure 3: Stability-capacity conflict for Markovian network dynamics. The expected fraction of hits mt /M (disks) and false alarms nt /(N − M) (crosses) are plotted as a function of time t after the network has been initialized with the cue pattern at t = 0. The parameters N = 105 , c = 5%, r = 1, and Q = 20 are the same as in Figure 1. (A) For a pattern size M = 800, full replay is impossible. For high firing thresholds θ ≥ 64, the sequence dies out, whereas for low thresholds θ ≤ 63, the network activity explodes, which is identical to Figure 1 although the time courses of hits and false alarms slightly differ. (B) For M = 1600, sequence replay is possible for thresholds 112 ≤ θ ≤ 133, whereas in Figure 1, we have obtained 114 ≤ θ ≤ 133.
patterns ξ t are represented by the actual network state xt . We consider to be a function of the numbers mt and nt of hits and false alarms, respectively (see section 4.2). The quality function
(mt , nt ) := mt /M − nt /(N − M)
(4.11)
is chosen such that a perfect representation of a pattern is indicated by
= 1. Random activation of the network, on the other hand, yields | | 1 in the generic scenario 1 M N. The quality function weighs hits much stronger than false alarms, similar to the so-called “normalized winnertake-all” recall as introduced by Graham and Willshaw (1997) or Maravall (1999). Equation 4.11 is physiologically inspired by a downstream neuron receiving excitation from hits and inhibition from false alarms.
Capacity for Sequences Under Biological Constraints
917
We say a pattern to be replayed correctly at time t if the detection criterion
(mt , nt ) ≥ γ
(4.12)
is satisfied where γ denotes the threshold of detection. A sequence of Q patterns is said to be replayed if the final target pattern in the Qth time step is correctly represented: (m Q , n Q ) ≥ γ . Here, we implicitly assume that all the patterns of a sequence are represented at least as proper as the last one. Similar to equation 4.12, we specify a detection criterion for sequence replay approximated by the Markovian dynamics, (m Q , n Q ) = ( m Q , n Q ) ≥ γ ,
(4.13)
where the expectation values m Q and n Q are obtained from the Q-times iterated transition matrix T Q for the initial condition (m0 , n0 ) = (M, 0). The criteria 4.12 and 4.13 are obviously different. For 1 M N, however, they are almost equivalent with γ ≈ γ because the distribution of the quality measure is typically unimodal and sharply peaked with variance below 1/(4 M) + 1/[4(N − M)]. Moreover, we will see in the next section that the specific value of the detection threshold does not affect scaling laws for sequential memory. 5 Scaling Laws for Minimal Sequences The capacity α of sequential memory is proportional to M−2 (see equation 3.3). In order to maximize α, one therefore seeks a minimal pattern size M at which the replay of sequences serves a given detection threshold γ . In this section, we assess this minimal pattern size for minimal sequences (Q = 1) and sparse patterns (1 M N). In particular, we explain why the minimal pattern size is independent of the network size N. In the case 1 M N, the reduced connectivity matrix in equation 4.2 1+r 1 c 10 can be approximated through cc11 ; neurons that are active in ≈ c 1 1 01 c 00 cue patterns are connected to neurons that should be active in target patterns with probability c m = c(1 + r ). Otherwise, the connectivity is about c (see Figure 2). 5.1 Hits and False Alarms in Pattern Recall. At some time t, only those M neurons are supposed to be active that belong to the cue pattern ξ A. We then require a particular minimal sequence ξ A → ξ B to be imprinted such that at time t + 1 event ξ B is recalled. We have assumed that the number j of inputs to each of the M “on” neurons that should be active at time t + 1 is binomially distributed with probability distribution b j,M (c + c s ) (see equation 4.4). In the same way, the input distribution for the N − M “off”
918
C. Leibold and R. Kempter
off units
σoff κ+
C Probability density
Probability density
A
0.5 0.4 0.2 0.1 0
σ κ
on −
Pattern size M
Probability density
D
on units
κ+
0.3
Mc
B
−κ−
4
10
=0.82
−3 −2 −1 0 1 2 3 Standard deviations =.2 .3 .4.5.6 .7 .8 .9
3
10
2
10
1
10 Mc(1+r) Mc θ Synaptic input
−1 0 1 2 Threshold parameter κ
3
+
Figure 4: Mean quality
of replay and threshold parameters κ+ and κ− . (A) Probability density of the number of synaptic inputs for “off” units, which are supposed to be inactive during the recall of a target pattern. The vertical dashed line indicates the firing threshold θ. The gray area represents the probability n /(N − M) of having a false alarm. (B) Same as in A but for “on” units, which are supposed to be active. The gray area represents the probability m /(N − M) of having a hit. (C) For 1 M N, the binomial distributions in A and B can be approximated by normal distributions. The probability of hits minus that of false alarms equals the gray area under the normal distribution between −κ− and κ+ . From equation 5.3, we see that this area can also be interpreted as the mean quality
of replay. (D) Pattern size M as a function of κ+ for different replay qualities
at constant r = 1 and c = 0.01; see equations 5.1 and 5.4. The dashed line connects the minima of M.
cells that should be inactive at time t + 1 is b j,M (c). As a result, a neuron that is supposed to be inactive receives, on average, input through c M activated synapses with a standard deviation c (1 − c)M. To avoid unintended firing, we require a firing threshold θ that is somewhat larger than c M. The larger the threshold, the more noise-induced firing due to fluctuations in the number of synapses is suppressed. Let us take a threshold θ = c M + κ+ c (1 − c) M where κ+ is a threshold parameter that determines the number of incorrectly activated neurons (Brunel et al., 2004), called false alarms. For κ+ = 1, for example, we have nt+1 ≈ 0.16 (N − M) false alarms (see Figure 4A). On the other hand, the threshold θ has to be small enough so that a neuron that is supposed to be active during event ξ B is indeed activated. Each of these neurons receives, on average,
Capacity for Sequences Under Biological Constraints
919
c m M inputs with standard deviation c m (1 − c m )M. A recall of ξ B is therefore achieved by a threshold that is somewhat smaller than c m M, that is, θ = c m M − κ− c m (1 − c m ) M where κ− is another threshold parameter that determines the number of correctly activated neurons, called hits. For κ− = 2, for example, we have mt+1 ≈ 0.98 M hits (see Figure 4B). The firing threshold θ is assumed to be the same for all neurons. Hence, combining the above two conditions, we find c M + κ+ c M (1 − c) = c m M − κ− c m M (1 − c m ), which then leads to expressions for the pattern size 1 M= c
2 √ κ+ 1 − c + κ− [r + 1][1 − c (1 + r )] r
(5.1)
and the firing threshold θ = c M + κ+ c (1 − c) M.
(5.2)
The pattern size M in equation 5.1 is independent of the network size N and scales like c −1 for small values of c. Moreover, the firing threshold θ in equation 5.2 is independent of the network size N. For small mean connectivities c, the firing threshold θ is also independent of c. We emphasize that the validity of these scaling laws requires an almost perfect initialization of the cue pattern. 5.2 Optimal Pattern Size and Optimal Firing Threshold. We now argue that the firing threshold parameters κ+ and κ− in equation 5.1 can be chosen such that M is minimal and, hence, the storage capacity is maximal. As indicated by the gray areas of the binomial distributions in Figures 4A and 4B, κ+ and κ− determine the mean numbers of false alarms n and hits m , respectively. For 1 M N, these binomial distributions are well √ approximated by gaussians, √ and we have n /(N − M) = [1 − erf(κ+ / 2)]/2 and m /M = [erf(κ / 2) + 1]/2, where − √ x the error function erf(x) := 2/ π 0 dt exp(−t 2 ) is the cumulative distribution of a gaussian. These approximations yield √ √ (m, n) = [erf(κ− / 2) + erf(κ+ / 2)]/2,
(5.3)
which can be interpreted as the area under a normal distribution between −κ− and +κ+ (see Figure 4C). From equation 5.3, we see that for a given mean quality
of replay, the threshold parameters κ+ and κ− are not independent. More precisely,
920
C. Leibold and R. Kempter
for some given detection criterion γ =
and κ+ > tion 5.3 yields
κ− =
√ √ 2 erf−1 [2γ − erf(κ+ / 2)].
√ 2 erf−1 (2γ − 1), equa-
(5.4)
For fixed
= γ one therefore can choose κ+ in equation 5.1 such that the pattern size M becomes minimal. At this minimal pattern size, the capacity α in equation 3.3 reaches its maximum, and encoding of events is as sparse as possible. Let us therefore call this minimum value of M the optimal pattern size Mopt for sequential memory. The dashed line in Figure 4D indicates that Mopt := minκ+ M is located at values κ+ 1. We also observe that the larger the detection threshold γ , the larger is Mopt . From equation 5.1, we find that for small connectivities c 1, as they occur in biological networks, the minimum pattern size Mopt can be phrased as
Mopt =
1 [M(r, γ ) + O(c)], c
(5.5)
where M(r, γ ) is a function of r and γ that has to be obtained by numerical minimization. Here, the order function O(c k ) is defined through limc→0 c −k O(c k ) = const. for k > 0. At values r = 1 and γ = 0.7, for example, we have M = 6.1 c corroborating the scaling law Mopt ∝ c −1 . For an optimal pattern size Mopt , we can find the optimal firing threshold θ opt from equation 5.2. In first approximation, θ opt is independent of the connectivity c and the network size N, but depends on r and γ , θ opt = T (r, γ ) + O(c).
(5.6)
For example, r = 1 and γ = 0.7 account for θ opt ≈ 9.1 c. The dependencies of Mopt and θ opt on the connectivity c are indicated in Figure 5 through solid lines. Both Mopt and θ opt increase with increasing detection threshold γ . These mean field results match numerical simulations well: in cellular network simulations (open circles in Figure 5), Mopt and θopt were determined as the minimal M and the corresponding θ that account for replay at a fixed detection threshold γ = 0.5. The numerical evaluation of the Markovian network dynamics as defined in section 4.2 (filled symbols in Figure 5) confirms the analytical results for a wider range of c and γ .
γ=0.5 (c.s.) γ=0.5 γ=0.7 γ=0.8 γ=0.9
-1
10
-2
10
-3
10
-4
10
-5
10
0.1
1 10 Connectivity c (%)
921
B Optimal threshold θopt
A
Optimal pattern size Mopt /N
Capacity for Sequences Under Biological Constraints
30 25 20 15 10 5 0 0.1
1 10 Connectivity c (%)
Figure 5: Optimal pattern size M opt and optimal firing threshold θ opt . Lines depict results from the mean field theory (equations 5.5 and 5.6). We also show numerical simulations (c.s.) of the network introduced in section 4.1 (empty circles, γ = 0.5) and Markovian dynamics defined in section 4.2 (filled symbols, γ = 0.5, 0.7, 0.8, 0.9). (A) For small connectivities c, the optimal pattern size M opt scales like c −1 and increases with increasing γ . (B) The optimal threshold θ opt is almost independent of the connectivity c for c 10%, and θ opt increases with increasing γ . Further parameters: sequence length Q = 1, network size N = 250 000, plasticity resources r = 1. For the Markovian dynamics, we used Brent’s method (Press et al., 1992) to numerically find a firing threshold θ as a root of the implicit equation m Q /M − n Q /(N − M) = γ , which is the detection criterion. By subsequently reducing M, we end up with a minimal value M opt for which the detection criterion
= γ can be fulfilled. The threshold root that is obtained at M opt is called θ opt .
The lower bound Mopt for the pattern size in equation 5.5 enables us to determine an upper bound for the capacity α of sequential memory. Combining equations 3.3 and 5.5, we find α = cN
(1
+ r )2
1 + O(c 2 ). M(r, γ )2
(5.7)
We note that α is linear in the connectivity c and the network size N, decreases with increasing γ and has a nontrivial dependency on the plasticity resources r that will be evaluated below. This scaling law for minimal sequences can now be used to study the storage of sequences in biologically feasible networks that face certain constraints. 6 Constrained Sequence Capacity Biological and computational networks generally face certain constraints. Those constraints can lead to limiting values and interdependencies of the network parameters c, N, and r . Some constraints and their implications on
922
C. Leibold and R. Kempter
the optimization of the capacity α of sequential memory in equation 5.7 are discussed in this section. 6.1 Limited Number of Synapses per Neuron. A biological neuron may have a limited number c N of synapses. If c N is constant, we find from equation 5.7 (for constant r and γ ) α = const.
and
P = const.
Increasing the capacity α therefore cannot be achieved by increasing N. Numerical results in Figure 6A (symbols) confirm this behavior for c 1.
Optimal pattern size Mopt /N
A
Synapses-per-neuron constraint
B
Synapses-per-network constraint
Nc=2500 5000 -2 10 7500 10000
-2
10
2
-3
10
-4
-4
10
10
c=0.5
α = Const.
3
10 Capacity α
6
N c=18x10 6 36x10 6 54x10 6 71x10
-3
10
c=0.5
3
-1
α~N
10
c=0.25 2
c=0.1
10
c=0.01
2
c=0.26
10
c=0.1 1
1
10
10
0
c=0.01
0
10
10 4
10
5
10 Network size N
6
10
4
10
5
10 Network size N
Figure 6: Influence of constraints on the optimal pattern size Mopt and the capacity α of sequential memory. (A) Synapses-per-neuron constraint. For a fixed number c N of synapses per neuron, we find M opt ∝ N and α = const. as N → ∞. The capacity α increases with increasing c N. Tilted solid lines connect symbols that refer to constant connectivities, for example, c = 0.01, 0.1, 0.25, 0.5. (B) Synapses-per-network constraint. For a fixed number c N2 of synapses in the network, we find M opt ∝ N2 and α ∝ N−1 as N → ∞. The capacity α increases with increasing c N2 . For both constraints, c N = const. and c N2 = const., there is an optimal network size at which the capacity α reaches its maximum. For r = 1 this maximum occurs at c ≈ 0.5. A further increase in c is impossible since the morphological connectivity c m = c (1 + r ) cannot exceed 1. Other parameters are Q = 1 and γ = 0.7. Dotted lines link symbols and are not obtained by mean field theory.
Capacity for Sequences Under Biological Constraints
923
For r = 1, the capacity α reaches its maximum at c ≈ 0.5, where we have c m = c (1 + r ) = 1, and the network can be considered an undiluted Willshaw one. For c → 0.5, the scaling law α = const. (solid line) underestimates the storage capacity because the O(c 2 ) term in the mean field equation 5.7 has been neglected. In biologically relevant networks, we typically have c 1, and thus, for c N = const., we face the scaling law P = α (1 + r ) c N = const. Therefore, the number c N of synapses a single neuron can support fully determines the network’s computational power for replaying sequences in the sense that adding more neurons to the network does not increase α or P. In the CA3 region of the rat hippocampus, for example, we have c m N ≈ 12,000 recurrent synapses at each pyramidal cell (see Urban, Henze, & Barrionuevo, 2001, for a review). The network size of CA3 is N ≈ 2.4 · 105 (Rapp & Gallagher, 1996). From these numbers, r = 1 and c N = const., we derive the connectivity c ≈ 0.025. A comparison of these numbers with Figure 6A leads to estimates for the minimal pattern size being in the order of Mopt ≈ 200 cells, a storage capacity of α ≈ 15 minimal sequences per synapse at a neuron, and P ≈ 1.8 · 105 minimal sequences per CA3 network. The saturation of α and P at about N = 105 for c N = 6,000 (see Figure 6A) may explain why the CA3 region has relatively few neurons (N 106 in humans) despite its seminal importance for episodic memory. 6.2 Limited Number of Synapses in the Network. Numerical simulations of artificial networks are constrained by the available computer memory, which limits the number c N2 of activated synapses in the network. For c N2 = const. we find from equation 5.7 (for constant r and γ ) α ∝ N−1
and
P ∝ N−2 .
Therefore an increase in both α and P can be achieved only by reducing the network size N at the expense of increasing the connectivity c. Numerical results in Figure 6B confirm this behavior for c 1. The capacity α increases with increasing c and, for r = 1, assumes its maximum at the upper bound c = 0.5 when c m = 1. For c → 0.5, the scaling law α ∝ N−1 (solid line) underestimates the storage capacity, similar to Figure 6A. We conclude that computer simulations of neural networks with constant c N2 perform worse in storing sequences the more the connectivity resembles the biologically realistic scenario c 1. 6.3 On the Ratio of Silent and Activated Synapses. In the previous two sections, we have assumed a constant ratio r between the connectivity c s through silent synapses and the connectivity c through nonsilent synapses. The specific choice r = 1 was motivated by neurophysiological estimates from Nusser et al. (1998) and Montgomery et al. (2001). We now focus on
924
C. Leibold and R. Kempter
B c+cs=0.05 0.10 0.15 0.25
-2
10
-3
10
c+cs = 0.05 c+cs = 0.05 c+cs = 0.25
100
10 Optimal threshold θopt
Optimal pattern size Mopt /N
A
-4
10
-5
10
3
Capacity α
10
1 100 c+cs = 0.25 c+cs = 0.25 c+cs = 0.05
10
2
10
1
10
1 0
10
1
10
100
1
10
100
Resources for plasticity r
Figure 7: Dependence of sequence replay on the resources r of synaptic plasticity for a constant total number (c + c s )N of synapses per neuron. Mean field theory (solid lines) explains numerical results obtained from the Markovian dynamics (symbols) well as long as r < 10 and θ opt 4. Below θ opt 4, the discreteness of θopt limits the validity of the mean field theory. (A) The optimal pattern size M opt (top) decreases with increasing r and saturates at values (c + c s )M opt = 1 (symbols). As a result, the capacity α (bottom) increases with r until M opt has reached its lower bound, and α exhibits a maximum. A further increase in r reduces c but leaves M opt constant and thus leads to a decrease of α (see equations 3.2 and 3.3). (B) The optimal firing threshold θ opt decreases with increasing r to its lower bound 1 (symbols). The weak dependence of θ opt on c m = c + c s is indicated by the gray lines. Further parameters for A and B: N = 250,000, γ = 0.7, Q = 1.
how the storage capacity α depends on this ratio r assuming that the total number c m N of morphological synapses per neuron is constant. Because of c m = c(1 + r ), an increase in r increases c s but reduces c. We note that this constraint is equivalent to a fixed number c N of activated synapses per neuron for constant r , a scenario evaluated in section 6.1. For constant (c + c s )N, numerical results in Figure 7A (symbols) show that the capacity α exhibits a maximum as a function of r . The maximum capacity occurs at a pattern size at which (c + c s )Mopt = c 11 Mopt = 1, that
Capacity for Sequences Under Biological Constraints
925
is, the association from a cue pattern to an “on” neuron of a target pattern is supported by a single spike, on average. For larger r , the optimal firing threshold θ opt remains at its minimum value of one (see Figure 7B). An increase of r beyond its optimum reduces c but leaves Mopt constant and thus leads to a decrease of α (see equations 3.2 and 3.3). Values of r larger than 1 are thus beneficial for good memory performance—in our case, the storage capacities α. Similarly, Brunel et al. (2004) find a high ratio r to be necessary for increasing the signal-to-noise ratio of readout from a perceptron-like structure. These findings raise the question why values of r found in some experiments (Nusser et al., 1998; Montgomery et al., 2001) are in the range r 1. We suppose that the specific value of r is due to the interplay between the recurrence in the hippocampus and the locality of synaptic learning rule (see section 9). In contrast, Isope and Barbour (2002) report r ≈ 4 at the cerebellar parallel fibers, which, locally, is a feedforward system. 6.4 Scale-Invariance of Sequential Memory. Given the scaling laws α ∝ c N of the storage capacity in equation 5.7, we can ask how the connectivity in a brain region should be set up in order to have scale-invariant sequential memory, which means P ∝ N. From P = αc m N ∝ c 2 N2 (see equation 5.7) we then find c ∝ N−1/2 or, equivalently, that the total number c N2 of synapses in the network is proportional to N3/2 . Surprisingly, the latter result is in agreement with findings from Stevens (2001; personal communication, 2005) in visual cortex and other brain areas. Thus, a N3/2 -law for the number of synapses can generate a scalable architecture for associative memory. To summarize this section, constraints have seminal influence on the scaling laws of the capacity of sequential memory, and different constraints lead to fundamentally different strategies for optimizing the performance of networks for replaying sequences. 7 Nonminimal Sequences In addition to constraints on intrinsic features of the network like a small connectivity or a limited number of synapses, there are also constraints that may be imposed on a sequence memory device from outside, for example, a fixed detection threshold γ and a nonminimal sequence length Q. 7.1 Finite Sequences (Q > 1). To determine the capacity α for nonminimal sequences Q > 1 in dependence on the network size N, we apply the
926
C. Leibold and R. Kempter
Markovian approximation as introduced in section 4.2. As in the case Q = 1, replay of sequences is initialized with an ideal representation of the cue pattern, (mt , nt ) = (M, 0) for t = 0. Patterns that occur later in the sequence at t ≥ 1, however, are not represented in an ideal way; typically, there is a finite number of false alarms nt , and the number of hits mt is generally below M (see also Figure 3). The recall of patterns amid a sequence therefore depends on noisy cues. As a consequence, for Q > 1, we expect that the dependence of the optimal pattern size Mopt and the optimal threshold θ opt on the network size N are different as compared to the case Q = 1. Assuming a constrained number of synapses per neuron (c N = const.), we nevertheless find that for Q > 1 the dependence of the optimal pattern size Mopt on N is almost linear for large N (see Figure 8A). Accordingly, the capacity α is nearly independent of N (see Figure 8B). Moreover, the optimal firing threshold θ opt is almost constant for large N (see Figure 8C).
B
-1
Q=1 2 4 8 16 ∞ (th)
-2
10
3
10
2
10
1
10
0
10
C Optimal threshold θopt
Optimal pattern size Mopt /N
10
Capacity α
A
-3
10
-4
2
10
1
10
0
10
10 4
10
5
6
10 10 Network size N
4
10
5
6
10 10 Network size N
Figure 8: Optimal event size M opt , capacity α and optimal firing threshold θ opt for nonminimal sequences in networks with a constrained number c m N = 10,000 of synapses per neuron. (A) The optimal pattern size M opt increases half a magnitude between Q = 1 and Q = 4 (three bottom lines) and saturates for Q ≥ 8 (three top-most lines). (B) The capacity α is almost constant for large N and decreases with increasing Q. (C) The optimal firing threshold θ opt reflects the dependencies of the optimal pattern size M opt . Numerical results (symbols) are obtained for γ = 0.7 and r = 1. Graphs for Q → ∞ (lines) are obtained from the mean field equation 7.1.
Capacity for Sequences Under Biological Constraints
927
These results for Q > 1 resemble the ones for Q = 1 shown in Figure 6, some of which are also indicated by disks in Figure 8. One reason for this correspondence is that patterns within a sequence are typically replayed at a high quality (see Figures 1 and 3). Figure 8B also shows that α is a decreasing function of Q. We note that α still refers to the number of minimal sequences. Then the maximum number of stored sequences of length Q is the Qth fraction of P = α(1 + r ) c N. Compared to Q = 1, the storage capacity α drops about an order of magnitude for Q = 2, 4 and soon, at Q 8, arrives at a baseline value for Q → ∞ (solid line) that was obtained from a mean field approximation to be explained below. We thus conclude that nonminimal sequences impose no fundamental limit to the memory capacity for sequences. However, due to discrete time, our model cannot comprise temporal dispersion of synchronous firing, which may limit the replay of long sequences in biological networks (Diesmann et al., 1999). 7.2 Infinite Sequences (Q → ∞). A beneficial consequence of the weak dependence of the capacity α on the sequence length for Q 8 is that sequential memory for large Q can be more easily discussed in the framework Q → ∞. Such a discussion requires finding the fixed-point distributions of the transition matrix T defined in equation 4.3. Assuming that the fixedpoint distributions for hits m and false alarms n are unimodal and given the case N 1, we can reduce the problem of finding fixed-point distributions of m and n to the much simpler problem of finding fixed points of the mean values m and n . Let us therefore introduce the iterated map,
mt+1
mt
= T ·
, nt+1
nt
(7.1)
for the mean values of the order parameters. To specify the map T · in accordance with the Markovian dynamics introduced in section 4.2, we define the mean synaptic inputs to “on” and “off” units, µon = c 11 m + c 01 n
and
µoff = c 10 m + c 00 n ,
respectively, as well as the variances, 2 = c 11 m (1 − c 11 m /M) + c 01 n [1 − c 01 n /(N − M)] σon 2 σoff = c 10 m (1 − c 10 m /M) + c 00 n [1 − c 00 n /(N − M)] ,
928
C. Leibold and R. Kempter
c10 from equawhich are determined by the reduced connectivity matrix cc11 01 c 00 tion 4.2. A gaussian approximation to binomial statistics then yields T ·
2 − θ )/ 2 σ M 1 + erf (µ on on m
1 = . 2 (N − M) 1 + erf (µ − θ )/ 2 σ 2 n
off off
(7.2)
Numerical iteration of equation 7.1 results in the fixed points ( m ∗ , n ∗ ) of the mean field dynamics and their basins of attraction (see Figure 9A). The iterated map has two trivial fixed points that are largely independent of the choice of firing threshold θ and pattern size M. These trivial fixed points represent complete activation of the network, on the one hand, and no activity at all, on the other hand. Shape and size of their basins of attraction (black and white areas in Figure 9A), however, are modulated by the specific values of M and θ . We also observe a third type of fixed point comprising a large number of hits and a small number of false alarms; numerics shows that we always find
1 at this fixed point of infinite sequence replay. Its basin of attraction is plotted in gray and extends over a small interval of false alarm rates; note the logarithmic scale on the ordinates in Figure 9A. In Figure 9A we see that the smaller the pattern size, the narrower is the range of thresholds allowing an infinite sequence replay. For a large enough pattern size, the range of possible thresholds is broad (see also Figures 1 and 3). The region in the (M, θ ) space where infinite sequence replay can occur is summarized in Figure 9B. The wedge-shaped stability regions are not much affected by N but strongly depend on c. The borders of such a stability region in Figure 9B can be described by upper and lower bounds for the thresholds, θ upper and θ lower , that can be approximated through linear functions of the pattern size M. The upper bound θ upper is interpreted as an iso-
line that separates the region of a completely deactivated state with fixed point m ∗ = n ∗ = 0 from the region of stable sequence replay where m ∗ = M
and n ∗ M. From the first line of equation 7.1, we then obtain √ θ upper ≈ c 11 M
− erf−1 (2
− 1) O( θ upper ). Thus, for large M, the bound θ upper is an almost linear function of the pattern size with a slope c 11
≈ c 11 . Similarly, from the second line of equation 7.1, we obtain the boundary θ lower between the region where m ∗ = M and n ∗ N and the region of a completely activated state m ∗ /M = n ∗ /(N − M) = 1, θ lower ≈ c 10 M + c 10 N (1 −
) + erf−1 (2
− 1) O( θ lower ).
Capacity for Sequences Under Biological Constraints
A 0
10
−2
10
929
M=700
M=800
M=900
M=1200
M=1600
θ=57
θ=64
θ=72
θ=98
θ=134
θ=56
θ=63
θ=71
θ=97
θ=133
θ=55
θ=62
θ=70
θ=92
θ=121
θ=54
θ=61
θ=69
θ=87
θ=111
θ=53
θ=60
θ=68
θ=86
θ=110
−4
10
0
10
−2
False alarms n/(N−M)
10
−4
10
0
10
−2
10
−4
10
0
10
−2
10
−4
10
0
10
−2
10
−4
10
0
0.5
0.5
10
N=105
B Threshold θ
10
200
c=0.2
0.5 10 Hits m/M
0.5
6
10
0.5
1
107
10
0.1
100
0.05
0 0
1000
0
1000 Pattern size M
0
1000
Figure 9: Fixed points of infinite sequence replay. (A) Basins of attraction of the mean field dynamics in equation 7.1 depend on pattern size M and firing threshold θ . The discrete dynamics of mean hit rates m /M and mean false alarm rates n /(N − M) exhibits two trivial fixed points. The first is a completely deactivated state, m ∗ = n ∗ = 0, with basins of attraction represented by a white area. The second fixed point represents maximal activation, m ∗ /M = n ∗ /(N − M) = 1, with basins of attraction painted black. For a few pairs of (M, θ), we also observe nontrivial fixed points (black dots) corresponding to sequence replay. Their basins of attraction are depicted by gray areas. Parameters (N = 105 , c = 0.05, r = 1) are the same as in Figures 1 and 3. (B) Regions of stable sequence replay in the (M, θ) space are plotted in gray; connectivities are c = 0.05, 0.1, 0.2, and network sizes are N = 105 , 106 , 107 for r = 1. The slopes of the upper and lower borders of these stability regions approximately equal the connectivities c 11 and c 10 , respectively.
930
C. Leibold and R. Kempter
The slope of θ lower is about c 10 , which for r = 1 is about half the slope of θ upper . These predicted slopes agree with numerical results in Figure 9B. The size of the region of infinite sequence replay is therefore proportional to c 11 − c 10 ∝ r . The larger the ratio r between silent and nonsilent synapses, the larger are the stability regions and, hence, the more robust is sequence replay. We emphasize that the above expressions for θ upper and θ lower are rough estimates that correspond to large pattern sizes M at which the distributions of synaptic inputs to “off” and “on” units do not overlap too much (see Figures 4A and B). Moreover, the optimal parameters Mopt and θ opt at the tip of a stability region cannot be determined explicitly because we cannot assess the exact value of
analytically. The mean field results in Figure 9A are largely consistent with the cellular simulations in Figure 1, but there are also important differences. Cellular simulations have been obtained for finite sequences Q = 20 whereas mean field results are valid for Q → ∞. Further discrepancies at the edges of the stability regions also occur because random fluctuations in cellular simulations can kick the network into complete activation or deactivation. The edges of the wedge-shaped regions in Figure 9B therefore describe the behavior of cellular networks only approximately. To summarize, the higher the capacity, the less robust is sequence replay against variations of the parameters M and θ . The wedge-shaped structures of the stability regions in Figure 9B indicate that the maximal sequence capacity and, hence, minimal M go along with a critical dependence of stability on the firing threshold. In the limit of M → Mopt , the network lives on the edge of dynamical (in)stability. 8 Information Content for N → ∞ The detection criterion we have proposed in section 4.3 permits a limited amount of errors. It is intuitively clear that these retrieval errors allow an increase of the storage capacity α as compared to an errorless case. However, the more errors occur, the more deteriorated is the representation of each of the patterns during replay. The common way of measuring the balance of these two opposing effects of retrieval errors is to calculate the information content I . The latter can be understood as the logarithm of the number of all possible ways of concurrently storing a number of P of associations or, more precisely (Nadal & Toulouse, 1990), I = lg2
P N M N−M . m+n m n
(8.1)
N M N−M Here, m+n /[ m ] is the number of patterns of size M that can be n represented in a network of size N, given the hits m and false alarms
Capacity for Sequences Under Biological Constraints
931
n. We note that the number P of associations between patterns depends on the performance of the readout device, and so does the information content. The information content is often calculated as a function of the so-called coding ratio f = M/N, which is interpreted as a firing rate. In biologicalrelevant networks, the firing rate is low ( f → 0) while they are required to be operable in the limit N → ∞. This asymptotic behavior of networks is extensively discussed in the literature (e.g., Willshaw et al., 1969; Gardner, 1987; Golomb et al., 1990). In what follows we will show that in our framework, we also have lim N→∞ f = 0. In this limit, we will assess the information content I for Q = 1 and Q → ∞. From equation 8.1, we derive an approximation of I for f → 0 given that the number n of false alarms is considerably smaller than the pattern size M, as it is motivated in section 7.2. For a fixed fraction η := m/M 1 of hits we can approximate I by evaluating equation 8.1 with n = 0. Then, applying Stirling’s formula and introducing the mixing entropy s(x) = −x lg2 x − (1 − x) lg2 (1 − x), we obtain I /(c m N2 ) = α f [η| lg2 η f | − s(η)].
(8.2)
From equation 3.3 we know that the storage capacity α scales like N/M2 and, thus, I /N 2 ∝ | ln M/N|/M. As a corollary, this shows that minimizing M not only maximizes α but also I . In case Q = 1, a combined optimization of θ and M leads to Mopt being independent of network size N (see section 5). As a result, we obtain f ∝ 1/N. Accordingly, the information content per synapse I /(c m N2 ) ∝ ln N increases with network size N. In order to also obtain the asymptotic behavior of I in the case of large sequence length, we have assessed the optimal pattern size Mopt for Q → ∞ as a function of network size N for a fixed connectivity c, that is, without any constraint (see Figure 10). Numerics reveals a sublogarithmic behavior, Mopt (N) ∝ (ln N)0.82 ; the coding ratio f ∝ (ln N)0.82 /N also falls below every bound as N → ∞, that is, coding becomes arbitrarily sparse. Together with equation 3.3, the unconstrained storage capacity diverges like α ∝ N/(ln N)1.64 . From equation 8.2, we thus find for N → ∞ the information content per synapse to increase sublogarithmically: I /N2 ∝ (ln N)−0.18 . The information content per synapse diverges for N → ∞, though very slowly. In fact, I /N2 grows so slowly that in the range of biologically
C. Leibold and R. Kempter
Optimal event size M
opt
932 3
10
c=0.1
0.82
log
c=0.2
2
10
3
4
5 6 7 8 9 10 11 12 Network size log10(N)
Figure 10: The dependency of the optimal pattern size M opt on the network size N is weak in the case of sequence length Q → ∞; note the logarithmic scale on the abscissa. The connectivity c is fixed, that is, no constraint is imposed (crosses: c = 0.1, circles: c = 0.2). The symbols represent optimal pattern sizes obtained from numerical solution of the fixed-point equation ( m ∗ , n ∗ ) = T · ( m ∗ , n ∗ ) (see equation 7.1). The solid line illustrates the asymptotic behavior M opt ∝ (ln N)0.82 found from linear regression.
reasonable network sizes 103 < N < 107 , the information content per synapse varies only by a factor of (7/3)0.18 ≈ 1.2. To summarize, combined optimization of M and θ provides an efficient algorithm to set up a sequential memory network for a broad range of network sizes. However, combining the results from sections 6 and 7 for biologically relevant parameter regimes, one obtains information contents I = |lg2 f α f c m N2 that are far below the theoretical maximum c m N2 of one bit per synapse.
9 Discussion This article combines analytical and numerical methods to assess the capacity for storing sequences of activity patterns in a recurrent network of McCulloch-Pitts units. Results from mean field theory are validated through simulations of cellular networks and a probabilistic dynamical description. Our approach is new in that we concurrently optimize the pattern size M and the firing threshold θ in order to maximize the storage capacity α. Within this framework, we derive the capacity α in dependence on five system parameters: network size N, mean connectivity c, synaptic plasticity resources r , sequence length Q, and detection threshold γ . The storage capacity of a network crucially depends on the criterion for pattern detection. One typically requires that the quality of replay of patterns exceeds some detection threshold γ (see equations 4.12 and 4.13). Our retrieval criterion with γ < 1, which allows errors in the replay of patterns, is fundamentally different from the error-free criterion γ → 1 in
Capacity for Sequences Under Biological Constraints
933
the classical Willshaw et al. (1969) network where the storage capacity is subject to Gardner’s bound (Gardner, 1987). In the original Willshaw model, as well as in our approach for minimal sequences (Q = 1), the network is initialized with a perfect representation of the cue pattern, m0 = M and n0 = 0. The Willshaw model, however, requires a perfect retrieval of a target pattern in that the number of hits is maximal, m1 = M, and that there is less than one false alarm on average, n1 < 1; furthermore, the firing threshold θ is set to the pattern size M. Then, binomial statistics yields the well-known logarithmic scaling laws for the optimal pattern size Mopt ∝ log N and the capacity αWillshaw ∝ N/ log2 N (Willshaw et al., 1969; Gardner, 1987; see also equation 3.3). In terms of the coding ratio f = Mopt /N, they find αWillshaw ∝ 1/( f | ln f |) for N → ∞. In contrast, in this article, we optimize both the firing threshold θ and the pattern size M, and we use a readout criterion that permits errors. Thus, the storage capacity α ∝ 1/ f (see equation 5.7) diverges faster than αWillshaw . An error-full representation of patterns is in agreement with the situation in the brain, for example, in the hippocampal CA3 network. There, the recurrently connected pyramidal cells also have feedforward connections to the pyramidal cells in CA1 via highly plastic synapses. It is generally assumed (Hasselmo, 1999) that these synapses are to be adjusted by CA3 activity and local learning rules; that is, CA1 can learn replayed patterns. Readout in CA1 may therefore be successful even if the absolute number of false alarms in CA3 exceeds the number of hits. The detection criterion in equation 4.13 can be motivated by such downstream neurons that receive excitation from the correctly activated neurons and inhibition from the incorrectly activated ones (e.g., via a globally coupled network of interneurons). For sequence length Q = 1, the concurrent optimization of M and θ leads to scaling laws for the replay of minimal sequences for biologically relevant connectivities c 1: the optimal pattern size is inversely proportional to the mean connectivity, Mopt ∝ c −1 , and the optimal firing threshold θopt is independent of c. Both θ opt and Mopt are independent of the network size N. The above dependencies finally lead to the capacity of sequential memory that scales like α ∝ c N (see equation 5.7). Moreover, the number of associations that can be stored scales like P ∝ c 2 N2 . A main conclusion from the scaling laws α ∝ c N and P ∝ c 2 N2 is that for a constrained number of synapses per cell (synapses-per-neuron constraint, c N = const.), the capacity α and the number P are constant, that is, independent of the network size N (see Figures 6 and 8). This means that it is impossible to increase the computational power of the network by increasing N. One could argue, however, that taking two independent networks doubles P and therefore would account for a performance increase that is linear in N. The drawback of this strategy is that then each pattern can be connected to only half of the other patterns, which are those located in the same network module.
934
C. Leibold and R. Kempter
A technically relevant constraint (e.g., in a computer simulation) is a constant total number of synapses in the network (synapses-pernetwork constraint, c N2 = const.). From above scaling laws, we conclude that α and P necessarily decrease with increasing network size (see Figure 6B). One can also ask whether there is a scaling law for the connectivity that accounts for scale-invariant storage, that√ is, P ∝ N (see section 6.3). In so doing, we find scale invariance for c ∝ 1/ N. As a result, the total number of synapses then is proportional to N3/2 , which is in line with results by Stevens (2001; personal communication, 2005). For the synapses-per-neuron constraint, there is an optimal value for the ratio r between silent and nonsilent synapses. For generic parameter regimes, this optimal value is rather large (r ≈ 10; see Figure 7). However, α exhibits a broad maximum as a function of r , and therefore the exact value of r is not critical for sequential memory. If one considers the network connectivity to be determined by local Hebbian learning rules, such as Spike timing dependent synaptic plasticity, ratios r that strongly deviate from 1 are implausible, since synaptic LTP at a specific pair of pre- and postsynaptic neurons can be compensated for locally only by LTD of another synapse at the very same pair of neurons (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Bi & Poo, 1998; Kempter, Gerstner, & van Hemmen, 1999). One thus can argue that the functional benefit of a very high amount of plastic resources may no longer justify the expenses of nonlocal signaling in synaptic plasticity. In short, ratios r ≈ 1 may be sufficient for an excellent performance of sequential memory. This article also shows that for long sequences (e.g., Q > 8), memory capacity becomes virtually independent of Q (see Figure 8). For large Q, however, the optimal pattern size is necessarily such that the network is close to dynamical instability (see Figure 9). Yet from the point of view of maximizing storage capacity α, the strategy of avoiding dynamical instabilities by increasing pattern size M is problematic, since α is proportional to M−2 (see equation 3.3). In order to approach the maximal storage capacity without the danger of complete activation or silencing of the network, one rather might introduce an activity-dependent stabilization mechanism that provides a negative feedback after a certain number of time steps. A biological realization that is at hand is a network of inhibitory interneurons (Bragin et al., 1995; Battaglia & Treves, 1998; Traub et al., 2000; Csicsvari, Jamieson, Wise, & Buzsaki, 2003). This of course may come at the cost of limiting sequence length Q or reducing the detection threshold γ . Our results for large sequence lengths Q are not immediately applicable to synfire chains (Abeles, 1991; Herrmann et al., 1995; Diesmann et al., 1999). The chief difficulty in translating our model into a more realistic network with continuous dynamics is to preserve the temporal separation between distinct patterns. The functional constraint of minimal sequence lengths is thus more likely a constraint on the temporal precision of network dynamics than on counting statistics. We speculate that for biological networks, spike
Capacity for Sequences Under Biological Constraints
935
desynchronization restricts the applicability of our results to small values of Q. The framework here is limited to orthogonal sequences; a particular pattern is not allowed to occur presynaptically in more than one minimal sequence. Nonorthogonal or loop-like sequential memories can be taken into account by, for example generalizing the framework to neurons with more than one-step memory (Dehaene, Changeux, & Nadal, 1987; Guyon, Personnaz, Nadal, & Dreyfus, 1988) or adding “internal patterns” that represent repetitions (Amit, 1988) or context (Levy, 1996). A possible neurophysiological application of our theory can be found in the hippocampus. During slow-wave sleep, low levels of the neuromodulator acetylcholine boost the impact of the excitatory feedback connections within CA3 (see Hasselmo, 1999, for a review). Slow-wave sleep goes along with a phenomenon called sharp-wave ripples, which is speculated to be a result of the replay of short sequences (Draguhn, Traub, Bibbig, & Schmitz, 2000; Csicsvari et al., 2000). A sharp-wave ripple burst is a pulselike incident of the local field potential in CA3 that is accompanied by 200 Hz oscillations. The latter are supposed to be generated by CA3 pyramidal cells (Behrens, van den Boom, de Hoz, Friedman, & Heinemann, 2005) and may reflect sequence replay (Wilson & McNaughton, 1994; N´adasdy et al., 1999; Lee & Wilson, 2002) occurring in timeslices of about 5 ms. The total duration of ripples of about 40 ms limits the number of putative events in a sequence to fewer than about eight. The temporal extent of a sharp wave may be controlled by inhibition (Maier, Nimmrich, & Draguhn, 2003), which would hint at dynamical stabilization of the network activity at a high level of storage capacity (see above). In Figure 8 we plotted the coding ratio f = Mopt /N and storage capacity α as a function of network size for various sequence lengths. If we apply these results to the situation in the hippocampal CA3 region of rats and a sequence length of Q = 8, we find for a network size of N = 240,000 a synapses-per-neuron constraint of c m N = 10,000 synapses per cell and plasticity resources r = 1, the optimal pattern size to be about 1500 cells. As a consequence, the storage capacity is about α = 1.2 minimal sequence per synapse at a cell, which corresponds to about 1600 full sequences of length 8 stored in the network. Interestingly the firing threshold we obtain is 55, which is approximately the same as that assumed by Diesmann et al. (1999) for cortical synfire networks. To summarize, this letter provides a simple rule of how to choose pattern size and threshold in order to optimize storage capacity under biologically realistic constraints such as low connectivity and similar amounts of silent and nonsilent synapses. From that, one can conclude that sequence completion in the recurrent network operates far below maximal information content. To put it more positively, information seems to be redundantly distributed over a large number of synapses, which seems consistent with the picture that memories are stored in a way that is robust against synaptic noise and some variability of morphological plasticity.
936
C. Leibold and R. Kempter
Appendix A: List of Symbols Symbol
Meaning (Location of First Use)
t x θ c cs cm r = c s /c C = (Cnn ) N M ξ Q P α = P/(c m N) m n
discrete time (section 2.1) binary network state vector (section 2.1) firing threshold (section 2.1) mean connectivity of activated synapses (section 2.1) mean connectivity of silent synapses (section 2.1) mean morphological connectivity (section 2.1) ratio between silent and active connectivity (section 2.1) connectivity matrix of activated synapses (section 2.1) network size (section 2.1) pattern size (section 2.2) binary pattern vector (section 2.2) sequence length (section 2.2) number of minimal sequences stored (section 3) capacity of sequential memory (section 3) number of hits (section 4.1) number of false alarms (section 4.1) reduced connectivity matrix (equation 4.1) transition matrix (equation 4.3) binomial probability (equation 4.4) conditional probability of hits (equation 4.5) conditional probability of false alarms (equation 4.5) conditional probability of one hit (equation 4.6) conditional probability of one false alarm (equation 4.7) quality of replay (equation 4.11) detection thresholds (equations 4.12 and 4.13) mean transition function (equation 7.2) coding ratio (section 8) information content (section 8)
c 11 c 10 c 01 c 00
T b p q ρ λ
γ , γ T ·
f = M/N I
Appendix B: Memory Capacity Revisited Let us consider a naive network of size N that initially has nosynapses at all. To imprint the first minimal sequence ξ A → ξ B in the network, we need M2 c 11 functional synapses in order to link two groups of M neurons at connectivity c 11 (see Figure 11). Let us first discuss the simpler case c 11 = c m . For the second sequence, ξ C → ξ D , fewer synapses are needed because we have to take into account that there are cells in pattern ξ C that are already connected to cells in ξ D because of some overlap with the first sequence. For random patterns, the probability that a neuron is active in a specific pattern is f = M/N, which is also called the coding ratio. As a result, the mean number of cells that are active in both of a given pair of patterns is Mf . Consequently, the Mf presynaptic cells that belong to both cue patterns ξ A and ξ C only have to be connected to the M (1 − f ) postsynaptic neurons of ξ D that do not overlap with ξ B . The number of new synapses needed is Mf · c 11 · M(1 − f ). In order to complete the second minimal sequence,
Capacity for Sequences Under Biological Constraints 1st minimal sequence AB ξA
2nd minimal sequence CD M
M
937
ξC
M
1
f
1−f
+ ...
+ 1 ξB
M
M 2 c11
f M
1−f M
ξD
2
M c11 (1–f 2)
Figure 11: Consumption of synapses by subsequently storing minimal sequences. The first minimal sequence ξ A → ξ B consumes c 11 M2 synapses. The patterns of a second minimal sequence ξ C → ξ D have some overlap f with ξ A and ξ B ; there are Mf cells (gray) both pre- and postsynaptically that contribute to both the first and the second minimal sequences. The number of synapses that are consumed by ξ C → ξ D is reduced by a factor of (1 − f 2 ); see the text.
we are left with connecting the remaining M(1 − f ) presynaptic cells of ξ C to all M postsynaptic cells of ξ D . In summary, the second sequence consumes Mf c 11 M(1 − f ) + M(1 − f ) c 11 M = M2 c 11 (1 − f 2 ) synapses. Similarly, the kth minimal sequence consumes M2 c 11 (1 − f 2 )k−1 synapses that have not yet been accounted for. Summing up all contributions until we reach the limit N2 c of available nonsilent synapses yields a condition on the maximal number P of minimal sequences,
!
N2 c = M2 c 11
P (1 − f 2 )k−1 = M2 c 11 f −2 1 − (1 − f 2 ) P .
(B.1)
k=1
In case c 11 < c m , we need to take into account the probability c 11 /c m of having a morphological synapse from cue to target in the nonsilent state. The transformation f 2 → f 2 c 11 /c m is sufficient to generalize the result in equation B.1. Solving the generalized version of equation B.1 for P and normalizing the result by N, we find the capacity to be α=
log(1 − c/c m ) , c m N log(1 − f 2 c 11 /c m )
which can be approximated for f 1 by α=
cN . c m c 11 M2
(B.2)
938
C. Leibold and R. Kempter
Equation B.2 is an extension to the results for the case c 11 = c m = 1, originally obtained by Willshaw et al. (1969), Nadal & Toulouse (1990), and Nadal (1991). In the main part of this article, we discuss the scenario c 11 = c m < 1; see equations 3.2 and 3.3. Acknowledgments We thank D. Schmitz for valuable discussions on the hippocampal circuitry, ¨ A.V.M. Herz for ongoing support, and R. Gutig, R. Schaette, M. Stemmler, K. Thurley, and L. Wiskott for discussions and comments on the manuscript. This research was supported by the Deutsche Forschungsgemeinschaft (Emmy Noether Programm: Ke 788/1-3, SFB 618) and the Bundesminis¨ Bildung und Forschung (Bernstein Center for Computational terium fur Neuroscience). References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J. (1988). Neural networks counting chimes. Proc. Natl. Acad. Sci. USA, 85, 2141–2145. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Information storage in a network with low levels of activity. Phys. Rev. A, 35, 2293–2303. August, D. A., & Levy, W. B. (1999). Temporal sequence compression by an integrateand-fire model of hippocampal area CA3. J. Comput. Neurosci., 6, 71–90. Battaglia, F. P., & Treves, A. (1998). Stable and rapid recurrent processing in realistic autoassociative memories. Neural Comput., 10, 431–450. Behrens, C. J., van den Boom, L. P., de Hoz, L., Friedman, A., & Heinemann, U. (2005). Induction of sharp wave-ripple complexes in vitro and reorganization of hippocampal networks. Nat. Neurosci., 8, 560–567. Bi, G.-Q., & Poo M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Brun, V. H., Otnass, M. K., Molden, S., Steffenach, H. A., Witter, M. P., Moser, M. B., & Moser, E. I. (2002). Place cells and place recognition maintained by direct entorhinal-hippocampal circuitry. Science, 296, 2243–2246. Brunel, N., Hakim, V., Isope, P., Nadal, J.-P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: Perceptron versus Purkinje cell. Neuron, 43, 745–757. Brunel, N., Nadal, J.-P., & Toulouse, G. (1992). Information capacity of a perceptron. J. Phys. A: Math. Gen., 25, 5017–5037. Buckingham, J., & Willshaw, D. (1992). Performance characteristics of the associative net. Network, 3, 407–414.
Capacity for Sequences Under Biological Constraints
939
Buckingham, J., & Willshaw, D. (1993). On setting unit thresholds in an incompletely connected associative net. Network, 4, 441–459. Buhmann, J., & Schulten, K. (1987). Noise-driven temporal association in neural networks. Europhys. Lett., 4, 1205–1209. Csicsvari, J., Hirase, H., Mamiya, A., & Buzsaki, G. (2000). Ensemble patterns of hippocampal CA3-CA1 neurons during sharp wave associated population events. Neuron, 28, 585–594. Csicsvari, J., Jamieson, B., Wise, K. D., & Buzsaki, G. (2003). Mechanisms of gamma oscillations in the hippocampus of the behaving rat. Neuron, 37, 311–322. Dehaene, S., Changeux, J.-P., & Nadal, J.-P. (1987). Neural networks that learn temporal sequences by selection. Proc. Natl. Acacd. Sci. USA, 84, 2727–2731. Deshpande, V., & Dasgupta, C. (1991). A neural network for storing individual patterns in limit cycles. J. Phys. A: Math. Gen., 24, 5105–5119. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Draguhn, A., Traub, R. D., Bibbig, A., & Schmitz, D. (2000). Ripple (approximately 200-Hz) oscillations in temporal structures. J. Clin. Neurophysiol., 17, 361–376. ¨ During, A., Coolen, A. C. C., & Sherington, D. (1998). Phase diagram and storage capacity of sequence processing neural networks. J. Phys. A: Math. Gen., 31, 8607– 8621. Fortin, N. J., Agster, K. L., & Eichenbaum, H. B. (2002). Critical role of the hippocampus in memory for sequences of events. Nat. Neurosci., 5, 458–462. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhys. Lett., 4, 481–485. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low firing rates. Phys. Rev. A, 41, 1843–1854. Graham, B., & Willshaw, D. (1997). Capacity and information efficiency of the associative net. Network: Comput. Neural Syst., 8, 35–54. Gutfreund, H., & M´ezard, M. (1988). Processing of temporal sequences in neural networks. Phys. Rev. Lett., 61, 235–238. Guyon, I., Personnaz, L., Nadal, J.-P., & Dreyfus, G. (1988). Storage and retrieval of complex sequences in neural networks. Phys. Rev. A, 38, 6365–6372. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends Cogn. Sci., 3, 351–359. ¨ Herrmann, M., Hertz, J. A., & Prugel-Bennett, A. (1995). Analysis of synfire chains. Network: Comput. Neural Syst., 6, 403–414. Herz, A. V. M., Li, Z., & van Hemmen, J. L. (1991). Statistical mechanics of temporal association in neural networks with transmission delays. Phys. Rev. Lett., 66, 1370–1373. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Isaac, J. T., Nicoll, R. A., & Malenka, R. C. (1995). Evidence for silent synapses: Implications for the expression of LTP. Neuron, 15, 427–444. Isope, P., & Barbour, B. (2002). Properties of unitary granule cell → Purkinje cell synapses in adult rat cerebellar slices. J. Neurosci., 22, 9668–9678.
940
C. Leibold and R. Kempter
Jensen, O., & Lisman, J. E. (2005). Hippocampal sequence-encoding driven by a cortical multi-item working memory buffer. Trends Neurosci., 28, 67–72. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kesner, R. P., Gilbert, P. E., & Barua, L. A. (2002). The role of the hippocampus in memory for the temporal order of a sequence of odors. Behav. Neurosci., 116, 286–290. Lee, A. K., & Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron, 36, 1183–1194. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6, 579–590. Little, W. A. (1974). Existence of persistent states in the brain. Math. Biosci., 19, 101– 120. ¨ Lorincz, A., & Buzs`aki, G. (2000). Two-phase computational model training longterm memories in the entorhinal-hippocampal region. Ann. New York Acad. Sci., 911, 83–111. Maier, N., Nimmrich, V., & Draguhn, A. (2003). Cellular and network mechanisms underlying spontaneous sharp wave-ripple complexes in mouse hippocampal slices. J. Physiol., 550, 873–887. Maravall, M. (1999). Sparsification from dilute connectivity in a neural network model of memory. Network: Comput. Neural Syst., 10, 15–39. McCulloch, W. S., & Pitts, W. (1943). Logical calculus of ideas immanent in nervous activity. Bull. of Math. Biophys., 5, 115–133. Montgomery, J. M., Pavlidis, P., & Madison, D. V. (2001). Pair recordings reveal all-silent synaptic connections and the postsynaptic expression of long-term potentiation. Neuron, 29, 691–701. Nadal, J.-P. (1991). Associative memory: On the (puzzling) sparse coding limit. J. Phys. A: Math. Gen., 24, 1093–1101. Nadal, J.-P., & Toulouse, G. (1990). Information storage in sparsely-coded memory nets. Network, 1, 61–74. ´ A., Csicsvari, J., & Buzs´aki, G. (1999). Replay and N´adasdy, Z., Hirase, H., Czurko, time compression of recurring spike sequences in the hippocampus. J. Neurosci., 19, 9497–9507. Nakazawa, K., McHugh, T. J., Wilson, M. A., & Tonegawa, S. (2004). NMDA receptors, place cells and hippocampal spatial memory. Nature Rev. Neurosci., 5, 361–372. Nowotny, T., & Huerta, R. (2003). Explaining synchrony in feed-forward networks: Are McCulloch-Pitts neurons good enough? Biol. Cybern., 89, 237–241. Nusser, Z., Lujan, R., Laube, G., Roberts, J. D. B., Molnar, E., & Somogyi, P. (1998) Cell type and pathway dependence of synaptic AMPA receptor number and variability in the hippocampus. Neuron, 21, 545–559. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rapp, P. R., & Gallagher, M. (1996). Preserved neuron number in the hippocampus of aged rats with spatial learning deficits. Proc. Natl. Acad. Sci. U.S.A., 93, 9926– 9930. Sompolinsky, H., & Kanter, I. (1986). Temporal associations in asymmetric neural networks. Phys. Rev. Lett., 57, 2861–2864.
Capacity for Sequences Under Biological Constraints
941
Stevens, C. F. (2001). An evolutionary scaling law for the primate visual system and its basis in cortical function. Nature, 411, 193–195. Traub, R. D., Bibbig, A., Fisahn, A., LeBeau, F. E. N., Whittington, M. A., & Buhl, E. H. (2000). A model of gamma-frequency network oscillations induced in the rat CA3 region by carbachol in vitro. Europ. J. Neurosci., 12, 4093–4106. Urban, N. N., Henze, D. A., & Barrionuevo, G. (2001). Revisiting the role of the hippocampal mossy fiber synapse. Hippocampus, 11, 408–417. Willshaw, D. J., Bunetman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679.
Received May 10, 2005; accepted September 8, 2005.
LETTER
Communicated by Christopher Williams
Kernel Fisher Discriminants for Outlier Detection Volker Roth
[email protected] ETH Zurich, Institute of Computational Science, CH-8092 Zurich, Switzerland
The problem of detecting atypical objects or outliers is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classifiers. The method presented in this letter bridges the gap between kernelized one-class classification and gaussian density estimation in the induced feature space. Having established the exact relation between the two concepts, it is now possible to identify atypical objects by quantifying their deviations from the gaussian model. This model-based formalization of outliers overcomes the main conceptual shortcoming of most one-class approaches, which, in a strict sense, are unable to detect outliers, since the expected fraction of outliers has to be specified in advance. In order to overcome the inherent model selection problem of unsupervised kernel methods, a crossvalidated likelihood criterion for selecting all free model parameters is applied. Experiments for detecting atypical objects in image databases effectively demonstrate the applicability of the proposed method in realworld scenarios.
1 Introduction A one-class-classifier attempts to find a separating boundary between a data set and the rest of the feature space. A natural application of such a classifier is estimating a contour line of the underlying data density for a certain quantile value. Such contour lines may be used to separate typical objects from atypical ones. Objects that look sufficiently atypical are often considered to be outliers, for which one rejects the hypothesis that they come from the same distribution as the majority of the objects. Thus, a useful application scenario would be to find a boundary that separates the jointly distributed objects from the outliers. Finding such a boundary defines a classification problem in which, however, usually only sufficiently many labeled samples from one class are available. In most practical problems, no labeled samples from the outlier class are available at all, and it is even unknown if any outliers are present. Since the contour lines of the data density often have a complicated form, highly Neural Computation 18, 942–960 (2006)
C 2006 Massachusetts Institute of Technology
Kernel Fisher Discriminants for Outlier Detection
943
nonlinear classification models are needed in such an outlier-detection scenario. Recently, it has been proposed to address this problem by exploiting the modeling power of kernel-based support vector machine (SVM) ¨ classifiers (see, e.g., Tax & Duin, 1999; Scholkopf, Williamson, Smola, & Shawe-Taylor, 2000). These one-class SVMs are able to infer highly nonlinear decision boundaries, although at the price of a severe model selection problem. The approach of directly estimating a boundary, as opposed to first estimating the whole density, follows one of the main ideas in learning theory, which states that one should avoid solving an intermediate problem that is too hard. While this line of reasoning seems to be appealing from a theoretical point of view, it leads to a severe problem in practical applications: when it comes to detecting outliers, the restriction to estimating only a boundary makes it impossible to derive a formal characterization of outliers without prior assumptions on the expected fraction of outliers or even on their distribution. In practice, however, any such prior assumptions can hardly be justified. The fundamental problem of the one-class approach lies in the fact that outlier detection is a (partially) unsupervised task that has been squeezed into a classification framework. The missing part of information has been shifted to prior assumptions that require detailed information about the data source. This letter aims at overcoming this problem by linking kernel-based one-class classifiers to gaussian density estimation in the induced feature space. Objects that have an unexpected high Mahalanobis distance to the sample mean are considered as atypical objects, or outliers. A particular Mahalanobis distance is considered to be unexpected if it is very unlikely to observe an object that far away from the mean vector in a random sample of a certain size. We formalize this concept in section 3 by way of fitting linear models in quantile-quantile plots. The main technical ingredient of our method is the one-class kernel Fisher discriminant classifier (OC-KFD), for which the relation to gaussian density estimation is shown. From the classification side, the OC-KFD-based model inherits both the modeling power of Mercer kernels and the simple complexity control mechanism of regularization techniques. Viewed as a function in the input space variables, the model can be viewed as a nonparametric density estimator. The explicit relation to gaussian density estimation in the kernel-induced feature space, however, makes it possible to formalize the notion of an atypical object by observing deviations from the gaussian model. Like any other kernel-based algorithm, however, the OC-KFD model contains some free parameters that control the complexity of the inference mechanism, and it is clear that deviations from gaussianity will heavily depend on the actual choice of these model parameters. In order to characterize outliers, it is thus necessary to select a suitable model in advance. This model selection problem is overcome by using a likelihood-based cross-validation framework for inferring the free parameters.
944
V. Roth
2 Gaussian Density Estimation and One-Class LDA Let X denote the n × d data matrix that contains the n input vectors x i ∈ Rd as rows. It has been proposed to estimate a one-class decision boundary ¨ by separating the data set from the origin (Scholkopf et al., 2000), which effectively coincides with replicating all x i with the opposite sign and separating X and −X. Typically, a ν-SVM classifier with a radial basis kernel function is used. The parameter ν upper-bounds the fraction of outliers in the data set and must be selected a priori. There are, however, no principled ways of choosing ν in a general (unsupervised) outlier-detection scenario. Such unsupervised scenarios are characterized by the lack of class labels that would assign the objects to either the typical class or the outlier class. The method proposed here follows the same idea of separating the data from their negatively replicated counterparts. Instead of an SVM, however, a kernel Fisher discriminant (KFD) classifier is used (Mika, R¨atsch, ¨ ¨ Weston, Scholkopf, & Muller, 1999; Roth & Steinhage, 2000). The latter has the advantage that is is closely related to gaussian density estimation in the induced feature space. By making this relation explicit, outliers can be identified without specifying the expected fraction of outliers in advance. We start with a linear discriminant analysis (LDA) model and then introduce kernels. The intuition behind this relation to gaussian density estimation is that discriminant analysis assumes a gaussian class-conditional data density. Let Xa = (X, −X) denote the augmented (2n × d) data matrix which also contains the negative samples −x i . Without loss of generality, we assume that the sample mean µ+ := i x i > 0, so that the sample means of the positive data and the negative data differ: µ+ = µ− . Let us now assume that our data are realizations of a normally distributed random variable in d dimensions: X ∼ Nd (µ, ). Denoting by Xc the centered data matrix, the estimator for takes the form ˆ = (1/n)Xc Xc =: W. The LDA solution β ∗ maximizes the between-class scatter β ∗ Bβ ∗ with B = µ+ µ + + µ− µ− under the constraint on the within-class scat ter β ∗ Wβ ∗ = 1. Note that in our special case with Xa = (X, −X) , the usual pooled within-class matrix W simply reduces to the above-defined W = (1/n)Xc Xc . It is well known (see, e.g., Duda, Hart, & Stork, 2001) that the LDA solution (up to a scaling factor) can be found by minimizing a least-squares functional, βˆ = arg min ya − Xa β2 , β
(2.1)
Kernel Fisher Discriminants for Outlier Detection
945
where ya = (2, . . . , 2, −2, . . . , −2) denotes a 2n-indicator vector for class membership in class + or −. In (Hastie, Buja, & Tibshirani, 1995), a slightly more general form of the problem is described where the above functional is minimized under a constrained on β, which in the simplest case amounts to adding a term γ β β to the functional. Such a ridge regression model assumes a penalized total covariance of the form T = (1/(2n)) · Xa Xa + γ I = (1/n) · X X + γ I . Defining an n-vector of ones y = (1, . . . , 1) , the solution vector βˆ reads −1 βˆ = Xa Xa + γ I Xa ya = (X X + γ I )−1 X y.
(2.2)
An appropriate scaling factor is defined in terms of the quantity ˆ s 2 = (1/n) · y yˆ = (1/n) · y Xβ,
(2.3)
which leads us to the correctly scaled LDA vector β ∗ = s −1 (1 − s 2 )−1/2 βˆ that fulfills the normalization condition β ∗ Wβ ∗ = 1. One further derives from Hastie et al. (1995) that the mean vector of X, projected onto the one-dimensional LDA subspace, has the coordinate value m+ = s(1 − s 2 )−1/2 , and that the Mahalanobis distance from a vector x to the sample mean µ+ is the sum of the squared Euclidean distance in the projected space and an orthogonal distance term: 2 D(x, µ+ ) = (β ∗ x − m+ ) + D⊥ 2 −1 with D⊥ = −(1 − s 2 )(β x. ∗ x) + x T
(2.4)
While in the standard LDA setting, all discriminative information is contained in the first term, we have to add the orthogonal term D⊥ to establish the link to density estimation. Note, however, that it is the term D⊥ that makes the density estimation model essentially different from OC classification: while the latter considers only distances in the direction of the projection vector β, the true density model also takes into account the distances in the orthogonal subspace. Since the assumption X ∼ Nd (µ, ) is very restrictive, we propose to relax it by assuming that we have found a suitable transformation of our input data φ : Rd → R p , x → φ(x), such that the transformed data are gaussian in p dimensions. If the transformation is carried out implicitly by introducing a Mercer kernel k(x i , x j ), we arrive at an equivalent problem in terms of the kernel matrix K = and the expansion coefficients α: αˆ = (K + γ I )−1 y.
(2.5)
946
V. Roth
¨ From Scholkopf et al. (1999) it follows that the mapped vectors can be represented in Rn as φ(x) = K −1/2 k(x), with k(x) = (k(x, x 1 ), . . . , k(x, x n )) .1 Finally we derive the following form of the Mahalanobis distances, which again consists of the Euclidean distance in the classification subspace plus an orthogonal term, 2 2 2 D(x, µ+ ) = (α ∗ k(x) − m+ ) − (1 − s )(α ∗ k(x)) + n(x),
(2.6)
where α ∗ = s −1 (1 − s 2 )−1/2 αˆ and (x) = k (x)(K + γ I )−1 K −1 k(x). Equation 2.6 establishes the desired link between OC-KFD and gaussian density estimation, since for our outlier detection mechanism, only Mahalanobis distances are needed. While it seems to be rather complicated to estimate a density by the above procedure, the main benefit over directly estimating the mean and the covariance lies in the inherent complexity regulation properties of ridge regression. Such a complexity control mechanism is of particular importance in highly nonlinear kernel models. Moreover, for ridge regression models, it is possible to analytically calculate the effective degrees of freedom, a quantity that will be of particular interest when it comes to detecting outliers.
3 Detecting Outliers Let us assume that the model is completely specified: both the kernel function k(·, ·) and the regularization parameter γ are fixed. The central lemma that helps us to detect outliers can be found in most statistical textbooks: Lemma 1. Let X be a gaussian random variable X ∼ Nd (µ, ). Then := (X − µ) -1 (X − µ) follows a χ 2 - distribution on d degrees of freedom. For the penalized regression models, it might be more appropriate to use the effective degrees of freedom df instead of d in the above lemma. In the case of one-class LDA with ridge penalties, we can easily estimate it as df = trace(X(X X + γ I )−1 X ) (Moody, 1992), which for a kernel model translates into df = trace(K (K + γ I )−1 ).
(3.1)
1 With a slight abuse of notation, we denote both the map and its empirical counterpart by φ(x).
Kernel Fisher Discriminants for Outlier Detection
947
The intuitive interpretation of the quantity df is the following: denoting n by V the matrix of eigenvectors of K and by {λi }i=1 the corresponding eigenvalues, the fitted values yˆ read yˆ = Vdiag λi /(λi + γ ) V y.
(3.2)
=:δi
It follows that compared to the unpenalized case, where all eigenvectors v i are constantly weighted by 1, the contribution of the ith eigenvector v i is downweighted by a factor δi /1 = δi . If the ordered eigenvalues decrease rapidly, however, the values δi are either close to zero or close to one, and df determines the number of terms that are essentially different from zero. A similar interpretation is possible for the orthogonal distance term in equation 2.6. From lemma 1, we conclude that if the data are well described by a gaussian model in the kernel feature space, the observed Mahalanobis distances should look like a sample from a χ 2 -distribution with df degrees of freedom. A graphical way to test this hypothesis is to plot the observed quantiles against the theoretical χ 2 quantiles, which in the ideal case gives a straight line. Such a quantile-quantile plot is constructed as follows. Let (i) denote the observed Mahalanobis distances ordered from lowest to highest, and pi the cumulative proportion before each (i) given by pi = (i − 1/2)/n. Further, let zi = F −1 pi denote the theoretical quantile at position pi , where F is the cumulative χ 2 -distribution function. The quantile-quantile plot is then obtained by plotting (i) against zi . Deviations from linearity can be formalized by fitting a linear model on the observed quantiles and calculating confidence intervals around the fit. Observations falling outside the confidence interval are then treated as outliers. A potential problem of this approach is that the outliers themselves heavily influence the quantile-quantile fit. In order to overcome this problem, the use of robust fitting procedures has been proposed in the literature (see, e.g., Huber, 1981). In the experiments below we use an M-estimator with Huber loss function. For estimating confidence intervals around the fit, we use the standard formula (see, e.g., Fox, 1997; Kendall & Stuart, 1977),
σ ((i) ) = b · (χ 2 (zi ))−1 ( pi (1 − pi ))/n,
(3.3)
which can be intuitively understood as the product of the slope b and the standard error of the quantiles. A 100(1 − ε)% envelope around the fit is then defined as (i) ± zε/2 σ ((i) ) where zε/2 is the 1 − (1 − ε)/2 quantile of the standard normal distribution. The choice of the confidence level ε is somewhat arbitrary, and from a conceptual viewpoint, one might even argue that the problem of specifying
948
V. Roth
one free parameter (i.e., the expected fraction of outliers) has simply been transferred into the problem of specifying another one. In practice, however, selecting ε is a much more intuitive procedure than guessing the fraction of outliers. Whereas the latter requires problem-specific prior knowledge, which is hardly available in practice, the former depends on only the variance of a linear model fit. Thus, ε can be specified in a problem-independent way. Note that a 100(1 − ε)% envelope defines a relative confidence criterion. Since we identify all objects outside the envelope as outliers and remove them from the model (see algorithm 1), it might be plausible to set ε ← ε/n, which defines an absolute criterion. As described above, we use a robust fitting procedure for the linear quantile fit. To further diminish the influence of the outliers, the iterative exclusion and refitting procedure presented in algorithm 1 has been shown to be very successful. Algorithm 1: Iterative Outlier Detection (Given-Estimated Mahalanobis Distances) repeat Fit a robust line into the quantile plot. Compute the 100(1 − (ε/n))%-envelope. Among the objects within the upper quartile range of Mahalanobis distances, remove the one with the highest positive deviation from the upper envelope. until no further outliers are present. The OC-KFD model detects outliers by measuring deviations from gaussianity. The reader should notice, however, that for kernel maps, which transform the input data into a higher-Dimensional space, a severe modeling problem occurs: in a strict sense, the gaussian assumption in the feature space will always be violated, since the transformed data lie on a d-dimensional submanifold. For regularized kernel models, the effective dimension of the feature space (measured by df ) can be much lower than the original feature space dimension. If the chosen model parameters induce a feature space where df does not exceed the input space dimension d, the gaussian model might still provide a plausible data description. We conclude that the user should be alarmed if the chosen model has df d. In such a case, the proposed outlier detection mechanism might produce unreliable results, since one expects large deviations from gaussianity anyway. The experiments presented in section 5, on the other hand, demonstrate that if a model with df ≈ d is selected, the OC-KFD approach successfully overcomes the problem of specifying the fraction of outliers in advance, which seems to be inherent in the ν-SVMs.
Kernel Fisher Discriminants for Outlier Detection
949
4 Model Selection In our model, the data are first mapped into some feature space in which a gaussian model is fitted. Mahalanobis distances to the mean of this gaussian are computed by evaluating equation 2.6. The feature space mapping is implicitly defined by the kernel function, for which we assume that it is parameterized by a kernel parameter σ . For selecting all free parameters in equation 2.6, we are thus left with the problem of selecting θ = (σ, γ ) . The idea is now to select θ by maximizing the cross-validated likelihood. From a theoretical viewpoint, the cross-validated (CV) likelihood framework is appealing, since in van der Laan, Dudoit, and Keles (2004), the CV likelihood selector has been shown to asymptotically perform as well as the optimal benchmark selector, which characterizes the best possible model (in terms of Kullback-Leibler divergence to the true distribution) contained in the parametric family. For kernels that map into a space with dimension p > n, however, two problems arise: (1) the subspace spanned by the mapped samples varies with different sample sizes, and (2) not the whole feature space is accessible for vectors in the input space. As a consequence, it is difficult to find a proper normalization of the gaussian density in the induced feature space. This problem can be easily avoided by considering the likelihood in the input space rather than in the feature space; that is, we are looking for a properly normalized density model p(x|·) in some bounded subset S ⊂ Rd such that the contour lines of p(x|·) and the gaussian model in the feature space have the same shape: p(x i |·) = p(x j |·) ⇔ p(φ(x i )|·) = p(φ(x j )|·).2 Denoting n by Xn = {x i }i=1 a sample from p(x) from which the kernel matrix K is built, a natural input space model is −1
pn (x|Xn , θ ) = Z
1 exp − D(x; Xn , θ ) , with Z = pn (x|Xn , θ ) d x, 2 S (4.1)
where D(x; Xn , θ ) denotes the (parameterized) Mahalanobis distances, equation 2.6, of a gaussian model in the feature space. Note that this density model in the input space has the same functional form as our gaussian model in the feature space, except for the different normalization constant Z. Only the interpretation of the models is different: the input space model is viewed as a function in x, whereas the feature space model is treated as a function in φ(x). The former can be viewed as a nonparametric density estimator (note that for RBF kernels, the functional 2
In order to guarantee integrability, we assume that the input density has a bounded support. Since in practice we have to approximate the integral by sampling anyway, this assumption does not limit the applicability of the proposed method.
950
V. Roth
form of the Mahalanobis distances in the exponent of equation 4.1 is closely related to a Parzen-window estimator). The feature-space model, on the other hand, defines a parametric density. Having selected the maximum likelihood model in the input space, the parametric form of the corresponding feature space model is then used for detecting atypical objects. Computing this constant Z in equation 4.1 requires us to solve a normalization integral over the d-dimensional space S. Since in general this integral is not analytically tractable for nonlinear kernel models, we propose to approximate Z by a Monte Carlo sampling method. In our experiments, for instance, the VEGAS algorithm (Lepage, 1980), which implements a mixed importance-stratified sampling approach, was a reasonable method for up to 15 input dimensions. The term reasonable here is not of a qualitative nature, but refers solely to the time needed for approximating the integral with a sufficient precision. For the ten-dimensional examples presented in the next section, for instance, the sampling takes approximately 1 minute on a standard PC. The choice of the subset S on which the numerical integration takes place is somewhat arbitrary, but choosing S to be the smallest hypercube including all training data has been shown to be a reasonable strategy in the experiments. 5 Experiments 5.1 Detecting Outliers in Face Databases. In a first experiment the performance of the proposed method is demonstrated for an outlier detection task in the field of face recognition. The Olivetti face database3 contains 10 different images of each of 40 distinct subjects, taken under different lighting conditions and at different facial expressions and facial details (glasses/no glasses). None of the subjects, however, wears sunglasses. All the images are taken against a homogeneous background with the subjects in an upright, frontal position. In this experiment, we additionally corrupted the data set by including two images in which we artificially changed normal glasses to “sunglasses,” as can be seen in Figure 1. The goal is to demonstrate that the proposed method is able to identify these two atypical images without any problem-dependent prior assumptions on the number of outliers or on their distribution. In order to exclude illumination differences, the images are standardized by subtracting the mean intensity and normalizing to unit variance. Each of the 402 images is then represented as a ten-dimensional vector that contains the projections onto the leading 10 eigenfaces (eigenfaces are simply the eigenvectors of the images treated as pixel-wise vectorial objects). From these vectors, a RBF kernel of the form k(x i , x j ) = exp(−x i − x j 2 /σ ) is built. In order to guarantee the reproducibility of the results, both the data
3
See http://www.uk.research.att.com/facedatabase.html.
Kernel Fisher Discriminants for Outlier Detection
951
Figure 1: Original and corrupted images with in-painted “sunglasses.”
set and an R-script for computing the OC-KFD model can be downloaded from www.inf.ethz.ch/personal/vroth/OC-KFD/index.html. Whereas our visual impression suggests that the “sunglasses” images might be easy to detect, the automated identification of these outliers based on the eigenface representation is not trivial: due to the high heterogeneity of the database, these images do not exhibit extremal coordinate values in the directions of the leading principal components. For the first principal direction, for example, the coordinate values of the two outlier images are −0.34 and −0.52, whereas the values of all images range from −1.5 to +1.1. Another indicator of the difficulty of the problem is that a one-class SVM has severe problems to identify the sunglasses images, as will be shown. In a first step of the proposed procedure, the free model parameters are selected by maximizing the cross-validated likelihood. A simple two-fold cross-validation scheme is used: the data set is randomly split into a training set and a test set of equal size, the model is built from the training set (including the numerical solution of the normalization integral), and finally the likelihood is evaluated on the test set. This procedure is repeated for different values of (σ, γ ). In order to simplify the selection process, we kept γ = 10−4 fixed and varied only σ . Both the test likelihood and the corresponding model complexity measured in terms of the effective degrees of freedom (df ) are plotted in Figure 2 as a function of the (natural) logarithm of σ . One can clearly identify both an overfitting and an underfitting regime, separated by a broad plateau of models with similarly high likelihood. The df -curve, however, shows a similar plateau, indicating that all these models have comparable complexity. This observation suggests that the results should be rather insensitive to variations of σ over values contained in this plateau. This suggestion is indeed confirmed by the results in Figures 3 and 4, where we compared the quantile-quantile plots for different parameter values (marked as I to IV in Figure 2). The plots for models II and III look very similar, and in both cases, two objects clearly fall outside a 100(1 − 0.1/n)%-envelope around the linear fit. Outside the plateau, the number of objects considered as outliers drastically increases in the overfitting regime (model I, σ too small), or decreases to zero in the underfitting regime (model IV, σ too large). The upper-right panel in Figure 3
952
V. Roth
test log−likelihood
−200
18
II
III
−300
15
−400
12
−500
9
I −600 −700
6
IV 6
7
8
9
10
11
12
effective degrees of freedom
−100
13
ln(σ) Figure 2: Selecting the kernel width σ by cross-validated likelihood (solid line). The dashed line shows the corresponding effective degrees of freedom (df ).
shows that the two outliers found by the maximum-likelihood model II are indeed the two sunglasses images. Furthermore, we observe that the quantile plot, after removing the two outliers by way of algorithm 1, appears to be gaussian-like (bottom right panel). Despite the potential problems of fitting gaussian models in kernel-induced feature spaces, this observation may be explained by the similar degrees of freedom in the input space and the kernel space: the plateau of the likelihood curve in Figure 2 corresponds to approximately 10 to 11 effective degrees of freedom. In spite of the fact that the width of the maximum-likelihood RBF kernel is relatively large (σ = 2250), the kernelized model is still different from a standard linear model. Repeating the model selection experiment with a linear kernel for different values of the regularization parameter γ , the highest test likelihood is found for a model with df = 6.5 degrees of freedom. A reliable identification of the two sunglass images, however, is not possible with this model: one image falls clearly within the envelope, and the other only slightly exceeds the upper envelope. In order to compare the results with standard approaches to outlier detection, a one-class ν-SVM with RBF kernel is trained on the same data set. The main practical problem with the ν-SVM is the lack of a plausible selection criterion for both the σ - and the ν-parameter. Taking into account the conceptual similarity of the SVM approach and the proposed OC-KFD method, we decided to use the maximum likelihood kernel emerging from the above model selection procedure (model II in Figure 2). The choice of the ν-parameter that upper-bounds the fraction of outliers turned out to
40 30 20 0
10
mahalanobis distances
100
150
953
model II (initial)
model I (initial)
50
mahalanobis distances
Kernel Fisher Discriminants for Outlier Detection
10
20
30
40
50
5
10
15
20
25
30
10
15
20
model II (outliers removed)
0
5
20
30
40
50
60
mahalanobis distances
model I (outliers removed)
25
chisq quantiles
10
mahalanobis distances
70
chisq quantiles
10
20
30
40
chisq quantiles
50
5
10
15
20
25
30
chisq quantiles
Figure 3: Quantile-quantile plots for different modes. (Left column) Overfitting model I from Figure 2, initial qq-plot (top) and final plot after subsequently removing the outliers (bottom). (Right column) Optimal model II.
be even more problematic: for different ν-values, the SVM model identifies roughly ν · 402 outliers in the data set (cf. Figure 5). Note that in the ν-SVM model, the quantity (ν · 402), where 402 is the size of the data set, provides an upper bound for the number of outliers. The observed almost linear increase of the detected outliers means, however, that the SVM model does not provide us with a plausible characterization of outliers. We basically “see” as many outliers as we have specified in advance by choosing ν. Furthermore, it is interesting to see that the SVM model has problems to identify the two sunglasses images: for ν = 0.0102, the SVM detects two outliers, which, however, do not correspond to the desired sunglass images (see the right panel of Figure 5). To find the sunglasses images within the
15 10
model IV (initial)
0
5
model III (initial) mahalanobis distances
10 15 20 25 30 35
V. Roth
5
mahalanobis distances
954
5
10
15
20
25
30
0
chisq quantiles
5
10
15
20
chisq quantiles
40 30 20 10 0
number of detected outliers
Figure 4: Quantile-quantile plots. (Left) Slightly suboptimal model III. (Right) Underfitting model IV.
0.02
0.04
0.06
ν− parameter
0.08
0.1
Figure 5: One-class SVM results. (Left) Number of detected outliers versus ν-parameter. The solid line defines the theoretical upper bound ν · 402. (Right) The two outliers identified with ν = 0.0102.
outlier set, we have to increase ν to 0.026, which “produces” nine outliers in total. One might argue that the observed correlation between the ν parameter and the number of identified outliers would be an artifact of using a kernel width that was selected for a different method (i.e., OC-KFD instead of νSVM). When the SVM experiments are repeated with both a 10 times smaller width (σ = 225) and a 10 times larger width (σ = 22,500), one observes the same almost linear dependency of the number of outliers on ν. The problem of identifying the sunglass images remains too: in all tested cases, the most
Kernel Fisher Discriminants for Outlier Detection
955
0
log−likelihood
−500 −1000 −1500
7 98 35 6 4 2
−2000 −2500 −3000
1
0 2
3
4
5
6
7
8
9
10
11
12
13
ln(σ) Figure 6: USPS data set. Cross-validated likelihoods as a function of the kernel parameter σ . Each curve corresponds to a separate digit.
dominant outlier is not a sunglass image. This observation indicates that, at least in this example, the problems of the ν-SVM are rather independent of the used kernel. 5.2 Handwritten Digits from the USPS Database. In a second experiment, the proposed method is applied to the USPS database of handwritten digits. The data are divided into a training set and a test set, each consisting of 16 × 16 gray-value images of handwritten digits from postal codes. It is well known that the test data set contains many outliers. The problem of detecting these outlier images has been studied before in the context ¨ of one-class SVMs (see Scholkopf & Smola, 2002). Whereas for the face data set, we used the unsupervised eigenface method (i.e., PCA) to derive a low-dimensional data representation, in this case we are given class labels, which allow us to employ a supervised projection method, such as LDA. For 10 classes, LDA produces a data description in a 9-dimensional subspace. For actually computing the projection, a penalized LDA model (Hastie et al., 1995) was fitted to the training set. Given the trained LDA model, the test set vectors were projected on the subspace spanned by the nine LDA vectors. To each of the classes, we then fitted an OC-KFD outlier detection model. The test-likelihood curves for the individual classes are depicted in Figure 6. For most of the classes, the likelihood attains a maximum around σ ≈ 1300. The classes 1 and 7 require a slightly more complex model with σ ≈ 350. The maximum likelihood models correspond to
0
5
10
15
20
30 20 10 0 0
5
10
15
20
chisq quantiles
10
20
30
chisq quantiles
mahalanobis distances
0
10 20 30 40
V. Roth
0
mahalanobis distances
mahalanobis distances
956
0
5
10
15
20
chisq quantiles
Figure 7: Outlier detection for the digit 9. The iteration terminates after excluding two images (top left → bottom left → top right panel).
approximately 9 to 11 effective degrees of freedom in the kernel space, which is not too far from the input space dimensionality. Once the optimal kernel parameters are selected, the outliers are detected by iteratively excluding the object with the highest deviation from the upper envelope around the linear quantile fit (cf. algorithm 1). For the digit 9, the individual steps of this iteration are depicted in Figure 7. The iteration terminates after excluding two outlier images. All remaining typical images with high Mahalanobis distances fall within the envelope (top right panel). For the combined set of outliers for all digits, Figure 8 shows the first 18 outlier images, ordered according to their deviations from the quantile fits. Many European-style 1s and “crossed” 7s are successfully identified as atypical objects in this collection of U.S. postal codes. Moreover, some almost unreadable digits are detected, like the “0,” which has the form of a horizontal bar (middle row), or the “9” in the bottom row. In order to compare the results with a standard technique, a one-class νSVM was also trained on the data. As in the previous experiment, the width of the RBF kernel was set to the maximum-likelihood value identified by the above model selection procedure. For the digit 9, the dependency of the number of identified outliers on the ν parameter is depicted in Figure 9. The almost linear dependency again emphasizes the problem that the SVM approach does not provide us with a meaningful characterization of outliers. Rather, one “sees” (roughly) as many outliers as specified in advance by
Kernel Fisher Discriminants for Outlier Detection
957
1
1
1
1
7
1
0
1
1
1
0
0
1
7
7
1
9
1
Figure 8: The first 18 detected outliers in the U.S. Postal Service test data set, ordered according to decreasing deviation from the quantile fits. The caption below each image shows the label provided by the database.
number of detected outliers
45 40 35 30 25 20 15 10 5 0
0
0.05
0.1
0.15
0.2
0.25
ν− parameter Figure 9: One-class SVM results for the digit 9. (Left) Number of detected outliers as a function of the ν-parameter. The solid line defines the theoretical upper bound ν · 177. (Right) The two outliers identified with ν = 0.03.
choosing ν. When setting ν = 0.03, the SVM identifies two outliers, which equals the number of outliers identified in the OC-KFD experiment. The two outlier images are depicted in the right panel. Comparing the results with that of the OC-KFD approach, we see that both methods identified the almost unreadable 9 as the dominating outlier.
958
V. Roth
5.3 Some Implementation Details. Presumably the easiest way of implementing the model is to carry out an eigenvalue decomposition of K . Both the the effective degrees of freedom df = i λi /(λi + γ ) and the Mahalanobis distances in equation 2.6 can then be derived easily from this decomposition. For practical use, consider the pseudo-code presented in algorithm 2. A complete R-script for computing the OC-KFD model can be downloaded from www.inf.ethz.ch/personal/vroth/OCKFD/index.html. Algorithm 2: OC-KFD lmax ← −∞ for θ on a specified grid do Split data into two subsets Xtrain and Xtest of size n/2. Compute kernel matrix K (Xtrain , σ ), its eigenvectors V, and eigenvalues {λi }. Compute αˆ = Vdiag{1/(λi + γ )}V y. Compute normalization integral Z by Monte Carlo sampling (see equation 4.1). Compute Mahalanobis distances by equations 2.6 and 3.2, and evaluate log likelihood on test set: l(Xtest |θ ) = i −(1/2)D(xi ∈ Xtest |Xtrain , θ ) − (n/2) ln(Z). if l(Xtest |θ ) > lmax then lmax = l(Xtest |θ ), θ opt = θ . end if end for Given θ opt , compute K (X, σopt ), V, {λi }. Compute Mahalanobis distances and df (see equations 2.6 and 3.2). Detect outliers using algorithm 1. 6 Conclusion Detecting outliers by way of one-class classifiers aims at finding a boundary that separates typical objects in a data sample from the atypical ones. Standard approaches of this kind suffer from the problem that they require prior knowledge about the expected fraction of outliers. For the purpose of outlier detection, however, the availability of such prior information seems to be an unrealistic (or even contradictory) assumption. The method proposed in this article overcomes this shortcoming by using a one-class KFD
Kernel Fisher Discriminants for Outlier Detection
959
classifier directly related to gaussian density estimation in the induced feature space. The model benefits from both the built-in classification method and the explicit parametric density model in the feature space: from the former, it inherits the simple complexity regulation mechanism based on only two free parameters. Moreover, within the classification framework, it is possible to quantify the model complexity in terms of the effective degrees of freedom df. The gaussian density model, on the other hand, makes it possible to derive a formal description of atypical objects by way of hypothesis testing: Mahalanobis distances are expected to follow a χ 2 distribution in df dimensions, and deviations from this distribution can be quantified by confidence intervals around a fitted line in a quantile-quantile plot. Since the density model is parameterized by both the kernel function and the regularization constant, it is necessary to select these free parameters before the outlier detection phase. This parameter selection is achieved by observing the cross-validated likelihood for different parameter values and choosing the parameters that maximize this quantity. The theoretical motivation for this selection process follows from van der Laan et. al. (2004), where it has been shown that the cross-validation selector asymptotically performs as well as the so-called benchmark selector, which selects the best model contained in the model family. The experiments on detecting outliers in image databases effectively demonstrate that the proposed method is able to detect atypical objects without problem-specific prior assumptions on the expected fraction of outliers. This property constitutes a significant practical advantage over the traditional ν-SVM approach. The latter does not provide a plausible characterization of outliers. One “detects” (roughly) as many outliers as one has specified in advance by choosing ν. Prior knowledge about ν, on the other hand, will be hardly available in general outlier-detection scenarios. In particular, the presented experiments demonstrate that the whole processing pipeline, consisting of model selection by cross-validated likelihood, fitting linear quantile-quantile models, and detecting outliers by considering confidence intervals around the fit, works very well in practical applications with reasonably small input dimensions. For input dimensions 15, the numerical solution of the normalization integral becomes rather time-consuming when using the VEGAS algorithm. Evaluating the usefulness of more sophisticated sampling models like Markov chain Monte Carlo methods for this particular task will be the subject of future work. Acknowledgments I thank the referees who helped to improve this letter. Special thanks go to Tilman Lange, Mikio Braun, and Joachim M. Buhmann for helpful discussions and suggestions.
960
V. Roth
References Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. Hoboken, NJ: Wiley. Fox, J. (1997). Applied regression, linear models, and related methods. Thousand Oaks, CA: Sage. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. Annals of Statistics, 23, 73–102. Huber, P. (1981). Robust statistics. Hoboken, NJ: Wiley. Kendall, M., & Stuart, A. (1977). The advanced theory of statistics (Vol. 1). New York: Macmillan. Lepage, G. (1980). Vegas: An adaptive multidimensional integration program (Tech. Rep. CLNS-80/447). Ithaca, NY: Cornell University. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). Piscataway, NJ: IEEE. Moody, J. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 847– 854). Cambridge, MA: MIT Press. Roth, V., & Steinhage, V. (2000). Nonlinear discriminant analysis using kernel func¨ tions. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 568–574). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., R¨atsch, G., & Smola, A. (1999). Input space vs. feature space in kernel-based methods. IEEE Trans. Neural Networks, 10(5), 1000–1017. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. ¨ Scholkopf, B., Williamson, R., Smola, A., & Shawe-Taylor, J. (2000). SV estimation ¨ of a distribution’s support. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 582–588). Cambridge, MA: MIT Press. Tax, D., & Duin, R. (1999). Support vector data description. Pattern Recognition Letters, 20(11–13), 1191–1199. van der Laan, M., Dudoit, S., & Keles, S. (2004). Asymptotic optimality of likelihoodbased cross-validation. Statistical Applications in Genetics and Molecular Biology, 3(1), art. 4.
Received January 27, 2005; accepted August 25, 2005.
LETTER
Communicated by Mario Figueiredo
Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation Liefeng Bo
[email protected] Ling Wang
[email protected] Licheng Jiao
[email protected] Institute of Intelligent Information Processing, Xidian University, Xi’an 710071, China
Kernel fisher discriminant analysis (KFD) is a successful approach to classification. It is well known that the key challenge in KFD lies in the selection of free parameters such as kernel parameters and regularization parameters. Here we focus on the feature-scaling kernel where each feature individually associates with a scaling factor. A novel algorithm, named FS-KFD, is developed to tune the scaling factors and regularization parameters for the feature-scaling kernel. The proposed algorithm is based on optimizing the smooth leave-one-out error via a gradient-descent method and has been demonstrated to be computationally feasible. FS-KFD is motivated by the following two fundamental facts: the leave-one-out error of KFD can be expressed in closed form and the step function can be approximated by a sigmoid function. Empirical comparisons on artificial and benchmark data sets suggest that FS-KFD improves KFD in terms of classification accuracy.
1 Introduction Fisher linear discriminant analysis (Fisher, 1936; Fukunaga,1990) is a classical classifier whose fundamental idea is to maximize the between-class scatter while minimizing the within-class scatter simultaneously. In many applications, Fisher linear discriminant analysis has proved to be very powerful. However, for real-world problems, only linear discriminant analysis is not good enough. Mika, Ratsch, and Weston (1999) and Mika (2002) introduced a class of nonlinear Fisher discriminant analysis using kernel tricks, named KFD. Extensive empirical comparisons have shown that KFD is comparable to other kernel-based classifiers, such as support vector machines (SVMs) (Vapnik, 1995, 1998) and least-squares support vector machines (LS-SVMs) (Gestel et al., 2002; Suykens & Vandewalle, 1999). Neural Computation 18, 961–978 (2006)
C 2006 Massachusetts Institute of Technology
962
L. Bo, L. Wang, and L. Jiao
For kernel-based learning algorithms, the key challenge lies in the selection of kernel parameters and regularization parameters. Many researchers have identified this problem and tried to solve it. Weston et al. (2001) performed feature selection for SVMs by combining the feature scaling technique with the leave-one-out error bound. Chapelle, Vapnik, Bousquet, and Mukherjee (2002) tuned multiple parameters for two-norm SVMs by minimizing the radius margin bound or the span bound. Ong and Smola (2003) applied semidefinite programming to learn kernel function by hyperkernel. Lanckriet, Cristianni, Bartlett, Ghaoui, and Jordan (2004) designed kernel matrix directly by semidefinite programming. All of these algorithms have proved to be effective and boosted the development of this field. We focus here on tuning the scaling factors of the feature scaling kernel (Williams & Barber, 1998; Krishnapuram, Hartemink, Carin, & Figueiredo, 2004). Two of the most popular feature-scaling kernels are polynomial kernel and gaussian kernel, as given below: Kθ (xi , x j ) = 1 +
d
r (k) (k) θk xi x j
,
(1.1)
2 (k) (k) θk xi − x j .
(1.2)
k=1
Kθ (xi , x j ) = exp −
d k=1
In a feature-scaling kernel, each feature has its own scaling factor. If some feature is insignificant or irrelevant for classification, the associated scaling factor will be set smaller; otherwise, it will be set larger. Cawley and Talbot (2003) gave a closed form of the leave-one-out error of KFD and demonstrated that it was superior to n-fold cross-validation error in terms of computational complexity. Motivated by this fact, we develop a novel algorithm, named FS-KFD, to tune multiple parameters for the feature-scaling kernel. FS-KFD is constructed in two steps: replacing the step function in the leave-one-out error with a sigmoid function and then optimizing the resulting smooth leave-one-out error via a gradient-descent algorithm. In FS-KFD, all the free parameters are analytically chosen, so the learning process is fully automatic. Extensive experimental comparisons show that FS-KFD improves the performance of KFD in the presence of many irrelevant features and obtains good classification accuracy. The remainder of the letter is organized as follows. In section 2, kernel Fisher discriminant analysis is briefly reviewed. The expressions for the smooth leave-one-out error and for its derivative are given in section 3. FS-KFD is extended to multiclass classification in section 4. In section 5, the experimental results are reported. The direction of future research is indicated in section 6.
Feature Scaling for Kernel Fisher Discriminant Analysis
963
2 Kernel Fisher Discriminant Analysis For real-world problems, linear discriminant analysis is not enough. Mika et al. constructed the linear discriminant analysis in the feature space induced by a Mercer kernel, thus implicitly yielding a nonlinear discriminant analysis in the input space. The leading model is named KFD, in which two scatter matrices—between-class scatter matrix and withinclass scatter matrix—are defined by SbF = (m1F − m2F )(m1F − m2F )T and SwF = 2 li i i F F T F i=1 j=1 ((x j ) − mi )((x j ) − mi ) , where the mean mi of the ith class i is miF = l1i lj=1 (xij ). An optimal transformation w is given by maximizing the between-class scatter while simultaneously minimizing the within-class scatter: max w
wT SbF w . wT SwF w
(2.1)
In terms of reproducing kernel theory (Aronszajn, 1950), w can be formu lated as w = lj=1 α j (x j ). With equation 2.1, we can calculate α by max α
where
α T S¯ bF α α T S¯ wF α
,
(2.2)
¯ 1F − m ¯ 2F )(m ¯ 1F − m ¯ 2F )T S¯ bF = (m
and
S¯ wF =
l j i ¯ iF ) i=1 j=1 (β j − m i ¯ iF = l1i [ lj=1 m K(x1 ,
2
¯ iF )T with β ij = [K(x1 , xij ), . . . , K(xl , xij )]T and (β ij − m i i x j ), . . . , lj=1 K(xl , xij )]T . It can be seen that KFD is equivalent to finding the leading eigenvector of ( S¯ wF )−1 S¯ bF . To improve numerical stability and generalization ability, we replace S¯ wF with S¯ wF + λI, where λ is a regularization constant and I is an identity matrix. For a new sample x, we can predict its label by l g(x) = sgn((w·(x)) + b) = sgn α j K(x j , x) + b ,
(2.3)
j=1
where b = −α T
¯ 1F +l2 m ¯ 2F l1 m l
.
3 Optimization of the Smooth Leave-One-Out Cross-Validation Error Let us denote the leave-one-out error by (x1 , y1 , . . . , xl , yl ). It is well known that the leave-one-out error is almost an unbiased estimate of the expected generalization error.
964
L. Bo, L. Wang, and L. Jiao
¨ Lemma 1 (Luntz & Brailovsky, 1969; Scholkopf & Smola, 2002).
l−1
E perror = E
1 (x1 , y1 , . . . , xl , yl ) , l
l−1 is the probability of test error for the model trained on samples of size where perror l − 1 and the expectations are taken over the random choice of the samples.
This lemma suggests that the leave-one-out error is a good estimate for the generalization error. However, the leave-one-out cross validation is rarely adopted in a medium- or large-scale application due to its high computational cost; it requires running the training algorithm l times. The training algorithms for kernel machines, such as KFD, typically require a computational cost of O(l 3 ). In this case, the computational cost of the leave-one-out cross-validation procedure is O(l 4 ), which quickly becomes intractable as the number of training samples increases. Fortunately, there exists a computationally efficient implementation for the leave-one-out cross-validation procedure in KFD, which only a computational cost of incurs O(l 3 ). Xu, Zhang, and Li (2001) showed that KFD is equivalent to minimizing the following loss function, f (α) ¯ = α¯ T (CT C + λU)α¯ − 2α¯ T CT y + yT y,
(3.1)
where α¯ = [ αb ], C = [K 1], U = [ 0IT 00 ], and I denotes the unit matrix. Let gi (x) be the ith kernel Fisher classifier constructed from the data set excluding the ith training sample. Defining the residual error by ri = yi − gi (xi ) for the ith training sample, Cawley and Talbot (2003) demonstrated the following: Lemma 2. r = (I − H) y (1 − D(H)), where H = C(C T C + λU)-1 C T , D(H) denotes the diagonal elements of H, and denotes element-wise division. A straightforward corollary of lemma 2 is that the leave-one-out error of KFD can be computed at a cost of O(l 3 ). This indicates that it is feasible to apply leave-one-out model selection to a medium-size problem. In the following, we will discuss the smooth leave-one-out error derived by replacing the step function with a sigmoid function. According to lemma 2, the leave-one-out error of KFD is given by l 1 1 − yi sign (yi − ri ) loo(θ , λ) = , l 2 i=1
(3.2)
Feature Scaling for Kernel Fisher Discriminant Analysis
965
where sign (a ) is 1 if a ≥ 0; otherwise, sign(a ) is −1. From equation 3.2, we observe that there exists a step function sign (·) in loo(θ , λ), implying that it is not differentiable. In order to use a gradient-descent method to minimize this estimate, we approximate the step function by a sigmoid function, tanh(γ t) =
exp (γ t) − exp (−γ t) , exp (γ t) + exp (−γ t)
(3.3)
where we set γ to be 10. Then the smooth leave-one-out error can be expressed as
loo(θ , λ) =
l 1 1 − yi tanh (γ (yi − ri )) . l 2
(3.4)
i=1
Figure 1 shows the leave-one-out error and the smooth leave-one-out error on the Breast Cancer data set. It can be seen from Figure 1 that the smooth leave-one-out error successfully follows the trend of the leave-oneout error. Thus, we can expect that a small, smooth leave-one-out error guarantees good generalization ability. According to the chain rule, the derivative of loo(θ , λ) is formulated as ∂(loo(θ , λ)) ∂(loo(θ , λ)) ∂r = . ∂θk ∂rT ∂θk It follows that we need only to calculate With
∂(tanh(t)) ∂t
(3.5) ∂(loo(θ ,λ)) ∂rT
and
∂r , ∂θk
respectively.
= sec h2 (t), we have
∂ (loo (θ , λ)) = ∂rT
γ y ⊗ sec h2 (γ (y − r)) 2l
T ,
(3.6)
where ⊗ denotes an element-wise proxduct. The derivative of r with respect to θk is given by ∂H ∂r =− y (1 − D (H)) ∂θk ∂θk
∂H . + ((I − H) y) (1 − D (H)) (1 − D (H)) ⊗ D ∂θk
(3.7)
966
L. Bo, L. Wang, and L. Jiao
A
Error
0.036 0.034 0.032
0.03 -4
-2 0 log2(lambda)
2
-2 0 log2(lambda)
2
B
Error
0.0345 0.034 0.0335 0.033 -4
Figure 1: (A) Variation of the leave-one-out error with log 2(λ) on the Breast data set. (B) Variation of the smooth leave-one-out error with log 2(λ) on the Breast data set.
The derivative of H with respect to θk is given by −1 −1 T ∂ CT C + λU ∂H ∂C T = C +C CT C C + λU ∂θk ∂θk ∂θk −1 ∂CT + C CT C + λU . ∂θk
(3.8) −1
Now let us focus on computing ∂(C C+λU) . A good solution is based on the ∂θk equality: T−1 T = I (Bengio, 2000). Differentiating both sides of the equation −1 , we have with respect to θk and then isolating ∂T ∂θk T
∂T −1 ∂T−1 = −T−1 T . ∂θk ∂θk
(3.9)
Feature Scaling for Kernel Fisher Discriminant Analysis
967
Substituting CT C + λU for T, we have −1 −1 ∂ CT C + λU T −1 ∂ CT C + λU C C + λU = − CT C + λU ∂θk ∂θk T T T −1 ∂C −1 T ∂C C C + λU = − C C + λU C+C ∂θk ∂θk (3.10) Combining equations 3.5, 3.6, 3.7, 3.8, and 3.10, we can compute the derivative of the smooth leave-one-out error with respect to θk . The derivative of H with respect to λ is given by −1 T −1 ∂H C C + λU . = − CT C + λU ∂λ
(3.11)
So we can compute the derivative of loo(θ , λ) with respect to λ in a similar manner. From the derivation, it can be easily verified that the computational complexity of FS-KFD is # (Iteration) × # (free parameters) × l 3 .
(3.12)
4 Extension to Multiclass Classification In this section, we attempt to extend FS-KFD to multiclass classification using the one-against-all scheme that has been independently devised by several researchers. Rifkin and Klautau (2004) carefully compared the oneagainst-all scheme with some other popular multiclass schemes and concluded that it is as accurate as any other scheme if the underlying binary classifiers are well-tuned, regularized classifiers. One-against-all reduces a c-class problem to c binary problems. For the sth binary problem, all samples labeled yi = s are considered positive samples and the others negative samples. For a new sample prediction, c classifiers are run, and the classifier that outputs the largest value is chosen. Let g (s) (xi ) denote the output of the sth binary classifier on a sample xi . According to the one-against-all scheme, the predicted label for xi is yˆ i = arg max
s∈{1,...,c}
g (s) (xi ) .
(4.1)
Thus, the leave-one-out error of multiclass classification can be written as mloo(θ , λ) =
l 1 1 − equal yi , arg max g (s) (xi ) , s l i=1
(4.2)
968
L. Bo, L. Wang, and L. Jiao
where equal (a , b) = 1 if a = b; otherwise equal (a , b) = 0, and yi ∈ {1, 2, . . . , c}. It becomes intractable to approximate equation 4.2 by a sigmoid function due to the discontinuity of the inner function arg maxs (g (s) (xi )). In the following, we consider an alternative strategy where the upper bound of the leave-one-out error of multiclass classification is optimized. Theorem 1. Let loo(s) denote the leave-one-out error of the sth binary classifier. If the one-against-all scheme is used, the following inequality holds: mloo ≤
c
loo(s) .
(4.3)
s=1
Proof. If all c binary classifiers classify the sample xi correctly, we have (s)
yi g (s) (xi ) > 0,
s = 1, . . . , c,
(s)
(4.4) (s)
where yi = 1, if yi = s; otherwise yi = −1. Inequality 4.4 can be further simplified to
g (yi ) (xi ) > 0 g (s) (xi ) < 0,
s = yi
.
(4.5)
Since only the output of the yith classifier is greater than zero, we have
arg min g (s) (xi ) = yi . s
(4.6)
This means that if all c binary classifiers classify the sample xi correctly, the final multiclass classifier also classifies xi correctly. The equivalent proposition is that if the multiclass classifier classifies xi incorrectly, there exists at least one binary classifier misclassifying xi . This completes the proof of theorem 1. This theorem allows us to control the leave-one-out error of multiclass classification by controlling the sum of the leave-one-out error of all the binary classifiers. Three multiclass schemes can be derived by considering whether the kernel parameters and the regularization parameters are shared by all the binary classifiers. In the first scheme, all the binary classifiers share the kernel parameters and the regularization parameters (Hsu & Lin, 2002; Rifkin & Klautau,
Feature Scaling for Kernel Fisher Discriminant Analysis
969
2004). The sum of the smooth leave-one-out errors of c binary classifiers can be formulated as sloo (θ , λ) =
c
loo(s) (θ , λ) .
(4.7)
s=1
loo(s) (θ , λ) can be expanded into
loo(s) (θ, λ) =
l 1
l
i=1
(s) (s) (s) 1 − yi tanh γ yi − ri , 2
(4.8)
(s)
where ri is the residual error on the ith sample for the sth binary problem. The derivative of sloo(θ , λ) with respect to θk is given by ∂(sloo(θ , λ)) ∂(loo(s) (θ , λ)) ∂r(s) = , ∂θk ∂θk ∂(r(s) )T c
(4.9)
s=1
where ∂(loo(s) (θ , λ)) = T ∂ r(s)
γ y(s) ⊗ sec h2 (γ (y(s) − r(s) )) 2l
T ,
∂H (s) ∂r(s) =− y (1 − D(H)) ∂θk ∂θk + ((I − H)y(s) ) (1 − D(H)) (1 − D (H)) ⊗ D
(4.10)
∂H ∂θk
(4.11)
Thus, we can compute the derivative of sloo (θ , λ) with respect to θk by combining equations 4.9, 4.10, and 4.11. The derivative of sloo (θ , λ) with respect to λ can be computed in a similar manner. It is easily checked that the computational complexity of this multiclass scheme is the same as that of FS-KFD for binary classification since all the binary classifiers share H. In the second scheme, only the kernel parameters are shared. As a result, the binary classifiers no longer share H due to the difference among the regularization parameters. The computational complexity of this scheme becomes c × # (Iteration) × # (free parameters) × l 3 .
(4.12)
970
L. Bo, L. Wang, and L. Jiao
In the third scheme, the kernel parameters and the regularization parameters are not shared. Therefore, we independently optimize the free parameters of each binary classifier. The computational complexity of this scheme is the same as that of the second one.
5 Performance Comparison In order to demonstrate the effectiveness of FS-KFD, we compare its performance with those of KFD, SVMs, and k-nearest neighbors (KNN) (Lowe, 1995) on an artificial XOR problem, benchmark data sets from UCI Machine Learning Repository (Blake & Merz, 1998), and the radar target recognition problem. All the algorithms were implemented in MATLAB 7.0. And all the experiments were run on a personal computer with 2.4 GHz P4 processors, 2 GB memory, and Windows XP operation system. Unless otherwise specified, the FS-KFD mentioned in the following uses the gaussian kernel. For FS-KFD, a gradient-descent method is used to search for the optimal values for free parameters, and thus one needs to choose good optimization software. We recommend using an available optimization package to avoid the numerical problems. Here we use the function fminunc in the optimization toolbox of MATLAB that implements BFGS quasi-Newton algorithm to solve medium-scale problems. The maximum number of iterations allowed is set to be 50, the termination tolerance on the function value and variable value is set to be 0.0001, and the cubic polynomial line search procedure is used to find the optimal step size. To avoid adding positive constraints in the optimization problem, we use parameterizations β = (log(θ ), log(λ)). The initial values of the scaling factors and regularization parameters are log(θ ) = log( d1 ) × 1 and log(λ) = 0, respectively, where d is the feature dimensionality. In general, choosing the optimal value for γ is difficult. Throughout the article, γ is set to be 10. We have found that using the same setting for various data sets works well. We can also try several different values for γ and choose the one leading to the smallest leave-one-out error.
5.1 Artificial XOR Problem. This experiment aims at validating the robustness of FS-KFD against the inclusion of the irrelevant features. To this end, a variant of XOR is constructed, with each feature drawn from a uniform distribution on the interval [−1, 1]. Regardless of the feature dimensionality d, the output label for a given data point is related to only the first two features of the data and is defined as y=
+1
if x1 x2 ≥ 0
−1
otherwise
x1 , x2 ∈ U (−1, +1) .
(5.1)
Feature Scaling for Kernel Fisher Discriminant Analysis
971
A 0.4
Error
0.3
FS-KFD KFD
0.2 0.1 0 0
5
10 15 Dimensionality
20
5
10 15 Scaling factor
20
B Magnitude
60 40 20 0 0
Figure 2: (A) Variation of the errors of FS-KFD and KFD with the dimensionality. (B) Scaling factors with the dimensionality d = 20.
This suggests that there exist d − 2 irrelevant features for the data with d features. The optimal decision function of this problem is nonlinear, and the highest recognition rate of linear classifiers is only 66.67%. FS-KFD and KFD are constructed on the training set with 200 samples and tested on the independent test set with 5000 samples. The results are averaged over 10 random realizations. To study the scaling property of the errors of FSKFD and KFD as the feature dimensionality, we sequentially increase the feature dimensionality from 2 to 20 at an interval of 2. The plots of the errors of the two algorithms as the function of the feature dimensionality are shown in Figure 2A. The scaling factors with the dimensionality d = 20 are shown in Figure 2B. From Figure 2A, we observe that FS-KFD is much more robust to the increase of the irrelevant features compared with KFD. Furthermore, the
972
L. Bo, L. Wang, and L. Jiao
Table 1: Information on Benchmark Data Sets. Problem
Training/Test
Class
Attribute
Breast German Liver Diabetes Vote Glass Yeast Splice Segment Vehicle
400/299 600/400 200/145 400/368 250/185 150/64 100/108 500/1675 500/1810 500/346
2 2 2 2 2 6 5 3 7 4
9 20 6 8 16 9 79 240 18 18
feature selection ability of FS-KFD is clearly exhibited in Figure 2B. The scaling factors corresponding to the relevant features are significantly larger than those corresponding to the irrelevant features. The rapid performance degradation of KFD suggests that the feature-scaling technique is indeed necessary in the presence of many irrelevant features. 5.2 Benchmark Comparison. The purpose of this experiment is to compare FS-KFD with KFD, SVM, and KNN on a collection of benchmark data sets from the UCI Machine Learning Repository. These data sets have been extensively used in testing the performance of diversified kinds of learning algorithms. Information on these benchmark data sets is summarized in Table 1. The sizes of training set and test set are shown in the second column of Table 1. For each training-test pair, the training samples are scaled into zero mean and unit variance, and the test samples are adjusted using the same linear transformation. The final errors are averaged over 10 random splits of the full data sets, which are reported in Tables 2 and 3. Note that all model selection procedures are independently performed for each training-test pair so that the standard error of the mean includes the variability due to the sensitivity of the model selection criterion to the partitioning of the data. The detailed experimental setups are summarized as follows: 1. For KFD, the leave-one-out error is used for model selection. We perform a grid search on intervals log 2(θ ) = [−12, −10, . . . , 4] and log 2(λ) = [−10, −9, . . . , 1]. Three possible multiclass schemes are considered: KFD with shared kernel parameters and regularization parameters, KFD with only shared kernel parameters, and KFD without shared free parameters.
Feature Scaling for Kernel Fisher Discriminant Analysis
973
Table 2: Mean and Variance of Test errors Obtained by FS-KFD, KFD, Span Bound–Based SVM, and KNN. Problem
FS-KFD(1)
KFD(1)
SVM(Span)
KNN
Breast German Diabetes Liver Vote Glass Splice Yeast
4.05 ± 0.71 24.75 ± 1.88 24.67 ± 1.75 30.14 ± 5.34 5.14 ± 1.40 32.81 ± 9.63 6.33 ± 1.27 5.83 ± 1.80
4.11 ± 0.77 23.35 ± 2.74 23.45 ± 2.05 29.72 ± 5.16 5.62 ± 1.98 33.28 ± 7.51 6.90 ± 1.09 5.85 ± 2.02
4.45 ± 0.76 24.22 ± 2.19 24.86 ± 1.59 31.72 ± 5.26 5.08 ± 1.86 32.97 ± 6.93 6.91 ± 0.61 6.67 ± 1.79
3.85 ± 1.01 27.35 ± 2.10 26.68 ± 2.06 39.66 ± 4.09 7.08 ± 1.83 31.87 ± 5.95 10.32 ± 1.06 8.89 ± 2.23
Segment Vehicle
4.59 ± 0.67 17.72 ± 2.21
7.87 ± 0.80 20.17 ± 2.00
6.57 ± 1.25 17.05 ± 2.38
8.25 ± 1.03 31.56 ± 1.83
Notes: FS-KFD(1) denotes FS-KFD with shared kernel parameters and regularization parameters. KFD(1) denotes KFD with shared kernel parameters and regularization parameters.
2. For SVM, the span bound (Vapnik & Chapelle, 2000) is used to optimize the kernel parameters and the regularization parameters. Initial setups are the same as in FS-KFD. 3. For KNN, the leave-one-out error is used to find the best number of neighbors k. We consider 50 different values from the interval [1, . . . , l − 1] (uniformly in logarithm) (R¨atsch, 2001), where l is the size of the training set. Two-tailed t-tests with the significant level 0.05 are performed to determine whether there is a significant difference between FS-KFD and other algorithms. The conclusions are summarized as follows. FS-KFD is significantly better than KFD on the Segment and Vehicle data sets. As for the remaining data sets, FS-KFD and KFD achieve similar performance. Table 3: Mean and Variance of Test errors Obtained by FS-KFD and KFD. Problem
FS-KFD(2)
FS-KFD(3)
KFD(2)
KFD(3)
Glass Splice Yeast
34.53 ± 11.13 6.16 ± 1.19 6.29 ± 2.61
31.87 ± 8.31 5.87 ± 1.00 6.67 ± 2.38
33.44 ± 9.44 6.95 ± 0.93 6.48 ± 2.58
31.71 ± 9.58 6.71 ± 0.93 7.59 ± 3.02
Segment Vehicle
4.36 ± 0.66 17.89 ± 2.07
4.61 ± 0.75 18.58 ± 2.09
8.04 ± 0.98 20.64 ± 2.13
7.62 ± 1.00 20.40 ± 1.93
Notes: FS-KFD(2) and FS-KFD(3) denote FS-KFD with only shared kernel parameters and without shared free parameters, respectively. KFD(2) and KFD(3) denote KFD with only shared kernel parameters and without shared free parameters, respectively.
974
L. Bo, L. Wang, and L. Jiao
FS-KFD and span bound–based SVM obtain similar performance on all data sets except Segment. FS-KFD is much better than KNN on all data sets except Breast and Glass. Pairwise two-tailed t-tests with a significance level of 0.05 are performed to determine whether there is a significant difference among the three multiclass schemes of FS-KFD and KFD. The resulting p-values indicate that there is no significant difference among the three multiclass schemes. In general, the feature-scaling technique improves the generalization performance of KFD and leads to a natural feature selection when irrelevant features occur. For example, on the Segment data set, the four largest scaling factors are 13.98, 3.50, 2.72, and 2.26, and yet other scaling factors are smaller than 0.5. 5.3 Radar Target Recognition. Radar target recognition refers to the detection and recognition of target signatures using high-resolution range profiles—in our case, in inverse synthetic aperture radar. A radar image represents a spatial distribution of microwave reflectivity that is sufficient to characterize the illuminated target. Range resolution allows the sorting of reflected signals on the basis of range. When range-gating or time-delay sorting is used to interrogate the entire range extent of the target space, a one-dimensional image, called a range profile, will be generated. Figure 3 is an example of such signature of three different planes: J-6, J-7, and B-52. Our task is to recognize the range profile of the three different plane models—J-6, J-7, and B-52—based on experimental data acquired in a microwave anechoic chamber. The dimensionality of the range profiles is 64. The full data set is split into 359 training samples and 719 test samples. The training samples consist of 103 one-dimensional images of J-6, 149 onedimensional images of J-7, and 107 one-dimensional images of B-52. The test samples consist of 206 one-dimensional images of J-6, 299 one-dimensional images of J-7, and 214 one-dimensional images of B-52. Experimental results for several classifiers are summarized in Table 4. It can be observed that FS-KFD is superior to other classifiers in terms of classification accuracy on this data set. 6 Discussion Our algorithm is not yet applicable to the problems where the number of feature dimensionality is on the order of several hundred and that of training samples on the order of several thousand due to the high computational cost. This limitation can be overcome by integrating a feature preselection step into FS-KFD. An alternative way to break this limitation is to allow some associated features to share the same scaling factors. For example, in image recognition problems, it is reasonable that the neighboring features share the same scaling factors. Exploiting effective feature preselection and reasonable feature-sharing schemes is an interesting research direction.
Feature Scaling for Kernel Fisher Discriminant Analysis
975
Magnitude
A 0.1 0.05 0 0
20 40 Dimensionality
60
20 40 Dimensionality
60
20 40 Dimensionality
60
Magnitude
B 0.1 0.05 0 0 C Magnitude
0.1
0.05
0 0
Figure 3: (A) One-dimensional image of J-6. (B) One-dimensional image of J-7. (C) One-dimensional image of B-52.
It is well known that the kernel function plays an important role in KFD. Choosing different kernel functions may result in different performance. The determination of an appropriate kernel for a specific application is far from fully understood. Consequently, combining FS-KFD and kernel
976
L. Bo, L. Wang, and L. Jiao
Table 4: Number of Misclassifications of Several Classifiers on the Radar Target Recognition Problem. Classifier SVM (gaussian kernel) LS-SVM (gaussian kernel) RVM (gaussian kernel) (Tipping, 2001) SPR (gaussian kernel) (Figueiredo, 2003) KFD ( gaussian kernel) FS-KFD (feature-scaling gaussian kernel)
J-6/J-7/B-52 11 11 12 12 13 7
construction trick to improve the performance of KFD in a specific application is of potential importance. One phenomenon worth mentioning is that the leave-one-out error resulting from the gradient-descent algorithm is smaller than the test error. The reason is that the leave-one-out error suffers from a large variance in small sample cases. If some countermeasure, such as regularization on the leave-one-out error is taken, this problem can be overcome. This is a topic we will pursue in the future research. Acknowledgments We thank two reviewers for their helpful comments that greatly improved the article and Lin Shi for her help in proofreading the manuscript. This work was supported by the National Natural Science Foundation of China under grants 60372050 and 60133010 and National 863 Project grant 2002AA135080.
References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bengio, Y. (2000). Gradient-based optimization of hyper-parameters. Neural Computation, 12, 1889–1900. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36, 2585–2592. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Figueiredo, M. A. T. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150–1159.
Feature Scaling for Kernel Fisher Discriminant Analysis
977
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual of Eugenics, 7, 179–188. Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). Orlando, FL. Academic Press. Gestel, T. V., Suykens, J., Lanckriet, G., Lambrechts, A., Moor, B. D., & Vandewalle, J. (2002). Bayesian framework for least squares support vector machine classifiers, gaussian processes and kernel Fisher discriminant analysis. Neural Computation, 15, 1115–1148. Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13, 415–425. Krishnapuram, B., Hartemink, A., Carin, L., & Figueiredo, M. (2004). A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1105–1111. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier, Neural Computation, 7, 72–85. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. (In Russian). Techicheskaya Kibernetica, 3. Mika, S. (2002). Kernel fisher discriminants. Unpublished doctoral dissertation, University of Technology, Berlin. Mika, S., Ratsch, G., & Weston, J. (1999). Fisher discriminant analysis with kernels. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing (pp. 41– 48). Piscataway, NJ: IEEE Press. Ong, C. S., & Smola, A. J. (2003). Machine learning with hyperkernels. In Proceedings of the Twentieth International Conference on Machine Learning (pp. 568–575). Menlo Park, CA: AAAI Press. R¨atsch, G. (2001). Robust boosting via convex optimization. Unpublished doctoral dissertation, University of Potsdam, Potsdam, Germany. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers, Neural Processing Letters, 9, 293–300. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal Machine Learning Research, 1, 211–244. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Vapnik, V. (1998). Statistical learning theory, New York: Wiley. Vapnik, V., & O. Chapelle. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12, 2013–2036. Weston, J., Mukherjee, M., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 668–674), Cambridge, MA: MIT Press. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342–1351.
978
L. Bo, L. Wang, and L. Jiao
Xu, J., Zhang, X., & Li, Y. (2001). Kernel MSE algorithm: A unified framework for KFD, LS-SVM and KRR. In Proceedings of the International Joint Conference on Neural Networks (pp. 1486–1491). Piscataway, NJ: IEEE Press.
Received October 26, 2004; accepted August 4, 2005.
LETTER
Communicated by Todd Leen
Class-Incremental Generalized Discriminant Analysis Wenming Zheng wenming
[email protected] Research Center for Learning Science, Southeast University, Nanjing, Jiangsu 210096, China
Generalized discriminant analysis (GDA) is the nonlinear extension of the classical linear discriminant analysis (LDA) via the kernel trick. Mathematically, GDA aims to solve a generalized eigenequation problem, which is always implemented by the use of singular value decomposition (SVD) in the previously proposed GDA algorithms. A major drawback of SVD, however, is the difficulty of designing an incremental solution for the eigenvalue problem. Moreover, there are still numerical problems of computing the eigenvalue problem of large matrices. In this article, we propose another algorithm for solving GDA as for the case of small sample size problem, which applies QR decomposition rather than SVD. A major contribution of the proposed algorithm is that it can incrementally update the discriminant vectors when new classes are inserted into the training set. The other major contribution of this article is the presentation of the modified kernel Gram-Schmidt (MKGS) orthogonalization algorithm for implementing the QR decomposition in the feature space, which is more numerically stable than the kernel Gram-Schmidt (KGS) algorithm. We conduct experiments on both simulated and real data to demonstrate the better performance of the proposed methods. 1 Introduction Generalized discriminant analysis (GDA) was proposed by Baudat and Anour (2000) as the nonlinear extension of the classical linear discriminant analysis (LDA) (Duda & Hart, 1973) from input space to a high-dimensional ¨ ¨ feature space via the kernel trick (Vapnik, 1995; Scholkopf, Smola, & Muller, 1998). However, the GDA method often suffers from the so-called small sample size (SSS) problem (Chen, Liao, Ko, Lin, & Yu, 2000; Zheng, Zhao, & Zou, 2004a; Cevikalp & Wilkes, 2004; Cevikalp, Neamtu, Wilkes, & Barkana, 2005), where the dimensionality of the feature space nonlinearly mapped from the input space is generally much larger than the number of the training samples, such that the optimal discriminant vectors of GDA lie in the null space of the within-class scatter matrix (Zheng, Zhao, & Zou, 2004b). Mathematically, the standard approach of solving GDA is to solve the eigenvalues λ and eigenvectors ω, solutions of the following generalized Neural Computation 18, 979–1006 (2006)
C 2006 Massachusetts Institute of Technology
980
W. Zheng
eigenequation (Baudat & Anour, 2000; Zheng et al., 2004b): SB ω = λST ω, where SB and ST represent the between-class scatter matrix and the total scatter matrix in the feature space, respectively. However, as for the case of the small sample size problem, the optimal discriminant vectors of GDA can be found from the null space of SW (Zheng et al., 2004b), where SW denotes the within-class scatter matrix. A simple and efficient way of solving GDA in this case is to find an orthonormal basis of the subspace ST (0) ∩ SW (0) (Zheng, Zhao, & Zou, 2005), where ST (0) and SW (0) represent the null space of ST and SW , respectively, and ST (0) represents the complement of ST (0). Although it is tractable to solve GDA by utilizing the Mercer kernels, the common aspect of the previously proposed algorithms is the use of the singular value decomposition (SVD) (Baudat & Anour, 2000; Zheng et al., 2005; Yang, Frangi, Jin, & Yang, 2004; Liu, Wang, Li, & Tan, 2004). A major common drawback of these algorithms is the difficulty of designing an incremental solution for the eigenvalue problem. The other major drawback of these SVD-based algorithms is the numerical instability problem. This is because the eigenvalues determined by the eigenvalue decomposition approach may be very close to each other, which will result in instability of the eigenvector according to the perturbation theory (Fukunaga, 1990). Recently, Xiong, Ye, Li, Cherkassky, and Janardan (2005) proposed a kernel discriminant analysis algorithm via QR decomposition (KDA/QR) to reduce the computational complexity of kernel discriminant analysis (KDA). However, similar to the kernel direct discriminant analysis (KDDA) approach (Lu, Plataniotis, & Venetsanopoulos, 2003), this method finds the discriminant vectors of KDA by limiting attention to the range space of SB , which may not obtain the optimal discriminant vectors in terms of the Fisher discriminant criterion—in particular, the case of the small sample size problem (Zheng et al., 2005). The other drawback of KDA/QR is that it is only a batch method, which requires that all the training data be available before computing the discriminant vectors (Ye et al., 2004). Thus, it is still time-consuming to update the discriminant vectors when new data items are inserted into the training set. In this article, we propose a computationally efficient and numerically stable algorithm for GDA as for the case of the small sample size problem. The proposed method can directly solve the optimal discriminant vectors of GDA by applying only QR decomposition. More important, the proposed method introduces an incremental technique to update the discriminant vectors when new data items are inserted into the training set, which is very desirable for designing a dynamic recognition system. Moreover, this article also proposes a modified kernel Gram-Schmidt (MKGS) orthogonalization algorithm for implementing the QR decomposition in the feature space, which is much more numerically stable in contrast to the kernel GramSchmidt (KGS) orthogonalization algorithm proposed by Wolf and Shashua (2003).
Class-Incremental Generalized Discriminant Analysis
981
In the next section, we review the KGS algorithm and then propose the MKGS algorithm in section 3. In section 4, we propose the batch GDA algorithm and the class-incremental GDA algorithm, respectively, using the MKGS algorithm. In section 5, we present the feature extraction method for classification based on the proposed GDA algorithm. Section 6 is devoted to the experiments on both simulated and real data. The conclusion is given in the last section. 2 KGS Algorithm Let A be a matrix with k columns α1 , . . . , αk , where αi ∈ Rn (i = 1, . . . , k). Let (·) be a mapping that maps the elements of Rn into a high-dimensional Hilbert space F , that is, : Rn → F . Let A = [(α1 ), . . . , (αk )].
(2.1)
Suppose that β1 , . . . , βk are the equivalent orthonormal vectors corresponding to the columns of A . Then βi (i = 1, . . . , k) can be computed by using the following classical Gram-Schmidt (CGS) orthonormal procedure ¨ (Bjorck, 1994): 1. β 1 = (α1 ). 2. Repeat for j = 2, . . . , k. j−1 β T (α ) β j = (α k ) − i=1 i β T β j βi . i
i
3. Repeat for j = 1, . . . , k, β j = β j /β j , where · stands for the Euclidean distance norm. However, directly computing the orthonormal vectors βi (i = 1, . . . , k) is an intractable task because the mapping function is hard to explicitly evaluate. Wolf and Shashua (2003) proposed an indirect approach to implement the above orthonormal procedure via the kernel trick (hereafter the KGS algorithm). More specifically, assume that k(x, y) is the reproducing kernel defined on the feature space F such that k(x, y) = (x), (y) = ((x))T (y),
(2.2)
where (x), (y) stands for the inner product of (x) and (y). Then according to Wolf and Shashua (2003), the KGS algorithm can be summarized as follows:
982
W. Zheng
KGS algorithm (Wolf & Shashua, 2003). Let A be a matrix with columns (α1 ), . . . , (α k ), where (α1 ), . . . , (α k ) are k linearly independent vectors. Then the corresponding orthonormal vectors of the columns of A can be obtained using the following steps, where s j and t j ( j = 1, . . . , k) are k-dimensional vectors, D is a k by k diagonal matrix, and e j = (0, . . . , 1, . . . , 0)T is a k-dimensional vector where the jth item is 1. 1. Let s1 = t1 = e 1 , D11 = k(α1 , α1 ), where e 1 = (1, 0, . . . , 0)T ; 2. Repeat for j = 2, . . . , k t k(α ,a )
j−1
tq ( j−1) k(αq ,αj )
, 1, 0, . . . , 0)T ; a. Compute s j = ( 11 D111 j , . . . , q =1D( j−1)( j−1) b. Compute t j = (−t1 , . . . , −t j−1 , e j , 0, . . . , 0)s j ; j c. Compute D j j = p,q =1 tpj tq j k(α p , αq ); 3. R = D1/2 [s1 , . . . , sk ]; 4. R−1 = [t1 , . . . , tk ]D−1/2 ; The columns of the matrix [β1 , . . . , βk ] = [(α1 ), . . . , (αk )][t1 , . . . , tk ]D−1/2 are the corresponding orthonormal vectors of the columns of A . 3 MKGS Algorithm The KGS algorithm proposed by Wolf and Shashua (2003) is essentially the kernelized version of the CGS algorithm. Thus, many numerical properties of CGS will be delivered to KGS. However, the experimental results by ¨ Rice (1966) and the theoretical analysis by Bjorck (1967) indicated that the CGS procedure is very sensitive to round-off errors. In other words, if the matrix A is ill conditioned, the computed vectors β1 , . . . , βk will soon lose their orthogonality, and reorthogonalization will be needed. Thus, it is very desirable to modify the KGS algorithm to obtain a numerically superior algorithm for orthogonalizing the columns of A . It is notable that the modified Gram-Schmidt (MGS) orthogonalization procedure is numerically superior to CGS. More details can be found in ¨ Rice (1966) and Bjorck (1967). Thus, we will adopt the MGS procedure to modify the KGS algorithm. In general, the MGS procedure can be divided into two versions: the row-oriented procedure and the column-oriented ¨ ¨ procedure (Bjorck, 1994). According to Bjorck (1994), the two procedures are numerically equivalent, the operations and rounding errors are the same, and both produce the same numerical results. The main difference is that the column-oriented procedure is more appropriate to use when the orthogonalized vectors are sequentially obtained. Based on the column-oriented procedure, the orthonormal vectors β1 , . . . , βk can be computed as follows (for simplicity, we use the notations: (αi )(0) = (αi ), i = 1, . . . , k):
Class-Incremental Generalized Discriminant Analysis
983
1. β1 = (α1 )(0) ; 2. Repeat for j = 2, . . . , k a. Repeat for m = 1, . . . , j − 1 (α j )(m) = (α j )(m−1) − b. β j = (α j )( j−1) ;
βmT (α j )(m−1) βm ; βmT βm
3. Repeat for j = 1, . . . , k β j = β j /β j ; Similar to the KGS algorithm, we implement the above procedure via the kernel trick, and hereafter the MKGS algorithm. MKGS Algorithm Let A be a matrix with columns (α1 ), . . . , (αk ), where (α1 ), . . . , (αk ) are k linearly independent vectors. Then the corresponding orthonormal vectors of the columns of A can be obtained using the following steps, where s j and t j ( j = 1, . . . , k) are k-dimensional vectors, D is a k by k diagonal matrix, is a k by k matrix, and e j = (0, . . . , 1, . . . , 0)T is a k-dimensional vector where the jth item is 1. 1. Let s1 = t1 = e 1 , D11 = k(α1 , α1 ), i1 = k(αi , α1 )(i = 1, . . . , k); 2. Repeat for j = 2, . . . , k (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 s ji = (i+1)
tj
j
(i)
pi tpj ; Dii (i) t j − s ji ti ;
p=1
= ( j)
c. t j = t j ; d. Repeat for p = 1, . . . , k j pj = q =1 tq j k(αq , α p ); j e. Compute D j j = p=1 pj tpj ; 3. R = D1/2 [s1 , . . . , sk ], where si = [si1 , si2 , . . . , si(i−1) , 1, 0, . . . , 0]T ; 4. R−1 = [t1 , . . . , tk ]D−1/2 ; The columns of the matrix [β1 , . . . , βk ] = [(α1 ), . . . , (αk )] [t1 , . . . , tk ]D−1/2 are the corresponding orthonormal vectors of the columns of A . By calculating the computational cost of each line of the above algorithm, we can easily obtain that the complexity of the MKGS algorithm is O(k 3 ).
984
W. Zheng
4 GDA/MKGS: GDA via MKGS j
Let X = {xi }i=1,...,c; j=1,...,Ni be an n-dimensional training sample set with N elements, where c is the number of the classes and Ni is the number of the samples in ith class. The between-class scatter matrix SB , the within-class scatter matrix SW , and the total scatter matrix ST are respectively defined as: SB =
c
T Ni ui − u ui − u
(4.1)
i=1
SW =
Ni c j j T xi − ui xi − ui
(4.2)
i=1 j=1
ST =
Ni c j j T xi − u xi − u ,
(4.3)
i=1 j=1 j
where x T denotes the transpose of x, (xi ) is the jth sample in the ith class, ui is the mean of ith class samples, and u is the mean of all samples in F : ui =
Ni j 1 xi , Ni
u =
j=1
i j 1 xi . N
c
N
(4.4)
i=1 j=1
4.1 Batch GDA/MKGS Algorithm. Let SB (0) denote the null space of (0) denote the complement of SB (0) and SW (0), respectively. SB , SB (0) and SW Then, from the expressions of SB , SW , and ST in equations 4.1, 4.2, and 4.3, we obtain SB (0) = span{ui − u |i = 1, . . . , c}
(4.5)
SW (0) = span{(xi ) − ui |i = 1, . . . , c; j = 1, . . . , Ni }
(4.6)
ST (0) = span{(xi ) − u |i = 1, . . . , c; j = 1, . . . , Ni }.
(4.7)
j
j
Note that (xi ) − u = ((xi ) − ui ) + (ui − u ). Thus, from equations 4.5, 4.6, and 4.7, we have j
j
ST (0) ⊆ span{(xi ) − ui , ui − u |i = 1, . . . , c; j = 1, . . . , Ni }. j
(4.8)
Moreover, we have the following two important theorems about SB (0) and SW (0):
Class-Incremental Generalized Discriminant Analysis
985
Theorem 1. SB (0) can be spanned by ui − u 1 (i = 2, . . . , c), that is, SB (0) = span{ui − u |i = 2, . . . , c}. 1 (0) can be spanned by (xi ) − (xi1 ) (i = 1, . . . , c; j = Theorem 2. SW j 2, . . . , Ni ), that is, SW (0) = span{(xi ) − (xi1 )|i = 1, . . . , c; j = 2, . . . , Ni }. j
The proofs of theorems 1 and 2 are given in appendixes A and B, respectively. Theorem 1 will be useful to design an incremental algorithm for updating the basis of SB (0) when new classes are inserted into the training set since it can be expressed as the span of the vectors ui − u 1 (i = 2, . . . , c), which does not depend on the ensemble mean u of the training set. Similarly, theorem 2 will be useful to design an incremental algorithm for updating the basis of SW (0) when new instances are inserted into the existing classes of the j training set since it can be expressed as the span of the vectors (xi ) − (xi1 ) (i = 1, . . . , c; j = 2, . . . , Ni ), which is not dependent on the class mean ui (i = 1, . . . , c). In the next section, we will show that theorems 1 and 2 are crucial for designing the class-incremental GDA/MKGS algorithm. Now let A = A 1 , . . . , Ac , Ac+1
(4.9)
where the matrices Ai are, respectively, defined by Ai = xi2 − xi1 , xi3 − xi1 , . . . , xiNi − xi1 (i = 1, . . . , c) (4.10) A c+1 = u2 − u1 , u3 − u1 , . . . , uc − u1 .
(4.11)
From theorems 1 and 2 and equations 4.9, 4.10, and 4.11, we obtain that SW (0) and SB (0) can be respectively spanned by the first N − c columns and the last c − 1 columns of the matrix A . Moreover, from equation 4.8, we obtain that ST (0) lies in the span of the N − 1 columns of matrix A . j Without loss of generality, we assume that (xi ) (i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent. Then we have the following theorem regarding the rank of ST (0): j
Theorem 3. Suppose that (xi )(i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent. Then the rank of ST (0) is N − 1, that is, rank(ST (0)) = N − 1. The proof of theorem 3 is given in appendix C. Consider that ST (0) lies in the span of the N − 1 columns of matrix A . Thus, theorem 3 indicates that the N − 1 columns of the matrix A form a basis of ST (0), where the first
986
W. Zheng
N − c columns form a basis of SW (0). Note that the optimal discriminant vectors of GDA lie in the subspace ST (0) ∩ SW (0) as for the case of the small sample size problem (Zheng et al., 2004b). Thus, our goal is to get the basis of ST (0) ∩ SW (0). This can be implemented by using the MKGS orthogonalization procedure. Let Aip denote the pth column of matrix Ai . Then for i, j, m, n = 1, . . . , c and p = 1, . . . , Ni − 1 and q = 1, . . . , Nj − 1, we have
Aip
T
T q +1 p+1 − xi1 xj − x 1j Ajq = xi p+1 T q +1 p+1 T 1 − xi xj xj = xi T q +1 1 T 1 xj + xi − xi1 x j p+1 q +1 p+1 q +1 = k xi , x j − k xi , x 1j − k xi1 , x j + k xi1 , x 1j
Aip
T
u m=
(4.12)
Nm p+1 T 1 xi − xi1 xmt Nm t=1
Nm p+1 t 1 k xi , xm − k xi1 , xmt (4.13) Nm t=1 T Nn Nm Nn Nm p q T 1 1 1 xm xn = k xmp , xnq . um un = Nm Nn Nm Nn
=
p=1
q =1
p=1 q =1
(4.14) Let K be an N − 1 by N − 1 matrix defined by K = (K ij )i=1,...,c+1; j=1,...,c+1 ,
(4.15)
where K ij = (Ai )T Aj .
(4.16)
Let (K ij ) pq denote the element in the pth row and qth column of the matrix K ij . Then for i, j = 1, . . . , c, m, n = 1, . . . , c − 1, and p = 1, . . . , Nj − 1 and q = 1, . . . , Nj − 1, we have T (K ij ) pq = Aip Ajq
(4.17)
T T T (K (c+1) j )mq = (K j(c+1) )q m = Ajq um+1 − Ajq u um+1 − u 1 = A jq 1 (4.18)
Class-Incremental Generalized Discriminant Analysis
T T un+1 − u (K (c+1)(c+1) )mn = A A(c+1)n = u m+1 − u1 1 (c+1)m T T = u m+1 un+1 − um+1 u1 T T un+1 + u1 u1 . − u 1
987
(4.19)
According to equations 4.12 to 4.19, we can easily calculate the matrix K . Let K (i, j) denote the element in the ith row and jth column of K . According to the MKGS algorithm, we obtain the following batch GDA/MKGS algorithm, where s j and t j ( j = 1, . . . , N − 1) are N − 1 dimensional vectors, D is an N − 1 by N − 1 diagonal matrix, and is an N − 1 by N − 1 matrix: Batch GDA/MKGS Algorithm 1. Let s1 = t1 = e 1 , D11 = K (1, 1), i1 = K (i, 1)(i = 1, . . . , N − 1); 2. Repeat for j = 2, . . . , N − 1 (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 s=
j
(i+1)
tj
(i)
pi tpj ; Dii (i) t j − sti ;
p=1
= ( j)
c. t j = t j ; d. Repeat for p = 1, . . . , N − 1 j pj = q =1 tq j K (q , p); j e. Compute D j j = p=1 pj tpj ; 3. [β N−c+1 , . . . , β N−1 ] = A [tN−c+1 , . . . , tN−1 ](Dij )i,−1/2 j=N−c+1,...,N−1 ; The c − 1 vectors β N−c+1 , . . . , β N−1 form an orthonormal basis of ST (0) ∩ SW (0), which are referred to as the discriminant vectors of GDA as for the case of the small sample size problem. By calculating the computational cost of the above algorithm, we obtain that the complexity of the batch GDA/MKGS algorithm is O(N3 ). 4.2 Class-Incremental GDA/MKGS Algorithm. This section aims to design an incremental algorithm for updating the discriminant vectors of GDA/MKGS when new data items are inserted into the training set. We consider two distinct cases of the inserted instances: (1) the instances belong to a new class, and (2) the instances belong to an existing class. 4.2.1 Insertion of a New Class. Recalling that we have c classes, let j {xc+1 | j = 1, . . . , Nc+1 } be the (c + 1)th class being inserted, where Nc+1 is
988
W. Zheng
the number of the new training samples. In this case, the expression in equation 4.9 can be rewritten as ˜ ˜ = A ˜ A 1 , . . . , Ac , Ac+1 , Ac+2 ,
(4.20)
˜ and A ˜ are where Ai (i = 1, . . . , c) are defined in equation 4.10, A c+1 c+2 respectively defined as follows: 1 3 1 Nc+1 1 2 ˜ A c+1 = xc+1 − xc+1 , xc+1 − xc+1 , . . . , xc+1 − xc+1 (4.21) ˜ A c+2
= u 2 − u1 , u3 − u1 , . . . , uc+1 − u1 .
(4.22)
The kernel matrix K in equation 4.15 is replaced by ˜ )T A ˜ . K˜ = ( A
(4.23)
The elements of K˜ can be calculated by utilizing the kernel function. According to the batch GDA/MKGS algorithm, we have the following algorithm of updating the new discriminant vectors when the (c + 1)th class is inserted into the training set, where K˜ (i, j) denotes the element in the ith row and jth column of K˜ , s˜ j and ˜t j ( j = 1, . . . , N + Nc+1 − 1) are ˜ is an N + Nc+1 − 1 by N + Nc+1 − 1 N + Nc+1 − 1-dimensional vectors, D ˜ is an N + Nc+1 − 1 by N + Nc+1 − 1 matrix. diagonal matrix, and Class-Incremental GDA/MKGS Algorithm 1: Updating Discriminant Vectors with the Insertion of the (c + 1)th Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D c. Repeat for i = 1, . . . , N − c ˜ ij = ij ; d. Repeat for i = N − c + 1, . . . , N + Nc+1 − 1 ˜ ˜ ˜ ij = qN−c =1 t q j K (q , i); 2. Repeat for j = N − c + 1, . . . , N + Nc+1 − 1 (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1
c.
j
(i)
˜ pi tpj ; ˜ ii D (i+1) (i) = t j − s ˜t i ; tj ˜t j = t (j j) ;
s=
p=1
Class-Incremental Generalized Discriminant Analysis
989
d. Repeat for p = 1, . . . , N + Nc+1 − 1 ˜ pj = qj =1 ˜t q j K˜ (q , p); ˜ jj = j ˜ pj ˜t pj ; e. Compute D p=1
3. [β N+Nc+1 −c , . . . , β N+Nc+1 −1 ] ˜ [ ˜t N+Nc+1 −c , . . . , ˜t N+Nc+1 −1 ]( D ˜ ij )−1/2 =A i, j=N+Nc+1 −c,...,N+Nc+1 −1 ; The c vectors β N+Nc+1 −c , . . . , β N+Nc+1 −1 are the new discriminant vectors of GDA after the (c + 1)th class is inserted into the training set. According to calculating the computational cost of the above algorithm, we obtain that the complexity of the class-incremental GDA/MKGS algorithm for updating discriminant vectors with the insertion of the (c + 1)th class is O{(Nc+1 + c)(N + Nc+1 )2 }. 4.2.2 Insertion of a New Instance from an Existing Class. Suppose that x is an instance being inserted into the ith (1 ≤ i ≤ c) class. For simplicity of notation, we denote x by xiNi +1 since there are Ni samples in the ith class. Then the mean of the ith class, denoted by u˜ i , is expressed as u˜ i =
N i +1 j 1 xi . Ni + 1
(4.24)
j=1
Without loss of the generality, we assume that i > 1. Then the expression in equation 4.9 can be rewritten as Ni +1 A˜ = A − xi1 , A˜ 1 , . . . , Ac , xi c+1 ,
(4.25)
where Ai (i = 1, . . . , c) are defined in equation 4.10, and A˜ c+1 is defined as ˜ i − u A˜ c+1 = u 2 − u1 , . . . , ui−1 − u1 , u 1 , ui+1 − u1 , . . . , uc − u1 .
(4.26)
The new kernel matrix, denoted by K˜ , is expressed as K˜ = ( A˜ )T A˜ .
(4.27)
According to the batch GDA/MKGS algorithm, we have the following incremental algorithm of updating the new discriminant vectors with the insertion of an instance in ith class, where K˜ (i, j) denotes the element in the ith row and jth column of K˜ , s˜ j and ˜t j ( j = 1, . . . , N) are N-dimensional ˜ is an N by N diagonal matrix, and ˜ is an N by N matrix. vectors, D
990
W. Zheng
Class-Incremental GDA/MKGS Algorithm 2: Updating Discriminant Vectors with the Insertion of a New Instance in ith (i ≤ c) Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D c. Repeat for i = 1, . . . , N − c ˜ ij = ij ; d. Repeat for i = N − c + 1, . . . , N ˜ ˜ ˜ ij = qN−c =1 t q j K (q , i) 2. Repeat for j = N − c + 1, . . . , N (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 j
(i)
˜ pi tpj ; ˜ ii D (i+1) (i) = t j − s ˜t i ; tj ˜t j = t (j j) ;
s=
p=1
c. d. Repeat for p = 1, . . . , N ˜ pj = qj =1 ˜t q j K˜ (q , p); ˜ jj = j ˜ pj ˜t pj ; e. Compute D p=1
˜ [ ˜t N+2−c , . . . , ˜t N ]( D ˜ ij )−1/2 3. [β N+2−c , . . . , β N ] = A i, j=N+2−c,...,N ; The c − 1 vectors β N+2−c , . . . , β N are the new discriminant vectors of GDA after the new instance of ith class is inserted into the training set. By calculating the computational cost of the above algorithm, we obtain that the complexity of the class-incremental GDA/MKGS algorithm for updating discriminant vectors with the insertion of a new instance is O(c N2 ). 5 Feature Extraction for Classification Let (X) = [ (x11 ) · · · (x1N1 ) · · · (xc1 ) · · · (xcNc ) ]. Then we have j xi = (X)e k+ j where k =
i−1 t=1
(5.1)
Nt . From equations 4.4, 4.10, and 5.1, we have
Ni Ni 1 (X) (X)e k+t = e k+t = (X)L i Ni Ni
(5.2)
Ai = (X)[e k+2 − e k+1 , e k+3 − e k+1 , . . . , e k+Ni − e k+1 ](i = 1, . . . , c),
(5.3)
ui =
t=1
t=1
Class-Incremental Generalized Discriminant Analysis
where L i =
1 Ni
Ni
t=1 e k+t .
991
From equations 4.11 and 5.2, we have
A c+1 = (X)[L 2 − L 1 , L 3 − L 1 , . . . , L c − L 1 ].
(5.4)
Combining equations 4.9, 5.3, and 5.4, we obtain A = A 1 , . . . , Ac , Ac+1 = (X)P,
(5.5)
where P = [e 2 − e 1 , . . . , e N1 − e 1 , e N1 +2 − e N1 +1 . . . , e N − e N−Nc +1 , L 2 − L 1 , . . . , L c − L 1 ].
(5.6)
Thus, according to the batch GDA/MKGS algorithm, the projection matrix of GDA can be expressed as WGDA/MKGS = [β N−c+1 , . . . , β N−1 ] = (X)P T,
(5.7)
where T=
1 D(N−c+1)(N−c+1)
tN−c+1 , . . . ,
1 D(N−1)(N−1)
tN−1 .
(5.8)
The projection of a test point (xtest ) onto WG DA/K MG S can be calculated by T T T ytest = WGDA /MKGS (xtest ) = T P K test ,
(5.9)
where T K test = k x11 , xtest , k x12 , xtest , . . . , k(xcNc , xtest .
(5.10)
Note that the discriminant vectors of GDA/MKGS lie in the subspace (0). Thus, we have SW T WGDA /MKGS SW = 0.
(5.11)
From equations 4.2 and 5.11, we have 1 N1 Nc T − u − u WGDA 1 , . . . , xc c = 0. /MKGS x1 − u1 , . . . , x1
(5.12)
From equation 5.12, we obtain that T T WGDA /MKGS (xi ) = WGDA/MKGS ui ( j = 1, . . . , Ni ; i = 1, . . . , c). j
(5.13)
992
W. Zheng
Equation 5.13 means that the training data of each class are projected onto the same point in the projection space. Now let Xi denote the ith class data j set in the feature space, and let yi and y¯ i represent the respective projection j of (xi ) and ui onto the projection matrix WGDA/MKGS , that is, j j T yi = WGDA /MKGS xi ,
T y¯ i = WGDA /MKGS ui .
(5.14)
From equations 5.13 and 5.14, we have j
yi1 = yi2 = · · · = yi = · · · = yiNi = y¯ i .
(5.15)
Based on the nearest-neighbor rule, we define the distance between the projections of the test point (xtest ) and the data set Xi as follows:
j d p (xtest ), Xi = min{ ytest − yi },
j = 1, . . . , Ni .
(5.16)
Combining equations 5.16 with 5.15, we obtain that
d p (xtest ), Xi = ytest − yi1 .
(5.17)
Therefore, based on the nearest-neighbor rule, the associated class index of the test point can be obtained as follows:
c ∗ = arg min d p (xtest ), Xi = arg min ytest − yi1 . i
i
(5.18)
6 Experiments We will use simulated and real data, respectively, to demonstrate the efficiency of the proposed method in this section. In the first example, we use toy data to show that MKGS is numerically superior to KGS for implementing the QR decomposition. The second example aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm when used for a dynamic face recognition system. In the third example, we use a handwritten digital character recognition experiment to demonstrate the better performance of the proposed algorithm. All of these experiments are run on the platform of IBM personal computer with MATLAB software. The monomial kernel and the gaussian kernel are used in the experiments, which are respectively defined as follows:
r r
Monomial kernel: k(x, y) = (x T y)d , where d is the monomial degree Gaussian kernel: k(x, y) = exp(−x − y2 σ ), where σ is the parameter of the gaussian kernel
Class-Incremental Generalized Discriminant Analysis
993
6.1 Toy Example. In this example, we use toy data to test the performance of MKGS and KGS for implementing the QR decomposition. The GDA algorithm via KGS (GDA/KGS) include the batch GDA/KGS algorithm and the class-incremental GDA/KGS algorithm for updating discriminant vectors with the insertion of the (c + 1) class. These two algorithms are given in appendixes D and E, respectively. Four clusters of artificial two-dimensional data sets are generated by the function y = x 2 + 0.5 + ε, where the x values have a uniform distribution in [−1, 1], and ε is a uniformly distributed random number on the interval [0, 0.5]. For each cluster, we generate 100 samples. The gaussian kernel with parameter σ = 0.1 is used over the experiment to calculate the kernel matrix. The experiment has three steps: 1. Choose the samples in the first two clusters as training data to compute discriminant vector using the batch GDA/MKGS algorithm and the batch GDA/KGS algorithm, respectively. Then compute and display the projections of the test data onto the computed discriminant vector. Figures 1a and 2a, respectively, display the experimental results, where the figure shows the feature values (indicated by shade of gray) and contour lines of identical feature values. 2. Insert the third cluster into the training set, and then compute two discriminant vectors for discriminating the three clusters using the class-incremental GDA/MKGS algorithm and the class-incremental GDA/KGS algorithm, respectively. Figures 1b and 1c display the projections of the test data onto the two discriminant vectors of GDA/MKGS, while Figures 2b and 2c display the projections of the test data onto the two discriminant vectors of GDA/KGS. 3. Insert the fourth cluster into the training set, and then compute three discriminant vectors for discriminanting the four clusters by using the class-incremental GDA/MKGS algorithm and the class-incremental GDA/KGS algorithm, respectively. Figures 1d through 1f display the projections of the test data onto the three discriminant vectors of GDA/MKGS, while Figures 2d through 2f display the projections of the test data onto the three discriminant vectors of GDA/KGS. By contrast with the experimental results between Figure 1 and Figure 2, we can see that the GDA/MKGS algorithm achieves much better performance than the GDA/KGS algorithm. The projections of the toy data can be nicely separated in Figure 1, whereas they could not be well separated in Figure 2. 6.2 Face Recognition. This example aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm in terms of recognition accuracy and training time for dynamic face recognition task. We use the AR face database (Martinez & Benavente, 1998) to perform
994
W. Zheng
1.5 1
1.5 1
1.5 1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (a)
1
-1.5
-1
0 (b)
1
-1.5
1.5
1.5
1.5
1 0.5
1 0.5
1 0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (d)
1
-1.5
-1
0 (e)
1
-1.5
-1
0 (c)
1
-1
0 (f)
1
Figure 1: Features extraction by class-incremental GDA/MKGS. (a) Projections onto the discriminant vector computed from the first two clusters using the batch GDA/MKGS algorithm. (b–c) Projections onto the two discriminant vectors computed from the first three clusters using the class-incremental GDA/MKGS algorithm; (d–f) Projections onto the three discriminant vectors computed from the four clusters using the class-incremental GDA/MKGS algorithm.
this experiment. The AR face database consists of over 3000 facial images of 126 subjects. Each subject contains 26 facial images recorded in two different sessions separated by two weeks, and each session consists of 13 images. The original image size is 768 × 576 pixels, and each pixel is represented by 24 bits of RGB color values. Figure 3 shows the 26 images for one subject; images 1 to 13 were taken in the first session and images 14 to 26 in the second session. Among the 126 subjects, we randomly select 70 subjects (50 males and 20 females) for this experiment. Similar to Cevikalp et al. (2005), we use only the nonoccluded images (those numbered 1 to 7 and 14 to 20) for the experiment. Before the experiment, all images are centered and cropped to the size of 468 × 476 pixels, and then are down-sampled into the size of 100 × 100 pixels.
Class-Incremental Generalized Discriminant Analysis
995
1.5 1
1.5 1
1.5 1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (a)
1
-1.5
-1
0 (b)
1
-1.5
1.5
1.5
1.5
1 0.5
1 0.5
1 0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (d)
1
-1.5
-1
0 (e)
1
-1.5
-1
0 (c)
1
-1
0 (f)
1
Figure 2: Features extraction by class-incremental GDA/KGS. (a) Projections onto the discriminant vector computed from the first two clusters using the batch GDA/KGS algorithm. (b–c) Projections onto the two discriminant vectors computed from the first three clusters using the class-incremental GDA/KGS algorithm. (d–f) Projections onto the three discriminant vectors computed from the four clusters using the class-incremental GDA/KGS algorithm.
We use the twofold cross-validation method (Fukunaga, 1990) to perform the experiment: divide all the images into two subsets, and use one subset as the training set and the other as the test set. After performing the experiment, we swap the training set and the test set and repeat the experiment. Considering that our experiment aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm for dynamic face recognition task in terms of recognition accuracy and training time, our experiment will focus more on demonstrating the dynamic recognition procedures when new subjects are inserted into the training set. More specifically, we will design the two trials of the twofold cross-validation as follows. In the first trial of the twofold cross validation, we choose 61 subjects among the 70 subjects for the experiment; we use the seven images
996
W. Zheng
(1)
(2)
(8)
(14)
(3)
(9)
(15)
(21)
(4)
(10)
(16)
(22)
(5)
(11)
(17)
(23)
(6)
(12)
(18)
(24)
(7)
(13)
(19)
(25)
(20)
(26)
Figure 3: Images of one subject in the AR face database.
numbered 1, 2, 3, 4, 14, 15, and 16 in each subject as training images and the other seven images numbered 5, 6, 7, 17, 18, 19, and 20 in each subject as test images. The discriminant vectors are computed using the batch GDA/MKGS algorithm. The test recognition rate of the test images is then calculated based on the nearest-neighbor classifier. After finishing the recognition, we choose one subject from the remainding images and insert the seven images numbered 1, 2, 3, 4, 14, 15, and 16 into the training set and the other seven images into the test set. Then we update the discriminant vectors using the class-incremental GDA/MKGS algorithm and recalculate the test recognition rate. This procedure is repeated until all 70 subjects are included in the training set and the test set. In the second trial, we swap the training images and the test images for each subject; that is, for each subject, we use the seven images numbered 1, 2, 3, 4, 14, 15, and 16 as test images and the other seven images numbered 5, 6, 7, 17, 18, 19, and 20 as training images, and then perform the same recognition procedure as in the first trial. For comparison, we conduct the same experiment using other commonly used face recognition methods, including the Eigenfaces method (Turk & Pentland, 1991), the Fisherfaces method (Belhumeur, Hespanha, & Kriegman, 1997), the LDA method via the MGS algorithm (LDA/MGS; ¨ Zheng, Zou, & Zhao, 2004), the KPCA method (Scholkopf et al., 1998), and the standard GDA method (Baudat, 2000), respectively. The monomial kernel with degree d = 2 and gaussian kernel with σ = 1e8 are used in the
Class-Incremental Generalized Discriminant Analysis
997
97
Average Recognition Rates (%)
96 95 94 93 92 91 Eigenfaces Fisherfaces LDA/MGS KPCA GDA GDA/MKGS
90 89 88 87 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 4: Average recognition rates of various systems with respect to the number of the training classes in the AR face recognition experiment, where the monomial kernel with degree d = 2 is used.
experiments. Figure 4 shows the average recognition rates of the two trials with respect to the number of the training classes when the monomial kernel with degree d = 2 is used, and Figure 5 shows the average recognition rates when the gaussian kernel with σ = 1e8 is used. As we can see from Figures 4 and 5, the GDA/MKGS method achieves the best recognition rate among the commonly used face recognition methods. Moreover, in order to demonstrate the incremental technique of the proposed algorithm, we compare the average training time between the batch GDA/MKGS algorithm and the class-incremental GDA/MKGS algorithm of updating the discriminant vectors when a new class is inserted into the training set. Figure 6 shows the experimental results. From Figure 6 we can clearly see that the class-incremental GDA/MKGS approach saves much more training time than the batch GDA/MKGS approach. 6.3 Handwritten Digital Character Recognition. In this example, we conduct handwritten digital character recognition experiment to further demonstrate the better recognition performance of the proposed algorithm.
998
W. Zheng
97
Average Recognition Rates (%)
96 95 94 93 92 Eigenfaces Fisherfaces LDA/MGS KPCA GDA GDA/MKGS
91 90 89 88 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 5: Average recognition rates of various systems with respect to the number of the training classes in the AR face recognition experiment, where the gaussian kernel with parameter σ = 1e8 is used.
We use the handwritten digits database of the U.S. Postal Service (USPS) collected from mail envelopes in Buffalo as the experimental data. The USPS database contains 7291 training points and 2007 test points of dimension¨ ality 256 (Scholkopf et al., 1998). We choose the first 1500 training points as training data and all 2007 test points as test data for experiment. For the comparison, this experiment is also conducted using the PCA method, the LDA method, the KPCA method, and the traditional GDA method, respectively, where we use the monomial kernel with a different degree in each trial to calculate the kernel matrix for the kernel-based algorithms. The nearest-neighbor classifier is used over the experiments for classification task. Table 1 shows the best average recognition rates of various systems. From Table 1, we can see that the GDA/MKGS method achieves the best recognition rate among the systems. Moreover, we also compare the average recognition rates among the kernel-based algorithms with respect to the different choice of the monomial degree. The experimental results are shown in Figure 7. It can be clearly seen from Figure 7 that the GDA/MKGS
Class-Incremental Generalized Discriminant Analysis
999
Average Training Time (Second)
12
10
8
6
4
2 Batch GDA/MKGS Class Incremental GDA/MKGS 0 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 6: Average training time of computing the discriminant vectors with respect to the number of the training classes. Table 1: Average Recognition Rates of Various Systems. Method
PCA
LDA
KPCA
GDA
GDA/MKGS
Average recognition rate (%)
92.28
86.10
92.63
93.17
93.82
algorithm can significantly improve the performance of the traditional GDA algorithm. 7 Conclusion In this work, an efficient algorithm has been presented to solve the discriminant vectors of GDA in the case of the small sample size problem. By applying QR decomposition rather than SVD, the proposed algorithm is computationally fast compared to the traditional GDA algorithms. More important, the proposed algorithm introduces an effective technique to update the discriminant vectors of GDA when new classes are inserted into
1000
W. Zheng
Average Recognition Rates (%)
95 94 93 92
KPCA GDA
91
GDA/MKGS 90 89 88 87 2
3
4
5
6
Degree of Monomial Kernel Figure 7: Average recognition rates of three kernel-based algorithms with respect to the different choice of monomial degree.
the training set, which is very desirable for designing dynamic recognition systems. Moreover, we have proposed a modified KGS algorithm in this work to replace the KGS algorithm proposed by Wolf and Shashua (2003) for implementing the QR decomposition in the feature space. This algorithm has turned out to be much more numerically stable than the KGS algorithm. Experiments on both simulated and real data sets demonstrated the better performance of the class-incremental GDA/MKGS algorithm. On the simulated toy data, our experiment demonstrated the superiority of MKGS to KGS for implementing the QR decomposition. And the face recognition experiments on the AR face database and handwritten digital character recognition on the USPS database demonstrated the high recognition rates of the proposed GDA/MKGS algorithm in contrast to other commonly used face recognition methods. Moreover, our face experiments demonstrated the computational advantage of the class-incremental GDA/MKGS algorithm when new classes are dynamically inserted into the training set.
Class-Incremental Generalized Discriminant Analysis
1001
Appendix A: Proof of Theorem 1 Proof. Because ui − u 1 = (ui − u ) − (u1 − u ), we get
span ui − u 1 i = 2, . . . , c ⊆ span ui − u i = 1, . . . , c . Note that u − u 1 = get
1 N
c i=1
Ni (ui − u 1)=
1 N
c i=2
(A.1)
Ni (ui − u 1 ). Thus, we
span ui − u 1 i = 2, . . . , c = span u − u1 , ui − u1 i = 2, . . . , c .
(A.2)
Because ui − u = (ui − u 1 ) − (u − u1 ), we get
span ui − u i = 1, . . . , c ⊆ span u − u 1 , ui − u1 i = 2, . . . , c .
(A.3)
From equations A.2 and A.3, we obtain span ui − u i = 1, . . . , c ⊆ span ui − u 1 i = 2, . . . , c .
(A.4)
From equations A.1 and A.4, we obtain span ui − u i = 1, . . . , c = span ui − u 1 i = 2, . . . , c .
(A.5)
Combining equations 4.1 and A.5, we obtain SB (0) = span ui − u 1 i = 2, . . . , c . Appendix B: Proof of Theorem 2 Proof. Note that j j xi − xi1 = xi − ui − xi1 − ui .
(B.1)
Thus, we obtain j j span xi − xi1 j = 2, . . . , Ni ⊆ span xi − ui j = 1, . . . , Ni . (B.2) Note that Ni Ni j j xi − xi1 = xi − xi1 = Ni ui − xi1 . j=2
j=1
(B.3)
1002
W. Zheng
Thus, we get j j span xi − xi1 j = 2, . . . , Ni = span ui − xi1 , xi − xi1 j = 2, . . . , Ni .
(B.4)
Because (xi ) − ui = ((xi ) − (xi1 )) − (ui − (xi1 )), we get j
j
j span xi − ui j = 1, . . . , Ni ⊆ span ui − xi1 , j xi − xi1 j = 2, . . . , Ni .
(B.5)
From equations B.4 and B.5, we get j j span xi − ui j = 1, . . . , Ni ⊆ span xi − xi1 j = 2, . . . , Ni . (B.6) From equations B.2 and B.6, we obtain j j span xi − ui j = 1, . . . , Ni = span xi − xi1 j = 2, . . . , Ni . (B.7) From equation B.7, we obtain j span xi − ui i = 1, . . . , c; j = 1, . . . , Ni j = span xi − xi1 i = 1, . . . , c; j = 2, . . . , Ni .
(B.8)
From equations 4.2 and B.8, we obtain j SW (0) = span{ xi − xi1 |i = 1, . . . , c; j = 2, . . . , Ni }.
Appendix C: Proof of Theorem 3 j
Proof. Because (xi )(i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent, we have rank x11 , x12 , . . . , xcNc = Ni = N. i
(C.1)
Class-Incremental Generalized Discriminant Analysis
1003
Moreover, we have rank x11 , x12 , . . . , xcNc = rank x11 , x12 − x11 , . . . , xcNc − x11 .
(C.2)
From equations C.1 and C.2, we obtain that rank x11 , x12 − x11 , . . . , xcNc − x11 = N.
(C.3)
Equation C.3 means that the N vectors (x11 ), (x12 ) − (x11 ), . . . , (xcNc ) − (x11 ) are linearly independent. Thus, we obtain that rank x12 − x11 , . . . , xcNc − x11 = N − 1. (C.4) Moreover, we have rank x11 − u , x12 − u , . . . , xcNc − u = rank x11 − u , x12 − x11 , . . . , xcNc − x11 ≥ rank x12 − x11 , . . . , xcNc − x11 .
(C.5)
From equations C.4 and C.5, we obtain that rank x11 − u , x12 − u , . . . , xcNc − u ≥ N − 1. Note that
c i=1
Ni j=1
(C.6)
((xi ) − u ) = 0. Thus, we have j
rank x11 − u , x12 − u , . . . , xcNc − u = rank x12 − u , . . . , xcNc − u ≤ N − 1.
(C.7)
Combining equations C.6 and C.7, we obtain that rank x11 − u , x12 − u , . . . , xcNc − u = N − 1.
(C.8)
Note that ST (0) = span{(xi ) − u |i = 1, . . . , c; j = 1, . . . , Ni }. Thus, from equation C.8, we obtain that rank(ST (0)) = N − 1. j
Appendix D: Batch GDA/KGS Algorithm 1. Let s1 = t1 = e 1 , D11 = K (1, 1); 2. Repeat for j = 2, . . . , N − 1
1004
W. Zheng
t11 K (1, j) ,..., D11
a. Compute s j =
j−1
q =1 tq ( j−1) K (q , j)
D( j−1)( j−1)
T , 1, 0, . . . , 0
;
b. Compute t j = (−t1 , . . . , −t j−1 , 1, 0, . . . , 0)s j ; j c. Compute D j j = p,q =1 tpj tq j K ( p, q ); 3. [β N−c+1 , . . . , β N−1 ] = A [tN−c+1 , . . . , tN−1 ](Dij )i,−1/2 j=N−c+1,...,N−1 ; The c − 1 vectors β N−c+1 , . . . , β N−1 are the discriminant vectors of GDA as for the case of small sample size problem. Appendix E: Class-Incremental GDA/KGS Algorithm: Updating Discriminant vectors with the Insertion of the (c + 1)th Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D 2. Repeat for j = N − c + 1, . . . , N + Nc+1 − 1 ˜
j−1
˜
t˜q ( j−1) K˜ (q , j)
, 1, 0, . . . , 0)T ; a. Compute s˜ j = ( t11 KD˜ (1, j) , . . . , q =1D˜ 11 ( j−1)( j−1) b. Compute ˜t j = (− ˜t 1 , . . . , − ˜t j−1 , 1, 0, . . . , 0)˜s j ; ˜t pj ˜t q j K˜ ( p, q ); ˜ jj = j c. Compute D p,q =1
3. [β N+Nc+1 −c , . . . , β N+Nc+1 −1 ] ˜ ij )−1/2 ˜ [ ˜t N+Nc+1 −c , . . . , ˜t N+Nc+1 −1 ]( D =A i, j=N+Nc+1 −c,...,N+Nc+1 −1 ; The c vectors β N+Nc+1 −c , . . . , β N+Nc+1 −1 are the new discriminant vectors of GDA after the (c + 1)th class is inserted into the training set. Acknowledgments I thank the anonymous reviewers for their valuable comments and suggestions. I also thank Songcan Chen from the Department of Computer Science and Engineering, University of Aeronautics and Astronautics, China, for his kind discussions and valuable advice. This work was supported in part by the National Natural Science Foundations of China under grants 60503023 and 60473035, and in part by the Jiangsu Natural Science Foundations under grants BK2005407 and BK2005122. References Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Belhumeur P. N., Hespanha J. P., & Kriegman D. J. (1997). Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 711–720.
Class-Incremental Generalized Discriminant Analysis
1005
˚ (1967). Solving linear least squares problems using Gram-Schmidt orthog¨ Bjorck, A. onalization. BIT, 7, 1–21. ˚ (1994). Numerics of Gram-Schmidt orthogonalization. Linear Algebra and ¨ Bjorck, A. Its Applications, 197–198, 297–316. Cevikalp, H., Neamtu, M., Wilkes, M., & Barkana, A. (2005). Discriminiative common vectors for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 4–13. Cevikalp, H., & Wilkes, M. (2004). Face recognition by using discriminant common vectors. In Proceedings of the 17th International Conference on Pattern Recognition (pp. 326–329). Piscataway, NJ: IEEE Computer Society. Chen, L. F., Liao, H. Y. M., Ko, M. T., Lin, J. C., & Yu, G. J. (2000). A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition, 33, 1713–1726. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Orlando, FL: Academic Press. Liu W., Wang Y., Li S. Z., & Tan T. (2004). Null space-based kernel Fisher discriminant analysis for face recognition. In Proceedings of the Sixth International conference on Automatic Face and Gesture Recognition (pp. 369–374). Piscataway, NJ: IEEE Computer Society. Lu L., Plataniotis K., & Venetsanopoulos A. (2003). Face recognition using kernel direct discriminant analysis algorithms. IEEE Transactions on Neural Networks, 14(1), 117–126. Martinez A. M., & Benavente, R. (1998). The AR face database (CVC Tech. Rep. No. 24). Barcelona, Spain: Computer Vision Center. Rice, J. R. (1966). Experiments on Gram-Schmidt orthogonalization. Mathematics Computation, 20, 325–328. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Wolf, L., & Shashua, A. (2003). Learning over sets using kernel principal angles. Journal of Machine Learning Research, 4, 913–931. Xiong, T., Ye, J., Li, Q., Cherkassky, V., & Janardan, R. (2005). Efficient kernel discriminant analysis via QR decomposition. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1529–1536). Cambridge, MA: MIT Press. Yang, J., Frangi, A. F., Jin, Z., & Yang, J-Y (2004). Essence of kernel Fisher discriminant: KPCA plus LDA. Pattern Recognition, 37, 2097–2100. Ye, J., Li, Q., Xiong, H., Park, H., Janardan, R., & Kumar, V. (2004). IDR/QR: An incremental dimension reduction algorithm via QR decomposition. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 364–373). New York: ACM Press. Zheng, W., Zhao, L., & Zou, C. (2004a). An efficient algorithm to solve the small sample size problem for LDA. Pattern Recognition, 37, 1077–1079.
1006
W. Zheng
Zheng, W., Zhao, L., & Zou, C. (2004b). A modified algorithm for generalized discriminant analysis. Neural Computation, 16, 1283–1297. Zheng, W., Zhao, L., & Zou, C. (2005). Foley-Sammon optimal discriminant vectors using kernel approach. IEEE Transactions on Neural Networks, 16(1), 1–9. Zheng, W., Zou C., & Zhao L. (2004). Real-time face recognition using Gram-Schmidt orthogonalization algorithm for LDA. In Proceedings of the 17th International Conference on Pattern Recognition (pp. 403–406). Piscataway; NJ: IEEE Computer Society.
Received January 10, 2005; accepted August 19, 2005.
ARTICLE
Communicated by Noboru Murata
Singularities Affect Dynamics of Learning in Neuromanifolds Shun-ichi Amari
[email protected] RIKEN Brain Science Institute, Saitama, 351-0198, Japan
Hyeyoung Park
[email protected] Kyungpook National University, Korea
Tomoko Ozeki
[email protected] RIKEN Brain Science Institute, Saitama, 351-0198, Japan
The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA timeseries model, and many other cases. The standard statistical paradigm of the Cram´er-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow Neural Computation 18, 1007–1065 (2006)
C 2006 Massachusetts Institute of Technology
1008
S.-I. Amari, H. Park, and T. Ozeki
manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples. 1 Introduction The multilayer perceptron is an adaptive nonlinear system that receives input signals and transforms them into adequate output signals. Learning takes place by modifying the connection weights and the thresholds of neurons. Since perceptrons are specified by a set of these parameters, we may regard the whole set of perceptrons as a high-dimensional space or manifold whose coordinate system is given by these modifiable parameters. We call this a neuromanifold. Let us assume that the behavior of a perceptron is disturbed by noise, so that by receiving input signal x, a perceptron emits output y stochastically. This stochastic behavior is determined by the parameters. Let us also assume that a pair (x, y), of input signal x and corresponding answer y is given from the outside by a teacher. A number of examples are generated by an unknown probability distribution, p0 (x, y) or the conditional probability p0 (y|x), of the teacher. Let us denote the set of examples as (x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn ). A perceptron learns from these examples to imitate the stochastic behavior of the teacher. The behavior of a perceptron under noise is described by a conditional probability distribution p(y|x), which is the probability of output y given input x, so it can be regarded as a statistical model that includes a number of unknown parameters. From the statistical point of view, estimation of the parameters is carried out from examples generated by an unknown probability of the teacher network. Learning, especially online learning, is a type of estimation where the parameters are modified sequentially, using examples one by one. The parameters change by learning, forming a trajectory in the neuromanifold. Therefore, we need to study the geometrical features of the neuromanifold to elucidate the behavior of learning. The neuromanifold of multilayer perceptrons is a special statistical model because it includes singular points, where the Fisher information matrix degenerates. This is due to the symmetry of hidden units, and the number of hidden units substantially decreases when two hidden neurons are identical. The identifiability of parameters is lost at such singular positions. Such a structure was described in the pioneering work of Brockett (1976) in the case of linear systems, and in the case of multilayer perceptrons by ˚ Chen, Lu, and Hecht-Nielsen (1993), Sussmann (1992), Kurkov´ a & Kainen ¨ (1994), and Ruger & Ossen (1997). This type of structure is ubiquitous in many hierarchical models such as the model of probability densities given by gaussian mixtures, the ARMA time-series model, and the model of linear systems whose transfer functions are given by rational functions. The
Singularities Affect Dynamics of Learning in Neuromanifolds
1009
Riemannian metric degenerates at such singular points, which are not isolated but form a continuum. In all of these models, when we summarize the parameters that give the same behavior (probability distribution), the set of behaviors is known to have a generic cone-type singularity embedded in a finite-dimensional, sometimes infinite-dimensional, regular manifold (Dacunha-Castelle & Gassiat, 1997). Many interesting problems arise in such singular models. Since the Fisher information matrix degenerates at singular points, its inverse does not exist. Therefore, the Cram´er-Rao paradigm of classic statistical theory cannot be applied. The maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, although the consistency of behavior holds. The criteria used for model selection (such as AIC, BIC, and MDL) are derived from the gaussianity of the maximum likelihood estimator, with the covariance given by the inverse of the Fisher information, so that their validity is lost in such hierarchical models. The generalization error has so far been evaluated based on the Cram´er-Rao paradigm, so we need a new theoretical method to attack the problem. This is related to the strange behavior of the log-likelihood ratio statistic in a singular model. These problems have not been fully explored by conventional statistics. When the true distribution lies at a regular point, the classical Cram´erRao paradigm is still valid, provided it is sufficiently separated from a singular point. However, the dynamics of learning is a global problem that takes place throughout the entire neuromanifold. It has been shown that once parameters are attracted to singular points, at the parameters are very slow to move away from them. This is the plateau phenomenon ubiquitously observed in backpropagation learning. Therefore, even when the true point lies at a regular point, the singular structure strongly affects the dynamics of learning. Although there is no unified theory, problems caused by the singular structure in hierarchical models have been remarked by many researchers, and various new approaches have been proposed. This is a new area of research that is attracting much attention. Hagiwara, Toda, and Usui (1993) noticed this problem first. They used AIC (the Akaike information criterion) to determine the size of perceptrons to be used for learning and found that AIC did not work well. AIC is a criterion to determine the model that minimizes the generalization error. However, it has been reported for a long time that AIC does not give good model selection performance in the case of multilayer perceptrons. Hagiwara et al. found that this is because of the singular structure of such a hierarchical model and investigated ways to overcome this difficulty (Hagiwara, 2002a, 2002b; Hagiwara, Hayasaka, Toda, Usui, and Kuno, 2001; Hagiwara et al., 1993; Kitahara, Hayasaka, Toda, & Usui, 2000). To accelerate the dynamics of learning, Amari (1998) proposed the natural or Riemannian gradient method of learning, which takes into account the geometrical structure of the neuromanifold. Through this method, one can avoid the plateau phenomenon in learning (Amari, 1998; Amari &
1010
S.-I. Amari, H. Park, and T. Ozeki
Ozeki, 2001; Amari, Park, & Fukumizu, 2000; Park, Amari, & Fukumizu, 2000). However, the Fisher information matrix, or the Riemannian metric, degenerates at some places because of singularity, so we need to develop a new theory of dynamics of learning (see Amari, 1967, for dynamics of learning for regular multilayer perceptrons) to understand its behavior, in particular the effects of singularity in the ordinary backpropagation method and the natural gradient method. Work done regarding this aspect includes that of Fukumizu and Amari (2000) as well as the statistical-mechanical approaches taken by Saad and Solla (1995), Rattray, Saad, and Amari (1998), and Rattray and Saad (1999). Fukumizu used a simple linear model and indicated that the generalization error of multilayer perceptrons with singularities is different from that of the regular statistical model (Fukumizu, 1999). This problem is related to the analysis of the log-likelihood-ratio statistic of the perceptron model at a singularity (Fukumizu, 2003). The strange behavior of the loglikelihood statistic in the gaussian mixture has been remarked since the time of Hotelling (1939) and Weyl (1939), but only recently has it become possible to derive its asymptotic behavior (Hartigan, 1985; Liu & Shao, 2003). Fukumizu (2003) extended the idea of the gaussian random field (Hartigan, 1985; Dacunha-Castelle & Gassiat, 1997) to make it applicable to multilayer perceptrons and formulated the asymptotics of the generalization error of multilayer perceptrons. Watanabe (2001a, 2001b, 2001c) was the first who studied the effect of singularity in Bayesian inference. He and his colleagues introduced algebraic geometry and algebraic analysis by using Hironaka’s theorem of singularity resolution and Sato’s formula in algebraic analysis to evaluate the asymptotic performance of the Bayesian predictive distribution in various hierarchical singular models; remarkable results have been derived (Watanabe, 2001a, 2001b, 2001c; Yamazaki & Watanabe, 2002, 2003; Watanabe & Amari, 2003). In this article, we give an overview concerning the strange behaviors of singular models so far studied. They include estimation, testing, Bayesian inference, model selection, generalization, and training errors. Special attention is paid to the dynamics of learning from the information-geometric point of view by summarizing our previous results (Amari & Ozeki, 2001; Amari, Park, & Ozeki, 2001, 2002; Amari, Ozeki, & Park, 2003). In particular, we show new results concerning the fast and slow submanifolds in learning of gaussian mixtures. The article is organized as follows. We give various examples of singular models in section 2. They include models of simple multilayer perceptrons with one hidden unit and two hidden units, the gaussian mixture model of probability distributions, and a toy model of the cone that is used to give an exact analysis. The analysis of the gaussian mixture is newly presented here. In section 3, we explain the theory developed by Dacunha-Castelle and Gassiat (1997), which shows the generic cone structure of a singular model. This elucidates why strange behaviors emerge in a singular model.
Singularities Affect Dynamics of Learning in Neuromanifolds
1011
We then show that such models with singularity have strange behaviors, differing from those of ordinary regular statistical models, in parameter estimation, Bayesian predictive distribution, the dynamics of learning, and model selection in section 4. Section 5 is devoted to a detailed analysis of the dynamics of learning. We use simple models to show that there appear slow and fast manifolds in the neighborhood of a singularity, which explains the plateau phenomenon in the ordinary gradient learning method. The natural gradient method is shown to resolve this difficulty. Section 6 deals with the generalization error and training error in simple singular models, where the gaussian random field (Hartigan, 1985; Dacunha-Castelle & Gassiat, 1997; Fukumizu, 2003) is used and the special potential functions are introduced (Amari & Ozeki, 2001). Explicit formulas will be given in the cases of the maximum likelihood estimator and the Bayesian estimator. 2 Singular Statistical Models and Their Geometrical Structure The manifolds, or the parameter spaces, of many hierarchical models such as multilayer perceptrons inevitably include singularities. Typical examples are presented in this section. 2.1 Single Hidden Unit Perceptron. We begin with a trivial model of a single hidden unit perceptron, which receives input vector x = (x1 , · · · , xm ) and emits scalar output y. The hidden unit calculates the weighted sum w · x = wi xi of the input, where w = (w1 , · · · , wm ) is the weight vector, and emits its nonlinear function ϕ(w · x) as the output, where ϕ(u) is the activation function ϕ(u) = tanh(u). The output unit is linear, so its output is vϕ(w · x), which is eventually disturbed by gaussian noise . Hence, the final output is y = vϕ(w · x) + .
(2.1)
The parameters to specify a perceptron are summarized into a single vector θ = (w, v). The average of y, given x, is E[y] = f (x, θ ) = vϕ(w · x),
(2.2)
where E denotes expectation, given x. Any point θ in the (m + 1)-dimensional parameter space M = {θ } specifies a perceptron and its average output function f (x, θ ). However, the parameters are redundant or unidentifiable in some cases. Since ϕ is an odd function, (w, v) and (−w, −v) give the same function, f (x, θ ) = f (x, −θ ).
(2.3)
1012
S.-I. Amari, H. Park, and T. Ozeki
Figure 1: (Left) Parameter space M of a single hidden unit perceptron. (Right) ˜ of the average output functions of perceptrons with a singular strucSpace M ture.
Moreover, when v = 0 (or w = 0), f (x, θ ) = 0, whatever value w takes (v takes). Hence, the function f (x, θ ) is the same in the set C ⊂ M, C = {(v, w)| vw = 0}
(2.4)
(see Figure 1, left), which we call the critical set. Therefore, if we summarize the points that have the same average output function f (x, θ ) into one, the parameter space shrinks such that the critical set C is reduced to a single ˜ which consists of all the differpoint. The parameter space reduces to M, ent average output functions of perceptrons. It consists of two components ˜ has connected by a single point (see Figure 1, right). In other words, M singularity. Note that M is the parameter space, each point of which corresponds to a perceptron. However, the behaviors of some perceptrons are ˜ is the set the same even if their parameters are different. The reduced M of perceptron behaviors that correspond to the average output functions or the probability distributions specified thereby. The probability density function of the input-output pair (x, y) is given by 1 1 p(y, x, θ ) = √ q (x) exp − (y − f (x, θ ))2 , 2 2π
(2.5)
where in equation 2.1 is subject to the standard gaussian distribution N(0, 1) and q (x) is the probability density function of input x. The Fisher information matrix is defined by G(θ ) = E
∂l(y, x, θ ) ∂l(y, x, θ )T ∂θ ∂θ
,
(2.6)
Singularities Affect Dynamics of Learning in Neuromanifolds
1013
where E denotes the expectation with respect to p(y, x, θ ) and l(y, x, θ ) = log p(y, x, θ ) = − 12 (y − f (x, θ ))2 + log q (x), is a fundamental quantity in statistics. It is positive definite in a regular statistical model and plays the role of the Riemannian metric of the parameter space, as is shown by the information geometry (Amari & Nagaoka, 2000). The Fisher information gives the average amount of information included in one pair (y, x) of data, which is used to estimate the parameter θ . Cram´er-Rao Theorem. Let θˆ be an unbiased estimator from n examples in a regular statistical model. Then the error covariance of θˆ satisfies 1 E[(θˆ − θ )(θˆ − θ )T ] ≥ G −1 (θ ). n
(2.7)
Moreover, the equality holds asymptotically (i.e., for large n) for the MLE (maximum likelihood estimator) θˆ , and it is asymptotically subject to the gaussian distribution with mean θ and covariance matrix (1/n)G −1 (θ )}. However, this does not hold when v = 0 or w = 0, because ∂l(y, x, θ ) =0 ∂v
(2.8)
∂l(y, x, θ ) = 0. ∂w
(2.9)
or
When v = 0, p(y, x, θ ) is kept constant even when w changes, and the same situation holds when w = 0. Hence, the Fisher information matrix, equation 2.6, degenerates, and its inverse G −1 (θ ) does not exist in the set C : vw = 0. The Cram´er-Rao theorem is no longer valid at the critical set C. The model is singular on C. This makes it difficult to analyze the performance of estimation and learning when the true distribution is in C or in the neighborhood of C. 2.2 Gaussian Mixture. The Gaussian mixture is a statistical model of probability distributions that has long been known to include singularities (Hotelling, 1939; Weyl, 1939). Let us assume that the probability density of the real variable x is given by a mixture of k gaussian distributions as
p(x, θ ) =
k i=1
vi ψ(x − µi ),
(2.10)
1014
S.-I. Amari, H. Park, and T. Ozeki
√ 2 where ψ(x) = (1/ 2π) exp{−x /2}. The parameters are θ = (v1 , · · · , vk ; µ1 , · · · , µk ) and 0 ≤ vi ≤ 1, vi = 1. When we know the number k of components, all we have to do is to estimate the unknown parameters θ from observed data x1 , x2 , · · · , xn . However, in the usual situation, k is unknown. To make the story simple, let us consider the case of k = 2. The model is given by p(x, θ ) = vψ (x − µ1 ) + (1 − v)ψ (x − µ2 ) .
(2.11)
The parameter space M is three-dimensional, θ = (v, µ1 , µ2 ). If µ1 = µ2 holds, p(x, θ ) actually consists of one component, as is the case with k = 1. In this case, the distribution is the same whatever value v takes, so we cannot identify v. Moreover if v = 0 or v = 1, the distribution is the same whatever value µ1 or µ2 takes, so the parameters are unidentifiable. Hence, in the parameter space M = {θ }, some parameters are unidentifiable in the region C: C : v(1 − v) (µ1 − µ2 ) = 0.
(2.12)
This is depicted in the shaded area in Figure 2. We call this the critical set. In the critical set, the determinant of the Fisher information matrix becomes 0, and there is no inverse matrix. Let us look at the critical set C ⊂ M carefully. In C, any point on the three lines—µ1 = µ2 = µ0 , v = 0, µ2 = µ0 , and v = 1, µ1 = µ0 —(see Figure 2 left), represents the same gaussian distribution, ψ(x − µ0 ). If we regard the parameter points representing the same distribution as one and the same, these three lines shrink to one point.
˜ Figure 2: Parameter space of gaussian mixture M and singular structure M.
Singularities Affect Dynamics of Learning in Neuromanifolds
1015
˜ = M/ ≈ through Mathematically speaking, we obtain the residue class M ˜ depicted in Figure the equivalence relation ≈. Then we get the space M 2 (right), which is the set of probability distributions (not the set M of parameters). This is the space where singularities are accumulated on a line C˜ and the dimensionality is reduced on the line. The line corresponds to the critical set C. To analyze the nature of singularity, let us introduce new variables w and u : w is the center of gravity of the two peaks, or the mean value of the distribution, and u is the difference between the locations of the two peaks, w = vµ1 + (1 − v)µ2
(2.13)
u = µ2 − µ1 .
(2.14)
Estimation of the mean parameter w of the distribution is easy, because the Fisher information matrix is nondegenerate in this direction. The problem is estimation of u and v when uv(1 − v) is small, because the critical region C is given by uv(1 − v). To make discussions simpler, consider the case where w = 0 is known, and only u and v are unknown. The distribution is then written as p(x, u, v) = vψ x − (1 − v)u + (1 − v)ψ(x + vu).
(2.15)
Let us consider only the region where u ≈ 0 and v(1 − v) > c holds for some constant c. By Taylor expansion of the above equation around u = 0, we get 1 1 p(x, u, v) ≈ ψ(x) 1 + c 2 (v)H2 (x)u2 + c 3 (v)H3 (x)u3 2 6
1 1 4 5 + c 4 (v)H4 (x)u + c 5 (v)H5 (x)u + · · · , 24 120
where c i (v) = v(1 − v)i + (1 − v)(−v)i , meaning that c 2 (v) = v(1 − v), c 3 (v) = v(1 − v)(1 − 2v), c 4 (v) = v(1 − v)(1 − 3v + 3v 2 ), and
(2.16)
1016
S.-I. Amari, H. Park, and T. Ozeki
H2 (x) = x 2 − 1, H3 (x) = x 3 − 3x, H4 (x) = x 4 − 6x 2 + 3 are the Hermite polynomials. We can embed this singular model locally in a regular model S as its singular subset. This is our new strategy of studying the singular structure by locally embedding it in an exponential family of distributions whose structure is well known and regular statistical analysis is possible. Let us consider a regular exponential family S specified by the regular parameters θ = (θ1 , θ2 , θ3 ), p(x, θ ) = ψ(x) exp θ1 H2 (x) + θ2 H3 (x) + θ3 H4 (x) ,
(2.17)
where ψ(x) is a dominating measure. Let us denote by S the parameter space of θ . Now we calculate l(x, u, v) = log p(x, u, v), by taking the logarithm of equation 2.16 and performing Taylor expansion again,
l(x, u, v) = log ψ + +
u2 u3 c 2 (v)H2 (x) + c 3 (v)H3 (x) 2 6
u4 {c 4 (v)H4 (x) − 3c 2 (v)2 H2 (x)2 }. 24
(2.18)
Hence, the parameter space M of gaussian mixtures is approximately embedded in S by 1 θ1 = c 2 (v)u2 , 2 1 θ2 = c 3 (v)u3 , 6 1 θ3 = {c 4 (v) − 3c 2 (v)2 }u4 . 24
(2.19) (2.20) (2.21)
This embedding from (u, v) to θ is singular. We first consider how M is embedded in the two-dimensional space (θ1 , θ2 ) by equations 2.19 and 2.20 in Figure 3a, where θ3 is ignored. The shape near the singularity u ≈ 0 becomes clear by using (u, v)-coordinates of M and their map in S. Let us consider a line in M where v is constant. This line v = const is
Singularities Affect Dynamics of Learning in Neuromanifolds
1017
(a)
(b) Figure 3: Gaussian mixture distribution M embedded in S. (a) M embedded in the parameter space S, (θ1 , θ2 ). (b) The picture where M cannot be embedded in S and sticks out into the higher dimension.
embedded in S as θ1 = a 1 u2 θ2 = a 2 u3 , where a 1 and a 2 are constants depending on v. This curve is not smooth but
1018
S.-I. Amari, H. Park, and T. Ozeki
is a cusp in S. The transformation from (u, v) to (θ1 , θ2 ) is singular at u = 0, and the Jacobian determinant, ∂θ1 ∂u |J | = det ∂θ 1 ∂v
, ∂θ2 ∂v
∂θ2 ∂u
(2.22)
vanishes at u = 0. To elucidate the dynamics of learning near the singularity u = 0, θ3 is necessary, as we will show later. Note that the above approximation, which uses Taylor expansion of u, is not applicable in the case where u is not small, but v is close to 0 or 1. Equation 2.16 is valid only in the case of small u. Taylor expansion is not suitable when v(1 − v) becomes small and u is large. Another consideration in terms of a random gaussian field is necessary, because an infinite-dimensional regular space is required to include the singular M. We have drawn the picture of M embedded in the three dimensions of S = {θ = (θ1 , θ2 , θ3 )} by calculating the higher-order term of u4 in Figure 3b. If u becomes large or v approaches 0 or 1, the surface of the embedded M includes higher terms ˜ sticks out and that cannot be represented in equation 2.17 and embedded M is wrapped in higher dimensions. As v(1 − v) approaches 0, the points in ˜ shrink toward the origin as u → 0 but expand into infinite dimensions. M This situation arises in the non-Donsker case (Dacunha-Castelle & Gassiat, 1997). Our primary interest is to know the influence of the singularity on the dynamics of learning. Because the natural gradient learning method takes the geometrical structure into account, it will not be greatly affected by the singularity. However, the effect of the singularity is not negligible when the true distribution is at, or near, the singularity C. In gradient descent learning algorithms, the critical set C works as a pseudo-attractor (the Milnor attractor) even when the true distribution is at a regular point far from it. The singular point resembles a black hole in the dynamics of learning, and it is necessary to investigate its influence on the dynamics, as we discuss in a later section. The Fisher information matrix degenerates in the critical set. We calculate it explicitly in a later section. 2.3 Multilayer Perceptron. We consider here the multilayer perceptron with hidden units (Rosenblatt, 1961; Amari, 1967; Minsky & Papert, 1969; Rumelhart, Hinton, & Williams, 1986). It is a singular model where the output y is written as
y=
k i=1
vi ϕ (wi · x) + ε.
(2.23)
Singularities Affect Dynamics of Learning in Neuromanifolds
1019
Here, x is an input vector, ϕ(wi · x) is the output of the ith neuron of the hidden layer, and wi is its weight vector. The neuron of the output layer summarizes all the outputs of the hidden layer by taking the weighted sum with weights vi . Gaussian noise ε ∼ N(0, 1) is added at the end, so the output y is a random variable. Let us summarize all the modifiable parameters in one vector, θ = (w1 , · · · , w k ; v1 , · · · , vk ), and then the average output is given by
E[y] = f (x, θ ) =
k
vi ϕ (wi · x) .
(2.24)
i=1
This model has a structure similar to the gaussian mixture distribution. When the parameter wi of the ith neuron is 0, this neuron is useless because ϕ(0) = 0 and vi may take any value. Moreover, when vi = 0, whatever value wi takes, this term is 0. In the meantime, if wi = w j (or wi = −w j ), these two neurons emit the same output (or the negative output). Then vi ϕ(wi · x) + v j ϕ(w j · x) is the same, not depending on particular values of vi and v j , provided vi + v j (or vi − v j ) takes a fixed value. Therefore, we can identify their sum (or difference), but each of vi and v j remains unidentifiable. The neuromanifold M of perceptrons is a space whose admissible coordinate system is given by θ . The critical set C is defined by the join of the two subsets, vi wi = 0,
wi = ±w j ,
(2.25)
in which the parameters are unidentifiable. In other words, there is a direction such that the behavior (the input-output relation) is the same even when the parameters change in this direction. Since the Fisher information is 0 along this direction, the Fisher information matrix degenerates, and its inverse diverges to infinity. In statistical study, the Cram´er-Rao theorem guarantees that the error of estimation from a large number of data is given by the inverse of the Fisher information matrix in the regular case. However, because the inverse of the Fisher information matrix diverges, the classical theory is not applicable here. In geometrical terms, the Riemannian metric, which is determined by the Fisher information matrix, degenerates. Therefore, the distance becomes 0 along a certain direction. This is what the singular structure gives rise to. Remark.
We have absorbed the bias term in the weighted sum as
wi · x =
wi xi + wi0 x0 = w ˜ i · x˜ + wi0 ,
(2.26)
1020
S.-I. Amari, H. Park, and T. Ozeki
where x0 = 1. Because x0 is constant, this causes another nonidentifiability. When w˜ i = w˜ j = 0,
(2.27)
if the other parameters satisfy vi ϕ(wi0 ) + v j ϕ(v j0 ) = const.,
(2.28)
they give the same behavior. In this article, we ignore such a case for simplicity. There are no other types of nonidentifiability (see Sussmann, 1992; ˚ Kurkov´ a & Kainen, 1994). We regard two perceptrons as being equivalent when their behaviors are the same even if their parameters are different. Let us shrink the manifold M by reducing the equivalent points of M to one point, giving the reduced ˜ In the mathematical sense, the reduced M ˜ corresponds to the residue M. class of M because of the equivalence. In this case, degeneration of the dimensionality occurs in the critical set of the neuromanifold. We showed this in the trivial case of one hidden neuron. We now show this through another simple example. Let us consider a perceptron with two hidden units with w = 1, w = (cos θ, sin θ ), which is represented by f (x, v, θ ) = vϕ(x1 cos θ + x2 sin θ + b),
(2.29)
where b is a fixed bias term. In this case, the parameter space is twodimensional with coordinates θ = (v, θ ). Consider the space S consisting of functions of the form f (x, θ ) = θ3 ϕ(θ1 x1 + θ2 x2 + b). Then equation 2.29 is ˜ in S, embedded in S by θ1 = cos θ, θ2 = sin θ , and θ3 = v. The embedded M ˜ = {θ| θ1 = cos θ, θ2 = sin θ, θ3 = v} , M is a cone depicted in Figure 4, and the apex is the singular point. 2.4 Cone Model. Amari and Ozeki (2001) analyzed a toy model, the cone model, to examine the exact behavior of the estimation error and the dynamics of learning. We introduce this model here for later use. Let us consider the statistical model described by a random variable x ∈ Rd+2 , which is subject to a gaussian distribution with mean µ and covariance matrix I , where I is the identity matrix, 1 1 p(x; µ) = √ exp − ||x − µ||2 . 2 ( 2π)d+2
(2.30)
Singularities Affect Dynamics of Learning in Neuromanifolds
1021
Figure 4: Singular structure of the neuromanifold.
The parameter space S = {µ} is a (d + 2)-dimensional Euclidean space with a coordinate system µ. When the mean parameter µ is restricted on the surface of a cone in S, the family of the gaussian distributions is called the cone model M. We first consider the case with d = 1, that is, d + 2 = 3, in which the cone is given by µ1 = ξ, µ2 = ξ cos θ, µ3 = ξ sin θ
(2.31)
(see Figure 5), where (ξ, θ ) are the parameters used to specify M. In the general case, the cone is parameterized by (ξ, ω), ξ
M:µ= √ 1 + c2
1 cω
= ξ a(ω),
(2.32)
where c is a constant that specifies the shape of the cone, ω is a vector on the d-dimensional unit sphere that specifies the directions of the cone, and ξ is the distance from the origin. In the case of d = 1, the sphere is a circle (see Figure 5), and the parameter ω is replaced by θ . The model M, which
1022
S.-I. Amari, H. Park, and T. Ozeki
Figure 5: Cone Model.
is given by the parameters (ξ, ω), is embedded in the d + 2–dimensional space S, and consists of two cones, one for ξ ≥ 0 and the other for ξ ≤ 0, of d + 1 dimensions, R × Sd , which are connected at the apex ξ = 0. The apex ξ = 0 is the singularity. Amari et al. analyzed the behavior of the maximum likelihood estimator in the case where the true parameters are on the singularity (Amari et al., 2001, 2002, 2003). In this case, the true distribution p0 (x) is given by x ∼ N(0, Id+2 ). The simple multilayer perceptron with one hidden unit (Amari et al., 2001, 2002, 2003) has a similar cone structure. Through a transformation of parameters—β = w, ξ = vβ, and ω = wβ ∈ Sd−1 —we get y = ξ ϕβ (ω · x) + ε,
(2.33)
where we put ϕβ (u) = β1 ϕ(βu). This is a cone in the space of (ξ, ω). In previous articles, we have analyzed the behavior of learning when the true parameter is on the singularity (ξ = 0). In this article, we analyze the behavior of learning when the true parameter is not necessarily at the singularity and show that the singularity strongly affects the dynamics. 2.5 Other Models. There are many other statistical models with singular structures. Hierarchical models include such singular structures in many cases. The space of the ARMA time-series model and that of linear rational systems are good examples (Amari, 1987; Brockett, 1976), but little is known about the effects of singularity. The estimation of points of change,
Singularities Affect Dynamics of Learning in Neuromanifolds
1023
which is called the Nile River problem, is also a well-known example of singularity. Let us consider another model, the model of population coding with multiple stimuli in a neural field (Amari, 1977), on which neurons are lined up continuously along a one-dimensional positional axis z. When a stimulus from the outside is applied at a specific place corresponding to z = µ, the neurons located around z = µ are excited. Excitation of the neural field—that is, the firing rate of a neuron at z—is written in this case as r (z) = vψ(z − µ) + ε(z),
(2.34)
where v is the strength of the stimulus, µ is its center, and ε(z) is a noise term dependent on z. Function ψ is unimodal and is called the tuning function. This model has been applied in population coding where the problem is to estimate µ from the neural response r (z). Its statistical analysis has been given in terms of the Fisher information in many reports (for example, Wu, Nakahara, & Amari, 2001; Wu, Amari, & Nakahara, 2002). When two stimuli are simultaneously given at locations µ1 and µ2 with intensities v and 1 − v, the response of the field is written as r (z; v, µ1 , µ2 ) = vψ (z − µ1 ) + (1 − v)ψ (z − µ2 ) + ε(z).
(2.35)
In this case, the same singular structure as that of the gaussian mixture appears in the parameter space θ = (v, µ1 , µ2 ). The strange behavior of the maximum likelihood estimator is analyzed in Amari and Burnashev (2003) and Amari and Nakahara (2005). 3 Locally Conic Structure and Gaussian Random Field The local structure of singular statistical model is studied by DacunhaCastelle and Gassiat (1997) in a unified manner. The local structure of a regular statistical model is represented by the tangent space of the manifold of the statistical model, where the first-order asymptotic theory is well formulated. The concepts of affine connections and related e- and m-curvature are necessary for the higher-order asymptotic theory, as is shown by information geometry (Amari & Nagaoka, 2000). A singular statistical model does not have the tangent space at singularity, and instead the tangent cone is useful for analyzing its local structure. We summarize the results of Dacunha-Castelle and Gassiat (1997) in this section without mathematical rigor but intuitively. The locally conic structure and the related random gaussian field play a fundamental role in analyzing the behaviors of the likelihood ratio statistics (Hartigan, 1985; Fukumizu, 2003) and also of the MLE and its generalization ability (Amari
1024
S.-I. Amari, H. Park, and T. Ozeki
& Ozeki, 2001). Another general framework for singular models is given from algebraic geometry, which we do not summarize here. (See Watanabe, 2000a, 2000b, and 2000c, and related papers.) 3.1 Locally Conic Structure. All the examples in section 2 have a locally conic structure. For a singular statistical model M = { p(x, θ ), θ ∈ Rk } in which the critical set C exists, Dacunha-Castelle and Gassiat (1997) introduced the following parameters in the neighborhood of C. Let us denote by p0 (x) a probability density in C where the identifiability is lost. Given p(x, θ ), let ξ be the Hellinger distance between p(x, θ ) and C, ξ = inf
p0 (x)∈C
( p0 (x) − p(x, θ ))2 d x.
(3.1)
It is further assumed that p(x, θ ) can be parameterized in a neighborhood of p0 (x) ∈ C by (ξ, ω), where ω ∈ is (k − 1)-dimensional. We thus have the new parameterization p(x, ξ, ω), where lim p(x, ξ, ω) = p(x, 0, ω) = p0 (x).
ξ →0
(3.2)
The critical set C is given by ξ = 0, where p(x, 0, ω) represents p0 (x) ∈ C so that ω is not identifiable. The score function with respect to ξ is the directional derivative of log likelihood and is denoted by v(x, ω) =
d log p(x, 0, ω) dξ
(3.3)
at ξ = 0. It depends on ω ∈ . Since ξ is the Hellinger distance, we have, E[{v(x, ω)}2 ] = 1,
(3.4)
that is, the Fisher information in the ξ -direction is normalized to 1 at any ω. Now consider l(x, ξ, ω) = log p(x, ξ, ω)
(3.5)
in the function space of random variable x. The model M is embedded in it, where the points in C are reduced to equivalent points, so that the ˜ Let us fix dimension reduction takes place. Its image is the reduced set M. a point p0 (x) in C. In its neighborhood, the Taylor expansion gives l(x, ξ, ω) = log p0 (x) + ξ v(x, ω),
(3.6)
Singularities Affect Dynamics of Learning in Neuromanifolds
1025
so that when ξ is very small, the image of M forms a cone in the space of functions of x, whose apex is log p0 (x) and whose edges are spanned by v(x, ω), ω ∈ . This is called the tangent cone and is different from the tangent space of a regular statistical model. 3.2 MLE and Gaussian Random Field. Given n examples D = {x1 , · · · , xn } from a singular model M, the log likelihood is written as L(D, ξ, ω) =
n
log p(xi , ξ, ω).
(3.7)
i=1
The MLE is the maximizer of L, but it is difficult to calculate because the derivatives of L with respect to ω are 0 at ξ = 0 in some directions, ∂ L(D, 0, ω) ∂ 2 L(D, 0, ω) = = · · · = 0. ∂ω ∂ω∂ω
(3.8)
We now fix ω and calculate the maximizer ξˆ (D, ω) of L. By expansion with respect to ξ , we have L(D, ξ, ω) = L(D, 0, ω) +
∂L 1 ∂2 L 2 ξ+ ξ + ···. ∂ξ 2 ∂ξ 2
(3.9)
Since ξˆ is small when the true distribution is p0 (x), that is, ξ = 0, the maximizer is given from −
∂2 L ∂L ξˆ = . ∂ξ 2 ∂ξ
(3.10)
The term of the second derivative, 1 ∂ 2 log p(xi , ξ, ω) 1 ∂2 L =− , 2 n ∂ξ n ∂ξ 2 n
−
(3.11)
i=1
converges to the Fisher information in the direction ω, because of the law of large numbers. The second term of the first derivative is 1 ∂l(xi , 0, ω) 1 1 ∂L = √ = √ Yn (ω) = √ v(xi , ω), ∂ξ n ∂ξ n i=1 n i=1 n
n
(3.12)
which converges to the gaussian random variable Y(ω) in law because of the central limit theorem. For ω = ω , Y(ω) and Y(ω ) are correlated in general.
1026
S.-I. Amari, H. Park, and T. Ozeki
A family of gaussian distributions {Y(ω), ω ∈ } forms a random gaussian field over . The maximizer ξˆ is given by 1 ξˆ (D, ω) = √ Yn (ω). n
(3.13)
3.3 MLE. Let us substitute ξˆ in equation 3.7, obtaining the partially maximized likelihood, ˆ L(D, ω) = L(D, ξˆ (D, ω), ω) 1 = log p0 (xi ) + Yn (ω)2 . 2 Hence, the MLE is given by the maximizer of the random field, ωˆ = argmax Yn (ω)2 .
(3.14)
It is difficult to calculate this and study the properties of the MLE in general. 3.4 Likelihood Ratio Statistics. The log likelihood ratio statistics, λ=2
log
p(xi , θˆ ) , p(xi , θ 0 )
(3.15)
is used for testing the null hypothesis H0 : θ = θ 0 against the alternative H1 : θ = θ 0 , where θˆ is the mle. The statistic λ is asymptotically subject to the χ 2 distribution with k degrees of freedom in a regular model, and hence E[λ] = k
(3.16)
asymptotically. However, this does not hold in a singular model. The log likelihood ratio statistics λ in a singular model is λ = 2 sup ξ,ω
log
p(xi , ξ, ω) p0 (xi )
(3.17)
= 2 sup{L(D, ξˆ , ω) − L(D, 0, ω)}
(3.18)
= 2 sup Yn (ω)2 .
(3.19)
ω ω
Hence, it is given by the supremum of the gaussian random field. Hartigan (1985) suggested that λ ∼ log log n in the gaussian mixture model by extracting m = log n almost independent Y(ω1 ), · · · , Y(ωm ).
Singularities Affect Dynamics of Learning in Neuromanifolds
1027
Fukumizu (2003) followed the idea to evaluate λ in the case of multilayer perceptrons and derived λ ∼ log n. 4 Singularity Causes Strange Behaviors in Estimation and Learning In the framework of singular statistical models, we give a glimpse of strange behaviors of estimation, testing, model selection, and online learning. We study three cases. In the first case, the true distribution, or the distribution that best approximates the true distribution, is exactly at the singularity. In this case, the parameters are not identifiable and the model is redundant (a smaller model suffices), but we can estimate its behavior (or the equivalent class) consistently. The gaussian random field plays a key role. In the second case, the true distribution is near the singularity. In the last case, the true distribution is at a regular point. In the last case, the classical theory can be applied locally. However, when studying the dynamics of learning, we need to take the influence of the singularity into account. The trajectories of learning cover the entire space, so it is a global problem in the entire space. The log likelihood ratio test and MLE are known to be asymptotically optimal in the regular model. The likelihood principle is the belief that statistical inference should be based on likelihood. In singular models, however, this is not always true, and their optimality is not guaranteed (Amari & Burnashev, 2003). The behavior of Bayesian estimation and estimation with a regularization term also shows a different aspect from the regular case. There are many interesting problems to be studied, such as learning and its dynamics. 4.1 Statistical Testing in the Neighborhood of Critical Set. Statistical testing is a general method to judge from data whether the true distribution lies at the singularity. In the case of gaussian mixtures, we judge whether k = 1 or k = 2 through a statistical test. We take equation 2.12 as the null hypothesis and perform testing against the alternative that this equation is not true. In a general regular case, the log likelihood ratio statistic λ obeys the χ 2 distribution with the degrees of freedom equal to the number of parameters when the number of data is large enough. However, when the model is singular, the log likelihood ratio statistic may not be subject to the χ 2 distribution and may diverge to infinity in proportion to the number n of observations. This was shown in the classical works of Weyl (1939) and Hotelling (1939). Only recently was a precise asymptotic evaluation of the log likelihood ratio statistic given (Fukumizu, 2003; Hartigan, 1985; Liu & Shao, 2003) for some singular models. It is unfortunate that such tangled problems have usually been excluded as pathological cases and have not been well studied. Such cases are not pathological; they are ubiquitous in hierarchical engineering models. Let us consider the statistical test H0 : θ = θ 0 against H1 : θ = θ 0 . When the true point θ 0 is a regular point—that is, it is not in the critical set—the
1028
S.-I. Amari, H. Park, and T. Ozeki
MLE is asymptotically subject to a gaussian distribution with mean θ 0 and a variance-covariance matrix G −1 (θ 0 ) /n, where G(θ ) is the Fisher information matrix. In such a case, the log likelihood ratio statistic is expanded in the Taylor series, giving T λ = n θˆ − θ 0 G −1 (θ 0 ) θˆ − θ 0 .
(4.1)
Hence, this is subject to the χ 2 distribution of the degrees of freedom equal to the number k of parameters. Its expectation is E[λ] = k.
(4.2)
However, when the true distribution θ 0 lies on the critical set, the situation changes. The Fisher information matrix degenerates, and G −1 diverges, so the expansion is no longer valid. The expectation of the log likelihood estimator is asymptotically written as E[λ] = c(n)k,
(4.3)
where the term c(n) takes various forms depending on the nature of singularities. By evaluating the property of the gaussian random field Y(ω), Fukumizu (2003) showed that c(n) = log n
(4.4)
in the case of multilayer perceptrons under a certain condition. In the case of the gaussian mixture, c(n) =
log log n
(4.5)
holds (Hartigan, 1985; Liu & Shao, 2003). 4.2 Estimation, Training Error, and Generalization Error. When the true parameter is at the singularity (or close to it), the MLE is no longer subject to the gaussian distribution, even asymptotically. This causes strange behaviors of training and generalization errors. The standard theory (Amari & Murata, 1993; Murata, Yoshizawa, & Amari, 1994) does not hold. This will be discussed in more detail in a later section. 4.3 Bayesian Estimator. The Bayesian estimator is used in many cases where an adequate prior distribution π(θ ) is assumed. When the prior distribution penalizes complex models, it plays a role equivalent to the regularization term. When a set of independently and identically (i.i.d) data
Singularities Affect Dynamics of Learning in Neuromanifolds
1029
D = {x1 , · · · , xn } generated by p(x; θ 0 ) is given, the posterior distribution of the parameters is written as p(θ |D) =
π(θ ) p (D|θ ) , p(D)
(4.6)
p(D|θ )π(θ )dθ
(4.7)
where p(D) =
is the distribution of data D. The maximum a posterior (MAP) estimator is given by the parameter θˆ that maximizes the posterior distribution. Also, the Bayesian predictive distribution is used as the distribution of a new data x based on D. It is given by averaging the distribution p(x|θ ) over the posterior distribution p(θ |D), p(x|D) =
p(x, θ ) p(θ |D)dθ .
(4.8)
It is empirically known that the Bayesian predictive distribution or MAP behaves well in the case of large-scale neural networks. In such a case, one uses a smooth prior π(θ ) > 0 on the neuromanifold. Obviously, if π(θ 0 ) = ∞ at a specific point, the MAP estimator is attracted to that specific point. This is not fair. When π(θ ) > 0 is smooth, its influence decreases as n approaches ∞, and it approaches the MLE, which is regarded as the MAP under the uniform prior. ˜ of However, a smooth prior on M is singular in the equivalence class M the neuromanifold, because a singular point in this class includes infinitely many equivalent parameters of M. Hence, the prior density is infinitely large on the singular points compared with that at regular points. This implies that the Bayesian smooth prior is in favor of singular points (perceptrons with a smaller number of hidden units) with an infinitely large factor. Hence, the Bayesian method works well in such a case to avoid overfitting. One may use a very large perceptron with a smooth Bayesian prior, and an adequate smaller model will be selected, although no theory exists that explains how to choose the prior. The Bayesian estimator of singular models was studied by Watanabe (2001a, 2001b) and Yamazaki and Watanabe (2002, 2003) by using algebraic geometry, in particular, Hironaka’s theory of singularity resolution and Sato’s formula in the theory of algebraic analysis. 4.4 Model Selection. To obtain an adequate model, one should select a model from many alternatives based on the data. In the case of hierarchical models, one should determine the preferred model size, that is, the number
1030
S.-I. Amari, H. Park, and T. Ozeki
of hidden units. This is the problem of model selection. AIC, BIC, and MDL have been widely used as model selection criteria. NIC is a version of AIC applicable to the general cost function of a neural network (Murata et al., 1994). AIC (Akaike, 1974) is a criterion to minimize the generalization error. The model that minimizes AIC = 2 × training error +
2k n
(4.9)
is selected according to this criterion. This is derived from asymptotic statistical analysis, where the MLE estimator θˆ is assumed to be asymptotically subject to the gaussian distribution with the covariance matrix equal to the inverse of the Fisher information matrix divided by n. MDL (Rissanen, 1986) is a criterion to minimize the length of encoding for the observed data by using a family of parametric models. It is given asymptotically by the minimizer of MDL = training error +
log n k. 2n
(4.10)
The Bayesian BIC (Schwarz, 1978) gives the same criterion as MDL. These criteria are derived also through the same assumption regarding the Gaussianity of the MLE. In the case of multilayer perceptrons, the neuromanifold with a smaller number of hidden units is included in that with a larger number, but the smaller one forms a critical set within the larger neuromanifold. Therefore, the MLE (or any other efficient estimator) is no longer subject to the gaussian distribution, even asymptotically, provided the true distribution belongs to a smaller model. Model selection is required when the estimator is close to the critical set, but the validity of AIC and MDL fails to hold. Akaho and Kappen (2000) noted this in the gaussian mixture model. One should evaluate the log likelihood ratio statistic more carefully in such a case (Amari, 2003). The situation is the same in other hierarchical models with singularity. There have been reported many comparisons of AIC and MDL by computer simulations. Sometimes AIC works better, while MDL does better in other cases. Such confusing reports seem to be the result of the difference between regular and singular models and also the difference in the nature of singularities. 4.5 Dynamics of Learning. Let us consider online learning of multilayer perceptrons through the gradient descent method. Let us define the error by the square of the difference between the network output and the teacher’s signal. When the noise term is gaussian, the square of the error is
Singularities Affect Dynamics of Learning in Neuromanifolds
1031
equal to the negative of the log likelihood. The minimization of the error is then equivalent to the maximization of the likelihood, and the result of learning locally converges to the maximum likelihood estimator. The stochastic gradient descent method was proposed by Amari (1967) and was named the backpropagation method (Rumelhart et al., 1986), while the natural gradient method (Amari, 1998; Amari et al., 2000; Park et al., 2000) takes the Riemannian structure of the space into account. For an input-output example (x, y), the loss function or the negative log likelihood is given by l(x, y; θ ) =
2 1 y − f (x, θ ) . 2
(4.11)
Its expectation is given by averaging it with respect to the true distribution p0 (x, y), L(θ) = E p0 [l(x, y; θ )] .
(4.12)
The backpropagation and natural gradient learning algorithm (Amari, 1998) are written, respectively, as θ t+1 = θ t − η∇l (x t , yt ; θ t )
(4.13)
θ t+1 = θ t − ηG −1 (θ t )∇l(x t , yt ; θ t ),
(4.14)
when example (x t , yt ) is given at time t = 1, 2, · · ·. Here, η is a learning constant, ∇ is the gradient ∂/∂θ, and G is the Fisher information matrix. It is generally difficult to calculate G(θ ), because the distribution q (x) of inputs is unknown. Moreover, its inversion is costly. The adaptive natural gradient method estimates G −1 (θ t ) adaptively from data (Amari et al., 2000; Park et al., 2000). It has been shown that the natural gradient method is locally equivalent to the Newton method, giving a Fisher efficient estimator, while the backpropagation is not Fisher efficient. The natural gradient method is capable of near-optimal performances (Rattray et al., 1998; Rattray & Saad, 1999). In the present formulation, the natural gradient is equivalent to the adaptive version of the Gauss-Newton method, but it is different and more powerful in other cost functions (Park et al., 2000). −1 The adaptive update of G −1 t = G (θ t ) is calculated online by −1 −1 −1 ˆ ˆ T G −1 t+1 = (1 + τ )G t − τ G t ∇l(x t , yt , θ t )[G t ∇l(x t , yt , θ t )] .
(4.15)
The learning constant τ should not be large in order to guarantee the stability of the estimation of G −1 , but should not be too small to guarantee that the estimator G t = G(θˆ t ) traces the change of θˆ t well. Inoue, Park and
1032
S.-I. Amari, H. Park, and T. Ozeki
Okada (2003) show that the ratio η/τ should be kept within an adequate range. Since examples are generated stochastically, the dynamics of learning, equations 4.13 and 4.14, are represented by the stochastic difference equations. However, when η is small, stochastic fluctuation is averaged out. Hence, we investigate the behavior of the averaged learning equation where ∇l is replaced by its expectation, ∇ L(θ ) = E[∇l(x, y; θ )].
(4.16)
If continuous time is used, these become differential equations: dθ = −η∇ L(θ ), dt dθ = −ηG −1 (θ )∇ L(θ ). dt
(4.17)
The solution draws a trajectory in the neuromanifold. The problem is how the trajectory is influenced by the singular structure. Kang et al. used a three-layer perceptron with binary weights and found that the parameters are attracted to the critical set that forms a singularity and are very slow to move away from it (Kang, Oh, Kwon, & Park, 1993). Saad and Solla (1995) and Riegler and Biehl (1995) analyzed the dynamics in a more general case and showed that such a phenomenon is universal. They argued that the slowness in backpropagation learning, or the plateau phenomenon, is caused by this singularity. The natural gradient learning method takes the geometrical structure into account. It enables the influence of the singular structure to be reduced, and the trajectory is not trapped in the plateaus. Rattray et al. analyzed the dynamics of natural gradient learning by means of statistical physics and showed that it is almost ideal (Rattray & Saad, 1999; Rattray et al., 1998). To examine the dynamics of learning in more detail, let us consider perceptrons consisting of two hidden units. The parameter space is M = {θ }, θ = (w 1 , w 2 , v1 , v2 ), and let us consider the subset Q(w, v) specified by w and v, Q(w, v) = {w 1 = w2 = w, v1 + v2 = v} ,
(4.18)
which is included in the critical set C. The behavior of each perceptron in Q is the same and corresponds to that of a perceptron having only one hidden unit, where the weight vector is w and the output weight is v, and the behavior is y = vϕ(w · x) + . Let the true parameters be θ 0 = {w 1 , w 2 , v1 , v2 }, where w1 = w2 , so two different hidden units are used.
Singularities Affect Dynamics of Learning in Neuromanifolds
1033
Let θ¯ = (w, ¯ v¯ ) be the perceptron with only one hidden unit that best approximates the input-output function f (x, θ 0 ) of the true perceptron. Then all the perceptrons of two hidden units on the line Q(w, ¯ v¯ ), w1 = w2 = w, ¯
v1 + v2 = v¯
(4.19)
correspond to the best approximation. Let us transform the two weights as w=
1 (w 1 + w2 ) , 2
u=
1 (w1 − w2 ) . 2
(4.20)
The derivative of L(θ ) along the line Q is then 0 because all the perceptrons are equivalent along the line. The derivatives in the direction of changing w ¯ and v¯ are also zero, because they are the best approximators. The derivative in the direction of u is again 0, because the perceptron having u is equivalent to that having −u, which is derived by interchanging the two hidden units. Hence, the line Q forms critical points of the cost function. This implies that it is very difficult to get rid of it once the parameters are attracted to Q (w, ¯ v¯ ). Fukumizu and Amari (2000) calculated the Hessian of L on the line. When it is positive definite, the line is attractive. When it includes negative eigenvalues, the state eventually escapes in these directions. They showed that in some cases, part of the line can be truly attractive, although it is not a usual asymptotically stable equilibrium but has directions of escape (even though the derivative is 0) in other parts. This is not a usual saddle point and belongs to the special type called the Milner attractor. In such a case, the perceptron is truly attracted to the line and stays inside the line Q(w, ¯ v¯ ), fluctuating around it because of random noise, until it finds a place from which it can escape. This explains the plateau phenomenon. The problem of plateau cannot be resolved by simply increasing η, because even when the state goes outside Q because of a large η, it may again return to it. To show why the natural gradient method works well, we need to evaluate its behavior in the neighborhood of the critical points. We can then prove that the natural gradient has the effect of strong repulsion to escape from the neighborhood of the critical set, and the plateau phenomenon will disappear. Computer simulations confirm this observation. Inoue et al. (2003) investigated the trajectory of learning by using the committee machine perceptron with two hidden units. They observed the following behavior of the trajectory. It approaches the singularity and stays near the singularity for a while before escaping from it. More precisely, they observed that w1 and w2 first came close to each other, both approaching w, ¯ which is the optimal in C, and then they moved away in different directions. Inoue et al. (2003) also studied the effectiveness of the adaptive natural gradient method, and showed the importance of controlling the two learning constants η and τ .
1034
S.-I. Amari, H. Park, and T. Ozeki
Figure 6: Learning trajectory near the singularities.
What is the trajectory of learning when the true parameters are on the singularity? Park, Inoue, and Okada (2003) investigated this problem by using three-layer perceptrons with two hidden units. Once the trajectory reaches C in Figure 6, all points are equivalent and suboptimal. Where does the trajectory enter C? To answer this question, we need to examine the dynamics near C. Because the component of the flow entering C is extremely slow near C, the flow component parallel to C is relatively strong. Analysis of the dynamics makes it clear that the trajectory does not stop at any point in the line where w 1 = w2 = w and v1 + v2 = v is constant, v1 = 0, v2 = 0, and that it approaches the point where either v1 or v2 is zero. This is an interesting observation. What is the trajectory of learning in the natural gradient method? Since the metric degenerates in C, its inverse diverges to infinity. However, ∇ L = 0 even when the true distribution is outside C. If we consider G −1 ∇ L, it becomes a multiplication of 0 and ∞. However, if we evaluate G −1 ∇ L near C, we can see that the infinitely strong repulsive force works in the direction of escape from C (Fukumizu & Amari, 2000). That is, the force going out from the plateau is strong, and the trajectory moves away without being attracted in the natural gradient. 5 Dynamics of Learning: Slow and Fast Manifolds In this section, we show in detail the effect of singularities in the dynamics of learning for three simple models: the one-dimensional cone, the simple
Singularities Affect Dynamics of Learning in Neuromanifolds
1035
MLP, and the gaussian mixture. Note that the structure of the gaussian mixture is very similar to that of the multilayer perceptron. We calculate the average trajectories of the standard and natural gradient learning methods to show how the trajectories approach the optimal point. We show that a slow manifold emerges around the critical set to which the state is quickly attracted by fast dynamics, and then the state escapes toward the optimal point slowly in the slow manifold. This is a universal feature of the plateau phenomenon. 5.1 Cone Model. Here, we investigate the dynamics of learning in the cone model introduced in section 2. The parameter space M is twodimensional with coordinates (ξ, θ ). The cost function is defined as the negative log likelihood l(x; ξ, θ ) =
1 (x1 − ξ )2 + (x2 − cξ cos θ )2 + (x3 − cξ sin θ )2 . 2
(5.1)
For the average learning dynamics under the standard gradient learning method, we can easily obtain
ξ˙ (t) ˙ θ(t)
= −ηt
x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ −cξ (x¯ 2 sin θ − x¯ 3 cos θ )
SGD
,
(5.2)
where x¯ i = E[xi ]. Similarly, the average dynamics of natural gradient learning can also be obtained as
ξ˙ (t) ˙ θ(t)
= −ηt
1 1+c 2
NGD
x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ − cξ1 (x¯ 2 sin θ − x¯ 3 cos θ )
, (5.3)
where we calculate the Fisher information matrix as G(ξ, θ ) =
1 + c2 0 . 0 c2ξ 2
(5.4)
To consider the effect of singularity (i.e., ξ = 0) on the dynamics of learning, we define two submanifolds satisfying ξ˙ = 0 and θ˙ = 0, respectively. These are Mξ = {(θ, ξ ) : x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ = 0}
(5.5)
1036
S.-I. Amari, H. Park, and T. Ozeki
Figure 7: Learning trajectories in the cone model. (Left) Standard gradient (the dotted line is the slow manifold). (Right) Natural gradient.
and Mθ = {(θ, ξ ) : x¯ 2 sin θ − x¯ 3 cos θ = 0}.
(5.6)
The intersection of Mξ and Mθ is the equilibrium of the dynamics. From the standard gradient learning equation, we see that ξ˙ is of order O(1), whereas θ˙ is of order O(ξ ). Therefore, in the neighborhood of the singularity where ξ is small, the speed of change in ξ is much faster than that of θ . Therefore, the state is attracted toward Mξ (the dashed line in Figure 7 (left)) by the fast dynamics. Then the state moves along the line ∂l/∂ξ = 0 or Mξ , which is the slow manifold. The dynamics becomes especially slow when ξ is small (slow dynamics). This explains the plateau phenomenon in leaning curves. On the other hand, with the natural gradient learning equation 5.3, one can see that ξ˙ is of order 1 and θ˙ is of order ξ −1 , so no slow manifolds appear. Moreover, the update term around the singularity is large, so that a strong repulsive force acts from the singularity. This explains why the plateau disappears in the natural gradient. In computer simulations, we set c = 1. For the true parameters, we took ξ ∗ = 1 and θ ∗ = 0, so we had (x¯ 1 , x¯ 2 , x¯ 3 ) = (1, 1, 0). For the standard gradient and the natural gradient, we traced a number of trajectories with different initial values of ξ and θ . The trajectories in the parameter space are shown in Figure 7 using polar coordinates. Note that the center of the polar coordinates, (0, 0), is the singular point corresponding to the apex of the cone. In Figure 7 (left) for the standard gradient, we can see that the trajectories were attracted to the singular point or the slow manifold and then finally
Singularities Affect Dynamics of Learning in Neuromanifolds
1037
Figure 8: Time evolution of expected loss in the cone model. (Solid line: standard gradient, dashed line: natural gradient).
were attracted to the optimal point. In contrast, for the natural gradient, Figure 7 (right), such attraction and retardation did not appear. Figure 8 shows the time evolution of the expected loss for the initial condition (ξ, θ ) = (1.5, 5π/6). We can see a clear plateau in the standard gradient learning curve, whereas there was no plateau in the natural gradient learning curve. 5.2 Simple MLP. We also investigated the dynamics of learning of the simple MLP defined by equation 2.29, which is also discussed in section 2. The loss function, which is the squared error or the negative log likelihood, is given by l(y, x; ξ, θ ) =
2 1 y − ξ ϕ(cos θ x1 + sin θ x2 + b) . 2
(5.7)
The mathematical analysis is similar, and the fast and slow manifolds are obtained from θ˙ = 0 and ξ˙ = 0, respectively. For our computer simulations, we set b = 0.5. For the true parameters, we took ξ ∗ = 1 and θ ∗ = 0. As for the cone model, we traced a number of
1038
S.-I. Amari, H. Park, and T. Ozeki
Figure 9: Learning trajectories in the simple MLP model. (Left) Standard gradient. (Right) Natural gradient.
trajectories with different initial values of ξ and θ . The trajectories in the parameter space are shown in Figure 9 using polar coordinates. Note that for this MLP model, the center of the polar coordinates is the singular point, which corresponds to the shrink point of Figure 4. In Figure 9 (left) for the standard gradient, we can see phenomena similar to those for the cone model. That is, the trajectory was attracted to the plateau near the singular point and then slowly reached the optimal point. For the natural gradient, Figure 7 (right), we did not see that kind of attraction and retardation. Figure 10 shows the time evolutions of the expected loss for the initial condition (ξ, θ ) = (1.5, 5π/6). We can see a clear plateau in the curve of standard gradient learning, but none in that of natural gradient learning. 5.3 Gaussian Mixture. Next, we consider a more realistic model, the gaussian mixture. This is an original study in this article. To investigate the dynamics of learning of the gaussian mixture model, we begin with the Taylor expansion, equation 2.16, where v(1 − v) > c should be kept in mind. In this model, the singularity exists at u = 0, so the Taylor expansion for small values of u is useful. The cost function, the negative of the log likelihood, is further expanded as 1 1 l(x; u, v) = − log ψ(x) + c 2 (v)H2 (x)u2 + c 3 (v)H3 (x)u3 2 6 1 1 2 4 2 4 5 + c 4 (v)H4 (x)u − c 2 (x)H2 (x)u + O(u ) . 24 8
(5.8)
Singularities Affect Dynamics of Learning in Neuromanifolds
1039
Figure 10: Time evolutions of expected loss in simple MLP. (Solid line: standard gradient, dashed line: natural gradient).
Let (u∗ , v ∗ ) be the true parameter from which the learning data are generated. We can calculate the average learning dynamics around small u. Lemma. When u is small, the gradient of l evaluated at the true parameter (u∗ , v ∗ ), is given by E ∗ [∂ul] = −c 2 (v){u∗ 2 c 2 (v ∗ ) − u2 c 2 (v)}u 1 − c 3 (v){u∗ 3 c 3 (v ∗ ) − u3 c 3 (v))}u2 + O(u3 ) 2 1 E ∗ [∂v l] = − c 2 (v){u∗ 2 c 2 (v ∗ ) − u2 c 2 (v)}u2 2 1 − c 3 (v){u∗ 3 c 3 (v ∗ ) − u3 c 3 (v))}u3 + O(u4 ), 6
(5.9)
(5.10)
where E ∗ denotes the expectation with respect to p(x, u∗ , v ∗ ) and c i (v) = dc i (v)/dv. The proof is given in the appendix.
1040
S.-I. Amari, H. Park, and T. Ozeki
The averaged equation for the standard gradient is given by
u(t) ˙
= −η
v(t) ˙
E ∗ [∂ul] E ∗ [∂v l]
SGD
.
(5.11)
We put f 1 (u, v) = u∗ 2 c 2 (v ∗ ) − u2 c 2 (v), f 2 (u, v) = u∗ 3 c 3 (v ∗ ) − u3 c 3 (v). The averaged learning equations are then given by
u(t) ˙
v(t) ˙
= −η SGD
+
c 2 (v) f 1 (u, v)u + 12 c 3 (v) f 2 (u, v)u2 1 c (v) f 1 (u, v)u2 2 2
O(u3 ) O(u4 )
+ 16 c 3 (v) f 2 (u, v)u3
.
(5.12)
We now consider the trajectories of learning in two cases: u∗ = 0 (singularity) and u∗ = 0 (regular). Case I: u∗ = 0. By putting u∗ = 0 in equation 5.12 and ignoring higher-order terms, we have u˙ = −ηc 2 (v)2 u3 , v˙ = −η
c 2 (v)c 2 (v) 4 u. 2
(5.13)
From this, the trajectory of dynamics is given by dv v˙ 1 − 2v = = u. du u˙ 2v(1 − v)
(5.14)
The equation can be integrated to give u2 =
1 1 (1 − 2v)2 − log(1 − 2v) + c, 4 2
(5.15)
where c is constant. When u is small, v˙ is of O(u4 ) and is much smaller than u, ˙ which is of order u3 . Hence, dv/du ≈ 0, and the trajectories are almost parallel to the u-axis, as shown in Figure 11. In other words, u converges to 0 without significantly changing v.
Singularities Affect Dynamics of Learning in Neuromanifolds
1041
Figure 11: Trajectories of learning in the gaussian mixture model.
Case II: u∗ = 0. We also evaluate the dynamics when u is small. The equation is u˙ = ηc 2 (v)c 2 (v ∗ )u∗ 2 u η v˙ = c 2 (v)c 2 (v ∗ )u∗ 2 u2 2
(5.16) (5.17)
and u c 2 (v) dv = du 2 c 2 (v)
(5.18)
irrespective of (u∗ , v ∗ ). Incidentally, the equation is the same as that of equation 5.14; hence, the trajectories are the same (see Figure 11), but the directions are opposite, and the state is escaping from u = 0 toward u∗ in this case; that is, the directions in Figure 11 are reversed. The equilibrium is given by the intersection of the two manifolds, MF : f 2 (u, v) = 0, MS : f 1 (u, v) = 0.
1042
S.-I. Amari, H. Park, and T. Ozeki
0.85
0.85 v
v
OPTIMUM
0.75
0.75 OPTIMUM
0.65
0.65 0.5
0.7 u
0.9
0.5
0.7 u
0.9
Figure 12: Learning trajectories in the simple gaussian mixture model. (Left) Standard gradient. (Right) Natural gradient.
When u is small, the first term of f 1 (u, v) dominates, and the state is quickly attracted to MS . Then it moves in MS slowly to the intersection of MS and MF . Computer simulations confirm this observation (see Figure 12 left). We next studied the natural gradient method. Using an approximation, equation 5.8, we can also obtain an explicit form of the Fisher information matrix. This is given by
3 2 2 2 4 1 3 5 c (v)u + (v)u c (v)u + c (v)(2c (v) + 1)u 2c 3 2 2 2 3 2 2 G(u, v) = . 1 3 5 1 1 2 4 6 c 3 (v)u + c 3 (v)(2c 2 (v) + 1)u (c (v)) u + ( 6 − c 2 (v))u 2 2 2 (5.19) For the natural gradient method, the dynamics of learning is
u(t) ˙ v(t) ˙
NGD
η = 3 u
c˜1 f 2 (u, v)u + c˜2 f 1 (u, v)u2 c˜3 f 2 (u, v) + c˜4 f 1 (u, v)u
,
(5.20)
where c˜i are functions of v. The equilibrium is again the intersection of MS and MF , but the roles of MS and MF are reversed. Repulsion is strong when u is small, and no plateau appears. For our computer simulations, we set the true parameters as u∗ = 0.75 and v ∗ = 0.7. Since the analytic expression of the average dynamics given in equations 5.12 and 5.20 are an approximation around a small u, we cannot apply this for the whole trajectory. Therefore, we use the Monte Carlo
Singularities Affect Dynamics of Learning in Neuromanifolds
1043
Figure 13: Time evolutions of the expected loss in the simple gaussian mixture model. Solid line: standard gradient, dashed line: natural gradient.
method to get the expectation of ∂l/∂u and ∂l/∂v. At each learning step, we generate 106 samples according to the true input distribution and take the sample means. We traced a number of trajectories with different initial values for u and v. The trajectories in the parameter space are shown in Figure 12. In Figure 12 (left), we can see the line (slow manifold) on which the parameters first converge. Therefore, the state proceeds by learning toward the line satisfying u∗ 2 c 2 (v ∗ ) = u2 c 2 (v), which is shown as a dashed line in Figure 12 (left). However, for the natural gradient, Figure 12 (right), the update terms of u˙ and v˙ have terms of order O(u−2 ) and O(u−3 ), respectively, which lead to much faster dynamics around the singularity. Figure 13 shows the time evolutions of the expected loss for the initial condition (u, v) = (0.9, 0.6). We can see the slow convergence in the standard gradient learning curve, whereas this sort of retardation is not apparent in natural gradient learning.
1044
S.-I. Amari, H. Park, and T. Ozeki
6 Generalization and Training Errors When the True Distribution Is at a Singular Point It is important for model selection to evaluate the generalization error relative to the training error. Since AIC and MDL have been derived from the asymptotic gaussianity of the estimator, they cannot be applied to the singular case. In particular, for hierarchical models such as multilayer perceptrons and gaussian mixtures, smaller models are embedded in the larger models as critical sets. Therefore, we need to find new model selection criteria for the singular case. In the regular case, the MLE and Bayes predictive estimators give asymptotically the same estimation performance. However, these are not guaranteed in the singular case. As a preliminary study, we analyzed the asymptotic behavior of the MLE and Bayes predictive distribution by using simple toy models when the true distribution lay at a singular point, that is, in a smaller model. The behavior of an estimator is evaluated by the relation between the expected generalization error and the expected training error. Let D = {x 1 , . . . , x n } be observed data, or D = {(x 1 , y1 ), . . . , (x n , yn )} in the case of perceptrons. When we have an estimated probability density function pˆ (x; D), the generalization error can be defined by the Kullback-Leibler divergence from the true probability density po to the estimated density function, po (x) K L[ po (x) : pˆ (x; D)] = E po log . pˆ (x; D)
(6.1)
In the case of perceptrons, the estimated density function is the conditional probability density pˆ (y|x; D), and a similar formulation follows. For the evaluation, we take an expectation of the generalization error with respect to the data, and we call it the expected generalization error, po (x) , E gen = E D E po log pˆ (x; D)
(6.2)
where E D denotes expectation with respect to the observed data D. Similarly, the training error of the estimated density function pˆ (x; D) is defined by the sample average, 1 po (x i ) log , n pˆ (x i ; D) n
i=1
(6.3)
Singularities Affect Dynamics of Learning in Neuromanifolds
1045
which is the expectation of log( p0 / pˆ ) with respect to the empirical distribution of data D, pemp (x) =
1 δ(x − x i ). n
(6.4)
The expected training error is defined as E train = E D
n 1 po (x i ) log . n pˆ (x i ; D)
(6.5)
i=1
For the MLE, the estimated density function is given by p(x, θˆ ) where ˆθ is the MLE. One can also see the relation between the expected training error and the likelihood ratio statistics λ in equation 3.17, E train = −
1 E D [λ]. 2n
For the Bayes estimation, the estimation density function is given by the Bayes predictive distribution of the form pˆ Bayes (x|D) =
p(x, θ ) p(θ |D)dθ .
(6.6)
6.1 Maximum Likelihood Estimator 6.1.1 Cone Model. For the cone model defined in equation 2.30, the log likelihood of data D = {x i }i=1,...,n is written as 1 ||x i − ξ a(ω)||2 . 2 n
L(D, ξ, ω) = −
(6.7)
i=1
The MLE is the one that maximizes L(D, ξ, ω). However, ∂ k L/∂ωk = 0 at ξ = 0 for any k, so we cannot analyze the behaviors of the MLE by Taylor expansion at ξ = 0. Therefore, we first fix ω and search for the ξ that maximizes L. The maximum ξˆ is given by 1 ξˆ (ω) = argmaxξ L(D, ξ, ω) = √ Yn (ω), n
(6.8)
where 1 x˜ = √ xi . n i=1 n
Yn (ω) = a(ω) · x˜ ,
(6.9)
1046
S.-I. Amari, H. Park, and T. Ozeki
Its limit Y(ω) is a zero-mean gaussian random field with covariance a(ω) · a(ω ), because x˜ ∼ N(0, I ) when the true distribution is at ξ = 0. The MLE ωˆ is given by the maximizer of ωˆ = argmaxω Yn2 (ω).
(6.10)
Using the MLE, we obtain the expected generalization and training errors in the following theorem. Theorem 1. MLE satisfies
For the cone model, when the true distribution is at ξ = 0, the
1 po (x) = E D maxω Y2 (ω) , ˆ 2n p(x|ξ , ω) ˆ n 1 po (x i ) 1 E train = E D log = − E D maxω Y2 (ω) . ˆ n 2n p(x i |ξ , ω) ˆ E gen = E D E po log
(6.11) (6.12)
i=1
A more detailed derivation is appendix. In addition, we can given in the obtain the explicit value of E D maxω Y2 (ω) within the limit of large d. Corollary 1.
E gen ≈
When d is large, the MLE satisfies
1 + 2c
E train ≈ −
√ 2 π
d + c2 (d + 1)
≈
2n(1 + c 2 ) √ 1 + 2c π2 d + c 2 (d + 1) 2n(1 + c 2 )
c2d , 2n(1 + c 2 ) ≈−
c2d . 2n(1 + c 2 )
(6.13)
(6.14)
The proof is also given in the appendix. Among the results is an interesting one concerning the antisymmetry between E gen and E train ; that is, E gen = −E train , which is proved in the regular case (Amari & Murata, 1993). Note also that the generalization and training errors depend on the shape parameter c as well as the dimension number d. In the regular case, they depend only on d. As one can easily see, when c is small, the cone looks like a needle, and its behavior resembles a one-dimensional model. When c is large, the cone resembles two (d + 1)-dimensional hypersurfaces, so its behavior is like a (d + 1)–dimensional regular model. Such observations are confirmed by equations 6.13 and 6.14.
Singularities Affect Dynamics of Learning in Neuromanifolds
1047
6.1.2 Simple MLP with One Hidden Unit. For the simple MLP defined in equation 2.33, we can also apply the same approach. The log likelihood of data set D = {(x i , yi )}i=1,...,n is written as
L(D; ξ, ω) = −
n 2 1 yi − ξ ϕβ (ω · x i ) . 2
(6.15)
i=1
Let us define two random variables depending on D and ω, 1 yi ϕβ (ω · x i ) , Yn (ω) = √ n i=1 n
1 2 ϕβ (ω · x i ). n
(6.16)
n
An (ω) =
(6.17)
i=1
Note that An (ω) converges to A(ω) = E x [ϕβ2 (ω · x)] as n goes to infinity, but is not normalized to 1 in the present case. Y(ω) defines a gaussian random field on , with mean 0 and covariance A(ω, ω ) = E x [ϕβ (ω · x)ϕβ (ω · x)]. One should be careful that An (ω, β) of equation 6.17 approaches 0 as β → 0. This belongs to the non-Donsker class, and our theory does not hold in such a case. Using the MLE, we get the following theorem. Theorem 2. For the simple MLP model, when the teacher perceptron is ξ = 0, the MLE satisfies Y2 (ω) 1 po (y|x) = E D supω , 2n An (ω) p(y|x, ξˆ , ω) ˆ n 1 Y2 (ω) po (y|x i ) 1 log . = − E D supω E train = E D n 2n An (ω) p(y|x i , ξˆ , ω) ˆ i=1 E gen = E D E po ,q log
(6.18) (6.19)
The details of this derivation are given in the appendix. From the results, we can see a nice correspondence between the cone model and MLP. However, note that there is no sufficient statistic in the MLP case, while all the data are summarized in the sufficient statistic x˜ in the cone model. In addition, due to the nonlinearity of the hidden unit, we cannot easily determine the explicit relation through which the training and generalization errors depend on the dimension number of the parameters, which we have for the cone model in corollary 1.
1048
S.-I. Amari, H. Park, and T. Ozeki
6.2 Bayes Predictive Distribution 6.2.1 Cone Model. Different from the regular case, the asymptotic behavior of the Bayesian predictive distribution depends on the prior. Let us define the prior as π(ξ, ω). The probability density of the observed sample is then given by Zn = p(D) =
π(ξ, ω)
n
p(x n |ξ, ω)dξ dω.
(6.20)
i=1
When new data x n+1 are given, we can similarly obtain the joint probability density p(x n+1 , D) as Zn+1 = p(x n+1 , D) =
π(ξ, ω)
n+1
p(x i |ξ, ω)dξ dω.
(6.21)
i=1
From the Bayes theorem, we can easily see that the Bayes predictive distribution is given by pˆ Bayes (x|D) =
Zn+1 , Zn
(6.22)
where x = x n+1 . When we assume a specific prior for the parameter ξ and ω, we can calculate Zn explicitly. When π(ξ ) = 1 and ω is uniform on , we can obtain the Bayes predictive distribution and the generalization error explicitly, as in the following theorems. Theorem 3. Under the uniform prior on ξ , the Bayes predictive distribution of the cone model is given by 1 1 pˆ BAYES (x|D) = √ e xp − x2 2 ( 2π)d+2
∇∇ SdU 1 1 tr H (x) , × 1 + √ ∇logSdU ( x˜ ) · x + 2 2n n SdU
(6.23)
where H2 (x) = x x T − I and
SdU ( x˜ ) =
e xp
1 Yn (ω)2 dω, 2
1 a (ω) · x i = a(ω) · x˜ . Yn = √ n i=1 n
Singularities Affect Dynamics of Learning in Neuromanifolds
1049
Theorem 4. Under the uniform prior on ξ , the generalization and training errors of the Bayes predictive distribution of the cone model are given by 2 1 po (x) 1 E gen = E D E po log = E D ∇logSdU ( x˜ ) = , (6.24) p(x|D) 2n 2n n 1 1 po (x i ) E D log = E gen − E D ∇logSdU ( x˜ ) · x˜ . (6.25) E train = n i=1 p(x i |D) n The details of this derivation are given in the appendix. Remark. For any prior π(ξ, ω) that is positive and smooth, theorems 3 and 4 also hold asymptotically without any change. The Jeffreys’ prior is given by the square root of the determinant of the Fisher information matrix, π(ξ, ω) ∝
|G(ξ, ω)|.
(6.26)
In this case, π(ξ ) ∝ |ξ |d and ω are uniformly distributed on Sd . The Jeffreys prior is not smooth and is 0 at ξ = 0. This is completely different from the regular case. We can conduct a similar analysis and obtain the following theorems. Theorem 5. Under the Jeffreys’ prior, the Bayes predictive distribution of the cone model is given asymptotically by 1 1 pˆ BAYES (x|D) = √ e xp − x2 2 ( 2π)d+2
∇∇ SdJ 1 1 × 1 + √ ∇logSdJ ( x˜ ) · x + H (x) , tr 2 2n n SdJ
(6.27)
where # " Id (Yn (ω)) e xp 1 Yn (ω)2 dω, 2 " # ! |z + u|d e xp − 1 z2 dz. Id (u) = √1 2 2π
SdJ ( x˜ ) =
!
(6.28) (6.29)
Theorem 6. Under the Jeffreys’ prior, the generalization and training errors of the Bayes predictive distribution of the cone model are given asymptotically by E gen =
2 d +1 1 E D ∇logSdJ ( x˜ ) = 2n 2n
(6.30)
1050
S.-I. Amari, H. Park, and T. Ozeki
E train = E gen −
1 E D ∇logSdJ ( x˜ ) · x˜ . n
(6.31)
For the proof, the same derivation process as that of the uniform case can be applied, although the process is fairly complicated. These results are rather surprising. Under the uniform prior, the generalization error is constant and does not depend on d, which is the complexity of the model. Hence, no overfitting occurs whatever complex models we use. This is completely different from the regular case. However, this striking result arises from the uniform prior on ξ . The uniform prior puts a strong emphasis on the singularity because there are infinitely many equivalent points at ξ = 0, so the prior density is infinitely large if we consider the ˜ of the probability distributions or behaviors. Hence, one should be space M very careful in choosing a prior when the model includes singularities. In the case of Jeffreys’ prior, the generalization error increases in proportion to d, which is similar to the regular case. In addition, the antisymmetric duality between E gen and E train does not hold for both the uniform prior and Jeffreys’ prior. 6.2.2 Simple MLP with One Hidden Unit. For the simple MLP model, we conducted a similar analysis for the uniform prior and Jeffreys’ prior, and obtained the following theorems. Theorem 7. Under the uniform prior on ξ , the Bayes predictive distribution of the simple MLP model is given by ! 12 1 y ∇ QU d (Yn , ω)ϕβ (ω · x)dω pˆ BAYES (y|x, D) = √ e xp − 2 1+ √ y n PdU (Yn ) 2π !
∇∇ QU 1 1 d (Yn , ω)An (ω)dω + + O H2 (y) , (6.32) U 2n n2 Pd (Yn ) where 1 Yn (ω)2
, e xp 2 A(ω) A(ω) PdU (Yn ) = QU d (Yn , ω)dω,
QU d (Yn , ω) =
1
(6.33) (6.34)
1 1 yi ϕβ (ω · x i ) = √ yi ϕi , Yn (ω) = √ n i=1 n i=1 A(ω) = E x ϕβ2 (ω · x) . n
n
(6.35) (6.36)
Singularities Affect Dynamics of Learning in Neuromanifolds
1051
Theorem 8. Under the uniform prior on ξ , the generalization and training errors of the Bayes predictive distribution of the simple MLP model are given by po (y|x) E gen = E D E po q log p(y|x, D) ! U ∇ QU 1 d (Yn , ω)∇ Qd (Yn , ω )A(ωω )dωω = ED U 2 2n Pd (Yn ) =
1 2n n 1
(6.37)
po (yi |x i ) E D log E train = n i=1 p(yi |x i , D) ! ∇ QU 1 d (Yn , ω)Yn (ω)dω . = E gen − E D n PdU (Yn )
(6.38)
Theorem 9. Under Jeffreys’ prior on ξ , the Bayes predictive distribution of the simple MLP model is given by ! 1 1 y ∇ QdJ (Yn , ω)ϕβ (ω · x)dω pˆ BAYES (y|x, D) = √ e xp − y2 1 + √ 2 n PdJ (Yn ) 2π !
∇∇ QdJ (Yn , ω)An (ω)dω 1 1 + + O H2 (y) , (6.39) 2n n2 PdJ (Yn ) where QdJ (Yn , ω) = PdJ
(D) =
1
I d+1 d
A(ω)
1 Yn (ω)2 Yn (ω)
e xp 2 A(ω) A(ω)
(6.40)
QdJ (Yn , ω)dω.
1 Id (u) = √ 2π
(6.41)
1 |z + u|d e xp − z2 dz. 2
(6.42)
Theorem 10. Under Jeffreys’ prior on ξ , the generalization and training errors of the Bayes predictive distribution of the simple MLP model are given by 1 E gen = ED 2n =
d +1 2n
!
∇ QdJ (Yn , ω)∇ QdJ (Yn , ω )A(ωω )dωω 2 J Pd (D)
(6.43)
1052
S.-I. Amari, H. Park, and T. Ozeki
1 E train = E gen − E D n
!
∇ QdJ (Yn , ω)Yn (ω)dω . PdJ (D)
(6.44)
All of these results for the simple MLP correspond well with those for the cone model. For both the cone and MLP models, we can see that the generalization error is strongly dependent on the prior distribution of the parameters. This differs from the classic theory for the regular models. 7 Conclusion It has long been known that some kinds of statistical model violate ordinary conditions such as the existence of a regular Fisher information matrix. Unfortunately, in classical statistical theories, singular models of this type have been regarded as pathological and have received little attention. However, to understand the behavior of hierarchical models such as multilayer perceptrons, singularity problems cannot be ignored. The singularity is closely related to basic problems regarding these models—such as the slow dynamics of learning in plateaus and a strange relation between generalization and training errors—and also the criteria of model selection. It is premature to give a general theory of estimation, testing, Bayesian inference, and learning dynamics for singular models. In this article, we have summarized our recent results regarding these problems using simple toy models. Although the results are preliminary, we believe that we have succeeded in elucidating various aspects of the singularity and have found some interesting paths to follow in future studies. Appendix: Proofs of Theorems Proof of Theorem 1.
The log likelihood of D is given by 1 x i − ξ a(ω)2 . 2 n
L(D, ξ, ω) = −
(A.1)
i=1
From the definition of the generalization and training errors, we obtain 1 2 ˆ ˆ E gen = E D E po −ξ (ω)a( ˆ (a(ω) ˆ · a(ω)) ˆ ˆ ω) ˆ · x + ξ (ω) 2 1 2 ξˆ (ω) ˆ = ED 2 =
1 E D maxω Yn2 (ω) . 2n
(A.2)
Singularities Affect Dynamics of Learning in Neuromanifolds
1053
n 1 1 2 E train = E D ˆ (a(ω) ˆ · a(ω)) ˆ −ξˆ (ω)a( ˆ ω) ˆ · x i + ξˆ (ω) n 2 i=1
1 1 ˆ √ a(ω) ˆ ˆ · x˜ + ξˆ 2 (ω) = E D −ξˆ (ω) 2 n 1 ˆ = E D − ξˆ 2 (ω) 2 =−
1 E D maxω Yn2 (ω) . 2n
(A.3)
In order to get the final results, we need to calculate
Proof of Corollary 7. maxω Yn2 (ω). Let
1 (x˜ 1 + cω · x˜ ) a(ω) · x˜ = √ 1 + c2 where x˜ = (x˜ 2 , . . . , x˜ d+2 )T . Then 1 2 x˜ 1 + 2c x˜ 1 ω · x˜ + c 2 (ω · x˜ )2 2 1+c 1 2 x˜ 1 + 2c x˜ 1 Ae · ω + c 2 A2 (e · ω)2 , = 2 1+c
Yn2 (ω) = a(ω) · x˜ 2 =
where x˜ = x˜ e = Ae, e = 1. Then we obtain ωˆ = argmaxω Yn2 (ω) = sgn(x˜ 1 )e, and maxω Yn2 (ω) =
1 2 x˜ 1 + 2c|x˜ 1 |A + c 2 A2 . 2 1+c
From the fact that E D [A2 ] = d + 1, E D [A] = E
x˜ 22
+ ··· +
2 x˜ d+2
d!! = (d − 1)!!
where d!! = d(d − 2)(d − 4) · · ·, we finally get
$
2 π
(−1)d
≈
√ d,
(A.4)
1054
S.-I. Amari, H. Park, and T. Ozeki
E D maxω Yn2 (ω) =
1 1 + 2c E D [|x˜ 1 |]E D [A] + c 2 (d + 1) 2 1+c √ 1 + 2c π2 d + c 2 (d + 1) c2d ≈ . ≈ 2 (1 + c ) 1 + c2
(A.5)
The log likelihood of D is given by
Proof of Theorem 2.
1 {yi − ξ ϕβ (ω · x i )}2 2 n
L(D, ξ, ω) = −
i=1
=−
1 2 1 2 yi + ξ yi ϕβ (ω · x i ) + − ξ 2 ϕβ (ω · x i ). 2 2 n
n
n
i=1
i=1
i=1
(A.6)
By using 1 2 1 Yn2 (ω) yi + , 2 2 An (ω) n
L(ξˆ (ω), ω) = −
(A.7)
i=1
ωˆ = argmaxω
Yn2 (ω) , An (ω)
(A.8)
1 2 2 ˆ ˆ ξ ( ω ˆ · x) + ( ω)ϕ ˆ ( ω ˆ · x) E gen = E D E y,x −ξˆ (ω)yϕ β β 2 1 2 Y2 (ω) 1 ˆ ξ (ω)A ˆ n (ω) ˆ = = ED Ex E D supω , 2 2n An (ω)
(A.9)
n 1 1 2 2 ˆ ˆ −ξ (ω)y ˆ i ϕβ (ω · x i ) + ξ (ω)ϕ ˆ β (ωˆ · x i ) E train = E D n 2 i=1
1 1 ˆ n (ω) ˆ ˆ + ξˆ 2 (ω)A = E D −ξˆ (ω) ˆ √ Y(ω) 2 n Y2 (ω) 1 1 2 ˆ = E D − ξ (ω)A ˆ n (ω) ˆ = − E D supω . 2 2n An (ω) Let us define
Proof of Theorem 3. Zn = p(D) =
(A.10)
π(ξ, ω)
n i=1
p(x i |ξ, ω)dξ dω,
(A.11)
Singularities Affect Dynamics of Learning in Neuromanifolds
Zn+1 = p(x, D) =
π(ξ, ω)
n+1
p(x i |ξ, ω)dξ dω.
1055
(A.12)
i=1
Then the Bayesian predictive distribution can be written by
pˆ BAYES (x|D) =
Zn+1 . Zn
(A.13)
Under the uniform prior, we can easily get 1 n 1 x i − ξ 2 dξ dω exp − x i 2 + ξ a(ω) · Zn = √ 2 2 ( 2π)n(d+2) √ 1 2π 1 x i 2 Yn (ω)2 dω. exp (A.14) = √ √ exp − n(d+2) 2 2 ( 2π) n
From equations A.14 and A.13, we obtain
1 pˆ BAYES (x|D) = √ ( 2π)d+2
$
n x2 SdU ( x˜ n+1 ) , exp − n+1 2 SdU ( x˜ )
where 1
%
x+ x˜ n+1 = √ n+1
&
xi ,
SdU ( x˜ )
=
1 2 exp Yn (ω) dω. 2
Using the approximation of the form,
x˜ n+1 ≈ x˜ +
x 1 x˜ √ − n 2n
and from equation A.15, we obtain
= x˜ + δx,
(A.15)
1056
S.-I. Amari, H. Park, and T. Ozeki
1 1 1 pˆ BAYES (x|D) = √ exp − x2 1 − 2 2n ( 2π)d+2
∇ SU ( x˜ ) ∇∇ SdU ( x˜ ) 1 T × 1 + Ud · δx + tr δxδx 2 Sd ( x˜ ) SdU ( x˜ ) 1 1 1 ∇ SdU ( x˜ ) = √ exp − x2 1 + √ ·x 2 n SdU ( x˜ ) ( 2π)d+2
∇∇ SdU T ∇ SdU 1 tr − x x · x ˜ − 1 . + 2n SdU SdU
(A.16)
Using the fact that ∇ SdU ( x˜ ) · x˜ + SdU ( x˜ ) = tr{∇∇ SdU ( x˜ )}, we can obtain the final result. From equation A.16,
Proof of Theorem 4.
∇∇ SdU 1 1 U tr H2 (x) E gen = E D E po − log 1 + √ ∇ log Sd ( x˜ ) · x + 2n n SdU
∇∇ SdU 1 1 H (x) = E D E po − √ ∇ log SdU ( x˜ ) · x − tr 2 2n n SdU 2 1 ∇ log SdU ( x˜ ) · x + 2n 2 1 = E D ∇ log SdU ( x˜ ) 2n (A.17)
Similarly, for the training error, we get
n 1 1 ∇∇ SdU 1 U E D − log 1 + √ ∇ log Sd ( x˜ ) · x i + H2 (x i ) E train = n 2n n SdU i=1
n ∇∇ SdU 1 1 1 U = E D − √ ∇ log Sd ( x˜ ) · x i − H2 (x i ) tr n 2n n SdU i=1 2 1 + ∇ log SdU ( x˜ ) · x i 2n 1 = − E D ∇ log SdU ( x˜ ) · x˜ n
n 2 ∇∇ SdU 1 1 1 U . E D − tr H2 (x i ) + + ∇ log Sd ( x˜ ) · x i n 2n 2n SdU i=1
Singularities Affect Dynamics of Learning in Neuromanifolds
1057
Here, we use the expansion 1 x˜ ≈ x˜ i + √ x i , n
(A.18)
1 f ( x˜ ) ≈ f ( x˜ i ) − √ ∇ f ( x˜ i ) · x i , n where x˜ i =
√1 n
(A.19)
x j , and we finally get
j=i
n 1 2 1 1 U U ∇ log Sd ( x˜i ) · x ED E train = − E D ∇ log Sd ( x˜ ) · x˜ + n n 2n i=1
1 E D ∇ log SdU ( x˜ ) · x˜ . n
= E gen −
(A.20)
On the other hand, since Yn (ω) and Yn+1 (ω) have the same distributions, we easily get E gen = H0 +
1 , n
where H0 is the entropy of the distribution p0 (x). Let us define
Proof of Theorem 5. !
'n |ξ |d i=1 p(x i |ξ, ω)dξ dω, ! ' n+1 J = p(x, D) = |ξ |d i=1 p(x i |ξ, ω)dξ dω. Zn+1 ZnJ = p(D) =
(A.21) (A.22)
Then the Bayesian predictive distribution can be written as pˆ BAYES (x|D) =
J Zn+1 . ZnJ
(A.23)
Under Jeffreys’ prior, we get ZnJ
1 n 2 2 |ξ | exp − x i + ξ a(ω) · = √ x i − ξ dξ dω 2 2 ( 2π)n(d+2) 1 1 x i 2 exp − = √ 2 ( 2π)n(d+2) # " n √ × |ξ |d exp − ξ 2 + n(a(ω) · x˜ )ξ dξ dω 2 1
d
1058
S.-I. Amari, H. Park, and T. Ozeki
1 1 = √ exp − x i 2 2 ( 2π)n(d+2) ( ( )
) a(ω) · x˜ 2 (a(ω) · x˜ ) 2 n d ξ− √ × |ξ | exp − exp dξ dω 2 2 n 1 1 2 exp − = √ x i √ d+1 2 ( 2π)n(d+2) n ( ) 2 d (a(ω) · x˜ ) 2 z z + a(ω) · x˜ exp − × exp dzdω 2 2 √ 1 2π 2 x exp − = √ i √ d+1 2 ( 2π)n(d+2) n ( ) (a(ω) · x˜ ) 2 × Id (a(ω) · x˜ ) exp dω. 2 Therefore, the predictive distribution is given from equation A.14 as 1 pˆ BAYES (x|D) = √ ( 2π)d+2
$
n n+1
d+1
x2 SdJ ( x˜ n+1 ) . exp − 2 SdJ ( x˜ )
(A.24)
By again using the same expansion as for the uniform case, we obtain
d 1+ pˆ BAYES (x|D) = √ 2n ( 2π)d+2
∇ SdJ ( x˜ ) ∇∇ SdJ ( x˜ ) 1 T × 1+ J · δx + tr δxδx 2 Sd ( x˜ ) SdJ ( x˜ ) 1 1 exp − x2 = √ 2 ( 2π)d+2
∇∇ SdJ 1 1 × 1 + √ ∇ log SdJ ( x˜ ) · x + H (x) . tr 2 2n n SdJ 1
Proof of Theorem 7. Zn =
n
1 exp − x2 2
(A.25)
Let us define
p(yi |x i , ξ, ω)dξ dω,
(A.26)
p(yi |x i , ξ, ω)dξ dω.
(A.27)
i=1
Zn+1 =
n+1 i=1
Singularities Affect Dynamics of Learning in Neuromanifolds
1059
Then the Bayesian predictive distribution can be written as pˆ BAYES (y|x, D) =
Zn+1 . Zn
(A.28)
Under the uniform prior, we can easily get 1 1 2 1 2 yi ϕi − Zn = √ n exp − yi + ξ ϕi dξ dω 2 2 2π √ 1 Yn (ω)2 1 2π 1 2 exp yi dω. = √ n √ exp − 2 An 2 An (ω) 2π n
(A.29)
Therefore, the predictive distribution can be written as 1 pˆ BAYES (y|x, D) = √ 2π
$
U Pd (Dn+1 ) n 1 , exp − y2 n+1 2 PdU (Dn )
(A.30)
where 1 1 Yn (ω)2 exp dω An 2 An (ω)
PdU (Dn ) =
1 2 ϕβ (ω · x i ). n n
An (ω) =
i=1
Noting that An (ω) converges to A(ω) within the limit of large n, we substitute PdU (Dn ) = PdU (Yn ). Using the approximation of the form,
Yn+1 ≈ Yn +
yϕ 1 Yn , √ − n 2n
(A.31)
1 Q(Yn+1 , ω) ≈ Q(Yn , ω) + √ ∇ Q(Yn , ω)yϕ n 1 (A.32) ∇∇ Q(Yn , ω)y2 ϕ 2 − ∇ Q(Yn , ω)Yn , 2n y ∇ Q(Yn , ω)ϕdω P(Yn+1 ) ≈ P(Yn ) + √ n
1 2 2 y ∇∇ Q(Yn , ω)ϕ dω − ∇ Q(Yn , ω)Yn dω , (A.33) + 2n +
1060
S.-I. Amari, H. Park, and T. Ozeki
we obtain
1 2 1 1 PdU (Yn+1 ) pˆ BAYES (y|x, D) = √ exp − y 1− 2 2n PdU (Yn ) 2π ! 1 1 y ∇ Q(Yn , ω)ϕdω = √ exp − y2 1 + √ 2 P(Yn ) n 2π
! ∇∇ Q(Yn , ω)ϕ 2 dω 1 + y2 2n P(Yn ) ! ∇ Q(Yn , ω)Yn dω P(Yn ) − − . (A.34) P(Yn ) P(Yn ) By using the fact that ∂2 ∂ Q(Y)A(ω) = Q(Y)A(ω) + Q(Y), ∂Y2 ∂Y we can finally obtain the result. Proof of Theorem 8. From equation 6.39 and the definition of the generalization error, we get ! y ∇ QU d (Yn , ω)ϕβ (ω · x)dω E gen = −E D E y,x log 1 + √ n PdU (Yn ) ! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (y) 2n PdU (Yn ) ! y ∇ QU d (Yn , ω)ϕβ (ω · x)dω ≈ −E D E y,x √ n PdU (Yn ) ! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (y) 2n PdU (Yn ) 2
! ∇ QU y2 d (Yn , ω)ϕβ (ω · x)dω − 2n PdU (Yn ) ! U ∇ QU 1 d (Yn , ω)∇ Qd (Yn , ω )A(ωω )dωω = . ED 2 U 2n P (Yn ) d
Similarly, for the training error, we get ! n yi ∇ QU 1 d (Yn , ω)ϕβ (ω · x i )dω E train = − E D log 1 + √ n n PdU (Yn ) i=1
(A.35)
Singularities Affect Dynamics of Learning in Neuromanifolds
! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (yi ) 2n PdU (Yn ) ! n yi ∇ QU 1 d (Yn , ω)ϕβ (ω · x i )dω ED √ ≈− n n PdU (Yn ) i=1 ! ∇∇ QU 1 d (Yn , ω)An (ω)dω H2 (yi ) + 2n PdU (Yn ) ! 2
∇ QU yi2 d (Yn , ω)ϕβ (ω · x i )dω − 2n PdU (Yn ) ! ∇ QU 1 d (Yn , ω)Yn (ω)dω . = E gen − E D n PdU (Yn )
1061
(A.36)
On the other hand, from equation A.30, the generalization error is written as E gen =
n+1 1 log + E D E po q log PdU (Dn ) − E D E po q log PdU (Dn+1 ) . 2 n
From the fact that lim E D E po q log PdU (Dn ) < ∞,
n→∞
we finally get E gen = Let us define
Proof of Theorem 9. Zn = p(D) =
1 . 2n
|ξ |d
n
p(yi |x i , ξ, ω)dξ dω,
(A.37)
i=1
Zn+1 = p(y, x, D) =
|ξ |d
n+1
p(yi |x i , ξ, ω)dξ dω.
(A.38)
i=1
The Bayesian predictive distribution can then be written as pˆ BAYES (x|D) =
Zn+1 . Zn
(A.39)
1062
S.-I. Amari, H. Park, and T. Ozeki
Similar to the cone model, we get 1 2 1 2 d |ξ | y ϕ exp − + ξ y ϕ − i i n i i dξ dω 2 2 2π √ 1 2 2π yi = √ n √ d+1 exp − 2 2π n Yn (ω) 1 1 Yn (ω)2
× dω. (A.40) I exp √ d+1 d 2 An (ω) An (ω) An
Zn = √
1
Therefore, the predictive distribution can be written as 1 pˆ BAYES (y|x, D) = √ 2π
$
n n+1
d+1
1 2 PdJ (Dn+1 ) . exp − y 2 PdJ (Dn )
(A.41)
By using the same approaches as for the uniform prior, we can easily obtain the final results. Proof of Theorem 10. written as E gen =
From equation A.14, the generalization error is
n+1 d +1 log + E D E po q log PdJ (Dn ) 2 n − E D E po q log PdJ (Dn+1 ) .
(A.42)
From the fact that lim E D E po q log PdJ (Dn ) < ∞,
n→∞
(A.43)
we finally get E gen =
d +1 2n
(A.44)
For the proof, the same derivation process as for the uniform case can be applied. References Akaho, S., & Kappen, H. J. (2000). Nonmonotonic generalization bias of gaussian mixture models. Neural Computation, 12, 6, 1411–1428.
Singularities Affect Dynamics of Learning in Neuromanifolds
1063
Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Appl. Comp., AC-19, 716–723. Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Trans., EC-16(3), 299– 307. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics. 27, 77–87. Amari, S. (1987). Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections and divergence. Mathematical Systems Theory, 20, 53–82. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S. (2003). New consideration on criteria of model selection. In L. Rutkowski & J. Kacprzyk (Eds.), Neural networks and soft computing (Proceedings of the Sixth International Conference on Neural Networks and Soft Computing) (pp. 25–30). Heidelberg: Physica Verlag. Amari, S., & Burnashev, M. V. (2003). On some singularities in parameter estimation problems. Problems of Information Transmission. 39, 352–372. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–154. Amari, S., & Nagaoka, H. (2000). Information geometry. New York: AMS and Oxford University Press. Amari, S., & Nakahara, H. (2005). Difficulty of singularity in population coding. Neural Computation, 17, 839–858. Amari, S., & Ozeki, T. (2001). Differential and algebraic geometry of multilayer perceptrons. IEICE Trans., E84-A, 31–38. Amari, S., Ozeki, T., & Park, H. (2003). Learning and inference in hierarchical models with singularities. Systems and Computers in Japan, 34, 34–42. Amari, S., Park, H., & Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12, 1399–1409. Amari, S., Park, H., & Ozeki, T. (2001). Statistical inference in nonidentifiable and singular statistical models. J. of the Korean Statistical Society, 30(2), 179–192. Amari, S., Park, H., & Ozeki, T. (2002). Geometrical singularities in the neuromanifold of multilayer perceptrons. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 343–350). Cambridge, MA: MIT Press. Brockett, R. W. (1976). Some geometric questions in the theory of linear systems. IEEE Trans. on Automatic Control, 21, 449–455. Chen, A. M., Lu, H., & Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. Neural Computation, 5, 910–927. ´ (1997). Testing in locally conic models, and Dacunha-Castelle, D., & Gassiat, E. application to mixture models. Probability and Statistics, 1, 285–317. Fukumizu, K. (1999). Generalization error of linear neural networks in unidentifiable cases. In O. Watanabe & T. Yokomori (Eds.), Algorithmic learning theory: Proceedings of the 10th International Conference on Algorithmic Learning Theory (ALT’99) (pp. 51– 62). Berlin: Springer-Verlag. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3), 833–851.
1064
S.-I. Amari, H. Park, and T. Ozeki
Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13, 317–327. Hagiwara, K. (2002a). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Hagiwara, K. (2002b). Regularization learning, early stopping and biased estimator. Neurocomputing, 48, 937–955. Hagiwara, K., Hayasaka, T., Toda, N., Usui, S., & Kuno, K. (2001). Upper bound of the expected training error of neural network regression for a gaussian noise sequence. Neural Networks, 14, 1419–1429. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. Proceedings of IJCNN (Vol. 3, pp. 2263–2266). Nagoya, Japan. Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. Proc. Barkeley Conf. in Honor of J. Neyman and J. Kiefer, 2, 807–810. Hotelling, H. (1939). Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. Math., 61, 440–460. Inoue, M., Park, H., & Okada, M. (2003). On-line learning theory of soft committee machines with correlated hidden units—Steepest gradient descent and natural gradient descent. J. Phys. Soc. Jpn., 72(4), 805–810. Kang, K., Oh, J.-H., Kwon, S., & Park, Y. (1993). Generalization in a two-layer neural networks. Phys. Rev. E, 48(6), 4805–4809. Kitahara, M., Hayasaka, T., Toda, N., & Usui, S. (2000). On the statistical properties of least squares estimators of layered neural networks (in Japanese). IEICE Transactions, J86-D-II, 563–570. ˚ Kurkov´ a, V., & Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6, 543–558. Liu, X., & Shao, Y. (2003). Asymptotics for likelihood ratio tests under loss of identifiability. Annals of Statistics, 31(3), 807–832. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. Park, H., Amari, S., & Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Neural Networks, 13, 755–764. Park, H., Inoue, M., & Okada, M. (2003). On-line learning dynamics of multilayer perceptrons with unidentifiable parameters. Submitted to J. Phys. A: Math. Gen., 36, 11753–11764. Rattray, M., & Saad, D. (1999). Analysis of natural gradient descent for multilayer neural networks. Physical Review E, 59, 4523–4532. Rattray, M., Saad, D., & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461–5464. Riegler, P., & Biehl, M. (1995). On-line backpropagation in two-layered neural networks. J. Phys. A; Math. Gen., 28, L507–L513. Risssanen, J. (1986). Stochastic complexity and modeling. Ann. Statist. 14, 1080–1100. Rosenblatt, F. (1961). Principles of neurodynamics. New York: Spartan. ¨ Ruger, S. M., & Ossen, A. (1997). The metric of weight space. Neural Processing Letters, 5, 63–72.
Singularities Affect Dynamics of Learning in Neuromanifolds
1065
Rumelhart, D., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error backpropagation. In D. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press. Saad, D., & Solla, A. (1995). On-line learning in soft committee machines. Phys. Rev. E, 52, 4225–4243. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks. 5, 589–593. Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, 13, 899–933. Watanabe, S. (2001b). Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 1409–1060. Watanabe, S. (2001c). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 329–336). Cambridge, MA: MIT Press. Watanabe, S., & Amari, S. (2003). Learning coefficients of layered models when the true distribution mismatches the singularities. Neural Computation, 15(5), 1013– 1033. Weyl, H. (1939). On the volume of tubes. Amer. J. Math., 61, 461–472. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Computation, 14, 999–1026. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Yamazaki, K., & Watanabe, S. (2002). A probabilistic algorithm to calculate the learning curves of hierarchical learning machines with singularities. Trans. on IEICE, J85-D-II(3), 363–372. Yamazaki, K., & Watanabe, S. (2003). Singularities in mixture models and upper bounds of stochastic complexity. International Journal of Neural Networks, 16(7), 1029–1038.
Received June 28, 2004; accepted May 26, 2005.
LETTER
Communicated by Paul Tiesinga
How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory Neurons Nicolas Brunel
[email protected] David Hansel
[email protected] Laboratory of Neurophysics and Physiology, CNRS UMR 8119, Universit´e Paris Ren´e Descartes, 75270 Paris Cedex 05, France
GABAergic interneurons play a major role in the emergence of various types of synchronous oscillatory patterns of activity in the central nervous system. Motivated by these experimental facts, modeling studies have investigated mechanisms for the emergence of coherent activity in networks of inhibitory neurons. However, most of these studies have focused either when the noise in the network is absent or weak or in the opposite situation when it is strong. Hence, a full picture of how noise affects the dynamics of such systems is still lacking. The aim of this letter is to provide a more comprehensive understanding of the mechanisms by which the asynchronous states in large, fully connected networks of inhibitory neurons are destabilized as a function of the noise level. Three types of single neuron models are considered: the leaky integrateand-fire (LIF) model, the exponential integrate-and-fire (EIF), model and conductance-based models involving sodium and potassium HodgkinHuxley (HH) currents. We show that in all models, the instabilities of the asynchronous state can be classified in two classes. The first one consists of clustering instabilities, which exist in a restricted range of noise. These instabilities lead to synchronous patterns in which the population of neurons is broken into clusters of synchronously firing neurons. The irregularity of the firing patterns of the neurons is weak. The second class of instabilities, termed oscillatory firing rate instabilities, exists at any value of noise. They lead to cluster state at low noise. As the noise is increased, the instability occurs at larger coupling, and the pattern of firing that emerges becomes more irregular. In the regime of high noise and strong coupling, these instabilities lead to stochastic oscillations in which neurons fire in an approximately Poisson way with a common instantaneous probability of firing that oscillates in time.
Neural Computation 18, 1066–1110 (2006)
C 2006 Massachusetts Institute of Technology
Synchronization Properties of Inhibitory Networks
1067
1 Introduction Of the various patterns of synchronous oscillations that occur in the brain, episodes of activity during which local field potentials display fast oscillations with frequencies in the range 40 to 200 Hz have elicited particular interest. Such episodes have been recorded in vivo in several brain areas, in particular in the rat hippocampus (Buzsaki, Urioste, Hetke, & Wise, 1992; Bragin et al., 1995; Csicsvari, Hirase, Czurko, Mamiya, & Buzs´aki, 1999a, 1999b; Siapas & Wilson, 1998; Hormuzdi et al., 2001). During these episodes, single-neuron firing rates are typically much lower than local field potential (LFP) frequencies (Csicsvari et al., 1999a), and single-neuron discharges appear very irregular. Although a detailed analysis of the single-cell firing statistics during these episodes is lacking, this irregularity is consistent with findings of high variability in interspike intervals of cortical neurons in various contexts (see, e.g., Softky & Koch, 1993; Compte et al., 2003), and the observation of large fluctuations in membrane potentials in intracellular recordings in cortex in vivo (see, e.g., Destexhe & Par´e 1999; Anderson, Lampl, Gillespie, & Ferster, 2000). Recent theoretical studies have shown that fast synchronous population oscillations in which single-cell firing is highly irregular emerge in networks of strongly interacting inhibitory neurons in the presence of high noise. Brunel and Hakim (1999) investigated analytically the emergence of such oscillations in networks of sparsely connected leaky integrate-and-fire neurons activated by an external noisy input. They showed that the frequency of the population oscillations increases rapidly when the synaptic delay decreases and that it can be much larger than the firing frequency of the neurons. For instance, a network of neurons firing with an average rate of 10 Hz can oscillate at frequencies that can be on the order of 200 Hz when synaptic delays are on the order of 1 to 2 msec (Brunel & Hakim, 1999; Brunel & Wang, 2003). Tiesinga and Jose (2000) found similar collective states in numerical simulations of a fully connected network of inhibitory conductance-based neurons activated with a noisy external input. However, the frequency of the population oscillations in their model was smaller (in the range 20–80 Hz), and the variability of the spike trains was weaker than in the leaky integrate-and-fire (LIF) network for similar synaptic time constants and average firing rates of the neurons. Following the terminology of Tiesinga and Jose (2000), we will call this type of state a stochastic synchronous state. Stochastic synchrony occurs in the presence of strong noise, in contrast to so-called cluster states, which are found when noise and heterogeneities are weak. In the simplest cluster state, all neurons tend to spike together in a narrow window of time; they form one cluster. In such a state, the population oscillation frequency is close to the average frequency of the neurons (Abbott & van Vreeswijk, 1993; Tsodyks, Mit’kov, & Sompolinsky, ˜ & Kopell, 1993; Hansel, Mato, & Meunier, 1995; White, Chow, Soto-Trevino,
1068
N. Brunel and D. Hansel
1998; Golomb & Hansel, 2000; Hansel & Mato, 2003). Clustering in which neurons are divided into two or more groups can also occur (Golomb, Hansel, Shraiman, & Sompolinsky, 1992; Golomb & Rinzel, 1994; Hansel et al., 1995; van Vreeswijk, 1996). Within each of these groups, neurons fire at a similar phase of the population oscillation, but different groups fire at different phases. Thus, the frequency of the population oscillations can be very different from the neuronal firing rate, as in stochastic synchrony. However, in contrast to stochastic synchrony, in cluster states the population frequency is always close to a multiple of the neuronal firing rate. Moreover, single-neuron activity in cluster states is much more regular than in stochastic synchronous states. In this letter, we examine the dynamics of networks of inhibitory neurons in the presence of noisy input. For simplicity, we consider a fully connected network; similarities and differences with more realistic randomly connected networks will be mentioned in the discussion. We consider three classes of models: the LIF model (Lapicque, 1907; Tuckwell, 1988), the exponential integrate-and-fire model (EIF) (Fourcaud-Trocm´e, Hansel, van Vreeswijk, & Brunel, 2003), and simple conductance-based (CB) models (Hodgkin & Huxley, 1952) with two active currents, a sodium and a potassium current. In these models, we study the instabilities of the asynchronous state. In this state, the population-averaged firing rate becomes constant in the large N limit, and correlations between neurons vanish in this limit. We investigate in particular how the instability responsible for stochastic synchrony relates to other types of instabilities of the asynchronous state. Section 2 is devoted to the LIF model. We fully characterize the spectrum of the instabilities of the asynchronous state and explore in what ways these instabilities depend on the noise amplitude, the average firing rate of the neurons, and the synaptic time constants (latency, rise, and decay time). This can be performed analytically, due to the simplicity of the LIF model and the simplified all-to-all architecture. The LIF neuron has the great advantage of analytical tractability, but it often exhibits nongeneric properties. For instance, the frequency-current relationship that characterizes the response of the neuron to a steady external current exhibits a logarithmic behavior near current threshold, while generic type I neurons have a square root behavior. LIF and standard Hodgkin-Huxley (HH) models may also display substantially different synchronization behaviors in the low-noise regime (Pfeuty, Golomb, Mato, & Hansel, 2003; Pfeuty, Mato, Golomb, & Hansel, 2005). Crucially, LIF neurons respond in a nongeneric way to fast oscillatory external inputs (Fourcaud-Trocm´e et al., 2003). This motivates an investigation of synchronization properties in models with more realistic dynamics. In section 3, we combine analytical calculations with numerical simulations to study a network of EIF neurons. In this model, single-neuron dynamics depend on a voltage-activated current that triggers action potentials. This framework allows us to make predictions regarding the way sodium currents
Synchronization Properties of Inhibitory Networks
1069
affect the emergence of fast oscillations. In section 4 we simulate several conductance-based network models to compare their behaviors with those of the LIF and the EIF. We conclude that although quantitative aspects of the phenomenology of stochastic synchrony depend on the details of the neuronal dynamics, the occurrence of this type of collective state is a generic feature of large neuronal networks of strongly coupled inhibitory neurons. 2 Networks of Leaky Integrate-and-Fire Neurons 2.1 The Model. We consider a fully connected network of N inhibitory LIF neurons. The membrane potential, Vi , of neuron i (i = 1, . . . , N) evolves in the subthreshold range Vi (t) ≤ Vt where Vt is the firing threshold, according to τm V˙ i (t) = −Vi (t) + Irec (t) + Ii,ext (t),
(2.1)
where τm is the membrane time constant, Ii,ext (t) is an external input, and Irec (t) is the recurrent input due to the interactions between the neurons. When the voltage reaches threshold Vt , a spike is emitted and the voltage is reset to Vr . The external input is modeled as a current, √ Ii,ext (t) = I0 + σ τm ηi (t),
(2.2)
where I0 is a constant DC input, σ measures the fluctuations around I0 , and ηi (t) is a white noise that is uncorrelated from neuron to neuron, ηi (t) = 0, ηi (t)η j (t ) = δi j δ(t − t ). Since the network is fully connected, all neurons receive the same recurrent synaptic input. It is modeled as J s t − t kj , N k N
Irec (t) = −
(2.3)
j=1
where the postsynaptic current elicited by apresynaptic spike in neuron j occurring at time t kj , is denoted by s t − t kj , and J, which measures the strength of the synaptic interactions, has the dimension of a voltage. The first sum on the right-hand side in this equation represents a sum over synapses; the second sum is over spikes. For the function s(t), we take s(t) =
0 A(exp[−(t − τl )/τd ] − exp[−(t − τl )/τr ])
t < τl t ≥ τl ,
(2.4)
1070
N. Brunel and D. Hansel
where τl is the latency, τr is the rise time, τd is the decay time of the postsynaptic current, and A = τm /(τd − τr ) is a normalization factor that ensures that the integral of a unitary postsynaptic current remains independent of the synaptic time constants. With this normalization, s(t)dt = τm . Such kinetics are obtained from the set of differential equations, τd s˙ (t) = −s(t) + x(t)
(2.5)
τr x(t) ˙ = −x(t) + τm δ(t − τl ),
(2.6)
where x is an auxiliary variable. In the following we assume that all the neurons are identical, and we take τm = 10 ms (McCormick, Connors, Lighthall, & Prince, 1985), Vt = 20 mV, Vr = 14 mV (Troyer & Miller 1997). 2.2 The Asynchronous State. The dynamics of the LIF network can be studied using a mean-field approach (Treves, 1993; Abbott & van Vreeswijk, 1993); Brunel & Hakim, 1999; Brunel, 2000). In the thermodynamic limit N → ∞, the √recurrent input to the neurons can be written up to corrections of order 1/ N: Irec (t) = −J s(t)
(2.7)
τd s˙ = −s + x
(2.8)
τr x˙ = −x + τm ν(t − τl ),
(2.9)
where ν(t) is the instantaneous firing rate of the neurons at time t averaged over the network. The dynamical state of the network can then be described by a probability density function (PDF) for the voltage, P(V, t), which satisfies the Fokker-Planck equation, τm
∂P ∂ σ 2 ∂2 P + = [(V − Irec (t) − I0 )P], 2 ∂t 2 ∂V ∂V
(2.10)
with the boundary conditions P(Vt , t) = 0 ∂P 2ν(t)τm (Vt , t) = − ∂V σ2 lim(P(Vr + , t) − P(Vr − , t)) = 0
→0
∂P ∂P 2ν(t)τm lim (Vr + , t) − (Vr − , t) = − . →0 ∂ V ∂V σ2
(2.11) (2.12) (2.13) (2.14)
Synchronization Properties of Inhibitory Networks
1071
Finally, P(V, t) must obey P(V, t)d V = 1 at all times t. In the stationary state of the network, the PDF of the membrane potential as well as the population average of the firing rate, ν(t) = ν0 , are constant in time, and the neurons fire asynchronously. Integrating equation 2.10 together with the condition ∂ P/∂t = 0 and the boundary conditions, equation 2.14, one can show that the stationary firing rate ν0 is given by Ricciardi (1977), Amit and Tsodyks (1991), and Amit and Brunel (1997), √ 1 = π ν0 τm
yt
2
e x [1 + erf(x)] d x
(2.15)
yr
where erf is the error function (Abramowitz & Stegun, 1970), Vt + Jν0 τm − I0 σ Vr + Jν0 τm − I0 . yr = σ yt =
(2.16) (2.17)
The coefficient of variation of the interspike interval distribution, which measures the irregularity of single-neuron firing, can also be computed in the asynchronous state. This yields (Brunel, 2000; Tuckwell, 1988) CV2 = 2π(ν0 τm )2
yt yr
2
ex dx
x
−∞
2
e y [1 + erf(y)]2 dy.
Figure 1 shows how the coefficient of variation (CV) increases with noise level σ , for several values of the average firing rate. 2.3 Stability Analysis of the Asynchronous State. The asynchronous state is stable if any small perturbation from it decays back to zero. To study the instabilities of the asynchronous state, one approach is to diagonalize the linear operator, which describes the dynamics of small perturbations from the asynchronous state (see appendix A). A specific eigenmode, with eigenvalue λ, is stable (resp. unstable) if Re(λ) < 0 (resp. Re(λ) > 0). The frequency at which the mode oscillates is ω/(2π) where ω = Im(λ). The asynchronous state is stable if the real part of the eigenvalue is negative for all the eigenmodes. Here we present an alternative approach that directly provides the equation that determines the critical manifolds in parameter space on which eigenmodes change stability (see also Brunel & Wang, 2003). The derivation proceeds in two steps. The first step is to compute the recurrent current, assuming that the population firing rate of the network is weakly modulated and oscillatory with a frequency
1072
N. Brunel and D. Hansel
1 10 Hz
0.1 CV
30 Hz 50 Hz
0.01
0.001 0.001
0.01
1 0.1 SD of noise σ (mV)
10
Figure 1: Coefficient of variation as a function of the noise level, σ , for Vt = 20 mV, Vr = 14 mV, τm = 10 ms, and several values of the average firing rate ν0 , indicated on the graph.
ω : ν(t) = ν0 + ν1 exp(iωt) with ν1 1 (for notational simplicity, we use complex numbers in the following linear analysis). At leading order, the recurrent current is also oscillatory at the same frequency, and its modulation can be written as I1 =
−Jν1 τm exp(−iωτl ) ≡ Jν1 τm AS (ω) exp[iπ + i S (ω)], (1 + iωτd )(1 + iωτr )
(2.18)
where AS (ω) =
1 (1 +
ω2 τd2 )(1
+ ω2 τr2 )
,
S (ω) = ωτl + atan(ωτr ) + atan(ωτd ).
(2.19) (2.20)
The phase shift of the synaptic current with respect to the oscillatory presynaptic firing rate modulation is the sum of four contributions. Three of them, on the right-hand side of equation 2.20, depend on the latency, the rise time, and the decay time of the synapses and vary with ω. The fourth contribution, which does not depend on ω, is due to the factor exp(iπ) in equation 2.18. It results from the inhibitory nature of the synaptic interactions.
Synchronization Properties of Inhibitory Networks
1073
The second step is to compute the firing rate in response to an oscillatory input, equation A.3. It is given by (Brunel & Hakim, 1999; Brunel, Chance, Fourcaud, & Abbott, 2001), ν1 =
∂U (yt , iω) − ∂U (y , iω) I1 ν0 ∂y ∂y r , σ (1 + iωτm ) U(yt , iω) − U(yr , iω)
(2.21)
where yt , yr are given by equations 2.16 and 2.17 and the function U is defined in appendix A, equation A.5. The modulation ν1 can also be written as ν1 =
I1 ν0 AN (ω) exp[i N (ω)], σ
(2.22)
where (I1 ν0 /σ )AN (ω) is the amplitude of the firing rate modulation and N (ω) is the phase shift of the firing rate with respect to the oscillatory input current. A negative (resp. positive) phase shift means that the modulation of the neuronal response is in advance (resp. delayed) compared to the modulation of the input. Since in the network, the modulation of the firing rate and the modulation of the recurrent current must be consistent, combining equations 2.18 and 2.21, the following self-consistent condition must be satisfied: 1=
Jν0 τm AN (ω)AS (ω) exp[iπ + i S (ω) + i N (ω)]. σ
(2.23)
This equation is a necessary and sufficient condition for the existence of self-sustained oscillations of the population firing rate with vanishingly small amplitude and frequency ω. Therefore, it is identical to the condition that an eigenmode of the linearized dynamics, oscillating at frequency ω, has marginal stability (i.e., the real part of its eigenvalue vanishes). Note that AN depends on ν0 , σ , Vt , and Vr , whereas AS depends on synaptic time constants τl , τr , and τd . The complex equation 2.23 is equivalent to two real equations. The first of these equations is S (ω) + N (ω) = (2k + 1)π,
k = 0, 1, 2, . . .
(2.24)
Equation 2.24 does not depend on the coupling strength J > 0. It determines the frequency spectrum of the eigenmodes with marginal stability. The second equation is 1=
Jν0 τm AN (ω)AS (ω). σ
(2.25)
1074
N. Brunel and D. Hansel
It determines the coupling J c (ω), J c (ω) =
σ , ν0 τm AN (ω)AS (ω)
(2.26)
at which a mode with frequency ω has marginal stability. In the following, we study the instability of the asynchronous state in the J-σ plane, at a fixed firing rate ν0 . When J and σ are varied, ν0 is kept fixed by a suitable adjustment of I0 . The effect of the firing rate ν0 on instabilities is then studied separately in section 2.4.4. For given ν0 and σ , one expects the asynchronous state to be stable when J is sufficiently small. The first eigenmode to lose stability when J increases determines the stability boundary of the asynchronous state. Therefore, this boundary is given by
J˜c =
min
{ω| S (ω)+ N (ω)=(2k+1)π }
σ , ν0 τm AN (ω)AS (ω)
(2.27)
where the minimum is computed over all the solutions to equation 2.24. 2.4 The Spectrum of Instabilities. Inspection of the qualitative properties of the synaptic and neuronal phase lag helps us to understand how the instabilities of the asynchronous state occur and how they depend on the noise level (see also Fuhrmann, Markram, & Tsodyks, 2002, Brunel & Wang, 2003, for similar considerations). 2.4.1 The Solutions to Equation 2.24. The synaptic phase lag S (ω) is an increasing function of frequency. For low and high frequencies, it behaves like S (ω) ∼ ω(τl + τr + τd ) and S (ω) ∼ π + ωτl , respectively. The function S (ω), which does not depend on the noise level, is plotted in Figure 2A. The neuronal phase lag, N (ω), depends markedly on the noise level. For low noise levels, N (ω) has a sawtooth profile with peaks and very sharp variations at frequencies ω = 2π f n , where f n are in the limit of zero noise integer multiples of the firing rate of the neurons, f n = nν0 , n = 1, 2 . . . (see Figure 2B). Provided the latency is strictly positive, τl > 0, the function N + S goes to infinity with a sawtooth profile in the large ω limit (see Figure 2C). Since the frequencies of the eigenmodes with marginal stability are solutions to equation 2.24, they can be determined graphically by looking at the intersections of the graph of the function N + S with the horizontal lines at ordinates (2k + 1)π. For τl > 0, one can show that an odd number, 2 p + 1, of such intersections exists for any k. For instance, for the parameters of Figure 2C, there are for k = 0, 13 intersections for σ = 0.01 mV, 3 intersections for σ = 0.05 mV, and only 1 intersection for σ ≥ 0.1 mV. Out of these 2 p + 1 intersections, p + 1 are for frequencies close to f n for values of n in a range that depends on the noise level and also on the synaptic
Synchronization Properties of Inhibitory Networks
1075
Synaptic phase lag
A 270 180 90 0
Neuronal phase lag
B
Total phase lag
C
0
50
100
150
200
0
50
100
150
200
0
50
100 150 Frequency (Hz)
200
90 0 -90 -180 270 180 90 0 -90
Figure 2: Interpretation of equation 2.24 in terms of synaptic and neuronal phase lags. (A) Synaptic phase lag for τl = 1 ms, τr = 1 ms, τd = 6 ms. (B) Neuronal phase lag for Vt = 20 mV, Vr = 14 mV, ν0 = 30 Hz, τm = 10 ms, and five noise levels: 0.01 mV (dot-dashed), 0.05 mV (dashed), 0.1 mV (thin solid), 1 mV (medium solid), and 10 mV (thick solid). Note the sharp variations at integer multiples of the firing rate ν0 (30, 60, 90, . . . Hz) for low noise levels, which disappear as noise becomes stronger. (C) Total phase lag (sum of synaptic and neuronal phase lags, for the same noise levels as in B). Solutions to equation. 2.24 for a given noise level are at the intersection of the curve representing the total phase lag and the horizontal dotted line at 180 degrees. Note the large number of intersections for low noise levels that disappear as noise increases until a single intersection is left.
kinetics, as will be shown later. For instance, n = 3, . . . , 9 for σ = 0.01 mV, n = 4, 5 for σ = 0.05 mV, as shown in Figure 2C. The remaining p intersections are at intermediate frequencies nν0 < ω/(2π) < (n + 1)ν0 .
1076
N. Brunel and D. Hansel
Figure 2B shows that the amplitude of the peaks in N (ω) decreases, and the peaks themselves become less sharp and broader when the noise increases. As a result, pairs of intersections with the 180 horizontal line coalesce and disappear. For example, in Figure 2C, a pair of intersections—one near 60 Hz and the other at a frequency between 60 and 90 Hz—disappears when the noise is increased from σ = 0.01 (dot-dashed curve) to 0.05 mV (dashed curve). Eventually, for sufficiently large noise, all the intersections except one have disappeared, and the neuronal phase N becomes a monotonously increasing function of ω (see Figure 2C, for σ ≥ 0.1 mV). For the parameters of Figure 2C, this single intersection is between 90 Hz (for σ = 10 mV) and 120 Hz (for σ = 0.1 mV), three to four times larger than the average firing rate of the neurons, ν0 = 30 Hz. In general, the value of ω at this intersection depends not only on ν0 but also on the synaptic time constants, τl , τr , and τd as will be discussed below. A similar picture holds for the intersections with horizontal lines at (2k + 1)π with k ≥ 1, although they occur at much larger frequencies. For example, intersections with the line at 540 degrees (k = 1) occur around 1000 Hz for the parameters of Figure 2. Once the frequencies of the marginal modes have been obtained from equation 2.24, the corresponding critical couplings are determined by solving equation 2.26. The results can be represented in the plane J-σ . 2.4.2 The Instability Spectrum in the J − σ Plane. We start by describing the structure of the instability spectrum in the case where the synaptic time constants are τr = 1 ms, τd = 6 ms, and τl = 1 ms. What happens when these parameters change is briefly discussed in section 2.5. We first consider the case ν0 = 30 Hz. The lines on which eigenmodes are marginal are plotted in the σ − J plane in Figure 3A. The frequencies of the marginal modes are plotted as a function of noise in Figure 3B. Figure 3A shows that there are several families of lines. Each family corresponds to the set of solutions to the phase equation, 2.24, for a given k = 0, 1, 2, 3, . . . (from low J to high J ; for clarity, only lines belonging to k = 0, 1 families are shown) as σ varies. As discussed in the previous section, an odd number 2 p + 1 of solutions to the phase condition exist for any k. These 2 p + 1 solutions can be divided into p pairs of solutions that coalesce at some level of noise and one solution that exists at any noise level. We first discuss the p lines corresponding to the 2 p solutions that disappear as the noise level increases. Each line is composed of an upper and a lower branch that approach each other as the noise level increases and subsequently meet at a critical value of the noise. The area enclosed by the line is the region in which the corresponding eigenmode is unstable (i.e., the region to the left of the curve). The frequency of the mode on the lower part of this line is very close to an integer multiple of the average neuronal firing rate nν0 , while on the upper part of the line, the frequency is between
Synchronization Properties of Inhibitory Networks
1077
Total synaptic inhibitory coupling (mV)
A k=1
100 k=0 n=8 10
n=7 n=6 n=5 Asynchronous state stable
n=4 1 0.001
n=3 0.01
1
0.1
10
B 1200 1050 k=1
Frequency (Hz)
900 150
750
60
k=0
n=3 n=2
30
450
150
n=4
90
600
300
n=5
120
n=11 0 n=10n=9 n=8 0.001 n=7 n=6 n=5 n=4 n=3 n=2
0 0.001
0.01
1 0.01 0.1 Noise amplitude (mV)
0.1
1
10
k=0
10
Figure 3: (A) Lines on which eigenmodes become unstable, obtained from equations 2.24 and 2.25, in the plane J − σ , for the parameters of Figure 2. The asynchronous state is stable below the lowest line (region marked “asynchronous state stable”). Only lines corresponding to families of solutions at k = 0 and k = 1 (marked in the figure) are indicated. Each family is composed of individual branches labeled by integer values of n (indicated only for k = 0). (B) Frequency of marginal modes. The thick curve in A is the stability boundary of the asynchronous state. The thick curve in B is the frequency of the unstable mode on this boundary plotted against the noise.
1078
N. Brunel and D. Hansel
nν0 and (n + 1)ν0 or (n − 1)ν0 (see Figure 3B). These two parts of the line meet each other for the noise level at which the eigenmode becomes stable for any value of the coupling. Hence, we can index all eigenmodes and the lines on which they have marginal stability in the σ − J plane by the integer n. This index is the number of clusters that emerge via the instability on the lower part of the line. If n = 1, one cluster emerges, and all the neurons in the network tend to fire simultaneously. For n ≥ 2, they tend to split into groups of neurons that fire in synchrony one after the other (see Figure 4B). Thus, the frequency of the population oscillations can be significantly larger than the average firing rate of the neurons, ν0 , if n is large. All these instabilities exist only in a limited range of noise amplitude. For example, in Figures 3A and 3B, the n = 2 curve exists for σ < 0.005 mV, n = 3 exists for σ < 0.04 mV, and so forth. This reflects the sensitivity of clustering to noise. As noise increases, neurons are less able to maintain their phase locking to the population oscillation. Clustering cannot emerge since neurons would skip more and more between clusters and spend less time bound to a specific cluster. The instabilities corresponding to these p eigenmodes are called clustering instabilities. An additional solution to the phase condition differs from the other solutions by the fact that it survives even for large noise. As noise increases, it generates a single-valued curve in the J − σ plane. On this line, the desynchronizing effect of the noise that would suppress the instability can be compensated for by increasing the coupling strength (see Figure 3A). The mode that has marginal stability on this line can also be indexed by an integer n since in the limit of weak noise, its frequency goes to one of the integer multiples of the firing rate f n (n = 4 in Figure 3). Hence, at low noise levels, this instability, like the clustering instabilities described above, leads to a cluster state in which the neurons fire in a regular manner. When the noise becomes strong, it leads to a state in which individual neurons fire in a highly irregular way while the population activity oscillates—a stochastic synchronous state. In this state, the neurons increase their firing probability together with the oscillatory population activity, that is, they display “rate oscillations” (see Figure 4D). To distinguish this instability from those described above, we term it an oscillatory rate instability. 2.4.3 Stability Boundary of the Asynchronous State. For each value of σ , the asynchronous state is stable for J < J˜c where J˜c is given by equation 2.27. Typically, J˜c = J (ω1 ) where ω1 is the smallest solution of equation 2.24. This is because if ω1 < ω2 < · · · are solutions to equation 2.24, we have AS (ω1 ) > AS (ω2 ) > · · · , and likewise AN (ω1 ) > AN (ω2 ) > · · · . The bold lines in Figure 3A indicate the boundary of the region in which the asynchronous state is stable. When the noise is weak, the stability boundary coincides with one of the lines where a clustering instability occurs (n = 2 for σ < 0.005 mV, n = 3 for
Synchronization Properties of Inhibitory Networks
B
A |
|
| |
| |
| |
|
|
|
|
|
Firing rate (Hz)
150 100 50 0 2000
2050
2100 Time (ms)
2150
|
| |
|
|
| |
|
150 100 50 0 2000
2050
2100 Time (ms)
2150
2200
D | | | |
|| | | |
|
|
||| |
||
| ||
||
| | |
| |
|| | || |
|
| |
|
|
| | |
| | | | | || |
| |
| | | ||
| |
| ||
||
| |
| ||
||
| | |
|| |
| | ||||
| |
||
|
||
|| | |
|
| |
|| || |
| | |
|
|
Firing rate (Hz)
150 100 50 2050
2100 Time (ms)
2150
|
|
| | | |
|
|
| |
| | |
| | | | | |
2200
| | |
|
| | | | | | |
|
| |
|
| | | | |
|
| | | |
| |
|
| ||
||
200
0 2000
| |
||| | |
|
| |
|
| | | |
|| |
|
|
|
|
| |
||
|| | | ||
| | |
| |
| | |
||
|
| |
|
|
|
|
|
| | | ||
|
|
| | |
|| |
| |
|
| ||
| |
| |
| |
|
|
|
| | ||
||
|
| |
| |
| |
|
|
| | | | ||| | |
| |
Firing rate (Hz)
|
200
2200
C
| |
|
| |
|
|
|
| |
|
| |
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
|
|
|
200
| |
| |
| |
| |
| |
| | | |
| |
| |
|
|
|
|
| |
|
|
|
|
| |
| |
| |
|
| | | | |
| | | |
|
|
|
|
| |
| | |
|
|
|
| | |
| |
| |
|
|
|
| |
|
| |
|
|
| |
|
| |
| |
| | | |
| | | |
|
|
| |
| | | |
| |
| |
| |
|
|
| |
| |
| |
|
| | |
|
|
|
|
|
|
| |
| |
|
|
|
|
| | | |
| | | |
| | | |
| |
|
|
|
| |
|
| | |
|
|
| |
| | | |
|
|
| |
| | |
| |
| |
| |
| |
| | |
| |
| |
|
| |
|
|
| | |
|
| |
|
|
| |
| | |
|
| | |
|
|
|
| | |
|
| |
|
| | | |
|
|
|
| | |
|
|
|
|
| |
|
|
|
|
|
| |
|
|
|
| |
| |
|
|
|
|
| |
| |
| | |
|
|
|
|
Firing rate (Hz)
1079
| | | |
| |
| | | | | |
| | | | |
| |
|
| |
| | |
|
| | |
|| | | | | | |
| | |
| | |
| | |
| | | |
| |
|
|
| ||
200 150 100 50 0 2000
2050
2100 Time (ms)
2150
2200
Figure 4: Simulations of a network of 1000 LIF neurons. All four panels show spike trains of 20 selected neurons (rasters) and instantaneous population firing rate, computed in 1 ms bins. (A, B) Low coupling–low noise region. In A, the asynchronous state is stable (J = 1 mV, σ = 0.04 mV). In B, the noise is decreased (σ = 0.02 mV). The network now settles in a three-cluster state, as predicted by the analytical results. (C, D) Strong coupling–strong noise region. C : J = 100 mV, σ = 10 mV; the asynchronous state is stable. Decreasing σ to σ = 4 mV leads to a stochastic oscillatory state, as predicted by the analytical results. See appendix C for more details on numerical simulations.
1080
N. Brunel and D. Hansel
0.005 < σ < 0.04 mV) and the CV of the firing is smaller than 0.04. When the noise is sufficiently large (σ > 0.04 mV), the stability boundary coincides with the oscillatory rate instability. On this part of the boundary, the CV increases from 0.04 to about 1 for σ ∼ 5 mV (see Figure 1). The frequency of the marginal mode on the stability boundary of the asynchronous state is shown in Figure 3B. At low noise levels, the index of the marginal mode is n = 2, and the frequency is about 2ν0 = 60 Hz. It increases discontinuously when σ ≈ 0.005 mV to about 3ν0 = 90 Hz, as the index changes from n = 2 to n = 3. A second discontinuity occurs for σ ≈ 0.04 mV, and the marginal mode on the boundary becomes the oscillatory rate mode. The index n changes from 3 to 4, and the frequency jumps to 120 Hz. For further increases of σ , the frequency changes smoothly with the noise and remains significantly larger than ν0 . The two regions of the stability boundary (clustering and rate oscillation) are characterized by different relationships between J˜c and σ . At low noise levels, in the clustering region, J˜c ∼ σ 2 (Abbott & van Vreeswijk √ 1993), while at high noise levels, in the rate oscillation region, J˜c ∼ σ/ ln σ (see appendix B for details of the derivation). 2.4.4 How the Stability Boundary Depends on the Firing Rate ν0 . The instabilities that occur on the asynchronous state stability boundary depend on ν0 , as shown in Figure 5. For ν0 = 10 Hz, clustering instabilities with n = 7, 8, 9, 10, 11 occur at very low noise levels. For σ ≈ 10−4 , the instability becomes an oscillatory rate instability with n = 12. As ν0 increases, the asynchronous state loses stability at smaller J and σ . Intuitively, this is because the inhibitory feedback responsible for the destabilization of the asynchronous state increases with ν0 . Moreover, clustering instabilities occur in a larger range of noise, the number of emerging clusters becomes smaller, and the transition to the oscillatory rate instability moves toward larger noise level. The last effect is a consequence of the fact that as the firing rate increases, the spike trains become more regular. For ν0 = 30 Hz, this transition occurs around σ = 0.04 mV, whereas for ν0 = 50 Hz, it is at about σ = 0.3 mV. For all these values of ν0 , the CV is in the range 0.01 to 0.1 at this transition. 2.5 The Effect of Synaptic Kinetics. The synaptic time constants and delay have a strong effect on the asynchronous state instability spectrum and asynchronous state stability boundary. Three qualitatively different situations can occur. First, when the latency and the rise time are both zero, the synaptic phase, S (ω) is bounded by π/2. Since the neuronal phase is smaller than π/2 (see Figure 2), equation 2.24 has no solutions. Hence, the asynchronous state is stable for any J , σ . Second, when there is no latency (τl = 0) but the rise time is finite, the synaptic phase is bounded by π. Hence, solutions to equation 2.24 exist
Synchronization Properties of Inhibitory Networks
1081
A Total synaptic inhibitory coupling (mV)
1000 10 Hz 30 Hz 50 Hz
100
10
1
0.1
0.01
0.001 1e-05 0.0001 0.001 0.01 0.1 Noise (mV)
1
10
B 160 n=3
Oscillation frequency (Hz)
140 n=4
n=12
120
n=11 100
n=10 n=9
80
50 Hz
n=2 n=3
n=8
30 Hz 10 Hz
n=7 60
n=2
40 1e-05 0.0001 0.001 0.01 0.1 Noise (mV)
1
10
Figure 5: Instabilities of the asynchronous state versus firing rate. (A) Stability boundary of the asynchronous state in the σ − J plane. (B) Frequency of the marginal mode on this boundary. In both panels, curves for three firing rates are shown: ν0 = 10 Hz (thin curves), 30 Hz (intermediate curves), 50 Hz (thick curves). Dashed lines indicate cluster state instabilities; the solid line indicates the oscillatory rate instability.
1082
N. Brunel and D. Hansel
B 100
k=1
k=0 10
n=9 1 n=8
Asynchronous state stable
n=7
0.1 0.001
0.01
0.1 Noise (mV)
1
Total synaptic inhibitory coupling (mV)
Total synaptic inhibitory coupling (mV)
A
k=0
100
n=7 n=6 n=5
10
Asynchronous state stable
n=4 1 0.001
10
n=3 0.01
0.1 Noise (mV)
1
10
1200
900
1050
750
k=1
600 450 n=12 300 150 0 0.001
n=11 n=10 n=9
n=5
n=6
0.01
n=8
k=0
n=7
Oscillation frequency (Hz)
Oscillation frequency (Hz)
1000
900 750 600 450 300 150
0.1 Noise (mV)
1
10
0 0.001
n=3 0.01
k=0
n=5 n=4 0.1 Noise (mV)
1
10
Figure 6: Influence of the shape of the inhibitory synaptic currents on the location of instabilities in the σ − J plane and on the frequencies of the eigenmodes with marginal stability. (A) Instantaneous synaptic currents (τr = τd = 0) and latency τl = 2 ms. The rate oscillation mode has an index n = 8 (frequency about 180 Hz at σ = 10 mV). When noise decreases, there is a succession of transitions to cluster modes with lower n. (B) Synaptic currents with τr = 2 ms, τd = 6 ms, and no latency (τl = 0). The oscillatory rate instability has a large index n (n ∼ 25). The rate oscillations have a frequency of about 120 Hz for σ = 10 mV. The critical coupling J˜c varies in a nonmonotonic way when σ increases. Note the different scales in ordinate of the top panels of A and B.
only for k = 0. An example of such a case is shown in Figure 6B. Note that in this case, the oscillatory rate instability has a very large index n, and there is a range of J in which the stability of the asynchronous state varies nonmonotonically as the noise level increases. The asynchronous state is
Synchronization Properties of Inhibitory Networks
1083
first unstable due to clustering instabilities, then becomes stable, becomes unstable again due to the oscillatory rate instability, and finally becomes stable as the noise increases. Third, with a finite latency, the synaptic phase is unbounded as ω increases. Hence, solutions to equation 2.24 exist for any k, leading to families of eigenmodes associated with each integer k as in Figure 3, for any value (zero or nonzero) of rise and decay times. For example, the families of lines on which eigenmodes have marginal stability in the σ − J plane and the frequencies of the marginal modes are shown in Figure 6A for k = 0 and k = 1. However, the region of stability of the asynchronous state is larger for nonzero decay time and/or rise time, as seen by comparing Figure 6A and Figure 3A (note that the scale of the ordinates is 10 times larger in Figure 3A than in Figure 6A). The number of clusters emerging on the asynchronous state stability boundary (i.e., the index n of the corresponding instability mode) in the weak noise region depends on the amplitude of noise, but also on all synaptic time constants (latency, rise time, decay time): the shorter the synaptic time constants, the larger the number of clusters. This is shown in Figure 7, where the number of clusters is plotted as a function of σ and a scaling factor, α, for all three synaptic time constants (τl = α × 1 ms, τr = α × 1 ms, τd = α × 6 ms). The solid lines in this figure show the boundary between regions in which the instability mode has frequency ∼ nν0 , for example, the corresponding instability leads to n-cluster states (number of clusters n marked in each region) in the σ − α plane. The number of clusters can vary from 1 for slow synaptic currents (for example with τl = 3 ms, τr = 3 ms, τd = 18 ms) to infinity as synaptic currents become infinitely fast (van Vreeswijk, 1996). These lines delineate open stripes in the σ − J plane; upon crossing such lines, a pair of solutions to equation 2.24 either appears or vanishes, one of which corresponds to the stability boundary of the asynchronous state. For example, for α = 1, the instability line on the stability boundary of the asynchronous state is n = 3 for σ = 0.01; increasing σ , the pair of solutions corresponding to n = 3 disappears, and the instability line becomes the n = 4 line. The dotted line separates the region where the number of solutions to equation 2.24 is strictly larger than one (on the left of the line) from a region where a single solution exists (on the right of the line)—hence, a pair of solutions vanishes when crossing the line from left to right. The difference between the dotted line and the solid lines is that on a dotted line, none of the solutions corresponds to the stability boundary of the asynchronous state. Taking again the case where α = 1 as an example, crossing the dotted line marks the point where the n = 5 pair coalesces (see Figure 3). The set of lines that marks the boundary of the open, large noise region (composed of a set of alternating solid and dotted lines) can be defined as the boundary of the stochastic synchrony region. For the LIF neuron, this set of lines remains
1084
N. Brunel and D. Hansel 3 1
Synaptic scale factor α
2.5
2
2
1.5
3 1 4 5
6 7
8
0.5 0.01
0.1 Noise (mV)
1
Figure 7: The nature of the unstable modes on the stability boundary of the asynchronous state as a function of noise and a global scaling factor of the synaptic kinetics α. The synaptic time constants are τr = α ms, τd = 6α ms, and τl = α ms. The solid lines are the boundaries between regions in which the instability mode has a frequency ∼ nν0 , for example, the corresponding instability leads to n-cluster states, where n is indicated in each region. The dotted line separates a weak noise regime where the number of solutions to equation 2.24 is strictly larger than 1, from a strong noise regime where there is only one solution left. In the weak noise regime, the instabilities are cluster-type instabilities, while in the strong noise regime, the instability is the oscillatory rate instability.
confined to a range of values of σ , which is between 0.05 and 0.2 mV, for the range of α shown in the figure. 3 Networks of EIF Neurons 3.1 The Model. In the following we consider a network of N inhibitory fully connected exponential integrate-and-fire neurons (EIF, FourcaudTrocm´e et al., 2003) receiving noisy external inputs. In the EIF, the fixed threshold condition of the LIF neuron is replaced by a spike-generating
Synchronization Properties of Inhibitory Networks
1085
current that depends exponentially on voltage. In this model, the voltage of neuron i is described by V − VT τm V˙ i (t) = −Vi (t) + T exp + Irec (t) + Ii,ext (t),
T
(3.1)
where the external and the recurrent currents are modeled as in the LIF network. The parameter VT is the highest voltage at which the membrane potential can be maintained injecting a steady external input, and the slope factor T measures the sharpness of the action potential generation. When the external current is large enough, the voltage diverges to infinity in finite time, due to the exponential term in the right-hand side of equation 3.1. This divergence defines the time of the spike. At that time, the voltage is reset to a fixed voltage Vr , where it remains during an absolute refractory period τ ARP . Unless specified otherwise, the results presented below were obtained for the following set parameters: τm = 10 ms, VT = 5.1 mV, Vr = −3 mV, τ ARP = 1.7 ms, and T = 3.5 mV (Fourcaud-Trocm´e et al., 2003). The EIF model is a good compromise between the simplified LIF neuron, which has an unrealistic spike generation mechanism, and more realistic HH-type models. It is simple enough that analytical techniques can be applied to study its dynamics. Moreover, simple HH models, can be mapped onto EIF models, as shown in Fourcaud-Trocm´e et al. (2003). In the following, we study the stability properties of the asynchronous state in this model and compare them to those of the LIF model, which can be thought of as an EIF neuron with infinitely sharp spike initiation ( T = 0). More generally, we investigate how the synchronization properties of the EIF network depend on the sharpness of the spike initiation. 3.2 Stability of the Asynchronous State. The approach described in section 2 can be applied to study the stability of the asynchronous state of the EIF network. Marginal modes are still determined by equations 2.24 to 2.27, but AN (ω) and N (ω) now represent the amplitude and the phase shift of the instantaneous firing rate modulation of a single EIF neuron in response to an oscillatory input at frequency ω. Fourcaud-Trocm´e et al. (2003) computed these functions in the low- and the high-frequency limits. Obtaining these functions analytically at any ω is a difficult task. Hence, we compute them for various noise levels using numerical simulations. Examples are shown in Figure 8. Once N (ω) and AN (ω) are known, we solve equations 2.24 to 2.26 to find the frequency of the unstable modes at the onset of instabilities, together with the critical coupling strength where these instabilities appear. The instability spectrum derived using this approach is plotted in the σ − J plane in Figure 9A for ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, τd = 6 ms. In this figure, only the lines corresponding to k = 0 and n = 1, 2, 3 are shown. This instability spectrum bears some resemblance to the one obtained in the LIF
1086
N. Brunel and D. Hansel A
Neuronal phase shift (degrees)
180
90
0
B
0
20
40 60 Frequency (Hz)
80
100
0
20
40 60 Frequency (Hz)
80
100
Total phase shift (degrees)
360
270
180
90
0
Figure 8: (A) Single-neuron phase shift for low noise (small circles, σ = 0.5 mV) and high noise (large circles, σ = 5 mV). Note that for low noise, the phase shift increases sharply when the frequency is close to integer multiples of the stationary frequency (here, 30 Hz), and decreases in between two successive integer values, while for high noise, the phase increases monotonically with frequency. (B) Total phase shifts (neuronal + synaptic, left-hand side of equation 2.24) for the same two noise levels as in A. The intersection of curves with the horizontal dotted line at 180 degrees gives the frequencies of oscillatory instabilities. Note that for low noise, these intersections are close to integer multiples of the firing rate (∼30, 60, 90 Hz) and intermediate frequencies (∼45, 75 Hz), while for high noise, there is a single intersection around 50 Hz that is unrelated to the single-cell firing frequency.
Synchronization Properties of Inhibitory Networks A
100 90
Network frequency (Hz)
1087
n=3
80 fast oscillation
70 60
n=2
50 40 30
n=1
20 1 Noise (mV)
0.1
Total synaptic strength (mV)
B
10
fast oscillation 100
10
n=3 Asynchronous state stable n=2 n=1
1 0.1
1 Noise (mV)
10
Figure 9: Oscillatory instabilities in the EIF model (circles and solid lines) and in the simulations of the Wang-Buzs´aki model (stars and dashed lines; see the details in section 4.) (A) Frequency of oscillatory instabilities versus noise. (B) Critical lines on which eigenmodes become unstable in the σ − J plane. In both panels, cluster instabilities with n = 1, 2, 3, corresponding to the resonances at 30, 60, 90 Hz (see Figure 8), are shown. Note that the frequency and the critical coupling strength on the stability boundary of the asynchronous state are very close in the two models.
1088
N. Brunel and D. Hansel
model for the same parameters (see Figure 3A). However, there are several significant differences between the two figures. One is that in the LIF model, the marginal modes for k = 0 have indices n ≥ 3, whereas the first mode to be unstable is for n = 1 in the EIF model. Moreover, in the EIF network, the oscillatory rate instability for k = 0 has an index n = 1, whereas it is n = 4 in the LIF network for the same parameters. As a consequence, the oscillations that emerge from the rate instability are slower in the EIF model than in the LIF model (40–70 Hz versus 90–120 Hz; compare Figure 9B with Figure 3B). Another difference is that the noise level at which all instabilities but the oscillatory rate instability have disappeared is to times greater in the EIF than in the LIF. These differences can be understood by comparing the functions N (ω), and N (ω) + S (ω) in the two models. These functions are plotted in Figure 8 for the EIF neuron (same parameters are in Figure 9A) for weak and strong noise. For low noise levels, the neuronal phase shift, N (ω), displays sharp variations close to integer multiples of the average firing frequency ν0 . This is similar to what happens in the LIF model. However, in contrast to the LIF model, where at the peaks N (ω) is close to 90 degrees, here it is close to 180 degrees. Hence, for fixed synaptic time constants, solutions to equation 2.24 exist for lower values of n in the EIF model than in the LIF model. In particular, in the EIF model, solutions to equation 2.24 exist for k = 0, n = 1, and n = 2, while this is not the case in the LIF model. Similarly, one can understand why for high noise levels, the frequency of the oscillatory rate mode is typically smaller in the EIF than in the LIF. This is because the function N (ω) monotonously increases from 0 to 90 degrees in the EIF, whereas in the LIF, it is smaller than 45 degrees (Fourcaud-Trocm´e et al., 2003; Geisler, Brunel, & Wang, 2005). Finally, the sharp variations of N (ω) at integer multiples of the frequency ν0 are more resistant to noise in the EIF neuron—they disappear at larger values of the noise—than in the LIF neuron. As a matter of fact, the clustering instabilities persist until the noise level is in the range 1 to 2 mV, which is much larger than in the LIF model. The effects of changing the synaptic time constants or the spike sharpness parameter T on the stability of the asynchronous state are depicted in Figure 10. This figure shows how the nature of the instability on the asynchronous state stability boundary depends on the noise level and a global scaling factor of the synaptic time constants, α. Three values of T are considered: T = 3.5 mV, T = 1 mV, and T = 0 (the LIF model, already shown in Figure 7, is shown for comparison). At low noise levels for fixed α, clustering instabilities occur with an index n, which increases when T decreases. This reflects the fact that the amplitude of the peaks of N (ω) decreases with T , and hence solutions with small n of equation 2.24 disappear. This effect can be compensated for by increasing α, that is, making the synapses slower. The n = 1 cluster instability for T = 0 requires
Synchronization Properties of Inhibitory Networks
1089
3 1
LIF (∆T=0)
Synaptic scale factor α
2.5
2
2 EIF (∆T=1mV)
1.5
1 1
3 4
1
56
EIF (∆T=3.5mV)
2
7 8
0.5 0.01
1
0.1
10
Noise (mV)
Figure 10: The nature of the unstable modes on the stability boundary of the asynchronous state as a function of noise and a global scaling factor of the synaptic kinetics α. The results are displayed for three values of T : T = 0 (LIF model, thin lines and labels, see Figure 7); T = 1 mV (intermediate lines and labels); T = 3.5 mV (thick lines and labels). For clarity, the dotted line for
T = 3.5 mV is truncated for σ < 1 mV.
synapses three times slower than for T = 3.5 mV. The cluster instabilities are more resistant to noise when T increases (for the n = 1 instability, up to about 0.2 mV for T = 0, 1 mV for T = 1 mV, 2 mV for T = 3.5 mV), reflecting the fact that peaks in N (ω) are more resistant to noise for larger
T . Finally, the frequency of the firing rate oscillations that appear in the high noise region also depends on T : as T increases, the frequency of these oscillations decreases. Frequencies in such a regime range from 40 to 70 Hz for T = 3.5 mV, from 60 to 90 Hz for T = 1 mV, and from 90 to 120 Hz for T = 0 (LIF model). 4 Network of Conductance-Based Neurons In this section, we study networks of conductance-based neurons in which action potential dynamics involve sodium and potassium currents. Using numerical simulations, we characterize the instabilities by which
1090
N. Brunel and D. Hansel
synchronous oscillations emerge in these models and compare the results with those we have presented above for the EIF and the LIF. In particular, we examine whether the fact that in the EIF, the frequency of rate oscillations increases with the sharpness of the spike initiation is also valid in simple conductance-based models. 4.1 The Models. We consider single-compartment conductance-based models in which the membrane potential, V, of a neuron obeys the equation C
dV I ion + Irec (t) + Iext (t), = IL − dt ion
(4.1)
where C is the capacitance of the neuron, I L = −g L (V − VL ) is the leak current, and ion I ion is the sum over all voltage-dependent ionic currents, Ie xt is the external current, and Ir ec denotes the recurrent current received by the neuron. The voltage-dependent currents are an inactivating sodium current, I Na = g Na m3∞ h (V − VNa ), and a delayed rectifier potassium current, I K = g K n4 (V − VK ). As in Wang and Buzs´aki (1996) the activation of the sodium current is assumed instantaneous, m∞ (V) =
αm (V) , αm (V) + βm (V)
while the kinetics of the gating variable h and n are given by (see, e.g., Hodgkin & Huxley, 1952) dx = αx (V)(1 − x) − βx (V)x, dt
(4.2)
with x = h, n. The functions αx and βx for the three models we consider are given in appendix D. For simplicity, we neglect the driving force in the synaptic interactions. Hence, the recurrent current has the form G s(t − t kj ), N k N
I rec (t) = −
j=1
(4.3)
Synchronization Properties of Inhibitory Networks
1091
where the function s(t) is given by equation 2.4 (in which τm = C/g L ), and G, which has the dimension of a current density (mA/cm2 ), measures the strength of the synapses. The external input is also modeled as a noisy current,
Iext (t) = I0 + σ Cg L η(t),
(4.4)
where I0 is a constant DC input, η(t) is a white noise with zero mean and unitary variance, and σ , which has the dimension of a voltage, measures the amplitude of the temporal fluctuations of the external input. 4.2 Characterization of the Instability of the Asynchronous State. To characterize the degree of synchrony in the network, we define the population-averaged membrane potential, 1 Vi (t), N N
¯ V(t) =
(4.5)
i=1
and the variance 2 2 ¯ ¯ σV2 = [V(t)] t − [V(t) t]
(4.6)
of its temporal fluctuations, where · · ·t denotes time averaging. After normalization of σV to the average over the population of the variance of single-cell membrane potentials, σV2i = [Vi (t)]2 t − [Vi (t)t ]2 , one defines χ(N) (Hansel & Sompolinsky, 1992, 1996; Golomb & Rinzel 1993, 1994; Ginzburg & Sompolinsky, 1994), χ(N) =
1 N
σV2 N i=1
σV2i
,
(4.7)
which varies between 0 and 1. The central limit theorem implies that in the limit N → ∞, χ(N) behaves as δχ χ(N) = χ∞ + √ + O N
1 , N
(4.8)
where χ∞ is the large N limit of χ(N) and δχ measures the finite size correction to χ at the leading order. For a given strength of the coupling,
1092
N. Brunel and D. Hansel
the network is in the asynchronous state for sufficiently large noise. This means that χ∞ = 0. When the noise level decreases, the asynchronous state loses stability via a Hopf bifurcation, and synchrony appears: χ∞ > 0. Near the bifurcation, χ∞ behaves at the leading order as χ∞ = A(σc − σ )γ =0
for
for
σ > σc ,
σ < σc
(4.9) (4.10)
with an exponent γ = 1/2 (Kuramoto, 1984; van Vreeswijk, 1996). To locate the noise level where the instability of the asynchronous state occurs for a given value of the synaptic coupling, we simulate networks of various sizes N. Comparing the values of χ(N) for the different N, we estimate δχ, and χ∞ as a function of σ . The obtained values of χ∞ are subsequently fitted according to equation 4.10. In a network of LIF neurons, in which the stability boundary of the asynchronous state can be computed analytically, we show in appendix C that this method gives an accurate prediction of such a boundary for network sizes on the order of 1000. The results of this analysis for the Wang-Buzs´aki network and two values of the synaptic coupling, G = 2 mA/cm2 and G = 12 mA/cm2 , are plotted in Figure 11. For G = 2 mA/cm2 , δχ(σ ) can be reliably estimated from simulations with N = 800 and N = 3200. For G = 12 mA/cm2 , the finite size effects are larger because the coupling is stronger, and a substantially better account of these effects requires simulations of larger networks (N = 1600 and N = 3200). This is shown in Figures 11B and 11C. Still, the estimates of σc obtained in the fits in these two figures are very close. Once the instability has been located, we compute the frequency of the population oscillations at the onset of the instability. To this end, we simulate the network for a noise level σ ≈ σc . A good estimate of the oscillation frequency is provided by looking at the autocorrelation of the population average of the membrane potentials of the neurons. If the bifurcation at σc is supercritical, the frequency estimates on the two sides of the transition are similar. If the bifurcation is subcritical, they may differ significantly. In that case, the frequency of the unstable mode has to be determined in the vicinity of the transition, but on the side where the asynchronous state is still stable. In fact, in the simulation reported below, the bifurcations were found to be supercritical. We checked using several examples that the frequency estimates thus obtained were only weakly sensitive to the size of the network for N > 800. Hence, in the results displayed below, the population frequency was estimated from simulations of networks with N = 800. The traces of two neurons, the membrane potential of the neurons averaged over all the network and its autocorrelation in the vicinity of the asynchronous state instability onset, are shown in Figure 12 for G = 12 mA/cm2 . The oscillations present in the population activity are not clearly reflected in the spiking activity at the single-cell level, which is highly
Synchronization Properties of Inhibitory Networks A
1093
0.6
χ
N=800-3200
0.3
0
1
1.5
χ
B 0.6
2 σ(mV)
2.5
3
N=800-3200
0.3
0
6.5
7
7.5
8 8.5 σ(mV)
9
9.5
C 0.6
χ
N=1600-6400 0.3
0
6.5
7
7.5
8 8.5 σ(mV)
9
9.5
Figure 11: The measure of synchrony χ as a function of the noise in the WangBuzs´aki model for two values of the coupling. In each panel, the finite size correction δχ is estimated by comparing the simulation results (circles and crosses) for two sizes of the network. Subtracting the finite size correction leads to estimates for χ∞ (squares) as a function of σ , which are fitted using equation 4.10 (solid lines). (A) G = 2 mA/cm2 , crosses: N = 800; circles: N = 3200. Fit: σc = 1.61 mV; A = 0.84 mV−1/2 . (B) G = 12 mA/cm2 , crosses: N = 800; circles: N = 3200, Fit: σc = 7.68 mV; A = 0.48 mV−1/2 . (C) G = 12 mA/cm2 , crosses: N = 1600; circles: N = 6400. Fit: σc = 7.62 mV; A = 0.5 mV−1/2 . In all these simulations, τl = 1 ms, τr = 1 ms, τd = 6 ms. The DC part of the external input, I0 , was tuned to get an average firing rate of the neuron of 30 Hz ±0.5 Hz near the onset of synchrony.
irregular. In contrast, the population average of the membrane potentials (or on the population activity, not shown) and its autocorrelation reveals the existence of population oscillations at a frequency that is about 70 Hz, compared to an average firing rate of the neurons of 29.5 Hz.
1094
N. Brunel and D. Hansel
A
25 mV
B
500 ms
5 mV
AC (mV^2)
C 3860 3858 3856 3854 -100
-50
0 t (ms)
50
100
Figure 12: The Wang-Buzs´aki network near the onset of instability of the asynchronous state. N = 800, G = 12 mA/cm2 , σ = 7.75 mV. Synaptic time constants and delay as in Figure 11. The average firing rate of the neurons is 29.5 Hz. The coefficient of variation of the interspike interval distribution is 0.72. (A) Membrane potential of two neurons in the network. The noise masks the fast oscillations of the subthreshold membrane potential. (B) The fast oscillations of the membrane potential are revealed on averaging over many neurons (here over all the neurons in the network.) (C) Autocorrelation of the population average membrane potential (averaging over 2 s). The frequency of the oscillations is 66 Hz.
4.3 The Stability Boundary of the Asynchronous State. We performed simulations for the three models described in appendix D. These models differ in the sharpness of the activation function of their sodium current. In each model, we varied the synaptic coupling strength and looked for the critical noise level, σc , at which the asynchronous state becomes unstable. The external input was also changed with G so that for σ near σc , the average firing rate of the neurons was 30 ± 0.5 Hz. Once the transition was located, the frequency of the population oscillations emerging at this transition was estimated, as explained above. The results obtained for the Wang-Buzs´aki model are summarized in Figure 9. The agreement with the predictions from the EIF model (see
Synchronization Properties of Inhibitory Networks
B 1
250
0.8
200
0.6
150
f (Hz)
m_inf(V)
A
0.4 0.2 0 -100
1095
C
100
20 mV
50 0 -50 V (mV)
50
0 -2
1 ms 0 2 I (mA/cm^2)
D -60 mV 20 mV 100 ms
Figure 13: Sodium activation curves and firing properties of the conductancebased model. In the three top panels: solid line: Wang-Buzs´aki model; dashed line: model 1; dotted line: model 2. (A) Activation functions m∞ (V). (B) Currentfrequency (I-f) curves of the three models. (C) Action potential of the three models. (D) Voltage traces in response to a constant step of currents. From left to right: model 1, Wang-Buzs´aki model, and model 2.
section 3) is remarkably good, with more discrepancy at small coupling and therefore small σ . This suggests that for high noise and coupling strength, the instability is mainly determined by the synaptic time constants and delays and by the properties of the sodium current responsible for the spike initiation. In the EIF model, the frequency of the unstable mode at instability onset depends on the parameter T , which characterizes the sharpness of the spike initiation driven by the sodium current. In fact, we found in section 3 that the frequency increases when T decreases. We have also shown that the index n of the rate oscillatory instability increases when T decreases. To test whether the spike initiation sharpness had similar effects in the conductance-based models, we simulated networks of neurons that differ from the WB model only for the function m∞ . The activation functions of these models are given in appendix D and are plotted in Figure 13. The slopes at half-height are 2.01 10−2 mV−1 (dashed line, model 1) and 3.4410−2 mV−1 (dotted line, model 2), compared to the 2.64 10−2 mV−1 in the WB model (solid line; see Figure 14A). The change in the activation curve has a substantial effect on the threshold of the f-I curve (see Figure 13B) and on the shape of the spikes (see Figure 13C).
1096
N. Brunel and D. Hansel
15
G (mA/cm^2)
A
10
5
0
B
Unstable
Stable
0
2
4 σ (mV)
6
8
2
4 σ (mV)
6
8
Osc. frequency (Hz)
80 70 60 50 40 30 0
Figure 14: (A) Stability boundary of the asynchronous state in the σ − G plane. (B) Frequency of the population oscillations near synchrony onset as a function of the noise. In both panels, solid line: Wang-Buzs´aki model; dashed line: model 1; dotted line: model 2.
The stability boundary of the asynchronous state and the frequency of the population oscillations on this boundary are plotted for the three models in the σ − J plane in Figure 14. In all the models, for sufficiently weak noise or sufficiently weak coupling, the frequency is close to the average firing rate of the neurons, ν0 = 30 Hz. This indicates that in this limit, the index of the instability mode is n = 1 (with k = 0) for the three models. The frequency
Synchronization Properties of Inhibitory Networks * * *
1097 * * *
*
20 mV
10 mV 250 ms
Figure 15: Clustering in model 2 for G = 3 mA/cm2 and σ = 1.2 mV. Synaptic time constants: τr = 1 ms, τd = 6 ms, τl = 1 ms. Size of the network: N = 1600. The pattern of synchrony corresponds to a smeared two-cluster state. The two shown neurons fire (two upper traces) in alternate periods in which they are nearly in-phase and nearly in antiphase. Stars indicate spikes in antiphase. The maxima of the voltage population average coincide with the action potential of at least one of the two neurons (lower traces).
increases with the noise. In the Wang-Buzs´aki model and in model 1, the increase is smooth. Hence the index of instability remains n = 1 at large coupling, and the destabilization of the asynchronous state always occurs via the n = 1 rate oscillatory instability. At σ ≈ 3 mV (G ≈ 5 mA/cm2 ), the frequencies of the oscillations in the Wang-Buzs´aki model and in model 1 begin to differ. The frequency of the oscillation is smaller in model 1 than in the Wang-Buzs´aki model. This is consistent with our analysis of the EIF model which predicts that in the high-noise regime, the frequency of the rate oscillatory mode should decrease with the sharpness of the spike initiation ( T larger). In model 2, the population frequency jumps to a value close to 60 Hz for σ ≈ 1 mV (G ≈ 2.5 mV/cm2 ). Beyond this value, it increases smoothly with σ . This indicates that the index of the instability changes from n = 1 to n = 2 as the coupling (or equivalently the noise) increases and that the index of the rate oscillatory instability is n = 2. Just after the change in n, the instability leads to a two-cluster state. This is confirmed in Figure 15 where the membrane potentials of two neurons are plotted together with the population average of the membrane potential for G = 3 mV/cm2 . However, because of the local noise, neurons do not belong to the same cluster all the time, but rather switch between the two clusters. Still, on average, at any time, each cluster comprises half of the neurons in the network (not shown). This behavior and the fact that in the high-noise
1098
N. Brunel and D. Hansel X=1
X=0.5 * * * *
* * * *
20 mV
10 mV 250 ms
Figure 16: The effect of synaptic kinetics on the pattern of synchrony near the onset of instability of the asynchronous state in model 2. The coupling strength is G = 1 mA/cm2 . The size of the network is N = 1600. The average frequency of the neurons is about 30 Hz. (Left) The control case: The synaptic rise time and decay time are τ1 = 1 ms and τ2 = 6 ms, respectively. The synaptic delay is δ = 1 ms. The spikes of the two neurons are well synchronized, and both fire at almost every cycle of the oscillations of the population averaged voltage. Noise: σ = 0.66 mV. (Right) Fast synapses: τ1 = 0.5 ms, τ2 = 3 ms, τl = 0.5 ms. The pattern of synchrony corresponds to a smeared two-cluster state. The two neurons fire in alternate periods in which they are nearly in-phase and nearly in antiphase (spikes indicated by a star). The maxima of the voltage population average coincide with the action potential of at least one of the two neurons. Noise: σ = 0.57 mV.
regime the population oscillations are faster in model 2 than in the WangBuzs´aki model are in line with the conclusions of our analysis of the LIF and EIF network. Finally we found that for the conductance-based models fast synapses and fast delays favor clustering as predicted by Figure 10. An example is shown in Figure 16 where the voltage traces of two neurons are plotted together with the average membrane potential for two sets of synaptic time constants and delays. In the control condition (α = 1), the two neurons always tend to spike in synchrony, and their spikes coincide most of the time with the maximum of the population oscillation. For synapses and delays two times faster (α = 0.5), the two neurons alternate between periods of nearly in-phase and nearly antiphase spiking, while the spikes of the two neurons coincide in general with the maxima of the oscillations of the population average voltage. 5 Discussion Our study sheds new light on the instabilities of the asynchronous state in networks of inhibitory neurons in presence of noisy external input. Previous
Synchronization Properties of Inhibitory Networks
1099
studies investigated synchronization properties of networks of inhibitory neurons in fully connected networks at zero noise or in the weak noise regime (Abbott & van Vreeswijk 1993; Treves, 1993; van Vreeswijk, 1996; Gerstner, van Hemmen, & Cowan, 1996; Wang & Buzs´aki, 1996; White et al., 1998; Neltner, Hansel, Mato, & Meunier, 2000), in sparsely connected networks in absence of noise (Wang & Buzs´aki 1996; Golomb & Hansel, 2000) or in sparsely connected networks in the strong noise–strong coupling region (Brunel & Hakim 1999; Tiesinga & Jose, 2000; Brunel & Wang, 2003). These studies had found qualitatively distinct types of instabilities of the asynchronous state in weak and strong noise regimes. This article shows how the two types of instabilities are related when the noise level is varied and investigates for the first time stochastic synchrony in the simpler framework of fully connected networks. 5.1 The Two Types of Eigenmodes in Large Neuronal Networks of Inhibitory Neurons. Our main result is the existence of two types of eigenmodes of the linearized dynamics around the asynchronous state that differ in terms of the effect of noise on their stability. One type of mode can be unstable only if the noise is sufficiently small. Such an instability occurs when the neurons resonantly lock with the oscillatory synaptic input. When the noise is too high, resonant locking is destroyed and the modes are stable. We termed these eigenmodes clustering modes because when destabilized, they lead to clustering. Eigenmodes of the second type can be destabilized at weak noise and weak coupling, leading to clustering, but also at strong noise provided the coupling is sufficiently strong, leading to coherent modulation of the firing probability of the neurons. In this regime, the network displays stochastic synchrony. We termed the eigenmodes undergoing this instability oscillatory rate modes. Which of the clustering eigenmodes or oscillatory rate modes becomes unstable first, when the coupling strength increases, depends on the noise level, the synaptic kinetics, and the intrinsic properties of the neurons. However, as a general rule, clustering eigenmodes are the first to be unstable, for low noise levels and fast synapses. At sufficiently large noise, the oscillatory rate mode is the only remaining unstable mode. The transition between these two regimes of synchrony and the frequency of the stochastic oscillations depends on the synaptic and single cell properties, as briefly discussed below. In particular, if the synapses are sufficiently fast, the frequency of the emerging oscillations can be significantly faster than the firing rate of the individual neurons (Brunel & Hakim, 1999; Brunel & Wang, 2003). Previous studies have shown that in networks of integrate-and-fire inhibitory neurons, a discrete spectrum of eigenmodes exists at zero noise (Abbott & van Vreeswijk, 1993; Treves, 1993; van Vreeswijk, 1996; Hansel & Mato, 2003). Abbott and van Vreeswijk (1993) showed that such eigenmodes become stable at very low values of noise, in a model with
1100
N. Brunel and D. Hansel
phase noise and no latency. However, the existence of specific eigenmodes that display instabilities robust to noise has not been reported in those studies. 5.2 Beyond the Instability Line of the Asynchronous State. We also used numerical simulations to explore the dynamics of both LIF and WangBuzs´aki networks in the synchronous regime beyond the stability boundary of the asynchronous state. A detailed description of the dynamics of the various synchronized states of the LIF networks is presented in appendix C. Using numerical simulations, we showed that the bifurcation leading to the stochastic synchronous state is supercritical, consistent with the results of Brunel and Hakim (1999). We also showed multistability between different types of cluster states in the low-noise, low-coupling region, with cluster states disappearing one after the other as the noise level increases. 5.3 On the Role of Intrinsic Properties in Stochastic Synchrony. Theoretical studies have shown that intrinsic properties of neurons have a strong influence on the stability of the asynchronous state in large neuronal networks at weak noise (Hansel et al., 1995; van Vreeswijik & Hansel, 2001; Ermentrout, Pascal, & Gutkin, 2001; Pfeuty et al., 2003, 2005). A key concept in grasping this influence is the phase response function, which characterizes the way a tonically firing neuron responds to small perturbations (Kuramoto, 1984; Ermentrout & Kopell, 1986; Hansel, Mato, & Meunier, 1993; Rinzel & Ermentrout, 1988). The shape of the phase response function depends on the intrinsic properties of the neurons. In conductance-based models, it is determined by the sodium, calcium, and potassium currents involved in the neuronal dynamics (Hansel et al., 1993, 1995; van Vreeswijik, Abbott, & Ermentrout, 1994; Ermentrout et al., 2001). Hence, the singleneuron dynamics can be related to network dynamics. This idea has been applied in the framework of simple integrate-and-fire models as well as in conductance-based models (for reviews, see Golomb, Hansel, & Mato, 2001; Mato, 2005). In the strong noise–strong coupling region, the phase response function is no longer relevant, and other approaches, such as the one used in this article, are required. 5.3.1 The Effect of Spike Initiation and Repolarization. Besides the synaptic time constants, we showed that the sharpness of the spike initiation is an important parameter that affects the quantitative features of stochastic synchronous oscillations at their onset (see also Geisler et al., 2005). In the EIF model, the frequency of the stochastic oscillations and the transition between the clustering and rate oscillation regions are strongly affected by the parameter T . The sharpness of spike initiation greatly influences the amplitude of the noise where the transition between the two modes of synchrony occurs. For LIF neurons ( T = 0), this transition occurs at very
Synchronization Properties of Inhibitory Networks
1101
low noise levels (below 0.1 mV). When the parameter T increases, the transition moves to higher noise levels. For instance, it is larger than 1 mV when T = 3.5 mV. Similarly, in the conductance-based model, we found a significant influence of the shape of the function m∞ on the frequency of the stochastic synchronous oscillations and the noise level required for their appearance. In contrast to the role of the spike initiation dynamics, our work suggests that the membrane potential repolarization dynamics following an action potential is much less critical for stochastic synchronous oscillations. In fact, for the quantitative features of the stochastic synchronous oscillation instability in the Wang-Buzs´aki and the EIF models to be similar, it is sufficient to choose the parameter T to match the spike initiation dynamics of the EIF to those of the Wang-Buzs´aki model. 5.3.2 The Effect of Subthreshold Active Currents on Stochastic Oscillations. We have focused on inhibitory networks of integrate-and-fire neurons and of specific HH-type neurons. In particular, the panoply of ionic currents in the conductance-based model we studied is limited to the standard fast sodium current and the delayed-rectifier potassium current. Synchronization properties of neurons with additional types of ionic currents, including those significantly activated in the sub-threshold range, remain to be explored in the high-noise regime. However, we believe that the stochastic synchronous state should be robust to the addition of such additional currents because such currents do not substantially affect the discharge modulation of neurons in response to oscillatory inputs at high frequency (Fourcaud-Trocem´e et al., 2003). 5.4 The Effect of Temporal Correlations in the External Noisy Input. In this article, we have considered white noise. Colored noise modifies the phase of LIF neurons at high frequencies. In this limit, the phase is 0 degrees for colored noise compared to 45 degrees in case of white noise (Brunel et al., 2001). One consequence is a larger population frequency in the stochastic synchronous state (Brunel & Wang, 2003). Note that in the particular case of synaptic currents with no latency, such a change in the properties of noise can lead to a drastic change in the synchronization properties of the network. Indeed, with white noise, the network displays the stochastic synchronous instability in the high-noise regime, whereas for sufficiently colored noise, no such instability can be found, because in this case, the sum of synaptic and neuronal phase lag is bounded by 180 degrees. In the case of EIF and CB neurons, differences of the neuronal firing rate response between white and colored noise are much less important (Fourcaud-Trocm´e et al., 2003), and therefore the synchronization properties of networks of such neurons should only weakly depend on the nature of the noise.
1102
N. Brunel and D. Hansel
For simplicity, we have also considered current-based inputs rather than conductance-based inputs. Conductance-based inputs are expected to yield qualitatively similar results, though the frequency of the stochastic oscillation is known to increase as the input conductance of neurons increases (Geisler et al., 2005). 5.5 Perspectives. In the models studied here, all the neurons have the same intrinsic properties and the connectivity is all-to-all. The addition of heterogeneities in cellular properties and in the external inputs would contribute to stabilize the modes leading to clustering instabilities (Neltner et al., 2000; Hansel & Mato, 2001, 2003). If the heterogeneities are too strong, these modes are stable for any value of the coupling. This fragility of cluster states to heterogeneity is due to the fact that the appearance of such states is dependent on neuronal resonances at integer multiples of the firing rate, and therefore any heterogeneity leading to pronounced cell-to-cell variability of firing rates will destroy such states. In contrast, we conjecture that the instability of the rate oscillatory mode is very robust to heterogeneities, although the coupling strength at which it occurs is likely to depend on the level of heterogeneities. This robustness would be due to the fact that the synchrony in this regime is no longer dependent on resonant peaks at integer multiples of the single-cell firing rate. More work is required to confirm this conjecture. Brunel and Hakim (1999) found stochastic synchrony in a network of N LIF neurons, connected at random with an average of C 1 synapses in the limit N → ∞ but C/N 1. Their analytical approach and the one used in this article are similar in spirit. However, an important difference is that when the connectivity is random, an additional noise term contributes to the synaptic inputs. The study of the stability of the asynchronous state then requires knowing how oscillations in the variance of the inputs to a neuron affect the single-cell firing rate. This analysis can be performed when synaptic interactions are modeled as a delta function, and presynaptic neurons fire approximately as Poisson processes, as in Brunel and Hakim (1999). Unfortunately, this analysis does not generalize easily to situations in which a finite rise and decay time are present (Fourcaud & Brunel, 2002) and/or neurons fire in a significantly non-Poissonian fashion. Still, we expect that if the connectivity is very large, these fluctuations will not destroy the oscillatory rate instability. In contrast, if the connectivity is too sparse, the spatial fluctuations in the synaptic inputs that increase with the synaptic strength can prevent the oscillatory rate instability from occurring (see, e.g., Golomb & Hansel, 2000). This will happen if the connectivity C is smaller than some critical number that depends on the synaptic kinetics, the average firing rate of the neurons, and their intrinsic properties. The exact conditions for existence of the oscillatory rate instability in such sparse networks should also be clarified in future work.
Synchronization Properties of Inhibitory Networks
1103
Appendix A: Linear Stability Analysis of the Asynchronous State in LIF Networks The asynchronous state described by equations 2.15 to 2.17 is stable if any small perturbation from it decays back to zero. In the mean field framework, a perturbation of the asynchronous state can be described by its effect on the average firing rate, on the PDF, and on the recurrent current, as follows: ν(t) = ν0 + ν1 (t)
(A.1)
P(V, t) = P0 (V) + P1 (V, t)
(A.2)
Ir ec (t) = J τm ν0 + I1 (t)
(A.3)
where 1, and ν1 (t), P1 (V, t) are finite. Inserting these expression in equations 2.9, 2.10, and 2.14 and keeping only the leading order in ε and looking for solutions P1 , ν1 , I1 ∝ exp λt with λ a complex number yields −Jν0 τm exp(−λτl ) 1= σ (1 + λτd )(1 + λτr )(1 + λτm )
∂U ∂y
(yt , λ) −
∂U (y , λ) ∂y r
U(yt , λ) − U(yr , λ)
,
(A.4)
where yt and yr are given in equations 2.16 and 2.17 and the function U(y, λ) is given in terms of combinations of hypergeometric functions (Abramowitz & Stegun, 1970):
2
U(y, λ) =
+
ey
1+λτm 2
M
1 − λτm 1 , , −y2 2 2
2 2ye y λτm 3 λτm M 1 − , , −y2 . 2 2 2
(A.5)
The asynchronous state is stable if for all the solutions to this equation, Re(λ) < 0. Conversely, the existence of at least one solution with Re(λ) > 0 signals that the asynchronous state is unstable. Thus, an onset of instability of the asynchronous state in parameter space is determined by setting λ = iω in equation A.4. At this onset, a Hopf bifurcation occurs, and ω is the frequency of the oscillation of the network activity in the corresponding unstable mode. In cases where the Hopf bifurcation is supercritical, ω is also the frequency of the synchronous oscillations that emerge at the instability onset.
1104
N. Brunel and D. Hansel
Appendix B: Scaling of the Critical Coupling Strength with Noise for Large Noise When σ goes to infinity, an expansion of equation 2.15 yields Vt − Vr (Vt − Vr )2 1 (yr ) + = (yr ) + · · · ν0 τm σ 2σ 2
(B.1)
√ where (x) = π exp(x 2 )(1 + erf(x)). To keep a finite rate ν0 as σ goes to ∞, yr must go to −∞ as
yr ∼ − ln(σ ). In addition, ∂U ∂y
(yt , λ) −
∂U (y , λ) ∂y r
U(yt , λ) − U(yr , λ)
∼ |yr |
in the limit yr → −∞. Equation 2.23 then implies that the critical coupling strength goes as Jc ∼
σ ln(σ )
for large σ . Appendix C: Numerical Simulations of LIF Networks Simulations of LIF networks were performed at various levels of J and σ close to the predicted stability boundary of the asynchronous state for different network sizes (N = 1000, 2000, 4000). The methods are as described in section 4.2. We show in Figure 17 the resulting phase diagram for ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, and τd = 6 ms. At low-coupling levels (J < 2 mV), the asynchronous state destabilizes through a subcritical bifurcation. There is a small parameter range where the asynchronous state coexists with the n = 3 cluster state. At low-noise levels, at least three types of cluster states coexist (with 1, 2, 3 clusters). The state that is selected by the network depends on initial conditions. These states destabilize successively through discontinuous transitions (for J = 1 mV, n = 1 destabilizes at σ = 0.16 mV, n = 2 destabilizes at 0.25 mV, n = 3 at 0.32 mV). Above a coupling level of about 2 mV, the asynchronous state destabilizes through a supercritical Hopf bifurcation to the stochastic synchronous state. The stochastic synchronous state coexists with the other cluster states in some range of noise amplitudes. As the coupling strength increases, the stochastic
Synchronization Properties of Inhibitory Networks
1105
100 1 CV
χ
1 0.5 0
0 2
6
4
8
2
8
CV
χ
6
0.5 0
0 0.5
1
0.5
1
0.2 χ
10
0.5 0
CV
1
0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
0.1 0.5 0
0 0.1
0.1 0.05 CV
1 χ
CV
1 χ
Total synaptic strength (mV)
4 0.5
1
0.5 0
0 0.02
0.04
0.02
0.04
1 0.01
0.1 1 Noise amplitude (mV)
10
Figure 17: Beyond the linear stability analysis: phase diagram of the network of LIF neurons (ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, τd = 6 ms) obtained with numerical simulations. Solid lines: instability lines obtained by analytical calculations (same as Figure 3, here for the sake of clarity only, the lines corresponding to n = 3 and n = 4 are shown). Filled triangles indicate the limit of stability of the asynchronous state, as obtained by numerical simulations (see the text for details). Other filled symbols: limits of stability of n = 1 (circles), n = 2 (squares), and n = 3 (diamonds) cluster states. Open symbols: limits of stability of firing rate oscillation (diamonds), n = 3 (squares), and n = 2 (circles) states. The insets show how synchrony level χ and CV (in both types of panels, triangles: asynchronous state and firing rate oscillation; circles: 1 cluster; squares, 2 clusters; diamonds, 3 clusters) varies in these various states as a function of noise at various coupling strengths (range indicated by dotted lines).
synchronous state merges successively with the n = 3 cluster state (around 8 mV), the n = 2 cluster state (above 12 mV), and finally the n = 1 cluster state (above 30 mV). Thus, at high coupling strengths (see top panel for a coupling strength of 100 mV), there is a single synchronous state that varies continuously from the fully synchronized state at σ = 0 mV (population frequency equal to firing rate) to the stochastic synchronous state close to the bifurcation at σ = 5 mV (population frequency about 90 Hz, much larger than single-cell firing rate, CV close to 1). The phase diagram
1106
N. Brunel and D. Hansel
shown in Figure 17 is representative of networks with short synaptic time constants, when the population firing rate in the high-noise region is larger than the single-cell firing rate, though the specifics of which cluster states are obtained depend on parameters. On the other hand, for large synaptic time constants, when the population frequency at high noise is on the same order or smaller than the single-cell firing rate, and the destabilization of the asynchronous state occurs exclusively on the n = 1 instability line, the phase diagram simplifies drastically: the asynchronous state destabilizes at any J through a supercritical bifurcation, a scenario similar to high-coupling strengths in Figure 17. Appendix D: The Conductance-Based Models In all the three conductance-based models dealt with in this work, the inactivation functions of the potassium and the sodium currents are as in the model of Wang and Buzs´aki (1996): αm (V) =
0.1(V + 35) 1 + e −(V+35)/10
βm (V) = 4e −(V+60)/18 αn (V) = 0.03
(V + 34) 1 − e −(V+34)/10
βn (V) = 0.375e −(V+44)/80 αh (V) = 0.21 e βh (V) =
−(V+58)/20
3 . 1 + e −(V+28)/10
(D.1) (D.2) (D.3) (D.4) (D.5) (D.6)
In order to study how the activation of the sodium affects the instability of the asynchronous state and the frequency of the population oscillations at synchrony onset we also considered two other models that differ from the Wang-Buzs´aki model in the sharpness of the sigmoidal function m∞ (V) (see also Figure 13). In model 1: αm (V) =
0.1(V + 30) 1 + e −(V+30)/10
βm (V) = 4e −(V+55)/12 .
(D.7) (D.8)
Hence, the slope of the activation curve at the inflexion point of the sigmoid is smaller in model 1 than in the Wang-Buzs´aki model (see also Figure 13A, dashed line).
Synchronization Properties of Inhibitory Networks
1107
In model 2: αm (V) =
0.1(V + 35) 1 + e −(V+35)/20
βm (V) = 4e −(V+47.4)/18 .
(D.9) (D.10)
In this model the activation curve is sharper than in the Wang-Buzs´aki model (see also Figure 13A, dotted line). In all three models g Na = 35 mS/cm2 , VNa = 55 mV, VK = −90 mV, g L = 0.1 mS/cm2 , VL = −65 mV and C = 1 µF/cm2 . In particular, the passive membrane time constant of the neuron is τm = C/g L = 10 msec as in Wang and Buzs´aki (1996). Acknowledgments We thank Alex Roxin and Carl van Vreeswijk for careful reading of the manuscript. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in a network of pulsecoupled oscillators. Phys. Rev. E, 48, 1483–1490. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates I: Substrate–spikes, rates and neuronal gain. Network, 2, 259–274. Anderson, J. S., Lampl, I., Gillespie, D. C., & Ferster, D. (2000). The contribution of noise to contrast invariance of orientation tuning in cat visual cortex. Science, 290, 1968–1972. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzs´aki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comp., 11, 1621–1671. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? J. Neurophysiol., 90, 415–430.
1108
N. Brunel and D. Hansel
Buzsaki, G., Urioste, R., Hetke, J., & Wise, K. (1992). High frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Csicsvari, J., Hirase, H., Czurko, A., Mamiya, A., & Buzs´aki, G. (1999a). Fast network oscillations in the hippocampal CA1 region of the behaving rat. J. Neurosci., 19, RC20. Csicsvari, J., Hirase, H., Czurko, A., Mamiya, A., & Buzs´aki, G. (1999b). Oscillatory coupling of hippocampal pyramidal cells and interneurons in the behaving rat. J. Neurosci., 19, 274–287. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Fourcaud, N., & Brunel, N. (2002). Dynamics of firing probability of noisy integrateand-fire neurons. Neural Computation, 14, 2057–2110. Fourcaud-Trocm´e, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci., 23, 11628–11640. Fuhrmann, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophysiol., 88, 761–770. Geisler, C., Brunel, N., & Wang, X.-J. (2005). Contributions of intrinsic membrane dynamics to fast network oscillations with irregular neuronal discharges. J. Neurophysiol., 94, 4344–4361. Gerstner, W., van Hemmen, L., & Cowan, J. (1996). What matters in neuronal locking? Neural Computation, 8, 1653–1676. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation, 12, 1095–1139. Golomb, D., Hansel, D., & Mato, G. (2001). Theory of synchrony of neuronal activity. In S. Gielen & F. Moss (Eds.), Handbook of biological physics. Dordrecht: Elsevier. Golomb, D., Hansel, D., Shraiman, D., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Phys. Rev. A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178.
Synchronization Properties of Inhibitory Networks
1109
Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comp., 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–370. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computational Neurosci., 3, 7–34. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conductance and excitation in nerve. J. Physiol., 117, 500–544. Hormuzdi, S. G., Pais, I., LeBeau, F. E., Towers, S. K., Rozov, A., Buhl, E. H., Whittington, M. A., & Monyer, H. (2001). Impaired electrical signaling disrupts gamma frequency oscillations in connexin 36-deficient mice. Neuron, 31, 487–495. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: SpringerVerlag. Lapicque, L. (1907). Recherches quantitatives sur l’excitabilit´e e´ lectrique des nerfs trait´ee comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Mato, G. (2005). Theory of neural synchrony. In C. Chow, B. Gutkin, D. Hansel, C. Meunier, & J. Dalibard (Eds.), Methods and models in neurophysics. Dordrecht: Elsevier. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons in the neocortex. J. Neurophysiol., 54, 782–806. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous neural networks. Neural Comp., 12, 1607–1641. Pfeuty, B., Golomb, D., Mato, G., & Hansel, D. (2003). Electrical synapses and synchrony: The role of intrinsic currents. J. Neurosci., 23, 6280–6294. Pfeuty, B., Mato, G., Golomb, D., & Hansel, D. (2005). The combined effects of inhibitory and electrical synapses in synchrony. Neural Comp., 17, 633–670. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling, pp. 251–291. Cambridge, MA: MIT Press. Siapas, A. G., & Wilson, M. A. (1998). Coordinated interactions between hippocampal ripples and cortical spindles during slow-wave sleep. Neuron, 21, 1123–1128. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334– 350. Tiesinga, P. H., & Jose, J. V. (2000). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network, 11, 1–23. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259– 284.
1110
N. Brunel and D. Hansel
Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Tsodyks, M. V., Mit’kov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Computation, 13, 959–992. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. ˜ C., & Kopell, N. (1998). Synchronization and White, J. A., Chow, C. C., Soto-Trevino, oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16.
Received May 19, 2005; accepted August 23, 2005.
LETTER
Communicated by Bard Ermentrout
Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory Connections Takashi Kanamaru
[email protected] Department of Basic Engineering in Global Environment, Faculty of Engineering, Kogakuin University, 2665-1 Nakano, Hachioji, Tokyo 192-0015, Japan
To study the synchronized oscillations among distant neurons in the visual cortex, we analyzed the synchronization between two modules of pulse neural networks using the phase response function. It was found that the intermodule connections from excitatory to excitatory ensembles tend to stabilize the antiphase synchronization and that the intermodule connections from excitatory to inhibitory ensembles tend to stabilize the in-phase synchronization. It was also found that the intermodule synchronization was more noticeable when the inner-module synchronization was weak. 1 Introduction The average behavior of neurons often shows synchronized oscillations in many areas of the brain—for example, in the visual cortex (Gray & Singer, 1989), the hippocampus (Buzs´aki, Horv´ath, Urioste, Hetke, & Wise, 1992; Bragin et al., 1995), the auditory neocortex (Traub, Bibbig, LeBeau, Cunningham, & Whittington, 2005), and the entorhinal cortex (Cunningham, Davies, Buhl, Kopell, & Whittington, 2003), and they have attracted considerable attention in the past 20 years. Synchronized oscillations with gamma frequency (20–70 Hz) among nearby neurons with overlapping receptive fields have been observed in the visual cortex. Moreover, when correlated visual stimulations were presented, synchronized oscillations were observed even among distant neurons that had nonoverlapping receptive fields and were separated by 7 mm. Based on such observations, it was proposed that the correlations among neuronal activities might be related to the binding of visual information (for reviews, see Gray 1994). Several mechanisms likely contribute to the generation of such synchronized oscillations in the visual cortex. First, the lateral geniculate nucleus (LGN) often provides oscillating inputs to the visual cortex. However, the range of projections from the LGN cannot explain the cortical synchronization among distant neurons. Therefore, the synchronized oscillations in the visual cortex are thought to be generated by an intracortical mechanism, not by oscillating inputs from the LGN (Gray & Singer, 1989). However, it Neural Computation 18, 1111–1131 (2006)
C 2006 Massachusetts Institute of Technology
1112
T. Kanamaru
is unknown whether the oscillations are caused by the properties of single neurons or by intracortical network interactions. As for the theory that the oscillations are caused by the properties of single neurons, it was reported that chattering cells in the visual cortex show periodic bursts of gamma frequency, which might be related to the generation of oscillatory responses (Gray & McCormick, 1996). On the other hand, physiological evidence supports the theory that the oscillations are generated by intracortical network interactions (Jagadeesh, Gray, & Ferster, 1992; Gray, 1994). In the hippocampus, it was reported that the network that contains inhibitory neurons contributes to the generation of oscillations (Buzs´aki et al., 1992; Whittington, Traub, & Jefferys, 1995; Fisahn, Pike, Buhl, & Paulsen, 1998). Concerning the generation of synchronized oscillations in the neuronal network, we have been studying pulse neural networks that are composed of excitatory neurons and inhibitory neurons. In previous studies, the dynamics of a single module consisting of a network were analyzed using the Fokker-Planck equation, and various synchronized firings were found depending on the values of the parameters (Kanamaru & Sekine, 2004, 2005). Such synchronized firings might be related to the synchronized oscillations among nearby neurons. In the study reported here, in order to elucidate the mechanism of synchronized oscillations among distant neurons, we analyzed the synchronization between two modules of networks, in which each module was composed of excitatory neurons and inhibitory neurons. Ermentrout and Kopell (1998) analyzed a similar network of two modules, each of which contained an excitatory cell (E-cell) and an inhibitory cell (I-cell). The E-cell and I-cell each represented populations of neurons, and their dynamics obeyed the equations for the spiking neuron model. Therefore, the neurons in each population were assumed to be perfectly synchronized. However, when the neurons in each module are not perfectly synchronized but are partially synchronized (van Vreeswijk, 1996), their analysis cannot hold because each neuron in a module receives many pulses from other neurons in that module and from neurons in the other module. In our model, perfect synchronization is not realized because of noise; therefore, probabilistic representations are required to describe the dynamics of each module. In section 2, the definition of a module of pulse neural network is given, and its dynamics are analyzed using the Fokker-Planck equation. Some examples of synchronized firings are presented. In section 3, a system with two modules of networks is introduced, and the intermodule synchronization is analyzed using the phase response function (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout, Pascal, & Gutkin, 2001; Nomura, Fukai, & Aoyagi, 2003). As a result, it was found that the intermodule connections from excitatory to excitatory ensembles tend to stabilize the antiphase synchronization, and the intermodule connections from excitatory to inhibitory ensembles tend to stabilize the in-phase synchronization. Moreover, it was found that the intermodule synchronization
Synchronization Between Two Modules of Pulse Neural Networks
1113
is more noticeable when the inner-module synchronization is weak. The final section provides a discussion and conclusions. 2 One-Module System In this section, we consider a module of a pulse neural network composed of (i) excitatory neurons with internal states θ E (i = 1, 2, . . . , NE ) and inhibitory (i) neurons with internal states θ I (i = 1, 2, . . . , NI ) that are written as ˙(i) (i) (i) θ E = 1 − cos θ E + 1 + cos θ E (i) × r E + ξ E (t) + gEE I E (t) − g E I I I (t) ,
(2.1)
˙(i) (i) (i) θ I = 1 − cos θ I + 1 + cos θ I (i) × r I + ξ I (t) + gIE I E (t) − g I I I I (t) ,
(2.2)
I X (t) =
NX (i) 1 δ θX − π , NX
(2.3)
i=1
(i) ( j) ξ X (t)ξY (t ) = Dδ XY δi j δ(t − t ),
(2.4)
where X, Y = E or I , g XY is the connection strength from ensemble Y to ensemble X, r E and r I are system parameters, and δ XY and δi j are Kro(i) necker’s deltas. I X (t) is the synaptic inputs from ensemble X, and ξ X (t) is noise in the ith neuron in ensemble X. In the following, we call this network with excitatory and inhibitory ensembles the one-module system. The dynamics of this one-module system are nearly identical to the dynamics of the pulse-coupled active rotators analyzed by Kanamaru and Sekine (2005). Therefore, we will briefly describe it here. Note that the model of neurons with θ˙ = (1 − cos θ ) + (1 + cos θ )r is the canonical model of class 1 neurons (Ermentrout & Kopell, 1986; Ermentrout, 1996), and arbitrary class 1 neurons near their bifurcation points can be transformed into the canonical model. The canonical model was previously extended to the network of weakly connected class 1 neurons (Hoppensteadt & Izhikevich, 1997; Izhikevich, 1999), and the system governed by equations 2.1, 2.2, and 2.3 has the form of this canonical model of weakly connected class 1 neurons. Thus, the networks of the weakly connected arbitrary class 1 neurons with global connections can be transformed into the above form with the appropriate change of variables. Here we restrict the parameters so that the system parameters r E and r I and the noise intensity D are uniform in the network. Moreover, the restrictions gEE = g I I ≡ gint = 4 and g E I = gIE ≡ ge xt are placed on the connection strengths for simplicity.
1114
T. Kanamaru (i)
In the absence of noise ξ X (t) and synaptic input I X (t), a single neuron shows self-oscillation for r X > 0. For r X < 0, this neuron becomes an excitable system with a stable equilibrium written by θ0 = −arccos
1 + rX , 1 − rX
(2.5)
in which θ0 is close to zero for r X ∼ 0. We define the firing time of the neuron (i) as the time at which θ X exceeds π because π is away from θ0 (∼ 0). Note that ˙(i) ˙(i) the relation θ E = θ I = 2 > 0 holds at θ = π independent of the synaptic (i) input I X (t) and the noise ξ X (t); therefore, the firing of the neuron can be defined naturally. In the following, we use values of the parameter where r X < 0, and we consider the dynamics of the networks of excitable neurons. Note that the synaptic input I X (t) from ensemble X can be rewritten as I X (t) =
NX 1 (i) δ t − tj , 2NX j
(2.6)
i=1
(i)
where t j is the jth firing time of the ith neuron in ensemble X. In the limit of NE , NI → ∞, the average behavior of neurons in the system can be analyzed with the Fokker-Planck equations, which describe the development of the probability density of the system over time, as shown in appendix A. It is notable that asynchronous firings and synchronized firings of neurons in the network correspond to a stationary solution and a time-varying solution of the Fokker-Planck equations, respectively. When the Fokker-Planck equations are used, a bifurcation set is obtained numerically by the method shown in appendix B, and the bifurcation set for the parameters r E = −0.025 and r I = −0.05 is shown in Figure 1. Generally, synchronized firings of neurons are observed when the chosen values of noise intensity D and connection strength ge xt are in the area enclosed by the SNL (saddle-node-on-limit-cycle) and Hopf bifurcation lines (see Figure 1). For more detailed information about the bifurcation, see Kanamaru and Sekine (2005). Typical synchronized firings of neurons in a one-module system are shown in Figure 2. The change in the probability flux J E , which is defined in appendix A, at θ E = π over time for various values of D and ge xt /gint , is shown in Figures 2A, 2C, and 2E. Note that the probability flux J E can be interpreted as the instantaneous firing rate of the excitatory ensemble. The raster plots of the firing times of the excitatory neurons in the system with NE = NI = 1000 are shown in Figures 2B, 2D, and 2F. As shown in Figures 2A and 2B, in cases where D and ge xt /gint are near the saddle-nodeon limit-cycle bifurcation, the synchronized firings of neurons have strong correlations and long periods because the system stays a long time in the
Synchronization Between Two Modules of Pulse Neural Networks
1115
2.5
Hopf 2 1.5
SN
g ext /g int
Hopf
1
SNL
0.5 0 0.001
SN
DLC 0.01
0.1
D Figure 1: Bifurcation set in the (D, ge xt ) plane for r E = −0.025 and r I = −0.05. SN, saddle node; SNL, saddle-node-on-limit cycle; DLC, double limit cycle.
area where the original saddle and node existed. As shown in Figures 2C and 2D, in cases where D and ge xt /gint are near the Hopf bifurcation, the synchronized firings of neurons have weak correlations and high frequencies because a limit cycle that corresponds to these synchronized firings is created around the stable equilibrium with large probability fluxes. The synchronized firings of neurons shown in Figures 2E and 2F are weakly synchronized periodic firings (Kanamaru & Sekine, 2004) where only a small percentage of the neurons fire in each period. Such firings are realized when the peak value of the probability flux J E is very small, as shown in Figure 2E, and when each neuron receives subthreshold periodic inputs. We assume that these weakly synchronized periodic firings may be related to the physiologically observed synchronized firings because their degree of synchronization is also weak (Gray & Singer, 1989; Buzs´aki et al., 1992; Fisahn et al., 1998). However, in the physiological environment, the properties of single neurons are not uniform and the structures of the networks are more complex. Therefore, more detailed theoretical analyses are required to validate the presence of neurons with weakly synchronized periodic firings in physiological environments. 3 Two-Module System In this section, to study the mechanism of the synchronized oscillations among distant neurons, we consider the two-module system in which the
1116
T. Kanamaru
A
D=0.005, gext / gint =0.5
0.5 0.4 JE 0.3 0.2 0.1 0 0
50
100
150
B
200
250
C
50
100
D
t
150
200
250
300
200
250
300
t
1000
index
1000
index
D=0.02, gext / gint =0.5
0.5 0.4 JE 0.3 0.2 0.1 0 300 0
500
500
0
0 0
50
100
150
200
250
300
0
50
100
t
150
t
E
D=0.005, gext / gint =1.5
0.2 0.15 JE 0.1 0.05 0 0
50
100
F
150
200
250
300
200
250
300
t
index
1000 500 0 0
50
100
150
t
Figure 2: Synchronized firings of neurons in the one-module system in the case where r E = −0.025 and r I = −0.05. (A, C, E) Change in the probability flux J E over time at θ E = π . (B, D, F) Raster plots of the firing times of the excitatory neurons in the system with NE = NI = 1000. (A, B) Synchronized firings of neurons where D and ge xt /gint are near the saddle-node-on-limit-cycle bifurcation. The results in the case of D = 0.005, ge xt /gint = 0.5, and gint = 4 are shown. (C, D) Synchronized firings of neurons where D and ge xt /gint are near the Hopf bifurcation. The results in the case of D = 0.02, ge xt /gint = 0.5, and gint = 4 are shown. (E, F) Weakly synchronized periodic firings of neurons where D = 0.005, ge xt /gint = 1.5, and gint = 4.
internal states of the neurons are defined as: ˙(i) (i) (i) θ Ek = 1 − cos θ Ek + 1 + cos θ Ek (i) × r Ek + ξ Ek (t) + g Ek Ek I Ek (t) − g Ek Ik I Ik (t) + Ek El I El (t) − Ek Il I Il (t) ,
(3.1)
Synchronization Between Two Modules of Pulse Neural Networks
˙(i) (i) (i) θ Ik = 1 − cos θ Ik + 1 + cos θ Ik (i) × r Ik + ξ Ik (t) + g Ik Ek I Ek (t) − g Ik Ik I Ik (t), + Ik El I El (t) − Ik Il I Il (t) , l ≡ 3 − k,
1117
(3.2) (3.3)
where k = 1, 2 and represents the first and second modules, respectively. For simplicity, we set the inner-module connection strengths as g Xk Yk = g XY and the intermodule connection strengths as Xk Yl ≡ XY (k = l). Moreover, we assume that the intermodule connection strengths are very weak, namely, XY 1, and that the intermodule connections originate only from the excitatory ensembles, namely, E I = I I = 0, because the intercolumnar longrange connections in the cortex are excitatory (Gilbert & Wiesel, 1983; Ts’o, Gilbert, & Wiesel, 1986). A similar network of two modules, each of which contains an excitatory cell (E-cell) and an inhibitory cell (I-cell), was previously analyzed by Ermentrout and Kopell (1998). The E-cell and I-cell each represented populations of neurons, and their dynamics obeyed the equations for the spiking neuron model. The neurons in each population were assumed to have perfectly synchronized firings. However, as shown in Figure 2, our innermodule neurons do not show perfectly synchronized firings; therefore, the average behavior of the neurons in each module cannot be represented by that of a single neuron. Instead, we use the probabilistic representation of the Fokker-Planck equation to describe the dynamics of each module. In the limit of NEk , NIk → ∞, the dynamics of the probability density of each module are governed by the Fokker-Planck equation shown in appendix A, and the Fourier coefficients of the probability densities follow the ordinary differential equation x˙ = f (x), which is defined in appendix B. When each module shows inner-module synchronized firings, the vector x moves on a limit cycle x = x 0 (t). In a system of two modules that have weak intermodule connections XY , the two limit cycles are connected weakly, and such a system can be analyzed using the phase response function (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout et al., Gutkin, 2001; Nomura et al., 2003), as summarized in appendix C. Using this method, we can transform the weakly connected ordinary differential equation x˙ = f (x) of the Fourier coefficients into the averaged phase equations C.5 and C.6, and we can analyze the stationary phase differences using C.10. The dependence of the stationary phase differences on the ratio γ , which is defined as EE : IE = 1 − γ : γ ,
(3.4)
1118
T. Kanamaru
in cases with different values of D and ge xt /gint is shown in Figures 3, 4, and 5. In the following, in-phase and antiphase synchronizations are defined as the stationary solution with phase difference φ = 0 or φ = 0.5, respectively. In all cases, it was found that the connections from excitatory to excitatory ensembles (EE ) tended to stabilize the antiphase synchronization, while the connections from excitatory to inhibitory ensembles (IE ) tended to stabilize the in-phase synchronization. However, when the inner-module synchronizations were strong (see Figure 3), the antiphase synchronization remained stable even when IE was large, and therefore intermodule synchronization was harder to attain than in other cases. On the other hand, when the inner-module synchronizations were weak (see Figure 5), the in-phase synchronization was stable over a wide range of γ , and therefore intermodule synchronization was easily attained. As shown above, EE and IE contribute to the intermodule synchronization in different ways, because their phase responses have different properties. Note that the phase response describes the change in frequency at φ in response to small perturbations, as shown in appendix C. The phase response function Z(t) is a vector function whose components represent the effects of inputs to the Fourier components of the Fokker-Planck equation. To make the phase response easier to understand, the phase responses δ E and δ I , upon injection of the delta function into the excitatory or inhibitory ensemble, are calculated by equation C.7 and the results are shown in Figure 6 for three sets of parameters. Generally, the two-phase responses have opposite signs in three cases; therefore, it can be concluded that the connections to excitatory and inhibitory ensembles have opposite synchronization properties. Moreover, when the system is close to the saddle-node-on-limit-cycle bifurcation point (see Figures 3, 6A, and 6B), the phase response of the inhibitory ensemble is much smaller than that of the excitatory ensemble. Thus, the in-phase synchronization is hard to attain in Figure 3. Although the phase responses in Figures 6D and 6F have similar forms, the amplitude of J I is smaller than that of J E in Figure 6C. Thus, the effect of the inhibitory ensemble is weak when the system is close to the Hopf bifurcation point and its firing rates are high (see Figures 4, 6C, and 6D), and the in-phase synchronization is harder to attain than the weakly synchronized periodic firings in Figure 5. Next, let us consider a system with a transmission delay d between two modules. Such a system can be analyzed with the equation
d (φα − φα ) =
1 T
0
T
Z(t + φα ) · p(t + φα , t + φα − d)dt,
(3.5)
which was obtained by incorporating the delay d to (φα − φα ) in equation C.7 (Hansel, Mato, & Meunier, 1995). The areas where the in-phase or antiphase synchronization is stable are obtained numerically, and their
Synchronization Between Two Modules of Pulse Neural Networks
1119
D=0.005, gext / gint =0.5
A 1 0.8
∆φ
0.6 0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.5 0.4 JE 0.3 0.2 0.1 0 0
50
100
150
200
250
300
200
250
300
t
C module 2
1000 500
module 1
0 1000 500 0 0
50
100
150
t Figure 3: (A) Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where r E = −0.025, r I = −0.05, D = 0.005, ge xt /gint = 0.5, and gint = 4. The phase variable is normalized with the period T. The solid and dotted lines denote the stable and unstable phase differences, respectively. (B) Change in J E over time in the case where γ = 0.5. The solid and dotted lines denote modules 1 and 2, respectively. (C) Raster plot of the firing times of excitatory neurons in a twomodule system where NE = NI = 1000. In B and C, the intermodule connection strengths were set at EE = IE = 0.025.
1120
T. Kanamaru
D=0.02, gext / gint =0.5
A 1 0.8 0.6
∆φ
0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.5 0.4 JE 0.3 0.2 0.1 0 0
20
40
60
80
100
60
80
100
t
C module 2
1000 500
module 1
0 1000 500 0 0
20
40
t Figure 4: Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where D = 0.02, ge xt /gint = 0.5, and gint = 4. The explanations are the same as those in Figure 3 except for the values of the parameters.
Synchronization Between Two Modules of Pulse Neural Networks
1121
D=0.005, gext / gint =1.5
A 1 0.8 0.6
∆φ
0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.2 0.15 JE 0.1 0.05 0 0
50
100
150
200
250
300
t
C module 2
1000 500
module 1
0 1000 500 0 0
50
100
150
200
250
300
t Figure 5: Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where D = 0.005, ge xt /gint = 1.5, and gint = 4. The explanations are the same as those in Figure 3 except for the values of the parameters.
1122
T. Kanamaru
A
C
D=0.005, gext / gint =0.5
0.5 0.4
J
0.4
0.3
JE
0.2
J 0.3
JI
0.1
0
0 0
0.2
0.4
φ
0.6
JE
0.2
0.1
B
D=0.02, gext / gint =0.5
0.5
0.8
JI
1
0
D
2
0.2
0.4
φ
0.6
0.8
1
0.8
1
1
ΓδE (φ)
ΓδI (φ)
1
ΓδI (φ)
Γ(φ)
Γ(φ) 0
0
ΓδE (φ)
-1
-1 0
0.2
0.4
φ
0.6
0.8
E
1
0
0.2
0.4
φ
0.6
D=0.005, gext / gint =1.5
0.03 0.02
JE
J
JI
0.01 0 0
0.2
0.4
0.6
0.8
1
0.8
1
φ
F 2
ΓδE (φ)
1
Γ(φ)
ΓδI (φ)
0 -1 0
0.2
0.4
0.6
φ
Figure 6: The phase responses δ E and δ I on injection of the delta function to the excitatory or inhibitory ensemble, respectively. (A, C, E) Change in J E and J I over time during a single period. (B, D, F) Phase responses. The parameters are shown in each figure.
dependence on the connection rate γ and the delay d is shown in Figure 7, where the delay is normalized by the period T. It is observed that in cases where d is small, IE stabilizes the in-phase synchronization, and in cases where d is large, EE stabilizes the in-phase synchronization. Let us consider the physiologically valid values of d for a gamma oscillation of 40 Hz
Synchronization Between Two Modules of Pulse Neural Networks
A
B
D=0.005, gext / gint =0.5
D=0.02, gext / gint =0.5
A
1
1
0.8
0.8
1123
A I 0.6
I
0.6
d/T
d/T
0.4
0.4
A
0.2
0.2
A
I 0
I
0 0
only ε EE
0.25
0.5
0.75
γ
1
0
only ε IE
C
0.25
0.5
γ
only ε EE
0.75
1
only ε IE
D=0.005, gext / gint =1.5 1
0.8
I
0.6
d/T 0.4
A
0.2
I 0 0
only ε EE
0.25
0.5
γ
0.75
1
only ε IE
Figure 7: Areas where the in-phase or antiphase synchronization is stable in the (γ , d) plane. The solid and dotted lines are the boundaries for the stable region of the in-phase or antiphase synchronization, respectively. The delay d is normalized with the period T. The in-phase synchronization is stable in the areas labeled I, and the antiphase synchronization is stable in the areas labeled A. The other stable phase differences are omitted for simplicity.
(T = 25 ms). The major components of the delay in signal transmission between two neurons are transmission delay on the axon and the synaptic delay to transmit the signal across the synaptic cleft (Nicholls, Martin, Wallace, & Fuchs, 2001). Because the conduction velocity along a myelinated axon is 1 to 100 m per second, the conduction delay between two neurons separated by 7 mm is estimated to be 0.07 to 7 ms. The synaptic delay is known to be about 1 to 2 ms. Thus, we roughly estimated that d < 10 ms and obtained the relationship d/T < 0.4. Under this condition, IE stabilizes the in-phase synchronization, as shown in Figure 7. Moreover, for d/T < 0.4, the area with stable in-phase synchronization was widest for the weakly synchronized periodic firings (see Figure 7C).
1124
T. Kanamaru
4 Discussion and Conclusions To study the mechanism through which synchronized oscillations occur in the brain, we analyzed the properties of synchronization of class 1 pulse neural networks. In the one-module system, which was composed of excitatory and inhibitory neurons, various synchronized firings were observed depending on the connection strengths and the noise intensity, and they might be related to the synchronized oscillations with gamma frequency among nearby neurons in the visual cortex. Note that such synchronized firings can be observed only when the excitatory neurons and inhibitory neurons are connected with each other (see the area with ge xt = 0 in Figure 1). In other words, the synchronized firings observed in our model were generated by the interactions between the excitatory ensemble and inhibitory ensemble in the network. On the other hand, it is known that self-oscillating neurons that consist of only excitatory (or inhibitory) neurons in a network can synchronize with each other (Mirollo & Strogatz, 1990). This difference might arise because our network is composed of excitable, but not self-oscillating, neurons. To elucidate the mechanism by which synchronized oscillations occur among distant neurons, we analyzed the synchronization between two modules of networks using the phase response function. A similar network of two modules, each of which showed perfect synchronization, was previously analyzed by Ermentrout and Kopell (1998). However, as shown in Figure 2, the neurons in our module do not show perfect synchronization; therefore, a probabilistic representation with the Fokker-Planck equation was required to describe the dynamics of each module. As a result, it was found that the intermodule connections from excitatory to excitatory (E → E) ensembles tended to stabilize the antiphase synchronization, while the intermodule connections from excitatory to inhibitory (E → I ) ensembles tended to stabilize the in-phase synchronization. Moreover, it was found that intermodule synchronization was more easily attained when the inner-module synchronizations were weak. Our finding that the E → E intermodule connections stabilize antiphase synchronization is analogous to the previous results that a pair of excitatory neurons with slow connections has a stable antiphase solution (Hansel et al., 1995; van Vreeswijk, 1996; Sato & Shiino, 2002). Moreover, our finding that the E → I intermodule connections tend to stabilize the in-phase synchronization is similar to the previous result that the E → I and I → E connections stabilize the in-phase synchronization despite the existence of a delay (Ermentrout & Kopell, 1998). However, the mechanism of synchronization in their model differs from that in our model. In the model of Ermentrout and Kopell (1998), the timing of the pulses played important roles in synchronization because their network contained only four neurons: two E-cells and two I-cells. They stated that a pair of pulses (doublet) of
Synchronization Between Two Modules of Pulse Neural Networks
1125
the I-cell was important in the process of synchronization. However, in our network, there are many neurons, and each neuron receives many pulses from other neurons (see Figures 2B, 2D, and 2F). Therefore, the timing of the pulses is less important in our model than in their model. Nevertheless, similar results on the roles of E → I connections were obtained. Moreover, in our model, it was found that the degree of synchronization in one module affects the properties of the intermodule synchronization. In summary, in our model, the oscillations in a neuronal ensemble were generated by a local network composed of excitatory neurons and inhibitory neurons, and their synchronization was realized by the long-range connections from excitatory to inhibitory ensembles. We modeled the average dynamics of the module using probabilistic representation with the Fokker-Planck equation. In physiological environments, the properties of single neurons are not uniform and the networks are more complex; therefore, probabilistic representation may be crucial for understanding their dynamics. However, in our research, we assumed that the properties of the neurons and the structure of the module were uniform. Therefore, more detailed analyses are required. Moreover, we confirmed that the analysis with the phase response function is applicable to the stochastic system whose average dynamics obey the Fokker-Planck equation. It is known that the phase response function can be calculated from physiological data (Reyes & Fetz, 1993a,1993b; Jones, Mulloney, Kaper, & Kopell, 2003); therefore, our method might widen application of the phase response function in theoretical and experimental fields.
Appendix A: The Fokker-Planck Equation for the One-Module System To analyze the dynamics of the one-module system, we use the FokkerPlanck equations (Kuramoto, 1984; Gerstner & Kistler, 2002), which are written as ∂n E ∂ (AE n E ) =− ∂t ∂θ E ∂ D ∂ (B E n E ) , BE + 2 ∂θ E ∂θ E ∂ ∂n I =− (AI n I ) ∂t ∂θ I ∂ D ∂ BI (B I n I ) , + 2 ∂θ I ∂θ I
(A.1)
(A.2)
AE (θ E , t) = (1 − cos θ E ) + (1 + cos θ E ) × (r E + gEE I E (t) − g E I I I (t)),
(A.3)
1126
T. Kanamaru
AI (θ I , t) = (1 − cos θ I ) + (1 + cos θ I ) ×(r I + gEE I E (t) − g I I I I (t)),
(A.4)
B E (θ E , t) = 1 + cos θ E ,
(A.5)
B I (θ I , t) = 1 + cos θ I ,
(A.6)
for the normalized number densities of excitatory and inhibitory neurons, in which
n E (θ E , t) ≡
NE (i) 1 δ θE − θE , NE
(A.7)
NI (i) 1 δ θI − θI , NI
(A.8)
i=1
n I (θ I , t) ≡
i=1
in the limit of NE , NI → ∞. The probability flux for each ensemble is defined as J E (θ E , t) = AE n E − J I (θ I , t) = AI n I −
∂ D (B E n E ), BE 2 ∂θ E ∂ D (B I n I ), BI 2 ∂θ I
(A.9) (A.10)
respectively. In the limit of NX → ∞, I X (t) in equation 2.6 follows an equation that is written as 1 J X (t), 2 = n(π, t),
I X (t) =
(A.11) (A.12)
where J X (t) ≡ J X (π, t) is the probability flux at θ X = π. By integrating the Fokker-Planck equations A.1 and A.2 with equation A.12, the dynamics of the network governed by equations 2.1 and 2.2 can be analyzed. Appendix B: Numerical Integration of the Fokker-Planck Equations In this section, we provide a method of performing numerical integration of the Fokker-Planck equations A.1 and A.2. Because the normalized number densities given by equations A.7 and A.8 are 2π-periodic functions of θ E
Synchronization Between Two Modules of Pulse Neural Networks
1127
and θ I , respectively, they can be expanded as n E (θ E , t) =
∞ E 1 + a k (t) cos(kθ E ) + b kE (t) sin(kθ E ) , 2π
(B.1)
∞ I 1 a k (t) cos(kθ I ) + b kI (t) sin(kθ I ) , + 2π
(B.2)
k=1
n I (θ I , t) =
k=1
and, by substituting them, equations A.1 and A.2 are transformed into an ordinary differential equation x˙ = f (x) where x ≡ (a 1E , b 1E , a 1I , b 1I , a 2E , b 2E , a 2I , b 2I , · · ·)t , (X)
da k (X) = −(r X + K X + 1)kb k dt k (X) (X) − (r X + K X − 1) b k−1 + b k+1 2 Dk (X) g ak , − 8
(B.3)
(X)
db k (X) = (r X + K X + 1)ka k dt k (X) (X) + (r X + K X − 1) a k−1 + a k+1 2 Dk (X) − g bk , 8 g(xk ) = (k − 1)xk−2 + 2(2k − 1)xk−1 + 6kxk + 2(2k + 1)xk+1 + (k + 1)xk+2 , K X ≡ g XE I E − g XI I I , (X)
a0 ≡ (X)
1 , π
b 0 ≡ 0,
(B.4)
(B.5) (B.6) (B.7) (B.8)
and X = E or I . By integrating this ordinary differential equation numerically, the time series of the probability fluxes J E and J I are obtained. For numerical calculations, each Fourier series is truncated at the first 40 or 60 terms. The bifurcation lines of the Hopf bifurcation and the saddle-node bifurcation in Figure 1 were obtained as follows. First, a stationary solution x s was numerically obtained by the Newton method (Press, Flannery, Teukolsky, & Vetterling, 1988), and the eigenvalues of the Jacobian matrix D f (x s ),
1128
T. Kanamaru
which had been numerically obtained by using the QR algorithm (Press et al., 1988), were examined to find the bifurcation lines. The bifurcation lines of the global bifurcations such as the homoclinic bifurcation and the double limit-cycle bifurcation, were obtained by observing the long-time behaviors of the solutions of x˙ = f (x). Appendix C: Analysis with the Phase Response Function In this section, we summarize the method of analyzing the dynamics of two weakly coupled oscillators. Let us consider a dynamical system x˙ = f (x) that has a stable limit cycle with period T as its solution, which was written as x = x 0 (t) (x 0 (t) = x 0 (t + T)), and then introduce a weak perturbation p(x, x ) from the other module x . Then the dynamics of the module are governed by a differential equation written as x˙ = f (x) + p(x, x ),
(C.1)
and it can be reduced to φ˙ = 1 + Z(φ) · p(x 0 , x 0 ),
(C.2)
where φ = t mod T and Z(φ) is the phase response function that describes the change in frequency at φ in response to small perturbations (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout et al., 2001; Nomura et al., 2003). Z(φ) can be numerically obtained using the method shown by, Ermentrout (1996). First, let us consider a linear differential equation, ˙ = −D f (x 0 (t))t · Z(t), Z
(C.3)
and integrate backward in time with random initial conditions. After Z(t) converges to a periodic orbit, normalization 1 T
T
Z(t) · x˙ 0 (t)dt = 1
(C.4)
0
is performed, and Z(t) is obtained. Let us denote the phases of the two modules as φ1 and φ2 , respectively. After averaging, the two phases obey φ˙ 1 = 1 + (φ1 − φ2 ),
(C.5)
φ˙ 2 = 1 + (φ2 − φ1 ),
(C.6)
Synchronization Between Two Modules of Pulse Neural Networks
(φα − φα ) =
1 T
0
T
Z(t + φα ) · p(t + φα , t + φα )dt,
p(t + φα , t + φα ) = p(x 0 (t + φα ), x 0 (t + φα )).
1129
(C.7) (C.8)
Using (φ), the phase difference φ ≡ φ1 − φ2 of the two phases obeys φ˙ = (φ) − (−φ), ≡ odd (φ).
(C.9) (C.10)
We can obtain the stable phase difference φ, which satisfies odd (φ) = 0 and odd (φ) < 0. Acknowledgments This research was partially supported by a Grant-in-Aid for Encouragement of Young Scientists (B) (No. 17700226) from the Ministry of Education, Culture, Sports, Science, and Technology, Japan. References ´ G., N´adasdy, Z., Hetke, J., Wise, K., & Buzs´aki, G. (1995). Gamma Bragin, A., Jando, (40–100Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Buzs´aki, G., Horv´ath, Z., Urioste, R., Hetke, J., & Wise, K. (1992). High-frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Cunningham, M. O., Davies, C. H., Buhl, E. H., Kopell, N., & Whittington, M. A. (2003). Gamma oscillations induced by kainate receptor activation in the entorhinal cortex in vitro. J. Neurosci., 23, 9761–9769. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. of Appl. Math., 46, 233–253. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biol., 29, 195–217. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press.
1130
T. Kanamaru
Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comput. Neurosci., 1, 11–38. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Trans. Neural Networks, 10, 499–507. Jagadeesh, B., Gray, C. M., & Ferster, D. (1992). Visually evoked oscillations of membrane potential in cells of cat visual cortex. Science, 257, 552–554. Jones, S. R., Mulloney, B., Kaper, T. J., & Kopell, N. (2003). Coordination of cellular pattern-generating circuits that control limb movements: The sources of stable differences in intersegmental phases. J. Neurosci., 23, 3457–3468. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Trans. Neural Networks, 15, 1009– 1017. Kanamaru, T., & Sekine, M. (2005). Synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections and their dependences on the forms of interactions. Neural Comput., 17, 1315–1338. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: Springer. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Nicholls, J. G., Martin, A. R., Wallace, B. G., & Fuchs, P. A. (2001). From neuron to Brain. Sunderland, MA: Sinauer. Nomura, M., Fukai, T., & Aoyagi, T. (2003). Synchrony of fast-spiking interneurons interconnected by GABAergic and electrical synapses. Neural Comput., 15, 2179– 2198. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C. Combridege: Cambridge University Press. Reyes, A. D., & Fetz, E. E. (1993a). Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Reyes, A. D., & Fetz, E. E. (1993b). Effects of transient depolarizing potentials on the firing rate of cat neocortical neurons. J. Neurophysiol., 69, 1673–1683. Sato, Y. D., & Shiino, M. (2002). Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Phys. Rev. E, 66, 041903. Traub, R. D., Bibbig, A., LeBeau, F. E. N., Cunningham, M. O., & Whittington, M. A. (2005). Persistent gamma oscillations in superficial layers of rat auditory neocortex: Experiment and model. J. Physiolo., 562, 3–8.
Synchronization Between Two Modules of Pulse Neural Networks
1131
Ts’o, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by crosscorrelation analysis. J. Neurosci., 6, 1160–1170. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E, 54, 5522–5537. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
Received June 1, 2005; accepted September 14, 2005.
LETTER
Communicated by Gregor Schoener
A Sensorimotor Map: Modulating Lateral Interactions for Anticipation and Planning Marc Toussaint
[email protected] School of Informatics, University of Edinburgh, Edinburgh, Scotland, U.K.
Experimental studies of reasoning and planned behavior have provided evidence that nervous systems use internal models to perform predictive motor control, imagery, inference, and planning. Classical (model-free) reinforcement learning approaches omit such a model; standard sensorimotor models account for forward and backward functions of sensorimotor dependencies but do not provide a proper neural representation on which to realize planning. We propose a sensorimotor map to represent such an internal model. The map learns a state representation similar to self-organizing maps but is inherently coupled to sensor and motor signals. Motor activations modulate the lateral connection strengths and thereby induce anticipatory shifts of the activity peak on the sensorimotor map. This mechanism encodes a model of the change of stimuli depending on the current motor activities. The activation dynamics on the map are derived from neural field models. An additional dynamic process on the sensorimotor map (derived from dynamic programming) realizes planning and emits corresponding goal-directed motor sequences, for instance, to navigate through a maze. 1 Introduction ¨ Kohler’s (1917) studies with monkeys were one of the first systematic investigations in the capability of planned behavior in animals. In one of his classic experiments, monkeys had to reach for a banana mounted below the ceiling. After many attempts in vain, one of the monkeys eventually exhib¨ ited the behavior that Kohler found so fascinating: the monkey retreated and sat quietly in a corner for minutes, staring at the banana and at some time also staring at a nearby table. It started to saccade several times between the banana and the table while still sitting quietly. Then it suddenly rushed up, grabbed the table, pulled it below the banana, mounted it, and jumped to grab the banana. Reading these experiment scripts today, one realizes how little we know ¨ about the neural processes in the monkey’s brain when Kohler read in its face the effort to reason about sequential behaviors to reach a goal. Classical (model-free) reinforcement learning approaches explicitly omit Neural Computation 18, 1132–1155 (2006)
C 2006 Massachusetts Institute of Technology
A Sensorimotor Map
1133
internal models (Sutton, & Barto, 1998; see also Majors & Richards, 1997). More recent studies in the cognitive sciences converge to the postulate that nervous systems use internal models to perform predictive motor control, imagery, and planning in a way that involves a simulation of actions and their perceptual implications (Grush, 2004). Based on experiments with humans, who were asked to imagine the way from a starting position in a maze to a goal position, Hesslow (2002) formulates three assumptions that may explain a simulation theory of cognitive functions: (1) behavior can be simulated by activating motor structures, as during an overt action, but suppressing its execution; (2) perception can be simulated by internal activation of sensory cortex, as during normal perception of external stimuli; and (3) both overt (executed) and covert (suppressed) actions can elicit perceptual simulation of their normal consequences. The evidence in favor of internal models and the hypotheseses developed in cognitive science raise the challenge to propose concrete models of how neural systems are capable of these processes. Such systems must be able to anticipate the sensorial implications of motor activities, but they also must account for planned, goal-oriented behavior. The sensorimotor map we propose in this letter provides mechanisms to self-organize a representation of sensorimotor data that encodes the dependencies between motor activity and predictable changes of stimuli (see also Toussaint, 2004). The self-organization process largely adopts the classical approaches to self-organizing neural stimulus representations (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) and their extensions with respect to growing representations (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992; Fritzke, 1995; Bednar, Kelkar, & Miikkulainen, 2002) and temporal dependencies (Bishop, Hinton, & Strachan, 1997; Euliano & Principe, 1999; Somervuo, 1999; Wiemer, 2003; Varsta, 2002; Klemm & Alstrom, 2002). However, unlike previous selforganizing maps, our model couples sensor and motor signals in a joint representational layer. The activation dynamics on the sensorimotor map are adopted from dynamic field models of a homogeneous, laterally connected neural layer (Amari, 1977). In the language of neural fields, the anticipation of a new stimulus corresponds to a shift of the activity peak, which is induced by a modulation of the lateral connection strengths. A key ingredient of our model is that the modulation depends on the current motor activities. A motor representation is coupled to the neural field by modulating the lateral connectivity instead of connecting directly to the neural units. By this mechanism, different motor activities lead to different shifts of the peak. The coupling encodes all the information necessary for anticipating a stimulus change depending on the motor activations and also for planning goal-directed motor sequences. On the sensorimotor map, an additional dynamic process similar to spreading activation dynamics (Bagchi, Biswas, & Kawamura, 2000) accounts for planning. The same coupling to the motor
1134
M. Toussaint
representation allows the system to emit motor excitations that execute the plan. The next section briefly recalls the relevant aspects of standard neural field dynamics. Section 3 gives an overview of the considered architecture. The sensorimotor map and how it couples to sensor and motor representations is introduced in section 4. Section 5 shows how the topology and parameters of the sensorimotor map can be learned online from data gathered during sensorimotor exploration. A demonstration of anticipation with the sensorimotor map is given in section 6, while section 7 introduces and demonstrates planning. In section 8, we briefly address possible extensions of the basic model before discussing related work in more detail in Section 9. A discussion concludes. 2 Neural Field dynamics Amari (1977) investigated a spatially homogeneous neural field as an approximation of a dense layer of interconnected neurons. His main interest was in a theory of the dynamics of activity pattern formation on such substrates. The lateral connectivity is assumed to induce local excitation and widespread inhibition, as typically described with a Mexican hat–type interaction kernel. The most elementary interesting stable solution to such a dynamic system is the single peak solution (also called activity bump or packet), where the activity is localized and stabilized around a center while the widespread inhibition emitted from the peak inhibits any spontaneous activation in the neighborhood. This simple solution has some important functional properties: if the peak is induced by a stimulus, it stabilizes its representation against noise; it may even stabilize the representation when the stimulus vanishes or is temporally occluded; it fuses two nearby stimuli while implementing a competition between distal stimuli; and it exhibits some delay to shift the peak to a new position when the stimulus switches (hysteresis). These properties make the model appealing for sensory processing and decision making, as well as motor control, where the dynamics effectively allow the system to filter noisy signals, decide among conflicting ¨ signals, and stabilize such decisions (Erlhagen & Schoner, 2002). Consequently, neural fields also find application in motor control and robotics ¨ ¨ problems (Schoner & Dose, 1992; Schoner, Dose, & Engels, 1995; Iossifidis & Steinhage, 2001; Dahm, Bruckhoff, & Joublin, 1998; Bergener et al., 1999). We introduce here a discrete implementation of such a neural field, fol¨ lowing Erlhagen and Schoner (2002). In this implementation, the activation mi of a unit i (denoted by m to anticipate the meaning of motor activations) is governed by the dynamics ˙ i = −mi + h m + A i + τm m
j
wi j φ(m j ) + [ξ ∼ N (0, ρm )].
(2.1)
A Sensorimotor Map
1135
Here, τm is the timescale of the dynamics, h m the resting level, Ai some feedforward input to unit i, ξ a gaussian noise term with variance ρm , and φ(m) a sigmoid. We choose 0 φ(m) = m ˆ = m 1
m1
(2.2)
as a simple parameterless, piecewise-linear sigmoid. The crucial term in these dynamics is the interaction strength wi j between units i and j. In spatially homogeneous neural fields, this strength is usually assumed to depend on only the distance between the locations r i and r j of the two neurons. Namely, for short distances, the interaction is excitatory, while for longer distances, it is inhibitory: wi j = w E exp
−(r i − r j )2 − wI . 2σ E2
(2.3)
The parameters here are the strengths of excitation (w E ) and inhibition (w I ), and the width σ E of the excitatory range. We generally omit indicating the time dependence of dynamic variables except when we need to refer to the time steps of the Euler integration (t) (t−1) (t) mi = mi + m ˙ i that we use to simulate the dynamics. 3 Overview of the Sensorimotor Architecture Figure 1 displays the sensorimotor architecture that we will use in the experiments. The architecture is composed of three layers. The bottom layer is an arbitrary sensor representation. In the experiments, the representation will comprise either 2 units for the x- and y-coordinates of a limb or 40 units encoding range sensor data from a maze. The top layer is the motor representation, which we choose to be a one-dimensional cyclic neural field. Different units in the field will encode different bearing directions of movements. The dynamics of these units are exactly as given in equation 2.1; the “distance” |r i − r j | between two units that determines the excitatory kernel in equation 2.3 is taken as the minimal distance on the circle, measured by how many units are between j and i. The central layer is the sensorimotor map governed by equation 4.1 given below. The key architectural feature is that the motor units project to lateral connections (i j) between two sensorimotor units j and i by multiplicatively modulating the signal transmission of that lateral connection. In contrast, sensor units project directly to sensorimotor units, as it is typical for selforganizing maps.
1136
M. Toussaint
A motor representation 1D cyclic neural field, units encode movement bearings, projects to connections of the SM-map
sensorimotor map self-organized topology
sensor representation projects to units of the SM-map
B motor activations m
closeup of two sensorimotor units i and j
motor weight vector mij multiplicative modulation Mij = mij , m ˆ j
i
⇒ unit intput: x˙ i ∝ Si + η Mij wij φ(xj ) 2] sensor input Si ∝ exp[−|si − s|2 /2σS
sensor weight vector si sensor activations s
Figure 1: Schema of the considered architecture. (A) The bottom layer is a sensor representation, projecting to units of the sensorimotor map via gaussian kernels. The top layer is a motor representation that projects to lateral connections (i j) between sensorimotor units j and i. (B) This coupling induces a multiplicative modulation of the lateral interactions in the sensorimotor map, which depends on the current motor activations.
4 Modulating the Lateral Interactions The core of the architecture is the sensorimotor map. Its activation dynamics is very similar to those of neural fields and reads τx x˙ i = −xi + h x + Si + η
[Mi j wi j − w I ] φ(x j ) + [ξ ∼ N (0, ρx )].
(4.1)
j
As for the neural field, the first term −xi induces an exponential relaxation of the dynamics, the second term h x is the resting level, and the third term Si is a feedforward input from the sensor representation to unit i. We assume that the sensorial input is given as a (unnormalized) gaussian kernel, Si = exp
−(si − s)2 , 2σ S2
(4.2)
A Sensorimotor Map
1137
that compares the input weight vector si (or codebook vector) of the unit i with the current sensor activations s. The fourth term describes the lateral interactions between units in the sensorimotor map. The lateral topology is not necessarily homogeneous but should reflect the topology of the state space and possible state transitions and is given by the lateral weighs wi j . In this article, we assume that wi j = 0 if there exists no connection and wi j = 1 if there exists one (see Toussaint, 2004, and section 8 for a version where wi j is continuous and learned with a temporal Hebb rule). The parameter w I specifies the global inhibition. The crucial difference to a standard neural field is the modulation Mi j of the lateral interactions. This modulation is how motor signals couple into the sensorimotor map. More precisely, we assume that Mi j = mi j , m, ˆ
(4.3)
which is the scalar product of the weight vector mi j and the current motor activations m. ˆ Thus, lateral interactions are modulated multiplicatively depending on the current motor activation. The weight vector mi j , which is associated with every lateral connection (i j), could be thought of the codebook vector of that connection. In a sense, lateral connections “respond” to certain motor activations. Due to the multiplicative coupling, a lateral connection contributes to lateral interaction only when the current motor activity “matches” the weight vector of this connection. Biologically plausible implementations of such modulation are, for example, pre- or postsynaptic inhibition of the signal transmission. In the case of presynaptic inhibition (Rose & Scott, 2003), synapses attach directly to the presynaptic terminal of other synapses, thereby modulating their transmission. In the case of postsynaptic inhibition (shunting inhibition), inhibitory synapses attach to branches of the dendritic tree near the soma, thereby modulating the transmission of the dendritic input accumulated at this dendritic branch (Abbott, 1991). Generally, modulation is a fundamental principle in biological neural systems (Phillips & Singer, 1997). The modulation may also be regarded as a special variant of sigma-pi neural networks (Mel, 1990; Mel & Koch, 1990). 5 Learning the Sensorimotor Map The self-organization and learning of the sensorimotor map combines standard techniques from self-organizing maps (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) and their extensions with respect to growing representations (Carpenter et al., 1992; Fritzke, 1995) and the learning of temporal dependencies in lateral connections (Bishop et al., 1997; Wiemer, 2003). The free variables that need to be adapted
1138
M. Toussaint
are (1) the number of units in the map and their lateral connectivity and (2) the weight vectors si and mi j coupling to the sensor and motor layers, respectively. Except for the adaptation of the motor coupling mi j , all the adaptation mechanisms are standard, and we keep their description brief. The topology. There already exist numerous techniques for the selforganization of representational maps, mostly based on the early work on self-organizing maps (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) or vector quantization techniques (Gersho & Gray, 1991). We prefer not to predetermine the state space topology but learn it, and hence adopt the technique of growing neural gas (Fritzke, 1995) to self-organize the lateral connectivity and that of fuzzy ARTMAPs (Carpenter et al., 1992) to account for the insertion of new units when the representation needs to be expanded. We detect novelty when the difference between the current stimulus s and the best matching weight vector si becomes too large. We make this criterion more robust against noise by using a low-pass filter (leaky integrator) of this representation error. More precisely, if i ∗ is the unit with the best match, i ∗ = argmaxi Si , we integrate the error measure e i ∗ via τe e˙ i ∗ = −e i ∗ + (1 − Si ∗ ). Note that Si ∗ = 1 ⇐⇒ si ∗ = s. Whenever this error measure exceeds a threshold ν ∈ [0, 1] termed vigilance, e i ∗ > ν, we generate a new unit j and reset the error measures, e i ∗ ← 0, e j ← 0. Exactly as for growing neural gas, we add new lateral connections between i ∗ and j ∗ = argmaxi=i ∗ Si if they were not already connected. To organize the deletion of lateral connections, we associate an “age” a i j with every connection, which is increased at every time step by an amount of Mi j φ(x j ) and is reset to zero when i and j are the best and second-best matching units. If a connection’s age exceeds a threshold a max , the connection is deleted. The sensor and motor coupling. Standard self-organizing maps adapt the input weight vectors si of a unit i in a Hebbian way such that si converges to the average stimulus for which i is the best matching unit. To avoid introducing additional learning parameters and to make the convergence more robust, we realize this with a weighted averaging, si (T) = T
T
1
(t ) t=1 t =1 αi
(t)
αi s(t) ,
(5.1)
(t)
where αi ∈ {0, 1} determines whether i is the best matching unit at time t. The averaging can efficiently be realized incrementally without additional parameters. We follow the same approach to adapt the motor coupling, (T)
mi j = T
T
1
t =1
(t )
αi j
t=1
(t)
αi j m ˆ (t) .
(5.2)
A Sensorimotor Map
1139 (t)
Here, the averaging weight αi j ∈ {0, 1} is chosen such that mi j learns the average motor signals that lead to an increasing postsynaptic and a decreasing presynaptic activity. In that way, mi j learns which motor signals contribute, on average, to a transition from the stimulus s j to a stimulus si . The simplest realization of this rule is (t) αi j
=
1
if x˙ i > 0 and x˙ j < 0
0
else
.
(5.3)
5.1 Experiments. All experiments will consider the problem of controlling a limb with position y ∈ [−1, 1]2 in a two-dimensional plane or maze. In this experiment, the sensor representation is directly the 2D coordinate of this limb, that is, s = y (see section 8 for an example where the sensor representation is based on range measurements). The motor representation is given by 20 units, m ˆ ∈ [0, 1]20 , which encode 20 different bearing directions ϕi ∈ {0◦ , 18◦ , . . . , 342◦ }. Activations of motor units directly lead to a limb movement with velocity y˙ according to the law
y˙ 1 y˙ 2
=
20 i=1
m ˆi
cos(ϕi ) . sin(ϕi )
(5.4)
At the borders or walls of a maze, this law is violated such that y˙ 1 or y˙ 2 is set to zero when otherwise the border or wall would be crossed. In the first experiment, the limb performs random movements that are induced by explicitly coupling a random signal Ai into the motor layer (see equation 2.1). A random signal Ai is generated by randomly picking a motor unit i ∗ and choosing Ai ∗ = 1 while Ai = 0 for all i = i ∗ . The signal is not randomized at every time step; instead, at each time step with a probability .8, the signal remains unchanged, and with a probability .2, a new i ∗ is chosen. These movements generate the data—the sequences of sensor and motor signals m(t) and s(t) —from which the sensorimotor map learns the dependencies between motor signals and stimulus changes. Our choice of parameters for the dynamics of the sensorimotor map and motor layer is shown in Table 1. Those for adaptation are τe = 10, ν = .2, and a max = 300. During the learning phase, the lateral coupling (which will induce anticipation) is switched off (η = 0). Figure 2A displays the topology of the sensorimotor map that has been learned for the 2D plane after various time steps. In all displays, the units are positioned according to their sensor weight vectors si . Concerning the topology, we basically reproduce the standard behavior of growing neural gas: in the early phase, the map grows as more and more regions are
1140
M. Toussaint
Table 1: Parameters.
Sensorimotor map Motor layer
τ
h
wE
σE
wI
ρ
η
σS
2
0
–
–
.5
.01
0
.05
5
−1
1
2
.6
.01
–
–
explored. In the late phase, unnecessary connections are deleted, leading to a Voronoi-like graph. Figures 2B and 2C are two different illustrations of the learned motor weight vectors mi j . To compute these diagrams, we first associate an angle θi j = (s j − si ) with every connection in the sensorimotor map. These angles θi j correspond to the true geometrical direction of a transition from j to i. Figure 2B displays tuning diagrams for 10 different motor units. For a given motor unit k, we consider all connections (i j) and draw a line with orientation θi j and length (mi j )k . The diagrams exhibit that motor units that represent a certain bearing ϕk have larger weights to connections with similar bearing θi j . The tuning curve 2C displays the same data in another way: for every motor unit k and connection (i j), the weight (mi j )k is plotted against the difference θi j − ϕk . Finally, Figure 2D displays the learning curve with respect to an error measure for the weight vectors mi j : as every motor unit k corresponds to a bearing ϕk , every activation pattern m ˆ over motor units corresponds to an average bearing ϕ(m) ˆ (cf. equation 5.4). The weight vectors mi j are such activation patterns and thus correspond to average bearings ϕ(mi j ). The error measure is the absolute difference between this bearing ϕ(mi j ) and the geometrical direction θi j , averaged over all connections (i j). The graph shows that this error measure does not fully converge to zero. Indeed, most of this error is accumulated at the border of the region for an obvious reason: according to the “physics” we defined, a motor command that would diagonally cross a border leads to a movement parallel to the border instead of a full stop. Thus, at the borders, a whole variety of motor commands exists that all lead to the same movement parallel to the border. Connections between two units parallel to a border thus learn an average of motor commands that also includes diagonal motor commands.
6 Anticipation The sensorimotor map as introduced so far is sufficient for short-term anticipations. When the sensorimotor space is explored as previously with random movements and given the map as learned in the previous example,
A Sensorimotor Map
A
t = 1000
B
1141
t = 2000
t = 10000
t = 100000
C
D
Figure 2: (A) The topology of the sensorimotor representation learned for the 2D region at different times. (B) Tuning diagrams for 10 of the 20 motor units (we display only every second unit to save space): for a motor unit k, lines with length (mi j )k and orientation θi j are drawn. (C) The tuning curve of motor units for all motor units and lateral connections: the weight (mi j )k is plotted against the difference of orientation of the motor unit (ϕk ) and the connection (θi j ). (D) The learning curve of an error measure for the difference in bearing represented by mi j and φi j . Errors occur mostly at the borders. See section 5.1 for more details.
1142
M. Toussaint
we may compare the actual current stimulus s to what the sensorimotor map currently represents, 1 s¯ = i
xi
xi si .
(6.1)
i
We term the quantity s¯ the represented stimulus, which may in general differ from the true stimulus s; we term the difference ¯s = s¯ −s the representational shift. The approximate nature of the representation is one obvious source of representational shift: even when the lateral couplings are turned off (η = 0), there might be small shifts because the map is course-grained. In our case, most of such representation errors stem, again, from the borders. Since there exist no units to represent positions beyond a border and since the activations xi typically have a gaussian-like shape over the units i, the represented stimulus s¯ for a stimulus s at the border will always have a slight inward shift of the order of σ . The results we give will omit this effect by discarding data from the border of the region. We collected data for three different strength η ∈ {0, .2, .5} of lateral interaction. The two measures we discuss are the norm, RSN = |¯s|,
(6.2)
of the representational shift and the directional match RSD of the representational shift with the true change in stimulus s(t) = s(t+1) −s(t) that occurs due to the motor activations, RSD =
¯s, s ∈ [−1, 1]. |¯s| |s|
(6.3)
The results are displayed in Figure 3. All numbers are the averages (and standard deviations) over 2205 data points taken when the limb moves, in random directions as described previously, in the central area of the plane. For η = 0, we find that the norm of the representational shift (RSN = .015 ± .009) is, as expected, very small when compared to the kernel width σ = .05. The shift direction is not correlated to the true stimulus change (RSD = .0054 ± .7). Thus, for η = 0, the internally represented stimulus s¯ is fully dominated by the true stimulus s, and small representational shifts stem from the approximate nature of the representation. For η = .2 we find significantly larger shifts, RSN = .075 ± .042. More important, though, we find a strong correlation in the direction of the representational shift and the true future change of the stimulus, RSD = .89 ± .27. For η = .5, both effects are even stronger: RSN = .20 ± .17 and
A Sensorimotor Map
1143
Figure 3: The norm RSN of the representational shift and the correlation measure RSD between representational shift and the current change of stimulus, for different strengths η of the lateral coupling in the sensorimotor map. With nonzero lateral coupling, the represented stimulus is shifted in the same direction as the current true stimulus change.
RSD = .93 ± .21. For any η, the norm of the true stimulus change is |s| = 0.036 ± 0.017. The results clearly show that the representational shift ¯s encodes an anticipation of the true change of stimulus, that is, the represented stimulus s¯ is an anticipation of a future stimulus that will be perceived depending on the current motor activations. The motor modulation of the lateral interactions is able to direct the representational shift toward the direction that corresponds to the motor signals. This effect can be seen much better visually, watching the recordings1 of the activations on the sensorimotor map and the dynamics of the two positions that correspond to s¯ and s (see also Figure 4). For η = 0, both s¯ and s move very coherently, almost always overlapping; only at the borders there is a systematic inward shift. For η = .2, the activity peak of the field x is always slightly ahead of the true stimulus; the represented position s¯ always runs ahead of the true limb position s. When the motor activations change, s¯ sweeps in front of s toward the new movement bearing. For η = .5 the situation becomes more dramatic. The lateral interaction become dominating such that the field activations x actually run away from the true stimulus, traveling self-sustained in the direction of the current movement. This “wave” breaks down at the border of the sensorimotor map, and the activation peak is recreated at the current stimulus. Thus, the represented position s¯ travels quickly away from the true limb position s in the movement direction until it hits the border and restarts from s.
1 Access
and watch the recordings online at www.marc-toussaint.net/projects.
1144
M. Toussaint
A
B
Figure 4: Anticipation of future stimuli. (A) The forward excitation Si , which encodes the true current stimulus s. The gray shading indicates the value of Si ∈ [0, 1]; for better visibility, edges (i j) are shaded with the average value (Si + S j )/2. The black arrow indicates the direction encoded by the current motor activations. (B) The activation field xi on the sensorimotor map. It exhibits a significant shift in the direction of movement, thus encoding an anticipation of future stimuli depending on the current motor activations. See also note 1.
7 The Dynamics of Planning To organize goal-oriented behavior, we assume that, in parallel to the activation dynamics of x, there exists a second dynamic process that can be motivated from classical approaches to reinforcement learning (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). Recall the Bellman equation,
Vπ∗ (i) =
a
π(a |i)
P( j|i, a )[r ( j) + γ Vπ∗ ( j)],
(7.1)
j
yielded by the expectation V ∗ (i) of the discounted future return R(t) = ∞ τ−1 r (t+τ ) (for which R(t) = r (t+1) + γ R(t+1)). Here, i is a state τ =1 γ index, and γ is the discount factor. We presumed that the received rewards r (t) actually depend on only the state and thus enter equation 7.1 only in terms of the reward function r (i) (we neglect here that rewards may directly depend on the action). Behavior is described by a stochastic policy π(a |i), the probability of executing an action a in state i. Given the property 7.1 of V ∗ , it is straightforward to define a recursion algorithm for an approximation V
A Sensorimotor Map
1145
of V ∗ such that V converges to V ∗ . This recursion algorithm is called value iteration (Sutton & Barto, 1998) and reads τv Vπ (i) = −Vπ (i) +
π(a |i)
a
P( j|i, a ) r ( j) + γ Vπ ( j) ,
(7.2)
j
with a “reciprocal learning rate” or time constant τv . Note that equation 7.1 is the fixed point equation of equation 7.2. Equation 7.2 provides an iterative scheme to compute the state-value function V based on only local information. The practical meaning of the state-value function is that it quantifies how desirable and promising it is to reach a state i, also accounting for future rewards to be expected. If rewards are given only at a single goal state, V has its maximum at this goal and is the higher the easier the goal can be reached from a given state. Thus, if the current state is, i it is a simple and efficient rule of behavior to choose an action a that will lead to the neighbor state j with maximal V( j) (the greedy policy). In that sense, V(i) provides a smooth gradient toward desirable goals. Note, though, that direct value iteration presumes that the state and action spaces are known and finite and that the current state and the world model P( j|i, a ) are known. In transferring these classical ideas to our model, we assume that the system is given a goal stimulus g, that is, it is given the command to reach a state that corresponds to perceiving the stimulus g. Just as ordinary stimuli induce an input Si to the field activations xi , we let the goal stimulus induce a reward excitation, Ri =
1 −(si − g)2 , exp Z 2σ R2
(7.3)
for each unit i, where Z is chosen such that i Ri = 1. Besides the activations xi , we introduce an additional field over the sensorimotor map, the value field vi , which is in analogy to the state-value function V(i). The dynamics are τv v˙ i = −vi + Ri + γ max(w ji v j ), j
(7.4)
and well comparable to equation 7.2. One difference is that vi estimates the “current-plus-future” reward r (t) + γ R(t) rather than the future reward only. In the upper notation, this to the
value iteration corresponds τv Vπ (i) = −Vπ (i) + r (i) + a π(a |i) j P( j|i, a ) γ Vπ ( j) . As it is commonly done for value iteration, we assumed π to be the greedy policy. More precisely, we considered only that action (i.e., that connection ( ji)) that leads to the neighbor state j with maximal value w ji v j . In effect, the summations over a as well as over j can be replaced by a maximization over j.
1146
M. Toussaint
Finally, we replaced the probability factor P( j|i, a ) by w ji . In practice, the value field will relax quickly to its fixed point vi∗ = Ri + γ max j (w ji v ∗j ) and stay there if the goal does not change. The quasi-stationary value field vi together with the current (nonstationary) activations xi allow the system to generate motor excitations that lead toward the goal. More precisely, the gradient v j − vi in the value field indicates how desirable motor activations m ji are when the current “state” is i. Goal-directed motor excitations can thus be generated as a weighted average of the motor activations m ji that have been learned for the connections, A=
1 xi w ji (v j − vi ) m ji , Z i, j
(7.5)
where Z is chosen to normalize |A| = 1. These excitations enter the motor activation dynamics, equation 2.1. Hence, the signals flow between the sensorimotor map, and the motor system is in both directions. In the anticipation process, the signals flow from the motor layer to the sensorimotor map: motor signals activate the corresponding connections and cause lateral, predictive excitations. In the action selection process, the signals are emitted from the sensorimotor map back to the motor layer to induce the motor excitations that should turn predictions into reality. 7.1 Experiments. To demonstrate the planning capabilities of the sensorimotor map, we consider a 2D maze. In the first phase, a sensorimotor map is learned that represents the specific maze environment, using random explorations as described in section 5. Figure 5A displays the topology of the learned sensorimotor map after 100,000 iterations, now with a kernel σ S = .01. In the planning phase, a goal stimulus is applied that corresponds to the position indicated by a triangle in Figure 5B. This goal stimulus induces reward excitations Ri on units that match the goal stimulus closely. The value field dynamics, equation 7.4, quickly relaxes to its fixed point, which is displayed in Figure 5C. The parameters we used are τu = 5, γ = .9, and σ R = σ S /4. As expected, the value field activations are high for units representing the proximity of the goal location and decay smoothly along the connectivity of the sensorimotor map. Note that this value field is not a decaying function of the Euclidean distance to the goal, but approximately a decaying function of the topological distance to the goal, that is, the shortest path length with respect to the learned topology. Figure 5B illustrates a trial where the limb is initially located in the upper-right corner of the maze. The activation field xi represents this current location. Together with the gradient of the value field at the current location (see equation 7.5), motor excitations are induced that let the limb
A Sensorimotor Map A
1147 B
C
Figure 5: Experiments with a maze. (A) The topology of the sensorimotor map learned. (B) The activation field xi on the sensorimotor at the start of the trajectory. (C) The attractor state of the value field on the sensorimotor map, spreading from the goal location in the lower right.
move toward the bottom left. As the limb moves, the sensorimotor activities xi follow its current position, and new motor excitations are induced continuously, which leads to a movement trajectory ascending the gradient of the value field until the goal is reached. In the experiment, once the goal is reached, we switch the goal to a new random location, inducing new reward excitations Ri . The value dynamics, equation 7.4, respond quickly to this change and relax to a new fixed point, providing the gradient to the new goal. A standard quality measure for planning techniques is the required time to goal. Figure 6 displays the time intervals between switching the goals, which are the times needed to reach the new goal position from the previous goal position. First, we see that the goal is always reached in finite time, indicating that planning is always successful. Further, the graph compares the time to reach the goal with the length of the shortest path. This shortest path length was computed from a coarse (block-wise) discretization of the maze with dynamic programming. The clear correlation between the time to reach the goal and the shortest path length shows that the movement of the limb indeed follows a planned shortest path trajectory from the initial position to the goal. Another indicator for successful action selection is whether the current movement is in the direction of the valuegradient. Figure 7A displays the bearing of the local value gradient, ( i xi (v j − vi ) (s j − si )), and the bearing of the current movement, ( y ˙ ), for the first 300 time steps of the experiment. We observe a clear correlation between both bearings, though with a systematic time delay. This time delay is approximately six time steps, as can be read from Figure 7B, and corresponds to the timescale of the motor dynamics, τm = 5. (See note 1 to access more recordings of planning experiments).
1148
M. Toussaint
Figure 6: Times needed to move from a random start position to a random goal position in the maze when the sensorimotor map plans and controls the movement. The time is plotted against the length of the shortest path connection, which was computed from a coarse (block-wise) discretization of the maze with dynamic programming.
8 Extensions We tried to keep the model introduced so far simple and focused on the key mechanisms. This basic model can be extended in many straightforward ways to realize other desired functionalities. For instance, the path generated by the sensorimotor map in the previous example (see Figure 5B) clearly hits the walls very often. This should be no surprise since there is no mechanism of obstacle avoidance implicit in the model so far. But is easy to apply a standard obstacle avoidance technique: given distance signals di ∈ [0, 1] from 20 range sensors (in the 20 different bearings ϕi ) around the limb, an inhibition (e.g., proportional to (1 − di )3 ) can be directly coupled into the motor activations mi . Figure 8A displays a trajectory generated with this obstacle avoidance switched on. Perhaps more important is the question of whether the local lateral weights wi j should be learned instead of fixed to 1 if a connection exists and 0 otherwise. In Toussaint (2004) we presented a learning scheme for these weights based on the temporal Hebb rule. One of the main reasons to consider the continuous plasticity of these lateral weights was that this allows the model to adapt to change. We decided to keep the simpler alternative where wi j ∈ {0, 1}. The adaptability can also be achieved with the basic mechanism of adapting the topology to a change of the world, that is, keep adding or deleting connections.
A Sensorimotor Map
1149
A
B
Figure 7: (A) ˙ and the direction of the value The movement direction ( y) gradient ( i xi (v j − vi ) (s j − si )) for the first 300 time steps. (B) Similar to aconvolution between both curves, we plot the average squared difference 2 t h(t) − f (t + τ ) between both curves when one of them is shifted by a time delay τ . (We chose a norm · 2 that accounts for the cyclic metric in angle space.) The typical time shift is τ ∗ = 6.
A
B
C
G
S
A B
G S
Figure 8: Results of three different extensions of the sensorimotor map. (A) A trajectory with obstacle avoidance. (B) A trajectory from start S to goal G when the paths were blocked at A and B. (C) A sensorimotor map learned from range sensors. See section 8 for details.
Recall our rule to delete connections. As for growing neural gas (Fritzke, 1995), we associate an “age” a i j with every connection and delete the connections when it exceeds a threshold a max . The difference from Fritzke is that we increase all connections’ ages by an amount of Mi j φ(x j ) at every time. As a result, if during execution of a planned trajectory, an anticipated transition to a new stimulus does not occur, then the connections that contribute to this anticipation (for which Mi j φ(x j ) is high) will eventually be deleted. This adaptation of the topology has a crucial influence on the dynamics of the value field. If all connections of a specific pathway are deleted, the attractor of the value field rearranges to guide a way around this blocked pathway.
1150
M. Toussaint
Figure 8B displays such an interesting trajectory, generated for a max = 10: the limb is initially located at S and the goal location is G. The system has learned a representation of the original maze, as given in Figure 5A. But the maze has now been modified by blocking the pathways at A and B. The system first tries to follow a direct path, which is blocked at A. When the limb hits this barrier and continuously activates the connections crossing A (in terms of Mi j φ(x j )), they are eventually deleted, which makes the value field rearrange and guide along another path crossing B. Now the limb hits the barrier at B and connections are deleted, which finally leads to a path that allows the limb to reach the goal. (See note 1 to access recordings of these experiments.) Finally, since the stimulus kernel (see equation 4.2) was chosen as a gaussian, the stimulus can be also given in representations other than directly as the location y of the limb. For instance, Figure 8C displays the sensorimotor map learned for the plane when the input was given as a 40-dimensional range vector (d1 , . . . , d40 ), each di ∈ [0, 1] for 40 different bearings. The only difference with the setup in section 5 was the choice of the kernel width: now we used σ S = 1. The learned topology is slightly more dense close to the borders. This stems from the fact that the range vector changes more dramatically close to a wall since the visual angle under which the wall is seen (and thus the number of entries of the range vector affected by the wall) varies more. Anticipation and planning equally work for this stimulus representation. However, the model is not sufficient to handle ambiguous (partially observable) stimuli. For example, in the maze, there exist many positions with very similar range sensor profile. The sensorimotor map learned of the maze with only range sensor data would lead to an incorrect topology (cf. section 10). 9 Discussion A key mechanism of the sensorimotor map is that motor activations modulate the lateral connection strengths and thereby induce anticipatory shifts of the activity peak on the sensorimotor map. This modulatory sensorimotor coupling encodes a model of the change of stimuli depending on the current motor activities. The mechanism attributes a specific role to the lateral connectivity, namely, motor-modulated anticipatory excitation, which differs significantly from previously proposed roles for lateral connections. However, we believe that the different views on the roles of lateral connections do not compete but complement each other; lateral connections may play different roles depending on the context and function of the respective neural representation. For instance, the role of lateral connections has been extensively discussed in the context of pure sensor representations, in particular, for the visual cortex (Miikkulainen, Bednar, Choe, & Sirosh, 2005). In such sensor representations, the function of lateral connections could be subsumed as either enforcing coherence or competition between laterally
A Sensorimotor Map
1151
connected units. The formation of topographic maps, columnar structure, or patterns of orientation-selective receptive fields can be explained on this basis (e.g., Bednar et al., 2002). Also the stabilization of noisy or occluded stimuli, or the disambiguation between contradicting stimuli can be modeled—for example, with standard neural field dynamics involving local excitatory and global inhibitory lateral connections (Erlhagen & ¨ Schoner, 2002). In the context of temporal signal representations, the function of lateral connections is to induce specific temporal dynamics, learned, for example, with a temporal Hebb rule (spike-time dependent plasticity; Dayan & Abbott, 2001). Self-organized temporal map models (Euliano & Principe, 1999; Somervuo, 1999; Wiemer, 2003; Varsta, 2002; Klemm & Alstrom, 2002) can learn to anticipate stimuli, for example, when a stimulus B always follows a stimulus A. The role we attributed to lateral connections naturally differs from these models since we consider a sensorimotor representation where anticipation needs to depend on the current motor activities and for which we proposed the modulatory sensorimotor coupling. Long-term prediction, for example, path integration (see Etienne & Jeffery, 2004, for a review), is clearly related to the sensorimotor anticipation that we addressed with our model. Some existing models of path integration are based on one- or two-dimensional representational maps of position or head direction, and anticipation is realized by a motor-dependent translation of the activity pattern. For instance, in Hartmann and Wehner (1995) and Song and Wang (2005), the translational shift on a one-dimensional head direction representation is realized with two additional layers of inhibitory neurons—one for right and one for left movements—that are coupled to the motor system. Zhang (1996) achieves an exact translation of the activity pattern by coupling a derivative term in the dynamics, while Stringer, Rolls, Trappenberg, and Araujo (2002) induce translational shifts on a two-dimensional place field representation with a complex coupling of head direction units and forward velocity units into the lateral place field dynamics, in effect similar to our approach. None of these approaches addresses the problem of planning or the emission of motor signals based on the learned forward model. Although our model implements sensorimotor anticipation, it is in its current form limited with regard to the task of exact path integration: only the direction but not the magnitude of anticipatory shifts is guaranteed to be correlated with the true movement, as the experiments in section 6 demonstrate. However, future extensions of the model might solve this problem, for example, by a precise tuning of the parameter η that allows us to calibrate the magnitude of the anticipatory shift (see Figure 3). In the context of machine learning, predictive forward models are typically learned as a function, for example, with a neural network (Jordan & Rumelhart, 1992; Wolpert, Ghahramani, & Jordan, 1995; Ghahramani, Wolpert, & Jordan, 1997; Wolpert, Ghahramani, & Flanagan, 2001). It is assumed that a goal trajectory is readily available such that the learned
1152
M. Toussaint
function allows them to compute the motor signals necessary to follow this trajectory. A representational map of the state space is not formed. In contrast, some model-based reinforcement learning systems have addressed ¨ & Eecen, 1994; the self-organization of state space representations (Krose Zimmer, 1996; Appl, 2000), for example, by using discrete fuzzy representations (e.g., the Fuzzy-ARTMAPs; Carpenter et al., 1992). However, these approaches do not propose a direct coupling of motor activities into a sensorimotor representation to realize anticipation and planning by neural dynamics on this representation; instead, they use the learned representation as an input to separate, standard reinforcement learning architectures. 10 Conclusion The sensorimotor map we describe in this letter proposes a new mechanism of how motor signals can jointly be coupled with sensor signals on a sensorimotor representation. The immediate function of this sensorimotor map and the proposed modulatory sensorimotor coupling is the anticipation of the change of stimuli depending on the current motor activity. Anticipation on its own is a fundamental ingredient of sensorimotor control, for example, to consolidate noisy sensorial information by fusing it with the temporal prediction or to bridge the inherent time lag of sensorial information. However, the ability to anticipate also provides the basic ingredient for planning. The forward model implicitly encoded by the sensorimotor map can be used to realize planning. We considered standard reinforcement learning techniques as a starting point and proposed a value dynamics on the sensorimotor map the performs basically the same computations as value iteration. For this to work in a neural systems framework, it is crucial that there exists a neural representation of the state space. The sensorimotor map provides such a representation. The self-organization and learning processes that develop the sensorimotor map do not set principled constraints on the type of sensor and motor representations coupled to the map. However, a more general problem was not solved and remains a limiting factor. Also in our model, the self-organization of the sensorimotor representation is mainly sensor driven. This leads to problems, as when different states induce the same stimulus (partial observability, stimulus ambiguity), since the current selforganization process will not be able to grow separate units for the same stimulus. The self-sustaining and anticipatory dynamics of the sensorimotor map are able to disambiguate such states depending on the temporal context. But a prerequisite is the existence of multiple units associated with the same stimulus. This leads us back to the challenge of understanding higher-level cognitive processes like internal simulation and planning, as mentioned in ¨ the context of Kohler’s classic monkey experiments. The basic mechanisms of anticipation and planning proposed in this letter, in particular, the
A Sensorimotor Map
1153
action-dependent modulation of lateral interactions, might be transferable to such higher-level representations. An open question is how animals and humans are able to organize such higher-level abstract representations, which clearly are not purely sensor based but state abstractions that capture both the sensor context and its relevance for behavior. Acknowledgments I thank the German Research Foundation (DFG) for funding the Emmy Noether fellowship TO 409/1-1, allowing me to pursue this research. References Abbott, L. (1991). Realistic synaptic inputs for network models. Network: Computation in Neural Systems, 2, 245–258. Amari, S. (1977). Dynamics of patterns formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87. Appl, M. (2000). Model-based reinforcement learning in continuous environments. Un¨ Informatik, Technische Universit¨at published doctoral dissertation, Institut fur ¨ Munchen. Bagchi, S., Biswas, G., & Kawamura, K. (2000). Task planning under uncertainty using a spreading activation network. IEEE Transactions on Systems, Man and Cybernetics A, 30, 639–650. Bednar, J. A., Kelkar, A., & Miikkulainen, R. (2002). Modeling large cortical networks with growing self-organizing maps. Neurocomputing, 44–46, 315–321. Bergener, T., Bruckhoff, C., Dahm, P., Janßen, H., Joublin, F., Menzner, R., Steinhage, A., & von Seelen, W. (1999). Complex behavior by means of dynamical systems for an anthropomorphic robot. Neural Networks, 12, 1087–1099. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Nashua, NH: Athena Scientific. Bishop, C., Hinton, G., & Strachan, I. (1997). GTM through time. In Proc. of IEE Fifth Int. Conf. on Artificial Neural Networks (pp. 111–116). London: IEE. Carpenter, G., Grossberg, S., Markuzon, N., Reynolds, J., & Rosen, D. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, 5, 698–713. Dahm, P., Bruckhoff, C., & Joublin, F. (1998). A neural field approach to robot motion control. In Proc. of 1998 IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC 1998) (pp. 3460–3465). New York: IEEE. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. ¨ Erlhagen, W., & Schoner, G. (2002). Dynamic field theory of movement preparation. Psychological Review, 109, 545–572. Etienne, A. S., & Jeffery, K. J. (2004). Path integration in mammals. Hippocampus, 14, 180–192. Euliano, N., & Principe, J. (1999). A spatio-temporal memory based on SOMs with activity diffusion. In Workshop on Self-Organizing Maps (pp. 253–266). Helsinki.
1154
M. Toussaint
Fritzke, B. (1995). A growing neural gas network learns topologies. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 625–632). Cambridge, MA: MIT Press. Gersho, A., & Gray, R. (1991). Vector quantization and signal compression. Boston: Kluwer. Ghahramani, Z., Wolpert, D. M., & Jordan, M. I. (1997). Computational models of sensorimotor integration. In P. Morasso & V. Sanguineti (Eds.), Selforganization, computational maps and motor control (pp. 117–147). Dordrecht: Elsevier. Grush, R. (2004). The emulation theory of representation: Motor control, imagery, and perception. Behavioral and Brain Sciences, 27, 377–396. Hartmann, G., & Wehner, R. (1995). The ant’s path integration system: A neural architecture. Biological Cybernetics, 73, 483–497. Hesslow, G. (2002). Conscious thought as simulation of behaviour and perception. Trends in Cognitive Sciences, 6, 242–247. Iossifidis, I. & Steinhage, A. (2001). Controlling an 8 DOF manipulator by means of neural fields. In A. Halme, R. Chatila, & E. Prassler (Eds.), Int. Conf. on Field and Service Robotics (FSR 2001). Helsinki. Jordan, M., Rumelhart, D. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Klemm, K., & Alstrom, P. (2002). Emergence of memory. Europhysics Letters, 59, 662. ¨ ¨ Kohler, W. (1917). Intelligenzprufungen am menschenaffen. (3rd ed.). Berlin: Springer. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer. ¨ Krose, B., & Eecen, M. (1994). A self-organizing representation of sensor space for mobile robot navigation. In Proc. of Int. Conf. on Intelligent Robots and Systems (IROS 1994) (pp. 9–14). New York: IEEE. Majors, M., & Richards, R. (1997). Comparing model-free and model-based reinforcement learning. (Tech. Rep. No. CUED/F-INENG/TR.286). Cambridge: Cambridge: University Engineering Department. Mel, B. W., (1990). The sigma-pi column: A model of associative learning in cerebral cortex. (Tech. Rep. CNS Memo 6). Pasadena: Computation and Neural Systems Program, California Institute of Technology. Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 474–481). San Mateo, CA: Morgan Kaufmann. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (2005). Computational maps in the visual cortex. Berlin: Springer. Phillips, W., & Singer, W. (1997). In the search of common foundations for cortical computation. Behavioral and Brain Sciences, 20, 657–722. Rose, P., & Scott, S. (2003). Sensory-motor control: A long-awaited behavioral correlate of presynaptic inhibition. Nature Neuroscience, 12, 1309–1316. ¨ Schoner, G., & Dose, M. (1992). A dynamical system approach to task level system integration used to plan and control autonomous vehicle motion. Robotics and Autonomous Systems, 10, 253–267. ¨ Schoner, G., Dose, M., & Engels, C. (1995). Dynamics of behavior: Theory and applications for autonomous robot architectures. Robotics and Autonomous Systems, 16, 213–245.
A Sensorimotor Map
1155
Somervuo, P. (1999). Time topology for the self-organizing map. In Proc. of Int. Joint Conf. on Neural Networks (IJCNN 1999) (pp. 1900—1905). New York: IEEE. Song, P., & Wang, X.-J. (2005). Angular path integration by moving “hill of activity”: A spiking neuron model without recurrent excitation of the head-direction system. J. Neuroscience, 25, 1002–1014. Stringer, S. M., Rolls, E. T., Trappenberg, T. P., & Araujo, I. E. T. de. (2002). Selforganizing continuous attractor networks and path integration: Two-dimensional models of place cells. Network, 13, 429–446. Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Toussaint, M. (2004). Learning a world model and planning with a self-organizing, ¨ dynamic neural system. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (NIPS 2003) (pp. 929–936). Cambridge, MA: MIT Press. Varsta, M. (2002). Self-organizing maps in sequence processing. Unpublised doctoral dissertation, Helsinki University of Technology. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. Wiemer, J. (2003). The time-organized map algorithm: Extending the self-organizing map to spatiotemporal signals. Neural Computation, 15, 1143–1171. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by selforganization. Proceedings of the Royal Society of London, B194, 431–445. Wolpert, D., Ghahramani, Z., & Flanagan, J. (2001). Perspectives and problems in motor learning. Trends in Cognitive Science, 5, 487–494. Wolpert, D., Ghahramani, Z., & Jordan, M. (1995). An internal model for sensorimotor integration. Science, 269, 1880–1882. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neuroscience, 16, 2112–2126. Zimmer, U. (1996). Robust world-modelling and navigation in a real world. NeuroComputing, 13, 247–260.
Received November 9, 2004; accepted August 18, 2005.
LETTER
Communicated by Randall Beer
A Reflexive Neural Network for Dynamic Biped Walking Control Tao Geng
[email protected] Department of Psychology, University of Stirling, Stirling, U.K.
Bernd Porr
[email protected] Department of Electronics and Electrical Engineering, University of Glasgow, Glasgow, U.K.
¨ otter ¨ Florentin Worg
[email protected] Department of Psychology, University of Stirling, Stirling, U.K., and Bernstein Center for Computational Neuroscience, University of G¨ottingen, G¨ottingen, Germany
Biped walking remains a difficult problem, and robot models can greatly facilitate our understanding of the underlying biomechanical principles as well as their neuronal control. The goal of this study is to specifically demonstrate that stable biped walking can be achieved by combining the physical properties of the walking robot with a small, reflex-based neuronal network governed mainly by local sensor signals. Building on earlier work (Taga, 1995; Cruse, Kindermann, Schumm, Dean, & Schmitz, 1998), this study shows that human-like gaits emerge without specific position or trajectory control and that the walker is able to compensate small disturbances through its own dynamical properties. The reflexive controller used here has the following characteristics, which are different from earlier approaches: (1) Control is mainly local. Hence, it uses only two signals (anterior extreme angle and ground contact), which operate at the interjoint level. All other signals operate only at single joints. (2) Neither position control nor trajectory tracking control is used. Instead, the approximate nature of the local reflexes on each joint allows the robot mechanics itself (e.g., its passive dynamics) to contribute substantially to the overall gait trajectory computation. (3) The motor control scheme used in the local reflexes of our robot is more straightforward and has more biological plausibility than that of other robots, because the outputs of the motor neurons in our reflexive controller are directly driving the motors of the joints rather than working as references for position or velocity control. As a consequence, the neural controller and the robot Neural Computation 18, 1156–1196 (2006)
C 2006 Massachusetts Institute of Technology
A Reflexive Neural Network for Dynamic Biped Walking Control
1157
mechanics are closely coupled as a neuromechanical system, and this study emphasizes that dynamically stable biped walking gaits emerge from the coupling between neural computation and physical computation. This is demonstrated by different walking experiments using a real robot as well as by a Poincar´e map analysis applied on a model of the robot in order to assess its stability. 1 Introduction There are two distinct schemes for leg coordination discussed in the literature on animal locomotion and on biologically inspired robotics: CPGs (central pattern generators) and reflexive controllers. It was found that motor neurons (and hence rhythmical movements) in many animals are driven by central networks of interneurons that generate the essential features of the motor pattern. However, sensory feedback signals also play a crucial role in such control systems by turning a stereotyped unstable pattern into the co-coordinated rhythm of natural movement (Reeve, 1999). These networks were referred to as CPGs. On the other hand, Cruse developed a reflexive controller model to understand the locomotion control of a slowly walking stick insect (Carausius morosus). In his model, reflexive mechanisms in each leg generate the step cycle of each individual leg. For interleg coordination, in accordance with observations in insects, he presented six mechanisms that can reestablish coordination in the case of minor disturbances (Cruse, Kindermann, Schumm, Dean, & Schmitz, 1998; Cruse & Warnecke, 1992). While neural systems modeled as CPGs or reflexive controllers explicitly or implicitly compute walking gaits, the mechanics also “compute” a large part of the walking movements (Lewis, 2001). This is called physical computation, namely, exploiting the system’s physics, rather than explicit models, for global trajectory generation and control. One distinct example of physical computation in animal locomotion is the “preflex”—the nonlinear, passive visco-elastic properties of the musculoskeletal system itself (Brown & Loeb, 1999). Due to the physical nature of the preflex, the system can respond rapidly to disturbances (Cham, Bailey, & Cutkosky, 2000). Thus, in all animals, locomotion control is shared between neural computation and physical computation. In this letter, we present our design of a novel reflexive neural controller that has been implemented on a planar biped robot. We will show how a dynamically stable biped walking gait emerges on our robot as a result of a combination of neural and physical computation. Several issues are addressed in this article that we believe are relevant for understanding biologically motivated walking control. Specifically we will show that it is possible to design a walking robot with a very sparse set of input signals and with a controller that operates in an approximate and self-regulating way. Both aspects may be of importance in biological systems too, because they
1158
¨ otter ¨ T. Geng, B. Porr, and F. Worg
allow for a much more limited structure of the neural network and reduce the complexity of the required information processing. Furthermore, in our robot, the controller is directly linked to the robot’s motors (its “muscles”), leading to a more realistic, reflexive sensor-motor coupling than implemented in related approaches. These mechanisms allowed us for the first time to arrive at a dynamically stable artificial biped combining physical computation with a pure reflexive controller. The experimental part of this study is complemented by a dynamical model and the assessment of its stability using a Poincar´e map approach. Robot simulations have been recently criticized, raising the issue that complex systems like a walking robot cannot be fully simulated because of uncontrollable contingencies in the design and in the world in which it is embedded. This notion, known as the embodiment problem, has been dis¨ otter, ¨ cussed to a large extent in the robotics literature (Porr & Worg 2005; Ziemke, 2001). This issue reappears also in our case where we find that the simulations and their analysis will indeed match the experiments and raise confidence in the design, while stopping short of the rich detail of the real system. This letter is organized as follows. First, we describe the mechanical design of our biped robot. Next, we present our neural model of a reflexive network for walking control. Then we demonstrate the result of several biped walking experiments and apply Poincar´e map analysis on the robot model. Finally, we compare our reflexive controller with other walking control mechanisms. 2 The Robot Reflexive controllers such as Cruse’s model involve no central processing unit that demands information on the real-time state of every limb and computes the global trajectory explicitly. Instead, local reflexes of every limb require very little information concerning the state of the other limbs. Coordinated locomotion emerges from the interaction between local reflexes and the ground. Thus, such a distributed structure can immensely decrease the computational burden of the locomotion controller. With these eminent advantages, Cruse’s reflexive controller and its variants were implemented on some multilegged robots (Ferrell, 1995). Whereas in the case of biped robots, though some of them also exploit some form of reflexive mechanisms, their reflexes usually work as an auxiliary function or as infrastructural units for other nonreflexive high-level or parallel controllers. For example, on a simulated 3D biped robot (Boone & Hodgins, 1997), specifically designed reflexive mechanisms were used to respond to two types of ground surface contact errors of the robot, slipping and tripping, while the robot’s hopping height, forward velocity, and body attitude are separately controlled by three decoupled conventional controllers. On a real biped robot (Funabashi, Takeda, Itoh, & Higuchi, 2001), two prewired reflexes are
A Reflexive Neural Network for Dynamic Biped Walking Control
1159
implemented to compensate for two distinct types of disturbances representing an impulsive force and a continuous force, respectively. To date, no real biped robot has existed that depends exclusively on reflexive controllers for walking control. This may be because of the intrinsic instability specific to biped walking, which makes the dynamic stability of biped robots much more difficult to control than that of multilegged robots. After all, a pure local reflexive controller itself involves no mechanisms to ensure the global stability of the biped. While the controllers of biped walking robots generally require some kind of continuous position feedback for trajectory computation and stability control, some animals’ fast locomotion is largely self-stabilized due to the passive, visco-elastic properties of their musculoskeletal system (Full & Tu, 1990). Not surprisingly, some robots can display a similar self-stabilization property (Iida & Pfeifer, 2004). Passive biped robots can walk down a shallow slope with no sensing, control, or actuation. However, compared with a powered biped, passive biped robots have obvious drawbacks, for example, needing to walk down a slope and their inability to control speed (Pratt, 2000). Some researchers have proposed equipping a passive biped with actuators to improve its performance. Van der Linde (1998) made a biped robot walk on level ground by pumping energy into a passive machine at each step. Nevertheless, no one has yet built a passive biped robot that has the capabilities of powered robots, such as walking at various speeds on various terrains (Pratt, 2000). Passive biped robots are usually equipped with circular feet, which can increase the basin of attraction of stable walking gaits and can make the motion of the stance leg look smoother. Instead, powered biped robots typically use flat feet so that their ankles can effectively apply torque to propel the robot to move forward in the stance phase and to facilitate its stability control. Although our robot is a powered biped, it has no actuated ankle joints, rendering its stability control even more difficult than that of other powered bipeds. Since we intended to exploit our robot’s passive dynamics during some stages of its gait cycle, similar to the passive biped, its foot bottom also follows a curved form with a radius equal to the leg length. As for the mechanical design of our robot, it is 23 cm high, foot to hip. It has four joints: left hip, right hip, left knee, and right knee. Each joint is driven by an RC servo motor. A hard mechanical stop is installed on the knee joints, thus preventing the knee joint from going into hyperextension, similar to the function of knee caps on animals’ legs. The built-in PWM (pulse width modulation) control circuits of the RC motors are disconnected while its built-in potentiometer is used to measure joint angles. Its output voltage is sent to a PC through a DA/AD board (USB DUX, www.linuxusb-daq.co.uk). Each foot is equipped with a modified Piezo transducer (DN 0714071 from Farnell) to sense ground contact events. We constrain the robot only in the sagittal plane by a boom. All three axes (pitch, roll, and
1160
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 1: (A) The robot and (B) a schematic of the joint angles of one leg. (C) The structure of the boom. All three orthogonal axes (pitch, roll, and yaw) rotate freely, thus having no influence on the robot dynamics in its sagittal plane.
yaw) of the boom can rotate freely (see Figure 1C), thus having no influence on the dynamics of the robot in the sagittal plane. Note that the robot is not supported by the boom in the sagittal plane. In fact, it is always prone to trip and fall. The most important consideration in the mechanical design of our robot is the location of its center of mass. Its links are made of aluminium alloy, which is light and strong enough. The motor of each hip joint is an
A Reflexive Neural Network for Dynamic Biped Walking Control
1161
Figure 2: Illustration of a walking step of the robot.
HS-475HB from Hitec. It weighs 40 g and can output a torque up to 5.5 kg/cm. Due to the effect of the mechanical stop, the motor of the knee joint bears a smaller torque than the hip joint in stance phases, but must rotate quickly during swing phases for foot clearance. We use a PARK HPXF from Supertec on the knee joint, which is light (19 g) but fast with 21 rad/s. Thus, about 70% of the robot’s weight is concentrated on its trunk. The parts of the trunk are assembled in such a way that its center of mass is located as far forward as possible (see Figure 2). The effect of this design is illustrated in Figure 2. As shown, one walking step has two stages: from A to B and from B to C. During the first stage, the robot has to use its own momentum to rise up on the stance leg. When walking at a low speed, the robot may have not enough momentum to do this, so the distance the center of mass has to cover in this stage should be as short as possible, which can be fulfilled by locating the center of mass of the trunk far forward. In the second stage, the robot falls forward naturally and catches itself on the next stance leg (see Figure 2). Then the walking cycle is repeated. The figure also shows clearly the movement of the curved foot of the stance leg. A stance phase begins with the heel touching ground and terminates with the toe leaving ground. 3 The Neural Structure of Our Reflexive Controller The reflexive controller model of Cruse et al. (1998) and Cruse and Saavedra (1996) that has been applied on many robots can be roughly divided into two levels: the single leg level and the interleg level. Figure 3 illustrates how Cruse’s model creates a single leg movement pattern. A protracting leg switches to retraction as soon as it attains the AEP (anterior extreme
1162
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 3: Single leg movement pattern of Cruse’s reflexive controller model (Cruse et al., 1998).
position). A retracting leg switches to protraction when it attains the PEP (posterior extreme position). On the interleg level, six different mechanisms have been described (Cruse et al., 1998), which coordinate leg movements by modifying the AEP and PEP of a receiving leg according to the state of a sending leg. Although Cruse’s model, as a reflexive controller, is for hexapod locomotion, where the problem of interleg coordination is much more complex than in biped walking, we can still compare its mechanism for the generation of single leg movement patterns with that of our reflexive controller. Cruse’s model depends on PEP, AEP, and GC (ground contact) signals to generate the movement pattern of the individual legs, whereas our reflexive controller presented here uses only GC and AEA (anterior extreme angle of hip joints) to trigger switching between stance and swing phases of each leg. Creation of the single leg movement pattern for our model is illustrated in Figure 4. Figures 4A to 4E represent a series of snapshots of the robot configuration while it is walking. At the time of Figure 4B, the left foot (black) has just touched the ground. This event triggers four local joint reflexes at the same time: flexor of left hip, extensor of left knee, extensor of right hip, and flexor of right knee. At the time of Figure 4E, the right hip joint attains its AEA, which triggers only the extensor reflex of the right knee. When the right foot (gray) contacts the ground, a new walking cycle will begin. Note that
Figure 4: Illustration of single leg movement pattern generation.
A Reflexive Neural Network for Dynamic Biped Walking Control
1163
Figure 5: The neuron model of reflexive controller on our robot. Gray circles = sensor neurons or receptors; vertical ovals = interneurons; horizontal ovals = motorneurons. Synapses: black circle = excitatory, black triangle = inhibitory. See Table 1 for abbreviations.
on the hip joints and knee joints, extensor means forward movement, while flexor means backward movement. The reflexive walking controller of our robot follows a hierarchical structure (see Figure 5). The bottom level is the reflex circuit local to the joints,
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1164 Table 1: Meaning of Some Abbreviations. AL, (AR) GL, (GR) EI, (FI) EM, (FM) ES, (FS)
Stretch receptor for anterior angle of left (right) hip Sensor neuron for ground contact of left (right) foot Extensor (flexor) reflex interneuron Extensor (flexor) reflex motorneuron Extensor (flexor) reflex sensor neuron
including motorneurons and angle sensor neurons involved in joint reflexes. The top level is a distributed neural network consisting of hip stretch receptors, ground contact sensor neurons, and interneurons for reflexes. Neurons are modeled as nonspiking neurons simulated on a Linux PC and communicated to the robot via the DA/AD board. Though somewhat simplified, they still retain some of the prominent neuronal characteristics. 3.1 Model Neuron Circuit of the Top Level. The joint coordination mechanism in the top level is implemented with the neuron circuit illustrated in Figure 5. Each of the ground contact sensor neurons has excitatory connections to the interneurons of the ipsilateral hip flexor and knee extensor as well as to the contralateral hip extensor and knee flexor. The stretch receptor of each hip has excitatory connections to its ipsilateral interneuron of the knee extensor and inhibitory connection to its ipsilateral interneuron of the knee flexor. Detailed models of the inter-neuron, stretch receptor, and ground contact sensor neuron are described in following subsections. 3.1.1 Interneuron Model. The interneuron model is adapted from one used in the neural controller of a hexapod simulating insect locomotion (Beer & Chiel, 1992). The state of each model neuron is governed by equations 3.1 and 3.2 (Gallagher, Beer, Espenschied, & Quinn, 1996): τi
dyi = −yi + ωi, j u j dt
(3.1)
u j = (1 + e j −y j )−1 ,
(3.2)
where yi represents the mean membrane potential of the neuron. Equation 3.2 is a sigmoidal function that can be interpreted as the neuron’s short-term average firing frequency, j is a bias constant that controls the firing threshold. τi is a time constant associated with the passive properties of the cell membrane (Gallagher et al., 1996), and ωi, j represents the connection strength from the jth neuron to the ith neuron. 3.1.2 Stretch Receptors. Stretch receptors play a crucial role in animal locomotion control. When the limb of an animal reaches an extreme position, its stretch receptor sends a signal to the controller, resetting the phase of the
A Reflexive Neural Network for Dynamic Biped Walking Control
1165
limbs. There is also evidence that phasic feedback from stretch receptors is essential for maintaining the frequency and duration of normal locomotive movements in some insects (Chiel &Beer, 1997). While other biologically inspired locomotive models and robots use two stretch receptors on each leg to signal the attaining of the leg’s AEP and PEP, respectively, our robot has only one stretch receptor on each leg to signal the AEA of its hip joint. Furthermore, the function of the stretch receptor on our robot is only to trigger the extensor reflex on the knee joint of the same leg rather than to explicitly (in the case of CPG models) or implicitly (in the case of reflexive controllers) reset the phase relations between different legs. As a hip joint approaches the AEA, the outputs of the stretch receptors for the left (AL) and the right hip (AR) are increased as ρ AL = (1 + e α AL ( AL −φ) )−1 ρ AL = (1 + e
α AR ( AR −φ) −1
)
(3.3) ,
(3.4)
where φ is the real-time angular position of the hip joint, AL and AR are the hip anterior extreme angles whose value are tuned by hand in an experiment, and α AL and α AR are positive constants. This model is inspired by a sensor neuron model presented in Wadden & Ekeberg (1998) that is thought capable of emulating the response characteristics of populations of sensor neurons in animals. 3.1.3 Ground Contact Sensor Neurons. Another kind of sensor neuron incorporated in the top level is the ground contact sensor neuron, which is active when the foot is in contact with the ground. Its output, similar to that of the stretch receptors, changes according to ρG L = (1 + e αG L (G L −VL +VR ) )−1
(3.5)
αG R (G R −VR +VL ) −1
(3.6)
ρG R = (1 + e
) ,
where VL and VR are the output voltage signals from piezo sensors of the left foot and right foot, respectively, G L and G R work as thresholds, and αG L and αG R are positive constants. While AEP and PEP signals account for switching between stance phase and swing phase in other walking control structures, ground contact signals play a crucial role in phase transition control of our reflexive controller. This emphasized role of the ground contact signal also has some biological plausibility. When held in a standing position on a firm flat surface, a newborn baby will make stepping movements, alternating flexion and extension of each leg, which looks like walking. This is called “stepping reflex,” elicited by the foot’s touching of a flat surface. There is considerable evidence that
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1166
the stepping reflex, though different from actual walking, eventually develops into independent walking (Yang, Stephens, & Vishram, 1998). Concerning its nonlinear dynamics, the biped model is hybrid in nature, containing continuous (in swing phase and stance phase) and discrete (at the ground contact event) elements. Hurmuzlu (1993) applied discrete mapping techniques to study the stability of bipedal locomotion. It was found that the timing of ground contact events has a crucial effect on the stability of biped walking. Our preference for using a ground contact signal instead of PEP or AEP signals also has other reasons. In PEP/AEP models, the movement pattern of a leg will break down as soon as the AEP or PEP cannot be reached, which may happen as a consequence of an unexpected disturbance from the environment or due to intrinsic failure. This can be catastrophic for a biped, though tolerable for a hexapod due to its high degree of redundancy. 3.2 Neural Circuit of the Bottom Level. In animals, a reflex is a local motor response to a local sensation. It is triggered in response to a suprathreshold stimulus. The quickest reflex in animals is the “monosynaptic reflex,” in which the sensor neuron directly contacts the motor neuron. The bottom-level reflex system of our robot consists of reflexes local to each joint (see Figure 5). The neuron module for one reflex is composed of one angle sensor neuron and the motor neuron it contacts (see Figure 5). Each joint is equipped with two reflexes, extensor reflex and flexor reflex, both modeled as a monosynaptic reflex, that is, whenever its threshold is exceeded, the angle sensor neuron directly excites the corresponding motor neuron. This direct connection between angle sensor neuron and motor neuron is inspired by a reflex described in cockroach locomotion (Beer, Quinn, Chiel, & Ritzmann, 1997). In addition, the motor neurons of the local reflexes receive an excitatory synapse and an inhibitory synapse from the interneurons of the top level, by which the top level can modulate the bottom-level reflexes. Each joint has two angle sensor neurons: one for the extensor reflex and the other for the flexor reflex (see Figure 5). Their models are similar to that of the stretch receptors described above. The extensor angle sensor neuron changes its output according to ρ E S = (1 + e α E S (φ− E S ) )−1 ,
(3.7)
where φ is the real-time angular position obtained from the potentiometer of the joint (see Figure 1B). E S is the threshold of the extensor reflex (see Figure 1B) and α E S a positive constant. Likewise, the output of the flexor sensor neuron is modeled as ρ F S = (1 + e α F S ( F S −φ) )−1 ,
(3.8)
A Reflexive Neural Network for Dynamic Biped Walking Control
1167
where F S and α F S are similar, as above. It should be particularly noted that the thresholds of the sensor neurons in the reflex modules do not work as desired positions for joint control, because our reflexive controller does not involve any exact position control algorithms that would ensure that the joint positions converge to a desired value. In fact, as will be shown in the next section, the joints often pass these thresholds in swing and stance phase and begin their passive movement thereafter. The sensor neurons involved in the local reflex module of each joint can affect the movements of only the joint they belong to, having not direct or indirect connection to other joints. This is different for the phasic feedback signal, AEA, which works in the top level (i.e., the interjoint level), sensing the position of the hip joints and contacting the motor neurons of the knee joints. The model of the motor neuron is the same as that of the interneurons presented in section 3.1.1. Note that on this robot, the output value of the motor neurons, after multiplication by a gain coefficient, is sent to the servo amplifier to directly drive the joint motor.1 In this way, the neural dynamics are directly coupled with the motor dynamics, and furthermore, with the biped walking dynamics. Thus, the robot and its neural controller constitute a closely coupled neuromechanical system. The voltage of joint motor is determined by Motor voltage = MAMP (g E M u E M + g F M u F M ),
(3.9)
where MAMP represents the magnitude of the servo amplifier, g E M and g F M are output gains of the motor neurons of the extensor and flexor reflex, respectively, and u E M and u F M are the outputs of the motor neurons. 4 Robot Walking Experiments The model neuron parameters chosen jointly for all experiments are listed in Tables 2 and 3. Only the thresholds of the sensor neurons and the output gain of the motor neurons are changed in different experiments. The time constants τi of all neurons take the same value of 5 ms. The weights of all the inhibitory connections are set to −10. The weights of all excitatory
1 While we use motors to drive the robot, animals use muscles for walking. Muscles have their own special properties that make them particularly suitable for walking behaviors, for example, the preflex, which refers to the nonlinear, passive visco-elastic properties of the musculoskeletal system of animals (Brown & Loeb, 1999). Due to the physical nature of the preflex, the system can respond to disturbances rapidly. In the next stage of our work, we will build a Hill-type muscle model with RC motors. The motor neurons of our reflexive controller at the moment drive the motors directly. In the next stage, they will drive the muscle model directly, just as in animals.
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1168
Table 2: Parameters of Neurons for Hip and Knee Joints.
Hip joints Knee joints
E I
F I
E M
F M
αE S
αF S
5 5
5 5
5 5
5 5
4 4
1 4
Note: For the meaning of the subscripts, see Table 1.
Table 3: Parameters of Stretch Receptors and Ground Contact Sensor Neurons. G L (v)
G R (v)
AL (deg)
AR (deg)
αG L
αG R
α AL
α AR
2
2
= E S
= E S
4
4
4
4
Table 4: Specific Parameters for Walking Experiments.
Hip joints Knee joints
E S (deg)
F S (deg)
115 180
70 100
gE M
gF M
±2 ±1.8
±2 ±1.8
connections are 10, except those between interneurons and motor neurons, which are 0.1. We encourage readers to watch the video clips of the robot walking experiments: Walking fast on a flat floor, http://www.cn.stir.ac.uk/˜tgeng/robot/ walkingfast.mpg Walking with a medium speed, http://www.cn.stir.ac.uk/˜tgeng/robot/ walkingmedium.mpg Walking slowly http://www.cn.stir.ac.uk/˜tgeng/robot/walkingslow.mpg Climbing a shallow slope, http://www.cn.stir.ac.uk/˜tgeng/robot/ climbingslope.mpg These videos can be viewed online with Windows Media Player (www. microsoft.com) or Realplayer (www.real.com). 4.1 Passive Movements of the Robot. In a walking experiment with specific parameters as given in Table 4, the passive part of the movement of the robot is shown most clearly. (The sign of g E M and g F M depends on the hardware configurations of the motors on the left and right leg.)
A Reflexive Neural Network for Dynamic Biped Walking Control
1169
Figure 6: Real-time data of one hip joint. (A) Hip joint angle. (B) Motor voltage measured directly at the motor neurons of the hip joint. During some periods (the gray areas), the motor voltage remains zero, and the hip joint moves passively.
Figure 6 shows the motor voltage and the angular movement of one of its hip joints while the robot is walking. During roughly more than half of every gait cycle, the hip joint is moving passively. As shown in Figure 7, during some period of every gait cycle (the gray area in Figure 7), the motor voltages of the motor neurons on all four joints remain zero, so all joints move passively until the swing leg touches the ground (see Figure 8). During this time, which is roughly one-third of a gait cycle (see Figures 7 and 8), the movement of the whole robot is exclusively under the control of “physical computation” following its passive dynamics; no feedback-based active control acts on it. This demonstrates very clearly how neurons and mechanical properties work together to generate the whole gait trajectory. This is also analogous to what happens in animal locomotion. Muscle control of animals usually exploits the natural dynamics of their limbs. For instance, during the swing phase of the human walking gait, the leg muscles first experience a power spike to begin leg swing and then remain limp throughout the rest of the swing phase, similar to what is shown in Figure 8. Note that in Figure 8 and the corresponding stick diagrams of walking gait, we omitted the detailed movement of the curved foot in order to show clearly the leg movements. The point on which the stance leg stands is the orthographic projection of the midpoint of the foot and not its exact ground contact point.
1170
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 7: Motor voltages of the four joints measured directly at the motor neurons, while the robot is walking. (A) Left hip. (B) Right hip. (C) Left knee. (D) Right knee. Note that during one period of every gait cycle (gray area), all four motor voltages remain zero, and all four joints (i.e., the whole robot) move passively (see Figure 8).
4.2 Walking at Different Speeds and a Perturbed Gait. The walking speed of the robot can be changed easily by adjusting only the thresholds of the reflex sensor neurons and the output gain of the motor neurons (see Table 5). Figures 9A and 9B show two phase plots of the hip and knee joint positions, which were recorded while the robot was walking with different speeds on a flat floor. Figure 9C shows a perturbed walking gait. The bulk of the trajectory represents the normal orbit of the walking gait, while the few outlying
A Reflexive Neural Network for Dynamic Biped Walking Control
1171
Figure 8: (A). A series of sequential frames of a walking gait cycle. The interval between every two adjacent frames is 33 ms. Note that during the time between frame 10 and frame 15, which is nearly one-third of the time length of a gait cycle (corresponding to the gray area in Figure 7), the robot is moving passively. At the time of frame 15, the swing leg touches the floor, and a new gait cycle begins. (B). Stick diagram of the gait drawn from the frames in A. The interval between any two consecutive snapshots is 67 ms.
trajectories are caused by external disturbances induced by small obstacles such as thin books (less than 4% of robot size) obstructing the robot path. After a disturbance, the trajectory returns to its normal orbit soon, demonstrating that the walking gaits are stable and to some degree robust against external disturbances. Here, robustness is defined as rapid convergence to a steady-state behavior despite unexpected perturbations (Lewis, 2001). With neuron parameters changed in the cases of fast walking and slow walking, walking dynamics are implicitly drawn into a different gait cycle (see Figure 9). Figure 9D shows an experiment in which the neuron parameters are changed abruptly online while the robot is walking at a slow speed
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1172
Table 5: The Values of Neuron Parameters Chosen to Generate Different Speeds (see Figure 9).
Low-speed walking (see Figure 9A) High-speed walking (see Figure 9B) Perturbed walking gait (see Figure 9C)
Hip joints Knee joints Hip joints Knee joints Hip joints Knee joints
E S (deg)
F S (deg)
gE M
gF M
120 180 110 180 115 180
70 100 85 100 90 100
±1.4 ±1.5 ±2.5 ±1.8 ±2.5 ±1.5
±1.3 ±1.5 ±2.5 ±1.8 ±2.5 ±1.5
Figure 9: Phase diagrams of hip joint position and knee joint position of one leg. Robot speed: (A) 28 cm/s; (B) 63 cm/s. (C) A perturbed walking gait. For values of the neuron parameters chosen in these experiments, see Table 6. Note that the hip joint angle in these figures is an absolute value, not the angle relative to the robot body as shown in Figure 1B. (D) The walking speed is changed online.
(33 cm/s, the big orbit). After a short transient stage (the outlaying trajectories), the gait cycle of the robot is automatically transferred into another stable, high-speed orbit (the small one, 57 cm/s). In other words, when the neuron parameters are changed, physical computation closely coupled
A Reflexive Neural Network for Dynamic Biped Walking Control
1173
Figure 10: The robot is climbing a shallow slope. The interval between any two consecutive snapshots is 67 ms.
with neural computation can autonomously shift the system into another global trajectory that is also dynamically stable. This experiment shows that our biped robot, as a neuromechanical system, is stable in a relatively large domain of its neuron parameters. With other real-time biped walking controllers based on biologically inspired mechanisms (e.g., CPG) or conventional trajectory preplanning and tracking control, it is still a puzzling problem how to change walking speed on the fly without undermining dynamical stability at the same time. However, this experiment shows that the walking speed of our robot can be drastically changed (nearly doubled) on the fly while the stability is still retained due to physical computation. 4.3 Walking Up a Shallow Slope. Figure 10 is a stick diagram of the robot when it is walking up a shallow slope of about 4 degrees. Steeper slopes could not be mastered. In Figure 10, we can see that when the robot is climbing the slope, its step length is becoming smaller, and the movement of its stance leg is becoming slower (its stick snapshots are becoming denser). Note that these adjustments of its gait take place autonomously due to the robot’s physical properties (physical computation), not relying on any preplanned trajectory or precise control mechanism. This experiment demonstrates that such a closely coupled neuromechanical system can to some degree autonomously adapt to an unstructured terrain. 5 Stability Analysis of the Walking Gaits 5.1 Dynamic Model of the Robot. The dynamics of our robot are modeled as shown in Figure 11. With the Lagrange method, we can get the equations that govern the motion of the robot, which can be written in the form D(q )q¨ + C(q , q˙ ) + G(q ) = τ,
(5.1)
where q = [φ, θ1 , θ2 , ψ]T is a vector describing the configuration of the robot
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1174
Figure 11: Model of the dynamics of our robot. Sizes and masses are the same as those of the real robot.
(for a definition of φ, θ1 , θ2 , ψ, see Figure 11). D(q ) is the 4 × 4 inertia matrix, C(q , q˙ ) is the 4 × 1 vector of centripetal and coriolis forces, G(q ) is the 4 × 1 vector representing gravity forces. τ = [0, τ1 , τ2 , τ3 ]T , τ1 , τ2 , τ3 are the torques applied on the stance hip (the hip joint of the stance leg in Figure 11), the swing hip, and the swing knee joints, respectively. Details of this equation can be found in the appendix. The dynamics of the DC motor (including gears) of each joint can be described with the following equations (here, the hip of the stance leg is taken as an example; the models of other joints are likewise):
La
di a + Ra i a + nk3 θ˙1 = V1 dt τ1 + I1 θ¨1 + k f θ˙1 = nk2 i a ,
(5.2) (5.3)
A Reflexive Neural Network for Dynamic Biped Walking Control
1175
where V1 is the applied armature voltage of the stance hip motor, which is obtained from the output of the motor neurons according to equation 3.9, i a is the armature current, L a the armature inductance, and Ra the armature resistance. k3 is the emf constant. k2 is the motor torque constant. I1 is the combined moment of inertial of the stance-hip motor and gear train referred to the gear output shaft. k f is the vicious-friction coefficient of the combination of the motor and the gear. n is the gear ratio. Considering that the electrical time constant of the motor is much smaller than the mechanical time constant of the robot, we neglect the dynamics of the electrical circuits of the motor, which leads to didta = 0. Thus, equation 5.2 is reduced to ia =
1 (V1 − nk3 θ˙1 ) . Ra
(5.4)
Combining equations 5.1, 5.3, and 5.4, we can get the dynamics model of the robot with the applied motor voltages as its control input. The heel strike at the end of swing phases and the knee strike at the end of knee extensor reflex are assumed to be inelastic impacts, which is in accordance with observations on our real robot and existing passive biped robots. This assumption implies the conservation of angular momentum of the robot just before and after the strikes, with which the value of q˙ just after the strikes can be computed using its value just before the strikes. Because the transient double support phase is very short in our robot walking (usually less than 40 ms), it is neglected in our simulation as often done in the analysis of other passive bipeds (Garcia, 1999). 5.2 Stability Analysis with Poincar´e Maps. The method of Poincar´e maps is usually employed for stability analysis of cyclic movements of nonlinear dynamic systems such as passive bipeds (Garcia, 1999). Because our reflexive controller exploits natural dynamics for the robot’s motion generation, and not trajectory planning or tracking control, the Poincar´e map approach can also be applied to the dynamics model of our robot together with the reflexive network as its controller. We choose the Poincar´e section (Garcia, 1999) to be right after the heel strike of the swing leg. Each cyclic walking gait is a limit cycle in the state space, corresponding to a fixed point on the Poincar´e section. Fixed points can be found by solving the roots of the mapping equation, P(x n ) − x n = 0,
(5.5)
˙ θ˙1 , θ˙2 , ψ] ˙ T is a state vector on the where x n = [q , q˙ ]T = [φ, θ1 , θ2 , ψ, φ, Poincar´e section at the beginning of the nth gait cycle. P(x n ) is a map function mapping x n to x n+1 , which is built by combining the reflexive controller and the robot dynamics model described above.
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1176
Table 6: Fixed Parameters of the Knee Joints. E S,k (deg)
F S,k (deg)
G M,k
180
110
0.9G M,h
Knee joints
Near a fixed point, x ∗ , the map function P(x ∗ ) can be linearized as (Garcia, 1999) P(x ∗ + xˆ ) ≈ P(x ∗ ) + J xˆ ,
(5.6)
where J is the 8 × 8 Jacobian matrix of partial derivatives of P: J =
∂P . ∂x
(5.7)
With any fixed point, J can be obtained by numerically evaluating P eight times in a small neighborhood of the fixed point. According to equation 5.6, small perturbations xˆ i to the limit cycle x ∗ at the start of the ith step will grow or decay from the ith step to the i + 1th step approximately according to xˆ i+1 ≈ J xˆ i . So if all eigenvalues of J lie within the unit cycle, any small perturbation will decay to 0 and the perturbed walking gait will return to its limit cycle, which means the limit cycle is asymptotically stable (Garcia, 1999). The movements of the knee joints are needed mainly for timely ground clearance without much influence on the stability of the walking gait. Therefore, in the simulation analysis and real experiment below, we set the knees’ neuron parameters to fixed values (see Table 6) that can ensure fast movements of the knee joints, preventing any possible scuff of the swing leg. For simplicity, we also fix the threshold of the flexor sensor neurons of the hips ( F S,h ) to 85 degrees in simulation and real experiments below. This will not damage the generality of the results, because similar results can be obtained provided that F S,h is in the interval 70 to 90 degrees. For values outside this range the robot will either fall or produce gaits very unnatural gaits. Thus, now we only need to tune two parameters of the hip joints: the threshold of the extensor sensor neurons ( E S,h ) and the gain of the motor neurons of hip joints (G M,h ), which work together to determine the gait properties. E S,h − F S,h determines roughly the stride length (not exactly, because the hip joint moves passively after passing E S,h ), while G M,h determines the amplitude of the applied voltage of the motors on the hip joint. Since these two parameters have such clear physical interpretations, their tuning is straightforward. With each set of the controller parameter E S,h and G M,h , we use a multidimensional Newton-Raphson method solving equation 5.6 to find
A Reflexive Neural Network for Dynamic Biped Walking Control
1177
Figure 12: Stable domain of the controller parameter, E S,h and G M,h . The big area enclosed by the outer curve represents the range obtained with simulations in which fixed points are stable. The shaded area is the range of the two parameters, in which stable gaits will appear in experiments performed with the real robot. The maximum permitted value of G M,h is 2.95 (higher values will destroy the motor of the hip joint). The two closed curves are a manual, continuous interpolation of the discrete boundaries obtained in simulations and real experiments, respectively.
the fixed point (Garcia, 1999). Then we compute the Jacobian matrix J of the fixed point using the approach described above and evaluate the stability of the fixed point according to its eigenvalues. The result of this Poincar´e map analysis is shown in Figure 12. We have found that asymptotically stable fixed points exist in a considerably large range of the controller parameters E S,h and G M,h (see Figure 12). For comparison, Figure 12 also shows the stable range of these two parameters obtained in real robot experiments. In the real robot, because no definite stability criterion, like using eigenvalues, is applicable, we regard a walking gait as stable if the robot does not fall. The best way to visualize the properties of a limit cycle is using the phase plane, which can be easily obtained in the simulations but is not available in our real robot due to the lack of absolute position and speed sensors. Figure 13 shows two phase plane plots of the absolute angular position of ˙ After being perturbed, one hip joint, φ (see Figure 11), and its derivative, φ. the walking gait returns to its limit cycle quickly in only a few steps, which is in accordance with the experiment results of the real robot presented in the last section.
1178
¨ otter ¨ T. Geng, B. Porr, and F. Worg
˙ (A) Corresponds to Figure 13: Two limit cycles in the phase plan of φ and φ. a fixed point found with this set of controller parameters, E S,h = 125 Deg, G M,h = 2.8. (B) Corresponds to E S,h = 110 Deg, G M,h = 2.5.
Because some details of the robot dynamics such as uncertainties of the ground contact, nonlinear frictions in the joints, and the inevitable noise and lag of the sensors are difficult, if not impossible, to model precisely, the results of simulation and real experiments are not exactly coherent (see Figure 12). However, stability analysis and experiments with our real robot have in general shown that our biped robot under control of the reflexive network will demonstrate stable walking gaits in a wide range of the critical controller parameters and that it will return to its normal orbit quickly after a disturbance. 6 Discussion and Comparison with Other Walking Control Mechanisms 6.1 Minimal Set of Phasic Feedbacks. The aim of locomotion control structures (modeled with either CPG or with reflexive controllers) is to control the phase relations between limbs or joints, attaining a stable phase locking that leads to a stable gait. Therefore, the locomotion controller needs phasic feedback from the legs or joints. In the case of reflexive controllers like Cruse’s model (Cruse et al., 1998), the phasic feedback signals sent to the controller are AEP and PEP signals, which can provide sufficient information on phase relations at least between adjacent legs. It is according to this information that the reflexive controller adjusts the PEP value of the leg, thus effectively changing the period of the leg, synchronizing it in or out of phase with its adjacent legs (Klavins, Komsuoglu, Full, & Koditschek, 2002). On the other hand, in the case of a CPG model, which can generate rhythmic movement patterns even without sensory feedback, it must nonetheless be entrained to phasic feedback from the legs in order to achieve realistic locomotion gaits. In some animals, evidence exists that every limb involved
A Reflexive Neural Network for Dynamic Biped Walking Control
1179
in cyclic locomotion has its own CPG (Delcomyn, 1980), and phasic feedback from muscles is indispensable to keep its CPGs in phase with the real time movement of the limbs. Not surprisingly, CPG mechanisms used on various locomotive robots also require phasic feedback. Lewis, EtienneCummings, Hartmann, Xu, & Cohen (2003) implemented a CPG oscillator circuit to control a simple biped. AEP and PEP signals from its hip joints define the feedback to the CPG, resetting its oscillator circuit. Removal of the AEP or PEP signals caused quick deterioration of this biped’s gait. On another quadruped robot (Fukuoka, Kimura, & Cohen, 2003), instead of discrete AEP and PEP signals, continuous position signals of the hip joints provide feedback to the neural oscillators of the CPG. The neural oscillator parameters were tuned in such a way that the minimum and maximum of the hip positions would reset the flexor and extensor oscillator respectively. Apparently this scheme functions identically with AEP, PEP feedback. In summary, because AEP and PEP provide sufficient information about phase relations between legs, walking control structures usually depend on them (or their equivalents) as phasic feedback from the legs. However, the top level of the reflexive controller on our robot requires only AEA signals as phasic feedback. Furthermore, this AEA signal is only for triggering the flexor reflex on the knee joint of the robot rather than triggering stance phases as in other robots. In this sense, the role (and number) of the phasic feedback signals is much reduced in our reflexive controller. In spite of the fact that the AEA signal is by itself not sufficient to control the phase relations between legs, stable walking gaits have appeared in our robot walking experiments (see section 4). This is because reflexive controller and physical computation cooperate to accomplish the task of phasic walking gait control. This shows that physical computation can help to simplify the controller structure. As described above, CPGs have been successfully applied on some realtime quadruped, hexapod, and other multilegged robots. However, in biped walking control based on CPG models, most of the current studies are performed with computer simulations. To our knowledge, no one has successfully realized real-time dynamic biped walking using a CPG model as a single controller, because the CPG model itself cannot ensure stability of the biped gait. A considerably well-known biped robot controlled by a CPG chip has been developed by Lewis et al. (2003). Its walking and running gaits look very nice, though on a treadmill instead of on a floor. But this biped robot has a fatally weak point in that its hips are fixed on a boom (not rotating freely around the boom axes as in our robot), so it is actually supported by the boom. The boom is greatly facilitating its control, avoiding the most difficult problem of dynamic stability control that is specific to biped robots. Thus, this robot is not a dynamic biped in a real sense. Instead, it is rather more equivalent to one pair of legs of a multilegged robot. Using computer simulations, Taga (1995) found that stable biped gaits can be generated by combining CPGs and human biomechanics. In animals,
1180
¨ otter ¨ T. Geng, B. Porr, and F. Worg
a CPG is a neural structure that is much more complex than the local reflex in anatomy and function. There is evidence that in mammal and human locomotion, CPGs work on top of reflexes and take their effects by modulating them. In evolution, simple monosynaptic reflexes must have appeared much earlier than the much more complex CPG structures. Not only with simulation analysis but also with our real system experiments, the current study has shown that local neuronal reflexes connected by a simple sensor-driven network are sufficient as a controller for dynamic biped walking, the most difficult form of legged locomotion in view of dynamic stability. 6.2 Physical Computation and Approximation. In contrast to exact representations and world models, physical computation often implies approximation. Approximation in control mechanism gives more room and possibility for physical computation. While conventional robots rely on precise trajectory planning and tracking control, biologically inspired robots rarely use preplanned or explicitly computed trajectories. Instead, they compute their movements approximately by exploiting physical properties of their self and the world, thus avoiding the accurate calibration and modeling required by conventional robotics. But in order to achieve real-time walking gait in a real world, even these biologically inspired robots often have to depend on some kind of position or velocity control on their joints. For example, on a hexapod, simulating the distributed locomotion control of insects (Beer et al., 1997), outputs of motor neurons were integrated to produce a trajectory of joint positions that was tracked using proportional feedback position control. On a quadruped, built by Kimura’s group, that implemented CPGs (neural oscillators) and local reflexes, all joints are PD controlled to move to their desired angles (Fukuoka et al., 2003). Even on a half-passive biped controlled by a CPG chip, position control worked on its hip joints, though passive dynamics of its knee joints was exploited for physical computation (Lewis, 2001). The principle of approximation embodied in the reflexive controller of our robot, however, goes even one step further, in the sense that there is no position or velocity control implemented on our robot. The neural structure of our reflexive controller does not depend on, or ensure the tracking of, any desired position. Indeed, it is this approximate nature of our reflexive controller that allows the physical properties of the robot itself, especially the passive dynamics of the robot (see Figure 8), to contribute implicitly to generation of overall gait trajectories, and ensures its stability and robustness to some extent. Just as argued by Raibert and Hodgins (1993, p. 353), “Many researchers in neural motor control think of the nervous system as a source of commands that are issued to the body as direct orders. We believe that the mechanical system has a mind of its own, governed by the physical structure and the laws of physics. Rather than issuing commands,
A Reflexive Neural Network for Dynamic Biped Walking Control
1181
the nervous system can only make suggestions which are reconciled with the physics of the system and the task.” 7 Conclusions In this article, we presented our design and some walking experiments performed by a novel neuromechanical structure for reflexive walking control. We demonstrated with a closely coupled neuromechanical system how physical computation can be exploited to generate a dynamically stable biped walking gait. In the experiments of walking at different speeds and climbing a shallow slope, it was also shown that the coupled dynamics of this neuromechanical system are sufficient to induce an autonomous, albeit limited, adaptation of the gait. While the biologically inspired model neurons used in our reflexive controller retain some properties of real neurons, they do not include one of the most significant features of neurons, synaptic plasticity. As has been observed in human and animal locomotion, while walking gait generation may be reflexive, stability control of walking behavior has to be predictive. Although physical computation can ensure autonomy and stability to some extent, in order to get a stable walking gait in a wide parameter range, we have to rely on the plasticity of the neural structure. In the near future, we will apply proactive learning on this neuromechanical system (Porr & ¨ otter, ¨ Worg 2003). The basic idea is to use the waveform resulting from the ground contact sensors to predict and thus avoid possible instabilities of the next walking step.
Appendix In the following we list the terms of the equation used in the simulation to build the Poincar´e map function. For definitions of l1 , l2 , l3 , l4 , l5 , φ, θ1 , θ2 , ψ, see Figure 11. r is the radius of the curved foot. mt is the mass of the trunk, mh the mass of the thigh, mk the mass of the shank with foot. g is the gravity. D11 = −4mk cos (φ) r 2 − 2mk l4 l2 + 2mk rl4 + 2mk l1 2 + l2 2 +2mt l1 l2 − 2l1 r − 2mt rl2 +4mk r cos (φ) l2 − 2mk r cos (φ) l4 + 2mk l4 2 + 2mt r cos (φ) l2 +2mt l2 l5 cos (θ1 ) − 2mt rl5 cos (θ1 ) +2mt r cos (φ) l1 + 2mt rl5 cos (θ1 − φ) + 2mt l1 l5 cos (θ1 ) −2mt cos (φ) r 2 + mt l5 2 −4mh cos (φ) r 2 − 2mh l1 l3 − 2mh l2 l3
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1182
+2mh rl3 + 2mk l1 l2 − 2mk l1 r − 4mk rl2 + 4mk r 2 +2mk l2 2 + 4mh l1 l2 − 4mh l1 r − 4mh rl2 +2mh l1 2 + 4mh r 2 + 2mh l2 2 −2mh rl3 cos (−θ2 + φ) −2mh l1 l3 cos (−θ1 + θ2 ) − 2mh l2 l3 cos (−θ1 + θ2 ) +2mh rl3 cos (−θ1 + θ2 ) − 2mk rl4 cos (−θ2 + ψ) +2mk rl4 cos (−θ1 + θ2 + ψ + φ) −2mk rl1 cos (−θ1 + θ2 + φ) −2mk l1 l4 cos (ψ) + 2mk l4 l2 cos (−θ1 + θ2 + ψ) +2mk l1 l4 cos (−θ1 + θ2 + ψ) +2mk rl1 cos (−θ1 + θ2 ) + 4mh r cos (φ) l2 −2mh r cos (φ) l3 + 4mh r cos (φ) l1 −2mk l1 2 cos (−θ1 + θ2 ) +2mh l3 2 + mt l1 2 + 2mt r 2 D12 = −mt rl5 cos (θ1 − φ) + mt rl5 cos (θ1 ) −mt (l1 + l2 )l5 cos (θ1 ) −mt mh rl3 cos (−θ1 + θ2 + φ) −mh rl3 cos (−θ1 + θ2 ) + mh l2 l3 cos (−θ1 + θ2 ) +mh l1 l3 cos (−θ1 + θ2 ) − mh l3 2 +mk rl1 cos (−θ1 + θ2 + φ) − mk rl4 cos (−θ1 + θ2 + ψ + φ) +mk l2 l1 cos (−θ1 + θ2 ) − mk rl1 cos (−θ1 + θ2 ) +2mk l1 l4 cos (ψ) +mk l1 2 cos (−θ1 + θ2 ) +mk l1 l4 cos (−θ1 + θ2 + ψ) −mk l4 l2 cos (−θ1 + θ2 + ψ) +mk rl4 cos (−θ1 + θ2 + ψ) −mk l1 2 − mk l4 2 D13 = −mh rl3 cos (−θ1 + θ2 + φ) +mh rl3 cos (−θ1 + θ2 ) − mh l2 l3 cos (−θ1 + θ2 )
A Reflexive Neural Network for Dynamic Biped Walking Control
1183
−mh l1 l3 cos (−θ1 + θ2 ) + mh l3 2 + mk l1 2 − mk rl1 cos (−θ1 + θ2 ) +mk rl4 cos (−θ1 + θ2 ) + mk l2 l1 cos (−θ1 + θ2 ) +mk rl1 cos (−θ1 + θ2 ) −2mk l1 l4 cos (ψ) − mk l1 2 cos (−θ1 + θ2 ) +mk l1 l4 cos (−θ1 + θ2 + ψ) +mk l4 l2 cos (−θ1 + θ2 + ψ) −mk rl4 cos (−θ1 + θ2 + ψ) + mk l1 2 + mk l4 2 D14 = mk rl4 cos (−θ1 + θ2 + ψ + φ) −mk rl4 cos (−θ1 + θ2 + ψ) +mk l4 (l1 + l2 ) cos (−θ1 + θ2 + ψ) −mk l1 l4 cos (ψ) + mk l4 2 D21 = D12 D22 = mt l5 2 + mh l3 2 + mk l1 2 − 2mk l1 l4 cos (ψ) + mk l4 2 D23 = −mh l3 2 − mk l1 2 + 2mk l1 l4 cos (ψ) − mk l4 2 D24 = mk l1 l4 cos (ψ) − mk l4 2 D31 = D13 D32 = D23 D33 = mh l3 2 + mk l1 2 − 2l1 l4 cos (ψ) + mk l4 2 D34 = −mk l1 l4 cos (ψ) + mk l4 2 D41 = D14 D42 = D24 D43 = D34 D44 = mk l4 2 C1 = 2mk sin (φ) φ˙ 2 r 2 − 4mk sin (φ) φ˙ 2 l2 r −2mt sin (φ) φ˙ 2 (l1 + l2 )r
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1184
+2mt l5 sin (θ1 − φ) φ˙ 2 r ˙ 1 −2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl 2
+mt l5 sin (θ1 − φ) θ˙1 r +2mt sin (φ) φ˙ 2 r 2 + 4mh sin (φ) φ˙ 2 r 2 ˙ −3mt l5 sin (θ1 − φ) θ˙1 φr +2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) r +2mh l3 sin (φ) φ˙ 2 r 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 r ˙ +2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l2 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l1 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l2 ˙ +3mh l3 sin (−θ1 + θ2 + φ) θ˙2 φr 2 −mk l1 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) r
+2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) r −2mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) l2 −2mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) l2 −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 −2mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) 2
+mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) l1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) r 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) l1
+2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) l1 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) l2
−mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) 2
−mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) l1 +2l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) r +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) r +2mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) 2
−mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l2 2 −mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) r 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) l2
˙ 2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) l1 −2l1 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) r ˙ 1 sin (−θ1 + θ2 + φ) θ˙2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 1 sin (−θ1 + θ2 + φ) φ˙ −2mk l4 cos (−θ2 + ψ + φ) ψl ˙ −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) r
−2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) l2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) l2 2
−mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) 2
+mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) r 2 −mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) r
−2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r 2
−mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l2 2 +mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) l2 2 +mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) r
˙ +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φr +2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙2 r ˙ −2mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr
1185
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1186
˙ 2 +2mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +2mk l1 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l2 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l2 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l1 2
−mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) r −2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) r 2
+mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) r −2mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l1 +2mh l3 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) r +2mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r −2mh l3 sin (−θ2 + φ) θ˙1 φ˙ cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l1 −2mh l3 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) l2 ˙ 2 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l1 ˙ 1 −2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl −4mh sin (φ) φ˙ 2 l2 r −4mh sin (φ) φ˙ 2 l1 r +2mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ 2 −2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl 2
+mt l5 cos (θ1 − φ) θ˙1 sin (φ) l2 2
+mt l5 cos (θ1 − φ) θ˙1 sin (φ) l1 2
−mt l5 cos (θ1 − φ) θ˙1 sin (φ) r 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) r 2 −mt l5 sin (θ1 − φ) θ˙1 cos (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l2 ˙ +2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φr 2
+mt l5 sin (θ1 − φ) θ˙1 cos (φ) l2 2
+mt l5 sin (θ1 − φ) θ˙1 cos (φ) l1 −2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l1 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 r ˙ −3mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 cos (φ) l2 ˙ +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψr ˙ −3mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ +3mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 r 2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 r
˙ 2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl −2mk sin (φ) φ˙ 2 l1 r 2
+mk l1 sin (−θ1 + θ2 + φ) θ˙1 r +2mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ +3mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 r −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l1 +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) l1 ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) l2 ˙ −3mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 r
1187
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1188
2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) l1
−2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) l1 2
+mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) r 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) l2 +4mk sin (φ) φ˙ 2 r 2 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) l1 +2mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 +2mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) l2 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 2
+mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) r 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) l2
+2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) l1 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) r 2
+mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 r −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) r +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) l2 +2mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l2
˙ 1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl
A Reflexive Neural Network for Dynamic Biped Walking Control
˙ 1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 −2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ −2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin sin (φ) φr ˙ +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ 1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr +2mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 cos (φ) r −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ +2mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr −2mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψr ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl C2 = −mt l5 sin (θ1 − φ) φ˙ 2 r ˙ 1 +mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl ˙ +mt l5 sin (θ1 − φ) θ˙1 φr −mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) r +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 +mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) ˙ 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 1 sin (−θ1 + θ2 + φ) φ˙ +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r −mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2
1189
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1190
˙ 2 −mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 ˙ 2 −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl −mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l1 −mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ 2 +mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl +mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l2 ˙ −mt l5 cos (θ1 − φ) θ˙1 sin (φ) φr +mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l1 ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 ˙ 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl −mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ −mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 cos (−θ1 + θ2 + φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl ˙ +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 −mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 −mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) +mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl +mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ cos (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r ˙ 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ 1 +mk l4 cos (−θ2 + ψ + φ) θ˙1 sin (φ) φl ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φφ) φl ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr −mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ −mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr +mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl ˙ C3 = mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr ˙ +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l1 sin (−θ1 + θ2 + φ) φ˙ cos (φ) r −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 ˙ −mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr ˙ 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 ˙ 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl
1191
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1192
˙ 1 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 −mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl +mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr −mk l4 sin (−θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 ˙ 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl +mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l2 +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l1 +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl ˙ −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 +mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl −mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ cos (φ) r +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r ˙ 2 −mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl ˙ 1 +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 −mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr
A Reflexive Neural Network for Dynamic Biped Walking Control
˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ 1 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr +mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ +mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr −mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl C4 = −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ 2 ˙ 1 ψ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 2 ψ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ ψ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr −mk l4 cos (−θ1 + θ2 + ψ + φ) l1 sin (−θ1 + θ2 + φ) φ˙ ψ˙ ˙ θ˙1 +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr ˙ ψ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr +mk l4 cos (−θ1 + θ2 + ψ + φ) l1 sin (−θ1 + θ2 + φ) θ˙1 ψ˙ ˙ 1 ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl +mk l4 sin (−θ1 + θ2 + ψ + φ) l1 cos (−θ1 + θ2 + φ) φ˙ ψ˙ ˙ 2 θ˙1 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ θ˙2 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr ˙ 1 θ˙2 +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 2 θ˙2 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 2 θ˙1 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ θ˙2 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr ˙ θ˙1 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr ˙ 1 θ˙2 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 2 ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 1 θ˙1 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl
1193
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1194
+mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ θ˙1 −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ θ˙2 G 1 = mt g sin (φ) r − mt g sin (φ) l2 −mt g sin (φ) l1 + mt gl5 sin (θ1 − φ) +2mh g sin (φ) r − 2mh g sin (φ) l2 −2mh g sin (φ) l1 +mh gl3 sin (φ) +mh gl3 sin (−θ1 + θ2 + φ) +2mk g sin (φ) r + mk g sin (φ) l4 −2mk g sin (φ) l2 −mk g sin (φ) l1 + mk gl1 sin (−θ1 + θ2 + φ) −mk gl4 sin (−θ1 + θ2 + ψ + φ) G 2 = −mt gl5 sin (θ1 − φ) −mh gl3 sin (−θ1 + θ2 + φ) −mk gl1 sin (−θ1 + θ2 + φ) +mk gl4 sin (−θ1 + θ2 + ψ + φ) G 3 = mh gl3 sin (−θ1 + θ2 + φ) +mk gl1 sin (−θ1 + θ2 + φ) −mk gl4 sin (−θ1 + θ2 + ψ + φ) G 4 = −mk gl4 sin (−θ1 + θ2 + ψ + φ) Acknowledgments This work was supported by SHEFC grant INCITE to F. W. We thank Kevin Swingler for correction of the text. References Beer, R., & Chiel, H. (1992). A distributed neural network for hexapod robot locomotion. Neural Computation, 4, 356–365. Beer, R., Quinn, R., Chiel, H., & Ritzmann, R. (1997). Biologically inspired approaches to robotics. Communications of the ACM, 40(3), 30–38.
A Reflexive Neural Network for Dynamic Biped Walking Control
1195
Boone, G., & Hodgins, J. (1997). Slipping and tripping reflexes for bipedal robots. Autonomous Robots, 4(3), 259–271. Brown, I., & Loeb, G. (1999). Biomechanics and neural control of movement. New York: Springer-Verlag. Cham, J., Bailey, S., & Cutkosky, M. (2000). Robust dynamic locomotion through feedforward-preflex interaction. In ASME IMECE Proceedings, Orlando, FL. Chiel, H., & Beer, R. (1997). The brain has a body: Adaptive behavior emerges from interactions of nervous system, body, and environment. Trends in Neuroscience, 20, 553–557. Cruse, H., Kindermann, T., Schumm, M., Dean, J., & Schmitz, J. (1998). Walknet—a biologically inspired network to control six-legged walking. Neural Networks, 11(7–8), 1435–1447. Cruse, H., & Saavedra, M. (1996). Curve walking in crayfish. Journal of Experimental Biology, 199, 1477–1482. Cruse, H., & Warnecke, H. (1992). Coordination of the legs of a slow-walking cat. Experimental Brain Research, 89, 147–156. Delcomyn, F. (1980). Neural basis of rhythmic behavior in animals. Science, 210, 492–498. Ferrell, C. (1995). A comparison of three insect-inspired locomotion controllers. Robotics and Autonomous Systems, 16, 135–159. Fukuoka, Y., Kimura, H., & Cohen, A. (2003). Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts. Int. J. of Robotics Research, 22, 187–202. Full, R. J., & Tu, M. S. (1990). Mechanics of six-legged runners. Journal of Experimental Biology, 148, 129–146. Funabashi, H., Takeda, Y., Itoh, S., & Higuchi, M. (2001). Disturbance compensating control of a biped walking machine based on reflex motions. JSME International Journal Series C—Mechanical Systems,Machine Elements and Manufacturing, 44, 724– 730. Gallagher, J., Beer, R., Espenschied, K., & Quinn, R. (1996). Application of evolved locomotion controllers to a hexapod robot. Robotics and Autonomous Systems, 19, 95–103. Garcia, M. (1999). Stability, scaling, and chaos in passive-dynamic gait models. Unpublished doctoral dissertation, Cornell University. Hurmuzlu, Y. (1993). Dynamics of bipedal gait; part II: Stability analysis of a planar five-link biped. ASME Journal of Applied Mechanics, 60(2), 337–343. Iida, F., & Pfeifer, R. (2004). Self-stabilization and behavioral diversity of embodied adaptive locomotion. Lecture Notes in Artificial Intelligence, 3139, 119–129. Klavins, E., Komsuoglu, H., Full, R., & Koditschek, D. (2002). Neurotechnology for biomimetic robots. Cambridge, MA: MIT Press. Lewis, M. (2001). Certain principles of biomorphic robots. Autonomous Robots, 11, 221–226. Lewis, M., Etienne-Cummings, R., Hartmann, M., Xu, Z., & Cohen, A. (2003). An in silico central pattern generator: Silicon oscillator, coupling, entrainment, and physical computation. Biological Cybernetics, 88, 137–151. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Comp., 15, 831–864.
1196
¨ otter ¨ T. Geng, B. Porr, and F. Worg
¨ otter, ¨ Porr, B., & Worg F. (2005). Inside embodiment what means embodiment for radical constructivists? Kybernetes, 34, 105–117. Pratt, J. (2000). Exploiting inherent robustness and natural dynamics in the control of bipedal walking robots. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Raibert, H., & Hodgins, J. (1993). Biological neural networks in invertebrate neuroethology and robotics. Orlando, FL: Academic Press. Reeve, R. (1999). Generating walking behaviors in legged robots. Unpublished doctoral dissertation, University of Edinburgh. Taga, G. (1995). A model of the neuro-musculo-skeletal systems for human locomotion. Biological Cybernetics, 73, 97–111. Van der Linde, R. Q. V. (1998). Active leg compliance for passive walking. In Proceedings of IEEE International Conference on Robotics and Automation. Piscataway, NJ: IEEE. Wadden, T., & Ekeberg, O. (1998). A neuro-mechanical model of legged locomotion: Single leg control. Biological Cybernetics, 79, 161–173. Yang, J., Stephens, M., & Vishram, R. (1998). Infant stepping: A method to study the sensory control of human walking. J. Physiol. (London), 507, 927–937. Ziemke, T. (2001). Are robots embodied? In First International Workshop on Epigenetic Robotics Modeling Cognitive Development in Robotic Systems (pp. 75–93).
Received December 20, 2004; accepted September 28, 2005.
LETTER
Communicated by Robert Kass
A Set Probability Technique for Detecting Relative Time Order Across Multiple Neurons Anne C. Smith
[email protected] Department of Anesthesiology and Pain Medicine, University of California at Davis, Davis, CA 95616, U.S.A.
Peter Smith
[email protected] Department of Mathematics, University of Keele, Keele, Staffordshire, ST5 5BG, U.K.
With the development of multielectrode recording techniques, it is possible to measure the cell firing patterns of multiple neurons simultaneously, generating a large quantity of data. Identification of the firing patterns within these large groups of cells is an important and a challenging problem in data analysis. Here, we consider the problem of measuring the significance of a repeat in the cell firing sequence across arbitrary numbers of cells. In particular, we consider the question, given a ranked order of cells numbered 1 to N, what is the probability that another sequence of length n contains j consecutive increasing elements? Assuming each element of the sequence is drawn with replacement from the numbers 1 through N, we derive a recursive formula for the probability of the sequence of length j or more. For n < 2 j, a closed-form solution is derived. For n ≥ 2 j, we obtain upper and lower bounds for these probabilities for various combinations of parameter values. These can be computed very quickly. For a typical case with small N ( 0 such that π(x∗ ) ≥ π(x) for all x ∈ whose distance from x∗ is less than ε, and if π(x∗ ) = π(x) implies x∗ = x, then x∗ is said to be a strict local solution. Note that the solutions of equation 3.1 remain the same if matrix A is replaced with A + kee , where k is an arbitrary constant. In addition, observe that maximizing a nonhomogeneous quadratic form x Qx + 2c x over is equivalent to solving equation 3.1 with A = Q + ec + ce (Bomze, 1998). A point x ∈ satisfies the Karush-Kuhn-Tucker (KKT) conditions for problem 3.1, that is, the first-order necessary conditions for local optimality, if there exist n + 1 real constants µ1 , . . . , µn and λ, with µi ≥ 0 for all i = 1 . . . n, such that: (Ax)i − λ + µi = 0 for all i = 1 . . . n, and n
xi µi = 0.
i=1
Note that since both xi and µi are nonnegative for all i = 1, . . . , n, the latter condition is equivalent to saying that i ∈ σ (x) implies µi = 0. Hence, the KKT conditions can be rewritten as (Ax)i
=λ
if i ∈ σ (x)
≤λ
if i ∈ / σ (x)
(3.2)
for some real constant λ. On the other hand, it is clear that λ = x Ax. A point x ∈ satisfying equation 3.2 will be called a KKT point throughout. The following easily proved results establish a first connection between standard quadratic programs and payoff-monotonic dynamics. Proposition 2. If x ∈ is a KKT point for equation 3.1, then it is a stationary point of any payoff-monotonic dynamics. If x ∈ int(), then the converse also holds. Proof. The proof is a straightforward consequence of proposition 1 and the KKT conditions 3.2.
1224
M. Pelillo and A. Torsello
Clearly, not all equilibria of payoff-monotonic dynamics correspond to KKT points of equation 3.1 (e.g., think about the vertices of ), but if they are approached by an interior trajectory, then this comes true. Proposition 3. Let x = limt→∞ x(t) be the limit point to a trajectory under any payoff-monotonic dynamics. If x(0) ∈ int(), then x is a KKT point for program 3.1. Proof. Since x is a limit point of a trajectory, it is a stationary point (see, ¨ 1970), and hence by proposition 1, πi (x) = π(x) for e.g., Bhatia & Szego, all i ∈ σ (x). Suppose now, to the contrary, that πi (x) > π(x) for some j ∈ / σ (x). Because of payoff monotonicity and stationarity of x, we have g j (x) > gi (x) = 0 for all i ∈ σ (x), and by continuity, there exists a neighborhood U of x such that g j (y) > 0 for all y ∈ U. Then, for a sufficiently large T, g j (x(t)) > 0 for all t ≥ T, and since x j (t) > 0 for all t (recall in fact that int() is invariant), we have x˙ j (t) > 0 for t ≥ T. This implies x j = limt→∞ x j (t) > 0, a contradiction. The next proposition, which will be useful later, provides another necessary condition for local solutions of equation 3.1, when the payoff matrix has a particular structure. Proposition 4. Let x ∈ be a stationary point of any payoff-monotonic dynamics, and suppose that the payoff matrix A is symmetric with positive diagonal entries, that is, a ii > 0 for all i = 1, . . . , n. Suppose that there exist i, j ∈ σ (x) such that a i j = 0. For 0 < δ ≤ x j let y(δ) = x + δ(ei − e j ) ∈ . Then y(δ) Ay(δ) > x Ax. Proof. From the symmetry of A, we have: y(δ) Ay(δ) = [x + δ(ei − e j )] A[x + δ(ei − e j )] = x Ax + 2δ(ei − e j ) Ax + δ 2 (ei − e j ) A(ei − e j ) = x Ax + 2δ[(Ax)i − (Ax) j ] + δ 2 (a ii − 2a i j + a j j ). But since x is a stationary point, we have (Ax)i = (Ax) j (recall in fact that i, j ∈ σ (x)), and by the hypothesis a i j = 0 we have y(δ) Ay(δ) = x Ax + δ 2 (a ii + a j j ) > x Ax, which proves the proposition.
Game Dynamics and Maximum Clique
1225
In an unpublished paper, Hofbauer (1995) showed that for symmetric payoff matrices, the population mean payoff x Ax is strictly increasing along the trajectories of any payoff-monotonic dynamics. This result generalizes the celebrated “fundamental theorem of natural selection” (Hofbauer & Sigmund, 1998; Weibull, 1995), whose original form traces back to R. A. Fisher (1930). Here, we provide a different proof, adapting a technique from Fudenberg and Levine (1998). Theorem 1. If the payoff matrix A is symmetric, then π(x) = x Ax is strictly increasing along any nonconstant trajectory of any payoff-monotonic dynamics. In other words, π(x(t)) ˙ ≥ 0 for all t, with equality if and only if x = x(t) is a stationary point. Proof. See Hofbauer (1995), or Pelillo (2002) for a different proof. Apart from the monotonicity result that provides a (strict) Lyapunov function for payoff-monotonic dynamics, the previous theorem also rules out complicated attractors like cycles, invariant tori, or even strange attractors. It also allows us to establish a strong connection between the stability properties of these dynamics and the solutions of equation 3.1. To this end, we need an auxiliary result. Lemma 1. Let x be a strict local solution of equation 3.1, and put S = σ (x). Then x is the only stationary point of any payoff-monotonic dynamics in int( S ). Moreover, x Ax > y Ay for all y ∈ S . Proof. Clearly, since x is a strict local solution of equation 3.1, it is a KKT point and hence is stationary under any payoff-monotonic dynamics by proposition 2. Suppose by contradiction that y ∈ int( S ) \ {x} is stationary too. Then it is easy to see that all points on the segment joining x and y, which is contained in int( S ) because of its convexity, consists entirely of stationary points. Hence, by theorem 1, π˙ = 0 on this segment, which means that π is constant, but this contradicts the hypothesis that x is a strict local solution of equation 3.1. Moreover, for a sufficiently small ε > 0 we have x Ax > [(1 − ε)x + εy] A[(1 − ε)x + εy] = (1 − ε)2 x Ax + 2ε(1 − ε)y Ax + ε 2 y Ay, but since x is stationary and σ (y) ⊆ σ (x), y Ax = x Ax, from which we readily obtain x Ax > y Ay. Theorem 2. A point x ∈ is a strict local solution of program 3.1 if and only if it is asymptotically stable under any payoff-monotonic dynamics.
1226
M. Pelillo and A. Torsello
Proof. If x is asymptotically stable, then there exists a neighborhood U of x in such that every trajectory starting in a point y ∈ U will converge to x. Then, by virtue of Theorem 1, we have y Ay > x Ax for all y ∈ U\{x}, which shows that x is a strict local solution for equation 3.1. On the other hand, suppose that x is a strict local solution of equation 3.1, and let S = σ (x). By lemma 1, the function V : int( S ) → R defined as V(y) = π(x) − π(y) is clearly nonnegative in int( S ), and it vanishes only when y = x. Furthermore, V˙ ≤ 0 by theorem 1 and, again from lemma 1, V˙ < 0 in int( S )\{x}, as x is the only stationary point in int( S ). This means that V is a strict Lyapunov function for any payoff-monotonic dynamics, ¨ 1970; Hirsch and hence x is asymptotically stable (see, e.g., Bhatia & Szego, & Smale, 1974). The results presented in this section show that continuous-time payoffmonotonic dynamics can be usefully employed to find (local) solutions of standard quadratic programs. In the rest of the article, we focus the discussion on a particular class of quadratic optimization problems that arise in conjunction with the maximum clique problem. 4 A Family of Quadratic Programs for Maximum Clique Let G = (V, E) be an undirected graph with no self-loops, where V = {1, . . . , n} is the set of vertices and E ⊆ V × V is the set of edges. The order of G is the number of its vertices, and its size is the number of edges. Two vertices i, j ∈ V are said to be adjacent if (i, j) ∈ E. The adjacency matrix of G is the n × n symmetric matrix AG = (a i j ) defined as follows: ai j =
1,
if (i, j) ∈ E,
0,
otherwise.
The degree of a vertex i ∈ V relative to a subset of vertices C, denoted by degC (i), is the number of vertices in C adjacent to it, that is, degC (i) =
ai j .
j∈C
Clearly, when C = V we obtain the standard degree notion, in which case we shall write deg(i) instead of degV (i). A subset C of vertices in G is called a clique if all its vertices are mutually adjacent; that is, for all i, j ∈ C, with i = j, we have (i, j) ∈ E. A clique is said to be maximal if it is not contained in any larger clique and maximum if it is the largest clique in the graph. The clique number, denoted by ω(G), is defined as the cardinality of the maximum clique. The maximum clique problem is to find a clique whose cardinality equals the clique number.
Game Dynamics and Maximum Clique
1227
In the mid-1960s, Motzkin and Straus (1965) established a remarkable connection between the maximum clique problem and the following standard quadratic program: maximize subject to
f (x) = x AG x x ∈ ⊂ Rn ,
(4.1)
where n is the order of G. Specifically, if x∗ is a global solution of equation 4.1, they proved that the clique number of G is related to f (x∗ ) by the following formula: ω(G) =
1 . 1 − f (x∗ )
(4.2)
Additionally, they showed that a subset of vertices C is a maximum clique of G if and only if its characteristic vector xC , which is the vector of defined as 1/|C|, if i ∈ C C xi = 0, otherwise, is a global maximizer of f on .4 Gibbons, Hearn, Pardalos, and Ramana (1997), and Pelillo and Jagota (1995), extended the Motzkin-Straus theorem by providing a characterization of maximal cliques in terms of local maximizers of f on . One drawback associated with the original Motzkin-Straus formulation, however, relates to the existence of “infeasible” solutions, that is, maximizers of f that are not in the form of characteristic vectors. Pelillo and Jagota (1995) have provided general characterizations of such solutions. To overcome this problem, consider the following family of standard quadratic programs: maximize
f α (x) = x (AG + α I )x
subject to
x ∈ ⊂ Rn ,
(4.3)
where α is an arbitrary real parameter and I is the identity matrix, which includes as special cases the original Motzkin-Straus program (see equation 4.1) and the regularized version proposed by Bomze (1997) (corresponding to the cases α = 0 and α = 12 , respectively).
4
In their original paper, Motzkin and Straus proved just the “only-if” part of this theorem. The converse direction is, however, a straightforward consequence of their result (Pelillo & Jagota, 1995).
1228
M. Pelillo and A. Torsello
Proposition 5. Let x be a KKT point for program 4.3 with α < 1, and let C = σ (x) be the support of x. If C is a clique of G, then it is a maximal clique. Proof. Suppose by contradiction that C is a nonmaximal clique. Hence, there exists j ∈ V\C such that (i, j) ∈ E for all i ∈ C. Since α < 1, we have for all i ∈ C: (AG x + αx)i = (AG x)i + αxi = 1 − (1 − α)xi < 1 = (AG x) j = (AG x + αx) j . But due to equation 3.2, this contradicts the hypothesis that x is a KKT point for equation 4.3. In general, however, the fact that a point x ∈ satisfies the KKT conditions does not imply that σ (x) is a clique of G. For instance, it is easy to show that if for a subset C we have degC (i) = k for all i ∈ C (i.e., C induces a k-regular subgraph), and degC (i) ≤ k for all i ∈ / C, then xC is a KKT point for equation 4.3 provided that α ≥ 0. The following theorem, which generalizes an earlier result by Bomze (1997), establishes a one-to-one correspondence between localglobal solutions of equation 4.3 and maximal-maximum cliques of G. By adapting the proof technique from Bomze (1997), it has also been proved previously in Bomze et al. (2002) using concepts and results from evolutionary game theory. Here we provide a different proof based on standard facts from optimization theory. Theorem 3. Let C be a subset of vertices of a graph G, and let xC be its characteristic vector. Then, for any 0 < α < 1, C is a maximal (maximum) clique of G if and only if xC is a local (global) solution of equation 4.3. Moreover, all solutions of the equation are strict and are characteristic vectors of maximal cliques of G. Proof. See Bomze et al. (2002) for a proof that requires several previous results from evolutionary game theory or appendix B for a self-contained proof which uses only basic concepts from optimization theory. Corollary 1. Let C be a subset of vertices of a graph G with xC as its characteristic vector, and let 0 < α < 1. Then C is a maximal clique of G if and only if xC is an asymptotically stable stationary point under any payoff-monotonic dynamics with payoff matrix A = AG + α I . Proof. The proof is obvious from theorems 2 and 3. These results naturally suggest any dynamics in the payoff-monotonic class as a useful heuristic for the maximum clique problem, and this will be the subject of the next section.
Game Dynamics and Maximum Clique
1229
5 Clique-Finding Payoff-Monotonic Dynamics Let G = (V, E) be a graph, and let AG denote its adjacency matrix. By using A = AG + α I
(0 < α < 1)
(5.1)
as the payoff matrix, any payoff-monotonic dynamics, starting from an arbitrary initial state, will eventually be attracted with probability one by the nearest asymptotically stable point, which, by virtue of corollary 1, will then correspond to a maximal clique of G. Clearly, in theory, there is no guarantee that the converged solution will be a global solution of equation 4.3 and therefore that it will yield a maximum clique in G. In practice, it is not unlikely, however, that the system converges toward a stationary point that is unstable, that is, a saddle of the Lyapunov function x Ax. This can be the case when the dynamics is started from the simplex barycenter and symmetry is not broken. Proposition 3 ensures, however, that the limit point of any interior trajectory will be at least a KKT point of program 4.3. The next proposition translates this fact in a different language. Proposition 6. Let x ∈ be the limit point of a trajectory of any payoff-monotonic dynamics starting in the interior of . Then either σ (x) is a maximal clique or it is not a clique. Proof. The proof is obvious from propositions 3 and 5. The practical significance of the previous result reveals itself in large graphs: even if these are quite dense, cliques are usually much smaller than the graph itself. Now suppose we are returned a KKT point x. Then we put C = σ (x) and check whether C is a clique. This requires O(m2 ) steps if C contains m vertices, while checking whether this clique is maximal would require O(mn) steps and, as stressed above, usually m n. But proposition 6 now guarantees that the obtained clique C (if it is one) must automatically be maximal, and thus we are spared trying to add external vertices. 5.1 Experimental Results. To assess the ability of our payoff-monotonic models to extract large cliques, we performed extensive experimental evaluations on both random and DIMACS benchmark graphs.5 For our simulations we used discretized versions of the continuous-time linear (see equation 2.6) and exponential (see equation 2.7) replicator dynamics (see appendix A for a description of our discretizations). We started the processes from the simplex barycenter and stopped them when a maximal clique (i.e., a strict local maximizer of f ) was found. Occasionally, when
5
Data can be found online at http://dimacs.rutgers.edu.
1230
M. Pelillo and A. Torsello
the system converged to a nonclique KKT point, we randomly perturbed the solution and let the game dynamics start from this new point. Since unstable stationary points have a basin of attraction of measure zero around them, the process is pulled away from it with probability one, to converge, eventually, to another (hopefully asymptotically stable) stationary point. In an attempt to improve the quality of the final results, we used a mixed strategy as far as the regularization parameter α is concerned. Indeed, Bomze et al. (1997) showed that the original Motzkin-Straus formulation (i.e., α = 0), which is plagued with the presence of spurious solutions, usually yields slightly better results than its regularized version. Accordingly, we started the dynamics using α = 0 and, after convergence, we restarted it from the converged point using α = 12 . This way, we are guaranteed to avoid spurious solutions, thereby obtaining a maximal clique. In the first set of experiments we ran the algorithms over random graphs of order 100, 200, 300, 400, 500, and 1000 and with edge densities ranging from 0.25 to 0.95. For each order and density value, 100 different graphs were generated. Table 1 shows the results obtained in terms of clique size. Here, n refers to the graph order, ρ is the edge density, and the labels “RD linear” and “RD exp” indicate the first-order and the exponential replicator dynamics, respectively. The results are compared with the following state-of-the-art neural network heuristics for maximum clique: Jagota’s continuous Hopfield dynamics (CHD) and mean field annealing (MFA) (Jagota, 1995; Jagota et al., 1996), the saturated linear dynamical network (SLDN) by Pekergin et al. (1999), an approximation approach introduced by Funabiki et al. (FTL) (1992), the iterative Hopfield nets (IHN) algorithm by Bertoni et al. (2002), and the Hopfield network learning (HNL) of Wang et al. (2003). Figure 1 plots the corresponding CPU timings obtained with a (nonoptimized) C++ implementation on a machine equipped with a 2.5 GHz Pentium 4 processor. Since random graphs are notoriously easy to deal with, a second set of experiments was also performed on the DIMACS benchmark graphs (see Tables 2 and 3). Here, columns marked with n and ρ contain the number of vertices in the graph and the edge density respectively. Columns “Clique Size” contain the size of the cliques found by the competing algorithms, while the column “Time” reports the CPU timings for the proposed dynamics. The sizes of the cliques obtained are compared against several algorithms present in either the neural network or the continuous optimization literature, and against the best result over all algorithm featured on the DIMACS challenge (Johnson & Trick, 1996) (DIMACS best). The neural-based approaches include mean field annealing (MFA), the inverted neurons network (INN) model by Grossman (1996), and the IHN algorithm, while algorithms from the continuous optimization literature include the continuous-based heuristic (CBH) by Gibbons et al. (1997) and the QSH algorithm by Busygin, Butenko, and Pardalos (2002). The results are taken from the cited papers. No results are presented for CHD, SLDN, FTL, and
Game Dynamics and Maximum Clique
1231
Table 1: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on Random Graphs with Varying Order and Density. n
ρ
100 0.25 0.50 0.75 0.90 0.95 200 0.25 0.50 0.75 0.90 0.95 300 0.25 0.50 0.75 0.90 0.95 400 0.25 0.50 0.75 0.90 0.95 500 0.25 0.50 0.75 0.90 0.95 1000 0.25 0.50 0.75 0.90 0.95
RD Linear 4.90 ± 0.56 8.01 ± 0.66 15.10 ± 1.05 28.50 ± 1.87 41.81 ± 1.80 5.34 ± 0.68 9.04 ± 0.76 17.77 ± 1.35 36.74 ± 1.65 57.33 ± 2.33 5.58 ± 0.62 9.54 ± 0.90 19.05 ± 1.27 40.74 ± 1.97 66.48 ± 2.69 5.73 ± 0.65 9.99 ± 0.80 20.17 ± 1.16 43.63 ± 1.73 73.11 ± 2.37 5.81 ± 0.69 10.14 ± 0.93 20.90 ± 1.32 46.10 ± 2.25 78.73 ± 2.66 6.23 ± 0.62 10.74 ± 0.80 22.88 ± 1.28 52.60 ± 2.02 93.75 ± 3.12
RD Exponential CHD 4.81 ± 0.60 8.07 ± 0.64 15.15 ± 1.11 28.92 ± 1.74 42.04 ± 1.78 5.35 ± 0.64 9.11 ± 0.77 18.05 ± 1.23 37.41 ± 1.55 58.24 ± 2.09 5.61 ± 0.62 9.57 ± 0.88 19.40 ± 1.17 41.48 ± 1.81 67.43 ± 2.47 5.73 ± 0.60 10.07 ± 0.77 20.42 ± 1.10 44.44 ± 1.64 74.25 ± 2.30 5.74 ± 2.30 10.31 ± 0.91 21.31 ± 1.27 46.93 ± 2.29 79.94 ± 2.68 6.17 ± 0.57 10.83 ± 0.82 23.04 ± 1.35 53.15 ± 2.11 94.80 ± 2.96
4.48 7.38 13.87 27.92 — — — — — — — — — — — 5.53 9.24 18.79 43.24 — — — — — — 6.03 10.25 21.26 — —
MFA
SLDN
FTL
IHN
HNL
— 8.50 — 30.02 — — — — — — — — — — — — 10.36 — 49.94 — — — — — — — — — — —
4.83 8.07 15.05 — — — — — — — — — — — — 5.70 9.91 20.44 — — — — — — — 6.17 10.93 23.19 — —
4.2 8.0 14.1 — — 4.9 8.5 — — — 5.1 8.9 — — — 4.9 8.9 17.7 — — 6.2 9.4 — — — 5.8 10.4 21.4 — —
— 9.13 — — — — 10.60 — — — — 11.60 — — — — 12.30 — — — — 12.80 — — — — — — — —
— 9 — 30 — — 11 — 39 — — 11 — 46 — — — — — — — 12 — 56 — — — — — —
HNL since the authors did not provide results on the DIMACS graphs. For the same reason, we did not report results on random graphs for INN, CBH, and QSH. A number of conclusions can be drawn from these results. First, the exponential dynamics provides slightly better results than the linear one, being, however, dramatically faster, especially on dense graphs. These results confirm earlier findings reported in Pelillo (1999, 2002) on graph classes arising from graph and tree matching problems. As for the comparison with the other algorithms, we note that our dynamics substantially outperform CHD and FTL and are, overall, as effective as SLDN (for which no results on DIMACS graphs are available). Observe that these approaches do not incorporate any procedure to escape from poor local solutions and, hence,
1232
M. Pelillo and A. Torsello 0.45 linear 0.4 exponential 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1.8 linear 1.6 exponential 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
100 vertices 4 3.5
1
200 vertices 7
linear exponential
linear exponential
6
3
5
2.5
4
2
3
1.5 1
2
0.5
1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
300 vertices 10 linear 9 exponential 8 7 6 5 4 3 2 1 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
500 vertices
1
400 vertices 40 35
linear exponential
30 25 20 15 10 5 1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1000 vertices
Figure 1: CPU time of replicator dynamics on random graphs with varying order and density. The x-axis represents the edge density, while the y-axis denotes time (in seconds).
are close in spirit to ours. Clearly our results are worse than those obtained with algorithms that do use some form of annealing or, in any case, are explicitly designed to avoid local optima, such as IHN, HNL, CBH, and QSH. Interestingly, however, the results are close to (and in some instances
Clique Size Graph brock200 1 brock200 2 brock200 3 brock200 4 brock400 1 brock400 2 brock400 3 brock400 4 brock800 1 brock800 2 brock800 3 brock800 4 c-fat200 1 c-fat200 2 c-fat200 5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 hamming6-2 hamming6-4
n 200 200 200 200 400 400 400 400 800 800 800 800 200 200 200 500 500 500 500 64 64
ρ
DIMACS Best
RD Linear
RD Exponential
0.75 0.50 0.61 0.66 0.75 0.75 0.75 0.75 0.65 0.65 0.65 0.65 0.08 0.16 0.43 0.04 0.07 0.19 0.37 0.90 0.35
21 12 15 17 24 25 25 33 21 21 21 21 12 24 58 14 26 64 126 32 4
17 8 9 13 21 20 19 19 16 16 15 17 12 24 58 14 26 64 126 32 4
17 8 10 13 21 21 18 20 16 16 18 17 12 24 58 14 26 64 126 32 4
Time (secs)
MFA
INN Average
IHN
CBH
QSH
19 9 11 14 24 21 22 23 16 16 16 15 6 24 58 — — — — — —
— 9.1 — 13.4 — 21.3 — 21.3 — 16.9 — 16.5 — — — — — — — — —
— — — — — — — — — — — — 12 24 58 14 26 64 — 32 4
20 12 14 16 23 24 23 24 20 19 20 19 12 24 58 14 26 64 126 32 4
21 12 15 17 27 29 31 33 17 24 25 26 12 24 58 14 26 64 126 32 4
RD Linear 0.33 0.23 0.25 0.28 1.36 1.35 1.38 1.37 4.68 4.61 4.72 4.63 0.22 0.15 0.16 0.97 1 1.01 1.09 0.09 0.03
RD Exponential 0.2 0.17 0.18 0.16 0.69 0.73 0.69 0.74 2.86 2.77 2.92 2.7 0.17 0.16 0.16 1.01 1.02 0.99 1 0.03 0.02
Game Dynamics and Maximum Clique
Table 2: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part I.
1233
1234
Table 2: Continued Clique Size Graph
ρ
256 256 65536 65536 28 70 120 496 171 776 3361
0.97 0.64 0.99 0.83 0.56 0.77 0.76 0.88 0.65 0.75 0.82
128 16 512 36 4 14 8 16 11 27 59
RD Linear
RD Exponential
82 10 312 32 4 14 8 16 7 15 31
81 10 297 32 4 14 8 16 7 15 31
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — —
— 16.0 — 32.0 — — — — 8.4 16.7 33.4
128 16 512 36 4 14 8 16 — — —
128 16 — — 4 14 8 16 10 21 —
QSH
RD Linear
RD Exponential
128 16 — — 4 14 8 16 11 24 —
3.4 0.44 459.04 24.27 0.01 0.04 0.12 5.11 0.21 5.43 174.89
0.51 0.27 34.48 6.05 0.00 0.02 0.07 1.36 0.14 2.64 54.34
M. Pelillo and A. Torsello
hamming8-2 hamming8-4 hamming10-2 hamming10-4 johnson8-2-4 johnson8-4-4 johnson16-2-4 johnson32-2-4 keller4 keller5 keller6
n
DIMACS Best
Clique Size Graph p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 p hat700-3 p hat1000-1 p hat1000-2 p hat1000-3 p hat1500-1 p hat1500-2 p hat1500-3 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3
n
ρ
DIMACS Best
300 300 300 500 500 500 700 700 700 1000 1000 1000 1500 1500 1500 200 200 200 200 200
0.24 0.49 0.74 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.70 0.70 0.90 0.90 0.90
8 25 36 9 36 50 11 44 62 10 46 66 11 65 94 30 18 70 60 44
RD Linear
RD Exponential
6 24 32 8 34 47 9 43 58 8 42 61 9 61 88 15 12 45 36 32
6 24 33 8 34 48 9 43 59 8 44 63 9 61 88 15 12 45 36 32
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — — — — — — — — 42 — 33
7.0 21.9 33.1 — — — 8.4 42.0 58.2 — — — 9.4 60.2 86.2 — — — — —
8 25 36 9 36 49 11 44 61 10 46 68 — — — 30 15 70 41 —
8 25 36 9 35 49 11 44 60 — — — — — — 15 12 46 36 30
QSH
RD Linear
RD Exponential
7 24 33 9 33 46 8 42 59 — — — — — — 30 18 70 60 35
0.41 0.69 1.04 1.18 2.07 3.75 2.39 4.31 7.35 4.89 8.02 14.07 11.03 22.96 36.72 0.68 0.96 1.74 1.11 0.76
0.39 0.43 0.43 1.07 1.14 1.23 2.1 2.2 2.43 4.5 4.36 4.8 9.71 10.37 10.89 0.22 0.23 0.35 0.3 0.26
Game Dynamics and Maximum Clique
Table 3: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part II.
1235
1236
Table 3: Continued Clique Size Graph
ρ
400 400 400 400 400 1000 200 200 400 400
0.50 0.70 0.70 0.70 0.90 0.50 0.70 0.90 0.50 0.70
13 40 30 22 100 10 18 42 13 21
RD Linear
RD Exponential
7 20 15 12 44 8 14 37 11 18
7 20 15 12 50 8 16 39 11 18
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — 18 41 — —
— — — — — — — — — —
— 40 30 — 100 10 17 41 12 21
8 20 15 14 50 — 18 41 12 20
QSH
RD Linear
RD Exponential
9 40 30 16 100 — 15 37 11 18
2.71 2.41 3.38 4 2.98 29.95 0.29 0.72 0.88 1.24
0.87 0.95 0.97 0.89 0.89 6.91 0.2 0.24 0.61 0.69
M. Pelillo and A. Torsello
san400 0.5 1 san400 0.7 1 san400 0.7 2 san400 0.7 3 san400 0.9 1 san1000 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7
n
DIMACS Best
Game Dynamics and Maximum Clique
1237
even better than) those obtained with Jagota’s MFA and Grossman’s INN, which are also in this family. 6 Annealed Imitation Dynamics: Evolving Toward Larger Cliques In an attempt to avoid inefficient local solutions, we now follow Bomze et al. (2002) and investigate the stability properties of equilibria of payoffmonotonic dynamics when the parameter α is allowed to take on negative values. Indeed, we shall restrict our analysis to imitation dynamics (see equation 2.5), but we first make a few observations pertaining to general selection dynamics. For any regular selection dynamics x˙ i = xi gi (x), we have: ∂gi (x) ∂ x˙ i = δi j gi (x) + xi , ∂xj ∂xj
(6.1)
where δi j is the Kronecker delta, defined as δi j = 1 if i = j and δi j = 0 otherwise. Assuming without loss of generality that σ (x) = {1, . . . , m}, the Jacobian of any regular selection dynamics at a stationary point x has therefore the following block triangular form, J (x) =
M(x) N(x) O
D(x)
,
(6.2)
, O is the (possibly where the entries of M(x) and N(x) are given by xi ∂g∂ ix(x) j empty) matrix containing all zeros, and D(x) = diag{gm+1 (x), . . . , gn (x)}. An immediate consequence of this observation is that we can already say something about the spectrum of J (x), when m < n. In fact, the eigenvalues of J (x) are those of M(x) together with those of D(x), and since D(x) is diagonal, its eigenvalues coincide with its diagonal entries, that is, gm+1 (x), . . . , gn (x). This set of eigenvalues governs the asymptotic behavior of the external flow under the system obtained by linearization around x and is usually called transversal eigenvalues (Hofbauer & Sigmund, 1998). Without knowing the form of the growth functions gi , however, it is difficult to provide further insights into the spectral properties of J (x), and therefore we now specialize our discussion to imitation dynamics (see equation 2.5). In this case, we have gi (x) = φ(πi (x)) −
n k=1
xk φ(πk (x)),
(6.3)
1238
M. Pelillo and A. Torsello
where φ is a strictly increasing function, and hence ∂ x˙ i = δi j φ(πi (x)) − xk φ(πk (x)) ∂xj k
+ xi a i j φ (πi (x)) − φ(π j (x)) −
xk a k j φ (πk (x)) .
(6.4)
k
When x is an equilibrium point, and hence πi (x) = π(x) for all i ∈ σ (x), the previous expression simplifies to ∂ x˙ i = δi j [φ(πi (x)) − φ(π(x))] ∂xj + xi [a i j φ (π(x)) − φ(π j (x)) − φ (π(x))π j (x)].
(6.5)
Before we provide the main result of this section, we prove the following useful proposition, which generalizes an earlier result by Bomze (1986). Proposition 7. Let x be a stationary point of any imitation dynamics, equation 2.5. Then: a. −φ(π(x)) is an eigenvalue of J (x), with x as an associated eigenvector. b. If y is an eigenvector of J (x) associated with an eigenvalue λ = −φ(π(x)), then e y = i yi = 0. Proof. Recall from proposition 1 that x is an equilibrium point for equation 2.5 if and only if πi (x) = π(x) for all i ∈ σ (x). Hence, for i = 1, . . . , n, we have: n j=1
xj
∂ x˙ i = xi [φ(πi (x)) − φ(π(x))] ∂xj + xi
n
x j [a i j φ (π(x)) − φ(π j (x)) − π j (x)φ (π(x))]
j=1
= xi [φ(πi (x)) − φ(π(x)) + φ (π(x))πi (x) − φ(π(x)) − φ (π(x))π(x)] = −xi φ(π(x)). In other words, we have shown that J (x)x = −φ(π(x))x, which proves part a of the proposition.
Game Dynamics and Maximum Clique
1239
To prove part b, first note that the columns of J (x) have a nice property: they all sum up to −φ(π(x)). Indeed, for all j = 1, . . . , n we have: n ∂ x˙ i = φ(π j (x)) − φ(π(x)) ∂xj i=1
+
n
xi a i j φ (π(x)) − φ(π j (x)) − π j (x)φ (π(x))
i=1
= φ(π j (x)) − φ(π(x)) + φ (π(x))π j (x) − φ(π j (x)) − φ (π(x))π j (x) = −φ(π(x)). Now, the hypothesis J (x)y = λy yields λ
yi =
(J (x)y)i = yj J (x)i j = −φ(π(x)) yj ,
i
which implies
i
j
i
yi = 0, since λ = −φ(π(x)).
i
j
Since we analyze the behavior of imitation dynamics restricted to the standard simplex , we are interested only in the eigenvalues of J (x) associated with eigenvectors belonging to the tangent space e⊥ = {y ∈ Rn : e y = 0}. The previous result therefore implies that the eigenvalue −φ(π(x)) can be neglected in our analysis, and that the remaining ones, including the transversal eigenvalues, are indeed all relevant. We now return to the maximum clique problem. Let a graph G = (V, E) be given, and for a subset of vertices C, let γ (C) = max degC (i) − |C| + 1 . i ∈C /
(6.6)
Note that if C is a maximal clique, then γ (C) ≤ 0. The next theorem shows that γ (C) plays a key role in determining the stability of equilibria of imitation dynamics. Theorem 4. Let C be a maximal clique of graph G = (V, E), and let xC be its characteristic vector. If γ (C) < α < 1, then xC is an asymptotically stable stationary point under any imitation dynamics (see equation 2.5) with payoff matrix A = AG + α I , and hence a (strict) local maximizer of f α in . Moreover, assuming C = V, if α < γ (C), then xC becomes unstable. Proof. Assume without loss of generality that C = {1, . . . , m} and suppose that γ (C) < α < 1. To simplify notations, put x = xC . We shall see that the eigenvalues of J (x) are real and negative. This implies that x is a sink and
1240
M. Pelillo and A. Torsello
hence an asymptotically stable point (Hirsch & Smale, 1974). The fact that x is a strict local maximizer of f α in follows directly from theorem 2. As already noticed in the previous discussion, because of its block di
agonal form, the eigenvalues of J (x) are those of M(x) = xi ∂g∂ ix(x) j
i, j=1,...,m
together with the n − m transversal eigenvalues gm+1 (x), . . . , gn (x), where: gi (x) = φ(πi (x)) −
n
xk φ(πk (x)).
k=1
Since C is a (maximal) clique, πk (x) = π(x) for all k ∈ C = σ (x), and therefore gi (x) = φ(πi (x)) − φ(π(x)). But φ is a strictly increasing function, and hence gi (x) < 0 if and only if πi (x) < π(x). Now, since C is a maximal clique, πi (x) = (AG x)i = degC (i)/m for all i > m, and π(x) = (m − 1 + α)/m. But for all i > m, we have degC (i) − m + 1 ≤ γ (C) < α, and this yields πi (x) < π(x). Hence, all transversal eigenvalues are negative. It remains to show that the eigenvalues of M(x) are negative too. When A = AG + α I , we have: M(x)i j = xi
∂ x˙ i 1 = [(a i j + αδi j )φ (π(x)) − φ(π(x)) − φ (π(x))π(x)]. ∂xj m
Hence, in matrix form, we have φ (π(x)) M(x) = m
φ(π(x)) 1− − π(x) ee + (α − 1)I , φ (π(x))
where ee is the m × m matrix containing all ones, and the eigenvalues of M(x) are λ1 =
φ (π(x)) (α − 1) m
with multiplicity m − 1, and φ (π(x)) λ2 = m
φ(π(x)) 1− − π(x) m + α − 1 = −φ(π(x)) φ (π(x))
with multiplicity 1. Since α < 1 and φ is strictly increasing, we have λ1 < 0. Moreover, recall from proposition 7 that eigenvalue λ2 = −φ(π(x)) is not
Game Dynamics and Maximum Clique
1241
relevant to the imitation dynamics on the simplex , since its eigenvector x does not belong to the tangent space e⊥ . Hence, as far as the dynamics in the simplex is concerned, we can ignore it. Finally, to conclude the proof, suppose that α < γ (C) = maxi>m degC (i) − m + 1. Then there exists i > m such that m − 1 + α < degC (i) and hence, dividing by m, we get πi (x) − π(x) > 0 and then gi (x) = φ(πi (x)) − φ(π(x)) > 0, which implies that a transversal eigenvalue of J (x) is positive, that is, x is unstable. Theorem 4 provides us with an immediate strategy to avoid unwanted local solutions: maximal cliques that are not maximum. Suppose that C is a maximal clique in G that we want to avoid. By letting α < γ (C), its characteristic vector xC becomes an unstable stationary point of any imitation dynamics under f α , and thus will not be approached by any interior trajectory. Hence, if there is a clique D such that still γ (D) < α holds, there is a (more or less justified) hope to obtain in the limit x D , which yields automatically a larger maximal clique D. Unfortunately, two other cases could occur: (1) no other clique T satisfies γ (T) < α, that is, α has a too large absolute value, and (2) even if there is such a clique, other attractors could emerge that are not characteristic vectors of a clique (note that this is excluded if α > 0 by theorem 3). The proper choice of the parameter α is therefore a trade-off between the desire to remove unwanted maximal cliques and the emergence of spurious solutions. Instead of keeping the value of α fixed, our approach is to start with a sufficiently large negative α and adaptively increase it during the optimization process, in much the same spirit as simulated or mean field annealing procedures. Of course, in our case, the annealing parameter has no interpretation in terms of a hypothetical temperature. The rationale behind this idea is that for values of α that are sufficiently negative, only the characteristic vectors of large maximal cliques will be stable, attractive points for the imitation dynamics, together with a set of spurious solutions. As the value of α increases, spurious solutions disappear, and at the same time, (characteristic vectors of) smaller maximal cliques become stable. We expect that at the beginning of the annealing process, the dynamics is attracted toward “promising” regions, and the search is further refined as the annealing parameter increases. In summary, a high-level description of the proposed algorithm is shown in Figure 2. Note that the last step in the algorithm is necessary if we also want to extract the vertices comprising the clique found, as shown in theorem 3. It is clear that for the algorithm to work, we need to select an appropriate annealing schedule. To this end, we employ the following heuristic suggested in Bomze et al. (2002). Suppose that the underlying graph is a random one in the sense that edges are generated independent of each other with a certain probability q (in applications, q will be replaced by the actual graph density), and suppose that C is an
1242
M. Pelillo and A. Torsello
Algorithm 1. Start with a sufficiently large negative α. 2. Let b be the barycenter of ∆ and set x = b. 3. Run any imitation dynamics starting from x, under AG + αI until convergence and let x be the converged point. 4. Unless a stopping condition is met, increase α and goto 3. 5. Select α ˆ with 0 < α ˆ < 1 (e.g., α ˆ = 12 ), run any imitation dynamics starting from current x under AG + α ˆI until convergence, and extract a maximal clique from the converged solution.
Figure 2: Annealed Imitation Heuristic.
unwanted clique of size m. Take δ > 0 small, say 0.01, and consider the quantity γ m = 1 − (1 − q )m −
mq (1 − q ) δ ν ,
(6.7)
where ν = 1/2(n − m). Bomze et al. (2002) proved that γ (C) exceeds γ m with probability 1 − δ. Thus, it makes sense to use γ m as a heuristic proxy for the lower bound of γ (C), to avoid being attracted by a clique of size m. Furthermore, note that among all graphs with n vertices and m edges, the maximum possible clique number is the only integer c that satisfies the following relations: c+1 c ≤m< , 2 2
(6.8)
Game Dynamics and Maximum Clique
1243
which, after some algebra, yields
8m + 1 1 − ⇔ πi (t) > π j (t). xi (t) x j (t) It is a well-known result in evolutionary game theory (Weibull, 1995; Hofbauer & Sigmund, 1998) that the fundamental theorem of natural selection (see theorem 1) also holds for the first-order linear dynamics (see equation A.1)—namely, the average consistency x Ax is a (strict) Lyapunov function for equation A.1, provided that A = A . In other words: x(t) Ax(t) < x(t + 1) Ax(t + 1) unless x(t) is a stationary point. Unfortunately, unlike the continuous-time case, there is no such result for the discrete exponential dynamics, equation A.2. That is, there is no guarantee that for any fixed value of the parameter, κ, the dynamics increases the value of x Ax. Indeed, with high values of this parameter, the dynamics can exhibit an oscillatory behavior (Pelillo, 1999). However, a recent result by Bomze (2005) allows us to define an adaptive approach that is guaranteed to find a (local) maximizer for x Ax.
1252
M. Pelillo and A. Torsello
We define the ε-stationary points as Statε = x ∈ : xi ((Ax)i − x Ax)2 < ε . i
Clearly, this set is composed of the union of open neighborhoods around the stationary points of any payoff-monotonic dynamics, and for ε → 0 shrinks toward the stationary points themselves. Let m ¯ A = maxi j |a i j |, span(A) = maxi j a i j − mini j a i j , and, for any given ε, define κ A(ε) as the unique κ > 0, which satisfies κ exp(2m ¯ Aκ) =
2ε . span(A) ε + 2m ¯ 2A
Theorem 5. Suppose A = A . Then for arbitrary ε > 0, for any positive κ ≤ κ A(ε), the objective function x Ax is strictly increasing over time along the parts of trajectories under equation A.2, which are not ε-stationary, that is, x(t) Ax(t) < x(t + 1) Ax(t + 1) if
x(t) ∈ Statε .
Proof. See Bomze (2005). This means that for each point x ∈ we can find a κ for which one iteration of equation A.2 increases x Ax. That is, by setting at each iteration κ = κ A(ε), we are guaranteed to increase x Ax along the trajectories of the system. Note, however, that this estimate of κ A(ε) is not tight. In particular, our experience shows that it severely underestimated the value of κ, slowing the convergence of the dynamics considerably. In order to obtain a better estimate of the parameter κ and improve the performance of the approach, in our experiments we employed the adaptive exponential dynamics described in Figure 4, which, as the next proposition shows, has x Ax as a Lyapunov function. Proposition 8. If the payoff matrix A is symmetric, then the function x Ax is strictly increasing along any nonconstant trajectory of the adaptive exponential dynamics defined above. In other words, x(t) Ax(t) ≤ x(t + 1) Ax(t + 1) for all t, with equality if and only if x = x(t) is a stationary point. Proof. By construction, the function is guaranteed to grow as long a κ that increases x Ax can be found. Theorem 5 guarantees that such a κ can indeed be found.
Game Dynamics and Maximum Clique
1253
Algorithm 1. Start with a sufficiently large κ and from an arbitrary x(0) ∈ ∆. Set t ← 0. 2. While x(t) is not stationary do 3.
Compute x(t + 1) using equation A.2;
4.
While x (t + 1)Ax(t + 1) ≤ x (t)Ax(t) do
5.
Reduce κ;
6.
Recompute x(t + 1) using equation A.2;
7.
Endwhile;
8.
t ← t + 1;
9. Endwhile;
Figure 4: Adaptive exponential (discrete-time) replicator dynamics.
Appendix B: Proof of Theorem 3 Theorem 3. Let C be a subset of vertices of a graph G, and let xC be its characteristic vector. Then, for any 0 < α < 1, C is a maximal (maximum) clique of G if and only if xC is a local (global) solution of equation 4.3. Moreover, all solutions of the equation 4.3 are strict and are characteristic vectors of maximal cliques of G. Proof. Suppose that C is a maximal clique of G, and let |C| = m. We shall prove that xC is a strict local solution of program 4.3. To this end, we use standard second-order sufficiency conditions for constrained optimization
1254
M. Pelillo and A. Torsello
(Luenberger, 1984). Let AG = (a i j ) be the adjacency matrix of G and, for notational simplicity, put A = AG + α I. First, we need to show that xC is a KKT point for equation 4.3. It is easy to see that since C is a maximal clique, we have: C
(AG x )i
= ≤
m−1 m m−1 m
if i ∈ C if i ∈ / C.
Hence, if i ∈ C, then (AxC )i = (AG xC )i + αxiC =
α m−1 + , m m
(B.1)
and if i ∈ / C, (AxC )i = (AG xC )i ≤
m−1 m−1 α < + . m m m
(B.2)
Therefore, conditions 3.2 are satisfied and xC is a KKT point. Note that the Lagrange multipliers µi ’s defined in section 3 are given by µi =
m−1+α − (AxC )i . m
To conclude the first part of the proof, it remains to show that the Hessian of the Lagrangian associated with program 4.3, which in this case is simply AG + α I , is negative definite on the following subspace: = {y ∈ Rn : e y = 0 and yi = 0 for all i ∈ ϒ}, where ϒ = {i ∈ V : xiC = 0 and µi > 0} . But from equation B.2, ϒ = V\C. Hence, for all y ∈ , we have: y Ay =
n
yi
i=1
=
i∈C
n
ai j yj + α
j=1
yi
j∈C
n
yi2
i=1
ai j yj + α
i∈C
yi2
Game Dynamics and Maximum Clique
=
yi
i∈C
= (α − 1)
y j − yi + α
j∈C
1255
yi2
i∈C
yi2
i∈C
= (α − 1)y y ≤0 with equality if and only if y = 0, the null vector. This proves that AG + α I is negative definite on , as required. To prove the inverse direction, suppose that xC ∈ is a local solution to equation 4.3 and hence a KKT point. By proposition 2, xC is also a stationary point for payoff-monotonic dynamics, and since Ahas positive diagonal entries a ii = α > 0, all the hypotheses of proposition 4 are fulfilled. Therefore, it follows that C is a clique (i.e., a i j > 0 for all i, j ∈ C); otherwise xC could not be a local solution of equation 4.3. On the other hand, from proposition 5, C is also a maximal clique. Furthermore, if x is any local solution, and hence a KKT point of equation 4.3, then necessarily x = x S where S = σ (x). Geometrically, this means that x is the barycenter of its own face. In fact, from the previous discussion, S has to be a (maximal) clique. Therefore, for all i ∈ S, (AG x)i + αxi = 1 − (1 − α)xi = λ, for some constant λ. This amounts to saying that xi is constant for all i ∈ σ (x), and i xi = 1 yields xi = 1/|S|. From what we have seen in the first part of the proof, this also shows that all local solutions of equation 4.3 are strict. Finally, as for the “global/maximum” part of the theorem, simply notice that at local solutions x = x S of equation 4.3, S = σ (x) being a maximal clique, the value of the objective function f α is 1 − (1 − α)/|S|.
Acknowledgments We thank Manuel Bomze for many stimulating discussions and Claudio Rossi for his help in early stages of this work. References Aarts. E., & Korst. J. (1989). Simulated Annealing and Boltzmann machines, New York: Wiley. Ballard, D. H., Gardner, P. C., & Srinivas, M. A. (1987). Graph problems and connectionist architectures (Tech. Rep. No. TR 167). Rochester, NY: University of Rochester.
1256
M. Pelillo and A. Torsello
Bertoni, A., Campadelli, P., & Grossi, G. (2002). A neural algorithm for the maximum clique problem: Analysis, experiments and circuit implementation. Algorithmica, 33(1), 71–88. ¨ G. P. (1970). Stability theory of dynamical systems. Berlin: Bhatia, N. P., & Szego, Springer-Verlag. Bomze, I. M. (1986). Non-cooperative two-person games in biology: A classification. Int. J. Game Theory, 15(1), 31–57. Bomze, I. M. (1997). Evolution towards the maximum clique. J. Global Optim., 10, 143–164. Bomze, I. M. (1998). On standard quadratic optimization problems. J. Global Optim., 13, 369–387. Bomze, I. (2005). Portfolio selection via replicator dynamics and projections of indefinite estimated covariances. Dynamics of Continuous, Discrete and Impulsive Systems B, 12, 527–564. Bomze, I. M., Budinich, M., Pardalos, P. M., & Pelillo, M. (1999). The maximum clique problem. In D.-Z. Du & P. M. Pardalos (Eds.), Handbook of combinatorial optimization (Suppl. Vol. A), (pp. 1–74). Boston: Kluwer. Bomze, I. M., Budinich, M., Pelillo, M., & Rossi, C. (2002). Annealed replication: A new heuristic for the maximum clique problem. Discr. Appl. Math., 121(1–3), 27–49. Bomze, I. M., Pelillo, M., & Giacomini, R. (1997). Evolutionary approach to the maximum clique problem: Empirical evidence on a larger scale. In I. M. Bomze, T. Csendes, R. Horst, & P. M. Pardalos (Eds.), Developments in global optimization (pp. 95–108). Dordrecht: Kluwer. Bomze, I. M., Pelillo, M., & Stix, V. (2000). Approximating the maximum weight clique using replicator dynamics. IEEE Trans. Neural Networks, 11(6), 1228–1241. ´ Boppana, R., & Halldorsson, M. M. (1992). Approximating maximum independent sets by excluding subgraphs. BIT, 32, 180–196. Brockington, M., & Culberson, J. C. (1996). Camouflaging independent sets in quasirandom graphs. In D. Johnson & M. Trick (Eds.). Cliques, coloring and satisfiability: Second DIMACS implementation challenge (pp. 75–88). Providence, RI: American Mathematical Society. Busygin, S., Butenko, S., & Pardalos, P. M. (2002). A heuristic for the maximum independent set problem based on optimization of a quadratic over a sphere. J. Comb. Optim., 6, 287–297. Cabrales, A., & Sobel, J. (1992). On the limit points of discrete selection dynamics. J. Econom. Theory, 57, 407–419. Fisher, R. A. (1930). The genetical theory of natural selection. New York: Oxford University Press. Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA: MIT Press. Funabiki, N., Takefuji, Y., & Lee, K.-C. (1992). A neural network model for finding a near-maximum clique. J. Parallel Distrib. Comput., 14, 340–344. Gaunersdorfer, A., & Hofbauer, J. (1995). Fictitious play, Shapley polygons, and the replicator equation. Games Econom. Behav., 11, 279–303. Gee, A. W., & Prager, R. W. (1994). Polyhedral combinatorics and neural networks. Neural Computation, 6, 161–180.
Game Dynamics and Maximum Clique
1257
Gibbons, L. E., Hearn, D. W., Pardalos, P. M., & Ramana, M. V. (1997). Continuous characterizations of the maximum clique problem. Math. Oper. Res., 22, 754– 768. Godbeer, G. H., Lipscomb, J., & Luby, M. (1988). On the computational complexity of finding stable state vectors in connectionist models (Hopfield nets) (Tech. Rep. No. 208/88). Toronto: University of Toronto. Grossman, T. (1996). Applying the INN model the maximum clique problem. In D. Johnson & M. Trick (Eds.), Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge (pp. 122–145). Providence, RI: American Mathematical Society. ¨ Grotschel, M., Lov´asz, L., & Schrijver, A. (1993). Geometric algorithms and combinatorial optimization. Berlin: Springer-Verlag. Hastad, J. (1996). Clique is hard to approximate within n1−ε . In Proc. 37th Ann. Symp. Found. Comput. Sci. (pp. 627–636). Los Alamitos, CA: IEEE Computer Society Press. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. New York: Academic Press. Hofbauer, J. (1995). Imitation dynamics for games. Collegium Budapest. Unpublished manuscript. Hofbauer, J., & Sigmund, K. (1998). Evolutionary games and population dynamics. Cambridge: Cambridge University Press. Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Trans. Pattern Anal. Machine Intell., 5, 267–287. Jagota, A. (1995). Approximating maximum clique with a Hopfield neural network. IEEE Trans. Neural Networks, 6, 724–735. Jagota, A., Pelillo, M., & Rangarajan, A. (2000). A new deterministic annealing algorithm for maximum clique. In Proc. IJCNN’2000: Int. J. Conf. Neural Networks (pp. 505–508). Piscataway, NJ: IEEE Press. Jagota, A., & Regan, K. W. (1997). Performance of neural net heuristics for maximum clique on diverse highly compressible graphs. J. Global Optim., 10, 439–465. Jagota, A., Sanchis, L., & Ganesan, R. (1996). Approximately solving maximum clique using neural networks and related heuristics. In D. Johnson & M. Trick (Eds.), Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge (pp. 169–204). Providence, RI: American Mathematical Society. Johnson, D., & Trick, M. (1996). Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge. Providence, RI: American Mathematical Society. Lin, F., & Lee, K. (1992). A parallel computation network for the maximum clique problem. In Proc. 1st Int. Conf. Fuzzy Theory Tech. Baton Rouge, LA. Luenberger, D. G. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. Maynard Smith, J. (1982). Evolution and the theory of games. Cambridge: Cambridge University Press. Miller, D. A., & Zucker, S. W. (1999). Efficient simplex-like methods for equilibria of nonsymmetric analog networks. Neural Computation, 4, 167–190. Miller, D. A., & Zucker, S. W. (1999). Computing with self-excitatory cliques: A model and an application to hyperacuity-scale computation in visual cortex. Neural Computation, 11, 21–66.
1258
M. Pelillo and A. Torsello
Motzkin, T. S., & Straus, E. G. (1965). Maxima for graphs and a new proof of a theorem of Tur´an. Canad. J. Math., 17, 533–540. Papadimitriou, C. H., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and complexity, Englewood Cliffs, NJ: Prentice Hall. Pardalos, P. M., & Rodgers, G. P. (1990). Computational aspects of a branch and bound algorithm for quadratic zero-one programming. Computing, 45, 131–144. ¨ & Guzelis, ¨ O., ¨ Pekergin, F., Morgul, C. (1999). A saturated linear dynamical network for approximating maximum clique. IEEE Trans. Circuits and Syst.—I, 46(6), 677–685. Pelillo, M. (1995). Relaxation labeling networks for the maximum clique problem. J. Artif. Neural Networks, 2, 313–328. Pelillo, M. (1999). Replicator equations, maximal cliques, and graph isomorphism. Neural Computation, 11(8), 2023–2045. Pelillo, M. (2002). Matching free trees, maximal cliques, and monotone game dynamics. IEEE Trans. Pattern Anal. Machine Intell., 24(11), 1535–1541. Pelillo, M., & Jagota, A. (1995). Feasible and infeasible maxima in a quadratic program for maximum clique. J. Artif. Neural Networks, 2, 411–420. Pelillo, M., Siddiqi, K., & Zucker, S. W. (1999). Matching hierarchical structures using association graphs. IEEE Trans. Pattern Anal. Machine Intell., 21(11), 1105–1120. Ramanujam, J., & Sadayappan, P. (1988). Optimization by neural networks. Proc. IEEE Int. Conf. Neural Networks (pp. 325–332). Piscataway, NJ: IEEE Press. Rosenfeld, A., Hummel, R. A., & Zucker, S. W. (1976). Scene labeling by relaxation operations. IEEE Trans. Syst. Man and Cybern., 6, 420–433. Samuelson, L. (1997). Evolutionary games and equilibrium selection, Cambridge, MA: MIT Press. Shrivastava, Y., Dasgupta, S., & Reddy, S. (1990). Neural network solutions to a graph theoretic problem. In Proc. IEEE Int. Symp. Circuits Syst. (pp. 2528–2531). Piscataway, NJ: IEEE Press. Shrivastava, Y., Dasgupta, S., & Reddy, S. M. (1992). Guaranteed convergence in a class of Hopfield networks. IEEE Trans. Neural Networks, 3, 951–961. Takefuji, Y., Chen, L., Lee, K., & Huffman, J. (1990). Parallel algorithms for finding a near-maximum independent set of a circle graph. IEEE Trans. Neural Networks, 1, 263–267. Wang, R. L., Tang, Z., & Cao, Q. P. (2003). An efficient approximation algorithm for finding a maximum clique using Hopfield network learning. Neural Computation, 15, 1605–1619. Weibull, J. W. (1995). Evolutionary game theory. Cambridge, MA: MIT Press. Wu, J., Harada, T., & Fukao, T. (1994). New method to the solution of maximum clique problem: Mean-field approximation algorithm and its experimentation. Proc. IEEE Int. Conf. Syst., Man, Cybern. (pp. 2248–2253). Piscataway, NJ: IEEE Press.
Received January 27, 2005; accepted September 9, 2005.
NOTE
Communicated by Anthony Bell
Correlation and Independence in the Neural Code Shun-ichi Amari
[email protected] Hiroyuki Nakahara
[email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan
The decoding scheme of a stimulus can be different from the stochastic encoding scheme in the neural population coding. The stochastic fluctuations are not independent in general, but an independent version could be used for the ease of decoding. How much information is lost by using this unfaithful model for decoding? There are discussions concerning loss of information (Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003). We elucidate the Nirenberg-Latham loss from the point of view of information geometry.
1 Introduction The brain retains information of a stimulus by an excitation pattern of a population of neurons. The neural excitation is stochastic, so that the information is kept and processed in terms of probability distributions. The stochastic process of generating an excitation pattern from a stimulus is described by an encoding scheme, and in general this encoding scheme produces correlations among neurons. However, it is tractable and plausible for the brain to use an uncorrelated model for further processing and decoding. How much information is lost by using an unfaithful model for decoding? This is a fundamental question discussed in Nirenberg and Latham (2003) and Schneideman, Bialek, and Berry (2003). Wu, Amari, and Nakahara (2001) studied this problem in terms of Fisher information and concluded that the loss of information is small in this specific structure while the decoding process is greatly simplified. Nirenberg and Latham (2003) proposed a measure of loss of information caused by the use of the unfaithful independent model for decoding. Here, the Kullback-Leibler (KL) divergence is used for describing the loss (see also Nirenberg & Latham, 2005). Schneideman et al. (2003) questioned this definition by posing a fundamental problem: how the amount of information should be defined in neural encoding and decoding. They studied various quantities related to this problem in terms of Shannon information. The relations concerning various information-theoretical concepts are Neural Computation 18, 1259–1267 (2006)
C 2006 Massachusetts Institute of Technology
1260
S. Amari and H. Nakahara
elucidated. There have been heated discussions, and the effects caused by using the different models for encoding and decoding schemes remain to be clarified. It is true that there is no unique measure of loss of information in neural coding and decoding. Shannon information is justified when it is used for the purpose of reproducing or transmitting messages correctly, because it gives the minimal length of code to describe a message. However, there is no clear justification of this measure when one applies it for other purposes. Similarly, the Fisher information is justified for the purpose of estimating the parameter given in the form of a stimulus, because it is the only invariant local measure in the manifold of probability distributions (Amari & Nagaoka, 2000). However, neural decoding is not merely for estimation but more naturally for generating an action. In such a case, the Bayes loss plays a role. Therefore, although these measures are useful in various respects and widely used, there are no unique measures. Various measures have their own meanings in their own right, as Schneideman et al. (2003) pointed out. It is not the purpose of this article to define a unique correct definition of loss of information in neural coding and decoding. We are afraid that such arguments might lead us to theological debates. Instead, here we study the mathematical structures of various measures of difference between (conditional) probability distributions. In particular, we focus on the structure of the Nirenberg-Latham loss of information and prove that it is very natural from the information geometry point of view. Information geometry (Amari & Nagaoka, 2000) studies the intrinsic properties in the manifold of probability distributions and hence is useful for studying stochastic neural encoding and decoding. The KL divergence is the canonical invariant divergence between two probability distributions in a dually flat manifold (Amari & Nagaoka, 2000). The Shannon information, Fisher information, Jensen-Shannon divergence, and many other invariant structures are derived from it. We study the properties of the NirenbergLatham loss of information from the point of view of information geometry. We give a necessary and sufficient condition that guarantees no loss of information in the sense of Nirenberg and Latham when using the unfaithful (i.e., uncorrelated) model for decoding. Moreover, the KL divergence of the encoding scheme (which is the conditional mutual information of noises) is decomposed orthogonally into two terms: one is the Nirenberg-Latham loss, and the other is due to an irrelevant normalization term. This elucidates the use of the Nirenberg-Latham loss for analyzing the decoding process. We are interested in extending the theory to a more general process of integrating different evidence in the Bayesian framework, denoted by posterior probabilities of a stimulus. This is a first step toward this interesting problem from the information geometrical viewpoint (Amari, 2005).
Correlation and Independence in the Neural Code
1261
2 Encoding and Decoding Let us consider a population of neurons activated by stimulus s. The firing pattern is represented by a vector r = (r1 , . . . , rn ), where ri is the activity of the ith neuron. Given stimulus s, the neurons fire stochastically, and its (conditional) probability p(r|s) represents the encoding scheme. The expectation of r, r¯ (s) =
r p(r|s) d r,
(2.1)
is called the tuning curve, where the integral should be replaced by summation when ri are discrete. Decoding is a process to infer s from the activity pattern r. An estimator is given by a function sˆ = sˆ (r), and its accuracy is bounded by the inverse of Fisher information. From the Bayesian standpoint, it is natural to consider the posterior distribution of s given r, p(r|s) p(s) , p(r)
p(s|r) =
(2.2)
where p(s) is the prior distribution of stimulus and p(r) =
p(s) p(r|s) ds.
(2.3)
Here, we use the same letter p to represent probabilities, and the meaning is clear from the context. The posterior distribution retains all the information concerning s, and the neural system infers s based on it. The encoding probability p(r|s) is not independent in general, and activities of ri are correlated. However, it would be difficult to take the correlation structure into account, and the brain might use a simplified independent version for decoding. It is given by q (r|s) =
n
pi (ri |s),
(2.4)
p(r1 , . . . , rn |s) dr1 · · · dri−1 dri+1 · · · drn
(2.5)
i=1
where pi (ri |s) =
1262
S. Amari and H. Nakahara
is the marginal distribution of ri . q (r|s) is the independent version of p(r|s), sometimes called the shuffled distribution. Wu, Amari, and Nakahara (2001) used this model (the unfaithful model) for decoding s in a neural field and analyzed the loss of Fisher information. The posterior distribution under the independence assumption is given by q (s|r) =
q (r|s) p(s) , q (r)
(2.6)
q (r|s) p(s) ds.
(2.7)
where q (r) =
3 Nirenberg-Latham Loss of Information Nirenberg and Latham (2003) proposed I = K L[ p(s|r) : q (s|r)] p(s|r) = p(r) p(s|r) log ds d r q (s|r)
(3.1)
as the loss of information. This is the average KL divergence between the two posterior distributions in the decoding scheme. One may consider a similar quantity, ˜ = K L[ p(r|s) : q (r|s)] = I
p(s) p(r|s) log
p(r|s) d r ds, q (r|s)
(3.2)
which is the average KL divergence in the encoding scheme. We show their properties and relation. Theorem 1. ˜ = I + K L[ p(r) : q (r)]. I
(3.3)
˜ = 0, if The proof is immediate from the definition. It is obvious that I and only if p(r|s) = q (r|s), that is, no correlation exists in the true encoding ˜ ≥ I , and I ˜ = 0 implies I = 0, but the converse scheme. Moreover, I does not necessarily hold. This asymmetry arises by the fact that p(s) = q (s) is common but p(r) = q (r). Hence, it is interesting to see the difference ˜ = 0 and I = 0. between I
Correlation and Independence in the Neural Code
1263
˜ : We show an interesting property of I Theorem 2. ˜ = I (R1 , R2 , . . . , Rn |S), I
(3.4)
where the right-hand side is the conditional mutual information of R1 , . . . , Rn . Proof. For fixed s, we have
p(r|s) K L[ p(r|s) : q (r|s)|s] = E p( r |s) log pi (ri |s) = −H(R1 , . . . , Rn |s) +
n
H(Ri |s)
i=1
= I (R1 , . . . , Rn |s),
(3.5)
which is the conditional mutual information. By averaging it over s, we have the theorem. This is called the strength of noise correlations and is studied in Schneidman et al. (2003). ˜ ? 4 How Related Are I and I ˜ = 0: We show when I = 0 occurs in spite of I Theorem 3. I = 0, if and only if p(r|s) = k(r)q (r|s),
(4.1)
for some k(r) not depending on s. Proof. When I = 0, from I = E p( r ) K L[ p(s|r) : q (s|r)|r],
(4.2)
we have p(s|r) = q (s|r)
(4.3)
for almost all r (that is, for all r for which p(r) = 0). By multiplying p(r)q (r)/ p(s) to the both sides and using Bayes’ theorem (see equations 2.2 and 2.6), we have q (r) p(r|s) = p(r)q (r|s). Hence, we have equation 4.1 where k(r) = p(r)/q (r).
(4.4)
1264
S. Amari and H. Nakahara
The theorem shows that even when the encoding scheme is correlated, I = 0 if the correlational part does not depend on s. In other words, I = 0 when the log likelihood is the same for the two cases except for a constant term log k(r) not depending on s. We can restate this by using the score function, which is the derivative of the log likelihood with respect to s. Corollary 1. I = 0, when and only when the score functions are the same for the two encoding schemes. We show an example first. Example 1. For s > 0, the encoding scheme, p(r1 , . . . , rn |s) = √
1 2 ri n (1 + tanh r1 · · · tanh rn ) exp − 2s 2πs 1
(4.5) is not independent and ri ’s are positively correlated. The marginal distributions are pi (ri |s) = √
2
r exp − i , 2s 2πs 1
(4.6)
and
1 1 2 q (r|s) = √ ri n exp − 2s 2πs
(4.7)
˜ = 0. is different from p(r|s). In this case I = 0, but I Theorem 4. When a set of functions f (r ; s) = p(r |s) of r , parameterized by s, spans the L 2 -space of r , where p(r |s) is a marginal distribution of any ri , I = 0 ˜ = 0 are equivalent. and I Proof: We consider the case with n = 2. By integrating equation 4.1 with respect to r2 , we have p1 (r1 |s) =
k(r1 , r2 ) p2 (r2 |s)dr2 p1 (r1 |s).
(4.8)
Correlation and Independence in the Neural Code
1265
This implies {1 − k(r1 , r2 )} p2 (r2 |s) dr2 = 0.
(4.9)
When { p2 (r2 |s)} spans the L 2 -space, that is a complete basis, 1 − k(r1 , r2 ) ≡ 0. ˜ = 0. Hence, under these conditions, when I = 0, p(r|s) = q (r|s) so that I Example 2. For additive encoding, ri = f i (s) + ni ,
(4.10)
where f i (s) denotes a unimodal and continuous tuning function and ni are ˜ = 0 are equivalent. jointly gaussian, I = 0 and I In the above case, the marginal distribution of ri is
2 ri − f i (s) pi (ri |s) = const exp − . 2σ 2
(4.11)
For t = f i (s),
f (ri , t) = pi (ri |t) = exp (t − ri )2 /2σ 2
(4.12)
is a kernel whose eigenfunctions are complete, spanning the L 2 -space. On the other hand, the functions 4.6 span only even functions of ri and are not complete. 5 Geometry of Loss of Information ˜ from the point of view of Finally, we study the relation between I and I information geometry. This may justify the use of I as the loss caused by using the unfaithful independent model. The Bayesian posterior is a probability distribution whose total mass is normalized to 1. However, it is computationally easier to retain a posterior distribution without normalization, without causing any loss of information. Hence, we consider two unnormalized distributions p(s, r) and q (s, r) over s where r is given. In other words, we regard p(s, r) and q (s, r) as unnormalized distributions of s where r is fixed, instead of regarding them as joint distributions of (s, r). Given two such positive unnormalized
1266
S. Amari and H. Nakahara
distributions p˜ (s) and q˜ (s), for which divergence is given by K L[ p˜ (s) : q˜ (s)] =
p˜ (s) log
p˜ (s)ds = 1,
q˜ (s)ds = 1, their KL
p˜ (s) + q˜ (s) − p˜ (s) ds q˜ (s)
(5.1)
(Amari & Nagaoka, 2000). Information geometry shows that one can decompose it into the following orthogonal sum of two nonnegative terms, K L[ p˜ (s) : q˜ (s)] = K L[ p˜ (s) : qˆ (s)] + K L[qˆ (s) : q˜ (s)],
(5.2)
where, by putting cp = qˆ (s) =
p˜ (s)ds,
cq =
q˜ (s)ds,
cp q˜ (s) cq
(5.3)
has the same mass c p as p˜ (s). It is easy to show that equation 5.2 follows from equation 5.1, and both terms on the right-hand side are nonnegative. The term K L [ p˜ : qˆ ] = c p K L[ p : q ],
(5.4)
where p = p˜ /c p and q = q˜ /c q are probability distributions, should be the difference between the normalized distributions, and K L [qˆ : q˜ ] = c p log
cp + c q − c p ≥ 0, cp
(5.5)
which shows the difference in the total masses of qˆ and q˜ . The decomposition is orthogonal, where the Pythagorean relation holds (Amari & Nagaoka, 2000). The orthogonal decomposition, equation 5.2, in the present case is K L[ p(s, r) : q (s, r)|r] = K L[ p(s|r) : q (s|r)|r] p(r)
p(r) + p(r) log + q (r) − p(r) . q (r)
(5.6)
By integrating both sides with respect to r, we have equation 3.3. ˜ = I (R1 , . . . , Rn |S) This shows that the conditional noise information I is decomposed orthogonally into two terms: One is the Nirenberg-Latham loss I , and the other corresponds to the irrelevant normalization constants.
Correlation and Independence in the Neural Code
1267
This decomposition is different from equation 3.4 of Schneidman et al. (2003), which is ˜ = Syn (R1 , . . . , Rn ) + I (R1 , . . . , Rn ) , I
(5.7)
where Syn (R1 , . . . , Rn ) = I (S : R1 , . . . , Rn ) −
I (S : Ri ) ,
(5.8)
and I (R1 , . . . , Rn ) = K L p(r) : pi (ri )
(5.9)
in the mutual information among R1 , . . . , Rn . 6 Conclusions The information geometry framework is applied to elucidate the loss of information by the use of the independent version of encoding scheme for decoding. A necessary and sufficient condition for the loss to vanish is given. Its meaning is newly given, justifying the use of loss in the Bayesian framework. Acknowledgments We thank Peter Dayan for useful discussions. References Amari, S. (2005). Generalization of Bayes predictive distribution and optimality. Manuscript submitted for publication. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: American Mathematical Society, and New York: Oxford University Press. Nirenberg, S., & Latham, P. (2003). Decoding neural spike trains: How important are correlations? Proceedings of the National Academy of Science, USA, 100, 7348–7353. Nirenberg, S., & Latham, P. (2005). Synergy, redundancy and independence in population codes. Journal of Neuroscience, 25, 5195–5206. Schneideman, E., Bialek, W., & Berry, M. J. II. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Wu, S. Amari, S., & Nakahara, H. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797.
Received February 3, 2005; accepted October 3, 2005.
LETTER
Communicated by Haim Sompolinsky
Analysis of Spike Statistics in Neuronal Systems with Continuous Attractors or Multiple, Discrete Attractor States Paul Miller
[email protected] Department of Physics and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02454, U.S.A.
Attractor networks are likely to underlie working memory and integrator circuits in the brain. It is unknown whether continuous quantities are stored in an analog manner or discretized and stored in a set of discrete attractors. In order to investigate the important issue of how to differentiate the two systems, here we compare the neuronal spiking activity that arises from a continuous (line) attractor with that from a series of discrete attractors. Stochastic fluctuations cause the position of the system along its continuous attractor to vary as a random walk, whereas in a discrete attractor, noise causes spontaneous transitions to occur between discrete states at random intervals. We calculate the statistics of spike trains of neurons firing as a Poisson process with rates that vary according to the underlying attractor network. Since individual neurons fire spikes probabilistically and since the state of the network as a whole drifts randomly, the spike trains of individual neurons follow a doubly stochastic (Poisson) point process. We compare the series of spike trains from the two systems using the autocorrelation function, Fano factor, and interspike interval (ISI) distribution. Although the variation in rate can be dramatically different, especially for short time intervals, surprisingly both the autocorrelation functions and Fano factors are identical, given appropriate scaling of the noise terms. Since the range of firing rates is limited in neurons, we also investigate systems for which the variation in rate is bounded by either rigid limits or because of leak to a single attractor state, such as the Ornstein-Uhlenbeck process. In these cases, the time dependence of the variance in rate can be different between discrete and continuous systems, so that in principle, these processes can be distinguished using second-order spike statistics. 1 Introduction Many processes inside the brain, particularly in the cerebral cortex, operate in a noisy environment (Shadlen & Newsome, 1994; Buzsaki, 2004). Noise is apparent in the trial-to-trial variation in times and number of Neural Computation 18, 1268–1317 (2006)
C 2006 Massachusetts Institute of Technology
Spike Statistics of Graded Memory Systems
1269
spikes produced by any neuron, under what are intended to be identical external conditions. The effects of noise on the spiking statistics can tell us about the underlying states of a network of neurons (Ginzburg & Sompolinsky, 1994), which can be described by the firing rates of neurons as a function of time. States of the network where the firing rate would remain constant in the absence of noise are stationary attractor states (Hopfield, 1982; Amit, 1989; Zipser, Kehoe, Littlewort, & Fuster, 1993). The concept of attractor states is important in understanding many functions of the brain (Hopfield, 1982; Amit, 1989; Hopfield & Herz, 1995; Goldberg, Rokni, & Sompolinsky, 2004). Hippocampal place fields (O’Keefe & Dostrovsky, 1971; Samsonovich & McNaughton, 1997), working memory (Zipser et al., 1993; Camperi & Wang, 1998; Romo, Brody, Hern´andez, & Lemus, 1999; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Durstewitz, Seamans, & Sejnowski, 2000; Miller, Brody, Romo, & Wang, 2003), and integration—for example, integration of velocity to obtain position (Cannon, Robinson, & Shamma, 1983; Robinson, 1989; Seung, 1996; Aksay, Baker, Seung, & Tank, 2000; Seung, Lee, Reis, & Tank, 2000; Sharp, Blair, & Cho, 2001; Taube & Bassett, 2003)—could all require a quasicontinuum of attractor states. In order to understand how the brain performs these functions, it is important to ask whether the brain digitizes information of a continuous quantity using a set of discrete attractor states as suggested in some models (Koulakov, Raghavachari, Kepecs, & Lisman, 2002; Goldman, Levine, Tank, & Seung, 2003), or if that information is stored as an analog variable in a continuous attractor as suggested by others (Seung, 1996; Seung et al., 2000; Compte et al., 2000; Miller et al., 2003; Loewenstein & Sompolinsky, 2003; Durstewitz, 2003). A single trial is typically insufficient to determine a precise rate, as the variance is on the same order as the firing rate for most cortical activity. For example, a system with stable firing rates of 16 Hz (standard deviation, 4 Hz, assuming a coefficient of variation (CV) of 1) and 25 Hz (standard deviation 5 Hz) would need to spend approximately 1 second at each rate in order for the two rates to be distinguishable with any confidence. Hence, if the system were to drift continuously between the two rates in less than 2 seconds, a single trial would not be enough to distinguish such behavior from a discrete jump. If the levels were any closer together, the time spent near each level of rate would have to be correspondingly longer to separate out the rates. So it is unlikely that a single trial (lasting up to a few seconds for most experimental protocols) would distinguish discrete states from a continuous range. Hence, we consider statistical measures calculated from many trials to address the characteristics of continuous and discrete attractors in a neuronal network. Correlations in the spike times of neurons normally decay on the timescale of synaptic time constants. However, in continuous attractor networks, fluctuations on a short timescale can be temporally integrated
1270
Paul Miller
and accumulate in the manner of a random walk. Hence, the statistics of neuronal spike times can contain slower correlations due to changes in the underlying rate (Cox & Lewis, 1966; Perkel, Gerstein, & Moore, 1967; Saleh, 1978; Ginzburg & Sompolinsky, 1994; Middleton, Chacron, Lindner, & Longtin, 2003). Hence, statistical properties of the spike times can be a useful tool for analyzing the network in terms of its attractors, which determine the firing rates. Important quantities characterizing the statistics of neuronal spike times are the spike time correlation function, Fano factor, and distribution of interspike intervals (ISIs). The correlation function measures the relative likelihood of one spike time compared to others. The Fano factor measures the variance across trials in the total number of spikes, relative to the mean. The ISI distribution is a histogram of times between spikes. We calculate autocorrelation functions (Zohary, Shadlen, & Newsome, 1994; Bair, Zohary, & Newsome, 2001), Fano factors (Dayan & Abbott, 2001) and ISI distributions for spike trains produced according to a Poisson process. Our systems are examples of doubly stochastic Poisson point processes (Saleh, 1978), so called because not only are spikes generated probabilistically (with a probability in a small time window, δt of r (t)δt, where r (t) is the underlying rate), but the underlying rate, r (t), of the Poisson spike train varies randomly. We can think of the two types of noise as operating over different spatial and temporal scales, with variations in rate being relatively slow fluctuations in the entire network that result from the punctate, Poisson noise inherent in the spike train of each individual neuron. We do not consider regular spiking with a varying rate (in which case the following analysis is unnecessary, since the rate could be read out as a function of time for each spike train) since the high CV, near one, of real neurons is better approximated by a Poisson process.
2 Autocorrelation Function for a Poisson Process with Stochastic Rate Variation We analyze the effects of rate variations by solving for a Poisson process, where no correlations in spike times exist apart from those due to any underlying change of rate. To be precise, we assume the state of the network drifts or changes discretely but slowly, with trial-to-trial randomness. We assume then that the firing rate of each neuron is primarily determined by the state of the network, which provides the synaptic input. Hence the slow, large-scale random variations in the state of the network induce similar random variations in the underlying firing rate, r (t), of each neuron. However, neurons do not fire regularly at a slowly varying firing rate, but emit spikes with a high CV. Since a Poisson process has a high CV of one and to make the problem tractable, we assume that each neuron emits spikes as a Poisson process, whose rate varies randomly with time and across trials.
Spike Statistics of Graded Memory Systems
1271
Hence, in any time interval, the probability of a spike is r (t)dt, but r (t) is not a fixed quantity. The statistics of such processes are determined by the probability distribution of the firing rate, P(r, t). For example, in a continuous attractor, the state of the system follows a random walk, which leads to a gaussian distribution of firing rates, whereas for a set of discrete states, only specific firing rates are possible, with probabilities calculated from the Binomial distribution. Such processes are termed doubly stochastic Poisson point processes (Saleh, 1978) and have been studied previously in physics with regard to the emission of photoelectrons from a surface. The probability distribution at one time, P(r, t), can be dependent on a known rate, r0 , at an earlier time, t0 , in which case we write the conditional probability distribution as P(r, t|r0 , t0 ). The underlying firing rate of any neuron fluctuates from trial to trial, about some mean value, r (t), with trial-to-trial variance, σ 2 (t). First, consider a process with constant average rate, r (t) = r0 , such that
∞
−∞
r P(r, t|r0 , t0 )dr = r0
(2.1)
and variance, σ 2 (t ), defined by
∞
−∞
r 2 P(r, t|r0 , t0 )dr = r02 + σ 2 (t ),
(2.2)
where t = t − t0 . The autocorrelation function measures how much more or less likely it is, on any particular trial, to observe a spike at time, t1 + τ , given a spike at t1 , compared to what is predicted by the average rates. For a stationary process, an average can be taken over all values of t1 , so the autocorrelation function, Cxx (τ ), is a function of the time lag, τ , between two spikes. A negative value for Cxx (τ ) indicates it is less likely than average to see two spikes separated by τ on the same trial (for example, due to the refractory period), while a positive value means a spike at one time predicts another spike is more likely than chance to occur after an interval of τ . Cxx (τ ) is zero if spike times are uncorrelated. For a Poisson process, the probability of a spike at time t is proportional to the rate, r (t), so when the rate is known only proba∞ bilistically, the spike probability is proportional to −∞ r P(r, t|r0 , 0)dr . Hence, the autocorrelation function is given, for positive τ , by (Brody, 1999)
1272
Paul Miller
ave Cxx (τ, T) =
=
1 T −τ
T−τ
dt1 Cxx (t1 , t1 + τ )
(2.3)
0
T−τ ∞ 1 dt1 dr1 r1 P (r1 , t1 |r0 , 0) T −τ 0 −∞ ∞ dr2 r2 P (r2 , t1 + τ |r1 , t1 ) × −∞
T−τ
−
dt1
∞
−∞
0
dr1 r1 P (r1 , t1 |r0 , 0)
∞
−∞
dr2 r2 P (r2 , t1 + τ |r0 , 0) ,
where T is the total measurement interval. In all cases described here (apart from the leaky integrators of section 5), the correlation functions are symmetric (Cxx (t1 , t2 ) = Cxx (t2 , t1 )) so we include formulas only for positive τ . If there is no correlation between the rates at one time and another time, or if the probability distribution of rates does not change with time, then the first term in equation 2.3 is equal to the second term (the shuffle correction; Brody, 1998). In such a case, the autocorrelation is zero, as expected. The autocorrelation function depends on the time lag, τ , alone if the underlying process is stationary, so that at least the first two moments (mean and variance) of the rate are constant. Since neural processes are rarely stationary, the subtraction of the shuffle correction in the above equation is designed to remove effects of nonstationarity in the average rate. However, if the variance in firing rate is not constant through the measurement interval, the autocorrelation function may also depend on the total measurement time, T. In general, we can rewrite the terms in the integrand of equation 2.3 as Cxx (t1 , t1 + τ ) =
∞
−∞
dr1 r1 P(r1 , t1 |r0 , 0)r (t1 + τ )|r1 , t1
−r (t1 )|r0 , 0r (t1 + τ )|r0 , 0,
(2.4)
where r (t2 )|r1 , t1 is the mean value of the rate at time, t2 conditional on its earlier value, r1 at time, t1 . If the average rate remains constant, at any time the rate is equally likely to increase or decrease. Hence, if the rate is known at a certain time, that value is the new average rate for later times. In such a case, the second integral of the first term in equation 2.3 becomes equivalent to equation 2.1, resulting in
∞
−∞
dr2 r2 P (r2 , t1 + τ |r1 , t1 ) = r1 .
(2.5)
Spike Statistics of Graded Memory Systems
1273
Hence, the first term of equation 2.3 gives 1 T −τ
T−τ
dt1 0
∞ −∞
dr1 r12 P (r1 , t1 |r0 , 0) =
1 T −τ
T−τ 0
dt1 r02 + σ 2 (t1 )
= r02 + σ 2 (t)(0,T−τ ) ,
(2.6)
which is the square of the mean rate plus the variance in rate averaged over the time interval of measurement. We have used the notation T−τ 2 1 σ 2 (t)(0,T−τ ) = T−τ σ (t)dt. The shuffle correction leads to subtraction 0 2 of a term r0 , to cancel part of the above term. Hence, for a process where the rate varies on a trial-to-trial basis, maintaining a fixed overall average, r0 , the autocorrelation function is given by ave Cxx (τ, T) = σ 2 (t)(0,T−τ ) .
(2.7)
2.1 Correlation Functions for a Continuous Attractor. A continuous attractor, or line attractor, is a range of stable states with no distinct changes between states. A neuronal network with a continuous attractor can be an integrator for any stimulus that causes the network’s state to shift along the attractor. Once the stimulus is removed, the network remains stable, so it does not change systematically in the absence of input. Such a property makes a neuronal integrator equivalent to a continuous memory device. When a neuronal network has a continuous attractor, noise causes the state to change as a random walk (see Figure 1A). A random walk, or more technically a Wiener process, is essentially the temporal integral of uncorrelated gaussian noise. A continuous attractor integrates any noise that causes the state of the network to shift along the attractor (see Figure 2) in the same manner as it can integrate a stimulus. Hence, the property that allows a continuous attractor to store the memory of a stimulus also causes it to have memory of (and, hence, integrate) the noise, leading to a random walk of firing rates. For a random walk process with uniform, uncorrelated noise, the variance is linear in time, σ 2 (t) = αt, which leads to an autocorrelation function proportional to the measurement period, T − τ : ave Cxx (τ, T) =
α (T − τ ) . 2
(2.8)
This is an unusual result, as correlation functions for spike times typically decay with lag time, τ , while an increase of total measurement time, T, improves the sampling. The autocorrelation function increases with measurement time for a random walk process, because the trial-to-trial variability
1274
100
rate (Hz)
A
Paul Miller
50
0 0
20 time (sec)
. B
t2
t3
t4
t5
t6
∆r
rate
r
0
0
time
T
.
Figure 1: (A) Rate as a function of time for different trials of an unbiased random walk process. Dashed line shows initial rate, of 50 Hz, with smooth curves √ indicating the standard deviation across trials, 50 Hz ± 20 Hz t. (B) Time variation of the rate for a process with discrete states and stochastic transitions between states.
in rates increases with time. The longer the time period of measurement is, the more the spike rate on one trial is distinguishable from the spike rate on another trial. This shows up as a positive autocorrelation, since observation of a spike at a certain time is more likely to occur on a trial with high rate,
Spike Statistics of Graded Memory Systems
fA
1275
fB
fC
s
s
s
fC
fB
fA
fA
Figure 2: The line attractor is defined by the tuning curve of each neuron, f A(s), f B (s), f C (s). Fluctuations in s lead to fluctuations in the firing rates of neurons that are either positively correlated ( f A, f B ) or negatively correlated ( f A, f C ).
and an above-average rate at one point in time means that spikes are more likely than on average for any later time. We extend the above result to find the autocorrelation and crosscorrelation functions between neurons whose rates depend on an underlying continuous attractor, parameterized by s, as shown in Figure 2. The firing rates of each neuron are partially determined by the position of the system along the continuous attractor, s, such that changes in s lead to coherent changes in the firing rates of all neurons in the system. We assume that an initial stimulus places the system at an initial point on the attractor with average location s 0 and standard deviation σ0 . The firing rates of neurons (labeled A and B) also fluctuate independently with standard deviations σ A and σ B about the value determined by the position in the attractor. So for two neurons, A and B, the probability distribution of their firing rates is given by
[r − f A(s)]2 PA(r |s) = exp − 2σ A2 2πσ A2 1 [r − f B (s)]2 exp − PB (r |s) = , 2σ B2 2πσ 2 1
(2.9)
(2.10)
B
where f A,B are the tuning curves (average firing rate as a function of stimulus) that describe the attractor (as depicted in Figure 2). In the calculations,
1276
Paul Miller
we expand these firing rate curves to second order in the fluctuations about the initial average stimulus, s 0 : f A(s) ≈ f A0 + f A (s − s 0 ) + f A (s − s 0 )2 /2.
(2.11)
The expansion in s − s 0 is valid if the change in the network due to noise is small compared to the change in the network due to the complete range of stimuli. We assume the diffusion along the continuous attractor is specified by a random walk in s, such that
(s − s0 )2 P(s, t|s0 , t0 ) =
exp − , 2α(t − t0 ) 2πα(t − t0 ) 1
(2.12)
where s0 is the initial position on the attractor for a given trial. Noise, of amplitude σ0 , during the initial stimulus presentation leads to a distribution of s0 , such that
(s0 − s 0 )2 exp − P(s0 , t0 ) = 2σ02 2πσ02 1
.
(2.13)
The cross-correlation function now is solved by integrating over all possibilities. It is expressed in the somewhat cumbersome form (writing t2 = t1 + τ ): C AB (τ, T) = ×
dt1
−∞
0
×
dt1
P(s0 )ds0
PA(r1 |s1 )r1 dr1
T−τ
−
∞
−∞
0
∞
T−τ
ave
−∞
−∞
P(s0 )ds0
P(s0 )ds0 ∞
−∞
−∞
P(s1 , t1 |s0 , t0 )ds1
P(s2 , t2 |s1 , t1 )ds2
∞
∞
−∞
∞
∞
∞ −∞
∞
−∞
PB (r2 |s2 )r2 dr2
P(s1 , t1 |s0 , t0 )ds1
P(s2 , t2 |s1 , t1 )ds2
∞ −∞
∞
−∞
PA(r1 |s1 )r1 dr1
PB (r2 |s2 )r2 dr2
with the complete final result: α(T − τ ) 2 f A f B + σ02 f A f B /2 C ave (τ, T) = σ02 f A f B + σ04 f A f B + AB 4 2 2 α (T − τ ) fA fB − + 6
,
(2.14)
Spike Statistics of Graded Memory Systems
f A0 ατ f B /2 +
1277
σ4 σ02 f A ˜f B + f A0 f B + 0 f A f B 2 4
α 2 f A f B α(T − τ ) ˜ 0 2 2 f A f B + f A f B + σ0 f A f B + (T − τ ) , + 4 12 (2.15) where ˜f B = f B0 + ατ f B /2. For clarity, we consider just the terms including up to the first derivative of f A,B from here on. In this case, we have 2 C ave AB (τ, T) = f A f B σ0 + α(T − τ )/2 ,
(2.16)
which tells us that the cross-correlation is proportional to the product of derivatives of the two tuning curves, f A f B (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Pouget, Zhang, Deneve, & Latham, 1998). Note also the two terms inside the brackets. The first is equal to the variance during the stimulus, which is a fixed quantity. The second term is exactly that derived earlier for a random walk with fixed noise, where the variance increases linearly with time. For the autocorrelation, we can simply set f A = f B = 1, measuring the noise along the line attractor in terms of rate instead of s. Note that given a position in the continuous attractor labeled by s(t), uncorrelated fluctuations in the rates do not affect the correlation function. Specifically, the cross-correlations and autocorrelations are unaffected by the quantities σ A, σ B in equations 2.9 and 2.10. Rather, it is the manner in which the average rate of a neuron depends on the stimulus that determines the correlations. 2.2 Discrete Levels of Rate: Autocorrelation. For a system with discrete states, noise-driven changes in rate occur probabilistically. We assume the system has a set of equally spaced states, each with an identical average lifetime, τx , with stable rates for a particular neuron separated by r (see Figure 1B). The discrete “hopping” between different values of rate is then described by the exponential time distribution between “hops” such that the probability between t and t + δt of remaining in one state is exp[−t/τx ]δt/τx . Such an exponential distribution arises when the probability of hopping per unit time is fixed at 1/τx . The probability distribution of rates after M hops is binomial, with mean r0 and variance, σ 2 (M) = Mr 2 . The probability of M jumps in total time, t, follows the Poisson distribution with mean t/τx . The variance of rate as a function of time can be calculated (see appendix A) to give σ 2 (t) = (r )2 t/τx . So when equation 2.7 is used,
1278
Paul Miller
the autocorrelation function for a process with discrete hopping is identical to that of a continuous random walk,
ave Cxx (τ, T) =
(r )2 (T − τ ) , 2τx
(2.17)
if the noise terms for the two processes are correctly scaled such that α = (r )2 /τx in equation 2.8. 3 Fano Factor: General Results The Fano factor is a measure of the variation in total spike count for a fixed process. Typically it varies between zero for regular spiking and one for Poisson firing. Time variation in the Fano factor indicates temporal correlations in the processes that lead to spikes. The Fano factor is defined by averaging over many trials of fixed time, T,
F (T) =
N2 − N2 , N
(3.1)
where N is the number of spikes in the interval of length T. The expected number of spikes, λ(T), in time, T, is given by the time integral of firing rate: λ(T) =
T
r (t)dt.
(3.2)
0
For a Poisson process, where spikes are uncorrelated, occurring with probability r (t)δt in the time interval t to t + δt, then the probability of N spikes in total time T is given by the Poisson distribution,
P(N|T) =
λN exp(−λ). N!
(3.3)
We are interested in processes where the rate is not fixed but can fluctuate from trial to trial, leading to different values of λ, in which case P(N|T) =
∞
−∞
dλP(λ|T)
λN exp(−λ). N!
(3.4)
Spike Statistics of Graded Memory Systems
1279
Using the above result, we see that the mean number of spikes, N, is the required result: N =
∞
NP(N) =
N=0
∞
−∞
dλλP(λ|T) = λ(T).
(3.5)
The mean square number (or second moment) is calculated similarly (Saleh, 1978): ∞
N2 =
N2 P(N) =
N=0
=
∞
−∞
∞
−∞
dλP(λ|T)
∞
λ N exp(−λ)
N=0
dλP(λ|T) λ2 + λ = λ|T + λ2 |T.
N(N − 1) + N N! (3.6)
Hence, the Fano factor for a Poisson process is given in terms of the moments of time-integrated rate, λ, as F (T) = =
N2 − N2 N λ2 + λ − λ2 λ
=1 +
Var(λ) . λ
(3.7)
Note that if the process is not Poisson but has either more regular or less regular spiking statistics, we can calculate the Fano factor by assuming the number of spikes has a mean given by the time integral of the rate (this must be true by definition of the rate) but with some variation around the mean, such that (N − λ)2 P(N|λ) = √ exp − . 2σ 2 4πσ 2 1
(3.8)
A calculation of F (T) leads to σ2 Var(λ) N2 − N2 = + . N λ λ
(3.9)
Hence in general, the Fano factor splits into two terms—one from irregularities in the spiking statistics and a second arising from irregularities in the underlying rate. The first term is zero for a regular process and one for
1280
Paul Miller
a Poisson process, and typically gives a fixed, constant value. We consider the effects of the second, rate-dependent term, in the following sections. 3.1 Calculation of Fano Factor from the Autocorrelation Function. In this section we use Bayes’ rule to relate the Fano factor to the temporal integral of the autocorrelation function. We evaluate the derivative in time of the second moment of λ as dλ2 (t) = 2λ(t)r (t) dt = 2 dr2 P(r2 , t)λ(t)|r2 , tr2 ,
(3.10)
where λ(t)|r2 (t) is the average over all trajectories that reach a specific rate, r2 at time, t. Since we can write
t
λ(t)|r2 (t) =
dt1 r1 (t1 )|r2 (t),
(3.11)
0
we can substitute into equation 3.10 to give: dλ2 (t) =2 dt
dr2 P(r2 , t)r2
dt1 r1 (t1 )|r2 (t)
0
dr2 P(r2 , t)r2
=2
t
t
dr1 P(r1 , t1 |r2 , t)r1 .
dt1
(3.12)
0
Bayes’ rule allows us to replace P(r2 , t)P(r1 , t1 |r2 , t) with P(r1 , t1 ) P(r2 , t|r1 , t1 ) so the conditional rate is based on a prior rate. Hence equation 3.12 yields dλ2 (t) =2 dt
t
dt1
dr1 P(r1 , t1 )r1
dr2 P(r2 , t|r1 , t1 )r2 ,
(3.13)
0
which is similar to the first term of equation 2.3. Care with the integral limits, such that all spikes and hence firing rates are integrated over a window 0 < t < T, leads to Var(λ(T)) = λ2 (T) − λ(T)2 T ave =2 dτ (T − τ )Cxx (τ, T), 0
(3.14)
Spike Statistics of Graded Memory Systems
1281
ave (τ, T) is given by equation 2.3. Hence, the Fano factor for a doubly where Cxx stochastic Poisson process becomes
2
F (T) = 1 +
T
ave (τ, T) dτ (T − τ )Cxx . T 0 dtr (t)
0
(3.15)
3.2 Fano Factor for Processes with Constant Mean Rate. If the mean rate is a constant, r0 , it is clear that the average value of λ is given by
T
λ(T) =
dt
T
r (t)P(r, t|r0 , t = 0)
−∞
0
=
∞
dtr0 = r0 T.
(3.16)
0
The second moment, λ2 , includes correlations so is not straightforward to calculate. We begin by noting that dλ2 dλ =2 λ dt dt
(3.17)
= 2λ(t)r (t),
(3.18)
where the average is over trials. Similarly, dr d 2 λ2 2 , = 2r (t) + 2 λ dt 2 dt
(3.19)
where the second term yields zero for a process with constant mean rate. So in general for such a process, λ2 (t) =
T
dt
0
t
0
dt 2r02 + 2σ 2 (t ) ,
(3.20)
giving
T
Var[λ(T)] =
dt 0
t
dt 2σ 2 (t ),
0
where σ 2 (t) is the variance in rate at time, t.
(3.21)
1282
Paul Miller
For both the random walk process and discrete hopping process, where the variance in rates increases linearly with t, we have Var[λ(T)] =
αt 3 , 3
(3.22)
where α = (r )2 τx . Hence, the Fano factor in both cases has a quadratic dependence on T and N: F (T) = 1 + =1 +
λ2 − λ2 λ αT 2 α N2 =1+ , 3r0 3r03
(3.23)
where for a discrete network α = (r )2 /τx . It is therefore difficult to distinguish the spike statistics of continuous versus discrete processes. The problem arises because the variances of the two processes increase linearly with time. Hence, common statistical measures, which depend on only the variance of firing rates, contain terms such as α = (r )2 /τx , which is the constant time derivative of variance in firing rates. Statistical measures that depend on r separately from this combination (which can remain constant as r → 0 for a random walk process) are needed to distinguish a continuous from a discrete system. This is true for statistics involving the fourth-order cumulant in spike counts, since the fourth moment of the firing rates differs between the two systems, such that r 4 − 4r 3 r − 3r 2 2 + 12r 2 r 2 − 6r 4 = (r )2 αt.
(3.24)
Setting r = 0 for the continuous system, equation 3.24 is an example that cumulants higher than second order are zero for the gaussian distribution. The details of obtaining a similar result in the moments of the spike count, N, and proof of equation 3.24 are given in appendix A. The above result is an indication that with the discrete hopping process, it is more probable for the rate to change by a large amount (more than a couple of standard deviations) than for the continuous random walk process. Since the probability of large excursions in rate is small, many trials are necessary to make use of this statistic, as we show below. In principle, the following set of moments, N4 − 3N2 2 + 2N4 − 6N3 + 6N2 N + 11N2 − 3N2 − 6N 3N2 − 3N2 − 3N = (r )2 t 2 /5,
(3.25)
Spike Statistics of Graded Memory Systems
1283
can be used to differentiate systems with continuous versus discrete states. We tested the formula by generating sets of spike times stochastically on a computer. Since the process is very noisy, an experimentally unfeasible number of trials (5000) was required to produce each curve. The four curves for each system (see Figure 3B) indicate the variation that remains even after this large number of trials. We used a bounded system (see section below) to test this statistical quantity. The boundaries quickly slow the increase in the fourth moment and cause the combination of moments to reach a maximum and decline. However, when we compare the curves with discrete states versus continuous states, the curve never rises significantly above zero for the continuous system. Hence, if this statistic is measured to be significantly above zero, it is an indication of a discrete set of attractors. If the deviation is not significantly above zero, it is most likely due to lack of statistical power. 4 Boundaries Firing rates of neurons are limited, with an upper bound arising from saturation of synapses or channel kinetics that is rarely reached for energetic reasons, and a lower bound of zero arising from the physical impossibility of the firing rate becoming negative. Hence, it is important to consider the effects of boundaries on the range of firing rate for both continuous and discrete systems. In the calculations, we assume the boundaries are reflecting, which means that whenever the unbounded random process would cause the firing rate to fall beyond the boundary by an amount, r x , we instead assume the rate falls inside the boundary by the same amount, r x . That is, noise still can move the rate from the boundary (the boundary is not an attractor state) but only to within the bounds. We match different discrete and continuous bounded systems by ensuring that the initial linear increase in variance of rate with time is identical (as we did with the unbounded system) and also that the variance in rate as t → ∞ is identical between systems. This means that assuming a discrete system with three states at r0 , r0 ± r , the corresponding √ five-state system has states at r0 , r0 ± r , r0 ± 2r , where r = r/ 3; the
four r = 6/11r0 ; state system has states at r0 ± r /2 and r0 ± 3r /2, where √ the two-state system has states at r0 ± r /2, where r = 2r0 ; and the continuous system has boundaries at r0 ± r B , where r B = 3/2r . Matching the initial linear increase in variance of rate requires us to keep a constant value for α = (r )2 /τx for all systems. 4.1 Continuous Random Walk with Reflecting Boundaries. When a random walk has reflecting boundaries, each reflection can be accounted for by a trajectory of an unbounded random walk from a mirror source. If the initial firing rate is r1 , then mirror sources exist for all n at 2r0 +
1284
Paul Miller
A
Continuous 5 states 3 states
Fano factor
300
0 0
20 Time (sec)
B 4
Continuous 5 states 3 states
4th order measure of N
4×10
0 0
8 Time (sec)
Figure 3: Plots of moments of the spike count from spike trains produced by computer simulations. The systems have an average rate of 100 Hz. The discrete system with three states has r = 80 Hz and τx = 16 s. The five-state and continuous systems are scaled to have the same initial and asymptotic variance in rates. Each dark line corresponds to the average of 5000 trials. Gray lines are analytic results for the corresponding unbounded processes [(r )t 2 /5]. (A) The Fano factors. (B) The fourth-order combination of moments of N (see equation 3.25).
Spike Statistics of Graded Memory Systems
1285
2rb − r1 + 4nrb (odd numbers of reflections) and r1 + 4nrb (even numbers of reflections) where the original source is included. The probability distribution of rate, P[r (t)], is for any rate between the two boundaries, the sum of the unbounded random walk probability of reaching that rate, when starting from any of the mirror sources. That is, for r0 − rb ≤ r ≤ r0 + rb , ∞ −(r − 2r0 + r1 − 2rb + 4nrb )2 1 P [r, t1 + τ |r1 , t1 ] = √ exp 2ατ 2πατ n=−∞ −(r − r1 + 4nrb )2 . (4.1) + exp 2ατ In practice, unless the calculation requires long time intervals, only a few values of n are required, between −nmax and nmax such that 2ατ (4nmax rb )2 . In the computational results, we use −8 ≤ n ≤ 8. This allows us to calculate mean rate conditional on an earlier rate, r (t)|r1 (t1 ) (as needed in equation 2.4) and the variance in rate, Var[r (t)|r1 (0)], by integrating the above probability distribution between boundaries. The full formulas used in the calculation are given in appendix B. 4.2 Small Number of Discrete States. For the system with discrete states, we simply limit the total number of states to bound the firing rate. We solve the system analytically for two, three, four, and five states. In all cases, the probability of a transition between states per time interval, dt, is dt/τx , with τx scaled for each system to maintain (r )2 τx = α. The only difference with the unbounded system is that for an edge state, the direction of the transition is fixed to be back to an intermediate state. So for two states, the system becomes simple to analyze, since the direction of the jump is predetermined for both states; only the time of the jump is stochastic. After an even number of jumps, the system always returns to its starting state, and after an odd number of jumps, the system is in the other state. So, using the Poisson distribution for number of jumps in time t, we find that after time t, the probabilities of remaining in the same state, P(even), or the other state, P(odd), are given by 1 −2t 1 + exp 2 τx −2t 1 1 − exp . P(odd) = 2 τx
P(even) =
(4.2)
Similarly, in the three-state system, if the rate is at the central state at any time, after an even number of jumps (with probability P(even) in equation 4.2) it will return to the same value. After an odd number of jumps,
1286
Paul Miller
the rate is equally likely to be r0 ± r . Hence, the system can be solved. We use similar methods to solve for the four-state and five-state system, as detailed in appendix C. Once the conditional probabilities for the firing rate dependent on its earlier value are known, we can use the standard methods described earlier to calculate the autocorrelation function and Fano factor. 4.3 Results for Bounded Systems. One observation for a bounded process is that if the initial rate is low, it will drift up on average; similarly, an initially high rate will drift down. This is seen in Figure 4A, where as expected, the average rate tends to the midpoint of the two boundaries a long time after the stimulus, as the system loses memory of the initial condition. Figure 4B contains two curves for each system showing the variance in rate as a function of time. For each system, the curve that is higher at early times is for a process where the initial rate is at the midpoint between the two boundaries. The lower solid curve is the result when the initial rate is at a boundary. Observation of such a dependence on the initial stimulus of the variance in firing rate as a function of time, as well as drift of the average firing rate, would be strong evidence for this type of bounded set of attractor states. Figures 4C and 4 D (solid lines) demonstrate that the autocorrelation decays on the timescale of the memory loss of the initial rate (cf. Figure 4A). Figure 4C is calculated as Cxx (t1 , τ ) where t1 = 1 s, whereas Figure 4D has t1 = 15 s. Since the autocorrelation with τ = 0 is equal to the variance at that time, the range of variances at small time in Figure 4B produces the range of magnitudes of autocorrelation in Figure 4C. At large times, the more discrete states in the system, the more time constants there are in the system, so that the slower time constants contribute to a slower decay in the autocorrelation function. The continuous random walk (solid lines) has the slowest and least purely exponential decay of all. The Fano factors (see Figure 4E) vary across systems at small times, reflecting the variance in firing rate (see Figure 4B) and average initial
Figure 4: Statistics for bounded systems. (A) Average rate given different initial rates. (B) Variance in rates. For each system, the upper curve is the variance when the initial rate is at the midpoint, r0 , and the lower curve is when the initial rate is at a boundary. For the three-state system, the variance is independent of the initial rate, but in general, the more states, the greater is the range. (C) Autocorrelation functions, Cxx (t1 , t1 + τ ) with t1 = 1s; same legend as B. (D) Autocorrelation functions, Cxx (t1 , t1 + τ ) with t1 = 15 s. Same legend as B. (E) Fano factors when the initial rate is r0 , the midpoint of the system. Same legend as B. (F) Fano factors when the initial rate is at the lower boundary (upper curve) or upper boundary (lower curve) of the system. Same legend as B.
Spike Statistics of Graded Memory Systems
60
50
2
Continuous 5-states 3-states 2-states
Rate variance (Hz )
B 50
Continuous 5-states 4-states 2,3-states 0 0
40 0
Time (sec)
8
C
D
Time (sec)
8
50
Autocorrelation
Autocorrelation
50
0 0
Time (sec)
0 0
8
E
F Fano factor
5
0 0
Continuous 5-states 3-states 10 Time (sec)
20
Time (sec)
8
5
Fano factor
Mean Rate (Hz)
A
1287
0 0
Continuous 3-states
Time (sec)
20
1288
Paul Miller
firing rates. At large times, as with the Ornstein-Uhlenbeck process (see Figure 5B), the Fano factors for the bounded systems approach a constant value greater than one due to the constant area under the autocorrelation functions and constant average rate. 5 Leaky Integrators A second method for limiting the range of firing rates is to make the integrator imperfect, or leaky. A leaky integrator is defined by a stable rate, r A, which corresponds to an attractor state, and a time constant, τ L , for decay of firing rates to that state. In practice, neuronal systems are likely to have both “hard” boundaries for the firing rate and a slow drift to an attractor state. For the continuous system, reflecting boundaries are analogous to walls of a flat (square) potential well, whereas a leak term makes the effective potential quadratic with a minimum at the attractor state. Adding a leak to the continuous random walk leads to an OrnsteinUhlenbeck (OU) process for the rate dr r − rA + αη(t), =− dt τL
(5.1)
where η(t) is the uncorrelated gaussian noise term. For the discrete integrator, we assume that the discrete levels of rate each decay toward a single value, r A, such that the distance between them decays exponentially: r → r exp(−t/τ L ). For both the continuous random walk and the discrete case, the mean rate follows: r (t) = r A + (r0 − r A) exp(−t/τ L ).
(5.2)
The variance in rate for the set of discrete states is then given by Var[r (t)] = r (t)2 − r (t)2 =
(r )2 t exp(−2t/τ L ) = αt exp(−2t/τ L ). τx
(5.3)
However, for the OU process we have (Gillespie, 1992) Var[r (t)] =
−2t ατ L 1 − exp . 2 τL
(5.4)
These results are compared in Figure 5B, using the specific values, τ L = 2s, r = 10 Hz, τx = 2 s, α = 50 s−3 . The OU process has a finite variance in rate at large times due to stochastic jitter about the attractor state (equivalent to noise around a smooth potential minimum). In contrast, for the discrete
Spike Statistics of Graded Memory Systems
B
50
5 Continuous Discrete
50
8
0
D t1=1s t1=2s t1=4s t1=8s t1=16s
Autocorrelation
50
0 0
Lag (sec)
F
8
Time (sec)
20
t1=1s t1=2s t1=4s t1=8s t1=16s
0 0
8
T=1s T=2s T=4s T=8s T=16s
Lag (sec)
50
50
Autocorrelation
0 0
E
Time (sec)
Continuous Discrete
Autocorrelation
0 0
Autocorrelation
C
Fano Factor
2
Rate variance (Hz )
A
1289
0 0
Lag (sec)
8
T=1s T=2s T=4s T=8s T=16s
Lag (sec)
8
Figure 5: Statistics for the leaky integrators. System with an attractor state at r A = 50 Hz, τ L = 2 s, α = 25 s−3 for the continuous system and r = 10 Hz, τx = 2 s for the discrete system. (A) The variance in rate for both continuous and discrete systems. (B) The Fano factor for both systems. For each system, the initial rate is 10 Hz for the upper curve and 50 Hz for the lower curve. (C) Autocorrelation, Cxx (t1 , t1 + τ ) for the discrete system. (D) Autocorrelation, ave (τ, T) Cxx (t1 , t1 + τ ) for the continuous system. (E) Average autocorrelation, Cxx ave (τ, T) for the continuous for the discrete system. (F) Average autocorrelation, Cxx system.
1290
Paul Miller
system, all allowed states collapse to a single rate at large times, so the variance in rate approaches zero. For both leaky processes, the trial-averaged value of the rate at a later time, t2 , given its value, r1 , at an earlier time, t1 , is no longer constant but given by the decay to the attractor state as −(t2 − t1 ) r (t2 )|r1 , t1 = r A + (r1 − r A) exp . τL
(5.5)
This allows us to find the autocorrelation function for both systems, using equation 2.3: −τ Cxx (t1 , t1 + τ ) = r12 (t1 ) exp τL −τ r1 (t1 ) − r1 (t1 )r2 (t1 + τ ) + r A 1 − exp τL −τ = Var[r (t1 )] exp . (5.6) τL So for the continuous, OU process, we have: −2t ατ L −τ 1 − exp (5.7) exp 2 τL τL −τ −2(T − τ ) τL ατ L ave exp 1 − exp 1− , Cxx (τ, T) = 2 τL 2(T − τ ) τL
Cxx (t1 , t1 + τ ) =
whereas for the discrete leaky system, we have: −2t1 −τ exp (5.8) τL τL −τ −2(T − τ ) τL ατ L ave Cxx (τ, T) = exp 1 − exp 2 τL 2(T − τ ) τL −2(T − τ ) . − exp τL
Cxx (t1 , t1 + τ ) = αt1 exp
The results are plotted in Figures 5C to 5F. Note that the averaged autocorrelation function for the OU process (see Figure 5F) has a linear decay to zero at small times, like the full random walk, but has an exponential decay with a longer measurement period.
Spike Statistics of Graded Memory Systems
1291
We use equation 3.15 to calculate the Fano factors for the two types of leaky integrator. For the discrete system, we have ατ L2 t exp (−2t/τ L ) + τ L 1 − exp (−t/τ L ) 1 − 3 exp (−t/τ L ) , F (t) = 1 + 2 r At + τ L (r0 − r A) 1 − exp (−t/τ L ) (5.9) and for the continuous system ατ L2 2t − 4τ L 1 − exp (−t/τ L ) + τ L 1 − exp (−2t/τ L ) . F (t) = 1 + 2 r At + τ L (r0 − r A) 1 − exp (−t/τ L ) (5.10) Figure 5B demonstrates that while the Fano factors for both processes begin by increasing quadratically together, for the discrete process, a maximum occurs on a timescale of the order of τ L , after which the Fano factor decays as 1/t back toward a value of one. In contrast, the Fano factor for the OU process rises to an asymptotic value of 1 + ατ L2 /r A (a value of 5 in Figure 5B). 6 Distribution of Interspike Intervals We have shown that whereas the second-order statistics of the two types of process can be identical, a difference does occur in the fourth-order statistics. The distribution of ISIs for the two processes will have identical second-order moments but will differ in their higher-order moments and hence have a different overall shape. In this section, we calculate the ISI distribution for a continuous attractor versus a set of discrete attractors and highlight their differences. For a Poisson process, following a spike at time t1 , the interspike interval, τs (t1 ), has a probability distribution, P(τs ), given by P(τs )dτs = exp −
t1 +τs
r (t )dt r (t1 + τs )dτs .
(6.1)
t1
t +τ For a nonstationary process, the quantity in the exponent, t11 s r (t )dt = λ(t1 , t1 + τs ) has a probability distribution that depends not only on the time interval, τs , but also on the initial time, t1 . For a random walk and OU process, the probability distribution of λ is gaussian (see appendix D) and can be calculated if the rate is known at the start of the interval, r1 (t1 ), and at the end of the interval, r2 (t2 ).
1292
Paul Miller
For such continuous random walks, the probability distribution for r2 is known, 1 (r2 − r1 )2 P(r2 , t1 + τs |r1 , t1 ) = √ exp − , 2ατs 2πατs
(6.2)
and the probability distribution for r1 given a spike occurring at t1 is, by Bayes’ rule, P(r1 , t1 |tsp = t1 ) = P(r1 , t1 ) × P(tsp = t1 |r1 , t1 )/P(tsp = t1 ) r1 (r1 − r0 )2 1 exp − . =√ 2αt1 r (t1 ) 2παt1
(6.3) (6.4)
We have used the notation tsp for the time of one spike in the spike train, so, for example, P(tsp = t1 ) is the probability of a spike at time, t1 . We can integrate over all probability distributions to find the distribution of interspike intervals for a random walk with a Poisson spike train: P(τs |tsp = t1 ) =
dr1 P(r1 |tsp = t1 )
dr2 P(r2 , t1 + τs |r1 , t1 )r2
dλ exp[−λ]P(λ|r1 , t1 ; r2 , t2 ) = r0 exp[−r0 τs + ατs2 t1 /2 + ατs3 /6] ατs t1 2ατs2 ατs t1 2 αt1 1− ) + 2 − (1 − . r0 r0 r0 r0
(6.5)
(6.6)
For a random walk of total duration T, the probability distribution for ISIs is given by the integral of the above expression over t1 up to a maximal time of T − τs , normalized by the expected number of ISIs, r0 T − 1, which yields P(τs ) =
r0 exp[−r0 τs + ατs3 /6] r0 T − 1 × exp[ατs2 (T − τs )/2] − 1
2 ατs2 (T − τs ) 2ατs2 6 4 4ατ × 1− + 2 2 + − 2 r0 r0 τs r0 τs r0 4ατs 2α(T − τs ) 4 6 2 + exp[ατs (T − τs )/2] + − − 2 2 . (6.7) r0 τs r02 r02 r0 τs
Spike Statistics of Graded Memory Systems
1293
When the spread in rates is significantly smaller than the average rate, r0 , the probability distribution is dominated by the initial term, r0 exp(−r0 τs ), which is the result for a static process. The result, equation 6.7, is limited to small τs r0 /(αT), because it is for an unbounded random walk, where the rate is unconstrained and unphysical contributions from negative rates can dominate the result for large τs . So we extend the result to the leaky integrator, for which the rates are more constrained, allowing a wider range of validity in τs . For the OU process, the resulting ISI distribution yields (see Figure 6D): P(τs |tsp = t1 ) =
exp (−r Aτs ) exp − (r 1 − r A) f (τs ) 1 + e −τs /τL (6.8) r1 2 2 −σλ,τ f 2 (τs ) −τs /τL −σr,t s 1 e +1 exp exp 2 2
−τs /τL 2 2 2 −τs /τ L e r 1 − f (τs )σr,t + 1 + σr,t e 1 1
2 2 r 1 − f (τs )σr,t f (τs ) , e −τs /τL + 1 r A 1 − e −τs /τL − σr,τ 1 s
where f (τs ) = τ L tanh[−τs /(2τ L )] (see equations D.6 and D.7), r 1 = r (t1 ) 2 (see equation 5.2) and σλ,τ = (ατ L2 /2){2τs − 4τ L [1 − exp(−τs /τ L )] + τ L [1 − s exp(−2τs /τ L )]} (see equation 5.10). For the discrete random walk, we do not have the full distribution of λ but can make progress by assuming that at most, one transition between states occurs in the interspike interval. Hence, our calculation is to first order in τs /τx . To make progress, we use the probability distribution of firing rate, r1 , at time, t, given by ∞ −t t N 1
N+n N−n . exp P (r1 = r0 + nr, t) = N τ τ 2 ! 2 ! x L 2 N=n
(6.9)
Assuming at most one transition, the probability of being at the same rate at the end of the ISI is P0 = 1 − exp(−τs /τx ) and of increasing or decreasing by r is P±1 = exp(−τs /τx )/2. In the case of no transition, exp(−λ) = exp(−r1 τs ). With a single transition, the jump in rates is equally likely to occur at any time in the ISI, so we can evaluate exp(−λ) by calculating exp(−λ) as a function of the time of transition and integrating across transition times. These results can be used in equation 6.5. For a jump up in rates within τs to a higher rate, r1 + r , exp(−λ) = exp(−r1 τs )
[exp(r τs ) − 1] . r τs
(6.10)
1294
Paul Miller
We can then evaluate the ISI distribution using P(τs , t1 ) =
dr1
r1 P(r1 , t1 ) r (t1 )
dr2 r2 P(r2 , t2 |r1 , t1 )exp(−λ)
(6.11)
and allowing only r2 = r1 , r1 ± r . The resulting ISI distribution is given by: t1 exp (−r0 τs ) P(τs |tsp = t1 ) = exp [cosh(r τs ) − 1] r0 τx 2 t1 2 t1 r0 − r sinh(r τs ) + (r ) cosh(r τs ) τx τx τs sinh(r τs ) × 1+ −1 τx r τs τs [1 − cosh(r τs )] t1 + r0 − r sinh(r τs ) . (6.12) τx τx r τs Note the similarity in form to the continuous random walk result (see equation 6.6), which the above formula reproduces in the limit r, τx → 0 with (r )2 /τx = α. A similar calculation can be used to evaluate the ISI distribution for the leaky discrete integrator (see Figure 6C). To first-order in τs /τx , we find
P(τs |tsp = t1 ) =
t1 exp (−r 1 τs ) cosh(r τ ) − 1 exp r1 τx t1 −t1 /τ L r1 − e r sinh(r τ ) × τx t1 × r 2 − e −t1 /τL e −τs /τL r sinh(r τ ) τx t1 + e −2t1 /τL e −τs /τL (r )2 cosh(r τs ) τx τs 1 × 1+ (exp(−dλ) + exp(+dλ)) − 1 τx 2 t1 τs r e −t1 /τL e −τs /τL + r 1 − e −t1 /τL r sinh(r τ ) τx τx 1 × (exp(−dλ) − exp(+dλ)) , 2
(6.13)
Spike Statistics of Graded Memory Systems
B
Ln(ISI probability)
10
Continuous Discrete
5 0 -5 0
C
2 0
-2 0
0.5
Time (sec)
Continuous Discrete
4 ISI deviation
A
1295
Time (sec)
0.2
D 1
t1=2s t1=8s
ISI deviation
ISI deviation
0.5
t1=16s
0
t1=2s t1=8s t1=16s
0.5 0
-0.5 0
Lag, τ (sec)
0.25
0
Lag, τ (sec)
0.25
Figure 6: ISI distributions. (A, B) Data from computer simulations of bounded system. Four sets of data, each the average of 5000 trials. The discrete system has states at 20 Hz, 100 Hz, and 180 Hz, with τx = 16 s. The continuous system is matched in average rate and variance of rate (see text). (A) Log of ISI probability. (B) ISI probability with result for a constant rate subtracted. (C) Analytic results for a leaky discrete integrator, τ L = 10 s, Dr = 10 Hz and τx = 10 s. Black: r (0) = 50 Hz; gray: r0 = 10 Hz. (D) Analytic results for a leaky continuous integrator, τ L = 10 s, α = 10 s−3 . Black: r (0) = 50 Hz; gray: r0 = 10 Hz.
where we have defined:
r τ ≡ r τ L e exp(−dλ) ≡
−t1 /τ L
−τs 1 − exp τL
τL exp(r τ L e −t1 /τL e −τs /τL ) τs
× [E 1 (r τ L e −t1 /τL e −τs /τL ) − E 1 (r τ L e −t1 /τL )] τL exp(−r τ L e −t1 /τL e −τs /τL ) exp(+dλ) ≡ τs ×[E 1 (−r τ L e −t1 /τL e −τs /τL ) − E 1 (−r τ L e −t1 /τL )].
(6.14)
1296
Paul Miller
In ∞ equation 6.14, the E 1 function is defined by the integral E 1 (x) = 1 dt exp(−xt)/t. We compare discrete and continuous processes in Figures 6A and 6B by numerically evaluating the ISI distributions. In all cases, we calculate the mean rate from the total number of spikes and subtract the ISI distribution expected for a Poisson process at this constant mean rate, r . We correct for a finite measurement interval, so subtract Pconst (τs , T) = exp (−r τs )
r (T − τs ) . rT − 1
(6.15)
The term T − τs appears as the integration limit (as no ISIs longer than the measurement time are possible), and the denominator is the total number of ISIs (number of spikes minus one). The main contribution to the ISI distribution for Poisson processes is the sum of exponentials, exp(−r τs ) with a distribution P(r ). When we subtract an exponential of the form exp(−r τ ), we find extra ISIs of low and high τs and a minimum near τs = 1/r , whose depth increases with the variance in rates (see Figures 6B to 6D). On a logarithmic plot of the ISI frequency, we find an initially steep gradient that becomes shallower for larger τs (see Figure 6A), as is expected for the sum of many exponentials. Since the firing rate in the discrete system of attractors is more likely to move far from its initial rate in a short space of time, the occurrence of a low firing rate and corresponding long ISIs is more prevalent (though rare). Figure 6A demonstrates such an excess of long ISIs in the discrete system over the excess seen in the continuous system. A strong difference is seen between the ISI deviation of the two leaky systems (Figure 6D) because the variance in rate remains high for the continuous (OU) system, leading to increasing ISI deviations with time (Figures 6C and 6D). On the other hand, the variance in rates of the discrete system eventually falls to zero, so the ISI deviation diminishes at longer measurement intervals (Figure 6C). 7 Discussion We have compared two types of doubly stochastic Poisson point processes. The processes are doubly stochastic because neuronal spikes are emitted stochastically with a probability proportional to an underlying rate, which also varies stochastically. We find that key statistical features of the spike trains averaged over many trials are identical. While the underlying rate can vary continuously or switch between discrete values, the autocorrelation functions and the Fano factors are identical. This is because both statistical measures depend on the second moment of the underlying rate as a function of time, which in both cases increases linearly with time. Similarly, as a result of the identical time dependence of their Fano factors, the two processes have the same distribution of ISIs (Saleh, 1978). This leads to a difficulty
Spike Statistics of Graded Memory Systems
1297
in distinguishing a continuously varying rate (such as required for analog memory storage, or in steady ramping activity) from a discretely jumping rate, where the jump times are stochastic but give the same behavior for the average rate. Only fourth-order statistics can distinguish the two cases, but in practice, calculations of such high-order statistics contain too much random error to be useful. The results presented here emphasize that single neuronal spike trains do not contain enough information to differentiate continuous attractor from discrete attractor networks, unless the jump in rates between discrete states is very large or many thousands of seconds of data are recorded. Since the system could change over many thousands of seconds, simultaneous recordings of multiple neurons involved in the network are probably needed to distinguish the two types of attractor systems. Working memory systems have been proposed that are based on either a continuous attractor (Seung et al., 2000; Miller et al., 2003) or a set of discrete attractors (Koulakov et al., 2002). The goal of these model systems, like integrator networks, is to produce neurons whose average firing rate is constant during the delay, when no stimulus is present. Both discrete and continuous systems exhibit the unusual property of an autocorrelation function, which depends on only the time interval of comparison. So it depends linearly on the time lag, τ (Ben-Yishai et al., 1995; Miller & Wang, in press) and increases with the measurement interval, T (Lewis, Gebber, Larsen, & Barman, 2001; Constantinidis & Goldman-Rakic, 2002). Such power law behavior relies on noise fluctuations being integrated, so does not occur if the noise in a discrete system is insufficient to cause the network to change states (Miller & Wang, in press). In the analysis in this article, the discrete system does have stochastic transitions between states. We find that just one parameter, the average lifetime between transitions, needs to be adjusted to match the behavior of the discrete with the continuous system. Cross-correlation functions include a term that is proportional to the product of the gradient of the tuning curve of the two neurons, as has been pointed out by Pouget (Pouget et al., 1998) and others (Ben-Yishai et al., 1995; C. D. Brody, personal communication, 2004). In general, two terms occur in all correlation functions. An initial, constant term arises from fluctuations during the stimulus. A second term is linear in the time of measurement, such as the delay time for a working memory task, due to integration of noise during the memory period. Such behavior is unusual as correlation functions typically decay with time, but in memory systems affected by noise, the correlations can last for the same duration as the memory of a stimulus. An unusual result is also found for the Fano factors, which increase quadratically with time for both systems (Gaspard & Wang, 1993). The Fano factor is a measure of the trial-to-trial variability in spike times. We see here that the Fano factor is the sum of two terms. The commonly observed term is a constant number due to the variability in the spike generation process
1298
Paul Miller
when the underlying rates are stable. The second term, which can vary in time, contains the effects of trial-to-trial variations in the underlying rates. Fano factors that increase as a power law have been observed experimentally in neurons responding to vision (Teich, Heneghan, Lowen, Ozaki, & Kaplan, 1997; Baddeley et al., 1997) and in the auditory system (Turcott et al., 1994). Such power-law behavior has been considered in terms of optimal encoding of natural stimuli, but in the working memory systems we consider here, it arises from the internal dynamics of noise-driven fluctuations in an attractor network. If, as in real systems, the firing rates are bounded, the variance in rates cannot increase inexorably with time, but reaches a constant value on a timescale of the order of the leak time or the time for the rate to cross between the boundaries. The correlation functions become exponential on this timescale, and the Fano factors approach a constant value after a quadratic rise at small times. A result for rigidly bounded systems that can be tested in real experimental data is that the trial-to-trial variance, as well as the mean, firing rate behaves differently depending on the starting point. Like a leaky integrator, the mean rate drifts toward one value (the midpoint between the boundaries). However, unlike the leaky integrator, for a system with rigid boundaries for the firing rate, the variance in rate increases more slowly when the initial rate is near a boundary (see Figure 4B). Such behavior of both mean and variance in the firing rate, if seen in real data, would be strong evidence for such a bounded system of attractors. Appendix A: Moments of Rate and Spike Count We find the moments of rate for a system with discrete hopping, by combining the Poisson distribution for the expected number of transitions, M, in time T: P(M|T) =
1 M!
T τx
M
e −T/τx
(A.1)
with the binomial distribution for the distribution of rates after M hops: P (r = r0 − Mr + 2nr | M) =
M! p n (1 − p) M−n . n!(M − n)!
(A.2)
This allows us to calculate the following moments of the rate: r (t) = r0
(A.3)
r (t)2 = r02 + αt
(A.4)
r (t)
(A.5)
3
= r03
+ 3r0 αt
Spike Statistics of Graded Memory Systems
r (t)4 = r04 + 6r02 αt + 3α 2 t 2 + (r )2 αt,
1299
(A.6)
where we have written α = (r )2 /τx . Notably, the variance increases linearly in time, r (t)2 − r (t)2 = αt,
(A.7)
and the fourth-order cumulant is nonzero only when the gaps between states are discrete: r 4 − 4r 3 r − 3r 2 2 + 12r 2 r 2 − 6r 4 = (r )2 αt.
(A.8)
For a Poisson spiking process, the probability of N spikes in time T is given by P(N|T) = where λ = of λ:
T 0
λ N −λ e , N!
(A.9)
r (t)dt. This allows us to evaluate the moments of N in terms
λ N N −λ e N! = λ
N =
N2 =
λ N N2 −λ e N!
= λ2 + λ N3 =
λ N N3 −λ e N!
= λ3 + 3λ2 + λ N4 =
λ N N4 −λ e N!
= λ4 + 6λ3 + 7λ2 + λ.
(A.10)
For a process where the average rate is constant but the variance increases linearly with time as αt, we can evaluate moments of λ in terms of moments of the rate by taking the time derivative, using (Gillespie, 1992): dλm = mλm−1 r (t) dt dλm r k (t) k! = mλm−1 r k+1 (t) + α λm r k−2 (t). dt 2!(k − 2)!
(A.11)
1300
Paul Miller
This leads to: dλ = r (t) dt d 2 λ2 = 2r 2 (t) dt 2 d 3 λ3 = 6r 3 (t) + 6αλ dt 3 d 4 λ4 = 24r 4 (t) + 48αr 2 (t) + 48r0 αr (t). dt 4 (A.12) Using equations A.3 to A.6, we can then evaluate: λ = r0 t λ2 = r02 t 2 + αt 3 /3 λ3 = r03 t 3 + r0 αt 4 λ4 = r04 t 4 + 2r02 αt 5 + α 2 t 6 /3 + (r )2 αt 5 /5. Combining the results for the moments of λ with the results for the moments of N yields N(t) = r0 t N(t)2 = r02 t 2 + r0 t + αt 3 /3 N(t)3 = r03 t 3 + r0 αt 4 + 3r02 t 2 + αt 3 + r0 t N(t)4 = r04 t 4 + 2r02 αt 5 + α 2 t 6 /3 + (r )2 αt 5 /5 + 6r03 t 3 + 6r0 αt 4 + 7r02 t 2 + 7αt 3 /3 + r0 t.
(A.13)
So the Fano factor is given by N(t)2 − N(t)2 = 1 + αt 2 /(3r0 ), N(t)
(A.14)
and we can find a combination of moments up to fourth order in N that depends on only the gap in rates between states: N4 − 3N2 2 + 2N4 − 6N3 + 6N2 N + 11N2 − 3N2 − 6N 3N2 − 3N2 − 3N = (r )2 t 2 /5.
(A.15)
Spike Statistics of Graded Memory Systems
1301
To calculate the full spike count distribution, P(N), we need to know the full distribution, P[λ(t)]. For the continuous random walk and continuous leaky integrator, we have a gaussian distribution for λ with mean λ and variance σλ2 (see appendix D) to give: P(N) =
dλP(N|λ)P(λ) 1
= 2σλ2
−(λ − λ)2 λN exp(−λ) exp dλ N! 2σλ2
= exp(−λ) exp
σλ2 2
N/2 k=0
N/2 2
σλ = exp(−λ) exp 2
k=0
N−2k (2k − 1)!! σλ2k λ − σλ2 (2k)!(N − 2k)! 1 k!(N − 2k)!
σλ2 2
k
N−2k λ − σλ2 .
(A.16)
Appendix B: Small Numbers of Discrete States We summarize here the conditional probabilities of the firing rate for systems with two, three, four, or five discrete states. We assume a Poisson distribution for the number of transitions in time, t (see equation A.1), and that transitions are equally likely to a state of higher rate as to a state of lower rate, unless the system is in a boundary state. For the two-state system, both states are boundary states. For the two-state system, we have −2τ 1 1 + exp 2 τx −2τ 1 1 − exp P [r0 − (r )/2, t + τ |r0 + (r )/2, t] = 2 τx P [r0 + (r )/2, t + τ |r0 + (r )/2, t] =
(B.1)
and by symmetry, −2τ 1 1 + exp 2 τx −2τ 1 1 − exp . P [r0 + (r )/2, t + τ |r0 − (r )/2, t] = 2 τx
P [r0 − (r )/2, t + τ |r0 − (r )/2, t] =
(B.2)
For the other systems, we omit the symmetrically identical results for brevity.
1302
Paul Miller
For the three-state system, we have −2τ 1 1 + exp 2 τx −2τ 1 1 − exp P [r0 ± (r )/2, t + τ |r0 , t] = 4 τx P [r0 , t + τ |r0 , t] =
(B.3)
and −τ −2τ 1 1 + 2 exp exp P [r0 + r, t + τ |r0 + r, t] = 4 τx τx −2τ 1 1 − exp P [r0 , t + τ |r0 + r, t] = 2 τx −τ −2τ 1 1 − 2 exp exp . (B.4) P [r0 − r, t + τ |r0 + r, t] = 4 τx τx For the four-state system, we have P[r0 + 3(r )/2, t + τ |r0 + (r )/2, t] 1 −τ τ τ = exp sinh + sinh 3 τx τx 2τx P [r0 + (r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp 2 cosh + cosh 3 τx τx 2τx P [r0 − (r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp 2 sinh − sinh 3 τx τx 2τx P [r0 − 3(r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp cosh − cosh 3 τx τx 2τx and P [r0 + 3(r )/2, t + τ |r0 + 3(r )/2, t] 1 −τ τ τ = exp cosh + 2 cosh 3 τx τx 2τx
(B.5)
Spike Statistics of Graded Memory Systems
1303
P [r0 + (r )/2, t + τ |r0 + 3(r )/2, t] −τ τ τ 1 2 sinh + 2 sinh = exp 3 τx τx 2τx P [r0 − (r )/2, t + τ |r0 + 3(r )/2, t] 1 −τ τ τ = exp 2 cosh − 2 cosh 3 τx τx 2τx P [r0 − 3(r )/2, t + τ |r0 + 3(r )/2, t] −τ τ τ 1 = exp sinh − 2 sinh . 3 τx τx 2τx For the five-state system, we have −τ τ 1 sinh2 P [r0 ± 2r, t + τ |r0 , t] = exp 2 τx 2τx τ 1 −τ sinh P [r0 ± r, t + τ |r0 , t] = exp 2 τx τx τ −τ cosh2 P [r0 , t + τ |r0 , t] = exp τx 2τx
(B.6)
(B.7)
and P [r0 + 2r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp sinh + 2 sinh √ 4 τx τx 2τx P [r0 + r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp cosh + 2 cosh √ 2 τx τx 2τx P [r0 , t + τ |r0 + r, t] τ 1 −τ sinh = exp 2 τx τx P [r0 − r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp cosh − 2 cosh √ 2 τx τx 2τx P [r0 − 2r, t + τ |r0 + r, t] √ 1 τ τ −τ = exp sinh − 2 sinh √ 4 τx τx 2τx
(B.8)
1304
Paul Miller
and P [r0 + 2r, t + τ |r0 + 2r, t] 1 −τ τ τ 2 = exp cosh + cosh √ 2 τx 2τx 2τx P [r0 + r, t + τ |r0 + 2r, t] √ τ −τ τ 1 = exp sinh + 2 sinh √ 2 τx τx 2τx τ −τ 2 sinh P [r0 , t + τ |r0 + 2r, t] = exp τx 2τx P [r0 − r, t + τ |r0 + 2r, t] √ τ −τ τ 1 = exp sinh − 2 sinh √ 2 τx τx 2τx P [r0 − 2r, t + τ |r0 + 2r, t] −τ τ τ 1 = exp cosh2 − cosh √ . 2 τx 2τx 2τx
(B.9)
This leads to the following results for the average rate, r (t)|r1 , 0 given an initial rate, r1 , the variance in rate, Var[r (t)|r1 , 0], autocorrelation function, Cxx [t, τ |r1 , 0] and Fano factor, F [t|r1 , 0], given by 1 + Var[λ(t)|r1 , 0]/λ(t)|r1 , 0 which are plotted in Figure 4. For the two-state system, −2t 1 r (t)|r0 + (r )/2, 0 = r0 + r exp 2 τx 2 −4t (r ) 1 − exp Var [r (t)|r0 + (r )/2, 0] = 4 τx −2τ −4t (r )2 Cxx [t, t + τ |r0 + (r )/2, 0] = exp 1 − exp 4 τx τx −2t r τx 1 − exp λ(t)|r0 + (r )/2, 0 = r0 t + 4 τx Var [λ(t)|r0 + (r )/2, 0] =
(r )2 τx2 4 t 1 3 −2t −4t × − exp . − exp τx 4 τx 4 τx (B.10)
Spike Statistics of Graded Memory Systems
1305
For the three-state system, starting at the middle state, r (t)|r0 , 0 = r0
−2t (r )2 1 − exp 2 τx −τ −2t (r )2 Cxx [t, t + τ |r0 , 0] = exp 1 − exp 2 τx τx Var[r (t)|r0 , 0] =
λ(t)|r0 , 0 = r0 t Var [λ(t)|r0 , 0] =
(r )2 τx2 4 t −t −2t 3 1 × − + 2 exp − exp , τx 2 τx 2 τx
(B.11)
and starting from a boundary state,
−t τx 2 −2t (r ) 1 − exp Var [r (t)|r0 + r, 0] = 2 τx 2 −τ −2t (r ) Cxx [t, t + τ |r0 + r, 0] = exp 1 − exp 2 τx τx −t λ(t)|r0 + r, 0 = r0 t + r τx 1 − exp τx r (t)|r0 + r, 0 = r0 + r exp
Var [λ(t)|r0 + r, 0] =
(r )2 τx2 4 t 1 3 −t −2t × − exp . − + 2 exp τx 2 τx 2 τx (B.12)
For the four-state system, starting at the upper of the two inner states [r (0) = r0 + (r )/2], −t −2t r r (t)|r0 + (r )/2, 0 = r0 + 4 exp − exp 6 2τx τx −t −3t (r )2 Var [r (t)|r0 + (r )/2, 0] = 33 − 16 exp − 24 exp 36 τx 2τx −4t 5t − exp + 8 exp 2τx τx
1306
Paul Miller
−τ −t (r )2 Cxx [t, t + τ |r0 + (r )/2, 0] = exp 32 − 16 exp 36 2τx τx 5t −3t + 4 exp − 20 exp 2τx 2τx −3t −2τ 1 − 4 exp + exp τx 2τx −4t 5t − exp + 4 exp 2τx τx −t r τx 15 − 16 exp λ(t)|r0 + (r )/2, 0 = r0 t + 12 2τx −2t + exp τx 2 2 (r ) τx t Var [λ(t)|r0 + (r )/2, 0] = 516 − 1475 144 τx −t −t − 256 exp + 1824 exp 2τx τx −2t −3t − 60 exp − 64 exp 2τx τx −4t −5t − exp , (B.13) + 32 exp 2τx τx while starting from the upper boundary state, −t −2t r 8 exp + exp r (t)|r0 + 3(r )/2, 0 = r0 + 6 2τx τx 2 −t −3t (r ) 33 − 64 exp + 48 exp Var [r (t)|r0 + 3(r )/2, 0] = 36 τx 2τx −4t 5t − exp −16 exp 2τx τx 2 −τ −t (r ) exp 32 − 64 exp Cxx [t, t + τ |r0 + 3(r )/2, 0] = 36 2τx τx 5t −3t − 8 exp + 40 exp 2τx 2τx −3t −2τ 1 + 8 exp + exp τx 2τx −4t 5t − exp − 8 exp 2τx τx
Spike Statistics of Graded Memory Systems
1307
−t r τx 15 − 16 exp 12 2τx −2t + exp τx t (r )2 τx2 −t 516 − 1667 + 2496 exp Var [λ(t)|r0 + 3(r )/2, 0] = 144 τx 2τx −3t −t + 128 exp −1024 exp τx 2τx −2t −132 exp τx −4t −5t − exp . (B.14) − 64 exp 2τx τx λ(t)|r0 + 3(r )/2, 0 = r0 t +
For the five-state system, starting at the middle state,
r (t)|r0 , 0 = r0
−t −t (r )2 Var[r (t)|r0 , 0] = 1 − exp 3 − exp 2 τx τx τ (r )2 −τ cosh √ Cxx [t, t + τ |r0 , 0] = exp 2 τx 2τx −t −t × 3 − exp 1 − exp τx τx √ τ −t 1 − exp + 2 2 sinh √ τx 2τx λ(t)|r0 , 0 = r0 t Var[λ(t)|r0 , 0] =
t −2t (r )2 τx2 −t 10 − 45 − 4 exp + exp 4 τx τx τx √ t −t t 48 cosh √ + 36 2 sinh √ , + exp τx 2τx 2τx (B.15)
and starting from the upper boundary state,
1308
Paul Miller
r (t)|r0 + 2r, 0 = r0 + r exp
−t τx
t 2 cosh √ 2τx
t 2 sinh √ 2τx −t (r )2 Var [r (t)|r0 + 2r, 0] = 6 + 8 exp 4 τx √ −2t 2t 2 + 3 cosh − exp τx τx √ √ 2t + 2 2 sinh τx +
√
−τ τ (r )2 exp cosh √ 2 τx 2τx −t × 3 + 4 exp τx √ −2t τ + 2 2 sinh √ − exp τx 2τx −t × 1 + exp τx √ τ + 2t τ + 2t − 6 cosh √ + 2 2 sinh √ 2τx 2τx −2t × exp τx −t t 3 cosh √ λ(t)|r0 + 2r, 0 = r0 t + r τx 6 − exp τx 2τx √ t + 2 2 sinh √ 2τx 2 2 −t (r ) τx Var [λ(t)|r0 + 2r, 0] = 3 + 4 exp 2 τx √ 2t −2t − exp 1 + 3 cosh τx τx √ √ 2t + 2 2 sinh . (B.16) τx
Cxx [t, t + τ |r0 + 2r, 0] =
Spike Statistics of Graded Memory Systems
1309
Appendix C: Analysis of Random Walk with Reflecting Boundaries In order to calculate the autocorrelation functions (see Figures 4C and 4D) from equation 2.4 and hence Fano factor (see Figure 4E) from equation 3.14 we need to evaluate the mean rate, r (t2 ), at a later time, t2 , conditional on its value, r1 = r0 + d1 , at an earlier time, t1 . For the random walk with reflecting boundaries this is given, using equation 4.1, by r (t2 )|r0 + d1 , t1 =
r0 +rb
r0 −rb
dr2 r2 √
1 2ατ
∞ n=−∞
−(r2 − r0 + d1 − 2rb + 4nrb )2 2ατ −(r2 − r0−d1 + 4nrb )2 + exp 2ατ ∞ −d1 + (4n + 1)rb erf = r 0 + d1 √ 2αt n=1 −d1 − (4n − 1)rb + erf √ 2αt −d1 − (4n + 1)rb −d1 + (4n − 1)rb + erf + erf √ √ 2αt 2αt −d1 + rb −d1 − rb d1 + erf − erf √ √ 2 2αt 2αt ∞ −d1 + (4n − 1)rb + 2rb 1 + erf √ 2αt n=1 −d1 − (4n + 1)rb + 1 − erf √ 2αt −d1 − rb + 2rb 1 + erf √ 2αt ∞ − [−d1 + (4n + 1)rb ]2 2αt + exp π 2αt n=0 − [−d1 − (4n + 1)rb ]2 − exp 2αt − [−d1 − (4n − 1)rb ]2 − exp 2αt
× exp
1310
Paul Miller
− [−d1 + (4n − 1)rb ]2 − exp 2αt 2αt − (d1 + rb )2 − (d1 − rb )2 + exp − exp . π 2αt 2αt (C.1)
The variance in rate plotted in Figure 4B is given for initial rates of r0 and r0 − rb . In both cases, the mirror sources are evenly spaced, simplifying the calculation a little. For r (0) = r0 , we have σr2 = −r02 +
r0 +rb
dr2 r22 √
∞
1
exp
−(r2 − r0 + 2nrb )2 2ατ
2ατ n=−∞ ∞ √ (2n + 1)rb 2αt π (2n + 1)rb 1 − erf = αt + 4rb √ √ π 2αt 2αt n=0 −(2n + 1)2 rb2 − exp . 2αt r0 −rb
For r (0) = r0 − rb , we have
σr2
= −r (t)|r0 − rb , 0 + √ 2
2 2ατ
r0 +rb
r0 −rb
−(r2 − r0 + rb + 4nrb )2 × exp 2ατ
dr2 r22
∞ n=−∞
= −r (t)|r0 − rb , 02 + (r0 − rb )2 + αt ∞ 2(2n + 1)rb + 16rb2 √ (2n + 1) 1 − erf 2αt n=0 2(2n + 1)rb √ (2n + 1) 1 − erf 2αt n=0 4nrb − 2n 1 − erf √ 2αt √ ∞ −4rb2 (2n)2 π(2n + 1)rb + 2(r0 − rb ) −1+2 exp √ 2αt 2αt n=0 + 8rb (r0 − rb )
∞
(C.2)
Spike Statistics of Graded Memory Systems
1311
−4rb2 (2n + 1)2 − exp 2αt √ ∞ −4rb2 (2n + 1)2 π(2n + 1)rb − 8rb . exp √ 2αt 2αt n=0
(C.3)
Appendix D: Gaussian Distribution of λ for Continuous Random Walks In order to calculate the ISI distribution from equation 6.1, it is necessary t +τ to know the probability distribution of λ(t1 , t1 + τs ) = t11 s r (t )dt . In this appendix, we prove by induction that the distribution is gaussian, that is,
P [λ(t1 , t1 + τ ) = λ] =
1 2 2πσλ(τ )
exp
2 − λ−λ 2 2σλ(τ )
(D.1)
for both the continuous random walk and the continuous leaky integrator (the Ornstein-Uhlenbeck, OU, process). For both processes, the distribution of rates as a function of time is known to be gaussian, with mean, r (t) and standard deviation, σr (t). We will use in the derivation the distribution, P(λ|r ), which denotes the distribution of λ for all paths that end at a certain rate, r , such that P(λ) =
dr P(r )P(λ|r ).
(D.2)
For gaussian P(r ) and gaussian P(λ) we must also have gaussian P(λ|r ). We can calculate the mean of the distribution, λ|r , by integrating over time the mean rates conditional on the final rate, r (t): λ(t)|r (t) =
dt1 r1 (t1 )|r (t).
(D.3)
The conditional mean rate is found using Bayes’ rule: r1 (t1 )|r (t) = =
dr1 P[r1 , t1 |r (t)]
(D.4)
dr1 P[r1 , t1 ]P[r (t)|r1 , t1 ] , P[r (t)]
(D.5)
1312
Paul Miller
which results for the leaky random walk in
λ(t)|r (t) = r At + [r0 − 2r A + r (t)] τ L
1 − exp 1 + exp
−t τL −t τL
,
(D.6)
which setting τ L → ∞ gives the uniform random walk result of λ(t)|r (t) = [r (t) + r0 ]t/2. We separate out the term linear in r (t), writing in general λ(t)|r (t) = f (t)r (t) + g(t),
(D.7)
where f and g are found from equation D.6. Substituting from equation D.7 into equation D.2, using gaussian distri2 butions and writing σλ|r as the variance of P[λ|r ], we find the condition 2 σλ2 = f (t)2 σr2 + σλ|r .
(D.8)
We now check the general condition for the nth moment of λ, dλn = nλn−1 r (t), dt
(D.9)
by evaluating the left- and right-hand sides separately, assuming gaussian probability distributions. To evaluate the left-hand side, we find λ =
1
n
=
2πσλ2
−(λ − λ)2 dλλ exp 2σλ2
n
n/2 n!(2k − 1)!! n−2k 2k λ σλ (2k)!(n − 2k)!
(D.10)
k=0
for even-n with the upper entry of the sum replaced by (n − 1)/2 for odd-n. Taking the derivative with respect to time gives n/2 dλn n!(2k − 1)!! 2k−2 n−2k−1 dσ 2 = σλ λ (n − 2k)r σλ2 + kλ λ . dt (2k)!(n − 2k)! dt k=0
(D.11)
Spike Statistics of Graded Memory Systems
1313
To evaluate the right-hand side, we find
n−1
nλ
1
r (t) =
2πσr2
drr
1 2 2πσλ|r
n−1
dλλ
−(λ − λ|r )2 exp 2 2σλ|r
n/2 n!(2k − 1)!! 2k−2 n−2k−1 σ λ = (2k)!(n − 2k)! λ k=0 2 × (n − 2k) σλ|r r − σr2 f (t)g(t) + nσr2 f (t)λ
=
n/2 n!(2k − 1)!! 2k−2 n−2k−1 σλ λ (n − 2k)σλ2 r + 2kσr2 f (t)λ , (2k)!(n − 2k)! k=0
(D.12) where in the last line we have used equation D.8 and the identity λ = f (t)r + g(t) from equation D.7. Equating equations D.11 and D.12 leads to the requirement dσλ2 = 2 f (t)σr2 . dt
(D.13)
We can evaluate a similar requirement on the conditional probability distribution being gaussian, from the requirement P [λ(t2 )|r2 (t2 )] =
P(r2 , t2 |r1 , t1 ) dr1 P(r1 , t1 ) P(r2 , t2 )
× P [λ − λ1 |r1 (t1 ), r2 (t2 )] ,
dλ1 P [λ1 (t1 )|r1 (t1 )] (D.14)
where P[λ − λ1 |r1 (t1 ), r2 (t2 )] is the probability of the integral of firing rate between r1 at time t1 and r2 at time t2 being equal to λ − λ1 . Integrating from a rate of r0 at time t = 0 is implicitly assumed in the other terms. If the probability distributions are gaussian, they depend on only their variance, which is given by equation D.8 and mean, which we write as λr1 ,r2 ,τ = d(τ )r1 + f (τ )r2 + h(τ ),
(D.15)
where τ = t2 − t1 . All the processes we consider are Markov, so given a definite rate at an earlier time, the distribution can depend on only the time difference at a later time, not the total time elapsed. With these definitions,
1314
Paul Miller
equation D.14 becomes the condition:
1 2 2σλ|r,t 2)
2 2σr,τ
− [λ − d(t2 )r0 − f (t2 )r2 − h(t2 )]2 exp 2 2σλ|r,t 2
2 2σr,t 2 [r2 − r (t2 )]2 exp 2 2σr,t 2 2 2 2 2σr,t 2σλ|r,t 2σλ|r,τ 1 1
dr1 exp
− [r2 − r (t2 )|r1 , t1 ]2 − [r1 − r (t1 )]2 exp 2 2 2σr,τ 2σr,t 1
dλ1 exp
− [λ1 − d(t1 )r0 − f (t1 )r1 − h(t1 )]2 2 2σλ|r,t 1
×
=
×
× exp
− [λ − λ1 − d(τ )r1 − f (τ )r2 − h(τ )]2 . 2 2σλ|r,τ
(D.16)
For both the leaky integrator and the continuous random walk, the mean rate at later times depends only linearly on an earlier known rate (see equation 5.2) so r (t2 )|r1 , t1 = b(τ )r1 + c(τ )
(D.17)
where τ = t2 − t1 , b(τ ) = exp(−τ/τ L ) and c(τ ) = r A[1 − exp(−t/τ L )]. Given such linear dependence, the above integrals can be solved to give the following algebraic requirement: 2 2
2 2 2 2 2 σλ|r,t = f 2 (t1 ) + d 2 (τ ) σr,t σ + σλ|r,t + σλ|r,τ σr,τ + b 2 (τ )σr,t . 2 1 r,τ 1 1 (D.18) Substituting terms, using equations 5.2, D.8, and D.7 allows us to confirm equation D.18 for the leaky (and therefore also nonleaky) continuous random walks. Direct use of equation A.11 confirms the lower moments are consistent with a gaussian probability distribution for λ and satisfaction of equation D.13, along with equation D.14 (and hence equation D.18) proves we can use a gaussian distribution for λ when calculating ISI distributions for these continuous random walk processes. Acknowledgments I am grateful to NIH-NIMH for support with a K25 Career Award. I appreciate helpful discussions with Alfonso Renart, David Luxat, Sridhar
Spike Statistics of Graded Memory Systems
1315
Raghavachari, Caroline Geisler, Carlos Brody, and Xiao-Jing Wang during the preparation of this work. References Aksay, E., Baker, R., Seung, H. S., & Tank, D. W. (2000). Anatomy and discharge properties of pre-motor neurons in the goldfish medulla that have eye-position signals during fixations. J. Neurophysiol., 84, 1035–1049. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A. & Roll’s, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond.: Biol. Sci., 264, 1775–1783. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: time scales and relationship to behavior. J. Neurosci., 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Brody, C. D. (1998). Slow covariations in neuronal resting potentials can lead to artefactually fast cross-correlations in their spike trains. J. Neurophysiol., 80, 3345– 3351. Brody, C. D. (1999). Correlations without synchrony. Neural Comput., 11, 1537– 1551. Buzsaki, G. (2004). Large-scale recording of neuronal ensembles. Nat. Neurosci., 7, 446–451. Camperi, M., & Wang, X.-J. (1998). A model of visuospatial short-term memory in prefrontal cortex: Recurrent network and cellular bistability. J. Comput. Neurosci., 5, 383–405. Cannon, S. C., Robinson, D. A., & Shamma, S. (1983). A proposed neural network for the integrator of the oculomotor system. Biol. Cybern., 49, 127–136. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10, 910–923. Constantinidis, C., & Goldman-Rakic, P. S. (2002). Correlated discharges among putative pyramidal neurons and interneurons in the primate prefrontal cortex. J. Neurophysiol., 88, 3487–3497. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Durstewitz, D. (2003). Self-organizing neural integrator predicts interval times through climbing activity. J. Neurosci., 23, 5342–5353. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. J. Neurophysiol., 83, 1733–1750. Gaspard, P., & Wang, X.-J. (1993). Noise, chaos and ( ,τ )-entropy per unit time. Physics Reports, 6, 291–345. Gillespie, D. T. (1992). Markov processes. Orlando, FL: Academic Press.
1316
Paul Miller
Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Goldberg, J. A., Rokni, U., & Sompolinsky, H. (2004). Patterns of ongoing activity and the functional architecture of the primary visual cortex. Neuron, 42, 489– 500. Goldman, M. S., Levine, J. H., Tank, G. M. D. W., & Seung, H. S. (2003). Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cereb. Cortex, 13, 1185–1195. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Hopfield, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-fire neurons. Proc. Natl. Acad. USA, 92, 6655–6662. Koulakov, A. A., Raghavachari, S., Kepecs, A., & Lisman, J. E. (2002). Model for a robust neural integrator. Nat. Neurosci., 5, 775–782. Lewis, C. D., Gebber, G. L., Larsen, P. D., & Barman, S. M. (2001). Long-term correlations in the spike trains of medullary sympathetic neurons. J. Neurophysiol., 85, 1614–1622. Loewenstein, Y., & Sompolinsky, H. (2003). Temporal integration by calcium dynamics in a model neuron. Nat. Neurosci., 6, 961–967. Middleton, J. W., Chacron, M. J., Lindner, B., & Longtin, A. (2003). Firing statistics of a neuron model driven by long-range correlated noise. Phys. Rev. E, 68, 21920– 21927. Miller, P., Brody, C. D., Romo, R., & Wang, X.-J. (2003). A recurrent network model of somatosensory parametric working memory in the prefrontal cortex. Cereb. Cortex, 13, 1208–1218. Miller, P., & Wang, X.-J. (2005). Power-law neuronal fluctuations in a recurrent network model of parametric working memory. J. Neurophysiol. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely moving rat. Experimental Brain Research, 34, 171–175. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. I. The single spike train. Biophys. J., 7, 391–418. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population code. Neural Comput., 10, 373–401. Robinson, D. A. (1989). Integrating with neurons. Annu. Rev. Neurosci., 12, 33–45. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Saleh, B. (1978). Photoelectron statistics. New York: Springer-Verlag. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17, 5900–5920. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93, 13339–13344. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271.
Spike Statistics of Graded Memory Systems
1317
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Sharp, P. E., Blair, H. T., & Cho, J. (2001). The anatomical and computational basis of the rat head-direction cell signal. Trends in Neurosci., 24, 289–294. Taube, J. S., & Bassett, J. P. (2003). Persistent neural activity in head direction cells. Cereb. Cortex, 13(11), 1162–1172. Teich, M. C., Heneghan, C., Lowen, S. B., Ozaki, T., & Kaplan, E. (1997). Fractal nature of the neural spike train in the visual system of the cat. J. Opt. Soc. Am. A, 14, 529–546. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C. & Teich, M. C. (1994). A nonstationary poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biol. Cybern., 70, 209–217. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. J. Neurosci., 13, 3406–3420. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
Received January 26, 2005; accepted September 29, 2005.
LETTER
Communicated by Rajesh Rao
Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning Jean-Pascal Pfister jean-pascal.pfister@epfl.ch
Taro Toyoizumi
[email protected] David Barber
[email protected] Wulfram Gerstner wulfram.gerstner@epfl.ch Laboratory of Computational Neuroscience, School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
In timing-based neural codes, neurons have to emit action potentials at precise moments in time. We use a supervised learning paradigm to derive a synaptic update rule that optimizes by gradient ascent the likelihood of postsynaptic firing at one or several desired firing times. We find that the optimal strategy of up- and downregulating synaptic efficacies depends on the relative timing between presynaptic spike arrival and desired postsynaptic firing. If the presynaptic spike arrives before the desired postsynaptic spike timing, our optimal learning rule predicts that the synapse should become potentiated. The dependence of the potentiation on spike timing directly reflects the time course of an excitatory postsynaptic potential. However, our approach gives no unique reason for synaptic depression under reversed spike timing. In fact, the presence and amplitude of depression of synaptic efficacies for reversed spike timing depend on how constraints are implemented in the optimization problem. Two different constraints, control of postsynaptic rates and control of temporal locality, are studied. The relation of our results to spike-timing-dependent plasticity and reinforcement learning is discussed. 1 Introduction Experimental evidence suggests that precise timing of spikes is important in several brain systems. In the barn owl auditory system, for example, coincidence-detecting neurons receive volleys of temporally precise spikes from both ears (Carr & Konishi, 1990). In the electrosensory system of mormyrid electric fish, medium ganglion cells receive input at precisely Neural Computation 18, 1318–1348 (2006)
C 2006 Massachusetts Institute of Technology
Optimal STDP in Supervised Learning
1319
timed delays after electric pulse emission (Bell, Han, Sugawara, & Grant, 1997). Under the influence of a common oscillatory drive as present in the rat hippocampus or olfactory system, the strength of a constant stimulus is coded in the relative timing of neuronal action potentials (Hopfield, 1995; Brody & Hopfield, 2003; Mehta, Lee, & Wilson, 2002). In humans, precise timing of first spikes in tactile afferents encodes touch signals at the fingertips (Johansson & Birznieks, 2004). Similar codes have also been suggested for rapid visual processing (Thorpe, Delorme, & Van Rullen, 2001), and for the rat’s whisker response (Panzeri, Peterson, Schultz, Lebedev, & Diamond, 2001). The precise timing of neuronal action potentials also plays an important role in spike-timing-dependent plasticity (STDP). If a presynaptic spike arrives at the synapse before the postsynaptic action potential, the synapse is potentiated; if the timing is reversed, the synapse is depressed (Markram, ¨ Lubke, Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998, 1999, 2001). This biphasic STDP function is reminiscent of a temporal contrast or temporal derivative filter and suggests that STDP is sensitive to the temporal features of a neural code. Indeed, theoretical studies have shown that given a biphasic STDP function, synaptic plasticity can lead to a stabilization of synaptic weight dynamics (Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001) while the neuron remains sensitive to temporal structure in the input (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Roberts, 1999; Kempter et al., 1999; Kistler & van Hemmen, 2000; Rao & Sejnowski, 2001; Gerstner & Kistler, 2002a). While the relative firing time of pre- and postsynaptic neurons, and hence temporal aspects of a neural code, play a role in STDP, it is less clear whether STDP is useful to learn a temporal code. In order to elucidate the computational function of STDP, we ask in this letter the following question: What is the ideal form of an STDP function in order to generate action potentials of the postsynaptic neuron with high temporal precision? This question naturally leads to a supervised learning paradigm: the task to be learned by the neuron is to fire at a predefined desired firing time t des . Supervised paradigms are common in machine learning in the context of classification and prediction problems (Minsky & Papert, 1969; Haykin, 1994; Bishop, 1995), but have more recently also been studied for spiking neurons in feedforward and recurrent networks (Legenstein, Naeger, & Maass, 2005; Rao & Sejnowski, 2001; Barber, 2003; Gerstner, Ritz, & van Hemmen, 1993; Izhikevich, 2003). Compared to unsupervised or reward-based learning paradigms, supervised paradigms on the level of single spikes are obviously less relevant from a biological point, since it is questionable what type of signal could tell the neuron about the “desired” firing time. Nevertheless, we think it is worth addressing the problem of supervised learning—first, as a problem in its own right, and second, as a starting point of spike-based reinforcement learning (Xie & Seung, 2004;
1320
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Seung, 2003). Reinforcement learning in a temporal coding paradigm implies that certain sequences of firing times are rewarded, whereas others are not. The “desired firing times” are hence defined indirectly via the presence or absence of a reward signal. The exact relation of our supervised paradigm to reward-based reinforcement learning will be presented in section 4. Section 2 introduces the stochastic neuron model and coding paradigm, which are used to derive the results presented in section 3. 2 Model 2.1 Coding Paradigm. In order to explain our computational paradigm, we focus on the example of temporal coding of human touch stimuli (Johansson & Birznieks, 2004), but the same ideas would apply analogously to the other neuronal systems with temporal codes already mentioned (Carr & Konishi, 1990; Bell et al., 1997; Hopfield, 1995; Brody & Hopfield, 2003; Mehta et al., 2002; Panzeri et al., 2001). For a given touch stimulus, spikes in an ensemble of N tactile afferents occur in a precise temporal order. If the same touch stimulus with identical surface properties and force vector is repeated several times, the relative timing of action potentials is reliably reproduced, whereas the spike timing in the same ensemble of afferents is different for other stimuli (Johansson & Birznieks, 2004). In our model, we assume that all input lines, labeled by the index j with 1 ≤ j ≤ N, converge onto one or several postsynaptic neurons. We think of the postsynaptic neuron as a detector for a given spatiotemporal spike pattern in the input. The full spike pattern detection paradigm will be used in section 3.3. As a preparation and first steps toward the full coding paradigm, we also consider the response of a postsynaptic neuron to a single presynaptic spike (section 3.1) or to one given spatiotemporal firing pattern (section 3.2). 2.2 Neuron Model. Let us consider a neuron i that is receiving input from N presynaptic neurons. Let us denote the ensemble of all spikes of N neuron j by x j = {t 1j , . . . , t j j }, where t kj denotes the time when neuron j fired its kth spike. The spatiotemporal spike pattern of all presynaptic neurons 1 ≤ j ≤ N will be denoted by boldface x = {x1 , . . . , xN }. f A presynaptic spike elicited at time t j evokes an excitatory postsynaptic f
potential (EPSP) of amplitude wi j and time course (t − t j ). For simplicity, we approximate the EPSP time course by a double exponential, s s (s) = 0 exp − − exp − (s), τm τs
(2.1)
with a membrane time constant of τm = 10 ms and a synaptic time constant of τs = 0.7 ms, which yields an EPSP rise time of 2 ms. Here (s) denotes the Heaviside step function with (s) = 1 for s > 0 and (s) = 0 otherwise.
Optimal STDP in Supervised Learning
1321
We set 0 = 1.3 mV such that a spike at a synapse with wi j = 1 evokes an EPSP with amplitude of approximately 1 mV. Since the EPSP amplitude is a measure of the strength of a synapse, we refer to wi j also as the efficacy (or “weight”) of the synapse between neuron j and i. Let us further suppose that the postsynaptic neuron i receives an additional input I (t) that could arise from either a second group of neurons or from intracellular current injection. We think of the second input as a teaching input that increases the probability that the neuron fires at or close to the desired firing time t des . For simplicity, we model the teaching input as a square current pulse I (t) = I0 (t − t des + 0.5T)(t des + 0.5T − t) of amplitude I0 and duration T. The effect of the teaching current on the membrane potential is
∞
uteach (t) =
k(s)I (t − s)ds
(2.2)
0
with k(s) = k0 exp(−s/τm ), where k0 is a constant that is inversely proportional to the capacitance of the neuronal membrane. In the context of the human touch paradigm discussed in section 2.1, the teaching input could represent some preprocessed visual information (“object touched by fingers starts to slip now”), feedback from muscle activity (“strong counterforce applied now”), cross-talk from other detector neurons in the same population (“your colleagues are active now”), or unspecific modulatory input due to arousal or reward (“be aware—something interesting happening now”). In the context of training of recurrent networks (e.g., Rao & Sejnowski, 2001), the teaching input consists of a short pulse of an amplitude that guarantees action potential firing. The membrane potential of the postsynaptic neuron i (spike response model; Gerstner & Kistler, 2002b) is influenced by the EPSPs evoked by all afferent spikes of stimulus x, the “teaching” signal, and the refractory f effects generated by spikes ti of the postsynaptic neuron ui (t|x, yti ) = urest +
N
wi j
f t j ∈x j
j=1
f
(t − t j ) +
f
η(t − ti ) + uteach (t),
(2.3)
f
ti ∈yti
where urest = −70 mV is the resting potential, yti = {ti1 , ti2 , . . . , tiF < t} is the set of postsynaptic spikes that occurred before t, and tiF always denotes the last postsynaptic spike before t. On the right-hand side of equation 2.3, η(s) denotes the spike afterpotential generated by an action potential. We take
s η(s) = η0 exp − τm
(s),
(2.4)
1322
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B [Hz]
100
ν
post
0.5
prob. density
200
1
g [kHz]
C
0
-60
-50
u [mV]
0 0
1
I0
2
0.04
I_0 = 1 I_0 = 1.5 I_0 = 2
0.02 0 0
50
100
150
200
ISI [ms]
. (B) Firing rate of the postsynaptic Figure 1: (A) Escape rate g(u) = ρ0 exp u−ϑ u neuron as a function of the amplitude I0 of a constant stimulation current (arbitrary units). (C) Interspike interval (ISI) distribution for different input currents.
where η0 < 0 is a reset parameter that describes how much the voltage is reset after each spike (for the relation to integrate-and-fire neurons, see Gerstner & Kistler, 2002b). The spikes themselves are not modeled explicitly but reduced to formal firing times. Unless specified otherwise, we take η0 = −5 mV. In a deterministic version of the model, output spikes would be generated whenever the membrane potential ui reaches a threshold ϑ. In order to account for intrinsic noise and also for a small amount of synaptic noise generated by stochastic spike arrival from additional excitatory and inhibitory presynaptic neurons that are not modeled explicitly, we replace the strict threshold by a stochastic one. More precisely we adopt the following procedure (Gerstner & Kistler, 2002b). Action potentials of the postsynaptic neuron i are generated by a point process with time-dependent stochastic intensity ρi (t) = g(ui (t)) that depends nonlinearly on the membrane potential ui . Since the membrane potential in turn depends on both the input and the firing history of the postsynaptic neuron, we write: ρi (t|x, yti ) = g(ui (t|x, yti )).
(2.5)
We take an exponential to describe the stochastic escape across threshold: g(u) = ρ0 exp u−ϑ where ϑ = −50 mV is the formal threshold, u = 3 mV u is the width of the threshold region and therefore tunes the stochasticity of the neuron, and ρ0 = 1/ms is the stochastic intensity at threshold (see Figure 1). Other choices of the escape function g are possible with no qualitative change of the results. For u → 0, the model is identical to the deterministic leaky integrate-and-fire model with synaptic current injection (Gerstner & Kistler, 2002b).
Optimal STDP in Supervised Learning
1323
We note that the stochastic process, defined in equation 2.5, is similar to but different from a Poisson process since the stochastic intensity depends on the set yt of the previous spikes of the postsynaptic neuron. Thus, the neuron model has some memory of previous spikes. 2.3 Stochastic Generative Model. The advantage of the probabilistic framework introduced above via the noisy threshold is that it is possible to describe the probability density1 Pi (y|x) of an entire spike train2 Y(t) = f f t ∈y δ(t − ti ) (see appendix A for details): i
Pi (y|x) =
ρi (ti |x, yt f ) exp − f
i
f
ti ∈y
= exp
T
T
ρi (s|x, ys )ds
0
log(ρi (s|x, ys ))Y(s) − ρi (s|x, ys )ds .
(2.6)
0
Thus, we have a generative model that allows us to describe explicitly the likelihood Pi (y|x) of emitting a set of spikes y for a given input x. Moreover, since the likelihood in equation 2.6 is a smooth function of its parameters, it is straightforward to differentiate it with respect to any variable. Let us differentiate Pi (y|x) with respect to the synaptic efficacy wi j , since this is a quantity that we will use later, ∂ log Pi (y|x) = ∂wi j
0
T
ρi (s|x, ys ) f (s − t j )ds, [Y(s) − ρi (s|x, ys )] ρi (s|x, ys ) f
(2.7)
t j ∈x j
dg |u=ui (s|x,ys ) . where ρi (s|x, ys ) = du In this letter, we propose three different optimal models: A, B, and C (see Table 1). The models differ in the stimulation paradigm and the specific task of the neuron. In section 3, the task and hence the optimality criteria are supposed to be given explicitly. However, the task in model C could also be defined indirectly by the presence or absence of a reward signal, as discussed in section 4.1. The common idea behind all three approaches is the notion of optimal performance. Optimality is defined by an objective function L that is directly related to the likelihood formula of equation 2.6 and that can be maximized by changes of the synaptic weights. Throughout
1 For simplicity, we denoted the set of postsynaptic spikes from 0 to T by y instead of yT . 2 Capital Y is the spike train generated by the ensemble (lowercase) y.
1324
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Table 1: Summary of the Optimality Criterion L for the Unconstrained Scenarios (Au , Bu , Cu ) and the Constrained Scenarios (Ac , Bc , Cc ). Unconstrained Scenarios Au —Postsynaptic spike imposed: L Au = log(ρ(t des )) Bu —Postsynaptic spike imposed ¯ des )) + spontaneous activity: L Bu = log(ρ(t imposed: Cu —Postsynaptic spike patterns γ i i k M−1 L Cu = log i Pi (y |x ) k=i Pi (0|x )
Constrained Scenarios Ac —No activity: T L Ac = L Au − 0 ρ(t)dt Bc —Stabilized activity: T L Bc = L Bu − Tσ1 2 0 (ρ(t) ¯ − ν0 )2 dt Cc —Temporal locality constraint: 2 L Cc = L Cu , P = a δ − T˜0
Notes: The constraint for scenario C is not included in the likelihood function L Cc itself, but rather in the deconvolution with a matrix P that penalizes quadratically the terms that are nonlocal in time. See appendix C for more details.
the article, this optimization is done by a standard technique of gradient ascent, wi j = α
∂L , ∂wi j
(2.8)
with a learning rate α. Since the three models correspond to three different tasks, they have a slightly different objective function. Therefore, gradient ascent yields slightly different strategies for synaptic update. In the following, we start with the simplest model with the aim of illustrating the basic principles that generalize to the more complex models. 3 Results In this section, we present synaptic updates rules derived by optimizing the likelihood of postsynaptic spike firing at some desired firing time t des . The essence of the argument is introduced in a particularly simple scenario, where the neuron is stimulated by one presynaptic spike and the neuron is inactive except at the desired firing time t des . This is the raw scenario that is developed in several directions. First, we may ask how the postsynaptic spike at the desired time t des is generated. The spike could simply be given by a supervisor. As always in maximum likelihood approaches, we then optimize the likelihood that this spike could have been generated by the neuron model (i.e., the generative model) given the known input. Or the spike could have been generated by a strong current pulse of short duration applied by the supervisor (teaching input). In this case, the a priori likelihood that the generative model fires at or close to the desired firing time is much higher. The two conceptual paradigms give slightly different results, as discussed in scenario A.
Optimal STDP in Supervised Learning
1325
Second, we may, in addition to the spike at the desired time t des , allow for other postsynaptic spikes generated spontaneously. The consequences of spontaneous activity for the STDP function are discussed in scenario B. Third, instead of imposing a single postsynaptic spike at a desired firing time t des , we can think of a temporal coding scheme where the postsynaptic neuron responds to one (out of M) presynaptic spike pattern with a desired output spike train containing several spikes while staying inactive for the other M − 1 presynaptic spike patterns. This corresponds to a pattern classification task, which is the topic of scenario C. Moreover, optimization can be performed in an unconstrained fashion or under some constraint. As we will see in this section, the specific form of the constraint influences the results on STDP, in particular, the strength of synaptic depression for post-before-pre timing. To emphasize this aspect, we discuss two constraints. The first constraint is motivated by the observation that neurons have a preferred working point defined by a typical mean firing rate that is stabilized by homeostatic synaptic processes (Turrigiano & Nelson, 2004). Penalizing deviations from a target firing rate is the constraint that we will use in scenario B. For a very low target firing rate, the constraint reduces to the condition of “no activity,” which is the constraint implemented in scenario A. The second type of constraint is motivated by the notion of STDP itself: changes of synaptic plasticity should depend on the relative timing of preand postsynaptic spike firing and not on other factors. If STDP is to be implemented by some physical or chemical mechanisms with finite time constants, we must require the STDP function to be local in time, that is, the amplitude of the STDP function approaches zero for large time differences. This is the temporal locality constraint used in scenario C. While the unconstrained optimization problems are labeled with the subscript u (Au , Bu , Cu ), the constrained problems are marked by the subscript c (Ac , Bc , Cc ) (see Table 1).
3.1 Scenario A: One Postsynaptic Spike Imposed. Let us start with a particularly simple model, which consists of one presynaptic neuron and one postsynaptic neuron (see Figure 2A). Let us suppose that the task of the postsynaptic neuron i is to fire a single spike at time t des in response to the input, which consists of a single presynaptic spike at time t pre , that is, the input is x = {t pre } and the desired output of the postsynaptic neuron is y = {t des }. Since there is only a single pre- and a single postsynaptic neuron involved, we drop in this section the indices j and i of the two neurons. 3.1.1 Unconstrained Scenario Au : One Spike at t des . In this section, we assume that the postsynaptic neuron has not been active in the recent past, that is, refractory effects are negligible. In this case, we have ρ(t|x, yt ) = ρ(t|x) because of the absence of previous spikes. Moreover, since there is
1326
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B 0.4
∆w
Au
0.3 0.2 0.1
0
t pre
t des
T
0
-50
t
0
pre
des
-t
50
[ms]
Figure 2: (A) Scenario A: a single presynaptic neuron connected to a postsynaptic neuron with a synapse of weight w. (B) Optimal weight change given by equation 3.2 for scenario Au . This weight change is exactly the mirror image of an EPSP.
only a single presynaptic spike (i.e., x = {t pre }), we write ρ(t|t pre ) instead of ρ(t|x). Since the task of the postsynaptic neuron is to fire at time t des , we can define the optimality criterion L Au as the log likelihood of the firing intensity at time t des , L Au = log ρ(t des |t pre ) .
(3.1)
The gradient ascent on this function leads to the following STDP function, w Au = α
∂ L Au ρ (t des |t pre ) des (t − t pre ), =α ∂w ρ(t des |t pre )
(3.2)
dg |u=u(t|tpre ) . Since this optimal weight change w Au can where ρ (t|t pre ) ≡ du be calculated for any presynaptic firing time t pre , we get an STDP function that depends on the time difference t = t pre − t des (see Figure 2B). As we can see directly from equation 3.2, the shape of the potentiation is exactly a mirror image of an EPSP. This result is independent of the specific choice of the function g(u). The drawback of this simple model becomes apparent if the STDP function given by equation 3.2 is iterated over several repetitions of the experiment. Ideally, it should converge to an optimal solution given by w Au = 0 in equation 3.2. However, the optimal solution given by w Au = 0 is problematic: for t < 0, the optimal weight tends toward ∞, whereas for t ≥ 0,
Optimal STDP in Supervised Learning
1327
there is no unique optimal weight (w Au = 0, ∀w). The reason for this problem is that the model describes only potentiation and includes no mechanisms for depression. 3.1.2 Constrained Scenario Ac : No Other Spikes Than at t des . In order to get some insight into where the depression could come from, let us consider a small modification of the previous model. In addition to the fact that the neuron has to fire at time t des , let us suppose that it should not fire anywhere else. This condition can be implemented by an application of equation 2.6 to the case of a single input spike x = {t pre } and a single output spike y = {t des }. In terms of notation, we set P(y|x) = P(t des |t pre ) and similarly ρ(s|x, y) = ρ(s|t pre , t des ) and use equation 2.6 to find P(t des |t pre ) = ρ(t des |t pre ) exp −
T
ρ(s|t pre , t des )ds .
(3.3)
0
Note that for s ≤ t des , the firing intensity does not depend on t des ; hence, ρ(s|t pre , t des ) = ρ(s|t pre ) for s ≤ t des . We define the objective function L Ac as the log likelihood of generating a single output spike at time t des , given a single input spike at t pre . Hence, with equation 3.3, L Ac = log(P(t des |t pre )) = log(ρ(t
des
|t
pre
T
)) −
ρ(s|t pre , t des )ds,
(3.4)
0
and the gradient ascent w Ac = α∂ L Ac /∂w rule yields w Ac = α
ρ (t des |t pre ) des (t − t pre ) − α ρ(t des |t pre )
T
ρ (s|t pre , t des )(s − t pre )ds.
0
(3.5) Since we have a single postsynaptic spike at t des , equation 3.5 can directly be plotted as a STDP function. In Figure 3 we distinguish two different cases. In Figure 3A we optimize the likelihood L Ac in the absence of any teaching input. To understand this scenario, we may imagine that a postsynaptic spike has occurred spontaneously at the desired firing time t des . Applying the appropriate weight update calculated from equation 3.5 will make such a timing more likely the next time the presynaptic stimulus is repeated. The reset amplitude η0 has only a small influence. In Figure 3B, we consider a case where firing of the postsynaptic spike at the appropriate time was made highly likely by a teaching input of duration T = 1 ms centered around the desired firing t des . The form of the STDP function depends on the amount η0 of the reset. If there is no reset η0 = 0, the
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
without teaching η0 = -10 mV η0 = -5 mV η0 = 0 mV
0.3
with teaching
0.1
0.2 0.1
0
0
-0.1
-0.1
-0.2
-50
t
pre
0
des
-t
50
[ms]
η0 = -10 mV η0 = -5 mV η0 = 0 mV
0.3
∆w
∆w
Ac
0.2
B
Ac
1328
-0.2
-50
t
pre
0
des
-t
50
[ms]
Figure 3: Optimal weight adaptation for scenario Ac given by equation 2.7 in the absence of a teaching signal (A) and in the presence of a teaching signal (B). The weight change in the post-before-pre is governed by the spike afterpotential u AP (t) = η(t) + uteach (t). The duration of the teaching input is T = 1 ms. The amplitude of the current I0 is chosen so that maxt uteach (t) = 5 mV. urest is chosen such that the spontaneous firing rate g(urest ) matches the desired firing rate 1/T: urest = u log( Tρ1 0 ) + θ −60 mV. The weight strength is w = 1.
STDP function shows strong synaptic depression of synapses that become active after the postsynaptic spike. This is due to the fact that the teaching input causes an increase of the membrane potential that decays back to rest with the membrane time constant τm . Hence, the window of synaptic depression is also exponential with the same time constant. Qualitatively the same is true if we include a weak reset. The form of the depression window remains the same, but its amplitude is reduced. The inverse of the effect occurs only for strong reset to or below resting potential. A weak reset is standard in applications of integrate-and-fire models to in vivo data and is one of the possibilities for explaining the high coefficient of variation of neuronal spike trains in vivo (Bugmann, Christodoulou, & Taylor, 1997; Troyer & Miller, 1997). A further property of the STDP functions in Figure 3 is a negative offset for |t pre − t des | → ∞. The amplitude of ∞the offset can be calculated for w 0 and t > 0, that is, w0 −ρ (urest ) 0 (s)ds. This offset is due to the fact that we do not want spikes at other times than t des . As a result, the optimal weight w (the solution of w Au = 0) should be as negative as possible (w → −∞ or w → w min in the presence of a lower bound) for t > 0 or t 0. 3.2 Scenario B: Spontaneous Activity. The constraint in scenario Ac of having strictly no other postsynaptic spikes than the one at time t des may seem artificial. Moreover, it is this constraint that leads to the negative
Optimal STDP in Supervised Learning
A
1329
B 0.4
j
∆w
Bu
0.3 0.2 0.1
i
0
0
pre tj
t
des
T
-50
t
0
pre
des
-t
50
[ms]
Figure 4: Scenario B. (A) N = 200 presynaptic neurons are firing one after the other at time t j = jδt with δt = 1 ms. (B) The optimal STDP function of scenario Bu .
offset of the STDP function discussed at the end of the previous paragraph. In order to relax the constraint of no spiking, we allow in scenario B for a reasonable spontaneous activity. As above, we start with an unconstrained scenario Bu before we turn to the constrained scenario Bc . 3.2.1 Unconstrained Scenario Bu : Maximize the Firing Rate at t des . Let us start with the simplest model, which includes spontaneous activity. Scenario Bu is the analog of the model Au , but with two differences. First, we include spontaneous activity in the model. Since ρ(t|x, yt ) depends on the spiking history for any given trial, we have to define a quantity that is independent of the specific realizations y of the postsynaptic spike train. Second, instead of considering only one presynaptic neuron, we consider N = 200 presynaptic neurons, each emitting a single spike at time t j = jδt, where δt = 1 ms (see Figure 4A). The input pattern will therefore be described by the set of delayed spikes x = {x j = {t j }, j = 1, . . . , N}. As long as we consider only a single spatiotemporal spike pattern in the input, it is always possible to relabel neurons appropriately so that neuron j + 1 fires after neuron j. Let us define the instantaneous firing rate ρ(t) ¯ that can be calculated by averaging ρ(t|yt ) over all realizations of postsynaptic spike trains: ρ(t|x) ¯ = ρ(t|x, yt ) yt |x .
(3.6)
Here the notation · yt |x means taking the average over all possible configuration of postsynaptic spikes up to t for a given input x. In analogy to a Poisson process, a specific spike train with firing times yt = {ti1 , ti2 , . . . , tiF < t} is generated with probability P(yt |x) given by equation 2.6. Hence, the
1330
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
average · yt |x of equation 3.6 can be written as follows (see appendix B for numerical evaluation of ρ(t)): ¯ ρ(t|x) ¯ =
t t ∞ 1 ... ρ(t|x, yt )P(yt |x)dtiF , . . . , dti1 . F! 0 0
(3.7)
F =0
Analogous to model Au , we can define the quality criterion as the log likelihood L Bu of firing at the desired time t des : ¯ des |x)). L Bu = log(ρ(t
(3.8)
Thus, the optimal weight adaptation of synapse j is given by w Bj u = α where
∂ ρ(t|x) ¯ ∂w j
∂ ρ(t ¯ des |x)/∂w j , ρ(t ¯ des |x)
(3.9)
is given by
∂ ρ(t|x) ¯ ∂ = ρ¯ (t|x)(t − t j ) + ρ(t|x, yt ) log P(yt |x) , ∂w j ∂w j yt |x
(3.10)
dg log P(yt |x) is given by equation 2.7 and ρ¯ (t|x) = du . u=u(t|x,yt ) yt |x Figure 4B shows that for our standard set of parameters, the differences to scenario Au are negligible. Figure 5A depicts the STDP function for various values of the parameter u at a higher postsynaptic firing rate. We can see a small undershoot in the pre-before-post region. The presence of this small undershoot can be understood as follows: enhancing a synapse of a presynaptic neuron that fires too early would induce a postsynaptic spike that arrives before the desired firing time and because of refractoriness would therefore prevent the generation of a spike at the desired time. The depth of this undershoot decreases with the stochasticity of the neuron and increases with the amplitude of the refractory period (if η0 = 0, there is no undershoot). In fact, correlations between pre- and postsynaptic firing reflect the shape of an EPSP in the high-noise regime, whereas they show a trough for low noise (Poliakov, Powers, & Binder, 1997; Gerstner, 2001). Our theory shows that the pre-before-post region of the optimal plasticity function is a mirror image of these correlations. ∂ ∂w j
3.2.2 Constrained Scenario Bc : Firing Rate Close to ν0 . In analogy to model Ac we introduce a constraint. Instead of imposing strictly no spikes at times t = t des , we can relax the condition and minimize deviations of the
Optimal STDP in Supervised Learning
A
1331
B ∆u = 0.5 ∆u = 1 ∆u = 3
0.8
∆w
∆w
Bc
Bu
0.6
σ = 4Hz σ = 6Hz σ = 8Hz
0.2
0.4
0.1
0.2 0
0 -0.2
-50
t
pre
0
des
-t
50
[ms]
-50
t
0
pre
des
-t
50
[ms]
Figure 5: (A) The optimal STDP functions of scenario Bu for different levels of stochasticity described by the parameter u. The standard value (u = 3 mV) is given by the solid line; decreased noise (u = 1 mV and u = 0.5 mV) is indicated by dot-dashed and dashed lines, respectively. In the lownoise regime, enhancing a synapse that fires slightly too early can prevent the firing at the desired firing time t des due to refractoriness. To increase the firing rate at t des , it is advantageous to decrease the firing probability time before t des . Methods: For each value of u, the initial weight w0 is set such that the spontaneous firing rate is ρ¯ = 30 Hz. In all three cases, w has been multiplied by u in order to normalize the amplitude of the STDP function. Reset: η0 = −5 mV. (B) Scenario Bc . Optimal STDP function for scenario Bc given by equation 3.13 for a teaching signal of duration T = 1 ms. The maximal increase of the membrane potential after 1 ms of stimulation with the teaching input is maxt uteach (t) = 5 mV. Synaptic efficacies wi j are initialized such that u0 = −60 mV, which gives a spontaneous rate of ρ¯ = ν0 = 5 Hz. Standard noise level: u = 3 mV.
instantaneous firing rate ρ(t|x, ¯ t des ) from a reference firing rate ν0 . This can be done by introducing into equation 3.8 a penalty term PB given by ¯ t des ) − ν0 )2 1 T (ρ(t|x, dt . PB = exp − T 0 2σ 2
(3.11)
For small σ , deviations from the reference rate yield a large penalty. For σ → ∞, the penalty term has no influence. The optimality criterion is a combination of a high firing rate ρ¯ at the desired time under the constraint of small deviations from the reference rate ν0 . If we impose the penalty as a multiplicative factor and take as before the logarithm, we get L Bc = log ρ(t ¯ des |x)PB .
(3.12)
1332
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
x1
y1
Figure 6: Scenario C. N presynaptic neurons are fully connected to M postsynaptic neurons. Each postsynaptic neuron is trained to respond to a specific input pattern and not respond to M − 1 other patterns as described by the objective function of equation 3.14.
Hence the optimal weight adaptation is given by
w Bj c = α
∂ ρ(t ¯ des |x)/∂w j α − Tσ 2 ρ(t ¯ des |x)
T
(ρ(t|x, ¯ t des ) − ν0 )
0
∂ ρ(t|x, ¯ t des )dt. ∂w j (3.13)
Since in scenario B, each presynaptic neuron j fires exactly once at time t j = jδt and the postsynaptic neuron is trained to fire at time t des , we can interpret the weight adaptation w Bj c of equation 3.13 as an STDP function w Bc that depends on the time difference t = t pre − t des . Figure 5 shows this STDP function for different values of the free parameter σ of equation 3.11. The higher the standard deviation σ , the less effective is the penalty term. In the limit of σ → ∞, the penalty term can be ignored, and the situation is identical to that of scenario Bu . 3.3 Scenario C: Pattern Detection 3.3.1 Unconstrained Scenario Cu : Spike Pattern Imposed. This last scenario is a generalization of scenario Ac . Instead of restricting the study to a single pre- and postsynaptic neuron, we consider N presynaptic neurons and M postsynaptic neurons (see Figure 6). The idea is to construct M independent detector neurons. Each detector neuron i = 1, . . . , M, should respond best to a specific prototype stimulus, say xi , by producing a desired spike train yi , but should not respond to other stimuli, yi = 0, ∀xk , k = i (see Figure 7). The aim is to find a set of synaptic weights that maximizes the probability
Optimal STDP in Supervised Learning
A
B k
400
200
x
x
i
400
0
1333
0
100
0
200
k
500
0
100
200
0
100
200
100
200
0.1
ρk
ρi
100
500 0
200
0.4 0.2 0
0
1000
y
y
i
1000
0
200
0
100
200
0.05 0
0
time [ms]
time [ms]
Figure 7: Pattern detection after learning. (Top) The left raster plot represents the input pattern the ith neuron has to be sensitive to. Each line corresponds to one of the N = 400 presynaptic neurons. Each dot represents an action potential. The right figure represents one of the patterns the ith neuron should not respond to. (Middle) The left raster plot corresponds to 1000 repetitions of the output of neuron i when the corresponding pattern xi is presented. The right plot is the response of neuron i to one of the patterns it should not respond to. (Bottom) The left graph represents the probability density of firing when pattern xi is presented. This plot can be seen as the PSTH of the middle graph. Arrows indicate the supervised timing neuron i learned. The right graph describes the probability density of firing when pattern xk is presented. Note the different scales of vertical axis.
that neuron i produces yi when xi is presented and produces no output when xk , k = i is presented. Let the likelihood function L Cu be
L Cu
M M γ = log Pi (yi |xi ) Pi (0|xk ) M−1 i=1
(3.14)
k=1,k=i
where Pi (yi |xi ) (see equation 2.6) is the probability that neuron i produces the spike train yi when the stimulus xi is presented. The parameter γ
1334
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
characterizes the relative importance of the patterns that should not be learned compared to those that should be learned. We get
L Cu =
M
log(Pi (yi |xi )) + γ log(Pi (0|xk )) xk =xi ,
(3.15)
i=1 1 M where the notation ·xk =xi ≡ M−1 k=i means taking the average over all patterns other than xi . The optimal weight adaptation yields
wiCj = α
∂ ∂ log Pi (yi |xi ) + αγ log Pi (0|xk ) . ∂wi j ∂wi j xk =xi
(3.16)
The learning rule of equation 3.16 gives the optimal weight change for each synapse and can be evaluated after presentation of all pre- and postsynaptic spike patterns; it is a “batch” update rule. Since each pre- and postsynaptic neuron emits many spikes in the interval [0, T], we cannot directly interpret the result of equation 3.16 as a function of the time difference t = t pre − t des as we did in scenario A or B. Ideally, we would like to write the total weight change of the optimal rule given by equation 3.16 as a sum of contributions wiCj =
WCu (t pre − t des ),
(3.17)
t pre ∈xij t des ∈yi
where WCu (t pre − t des ) is an STDP function and the summation runs over all pairs of pre- and postsynaptic spikes. The number of pairs of pre- and postsynaptic spikes with a given time shift is given by the correlation function, which is best defined in discrete time. We assume time steps of duration δt = 0.5 ms. Since the correlation will depend on the presynaptic neuron j and the postsynaptic neuron i under consideration, we introduce a new index, k = N(i − 1) + j. We define the correlation in discrete time by its matrix elements Ck that describe the correlation between the presynaptic spike train Xij (t) and the postsynaptic spike train Yi (t − T0 + δt). For example, C3 = 7 implies that seven spike pairs of presynaptic neuron j = 3 with postsynaptic neuron i = 1 have a relative time shift of T0 − δt. With this definition, we can rewrite equation 3.17 in vector notation (see section C.1 for more details) as !
wC = CWCu ,
(3.18)
Optimal STDP in Supervised Learning
A
1335
B 0.4 a = 0.04 a = 0.4
Cc
0.2
∆W
0.2
∆W
Cu
0.3
0.1
0.1
0
0 -50
t
0
pre
des
-t
50
[ms]
-0.1
-50
t
0
pre
des
-t
50
[ms]
Figure 8: (A) Optimal weight change for scenario Cu . In this case, no locality constraint is imposed, and the result is similar to the STDP function of scenario Ac (with η0 = 0 and uteach (t) = 0) represented on Figure 3. (B) Optimal weight change for scenario Cc as a function of the locality constraint characterized by a . The stronger the importance of the locality constraint, the narrower is the spike-spike interaction. For A and B, M = 20, η0 = −5 mV. The initial weights wi j are chosen so that the spontaneous firing rate matches the imposed firing rate. C C C where wC = (w11 , . . . , w1N , w21 , . . . , wCMN )T is the vector containing all the optimal weight change given by equation 3.16 and WCu is the vector containing the discretized STDP function with components WCu = WCu (−T0 + δt) for 1 ≤ ≤ 2T˜0 with T˜0 = T0 /δ. In particular, the center of the STDP function (t pre = t des ) corresponds to the index = T˜0 . !
The symbol = expresses the fact that we want to find WCu such that wC is as close as possible to CWCu . By taking the pseudo-inverse C + = (C T C)−1 C T of C, we can invert equation 3.18 and get WCu = C + wC .
(3.19)
The resulting STDP function is plotted in Figure 8A. As it was the case for the scenario Au , the STDP function exhibits a negative offset. In addition to the fact the postsynaptic neuron i should not fire at other times than the ones given by yi , it should also not fire whenever pattern xk , k = i is presented. The presence of the negative offset is due to those two factors. 3.3.2 Constrained Scenario Cc : Temporal Locality. In the previous paragraph, we obtained a STDP function with a negative offset. This negative offset does not seem realistic because it implies that the STDP function is not localized in time. In order to impose temporal locality (finite memory
1336
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B 0.4 M=20 M=60 M=100
0.2
∆W
Cc
∆W
0.1
w=0.5w0 w=w0 w=1.5w0
0.3
Cc
0.2
0.1 0
-0.1
0
-0.2 -0.1
-50
t
0
pre
des
-t
50
[ms]
-50
t
pre
0
des
-t
50
[ms]
Figure 9: (A) Optimal STDP function as a function of the number of input patterns M. (a = 0.04, N = 400). (B) Optimal weight change as a function of the weight w. If the weights are small (dashed line), potentiation dominates, whereas if they are big (dotted line), depression dominates.
span of the learning rule), we modify equation 3.19 in the following way (see section C.2 for more details): WCc = (C T C + P)−1 C T wC ,
(3.20)
where P is a diagonal matrix that penalizes nonlocal terms. In this article, we take a quadratic suppression of terms that are nonlocal in time. With respect to a postsynaptic spike at t des , the penalty term is proportional to (t − t des )2 . In matrix notation and using our convention that the postsynaptic spike corresponds to = T˜0 , we have: 2 P = a δ − T˜0 .
(3.21)
The resulting STDP functions for different values of a are plotted in Figure 8B. The higher the parameter a , the more nonlocal terms are penalized, the narrower is the STDP function. Figure 9A shows the STDP functions for various number of patterns M. No significant change can be observed for different numbers of input patterns M. This is due to the appropriately chosen normalization factor 1/(M − 1) in the exponent of equation 3.14. The target spike trains yi have a certain number of spikes during the time window T; they set a target value for the mean rate. Let ν post = T1M × M T i i=1 0 y (t)dt be the imposed firing rate. Let w0 denote the amplitude of the synaptic strength such that the firing rate ρ¯ w0 given by those weights is identical to the imposed firing rate: ρ¯ w0 = ν post . If the actual weights are
Optimal STDP in Supervised Learning
a=0
B
C
0
-0.3
0
-0.3 -0.3
0
opt
∆w
0.3
a = 0.4 0.3
rec
rec
0.3
∆w
∆w
rec
0.3
a = 0.04
∆w
A
1337
0
-0.3 -0.3
0
opt
∆w
0.3
-0.3
0
opt
0.3
∆w
Figure 10: Correlation plot between the optimal synaptic weight change wopt = wCu and the reconstructed weight change wrec = CWCc using the temporal locality constraint. (A) No locality constraint; a = 0. Deviations from the diagonal are due to the fact that the optimal weight change given by equation 3.16 cannot be perfectly accounted for the sum of pair effects. The mean deviations are given by equation C.7. (B) A weak locality constraint (a = 0.04) almost does not change the quality of the weight change reconstruction. (C) Strong locality constraint (a = 0.4). The horizontal lines arise since most synapses are subject to a few strong updates induced by pairs of pre- and postsynaptic spike times with small time shifts.
smaller than w0 , almost all the weights should increase, whereas if they are bigger than w0 , depression should dominate (see Figure 9B). Thus, the exact form of the optimal STDP function depends on the initial weight value w0 . Alternatively, homeostatic process could ensure that the mean weight value is always in the appropriate regime. In equations 3.17 and 3.18, we imposed that the total weight change should be generated as a sum over pairs of pre- and postsynaptic spikes. This is an assumption that has been made in order to establish a link to standard STDP theory and experiments where spike pairs have been in the center of interest (Gerstner et al., 1996; Kempter et al., 1999 ; Kistler & van Hemmen, 2000; Markram et al., 1997; Bi & Poo, 1998; Zhang et al., 1998). It is, however, clear by now that the timing of spike pairs is only one of several factors contributing to synaptic plasticity. We therefore asked how much we miss if we attribute the optimal weight changes calculated in equation 3.16 to spike pair effects only. To answer this question, we compared the optimal weight change wiCj from equation 3.16 with that derived from the pair Cc pre − t des ) with or without based STDP rule wirec j = t pre ∈xij t des ∈yi W (t locality constraint, that is, for different values of the locality parameter (a = 0, 0.04, 0.4) (see Figure 10). More precisely, we simulate M = 20 detector neurons, each having N = 400 presynaptic inputs, so each subplot of Figure 10 contains 8000 points. Each point in a graph corresponds to the optimal change of one weight for one detector neuron (x-axis) compared to the weight change of the same weight due to pair-based STDP
1338
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
(y-axis). We found that in the absence of a locality constraint, the pair-wise contributions are well correlated with the optimal weight changes. With strong locality constraints, the quality of the correlation drops significantly. However, for a weak locality constraint that corresponds to an STDP function with reasonable potentiation and depression regimes, the correlation of the pair-based STDP rule with the optimal update is still good. This suggests that synaptic updates with an STDP function based on pairs of pre- and postsynaptic spikes is close to optimal in the pattern detection paradigm.
4 Discussion 4.1 Supervised versus Unsupervised and Reinforcement Learning. Our approach is based on the maximization of the probability of firing at desired times t des with or without constraints. From the point of view of machine learning, this is a supervised learning paradigm implemented as a maximum likelihood approach using the spike response model with escape noise as a generative model. Our work can be seen as a continuoustime extension of the maximum likelihood approach proposed in Barber (2003). The starting point of all supervised paradigms is the comparison of a desired output with the actual output a neuron has, or would have, generated. The difference between the desired and actual output is then used as the driving signal for synaptic updates in typical model approaches (Minsky & Papert, 1969; Haykin, 1994; Bishop, 1995). How does this compare to experimental approaches? Experiments focusing on STDP have been mostly performed in vitro (Markram et al., 1997; Magee & Johnston, 1997; Bi & Poo, 1998). Since in typical experimental paradigms, firing of the postsynaptic neuron is enforced by strong pulses of current injection, the neuron is not in a natural unsupervised setting; but the situation is also not fully supervised, since there is never a conflict between the desired and actual output of a neuron. In one of the rare in vivo experiments to STDP (Fr´egnac, Shulz, Thorpe, & Bienenstock, 1988, 1992), the spikes of the postsynaptic neuron are also imposed by current injection. Thus, a classification of STDP experiments in terms of supervised, unsupervised, or reward based is not as clear-cut as it may seem at a first glance. From the point of view of neuroscience, paradigms of unsupervised or reinforcement are probably much more relevant than the supervised scenario discussed here. However, most of our results from the supervised scenario analyzed in this article can be reinterpreted in the context of reinforcement learning following the approach proposed by Xie and Seung (2004). To illustrate the link between reinforcement learning and supervised learning, we define a global reinforcement signal R(x, y) that depends on the spike timing of the presynaptic neurons x and the postsynaptic neuron y.
Optimal STDP in Supervised Learning
1339
The quantity optimized in reinforcement learning is the expected reward
Rx,y averaged over all pre- and postsynaptic spike trains:
Rx,y =
R(x, y)P(y|x)P(x).
(4.1)
x,y
If the goal of learning is to maximize the expected reward, we can define a learning rule that achieves this goal by changing synaptic efficacies in the direction of the gradient of the expected reward Rx,y : ∂ log P(y|x)
wx,y = α R(x, y) , ∂w x,y
(4.2)
∂ log P(y|x)
is the quantity we discussed where α is a learning parameter and ∂w in this article. Thus, the quantities optimized in our supervised paradigm re-appear naturally in a reinforcement learning paradigm. For an intuitive interpretation of the link between reinforcement learning, and supervised learning, consider a postsynaptic spike that (spontaneously) occurred at time t0 . If no reward is given, no synaptic change takes place. However, if the postsynaptic spike at t0 is linked to a rewarding situation, the synapse will try to recreate in the next trial a spike at the same time, that is, t0 has the role of the desired firing time t des introduced in this article. Thus, the STDP function with respect to a postsynaptic spike at t des derived in this article can be seen as the spike timing dependence that maximizes the expected reward in a spike-based reinforcement learning paradigm. 4.2 Interpretation of STDP Function. Let us now summarize and discuss our results in a broader context. In all three scenarios, we found an STDP function with potentiation for pre-before-post timing. Thus, this result is structurally stable and independent of model details. However, depression for post-before-pre timing does depend on model details. In scenario A, we saw that the behavior of the post-before-pre region is determined by the spike afterpotential (see Table 2 for a result summary of the three models). In the presence of a teaching input and firing rate constraints, a weak reset of the membrane potential after the spike means that the neuron effectively has a depolarizing spike after potential (DAP). In experiments, DAPs have been observed by Feldman (2000), Markram et al. (1997), and Bi and Poo (1998) for strong presynaptic input. Other studies have shown that the level of depression does not depend on the ¨ om, ¨ postsynaptic membrane potential (Sjostr Turrigiano, & Nelson, 2001). In any case, a weak reset (i.e., to a value below threshold rather than to the resting potential) is consistent with the findings of other researchers that used integrate-and-fire models to account for the high coefficient of
1340
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Table 2: Main Results for Each Scenario. Unconstrained Scenarios Au —pre-before-post: LTP ∼ EPSP Bu —pre-before-post: LTP/LTD ∼ reverse correlation Cu —pre-before-post: LTP ∼ EPSP, LTD ∼ background patterns
Constrained Scenarios Ac —post-before-pre: LTD (or LTP) ∼ spike afterpotential Bc —post-before-pre: LTD ∼ increased firing rate Cc —post-before-pre: LTD ∼ background patterns ∼ temporal locality
variation of spike trains in vivo (Bugmann et al., 1997; Troyer & Miller, 1997). In the presence of spontaneous activity (scenario B), a constraint on the spontaneous firing rate causes the optimal weight change to elicit a depression of presynaptic spikes that arrive immediately after the postsynaptic one. In fact, the reason for the presence of the depression in scenario Bc is directly related to the presence of a DAP caused by the strong teaching stimulus. In both scenarios A and B, depression occurs in order to compensate the increased firing probability due to the DAP. In scenario C, it has been shown that the best way to adapt the weights (in a task where the postsynaptic neuron has to detect a specific input pattern among others) can be described as an STDP function. This task is similar to the one in Izhikevich (2003) in the sense that a neuron is designed to be sensitive to a specific input pattern, but different since our work does not assume any axonal delays. The depression part in this scenario arises from a locality constraint. We impose that weight changes are explained by a sum of pair-based STDP functions. There are various ways of defining objective functions, and we have used three different objective functions in this article. The formulation of an objective function gives a mathematical expression of the functional role we assign to a neuron. The functional role depends on the type of coding (temporal coding or rate coding) and hence on the information the postsynaptic neurons will read out. The functional role also depends on the task or context in which a neuron is embedded. It might seem that different tasks and coding schemes could thus give rise to a huge number of objective functions. However, the reinterpretation of our approach in the context of reinforcement learning provides a unifying viewpoint: even if the functional role of some neurons in a specific region of the brain can be different from other neurons of a different region, it is still possible to see the different objective functions as different instantiations of the same underlying concept: the maximization of the reward, where the reward is task specific. More specifically, all objective functions used in this letter maximized the firing probability at a desired firing time t des , reflecting the fact that
Optimal STDP in Supervised Learning
1341
in the framework of timing-based codes, the task of a neuron is to fire at precise moments in time. With a different assumption on the neuron’s role on signal processing, different objective functions need to be used. An extreme case is a situation where the neuron’s task is to avoid firing at time t des . A good illustration is given by the experiments done in the electrosensory lobe (ELL) of the electric fish (Bell et al., 1997). These cells receive two sets of input: the first one contains the pulses coming from the electric organ, and the second input conveys information about the sensory stimulus. Since a large fraction of the sensory stimulus can be predicted by the information coming from the electric organ, it is computationally interesting to subtract the predictable contribution and focus on only the unpredictable part of the sensory stimulus. In this context, a reasonable task would be to ask the neuron not to fire at time t des where t des is the time where the predictable simulation arrives, and this task could be defined indirectly by an appropriate reward signal. An objective function of this type would, in the end, reverse the sign of the weight change of the causal part (LTD for the pre-before-post region), and this is precisely what is seen experimentally (Bell et al., 1997). In our framework, the definition of the objective function is closely related to the neuronal coding. In scenario C, we postulate that neurons emit a precise spike train whenever the “correct” input is presented and are silent otherwise. This coding scheme is clearly not the most efficient one. Another possibility is to impose postsynaptic neurons to produce a specific but different spike train for each input pattern, and not only for the “correct” input. Such a modification of the scenario does not dramatically change the results. The only effect is to reduce the amount of depression and increase the amount of potentiation. 4.3 Optimality Approaches versus Mechanistic Models. Theoretical approaches to neurophysiological phenomena in general, and to synaptic plasticity in particular, can be roughly grouped into three categories: biophysical models that aim at explaining the STDP function from principles of ion channel dynamics and intracelluar processes (Senn, Tsodyks, & Markram, 2001; Shouval, Bear, & Cooper, 2002; Abarbanel, Huerta, & Rabinovich, 2002; Karmarkar & Buonomano, 2002); mathematical models that start from a given STDP function and analyze computational principles such as intrinsic normalization of summed efficacies or sensitivity to correlations in the input (Kempter et al., 1999; Roberts, 1999; Roberts & Bell, 2000; van Rossum et al., 2000; Kistler & van Hemmen, 2000; Song et al., 2000; ¨ Song & Abbott, 2001; Kempter et al., 2001; Gutig, Aharonov, Rotter, & Sompolinsky, 2003); and models that derive “optimal” STDP properties for a given computational task (Chechik, 2003; Dayan & H¨ausser, 2004; Hopfield & Brody, 2004; Bohte & Mozer, 2005; Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005a, 2005b). Optimizing the likelihood of postsynaptic firing in a predefined interval, as we did in this letter, is only one possibility
1342
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
among others of introducing concepts of optimality (Barlow, 1961; Atick & Redlich, 1990; Bell & Sejnowski, 1995) into the field of STDP. Chechik (2003) uses concepts from information theory but restricts his study to the classification of stationary patterns. The paradigm considered in Bohte and Mozer (2005) is similar to our scenario Bc , in that they use a fairly strong teaching input to make the postsynaptic neuron fire. Bell and Parra (2005) and Toyoizumi et al. (2005a) are also using concepts from information theory, but they are applying them to the pre- and postsynaptic spike trains. The work of Toyoizumi et al. (2005a) is a clearcut unsupervised learning paradigm and hence distinct from our approach. Dayan and H¨ausser (2004) use concepts of optimal filter theory but are not interested in precise firing of the postsynaptic neuron. The work of Hopfield and Brody (2004) is similar to our approach in that it focuses on recognition of temporal input patterns, but we are also interested in triggering postsynaptic firing with precise timing. Hopfield and Brody emphasize the repair of disrupted synapses in a network that has previously acquired its function of temporal pattern detector. Optimality approaches such as ours will never be able to make strict predictions about the properties of neurons or synapses. Optimality criteria may, however, help to elucidate computational principles and provide insights into potential tasks of electrophysiological phenomena such as STDP. Appendix A: Probability Density of a Spike Train The probability density of generating a spike train yt = {ti1 , ti2 , . . . , tiF < t} with the stochastic process defined by equation 2.5 can be expressed as follows, P(yt ) = P(ti1 , . . . , tiF )R(t|yt ),
(A.1)
where P(ti1 , . . . , tiF ) is the probability density of having F spikes at times t ti1 , . . . , tiF and R(t|yt ) = exp(− t F ρ(t |yt )dt ) corresponds to the probability i
of having no spikes from tiF to t. Since the joint probability P(ti1 , . . . , tiF ) can be expressed as a product of conditional probabilities,
P(ti1 , . . . , tiF ) = P(ti1 )
F
f f −1 P ti |ti , . . . , ti1 ,
f =2
Equation A.1 becomes P(yt ) = ρ(ti1 |yti1 ) exp
−
ti1 0
ρ(t |yt )dt
(A.2)
Optimal STDP in Supervised Learning
·
F
=
f ρ(ti |yt f ) exp − i
f =2
ti
ti
1343
f
f −1
ρ(t |yt )dt
t ρ(t |yt )dt exp − tiF
t f ρ(ti |yt f ) exp − ρ(t |yt )dt . i
f
ti ∈yt
(A.3)
0
Appendix B: Numerical Evaluation of ρ(t) ¯ Since it is impossible to numerically evaluate the instantaneous firing rate ρ(t) ¯ with the analytical expression given by equation 3.6, we have to do it in a different way. In fact, there are two ways to evaluate ρ(t). ¯ Before going into the details, let us first recall that from the law of large numbers, the instantaneous firing rate is equal to the empirical density of spikes at time t,
ρ(t|yt ) yt = Y(t)Y(t) ,
(B.1)
f where Y(t) = t f ∈yt δ(t − ti ) is one realization of the postsynaptic spike i train. Thus, the first and simpler method based on the right-hand side of equation B.1 is to build a PSTH by counting spikes in small time bins [t, t + δt] over, say, K = 10,000 repetitions of an experiment. The second, and more advanced, method consists in evaluating the left-hand side of equation B.1 by Monte Carlo sampling. Instead of averaging over all possible spike trains yt , we generate K = 10,000 spike trains by repetition of the same stimulus. A specific spike train yt = {ti1 , ti2 , . . . , tiF < t} will automatically appear with appropriate probability given by equation 2.6. The Monte Carlo estimation ρ(t) ˜ of ρ(t) ¯ can be written as
ρ(t) ˜ =
P 1 ρ(t|ytm ), P
(B.2)
m=1
where ytm is the mth spike train generated by the stochastic process given by equation 2.5. Since we use the analytical expression of ρ(t|ytm ), we will call equation B.2 a semianalytical estimation. Let us note that the semianalytical estimation ρ(t) ˜ converges more rapidly to the true value ρ(t) ¯ than the empirical estimation based on the PSTH. In the limit of a Poisson process, η0 = 0, the semianalytical estimation ρ(t) ˜ given by equation B.2 is equal to the analytical expression of equation 3.6, since the instantaneous firing rate ρ of a Poisson process is independent of the firing history yt = {ti1 , ti2 , . . . , tiF < t} of the postsynaptic neuron.
1344
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Appendix C: Deconvolution C.1 Deconvolution for Spike Pairs. With a learning rule such as equation 3.16, we know the optimal weight change wi j for each synapse, but we still do not know the corresponding STDP function. Let us first define the correlation function c k (τ ), k = N(i − 1) + j between the presynaptic spike train Xij (t) = tpre ∈xij δ(t − t pre ) and the postsynaptic spike train Yi (t) = tdes ∈yi δ(t − t des ),
T
c k (τ ) = 0
Xij (s)Yi (s + τ )ds,
k = 1, . . . , NM,
(C.1)
where we allow a range −T0 ≤ τ ≤ T0 , with T0 T. Since the sum of the pair-based weight change W should be equal to the total adaptation of weights wk , we can write
T0
−T0
!
c k (s)W(s)ds = wk
k = 1, . . . , NM.
(C.2)
If we want to express equation C.1 in a matrix form, we need to descretize time in small bins δt and define the matrix element, Ck =
(+1)δt−T0 δt−T0
c k (s)ds.
(C.3)
Now equation C.2 becomes !
w = CW,
(C.4)
where w = (w11 , . . . , w1N , w21 , . . . , w MN )T is the vector containing all the optimal weight change and W is the vector containing the discretized STDP function: W = W(−T0 + δt), for = 1, . . . , 2T˜0 with T˜0 = T0 /δt. In order to solve the last matrix equation, we have to compute the inverse of the nonsquare NM × 2T˜0 matrix C, which is known as the Moore-Penrose inverse (or the pseudo-inverse), C + = (C T C)−1 C T ,
(C.5)
which exists only if (C T C)−1 exists. In fact, the solution given by W = C + w
(C.6)
Optimal STDP in Supervised Learning
1345
minimizes the square distance D=
1 (CW − w)2 . 2
(C.7)
C.2 Temporal Locality Constraint. If we want to impose a constraint of locality, we can add a term in the minimization process of equation C.7 and define the following, 1 E = D + WT PW, 2
(C.8)
where P is a diagonal matrix that penalizes nonlocal terms. In this article, we take a quadratic suppression of terms that are nonlocal in time: 2 P = a δ − T˜0 .
(C.9)
T˜0 corresponds to the index of the vector W in equations C.4 and C.8 for which t pre − t des = 0. Calculating the gradient of E given by equation C.8 with respect to W yields ∇W E = C T (CW − w) + PW.
(C.10)
By looking at the minimal value of E, that is, ∇W E = 0, we have W = (C T C + P)−1 C T w.
(C.11)
By setting a = 0, we recover the previous case. Acknowledgments This work was supported by the Swiss National Science Foundation (200020-103530/1 and 200020-108093/1). T.T was supported by the Research Fellowships of the Japan Society for the Promotion of Science for Young Scientists and a Grant-in-Aid for JSPS Fellows. References Abarbanel, H., Huerta R., & Rabinovich, M. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Academy of Sci. USA, 59, 10137–10143. Atick, J., & Redlich, A. (1990). Towards a theory of early visual processing. Neural Computation, 4, 559–572. Barber, D. (2003). Learning in spiking neural assemblies. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 149–156). Cambridge, MA: MIT Press.
1346
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Parra, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bell, C., Han, V., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellumlike structure depends on temporal order. Nature, 387, 278–281. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bi, G., & Poo, M. (2001). Synaptic modification of correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bohte, S. M., & Mozer, M. C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 201–208). Cambridge, MA: MIT Press. Brody, C., & Hopfield, J. (2003). Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron, 37, 843–852. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and fluctuation detection in the highly irregular firing of leaky integrator neuron model with partial reset. Neural Computation, 9, 985–1000. Carr, C. E., & Konishi, M. (1990). A circuit for detection of interaural time differences in the brain stem of the barn owl. J. Neurosci., 10, 3227–3246. Chechik, G. (2003). Spike-timing-dependent plasticity and relevant mututal information maximization. Neural Computation, 15, 1481–1510. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Feldman, D. (2000). Timing-based LTP and LTD and vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fr´egnac, Y., Shulz, D. E., Thorpe, S., & Bienenstock, E. (1988). A cellular analogue of visual cortical plasticity. Nature, 333(6171), 367–370. Fr´egnac, Y., Shulz, D. E., Thorpe, S., & Bienenstock, E. (1992). Cellular analogs of visual cortical epigenesis. I: Plasticity of orientation selectivity. Journal of Neuroscience, 12(4), 1280–1300. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse- and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. K. (2002a). Mathematical formulations of Hebbian learning. Biological Cybernetics, 87, 404–415.
Optimal STDP in Supervised Learning
1347
Gerstner, W., & Kistler, W. K. (2002b). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time–resolved excitation patterns. Biol. Cybern., 69, 503–515. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetry Hebbian plasticity. J. Neuroscience, 23, 3697–3714. Haykin, S. (1994). Neural networks. Upper Saddle River, NJ: Prentice Hall. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spiketiming-based computation networks. Proc. Natl. Acad. Sci. USA, 101, 337–342. Izhikevich, E. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14, 1569–1572. Johansson, R., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature Neuroscience, 7, 170–177. Karmarkar, U., & Buonomano, D. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors. J. Neurophysiology, 88, 507–513. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic potentials. Neural Comput., 12, 385–405. Legenstein, R., Naeger, C., & Maass, W. (2005). What target functions can be learnt with spike-timing-dependent plasticity? Neural Computation, 17, 2337–2382. Magee, J. C., & Johnston, D. (1997). A synaptically controlled associative signal for Hebbian plastiticy in hippocampal neurons. Science, 275, 209–213. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, 275, 213–215. Mehta, M. R., Lee, A. K., & Wilson, M. A. (2002). Role of experience of oscillations in transforming a rate code into a temporal code. Nature, 417, 741–746. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge MA: MIT Press. Panzeri, S., Peterson, R., Schultz, S., Lebedev, & Diamond, M. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Poliakov, A. V., Powers, R. K., & Binder, M. C. (1997). Functional identification of the input-output transforms of motoneurones in the rat and cat. J. Physiology, 504, 401–424. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7, 235–246. Roberts, P., & Bell, C. (2000). Computational consequences of temporally asymmetric learning rules: II. Sensory image cancellation. Computational Neuroscience, 9, 67– 83.
1348
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Senn, W., Tsodyks, M., & Markram, H. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Seung, S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., & Abbott, L. (2001). Column and map development and cortical re-mapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-time-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural Networks, 14, 715–725. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005a). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. K. Saul, Y. Weiss, and L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005b). Generalized Bienenstock-Cooper-Munro rule for spiking neurons that maximizes information transmission. Proc. National Academy Sciences (USA), 102, 5239–5244. Troyer, T. W., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neuroscience, 20, 8812–8821. Xie, X., & Seung, S. (2004). Learning in neural networks by reinforcement of irregular spiking. Phys. Rev. E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received August 10, 2004; accepted September 8, 2005.
LETTER
Communicated by Peter Latham
When Response Variability Increases Neural Network Robustness to Synaptic Noise Gleb Basalyga
[email protected] Emilio Salinas
[email protected] Department of Neurobiology and Anatomy, Wake Forest University School of Medicine, Winston-Salem, NC 27157-1010, U.S.A
Cortical sensory neurons are known to be highly variable, in the sense that responses evoked by identical stimuli often change dramatically from trial to trial. The origin of this variability is uncertain, but it is usually interpreted as detrimental noise that reduces the computational accuracy of neural circuits. Here we investigate the possibility that such response variability might in fact be beneficial, because it may partially compensate for a decrease in accuracy due to stochastic changes in the synaptic strengths of a network. We study the interplay between two kinds of noise, response (or neuronal) noise and synaptic noise, by analyzing their joint influence on the accuracy of neural networks trained to perform various tasks. We find an interesting, generic interaction: when fluctuations in the synaptic connections are proportional to their strengths (multiplicative noise), a certain amount of response noise in the input neurons can significantly improve network performance, compared to the same network without response noise. Performance is enhanced because response noise and multiplicative synaptic noise are in some ways equivalent. So if the algorithm used to find the optimal synaptic weights can take into account the variability of the model neurons, it can also take into account the variability of the synapses. Thus, the connection patterns generated with response noise are typically more resistant to synaptic degradation than those obtained without response noise. As a consequence of this interplay, if multiplicative synaptic noise is present, it is better to have response noise in the network than not to have it. These results are demonstrated analytically for the most basic network consisting of two input neurons and one output neuron performing a simple classification task, but computer simulations show that the phenomenon persists in a wide range of architectures, including recurrent (attractor) networks and sensorimotor networks that perform coordinate transformations. The results suggest that response variability could play an important dynamic role in networks that continuously learn.
Neural Computation 18, 1349–1379 (2006)
C 2006 Massachusetts Institute of Technology
1350
G. Basalyga and E. Salinas
1 Introduction Neuronal networks face an inescapable trade-off between learning new associations and forgetting previously stored information. In competitive learning models, this is sometimes referred to as the stability-plasticity dilemma (Carpenter & Grossberg, 1987; Hertz, Krogh, & Palmer, 1991): in terms of inputs and outputs, learning to respond to new inputs will interfere with the learned responses to familiar inputs. A particularly severe form of performance degradation is known as catastrophic interference (McCloskey & Cohen, 1989). It refers to situations in which the learning of new information causes the virtually complete loss of previously stored associations. Biological networks must face a similar problem, because once a task has been mastered, plasticity mechanisms will inevitably produce further changes in the internal structural elements, leading to decreased performance. That is, within subnetworks that have already learned to perform a specific function, synaptic plasticity must at least partly appear as a source of noise. In the cortex, this problem must be quite significant, given that even primary sensory areas show a large capacity for reorganization (Wang, Merzenich, Sameshima, & Jenkins, 1995; Kilgard & Merzenich 1998; Crist, Li, & Gilbert, 2001). Some mechanisms, such as homeostatic regulation (Turrigiano & Nelson, 2000) and specific types of synaptic modification rules (Hopfield & Brody, 2004), may help alleviate the problem, but by and large, how nervous systems cope with it remains unknown. Another factor that is typically considered a limitation for neural computation capacity is response variability. The activity of cortical neurons is highly variable, as measured by either the temporal structure of spike trains produced during constant stimulation conditions or spike counts collected in a given time interval and compared across identical behavioral trials (Dean, 1981; Softky & Koch, 1992, 1993; Holt, Softky, Koch, & Douglas, 1996). Some of the biophysical factors that give rise to this variability, such as the balance between excitation and inhibition, have been identified (Softky & Koch, 1993; Shadlen & Newsome, 1994; Stevens & Zador, 1998). But its functional significance, if any, is not understood. Here we consider a possible relationship between the two sources of randomness just discussed, whereby response variability helps counteract the destabilizing effects of synaptic changes. Although noise generally hampers performance, recent studies have shown that, in nonlinear dynamical systems such as neural networks, this is not always the case. The best-known example is stochastic resonance, in which noise enhances the sensitivity of sensory neurons to weak periodic signals (Levin & Miller, 1996; Gammaitoni, H¨anggi, Jung, & Marchesoni, 1998; Nozaki, Mar, Grigg, & Collins, 1999), but noise may play other constructive roles as well. For instance, when a system has an internal source of noise, externally added
Neuronal Variability Counteracts Synaptic Noise
1351
noise can reduce the total noise of the output (Vilar & Rubi, 2000). Also, adding noise to the synaptic connections of a network during learning produces networks that, after training, are more robust to synaptic corruption and have a higher capacity to generalize (Murray & Edwards, 1994). In this letter, we study another beneficial effect of noise on neural network performance. In this case, adding randomness to the neural responses reduces the impact of fluctuations in synaptic strength. That is, here, performance depends on two sources of variability, response noise and synaptic noise, and adding some amount of response noise produces better performance than having synaptic noise alone. The reason for this paradoxical effect is that response noise acts as a regularization factor that favors connectivity matrices with many small synaptic weights over connectivity matrices with few large weights, and this minimizes the impact of a synapse that is lost or has a wrong value. We study this regularization effect in three different cases: (1) a classification task, which in its simplest instantiation can be studied analytically; (2) a sensorimotor transformation; and (3) an attractor network that produces self-sustained activity. For the latter two, the interaction between noise terms is demonstrated by extensive numerical simulations. 2 General Framework First we consider networks with two layers: an input layer that contains N sensory neurons and an output layer with K output neurons. A matrix r is used to denote the firing rates of the input neurons in response to M stimuli, so ri j is the firing rate of input unit i when stimulus j is presented. These rates have a mean component r¯ plus noise, as described in detail below. The output units are driven by the first-layer responses, such that the firing rate of output unit k evoked by stimulus j is
Rk j =
N
wki ri j ,
(2.1)
i=1
or in matrix form, R = wr, where w is the K × N matrix of synaptic connections between input and output neurons. The output neurons also have a set of desired responses F , where Fk j is the firing rate that output unit k should produce when stimulus j is presented. In other words, F contains target values that the outputs are supposed to learn. The error E is the mean squared difference between the actual driven responses Rk j and the desired ones, M K 1 2 E= (Rk j − Fk j ) , (2.2) KM k=1 j=1
1352
G. Basalyga and E. Salinas
or in matrix notation, 1 Tr (wr − F )(wr − F )T . (2.3) KM Here, Tr( A) = i Aii is the trace of a matrix, and the angle brackets indicate an average over multiple trials, which corresponds to multiple samples of the noise in the inputs r. The optimal synaptic connections W are those that make the error as small as possible. These can be found by computing the derivative of equation 2.3 with respect to w (or with respect to wa b , if the summations are written explicitly) and setting the result equal to zero (see, e.g., Golub & van Loan, 1996). These steps give E=
W = F r T C −1 ,
(2.4)
−1 where r = r and CT is the inverse (or the pseudo-inverse) of the correlation matrix C = r r . The general outline of the computer experiments proceeds in five steps as follows. First, the matrix r with the mean input responses is generated together with the desired output responses F . These two quantities define the input-output transformation that the network is supposed to implement. Second, response noise is added to the mean input rates, such that
ri j = r i j (1 + ηi j ).
(2.5)
The random variables ηi j are independently drawn from a distribution with zero mean and variance σr2 , ηi j = 0 2 ηi j = σr2 ,
(2.6)
where the brackets again denote an average over trials. We refer to this as multiplicative noise. Third, the optimal connections are found using equation 2.4. Note that these connections take into account the response noise through its effect on the correlation matrix C. Fourth, the connections 2 are corrupted by multiplicative synaptic noise with variance σW , that is, Wi j = Wi j (1 + i j ),
(2.7)
where i j = 0 2 2 . i j = σW
(2.8)
Neuronal Variability Counteracts Synaptic Noise
1353
Finally, the network’s performance is evaluated. For this, we measure the network error E W , which is the square error obtained with the optimal but corrupted weights W, averaged over both types of noise, EW =
1 Tr (Wr − F )(Wr − F )T . KM
(2.9)
Thus, the brackets in this case indicate an average over multiple trials and multiple networks, that is, multiple corruptions of the optimal weights W. The main result we report here is an interaction between the two types of noise: in all the network architectures that we have explored, for a fixed amount of synaptic noise σW , the best performance is typically found when the response noise has a certain nonzero variance. So, given that there is synaptic noise in the network, it is better to have some response noise rather than to have none. Before addressing the first example, we should highlight some features of the chosen noise models. Regarding response noise, equations 2.5 and 2.6, other models were tested in which the fluctuations were additive rather than multiplicative. Also, gaussian, uniform, and exponential distributions were tested. The results for all combinations were qualitatively the same, so the shape of the response noise distribution does not seem to play an important role; what counts is mainly the variance. On the other hand, the benefit of response noise is observed only when the synaptic noise is multiplicative; it disappears with additive synaptic noise. However, we do test several variants of the multiplicative model, including one in which the random variables i j are drawn from a gaussian distribution and another in which they are binary, 0 or −1. The latter case represents a situation in which connections are eliminated randomly with a fixed probability. 3 Noise Interactions in a Classification Task First we consider a task in which the two-layer, fully connected network is used to approximate a binary function. The task is to classify M stimuli on the basis of the N input firing rates evoked by each stimulus. Only one output neuron is needed, so K = 1. The desired response of this output neuron is the classification function Fj =
1 if j ≤ M/2 0 else,
(3.1)
where j goes from 1 to M. Therefore, the job of the output unit is to produce a 1 for the first M/2 input stimuli and a 0 for the rest. 3.1 A Minimal Network. In order to obtain an analytical description of the noise interactions, we first consider the simplest possible network that
1354
G. Basalyga and E. Salinas
exhibits the effect, which consists of two input neurons and two stimuli. Thus, N = M = 2, and the desired output is F = (1, 0). Note that with a single output neuron, the matrices W and F become row vectors. Now we proceed according to the five steps outlined in the preceding section; the goal is to show analytically that in the presence of synaptic noise, performance is typically better for a nonzero amount of response noise. The matrix of mean input firing rates is set to
r=
1 r0 r0 1
,
(3.2)
where r0 is a parameter that controls the difficulty of the classification. When it is close to 1, the pairs of responses evoked by the two stimuli are very similar, and large errors in the output are expected; when it is close to 0, the input responses are most different, and the classification should be more accurate. After combining the mean responses with multiplicative noise, as prescribed by equation 2.5, the input responses in a given trial become r=
1 + η11
r0 (1 + η12 )
r0 (1 + η21 )
1 + η22
.
(3.3)
Assuming that the fluctuations are independent across neurons, the correlation matrix is therefore
C = rr
T
=
1 + r02 1 + σr2 2r0
2r0 2 1 + r0 1 + σr2
.
(3.4)
Next, after calculating the inverse of C, equation 2.4 is used to find the optimal weights, which are σr2 1 + r02 + 1 − r02 W1 = 2 2 1 + σr2 1 + r02 − 4r02 σ 2 1 + r02 − 1 − r02 W2 = r r0 . 2 2 1 + σr2 1 + r02 − 4r02
(3.5)
Notice that these connections take into account the response variability through their dependence on σr . The next step is to corrupt these synaptic weights as prescribed by equation 2.7 and substitute the resulting
Neuronal Variability Counteracts Synaptic Noise
A
1355
B 3
E W 0.6
2
0.5 0.4
1
0.3
0
E min
0.2
W2
−1
0.1 0
W1
−2 0
0.1
σmin0.2
0.3
0.4
σr
0.5
0
0.2
0.4
0.6
0.8
σr
1
Figure 1: Noise interaction for a simple network of two input neurons and one output neuron (K = 1, N = M = 2). Both input responses and synaptic weights were corrupted by multiplicative gaussian noise. For all curves, solid lines are theoretical results, and symbols are simulation results averaged over 1000 networks and 100 trials per network. In all cases, r0 = 0.8. (A) Average square difference between observed and desired output responses, E W , as a function of the standard deviation (SD) of the response noise, σr . Squares and dashed line correspond to the error without synaptic noise (σW = 0); circles and solid lines correspond to the error with synaptic noise (σW = 0.15, 0.20, 0.25). (B) Dependence of the (uncorrupted) optimal weights W on σr .
expressions into equation 2.9. After making all the substitutions, calculating the averages, and simplifying, we obtain the average error, EW =
1 2 2 2 σ W + W2 1 + σr2 1 + r02 − W1 − r0 W2 + 1 . 2 W 1
(3.6)
This is the average square difference between the desired and actual responses of the output neuron given the two types of noise. It is a function of only three parameters, σr , σW , and r0 , because the optimal weights themselves depend on σr and r0 . The interaction between noise terms for this simple N = M = 2 case is illustrated in Figure 1A, which plots the error as a function of σr with and without synaptic variability. Here, dashed and solid lines represent the theoretical results given by equations 3.5 and 3.6, and symbols correspond to simulation results averaged over 1000 networks and 100 trials per network. Without synaptic noise (dashed line), the error increases monotonically with σr , as one would normally expect when adding response variability. In contrast, when σW = 0.15, 0.2, or 0.25 (solid lines), the error initially decreases and then starts increasing again, slowly approaching the curve obtained with response noise alone.
1356
G. Basalyga and E. Salinas
Figure 1B shows how the optimal weights depend on σr . The solid lines were obtained from equations 3.5. The curves show that the effect of response noise is to decrease the absolute values of the optimal synaptic weights. Intuitively, that is why response variability is advantageous; smaller synaptic weights also mean smaller synaptic fluctuations, because the standard deviations (SD) are proportional to the mean values. So, there is a trade-off: the intrinsic effect of increasing σr is to increase the error, but with synaptic noise present, σr also decreases the magnitude of the weights, which lowers the impact of the synaptic fluctuations. That the impact of synaptic noise grows directly with the magnitude of the weights is also apparent from the first term in equation 3.6. The magnitude of the noise interaction can be quantified by the ratio E min /E 0 , where the numerator is the minimal value of the error curve and the denominator is the error obtained when only synaptic noise is present, that is, when σr = 0. The minimum error E min occurs at the optimal value of σr , denoted as σmin . The ratio E min /E 0 is equal to 1 if response variability provides no advantage and approaches 0 as σmin cancels more of the error due to synaptic noise. For the lowest solid curve in Figure 1A, the ratio is approximately 0.8, so response variability cancels about 20% of the square error generated by synaptic fluctuations. Note, however, that in these examples, the error is below E 0 for a large range of values of σr , not only near σmin , so response noise may be beneficial even if it is not precisely matched to the amount of synaptic noise. Figure 2 further characterizes the strength of the interaction between the two types of noise. Figures 2A and 2B show how the error and the optimal amount of response variability vary as functions of σW . These graphs indicate that the fraction of the error that σr is able to compensate for, as well as the optimal amount of response noise, increases with the SD of the synaptic noise. The minimum error, E min , grows steadily with σW ; clearly, σr cannot completely compensate for synaptic corruption. Also, σW has to be bigger than a critical value for the noise interaction to be observed (σW > 0.1, approximately). However, except when synaptic noise is very small, the optimal strategy is to add some response noise to the network. As in the previous figure, symbols and lines in Figure 2 correspond to simulation and theoretical results, respectively. To obtain the latter, the key is to calculate σmin . This is done by first substituting the optimal synaptic weights of equation 3.5 into the expression for the average error, equation 3.6, and second, calculating the derivative of the error with respect to σr2 2 and equating it to zero. The resulting expression gives σmin as a function of the only two remaining parameters, σW and r0 . The dependence, however, is highly nonlinear, so in general the solution is implicit: 2 2 2 σr8 1 − σW + 2σr6 1 + a 2 1 − 2σW + 6σr4 a 2 1 − σW 2 2 2 2 − 4σW = 0, + a 4 1 + 3σW − 4a 2 σW + 2σr2 a 2 1 + a 2 + 2a 2 σW
(3.7)
Neuronal Variability Counteracts Synaptic Noise
A
1357
B
1
0.8
σmin
E min E0
0.8
0.6
0.6
E min
0.4
0.4
0.2
0.2 0
0
0.2
0.4
0.6
0.8
1
0
σW
C
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
1
σW
D
1
1.5
σmin
E min E0
0.8
1
E min
0.6 0.4
0.5 0.2 0
0
0.2
0.4
0.6
0.8
r0 1
0
0
r0 1
Figure 2: Optimal amount of response noise in the minimal classification network. Same network with two sensory neurons and one output neuron as in Figure 1. Lines and symbols indicate theoretical and simulation results, respectively, averaged over 1000 networks and 100 trials per network. (A) Strength of the noise interaction quantified by E min (dashed line) and E min /E 0 (solid line), as a function of σW , which determines the synaptic variability. Here and in B, r0 = 0.8. (B) Optimal amount of response variability, σmin , as a function of σW , for the same data in A. (C) Strength of the noise interaction as a function of r0 , which parameterizes the discriminability of the mean input responses evoked by the two stimuli. Here and in D, σW = 1. (D) σmin , as a function of r0 for the same data in C.
where
a≡
1 − r02 1 + r02
.
(3.8)
1358
G. Basalyga and E. Salinas
The value of σr that makes equation 3.7 true is σmin . For Figures 2A and 2B, the zero of the polynomial was found numerically for each combination of r0 and σW . Figures 2C and 2D show how E min , E min /E 0 , and σmin depend on the separation between evoked input responses, as parameterized by r0 . For these two plots, we chose a special case in which σmin can be obtained analytically from equation 3.7: σW = 1. In this particular case, the dependence of σmin on r0 has a closed form, 2 σmin
=
1 − r02
2/3 ((1 + r0 )2/3 + (1 − r0 )2/3 ).
1 + r02
(3.9)
This function is shown in Figure 2D. In general, the numerical simulations are in good agreement with the theory, except that the scatter in Figure 2D tends to increase as r0 approaches 0. This is due to a key feature of the noise interaction, which is that it depends on the overlap between input responses across stimuli. This can be seen as follows. First, notice that in Figure 2C, the relative error approaches 1 as r0 gets closer to 0. Thus, the noise interaction becomes weaker when there is less overlap between input responses, which is precisely what r0 represents in equation 3.2. If there is no overlap at all, the benefit of response noise vanishes. This fact explains why more than one neuron is needed to observe the noise interaction in the first place. This observation can be demonstrated analytically by setting r0 = 0 in equations 3.5 and 3.6, in which case the average square error becomes E W (r0 = 0) =
1 2
2 σW −1 + 1 . 1 + σr2
(3.10)
2 This result has interesting implications. If σW = 1, response noise makes 2 no difference, so there is no optimal value. If σW < 1, the error increases 2 monotonically with response noise, so the optimal value is 0. And if σW > 1, the optimal strategy is to add as much noise as possible. In this case, the variance of the output neuron is so high that there is no hope of finding a reasonable solution; the best thing to do is set the mean weights to zero, disconnecting the output unit. Thus, without overlap, either the synaptic noise is so high that the network is effectively useless, or, if σW is tolerable, response noise does not improve performance. At r0 = 0, the numerical solutions oscillate between these two extremes, producing an average error of 0.5 (left-most point in Figure 2C). In general, however, with nonzero overlap, there is a true optimal amount of response noise, and the more overlap there is, the larger its benefits are, as shown in Figure 2C. The simulation data points in Figure 2 were obtained using fluctuations and η in equations 2.7 and 3.3, respectively, sampled from gaussian
Neuronal Variability Counteracts Synaptic Noise
1359
distributions. The results, however, were virtually identical when the distribution functions were either uniform or exponential. Thus, as noted earlier, the exact shapes of the noise distributions do not restrict the observed effect. 3.2 Regularization by Noise. Above, we mentioned that response noise tends to decrease the absolute value of the optimal synaptic weights. Why is this? The reason is that minimization of the mean square error in the presence of response noise is mathematically equivalent to minimization of the same error without response noise but with an imposed constraint forcing the optimal weights to be small. This is as follows. Consider equation 2.4, which specifies the optimal weights in the twolayer network. Response noise enters into the expression through the correlation matrix. By separating the input responses into mean plus noise, we have C = (r + η)(r + η)T = r r T + ηη T = r r T + Dσ ,
(3.11)
where we have assumed that the noise is additive and uncorrelated across neurons (additivity is considered for simplicity but is not necessary). This results in the diagonal matrix Dσ containing the variances of individual units, such that element j along the diagonal is the total variance, summed over all stimuli, of input neuron j. Thus, uncorrelated response noise adds a diagonal matrix to the correlation between average responses. In that case, equation 2.4 can be rewritten as −1 W = F r T r r T + Dσ .
(3.12)
Now consider the mean square error without any noise but with an additional term that penalizes large weights. To restrict, for instance, the total synaptic weight provided by each input neuron, add the penalty term 1 λi wi2j K M i, j
(3.13)
to the original error expression, equation 2.3. Here, λi determines how much input neuron i is taxed for its total synaptic weight. Rewriting this as a trace, the total error to be minimized in this case becomes E=
1 (Tr[(wr − F )(wr − F )T ] + Tr(w T Dλ w)), KM
(3.14)
1360
G. Basalyga and E. Salinas
where Dλ is a diagonal matrix that contains the penalty coefficients λi along the diagonal. The synaptic weights that minimize this error function are given by F r T (r r T + Dλ )−1.
(3.15)
But this solution has exactly the same form as equation 3.12, which minimizes the error in the presence of response noise alone, without any other constraints. Therefore, adding response noise is equivalent to imposing a constraint on the magnitude of the synaptic weights, with more noise corresponding to smaller weights. The penalty term in equation 3.13 can also be interpreted as a regularization term, which refers to a common type of constraint used to force the solution of an optimization problem to vary smoothly (Hinton, 1989; Haykin, 1999). Therefore, as has been pointed out previously (Bishop, 1995), the effect of response fluctuations can be described as regularization by noise. In our model, we assumed that the fluctuations in synaptic connections are proportional to their size. What happens, then, is that response noise forces the optimal weights to be small, and this significantly decreases the part of the error that depends on σW . In this way, smaller synaptic weights— and therefore a nonzero σr —typically lead to smaller output errors. Another way to look at the relationship between the two types of noise is to calculate the optimal mean synaptic weights taking the synaptic variability directly into account. For simplicity, suppose that there is no response noise. Substitute equation 2.7 directly into equation 2.3, and minimize with respect to W, now averaging over the synaptic fluctuations. With multiplicative noise, the result is again an expression similar to equations 3.12 and 3.15, where a correction proportional to the synaptic variance is added to the diagonal of the correlation matrix. In contrast, with additive synaptic noise, the resulting optimal weights are exactly the same as without any variability, because this type of noise cannot be compensated for. Therefore, the recipe for counteracting response noise is equivalent to the recipe for counteracting multiplicative synaptic noise. An argument outlining why this is generally true is presented in section 6.1. 3.3 Classification in Larger Networks. When the simple classification task is extended to larger numbers of first-layer neurons (N > 2) and more input stimuli to classify (M > 2), an important question can be studied: How does the interaction between synaptic and response noise depend on the dimensionality of the problem, that is, on N and M? To address this issue, we did the following. Each entry in the N × M matrix r of mean responses was taken from a uniform distribution between 0 and 1. The desired output still consisted of a single neuron’s response given by equation 3.1, as before. So each one of the M input stimuli evoked a set of N neuronal responses,
Neuronal Variability Counteracts Synaptic Noise
A
B
E min 1 E 0 0.8
σmin
0.5 0.4
0.6
0.3
0.4
0.2
0.2
0.1
0
0
10
20
30
40
N
0
50
0
10
20
30
40
10
20
30
40
N
50
D
C 1
E min E 0 0.8
σmin
0.5 0.4
0.6
0.3
0.4
0.2
0.2
0.1
0
1361
0
10
20
30
40
M
50
0
0
M
50
Figure 3: Interaction between synaptic noise and response noise during the classification of M input stimuli. For each stimulus, the mean responses of N input neurons were randomly selected from a uniform distribution between 0 and 1. The output unit of the network had to classify the M response patterns by producing either a 1 or a 0. The synaptic noise SD was σW = 0.5. Results (circles) are averages over 1000 networks and 100 trials per network. All data are from computer simulations. (A) Relative error, E min /E 0 , as a function of the number of input neurons, N. The number of stimuli was kept constant at M = 10. (B) Optimal value of the response noise SD, σmin , as a function of the number of input neurons, N. Same simulations as in A. (C) Relative error as a function of the number of input stimuli, M. The number of input neurons was kept constant at N = 10. (D) Optimal value of the response noise SD as a function of M for the same simulations as in C.
each set drawn from the same distribution, and the output neuron had to divide the M evoked firing rate patterns into two categories. The optimal amount of response noise was found, and the process was repeated for different combinations of N and M. The results from these simulations are shown in Figure 3. All data points were obtained with the same amount of synaptic variability,
1362
G. Basalyga and E. Salinas
σW = 0.5. Each point represents an average over 1000 networks for which the optimal connections were corrupted. The amount of response noise that minimized the error, averaged over those 1000 corruption patterns, was found numerically by calculating the average error with the same mean responses and corruption patterns but different σr . For each combination of N and M, this resulted in σmin , which is shown in Figure 3B. The actual average error obtained with σr = σmin divided by the error for σr = 0 is shown in Figure 3A, as in Figure 2. Interestingly, the benefit conferred by response noise depends strongly on the difference between N and M. With M = 10 input stimuli, the effect of response noise is maximized when N = 10 neurons are used to encode them (see Figure 3A), and vice versa, when there are N = 10 neurons in the network, the maximum effect is seen when they encode M = 10 stimuli (see Figure 3C). Results with other numbers (5, 20, and 40 stimuli or neurons) were the same: response noise always had a maximum impact when N = M. This is not unreasonable. When there are many more neurons than stimuli, a moderate amount of synaptic corruption causes only a small error, because there is redundancy in the connectivity matrix. On the other hand, when there are many more input stimuli than neurons, the error is large anyway, because the N neurons cannot possibly span all the required dimensions, M. Thus, at both extremes, the impact of synaptic noise is limited. In contrast, when N = M, there is no redundancy, but the output error can potentially be very small, so the network is most sensitive to alterations in synaptic connectivity. Thus, response noise makes a big difference when the number of responses and the number of independent stimuli encoded are equal or nearly so. In Figures 3A and 3C, the relative error is not zero for N = M, but it is quite small (E min = 0.23, E min /E 0 = 0.004). This is primarily because the error without any response noise, E 0 , can be very large. Interestingly, the optimal amount of response noise also seems to be largest when N = M, as suggested by Figures 3B and 3D. In contrast to previous examples, for all data points in Figure 3, the fluctuations in the synapses and in the firing rates, and η, were drawn from uniform rather than gaussian distributions. As mentioned before, the variances of the underlying distributions should matter, but their shapes should not. Indeed, with the same variances, results for Figure 3 were virtually identical with gaussian or exponential distributions. A potential concern in this network is that although the variability of the output neuron depends on the interaction between the two types of noise, perhaps the interaction is of little consequence with respect to actual classification performance. The relevant measure for this is the probability of correct classification, pc . This probability is obtained by comparing the distributions of output responses to stimuli in one category versus the other, which is typically done using standard methods from signal detection theory (Dayan & Abbott, 2001). The algorithm underlying the calculation is quite simple: in each trial, the stimulus is assumed to belong to class 1
Neuronal Variability Counteracts Synaptic Noise
1363
if the output firing rate is below a threshold; otherwise, the stimulus belongs to class 2. To obtain pc , the results should be averaged over trials and stimuli. Finally, note that an optimal threshold should be used to obtain the highest possible pc . We performed this analysis on the data in Figure 3. Indeed, pc also depended nonmonotonically on response variability. For instance, for N = M = 10, the values with and without response noise were pc (σr = σmin ) = 0.83 and pc (σr = 0) = 0.75, where chance performance corresponds to 0.5. Also, the maximum benefit of response noise occurred for N = M and decreased quickly as the difference between N and M grew, as in Figures 3A & 3C. However, the amount of response noise that maximized pc was typically about one-third of the amount that minimized the mean square error. Thus, the best classification probability for N = M = 10 was pc (σr = 0.13) = 0.91. Maximizing pc is not equivalent to minimizing the mean square error; the two quantities weight differently the bias and variance of the output response (see Haykin, 1999). Nevertheless, response noise can also counteract part of the decrease in pc due to synaptic noise, so its beneficial impact on classification performance is real. 4 Noise Interactions in a Sensorimotor Network To illustrate the interactions between synaptic and response noise in a more biologically realistic situation, we apply the general approach outlined in section 2 to a well-known model of sensorimotor integration in the brain. We consider the classic coordinate transformation problem in which the location of an object, originally specified in retinal coordinates, becomes independent of gaze angle. This type of computation has been thoroughly studied both experimentally (Andersen, Essick, & Siegel, 1985; Brotchie, Andersen, Snyder, & Goodman, 1995) and theoretically (Zipser & Andersen, 1988; Salinas & Abbott, 1995; Pouget & Sejnowski, 1997) and is thought to be the basis for generating representations of object location relative to the body or the world. Also, the way in which visual and eye-position signals are integrated here is an example of what seems to be a general principle for combining different information streams in the brain (Salinas & Thier 2000; Salinas & Sejnowski, 2001). Such integration by “gain modulation” may have wide applicability in diverse neural circuits (Salinas, 2004), so it represents a plausible and general situation in which computational accuracy is important. From the point of view of the phenomenon at hand, the constructive effect of response noise, this example addresses an important issue: whether the noise interaction is still observed when network performance depends on a population of output neurons. In the classification task, performance was quantified through a single neuron’s response, but in this case, it depends on a nonlinear combination of multiple firing rates, so maybe the
1364
G. Basalyga and E. Salinas
impact of response noise washes out in the population average. As shown below, this is not the case. The sensorimotor network has, as before, a feedforward architecture with two layers. The first layer contains N gain-modulated sensory units, and the second or output layer contains K motor units. Each sensory neuron is connected to all output neurons through a set of feedforward connections, as illustrated in Figure 4B. The sensory neurons are sensitive to two quantities: the location (or direction) of a target stimulus x, which is in retinal coordinates, and the gaze (or eye-position) angle y. The network is designed so that the motor layer generates or encodes a movement in a direction z, which represents the direction of the target relative to the head. The idea is that the profile of activity of the output neurons should have a single peak centered at direction z. The correct (i.e., desired) relationship between inputs and outputs is z = x− y, which is approximately how the angles x and y should be combined in order to generate a head-centered representation of target direction (Zipser & Andersen, 1988; Salinas & Abbott, 1995; Pouget & Sejnowski, 1997). In other words, z is the quantity encoded by the output neurons, and it should relate to the quantities encoded by the sensory neurons through the function z(x, y) = x− y. Many other functions are possible, but as far as we can tell, the choice has little impact on the qualitative effect of response noise. In this model, the mean firing rate of sensory neuron i is characterized by a product of two tuning functions, f i (x) and gi (y), such that r i (x, y) = rmax f i (x) (1 − D + D gi (y)) + r B ,
(4.1)
where r B = 4 spikes per second is a baseline firing rate, rmax = 35 spikes per second and D is the modulation depth, which is set to 0.9 throughout. The sensory neurons are gain modulated because they combine the information from their two inputs nonlinearly. The amplitude—but not the selectivity —of a visually triggered response, represented by f i (x), depends on the
Figure 4: Network model of a sensorimotor transformation. In this network, N = 400, K = 25, M = 400. Target and movement directions, x and z, respectively, vary between −25 and 25, whereas gaze angle y varies between −15 and 15. The graphs correspond to a single trial in which x = −10, y = 10, and z = x− y = −20. Neither response noise nor synaptic corruption was included in this example. (A) Firing rates of the 400 gain-modulated input neurons arranged according to preferred stimulus location. (B) Network architecture. (C) Firing rates of the 25 output motor neurons arranged according to preferred target location.
Neuronal Variability Counteracts Synaptic Noise
1365
Input Sensory Neurons
A
x = −10, y = 10
ri
50 40 30 20 10 0
−20
−10
0
10
20
Preferred stimulus location
B
ri 1
2
...
i
...
1
2
...
K
Wji Rj
Output Motor Neurons
C
z = x − y = −20 50
Rj
40 30 20 10 0
−20
−10
0
10
20
Preferred target location
N
1366
G. Basalyga and E. Salinas
direction of gaze (Andersen et al., 1985; Brotchie et al., 1995; Salinas & Thier, 2000). Note that in the expression above, the second index of the mean rate r i j has been replaced by parentheses, indicating a dependence on x and y. This is to simplify the notation; the responses can still be arranged in a matrix r if each value of the second index is understood to indicate a particular combination of values of x and y. For example, if the rates were evaluated in a grid with 10 x points and 10 y points, the second index would run from 1 to 100, covering all combinations. Indeed, this is how it is done in the computer. For simplicity, the tuning curves for different neurons in a given layer are assumed to have the same shape but different preferred locations or center points, which are always between −25 and 25. Visual responses are modeled as gaussian tuning functions of stimulus location x,
(x − a i )2 f i (x) = exp − 2σ 2f
,
(4.2)
where a i is the preferred location and σ f = 4 is the tuning curve width. The dependence on eye position is modeled using sigmoidal functions of the gaze angle y, gi (y) =
1 , 1 + exp(−(b i − y)/di )
(4.3)
where b i is the center point of the sigmoid and di is chosen randomly between −7 and +7 to make sure that the curves gi (y) have different slopes for different neurons in the array. In each trial of the task, response variability is included by applying a variant of equation 2.5: ri j = r i j +
r i j ηi j .
(4.4)
This makes the variance of the rates proportional to their means, which in general is in good agreement with experimental data (Dean, 1981; Softky & Koch, 1992,1993; Holt, Softky, Koch, & Douglas, 1996). This choice, however, is not critical (see below). The desired response for each output neuron is also described by a gaussian,
Fk (z) = rmax
(z − c k )2 exp − 2σ F2
+ rB,
(4.5)
where σ F = 4 and c k is the preferred target direction of motor neuron k. This expression gives the intended response of output unit k in terms of the
Neuronal Variability Counteracts Synaptic Noise
1367
encoded quantity z. Keep in mind, however, that the desired dependence on the sensory inputs is obtained by setting z = x− y. When driven by the first-layer neurons, the output rates are still calculated through a weighted sum, Rk (z) = Rk (x, y) =
N
Wki ri (x, y).
(4.6)
i=1
This is equivalent to equation 2.1 but with the second index defined implicitly through x and y, as mentioned above. The optimal synaptic connections Wki are determined exactly as before, using equation 2.4. Typical profiles of activity for input and output neurons are shown in Figures 4A and 4C for a trial with x = −10 and y = 10. The sensory neurons are arranged according to their preferred stimulus location a i , whereas the motor neurons are arranged according to their preferred movement direction c k . For this sample trial, no variability was included; the firing rate values in Figure 4A are scattered under a gaussian envelope (given by equation 4.2) because the gaze-dependent gain factors vary across cells. Also, the output profile of activity is gaussian and has a peak at the point z = −20, which is exactly where it should be given that the correct inputoutput transformation is z = x− y. With noise, the output responses would be scattered around the gaussian profile and the peak would be displaced. The error used to measure network performance is, in this case, E pop = |z − Z|.
(4.7)
This is the absolute difference, averaged over trials and networks, between the desired movement direction z—the actual head-centered target direction—and the direction Z that is encoded by the center of mass of the output activity, (Ri − rB )2 c i Z = i . 2 k (Rk − rB )
(4.8)
Therefore, equation 4.7 gives the accuracy with which the whole motor population represents the head-centered direction of the target, whereas equation 4.8 provides the recipe to read out such output activity. Now the idea is to corrupt the optimal connections and evaluate E pop using various amounts of response noise to determine whether there is an optimum. Relative to the previous examples, the key differences are, first, that the error in equation 4.7 represents a population average, and second, that although the connections are set to minimize the average difference between desired and driven firing rates, the performance criterion is not based directly on it.
1368
G. Basalyga and E. Salinas
A
B
10
0.8
E min E 0 0.6
E pop
8 6
0.4 4 0.2
2
E min
0
0
1 σmin
2
3
σr
4
0
0
500
1000
1500
2000
2500
N D
C σmin1.5
2
E min 1.5
1 1
E min E0
0.5
0
0
0.05
0.1
0.15
0.2
0.25
pW
0.5
0.3
0
0
0.05
0.1
0.15
0.2
0.25
0.3
pW
Figure 5: Noise interaction for the sensorimotor network depicted in Figure 4. Results are averaged over 100 networks and 100 trials per network. All data are from computer simulations. (A) Average absolute deviation between actual and encoded target locations, E pop , as a function of response noise. Continuous lines are for three probabilities of weight elimination, pW = 0.1, 0.3 and 0.5; the dashed line corresponds to pW = 0. (B) Magnitude of the noise interaction, measured by the relative error E min /E 0 , as a function of the number of input neurons, N, for pW = 0.2. (C) E min and E min /E 0 as functions of pW . (D) Optimal response noise SD, σmin , as a function of pW .
Simulation results for this sensorimotor model are presented in Figure 5. A total of 400 sensory and 25 output neurons were used. These units were tested with all combinations of 20 values of x and 20 values of y, uniformly spaced (thus, M = 400). Synaptic noise was generated by random weight elimination. This means that after having set the connections to their optimal values given by equation 2.4, each one was reset to zero with a probability pW . Thus, on average, a fraction pW of the weights in each network was eliminated. As shown in Figure 5A, when pW > 0, the error between the encoded and the true target direction has a minimum with respect to σr . These error curves represent averages over 100 networks. Interestingly, the
Neuronal Variability Counteracts Synaptic Noise
1369
benefit of noise does not decrease when more sensory units are included in the first layer (see Figure 5B). That is, if pW is constant, the proportion of eliminated synapses does not change, so the error caused by synaptic corruption cannot be reduced simply by adding more neurons. Figure 5C shows the minimum and relative errors as functions of pW . This graph highlights the substantial impact that response noise has on this network: the relative error stays below 0.2 even when about a third of the synapses are eliminated. This is not only because the error without response noise is high, but also because the error with an optimal amount of noise stays low. For instance, with pW = 0.3 and σr = σmin , the typical deviation from the correct target direction is about 2 units, whereas with σr = 0, the typical deviation is about 10. Response noise thus cuts the deviation by about a factor of five, and importantly, the resulting error is still small relative to the range of values of z, which spans 50 units. Also, as observed in the classification task, in general it is better to include response noise even if σr is not precisely matched to the amount of synaptic variability (see Figure 5A). Figure 5D plots σmin as a function of the probability of synaptic elimination. The optimal amount of response noise increases with pW and reaches fairly high levels. For instance, at a value of 1, which corresponds to pW near 0.15, the variance of the firing rates is equal to their mean, because of equation 4.4. We wondered whether the scaling law of the response noise would make any difference, so we reran the simulations with either additive noise (SD independent of mean) or noise with an SD proportional to the mean, as in equation 2.5. Results in these two cases were very similar: E min and E min /E 0 varied very much as in Figure 5C, and the optimal amount of noise grew monotonically with pW , as in Figure 5D. 5 Noise Interactions in a Recurrent Network The networks discussed in the previous sections had a feedforward architecture, and in those cases the contribution of response noise to the correlation matrix between neuronal responses could be determined analytically. In contrast, in recurrent networks, the dynamics are more complex and the effects of random fluctuations more difficult to ascertain. To investigate whether response noise can still counteract some of the effects of synaptic variability, we consider a recurrent network with a well-defined function and relatively simple dynamics characterized by attractor states. When the firing rates in this network are initialized at arbitrary values, they eventually stop changing, settling down at certain steady-state points in which some neurons fire intensely and others do not. The optimal weights sought are those that allow the network to settle at predefined sets of steady-state responses, and the error is thus defined in terms of the difference between the desired steady states and the observed ones. As before, response noise is taken into account when the optimal synaptic weights are generated,
1370
G. Basalyga and E. Salinas
although in this case, the correction it introduces (relative to the noiseless case) is an approximation. The attractor network consists of N continuous-valued neurons, each of which is connected to all other units via feedback synaptic connections (Hertz et al., 1991). With the proper connectivity, such a network can generate, without any tuned input, a steady-state profile of activity with a cosine or gaussian shape (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Salinas, 2003). Such stable “bump”-shaped activity is observed in various neural models, including those for cortical hypercolumns (Hansel & Sompolinsky, 1998), headdirection cells (Zhang, 1996; Laing & Chow, 2001), and working memory circuits (Compte, Brunel, Goldman-Rakic, & Wang, 2000). Below, we find the connection matrix that allows the network to exhibit a unimodal activity profile centered at any point within the array. 5.1 Optimal Synaptic Weights in a Recurrent Architecture. The dynamics of the network are determined by the equation dri τ = −ri + h Wi j r j + ηi , dt j
(5.1)
where τ = 10 is the integration time constant, ri is the response of neuron i, and h is the activation function of the cells, which relates total current to firing rate. The sigmoid function h(x) = 1/(1 + exp(−x)) is used, but this choice is not critical. As before, ηi represents the response fluctuations, which are drawn independently for each neuron in everytime step. In this case, they are gaussian, with zero mean and a variance σr2 t. The variance of ηi is divided by the integration time step t to guarantee that the variance of the rate ri remains independent of the time step (van Kampen, 1992). For our purposes, manipulating this type of network is easier if the equations are expressed in terms of the total input currents to the cells (Hertz, Krogh, &Palmer, 1991; Dayan & Abbott, 2001). If the current for neuron i is ui = j Wi j r j , then τ
dui Wi j (h(u j ) + η j ), = −ui + dt j
(5.2)
is equivalent to equation 5.1. A stationary solution of equation 5.2 without input noise is such that all derivatives become zero. This corresponds to an attractor state α for which uiα =
j
Wi j h uαj .
(5.3)
Neuronal Variability Counteracts Synaptic Noise
A ui
1371
B ui
2
2
1
1
0
0
−1
−1
−2
−2
−180
−90
0
90
180
−180
Neuron label (deg)
−90
0
90
180
Neuron label (deg)
Figure 6: Steady-state responses of a recurrent neural network with 20 neurons. Results show the input currents of all units after 1000 ms of simulation time, with responses evolving according to equation 5.2. Each neuron is labeled by an angle between −180 degrees and 180 degrees. (A) Steady-state responses for four sets of initial conditions with peaks near units −90 degrees, 0 degree, 90 degrees, and 180 degrees. The observed activity profiles are indistinguishable from the desired gaussian curves. Neither synaptic nor response noise was included in this example. (B) Steady-state responses with and without noise. The desired activity profile is indicated by the solid line. The dotted line corresponds to the activity observed with noise after 1000 ms of simulation time, having started with an initial condition equal to the desired steady state. Vertical lines indicate the locations of the corresponding centers of mass. The absolute deviation is 34 degrees. Here, σr = 0.3 and pW = 0.02.
The label α is used because the network may have several attractors or sets of fixed points. The desired steady-state currents are denoted as Uiα . These are gaussian profiles of activity such that during steady state α = 1, neuron 1 is the most active (i.e., the gaussian is centered at neuron 1), during steady state α = 2, neuron 2 is the most active, and so on. Figure 6 illustrates the activity of the network at four steady states in the absence of noise (σW = 0 = σr ). To make the network symmetric, the neurons were arranged in a ring, so their activity profiles wrap around. Because of this, each neuron is labeled with an angle. The observed currents ui settle down at values that are almost exactly equal to the desired ones, Uiα . The synaptic connections that achieved this match were found by enforcing the steady-state condition 5.3 for the desired attractors. That is, we minimized 2 NA α 1 α Ui − E= Wi j h U j , NA i j α=1
(5.4)
1372
G. Basalyga and E. Salinas
where Uiα is a (wrap-around) gaussian function of i centered at α and NA is the number of attractors; in the simulations, NA is always equal to the number of neurons, N. This procedure leads to an expression for the optimal weights equivalent to equation 2.4. Thus, without response noise, W = L C −1 ,
(5.5)
where 1 α α U h Uj NA α i 1 h(Uiα ) h U αj . Ci j = NA α Li j =
(5.6)
To include the effects of response noise, we add a correction to the diagonal of the correlation matrix, as in the previous cases (see section 3.2). We thus set Ci j =
1 α α σ2 h Ui h U j + δi j a r , NA α 2τ
(5.7)
where a is a proportionality constant. The rationale for this is as follows. Strictly speaking, equation 5.2 with response noise does not have a steady state. But consider the simpler case of a single variable u with a constant asymptotic value u∞ , such that τ
du = −u + u∞ + η. dt
(5.8)
If the trajectory u(t) from t = 0 to t = T is calculated many times, starting from the same initial condition, the distribution of end points u(T) has a well-defined mean and variance, which vary smoothly as functions of T. The mean is always equal to the end point that would be observed without noise, whereas for T much longer than the integration time constant τ , the variance is equal to the variance of the fluctuations on the right-hand side of equation 5.8 divided by 2τ (van Kampen, 1992). These considerations suggest that we minimize 2 α 1 α E= Ui − Wi j h U j + a η˜ j , NA α,i j
(5.9)
Neuronal Variability Counteracts Synaptic Noise
1373
where the variance of η˜ j is σr2 (2τ ). This leads to equation 5.5 with the corrected correlation matrix given by equation 5.7.
5.2 Performance of the Attractor Network. To evaluate the performance of this network, we compare the center of mass of the desired activity profile to that of the observed profile tracked during a period of time. For a particular attractor α, the network is first initialized very close to that desired steady state; then equation 5.2 is run for 1000 ms (100 time constants τ ), and the absolute difference between the initial and the current centers of mass is recorded during the last 500 ms. The error for the recurrent networks E rec is defined as the absolute difference averaged over this time period and all attractor states, that is, all values of α. Also, when there is synaptic noise, an additional average over networks is performed. This error function is similar to equation 5.1, except that the circular topology is taken into account. Thus, E rec is the mean absolute difference between desired and observed centers of mass. It is expressed in degrees. Before exploring the interaction between synaptic and response noise, we used E rec to test whether the noise-dependent correction to the correlation matrix in equation 5.7 was appropriate. To do this, a recurrent network without synaptic fluctuations was simulated multiple times with different values of the parameter a and various amounts of response noise. The desired attractors were kept constant. The resulting error curves are shown in Figure 7A. Each one gives the average absolute deviation between desired and observed centers of mass as a function of σr for a different value of a . The dependence on a was nonmonotonic. The optimal value we found was 0.5, which corresponds to the lowest curve (dashed) in the figure. This curve was well below the one observed without adjusting the synaptic weights. Therefore, the correction was indeed effective. Figure 7B shows E rec as a function of σr when synaptic noise is also present in the recurrent network. The three solid curves correspond to nets in which synapses were randomly eliminated with probabilities pW = 0.005, 0.015, and 0.025. As with previous network architectures, a nonzero amount of response noise improves performance relative to the case where no response noise is injected. In this case, however, the mean absolute error is already about 25 degrees at the point at which response noise starts making a difference, around pW = 0.005 (see Figure 7C). This is not surprising: these types of networks are highly sensitive to changes in their synapses, so even small mismatches can lead to large errors (Seung, Lee, Reis, & Tank, 2000; Renart, Song, & Wang, 2003). Also, Figure 7C shows that the ratio E min /E 0 does not fall below 0.6, so the benefit of noise is not as large as in previous examples. The effect was somewhat weaker when synaptic variability was simulated using gaussian noise with SD σW instead of random synaptic elimination. Nevertheless, it is interesting that the interaction between synaptic and response noise is observed at all under
1374
G. Basalyga and E. Salinas
A
B
100
100
E rec
E rec
80
80
60
60
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
σr
0
C 1
0.3
0.4
0.01
0.02
0.03
0.04
σr
0.5
60
0.15
40
0.1
20
0.05
0 0.05
0
0.2
0.01
0.02
0.03
0.04
σmin
0.2
E min
0.4
0
0.2
0.25
80
0.6
0
0.1
D 100
E min E0
0.8
0
pW
0
0.05
pW
Figure 7: Interaction between synaptic and response noise in recurrent networks. (A) Average absolute difference between desired and observed centers of mass as a function of σr . Units are degrees. The different curves are for a = 0, 1.5, 1, and 0.5, from left to right. The lowest curve (dashed) was obtained with a = 0.5, confirming that the synaptic weights are optimized when response noise is taken into account. (B) Average error E rec as a function of response noise. Continuous lines are for three probabilities of weight elimination pW = 0.005, 0.015, and 0.025; the dashed line corresponds to pW = 0. Here and in the following panels, a = 0.5. (C) E min /E 0 (left y-axis) and E min (right y-axis) as functions of pW . (D) Optimal response noise SD, σmin , as a function of pW for the same data in C.
these conditions, given that the response dynamics are richer and that the minimization of equation 5.9 may not be the best way to produce the desired steady-state activity. 6 Discussion 6.1 Why are Synaptic and Response Fluctuations Equivalent? We have investigated the simultaneous action of synaptic and response fluctuations on the performance of neural networks and found an interaction or
Neuronal Variability Counteracts Synaptic Noise
1375
equivalence between them: when synaptic noise is multiplicative, its effect is similar to that of response noise. At heart, this is a simple consequence of the product of responses and synaptic weights contained in most neural models, which has the form j Wj r j . With multiplicative noise in one of the variables, this weighted sum turns into j Wj (1 + ξ j )r j , which is the same whether it is the synapse or the response that fluctuates. In either case, the total stochastic component j Wj ξ j r j scales with the synaptic weights. The same result is obtained with additive response noise. Additive synaptic noise behaves differently, however. It instead leads to a total fluctuation j ξ j r j that is independent of the mean weights. Evidently, in this case, the mean values of the weights have no effect on the size of the fluctuations. Thus, the key requirement for some form of equivalence between the two noise sources is that the synaptic fluctuations must depend on the strength of the synapses. This condition was applied to the three sets of simulations presented above, which corresponded to the classification of arbitrary response patterns, a sensorimotor transformation, and the generation of multiple selfsustained activity profiles. This selection of problems was meant to illustrate the generality of the observations outlined in the above paragraph. And indeed, although the three problems differed in many respects, the results were qualitatively the same. We should also point out that in all the simulations, the criterion used to determine the optimality of the synaptic weights was based on a mean square error. But perhaps the noise interaction changes when a different criterion is used. To investigate this, we performed additional simulations of the small 2×1 network in which the optimal synaptic weights were those that minimized a mean absolute deviation; thus, the square in equation 2.2 was substituted with an absolute value. In this case, everything proceeded as before, except that the mean weight values W had to be found numerically. For this, the averages were performed explicitly, and the downhill simplex method was used to search for the best weights (Press, Teukolsky, & Vetterling, 1992). The results, however, were very similar to those in Figure 2A. Although the shapes of the curves were not exactly the same, the relative and minimum errors found with the absolute value varied very much like with the mean-square error criterion as functions of σW . Therefore, our conclusions do not seem to depend strongly on the specific function used to weight the errors and find the best synaptic connection values. 6.2 When Should Response Noise Increase? According to the argument above, the most general way to state our results is this: assuming that neuronal activities are determined by weighted sums, any mechanism that is able to dampen the impact of response noise will automatically reduce the impact of multiplicative synaptic noise as well. Furthermore, we suggest that under some circumstances, it is better to add more response noise
1376
G. Basalyga and E. Salinas
and increase the dampening factor than ignore the synaptic fluctuations altogether. There are two conditions for this scenario to make sense. First, the network must be highly sensitive to changes in connectivity. This can be seen, for instance, in Figure 3A, which shows that the highest benefit of response noise occurs when the number of neurons matches the number of conditions to be satisfied; it is at this point that the connections need to be most accurate. Second, the fluctuations in connectivity cannot be evaluated directly. That is, why not take into account the synaptic noise in exactly the same way as the response noise when the optimal connections are sought? For example, the average in equation 2.3 could also include an average over networks (synaptic fluctuations), in which case the optimal mean weights would depend not only on σr but also on σW . In the simulations, this could certainly be done and would lead to smaller errors. But we explicitly consider the possibility that either σW is unknown a priori, or there is no separate biophysical mechanism for implementing the corresponding corrections to the synaptic connections. Condition 2 is not unreasonable. Realistic networks with high synaptic plasticity must incorporate mechanisms to ensure that ongoing learning does not disrupt their previously acquired functionality. Thus, synaptic modification rules need to achieve two goals: establish new associations that are relevant for the current behavioral task and make adjustments to prevent interference from other, future associations. The latter may be particularly difficult to achieve if learning rates change unpredictably with time. It is not clear whether plausible (e.g., local) synaptic modification mechanisms could solve both problems simultaneously (see Hopfield & Brody, 2004), but the results suggest an alternative: synaptic modification rules could be used exclusively to learn new associations based on current information, whereas response noise could be used to indirectly make the connectivity more robust to synaptic fluctuations. Although this mechanism evidently does not solve the problem of combining multiple learned associations, it might alleviate it. Its advantage is that, assuming that neural circuits have evolved to adaptively optimize their function in the face of true noise, simply increasing their response variability would generate synaptic connectivity patterns that are more resistant to fluctuations. 6.3 When Is Synaptic Noise Multiplicative? The condition that noise should be multiplicative means that changes in synaptic weight should be proportional to the magnitude of the weight. Evidently not all types of synaptic modification processes lead to fluctuations that can be statistically modeled as multiplicative noise; for instance, saturation may prevent positive increases, thus restricting the variability of strong synapses. However, synaptic changes that generally increase with initial strength should be reasonably well approximated by the multiplicative model. Random synapse elimination fits this model because if a weak synapse disappears, the change is small, whereas if a strong synapse disappears, the change is large. Thus,
Neuronal Variability Counteracts Synaptic Noise
1377
the magnitude of the changes correlates with initial strength. Another procedure that corresponds to multiplicative synaptic noise is this. Suppose the size of the synaptic changes is fixed, so that weights can vary only by ±δw, but suppose also that the probability of suffering a change increases with initial synaptic strength. In this case, all changes are equal, but on average, a population of strong synapses would show higher variability than a population of weak ones. In simulations, the disruption caused by this type of synaptic corruption is indeed lessened by response noise (data not shown). 7 Conclusion To summarize, the scenario we envision rests on five critical assumptions: (1) the activity of each neuron depends on synaptically weighted sums of its (noisy) inputs, (2) network performance is highly sensitive to changes in synaptic connectivity, (3) synaptic changes unrelated to a function that has already been learned can be modeled as multiplicative noise, (4) synaptic modification mechanisms are able to take into account response noise, so synaptic strengths are adjusted to minimize its impact, but (5) synaptic modification mechanisms do not directly account for future learning. Under these conditions, our results suggest that increasing the variability of neuronal responses would, on average, result in more accurate performance. Although some of these assumptions may be rather restrictive, the diversity of synaptic plasticity mechanisms together with the high response variability observed in many areas of the brain make this constructive noise effect worth considering. Acknowledgments This research was supported by NIH grant NS044894. References Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 450–458. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Brotchie, P. R., Andersen, R. A., Snyder, L. H., & Goodman, S. J. (1995). Head position signals used by parietal neurons to encode locations of visual stimuli. Nature, 375, 232–235. Carpenter, G. A., & Grossberg, S. (1987). Art2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26, 4919–4930.
1378
G. Basalyga and E. Salinas
Compte, A., Brunel, N., Goldman-Rakic, P., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Crist, R. E., Li, W., & Gilbert, C. (2001). Learning to see: Experience and attention in primary visual cortex. Nature Neuroscience, 4(4), 519–525. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Dean, A. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys., 70, 223–287. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: John Hopkins University Press. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch, & L. Segev (Eds.), Methods in neuronal modeling: From synapse to networks (pp. 499–567). Cambridge, MA: MIT Press. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185–234. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. Journal Neurophysiology, 75, 1806–1814. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spike-timing–based computation networks. Proc. Natl. Acad. Sci. USA, 101, 337– 342. Kilgard, M. P., & Merzenich, M. M. (1998). Plasticity of temporal information processing in the primary auditory cortex. Nature Neuroscience, 1, 727–731. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Computation, 13(7), 1473–1494. Levin, J. E., & Miller, J. P. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380, 165–168. McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. Murray, A. F., & Edwards, P. J. (1994). Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792–802. Nozaki, D., Mar, D. J., Grigg, P., & Collins, J. J. (1999). Effects of colored noise on stochastic resonance in sensory neurons. Physical Review Letters, 82, 2402– 2405. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge: Cambridge University Press.
Neuronal Variability Counteracts Synaptic Noise
1379
Renart, A., Song, P., & Wang, X. J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Salinas, E. (2003). Background synaptic activity as a switch between dynamical states in a network. Neural Computation, 15(7), 1439–1475. Salinas, E. (2004). Context-dependent selection of visuomotor maps. BMC Neuroscience, 5(1), 47. Salinas, E., & Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. Journal of Neuroscience, 15, 6461–6474. Salinas, E., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behavior, neurophysiology and computation meet. Neuroscientist, 2, 539– 550. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Softky, W. P., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4(5), 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPS. Journal of Neuroscience, 13, 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1, 210–217. Turrigiano, G. G., & Nelson, S. B. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10, 358–364. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry. Amsterdam: Elsevier. Vilar, J. M. G., & Rubi, J. M. (2000). Scaling of noise and constructive aspects of fluctuations. Berlin: Springer-Verlag. Wang, X., Merzenich, M. M., Sameshima, K., & Jenkins, W. (1995). Remodelling of hand representation in adult cortex determined by timing of tactile stimulation. Nature, 378, 71–75. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16(6), 2112– 2126. Zipser, D., & Andersen, R. A. (1988). A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331, 679–684.
Received July 15, 2005; accepted September 16, 2005.
LETTER
Communicated by Harel Shouval
Strongly Improved Stability and Faster Convergence of Temporal Sequence Learning by Using Input Correlations Only Bernd Porr
[email protected] Department of Electronics and Electrical Engineering, University of Glasgow, Glasgow, GT12 8LT, Scotland
¨ otter ¨ Florentin Worg
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland, and Bernstein Center of Computational Neuroscience, University G¨ottingen, Germany
Currently all important, low-level, unsupervised network learning algorithms follow the paradigm of Hebb, where input and output activity are correlated to change the connection strength of a synapse. However, as a consequence, classical Hebbian learning always carries a potentially destabilizing autocorrelation term, which is due to the fact that every input is in a weighted form reflected in the neuron’s output. This self-correlation can lead to positive feedback, where increasing weights will increase the output, and vice versa, which may result in divergence. This can be avoided by different strategies like weight normalization or weight saturation, which, however, can cause different problems. Consequently, in most cases, high learning rates cannot be used for Hebbian learning, leading to relatively slow convergence. Here we introduce a novel correlation-based learning rule that is related to our isotropic se¨ otter, ¨ quence order (ISO) learning rule (Porr & Worg 2003a), but replaces the derivative of the output in the learning rule with the derivative of the reflex input. Hence, the new rule uses input correlations only, effectively implementing strict heterosynaptic learning. This looks like a minor modification but leads to dramatically improved properties. Elimination of the output from the learning rule removes the unwanted, destabilizing autocorrelation term, allowing us to use high learning rates. As a consequence, we can mathematically show that the theoretical optimum of one-shot learning can be reached under ideal conditions with the new rule. This result is then tested against four different experimental setups, and we will show that in all of them, very few (and sometimes only one) learning experiences are needed to achieve the learning goal. As a consequence, the new learning rule is up to 100 times faster and in general more stable than ISO learning.
Neural Computation 18, 1380–1412 (2006)
C 2006 Massachusetts Institute of Technology
Input Correlation Learning
1381
1 Introduction Probably all existing correlation-based learning algorithms rely currently on Donald Hebb’s (1949) famous paradigm that connections between network units should be strengthened if the two connected units are simultaneously active (Oja, 1982; Kohonen, 1988; Linsker, 1988). The Hebb rule can be formalized as ρ j = µu j f (v),
(1.1)
strength and the output is calculated from the where ρ j is the connection weighted sum v = j ρ j u j . The factor µ is called the learning rate. The linear operator f is just the identity operator f = v for classical Hebbian learning (Hebb, 1949), and it is the derivative f = v for differential Hebbian learning (Kosco, 1986). In spite of their success, Hebbian-type learning algorithms can be unstable because of the existing autocorrelation term in the learning rule. This can be seen if we replace v in equation 1.1 by the weighted sum. Apart from the cross-correlation terms, we get ρ j ∝ µρ j u j f (u j ). Hebbian learning is stable only if this autocorrelation term is zero or can be compensated for by means of additional measures taken (Oja, 1982; Bienenstock, Cooper, & ¨ otter, ¨ Munro, 1982; Miller, 1996b; Porr & Worg 2003a). In the general case, however, this term leads to an exponentially growing instability and network divergence. Hebb rules have been employed in a wide variety of unsupervised learning tasks, and previously we have focused on the specific problem of tem¨ otter, ¨ poral sequence learning (Porr and Worg 2001, 2003a). In this case, two (or more) signals exist that are correlated to each other, but with certain delays between them. In real life, this can happen, for example, when heat radiation precedes a pain signal when touching a hot surface or when the smell of a prey arrives before the predator is close enough to see it hiding in the shrubs. Such situations occur often during the lifetime of a creature, and in these cases it is advantageous to learn reacting to the earlier stimulus, not having to wait for the later signal. Temporal sequence learning enables the animal to react to the earlier stimulus. Thus, the animal learns an anticipatory action to avoid the late, unwanted stimulus. From a more theoretical perspective, such situations are related to classical or instrumental conditioning, and in early studies, correlation-based, stimulus-substitution models were used to address the problem of how to learn such sequences (Sutton & Barto, 1981). Soon, however, these methods were superseded by reinforcement learning algorithms (Sutton, 1988; Watkins, 1989; Watkins & Dayan, 1992), partly because those algorithms had favorable mathematical properties (Dayan & Sejnowski, 1994) and partly because convergent learning could be achieved in behaving systems (Kaelbling, Littman, &
1382
¨ otter ¨ B. Porr and F. Worg
Moore, 1996). Relations to biophysics, however, seem to exist more to the dopaminergic reward-based learning system (Schultz, Dayan, & Montague, 1997) than to (differential) Hebbian learning through long-term potentiation (LTP) at glutamatergic synapses (Malenka & Nicoll, 1999; for a review, ¨ otter ¨ see Worg & Porr, 2005). Therefore, in a series of recent articles, we have tried to show that it is possible to solve reinforcement learning tasks by correlation-based (Hebbian) rules, realizing that such tasks can often be embedded into the framework of sequence learning which allows for ¨ otter, ¨ a Hebbian formalism (Porr & Worg 2003a, 2003b). However, we had to discover that the Hebbian learning rule, which we had designed to address problems of temporal sequence learning, produces exactly the same autocorrelation instability that often prevented convergence. To solve this problem, in this study we present a novel, heterosynaptic learning rule that allows implementation of fast and stable learning. This learning rule has been derived from isotropic sequence order (ISO) learn¨ otter, ¨ ing (Porr & Worg 2003a), which belongs to the class of differential Hebbian learning rules (Kosco, 1986). ISO learning, however, suffers from the problem discussed above. It too contains the destabilizing autocorrelation term; only for the limiting case of µ → 0 have we been able to prove ¨ otter, ¨ that this term vanishes (Porr & Worg 2003a), but only when using a set of orthogonal input filters. However, a very simple alteration of ISO learning eliminates its autocorrelation term completely: if we correlate only inputs with each other, this term no longer exists. More specifically, we define an error signal at one of the inputs and correlate this error signal with the other inputs. Consequently, our rule can be used in applications where such an error signal can be identified, which is the case, in particular, in closed-loop feedback control. In this study, we first derive the convergence properties of input correlation (ICO) learning, showing that one-shot learning is the theoretical limit for the learning rate. As an additional advantage, it will become clear that input filtering does not rely on orthogonal filters at the different inputs. Any input characteristic will suffice as long as the whole system contains an (additional) low-pass filter component. This, however, can also come from the transfer function of the environment in which the learning system is embedded. The advantage of now being able to choose almost arbitrary input filters will for the first time also allow approximating far more complex (e.g., nonlinear) output characteristics than was possible with ISO learning. In the second part of this study, we compare ICO learning with its equivalent differential Hebbian learning rule: the ISO learning rule. This comparison, performed on a simulated and real benchmark test, will demonstrate that input correlation learning is indeed much faster and more stable than the older ISO learning. Finally, we will present a set of experiments from different application domains showing that one-shot learning can be approached when using the ICO rule. These applications have been
Input Correlation Learning
1383
specifically chosen to raise confidence that ICO learning can be applied in a variety of different situations. 2 Input Correlation learning 2.1 The Neural Circuit. Figure 1A shows the basic components of the ¨ otter ¨ neural circuit. In contrast to Porr & Worg (2003a) we will for the mathematical formalism employ here the z-transform instead of the Laplace transform. This is due to the fact that the z-space provides a simple way to express the correlation and thus allows a straightforward proof of convergence and stability (see also the appendix). The learner consists of two inputs, x0 and x1 , which are filtered with functions h, u0 = x0 ∗ h 0 u j = x1 ∗ h j ,
(2.1)
where the signal x1 is filtered by a filter bank of N filters, which are indexed by j. The filter functions h 1 , . . . , h N represent a filter bank with different characteristics so that it is possible to generate complex-shaped responses (Grossberg, 1995). The filtered inputs uk converge onto a single learning unit with weights ρk , and its output is given by v=
N
ρk uk .
(2.2)
k=0
The output will determine the behavior of the system, but not its learning. To make ICO learning comparable with ISO learning, for h we will use mostly resonators, as in our previous work. We will, however, also employ other filter functions if applicable. In discrete time, the resonator responses are given by h(n) =
1 an 1 e sin(bn) ↔ H(z) = , b (z − e p )(z − e p∗ )
(2.3)
where p ∗ is the complex conjugate of p. Note that z-transformed functions are denoted by capital letters or as ρ(z) in the case of Greek letters. The index for the time steps is n. The real andimaginary parts of p are defined as a = Re( p) = −π f /Q and b = Im( p) = (2π f )2 − a 2 , respectively, which is the definition for continuous time. The transformation into discrete time is performed by the exponential e p in equation 2.3, which is called the impulse invariance method. The parameter 0 ≤ f < 0.5 is the frequency of the resonator normalized to a sampling rate of one. The so-called quality
1384
¨ otter ¨ B. Porr and F. Worg
Figure 1: Circuit and weight change. (A) General form of the neural circuit in an open-loop condition. Inputs xk are filtered by resonators with impulse response h k and summed at v with weights ρk . The symbol d/dt denotes the derivative. The amplifier symbol denotes a changeable synaptic weight, ⊗ is a correlator, and is a summation node. The filters h 1 , . . . , h N form a filter bank to cover a wider range of temporal differences between the inputs. (B) Weight change curve. Shown is the weight change for two identical resonators h 0 , h 1 with Q = 0.51, f = 0.01. The two inputs x0 and x1 receive delta pulses x1 (n) = δ(n) and x0 (n) = δ(n − T). The temporal difference between the inputs is T. The resulting weight change after infinite time is ρ. (C) Behavior of the weight ρ1 for ICO learning as compared to ISO learning. Pairs of delta pulses are applied as in B. The time between the delta pulses was set to T = 25. The pulse sequence was repeated every 2000 time steps until step 100,000. After step 100,000, only input x1 receives delta pulses. The learning rate was µ = 0.001.
Input Correlation Learning
1385
Q > 0.5 of the resonator defines the decay rate. We will mostly employ a very low quality (Q = 0.6), which results in a rapid decay. 2.2 The Learning Rule. The learning rule for the weight change ρ j is dρ j du0 = µu j dt dt
j > 0,
(2.4)
where only input signals are correlated with each other. Comparing equation 1.1 with the new learning rule, we see that the output v has been replaced by the input u0 . The derivative indicates that the learning rule implements differential learning (Kosco, 1986). Thus, we have differential heterosynaptic learning. Weight changes can be calculated by correlating the resonator responses of H0 and H1 in the z-domain. In the open-loop case, this is straightforward ¨ otter ¨ and differs only formally from the Laplace domain used in Porr & Worg (2003a) yielding the same weight change curves. Figure 1B shows the weight change curve for N = 1, H0 = H1 (for parameters, see the figure legend). Weights increase for T > 0 and decrease for T < 0, which means that a sequence of events x1 → x0 leads to a weight increase at ρ1 , whereas the reverse sequence x0 → x1 leads to a decrease. Thus, learning is predictive in relation to the input x0 . Weights stabilize if the input x0 is set to a constant value (or if x1 is set to zero). Figure 1C shows the behavior of ICO learning as compared to ISO learning in the open-loop case for the relatively high learning rate of µ = 0.001. Clearly one sees that ISO learning contains an exponential instability, which leads to an upward bend in the straight line and prevents weight stabilization even when setting x0 = 0 at time step 100,000. This is different for ICO learning, which does not contain this instability. 3 ICO Learning Embedded in the Environment 3.1 The Closed-Loop Circuit: General Setup and Learning Goal. ICO learning is designed for a closed-loop system where the output of the learner v feeds back to its inputs x j after being modified by the environment. The resulting structure (see Figure 2), similar to that described in Porr, von ¨ otter ¨ Ferber, and Worg (2003), is that of an subsumption architecture where we start with an inner feedback loop, which is superseded by an outer loop (Brooks, 1991). (For a more detailed discussion of such nested structure, we refer to Porr et al., 2003.) 3.1.1 Feedback Loop. Initially only a stable inner reflex, or feedback loop, exists, which is established by the transfer function of the organism H0 , the transfer function of the environment P0 , the weight ρ0 = 0, and the (here constant) set point SP. Such a reflex could, for example, be the retraction
1386
¨ otter ¨ B. Porr and F. Worg
Figure 2: ICO learning embedded in its environment. is a linear summation unit. Except for the constant set point SP, the inside of the organism resembles ICO learning, as shown in Figure 1A, but here shown without the filter bank and transformed into the z-domain. D is a disturbance, which is delayed by T time steps. The term z − 1 denotes a derivative in the z-domain. Transfer functions P0 and P1 represent the environment and establish the feedback from the motor output v to the sensor inputs X0 and X1 . S0 represents the input before subtracting the set point.
reaction of an animal when touching a hot surface. In such an avoidance scenario, X0 would represent the input to a pain receptor, with a desired state of SP = 0. Hence, a correctly designed reflex will indeed reestablish this desired state, but only in a reactive way, hence, only after the disturbance D has upset the state at X0 for a short while. The delay parameter z−T is here introduced to define the timing relation between inner, late and outer, early (predictive) loop. Thus, the transfer function H0 establishes a fixed reaction of the organism by transferring sensor inputs into motor actions. The transfer function P0 establishes the environmental feedback from the motor output to the sensor input of the organism. The goal of the feedback loop is to keep the set point SP at S0 as precise as possible. In this context, X0 can be understood as an error signal that has to be minimized. Without losing generality, we will set the set point SP for
Input Correlation Learning
1387
all theoretical derivations from now on to zero (SP = 0), which means that S0 = X0 , and we interpret the sensor input as the error signal. 3.1.2 Learning Goal. We are going to explain now how learning is achieved. Initially the outer loop, formed by H1 , P1 , is inactive because ρ1 = 0. It receives the disturbance D at sensor input X1 earlier than the inner loop. In our example, one could think of a heat radiation signal that is felt before touching the hot surface. However, a naive system will not react in the right way, withdrawing the limb before touching, as can be seen in very young children, who will hurt themselves in such a situation. Hence, the learning goal for this system is to increase ρ1 such that an earlier appropriate reaction will be elicited after learning. As a consequence, after learning, X0 will, in an ideal case, never leave the set point again. In a way, one could think of this as the reflex being shifted earlier in time. In the general case, there will be a filter bank where every filter has its own corresponding weight ρ j , j > 0. In the following sections, we will establish the formalism for treating such closed-loop systems and provide a convergence proof. The main result of this section is that we will show that ICO learning approaches one-shot learning in a stable convergence domain provided the inner loop represents a stable feedback controller or, in other words, provided the reflex creates an appropriate and stable reaction. Readers not interested in the mathematical derivations, which rely on the application of some methods from control theory, might consider skipping this section. 3.2 Stability Proof 3.2.1 Responses to a Disturbance. The stability of a feedback system can be evaluated by looking at its impulse response to a disturbance. The actual reaction of the feedback system to a disturbance D can be calculated easily in the z-domain. In the simplest case, the disturbance is a delta pulse, which is just D = 1 in the z-domain. In more complex scenarios (as in the experiments), the disturbance is a random event for which we assume that it is bounded and stable. Thus, we apply a disturbance D and observe the changes, for example, at the sensor input X0 : X0 = Dz−T P0 + X0 H0 ρ0 P0 .
(3.1)
We can now solve for X0 and get
X0 = Dz−T
P0 . 1 − ρ0 P0 H0
(3.2)
¨ otter ¨ B. Porr and F. Worg
1388
This equation provides the response of the feedback loop to a disturbance D. We demand here that the feedback is designed in a way that X0 is stable and always decays to zero after a disturbance has occurred. (For a general stability analysis of feedback loops, see D’Azzo, 1988.) In addition, we introduce F = X0 H0 = z−T
P0 H0 , 1 − ρ0 P0 H0
(3.3)
which is the response of the feedback loop at U0 to a delta pulse (D = 1). We will need this term later for the stability analysis. A pure feedback loop cannot maintain the set point all the time because the reaction to a disturbance D by the feedback loop is always too late. Thus, from the point of view of the feedback, it is desirable to predict the disturbance D to preempt the unwanted triggering of the feedback loop (Palm, 2000). Figure 2 accommodates this in the most general way by a formal delay parameter z−T , which ensures that the input x1 receives the disturbance D earlier than input x0 . This establishes a second predictive pathway, which is inactive at the start of learning (ρ1 = 0). The learning goal is to find a value for ρ1 so that the learner can use the earlier signal at x1 to generate an anticipatory reaction that prevents x0 from deviating from the set point SP. Generally the predictive pathway is set up as a filter bank where the input x1 feeds into different filters that generate the predictive response. The response of the system to a disturbance D with the predictive pathway can be obtained in the same way as demonstrated for the feedback loop: N ρk Hk + Dz−T P0 DP1 k=1 X0 = . 1 − P0 ρ0 H0
(3.4)
The goal now is to find a distribution of weights ρk so that the condition X0 = 0 is satisfied all the time. In other words, find weights that ensure that the input X0 never deviates from the set point. 3.2.2 Analysis of Stability • Learning rule in the z-domain. Stability is achieved if the weights ρ j converge to a finite value. We will prove stability in the z-domain, which has two advantages: the derivative can be expressed in a very simple form, and the closed loop can be eliminated. The result also provides absolute values of the weights after a disturbance has occurred. Equation 2.4 can be rewritten in the z-domain as (z − 1)ρ j (z) = µ[(z − 1)U0 (z)]U j (z−1 ),
(3.5)
Input Correlation Learning
1389
where (z − 1) is the derivative. Since the z-transform is not such a commonly used formalism, we refer the reader to the appendix for a detailed description of some of the methods used to arrive at equation 3.5. Note that the weight ρ j (z) is the z-transformed version of ρ j (t). The change of the weight ρ j (z) on the left side is expressed in the same way as the derivative on the right side. This formulation also takes into account that any change of the weight ρ j (z) might have an immediate impact on the values of U0 and U j . Thus, we do not assume here that learning operates at low learning rates µ. At this point, we allow for any learning rate. • Calculating the weight. To calculate the weight ρ j (z), we need the filtered reflex input U0 = X0 H0 , which can be directly obtained from equation 3.4. The resulting weight ρ j (z) can now be evaluated using ρ j (z) = µF
DP1
N
−T
ρk Hk + Dz
D− P1− H j− ,
(3.6)
k=1
where we will abbreviate from now on the times-reversed functions H(z−1 ) by H − . Solving for ρ j (z) gives
ρ j (z) =
µF DD− P1 P1−
N
k = j,k=1
ρk (z)Hk H j− + z−T µF DD− P1− H j−
1 − µF DD− P1 P1− H j H j−
, (3.7)
which is the value of the weight ρ j (z) after a disturbance D. To get a better understanding of the equation above, we restrict ourselves now to just one filter in the predictive pathway and set N = 1. In that case, the sum in the numerator vanishes to give ρ1 (z) =
z−T µF DD− P1− H1− M . − − := − K 1 − µF DD P1 P1 H1 H1
(3.8)
Thus, we have a result that can be analyzed for the stability of weight ρ1 (z). • Stability criterion. A system is bounded-input bounded-output stable if its impulse response and its corresponding transfer function Y satisfy the following condition,
|Y(e iω )|
1). After having understood the case with just one filter N = 1, we can now generalize to the case N > 1. Thus, we are getting back to equation 3.7. Comparing equation 3.7 with equation 3.8 shows that the stability criteria from the special case also apply to the general case. The denominators are the same in both cases so that the criterion equation 3.11 still holds. The only difference is the sum over correlations between different resonators (Hk correlated with H j ). The crucial question here is whether the correlation of these resonator responses is stable. The answer is affirmative because the correlation of one resonator Hk with another
1392
¨ otter ¨ B. Porr and F. Worg
one H j is just the weight change for the case T = 0 of the learning rule (see Figure 1B). This weight change is stable for the same reason as given above: the correlation of two low-pass-filtered delta pulses is bounded. Thus, ICO learning is also stable for a filter bank embedded in a closed loop. The absolute values of the weights in a filter bank are not easy to understand because of the correlations between the filter functions H j and Hk . These correlations do not play a role after successful learning because then x0 is constant, and therefore any weight change is suppressed anyway. 4 Applications This section compares the performance of ICO learning with differential Hebbian (ISO) learning and shows that ICO learning can be applied successfully to different application domains. In sections 4.1 and 4.2, we use a biologically inspired task, which will be performed first as a simulation and then as a real robot experiment where a robot was supposed to retrieve “food disks” (i.e., white paper snipplets on the floor). This task is similar to the one described in Verschure, Voegtlin, and Douglas (2003) and to the ¨ otter ¨ second experiment in Porr & Worg (2003b). In the simulation, we will compare ISO learning with ICO learning and show that the latter is able to perform one-shot learning under ideal noise-free conditions. The actual robot experiment will show that ICO learning also operates successfully in a physically embodied system where ISO learning fails. Other complex control examples will be presented in the last two experiments using different setups. 4.1 The Simulated Robot. This section presents a benchmark application that compares Hebbian (ISO) learning with our new input correlation (ICO) learning. Figure 3A presents the task where a simulated robot has to learn to retrieve food disks in an arena. The disks are also emitting simulated sound signals. Two sets of sensor signals are used. One sensor type (x0 ) reacts to (simulated) touch and the other sensor type (x1 ) to the sound. The actual choice of these modalities is not important for the experiment, but this creates a natural situation where sound precedes touch. Hence, learning must use the sound sensors that feed into x1 to generate an anticipatory reaction toward the food disk (Verschure et al., 2003). The circuit diagram is shown in Figure 3B. The reflex reaction is established by the difference of two touch detectors (LD), which cause a steering reaction toward the white disk. Hence, x0 is a transient signal that occurs only during touching of a disk. As a consequence, x0 is equal to zero if both LDs are not stimulated, which is the trivial case of not touching a disk at all, or when they are stimulated at the same time, which happens during a straight encounter with a disk. The latter situation occurs after successful learning, which leads to the head-on touching of the disks. The reflex has a constant weight ρ0 , which always guarantees a stable reaction. The predictive signal
Input Correlation Learning
1393
Figure 3: The robot simulation. (A) The robot has a reflex mechanism (1), which elicits a sharp turn as soon as it touches the disk laterally and thereby pulls the robot into the center of the disk. The disk also emits “sound.” The robot’s task is to use this sound field to find the disk from a distance (2). (B) The robot has two touch detectors (LD), which establish with the filter H0 and the fixed weight ρ0 the reflex reaction by x0 = L Dl − L Dr . The difference of the signals from two sound detectors (SD) feeds into a filter bank. The weights ρ1 , . . . , ρ N are variable and are changed by either ISO or ICO learning. Apart from the reflex reaction at the disk, the robot has a simple retraction mechanism when it collides with a wall (“retraction,” not used for learning). The output v is the steering angle of the robot. (C) Basic behavior until the first learning experience. The trace at ∗ continues in D, where the robot has learned to target the disks from a distance. The example here uses ICO learning with µ = 5 · 10−5 . Other parameters: filters are set to f 0 = 0.01 for the reflex, f j = 0.1/j, j = 1, . . . , 5 for the filter bank where Q = 0.51 for all filters. Reflex weight was ρ0 = 0.005.
x1 is generated by using two signals coming from the sound detectors (SD). The signal is simply assumed to give the Euclidean distance (rr,l→s ) of the left (l) or right (r ) microphone from a sound source (s). The difference of the signals from the left and the right microphone rr →s − rl→s is a measure of the azimuth of the sound source to the robot. Successful learning leads to a turning reaction, which balances both sound signals and results ideally in a straight trajectory toward the target disk, ending in a head-on contact.
1394
¨ otter ¨ B. Porr and F. Worg
After the robot encounters a disk, the disk is removed and randomly placed somewhere else. An example of successful learning is presented in Figures 3C and 3D. The robot first bumps into walls. Eventually it drives through the disk, which provides the robot with the first learning experience. In this example, just one experience has been sufficient for successful learning. The trace in Figure 3D continues the trace from Figure 3C. Such one-shot learning can be achieved with ICO learning but not with ISO learning. This will be tested now more systematically by comparing the performance for ISO and ICO learning in a few hundred simulations. We quantify successful and unsuccessful learning for increasing learning rates µ. Learning was considered successful when we received a sequence of four contacts with the disk at a subthreshold value of |x0 | < 0.2. We recorded the actual number of contacts until this criterion was reached. Hence, four contacts represent our statistical threshold for deciding between chance and actual successful learning. There were two reasons to choose a threshold of 0.2. First, when x0 is below the threshold, the robot visibly heads for the center of the food disk. Second, the signal x0 has only discrete values because of a discrete arena of 600 × 400 where the robot has a size of 20 × 10. Even if the robot heads perfectly toward the food disk, there will often be a temporal difference between the left and the right sensor because of the discrete representation of both the robot and the round-shaped food disk (diameter 20) leading to a small remaining value of x0 (aliasing effect). The log-log plots of the number of contacts in Figures 4A and 4B show that both rules follow a power law. The similarity of the curves for small learning rates reflects the mathematical equivalence of both rules for µ → 0. The dependence of failures on the learning rate is quite different for ISO learning as compared to ICO learning. For differential Hebbian (ISO) learning (see Figure 4B), errors increase roughly exponentially up to a learning rate of µ = 10−4 . This behavior reflects errors caused by the autocorrelation terms. Above µ = 10−4 , failures reach a plateau with some statistical variability. For ICO learning (see Figure 4A) failures remain essentially zero up to µ = 0.0002; the learned behavior diverges only above that value. In contrast to the ISO rule, this effect is here due to overlearning, where the learning gain of the predictive pathway is higher than the gain of the feedback loop. Thus, the predictive pathway becomes unstable during the first learning experience. This means that the effective learning rate (see equation 3.12) has exceeded one. The actual learning rate µ is lower because it is multiplied with the gains of the feedback reaction F and the predictive pathway DH1 P1 , which depend on the actual experimental setup. For two different learning rates (µ = 5 · 10−6 , 5 · 10−5 ), the weights ρ j , j > 0 and the reflex input x0 are plotted in Figure 5. The data have been taken from four simulations of Figure 4. Thus, success has been measured in the same way as before, requiring |x0 | to be below 0.2 for four consecutive learning experiences. At the low learning rate (see Figures 5A–5D), weights
Input Correlation Learning
1395
Figure 4: Results from the simulated robot experiment. (A) Results from ICO learning. (B) Results from the ISO learning. Log-log plots show how many contacts with the target were required for successful learning at a given learning rate µ. Histograms show how many times learning was not successful. The bin size was set to 10 experiments, which gives an equal spacing on the log x-axis. Failures are shown on a linear axis.
1396
¨ otter ¨ B. Porr and F. Worg
Figure 5: Comparing ICO and ISO learning in individual simulated robot experiments. (A,B,E,F) Plots of the reflex input x0 of the contacts with the food source. (C,D,G,H) Plots of the weights for two learning rates: (A–D) µ = 5 · 10−6 and (E–H) 5 · 10−5 for the two different learning rules ISO and ICO learning. The inset in G shows steps from 6000 to 10,000 plotted with a y-range of −55.72 · 10−3 , . . . , −55.54 · 10−3 . The inset in H shows steps from 0 to 20,000 plotted with a y-range of −0.001, . . . , 0.0025.
Input Correlation Learning
1397
converge to very similar values for ISO as well as ICO learning. This is not surprising, as for low learning rates, the autocorrelation term in ISO learning is small. However, even for such low learning rates, the weights drift for the ISO learning case. This can be seen in particular between steps 3000 and 7000 in Figure 5D. There are no contacts with food disks, and consequently x0 stays zero. However, the weights drift upward because of nonzero inputs to the filter bank through x1 . ICO learning (see Figure 5C) does not show any weight drift for three reasons. First, a constant input at x0 keeps the weights constant. Second, the predictive input x1 is zero at the moment x0 is triggered. This is the case after successful learning, as seen, for example, in Figure 5C between steps 32,000 and 36,000. Third, the derivative (u0 ) of the filtered input x0 is symmetric so that the weight change is effectively zero. All of these factors contribute to stability. Even in the case that x0 always receives small transients, learning is stable. Transients can occur due to aliasing in the simulation or in the real robot due to mechanical imperfections. Such transients trigger unwanted weight change. However, they do not destabilize learning if x0 is understood as an error signal that always counteracts unwanted weight change. For example, a transient at the reflex input x0 causes the robot to learn a steering reaction to the left that is too strong. The next time the robot enters the food disk, the left turn that is too strong causes an error signal at x0 that reduces the steering reaction again. Thus, one finds that in these cases, weights will occasionally grow or shrink due to transients in x0 . However, the weights will be brought back to their optimal values if x0 carries a proper error signal. In the experiments with high learning rates (see Figures 5E to 5H) learning is very fast, resulting in stable weights for ICO after just two learning experiences, which appear in Figure 5E as large peaks. After the second peak, weights undergo only minimal change. In fact, the “almost head-on” contacts (small peaks in x0 ) between steps 6000 and 10,000 of Figure 5G cause the weights to become more positive again. This is demonstrated in the inset of Figure 5G, which indicates that learning has initially caused a slight overshooting of the weights. A different behavior is observed for ISO learning (see Figure 5H). After the second contact with the food disk, the system starts to diverge. The autocorrelation term dominates learning, leading to exponential growth of the weights. After step 22,000, the reflex input x0 is zero, which means that only the autocorrelation terms change the weights. Behaviorally we observe that the robot first learns the right behavior: driving toward the food disk. This behavior corresponds to negative weights, as seen in Figures 5C, 5D, and 5G. After step 10,000, however, the weights drift to positive values, which is behaviorally an avoidance behavior. This behavior becomes stronger and stronger so that the robot will never touch the food disk again. This unwanted ongoing learning is due to the movements of the robot, which cause a continuously changing sound signal x1 resulting in a nonvanishing autocorrelation term. Thus, while ICO learning (see
1398
¨ otter ¨ B. Porr and F. Worg
Figure 6: ICO learning simulation with three simultaneously present food disks. See Figure 3 for parameters. (A) Trace of the robot simulation for the whole simulation. The trace before learning is in gray to differentiate it from the learning behavior. Initially, learning is switched off for the first 1000 steps to demonstrate purely reflexive behavior when encountering the disk at R. The following first three learning experiences are numbered 1 to 3. (B) Weight development during learning. The learning rate was set again to µ = 5 · 10−5 .
equation 2.4) is stable for both low and high learning rates, its differential Hebbian counterpart ISO learning is stable only at low learning rates. The benchmark tests have provided an ideal condition for learning where just one food disk was in the arena. This gave a perfect correlation between proximal and distal sensor. Having three food disks in the arena at the same time renders learning more difficult (see Figure 6). Now we have no longer a simple relationship between the reflex input x0 and the predictor x1 . The sound fields from the different food disks superimpose onto each other so that the distal information is distorted. However, ICO learning also manages this scenario without any problems. Figure 6A depicts the trace of a run starting just before the first learning experience. Figure 6B shows the corresponding weight development, which is stable as well. Again, ISO learning is not able to perform this task at this high learning rate (data not shown). In summary, the simulations demonstrate that ICO learning is much more stable than the Hebbian ISO learning rule. ICO learning is able to operate with high learning rates approaching one-shot learning under ideal noise-free conditions. 4.2 The Real Robot. In this section we show that the same food disk targeting task can also be solved by a real robot. This is not self-evident because of the complications that arise from the embodiment of the robot and its situatedness in a real environment. (See Ziemke, 2001, for a discussion of the embodiment principle.) In addition, we will show that it is possible to use filters other than resonators in the predictive pathway.
Input Correlation Learning
1399
Figure 7: Experiment with a real robot. (A–C) Traces during the run, which lasted 8:46 min. A is taken from the start of the run at 0:06, showing the first reflex reaction; B and C show learned targeting behaviors after 3:45 (14 contacts) and 4:48 (18 contacts), respectively. The development of the weights and the trace x0 is shown in D. The values of the weights for B and C are indicated by arrows. Parameters: Learning rate was set to µ = 0.00002, the reflex weight to ρ0 = 40, and the video input image v(, ϒ) was = [1, . . . , 96] × ϒ[1, 64] pixels. The scan line for the reflex was at ϒ = 50 and for the predictor at ϒ = 2. The reflex x0 and the predictive signal x1 were generated by creating a weighted 2 sum of thresholded gray levels: x0,1 (ϒ) = 96 =1 ( − (96/2)) (v(, ϒ) − 128) where is the Heaviside function. The predictive input is split up into a filter bank of five filters. The predictive filters have 100, 50, 33, 25, 20 taps, where all coefficients are set to one. The reflex pathway is set up with a resonator set to f 0 = 0.01 and Q = 0.51. The camera was a standard pinhole camera with a rather narrow viewing angle of ±35 degrees.
As before, the task of the robot is to target a white disk from a distance. Similar to the simulation, the robot has a reflex reaction that pulls the robot into the white disk just at the moment the robot drives over the disk. This reflex reaction is achieved by analyzing the bottom scan line of a camera mounted on the robot. The predictive pathway is created in a similar way. A scan line from the top of the image, which views the arena at a greater distance from the robot (hence “in its future”), is fed into a filter bank of five filters. In contrast to the simulation, these filters are set up as finite impulse response (FIR) filters with different numbers of taps where all coefficients are set to one. Thus, the only thing such a filter does is to smear the input
1400
¨ otter ¨ B. Porr and F. Worg
signal out over time while the response duration is limited by the number of filter taps. We chose such filters for two reasons. First, in contrast to ISO learning, we do not need orthogonality between the reflex pathway and the predictive pathway. Thus, it is possible to employ different filter functions in the different pathways. This made it possible to solve a problem that exists with this robot setup. Because we used a camera with a rather narrow angle, we had to put the food disk rather centrally in front of the robot. The FIR filters generate step responses that result in a clearly observable behavioral change after learning as soon as the food disk enters the visual field of the robot. Resonator responses are so smooth that the reflex and learned reaction look too similar. The reflex behavior before learning is shown in Figure 7A, where the robot drives exactly straight ahead until it encounters the white disk. Only when it sees the disk directly in front of it is a sharp and abrupt turning reaction generated. The learning rate was set to the highest possible value such that at higher learning rates, the system started to diverge. Learning needs longer than in the simulation: about 10 contacts with the white disk are needed until a learned behavior can be seen. Examples for successful learning are shown in Figures 7B and 7C. Now the robot’s turning reaction sets in from a distance of about 50 cm from the target disk. Thus, the robot has learned anticipatory behavior. The real robot is subject to complications that do not exist in the simulation. The inertia of the robot, imperfections of the motors, and noise from the camera render learning more difficult than in the simulation. As a consequence, we obtain a nonzero reflex input x0 all the time, as shown in the top trace of Figure 7D. This is also reflected in the weight development. The weights change during the whole experiment, although they do not diverge. Rather, they oscillate around their best value. The experiment can be run for a few hours without divergence. Another reason for weight change is the limited space in the arena. This effect can be drastic if the robot is caught in a corner of the arena. Imagine the robot first encounters a food disc and then directly bumps into a wall. The bump then causes a retraction reaction, which changes the input x0 and therefore the reflex reaction. Consequently, learning is affected by such movements. Another aspect is the human operator who throws the food disks in front of the robot. If the food disk is thrown in too late in front of the robot, the timing between x1 and x0 is different, which also leads to wrong correlations. All additional error sources like noisy data impose an upper limit for the learning rate. This limit, however, is not the theoretical one (see equation 3.11) but a practical limit to protect the robot from learning the wrong behavior during its first learning experience (Grossberg, 1987). 4.3 Control Applications. In the next two sections, we will demonstrate that ICO learning can also be used in more conventional control situations.
Input Correlation Learning
1401
Figure 8: (A) Setup of the mechanical system. The position of the main arm is maintained by a PI controller controlling motor force M with ρ0 = 6; its position is measured by a potentiometer P, SP = 100◦ , and effective equilibrium point (EEP) reached 93.2◦ . Note that the effective equilibrium point will be identical to the set point only for an ideal controller at infinite gain. A disturbance D is introduced by an orthogonally mounted smaller arm. System paj rameters: sampling interval 5 ms, µ = 2 × 10−5 , f 0 = 10 Hz, Q0 = 0.6, Q1 = 0.6, j 20 Hz f 1 = j , j = 1, . . . , 10. (B) Signal traces D, M, and P from one experiment. The inset ρ1 shows the development of the connection weights. Disturbances are compensated after about four trials and weights stabilize.
1402
¨ otter ¨ B. Porr and F. Worg
To this end, we note first that a reflex is conceptionally very similar to a conventional closed-loop controller, where a setup is maintained by a feedback reaction from the controller as soon as a disturbance is being measured. In the next section, we will show anticipatory control of a mechanical arm as well as feedforward compensation of a heat pulse in a temperaturecontrolled container, such as those commonly used for chemical reactions. Mainly we will try to demonstrate that in these situations, ICO learning also converges very fast, which may make it applicable in more industrial scenarios too. 4.3.1 The Mechanical Arm. To show that ICO learning is also able to operate with a classical PI controller, we have set up another mechanical system. In addition, we show in this example how weights can be kept stable if the input x0 is too noisy. Recall that weight stabilization occurs as soon as x0 = 0 (see equation 2.4). To ensure this, we employ a threshold around the SP creating an interval within which x0 was set to zero. For our mechanical arm (see Figures 8A and 8B), a conventional PI controller defines the reflexive feedback loop controlling arm position P = x0 . The PI controller replaces the resonator H0 in this case. To stop the weights from fluctuating, we employ a threshold at x0 of = ±1◦ around the setpoint. Disturbances (D = x1 ) arise from the pushing force of a second small arm mounted orthogonally to the main arm. A fast-reacting touch sensor at contact point measures D. Force D is transient (top trace in Figure 8B), and the small arm is pulled back by a spring. A moderately high learning rate was chosen to demonstrate how the system develops in time. The second trace M = v shows the motor signal of the main arm. Close inspection reveals that during learning, this signal first becomes biphasic (see the small inset curve), where the earlier component corresponds to the learned part and the later component to the PI controller’s reaction. At the end of learning, only the first component remains (note the forward shift of M with respect to D). Trace P = x0 shows the position signal of the main arm. In the control situation, learning was off, and a biphasic reaction is visible with about a 10 degree position deviation (peak to peak). During learning, this deviation is almost fully compensated after four trials. Inset curves ρ1 at the bottom show that the connection weights have stabilized after the fourth trial. The fifth trial is shown to demonstrate the remaining variability of the system’s reaction. 4.3.2 Temperature Control. Figure 9 shows anticipatory temperature control against heat spikes, which could be potentially damaging in a real plant. A feedback loop with a resonator H0 guarantees a constant temperature SP in a container. The actual temperature is controlled by an electric heater (κv ) and a cooling system (φv ). The system can be considered nonlinear because cooling and heating are achieved by different techniques. The demanding task of learning here is to predict temperature changes caused by
Input Correlation Learning
1403
Figure 9: Learning to keep the temperature ϑ in a container C constant against external disturbances. Container volume was 500 ml, the main heat source was provided by a 500 W coil heater (κv ), and the main cooling source was by a pulse-width modulated, valve-controlled water flow through a copper coil (φv ) with maximum 750 ml/min at 17◦ C. The disturbance heat source (κ D ) received pulses of 1000 W from D. Data acquisition and control were performed with a USB-DUX board. Sampling rate was 1 Hz. The resonator in the feedback loop was set to f 0 = 0.2 Hz, Q0 = 0.51 and its corresponding weight to ρ0 = 50. H1 is a filter bank of resonators with parameters given in the caption to Figure 10.
another heater κ D , which is switched on occasionally. In a real application, this heater would be a second thermometer or other sensor that is able to predict the deviation from the set point SP. Several temperature experiments have been performed at different setpoints. In Figure 10A, a high gain and small µ was used, and learning compensates over- and undershooting in about 15 trials. Figure 10B shows that with a high gain and a high learning rate, the heat spike is compensated in a single trial, which could represent a vital achievement in a real plant. In this case, compensation of the undershoot takes much longer (not shown). In Figure 10C, a low gain was used, and the system reacts rather slowly. Learning compensates the overshoot after four trials, and the effective equilibrium point is now again reached, which was not the case before learning. In all situations weights essentially stabilize and drift only slightly around their equilibrium, because no threshold was used at x0 . These small oscillations are similar to the behavior of the weights in the real robot experiment, which were also oscillating around their equilibrium. Furthermore, we note that learning already sets in strongly in the first trial immediately influencing the output.
1404
¨ otter ¨ B. Porr and F. Worg
In Figure 10D, we show how the system reacts when using the Hebbian learning rule (ISO learning). We observe poor convergence even for rather small learning rates of µ = 2 × 10−11 , which is more than 1000 times smaller than those used for Figures 10A to 10C. These findings mirror the results of the simulations performed above. Some compensation occurs, but weights drift much more. To achieve this specific result, a higher gain had to be used than in the equivalent experiment shown in Figure 10C. With a lower gain, convergence was never reached, probably due to the noise in the signals. It should also be noted that this experiment was the best out of 20 using ISO learning. 5 Discussion In this letter, we have presented a modification of our old ISO learning rule, which has led to a dramatic improvement of convergence speed and stability. Mathematically we were able to show that under ideal noise-free conditions, ICO learning approaches one-shot learning. We have discussed relations of these types of differential Hebbian learning rules (Kosco, 1986; Klopf, 1986; Sutton & Barto, 1987; Roberts, 1999) to temporal sequence learning and to reward-based learning, most notably TD and Q learning (Sutton & Barto, 2002), and its embedding in the existing literature to a great extent in a set of older articles (see, in particular, ¨ otter ¨ (Worg & Porr, 2005, for a summary). Here we restrict the discussion to the relevant novel features of ICO learning. We have to discuss the different application domains of ICO versus ISO learning. ICO and ISO learning are identical when using an orthogonal filter set for the condition of µ → 0. In this situation, the autocorrelation term of ISO learning vanishes, and convergence is guaranteed for ISO learning as well (Porr et al., 2003). The advantage of ISO learning as compared to ICO learning is its isotropy: all inputs can self-organize into reflex inputs or predictive inputs, depending on their temporal sequence (see Porr & ¨ otter, ¨ Worg 2003a, for a discussion on this property). For ICO learning,
Figure 10: Temperature control experiments. Parameters of the filter bank j j H1 are Q1 = 0.51 f 1 = 0.1jHz , with for A, B: j = 1, . . . , 12 and for C, D : j = 1, . . . , 10. Experiments with different parameters: (A) SP = 60◦ C, EEP = 59.2◦ C, ρ0 = 250, disturbance pulse duration 10 s, µ = 4 × 10−8 . (B) SP = 70◦ C, EEP = 68.4◦ C, ρ0 = 250, disturbance pulse duration 20 s, µ = 4 × 10−7 . (C) SP = 44.0◦ C, EEP = 43.5◦ C, ρ0 = 150, disturbance pulse duration 12 s, µ = 7.5 × 10−7 . (D) Same as in C, except for higher feedback gain of ρ0 = 250 (EEP = 43.9◦ C) and lower learning rate of µ = 2 × 10−11 , but using ISO learning as denoted in the figure. Note that the levels of the input signals at ρ0 and ρ j , j > 0 are different. This leads to different absolute values for ρ0 and ρ j , j > 0.
Input Correlation Learning
1405
1406
¨ otter ¨ B. Porr and F. Worg
one needs to build the predefined subsumption architecture (see Figure 2) into the system from the beginning. This means that we have to set up a feedback system with a desired state and an error signal (x0 → 0) that drives learning. In the context of technical control applications, this is usually given so that ICO learning is the preferred choice against ISO learning (D’Azzo, 1988). In biology, however, self-organization is the key aspect. ISO learning has the ability to self-organize which pathways become reflex pathways and which pathways become predictive pathways. Reflex pathways can be superseded by other pathways, which in turn can become reflex pathways. This also means that ISO learning is able to use any input as an error signal, whereas ICO learning can use only x0 as an error signal. By hierarchically superseding reflex loops, ISO learning is able to self-organize subsumption architectures (Brooks, 1989; Porr et al., 2003), which is not possible with ICO learning. The filter bank here is used to generate an appropriate behavioral response. In contrast to our older ISO learning, it is possible to use other filter functions like step functions for the filter bank, which has been demonstrated in the real robot experiment. The only restriction imposed on the filter bank is that it should establish a low-pass characteristic. This characteristic has to be established by the closed loop, not the open loop. This means that the actual filter in the filter bank need not possess a low-pass characteristic, but the closed loop established by the environment. Filter banks have been employed in other learning algorithms, for example, in TD learning (Sutton & Barto, 2002). In contrast to our learning scheme, the filters there are used only for the critic, not for the actor. In other words, they are used to smear out the conditioned stimulus so that it can be correlated with the unconditioned stimulus. In terms of synaptic plasticity, ISO and ICO learning differ substantially. While ISO learning can be interpreted as a homosynaptic learning ¨ otter, ¨ rule (Porr, Saudargiene, & Worg 2004), ICO learning is strictly heterosynaptic. The neuronal literature on heterosynaptic plasticity normally emphasizes that it is essentially a modulatory process that modifies (con¨ ventional) homosynaptic learning (Bliss & Lomo, 1973; Markram, Lubke, Frotscher, & Sakmann, 1997), but cannot lead to plasticity on its own (Ikeda et al., 2003; Bailey, Giustetto, Huang, Hawkins, & Kandel, 2000; Jay, 2003). However, evidence was also found for a more direct influence of heterosynaptic plasticity in Aplysia siphon sensory cells (Clark & Kandel, 1984), in ¨ the amygdala (Humeau, Shaban, Bissi`ere, & Luthi, 2003) and in the limbic system (Beninger & Gerdjikov, 2004; Kelley, 2004). As a consequence, heterosynaptic learning rules have so far mostly been used to emulate modulatory processes, for example, by the implementation of three-factor learning rules, trying to capture dopaminergic influence in the striatum and the cortex (Schultz & Suri, 2001). To our knowledge, ICO is the only learning rule that operates strictly heterosynaptically, which, for network learning and plasticity, might open new avenues as
Input Correlation Learning
1407
compared to the well-established Hebb rules (Oja, 1982; Kohonen, 1988; Linsker, 1988; MacKay, 1990; Rosenblatt, 1958; von der Malsburg, 1973; Amari, 1977; Miller, 1996a). The tremendous stability of ICO, which is guaranteed for x0 = 0 or can be enforced by using a threshold (x0 < ), will allow designing stable nested or chained architectures of several ICO learning units where the “primary” units in such an architecture are controlled by the feedback neuronal activity of the “secondary” ones. Hence, the secondary neurons in such a setup would provide the x0 signal by way of an internal feedback loop, which takes the role and replaces the behavioral feedback employed here. Not only does this shed interesting light on neuronal feedback loops like the corticothalamic loops (Alexander, DeLong, & Strick, 1986; Morris, Cochran, & Pratt, 2005), but it might also offer interesting possibilities for novel network architectures, where stability can be built into the system by way of such loops. Like ISO learning, ICO learning develops a forward model (Palm, 2000) of the reflex reaction established by H0 , ρ0 , and P0 . The forward model is represented by the resonators and weights H j , ρ j , j > 0 (Porr et al., 2003). The main advantage of ICO learning against ISO learning is that it is not limited to resonators (H j ) as filters. We have shown here that instead of resonators, simple FIR filters can be used for the filter bank. The required low-pass characteristic came from the environment. The FIR filter was, however, just an example. Future research has to systematically explore which linear and nonlinear filters are suitable for ICO learning. Finding a target with a simulated or real-world device has been employed. The oldest model with hand-tuned fixed weights was employed by Walter (1953), where his tortoise had to find its home cage. To find the optimal weights Paine and Tani (2004) have recently employed a genetic algorithm that is able to solve a T-maze task. Their simulated robots need 63 generations. When it comes to learning, basically two paradigms are employed: reinforcement learning or Hebbian learning. In reinforcement learning, Q-learning seems to be the learning rule of choice. Q-learning generates optimal policies to retrieve a reward where a policy associates a sensor input with an action. The Q-value evaluates the policy if it leads to a reward or not. The higher the Q-value is, the more probable is the future or immediate reward. Q-learning has been successfully to applied by Bakker, Lin˚aker, & Schmidhuber (2002) to a T-maze task. The robot has to learn that a road sign at the entrance of the T-maze gives the clue as to whether the reward is in the left or the right arm. To solve this task, the simulated Kephera robot needed 2500 episodes. Thrun (1995) also employs Q-learning to find a target. In contrast to Bakker et al., however, the robot navigates freely in an environment. This task probably comes closest to our task. Successful targeting behavior is established after approximately 20 episodes. Our robot needed approximately 15 contacts with the white disk to find it reliably. However, after 20 episodes, the success rate in Thrun’s experiment
¨ otter ¨ B. Porr and F. Worg
1408
is still poor. A further 80 episodes are needed to bring the success rate up to 90%. Our robot has a comparable success rate of 90% after these 15 contacts, given that the camera can see the disk. The different convergence speeds suggest that Thrun has employed a lower learning rate. The other learning rule that has been employed to solve targeting tasks is Hebbian learning. In Verschure and Voegtlin (1998) and Verschure et al. (2003), the robot has the task of finding targets. Similar to our robot, their robot is equipped with proximal and distal sensors. The proximal sensors trigger reflex reactions. The task is to use the distal sensors to find the targets from the distance. In contrast to our heterosynaptic learning, Verschure and Voegtlin employ Hebbian learning, not heterosynaptic learning. In order to limit unbounded weight growth, they modified the Hebbian learning rule. Verschure et al. did this directly by adding a decay term proportional to the weight. In Verschure and Voegtlin, infinite weight growth is counteracted by inhibiting the signals from the distal sensors or, in other words, the conditioned stimuli. Unfortunately, a direct comparison of the performances with our experiment is not possible because it is not clear from Verschure and Voegtlin and from Verschure et al. how many contacts with the target were needed to learn the behavior. Touzet and Santos (2001) have systematically compared different reinforcement learning algorithms applied to obstacle avoidance. Such systematic approaches are difficult to achieve because of different hardware platforms, different environments, and different ways of documenting the robot runs. Thus, a systematic evaluation of the different learning rules will be the subject of further investigation. Appendix: Using the z-Transform for the Convergence Proof We describe in detail how we transformed the learning rule, equation 2.4, into the z-domain. The z-transform of a sampled (or time-discrete) signal x(n) is defined as X(z) =
∞
x(n)z−n .
(A.1)
n=−∞
The capital letter X(z) denotes the z-transform of the original signal x(n). The z-transform is the discrete version of the Laplace transform, which in turn is a generalized version of the Fourier transform. The original signal and its z-transform are equivalent if convergence can be guaranteed (Proakis & Manolakis, 1996). The z-transform has a couple of useful properties that simplify the convergence proof shown in section 3.2.2.
r
Convolution: The z-transform can be applied not only to signals but also to filters. Filtering in the time domain means convolution of the
Input Correlation Learning
1409
signal x(n) with the impulse response h(n) of the filter. In the z-domain, it is just a multiplication with the transformed impulse response: x(n) ∗ h(n) ⇔ X(z)H(z).
r
r
(A.2)
For example, equation 2.1 turns into U j = X j H j in the z-domain where the capital letters indicate the z-transformed functions. Once transformed into the z-domain, equations can be solved by simple algebraic manipulations. For example, equation 3.1 can be solved for X0 by subtracting X0 H0 ρ0 P0 from both sides and then dividing the equation by 1 − ρ0 P0 H0 . Correlation: The correlation of two signals can be derived from the convolution (see equation A.2) by recalling that a correlation is just a convolution where one signal is reversed in time. Time reversal x(−n) in the z-domain X(z−1 ) leads directly to a formula for correlation: x(n) ∗ h(−n) ⇔ X(z)H(z−1 ).
(A.3)
Derivative: The derivative in the z-space can be expressed as an operator (Bronstein & Semendjajew, 1989):
d ⇔ (z − 1). (A.4) dn With that background it is now possible to z-transform the learning rule, equation 2.4, ρ j = µu j u0
⇔
(z − 1)ρ j = µU j (z−1 )(z − 1)U0 (z),
(A.5)
which is equation 3.5. Note that the derivative on the right-hand side is not time-reversed because it belongs to U0 . Acknowledgments We thank David Murray Smith for fruitful comments on the manuscript. We are also grateful to Thomas Kulvicus and Tao Geng who constructed the mechanical arm. References Alexander, G., DeLong, M., & Strick, P. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9, 357– 381. Amari, S. I. (1977). Neural theory of association and concept-formation. Biol. Cybern., 26(3), 175–185. Bailey, C. H., Giustetto, M., Huang, Y. Y., Hawkins, R. D., & Kandel, E. R. (2000). Is heterosynaptic modulation essential for stabilizing Hebbian plasticity and memory? Nat. Rev. Neurosci., 1(1), 11–20.
1410
¨ otter ¨ B. Porr and F. Worg
Bakker, B., Lin˚aker, F., & Schmidhuber, J. (2002). Reinforcement learning in partially observable mobile robot domains using unsupervised event extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ: IEEE. Beninger, R., & Gerdjikov, T. (2004). The role of signaling molecules in reward-related incentive learning. Neurotoxicity Research, 6(1), 91–104. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity, orientation specifity and binocular interpretation in visual cortex. J. Neurosci., 2, 32–48. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentrate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232(2), 331–356. Bronstein, I., & Semendjajew, K. (1989). Taschenbuch der Mathematik (24th ed.). Thun and Frankfurt. Harri Deutsch. Brooks, R. A. (1989). How to build complete creatures rather than isolated cognitive simulators. In K. VanLehn, (Ed.), Architectures for intelligence (pp. 225–239). Hillsdale, NJ: Erlbaum. Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence, 47, 139–159. Clark, G. A., & Kandel, E. R. (1984). Branch-specific heterosynaptic facilitation in Aplysia siphon sensory cells. PNAS, 81(8), 2577–2581. Dayan, P., & Sejnowski, T. (1994). Td(λ) converges with probability 1. Mach. Learn., 14(3), 295–301. D’Azzo, J. J. (1988). Linear control system analysis and design. New York: McGrawHill. Diniz, P. S. R. (2002). Digital signal processing. Cambridge: Cambridge University Press. Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11, 23–63. Grossberg, S. (1995). A spectral network model of pitch perception. J. Acoust. Soc. Am., 98(2), 862–879. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. ¨ Humeau, Y., Shaban, H., Bissi`ere, S., & Luthi, A. (2003). Presynaptic induction of heterosynaptic associative plasticity in the mammalian brain. Nature, 426(6968), 841–845. Ikeda, H., Akiyama, G., Fujii, Y., Minowa, R., Koshikawa, N., & Cools, A. (2003). Role of AMPA and NMDA receptors in the nucleus accumbens shell in turning behaviour of rats: Interaction with dopamine and receptors. Neuropharmacology, 44, 81–87. Jay, T. (2003). Dopamine: A potential substrate for synaptic plasticity and memory mechanisms. Prog. Neurobiol., 69(6), 375–390. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kelley, A. E. (2004). Ventral striatal control of appetitive motivation: Role in ingestive behaviour and reward-related learning. Neuroscience and Biobehavioural Reviews, 27, 765–776.
Input Correlation Learning
1411
Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker, (Ed.), Neural networks for computing: AIP Conference Proceedings. New York: American Institute of Physics. Kohonen, T. (1988). Self-organization and associative memory (2nd ed.). Berlin: Springer. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker (Ed.), Neural networks for computing: AIP Conference Proceedings (pp. 277–282). New York: American Institute of Physics. Linsker, R. (1988). Self-organisation in a perceptual network. Computer, 21(3), 105– 117. MacKay, D. J. (1990). Analysis of Linsker’s application of Hebbian rules to linear networks. Network, 1, 257–298. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmanno, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Miller, K. D. (1996a). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Donnay, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks III (pp. 55–78). Berlin: Springer-Verlag. Miller, K. D. (1996b). Synaptic economics: Competition and cooperation in correlation-based synaptic plasticity. Neuron, 17, 371–374. Morris, B., Cochran, S., & Pratt, J. (2005). PCP: From pharmacology to modelling schizophrenia. Curr. Opin. Pharmacol., 5(1), 101–106. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Paine, R. W., & Tani, J. (2004). Motor primitive and sequence self-organisation in a hierarchical recurrent neural network. Neural Networks, 17, 1291–1309. Palm, W. J. (2000). Modeling, analysis and control of dynamic systems. New York: Wiley. ¨ otter, ¨ Porr, B., Saudargiene, A., & Worg F. (2004). Analytical solution of spike-timing dependent plasticity based on synaptic biophysics. In S. Thrun, L. Saul, & B. ¨ Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. ¨ otter, ¨ Porr, B., von Ferber, C., & Worg F. (2003). Iso-learning approximates a solution to the inverse-controller problem in an unsupervised behavioural paradigm. Neural Comp., 15, 865–884. ¨ otter, ¨ Porr, B., & Worg F. (2001). Temporal Hebbian learning in rate-coded neural networks: A theoretical approach towards classical conditioning. In G. Dorffner, H. Bischof, & K. Hornik (Eds.), Artificial neural networks—ICANN 2001 (pp. 1115– 1120), Berlin: Springer. ¨ otter, ¨ Porr, B., & Worg F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. ¨ otter, ¨ Porr, B., & Worg F. (2003b). Isotropic sequence order learning in a closed loop behavioral system. Roy. Soc. Phil. Trans. Math., Phys. and Eng. Sciences, 361(1811), 2225–2244. Proakis, J. G., & Manolakis, D. G. (1996). Digital signal processing. Upper Saddle River, NJ: Prentice Hall. Roberts, P. D. (1999). Temporally asymmetric learning rules: I. Differential Hebbian learning. Journal of Computational Neuroscience, 7(3), 235–246.
1412
¨ otter ¨ B. Porr and F. Worg
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6), 386–408. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Suri, R. E. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862. Sutton, R. (1988). Learning to predict by method of temporal differences. Machine Learning, 3(1), 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Sutton, R. S., & Barto, A. (1987). A temporal-difference model of classical conditioning. In Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 355–378). Mahwah, NJ: Erlbaum. Sutton, R. S., & Barto, A. G. (2002). Reinforcement learning: An introduction (2nd ed.), Cambridge, MA: MIT Press. Thrun, S. (1995). An approach to learning mobile robot navigation. Robotics and Autonomous Systems, 15, 301–319. Touzet, C., & Santos, J. F. (2001). Q-learning and robotics. In IJCNN’99, European Simulation Symposium. Piscataway, NJ: IEEE. Verschure, P., & Voegtlin, T. (1998). A bottom-up approach towards the acquisition, retention, and expression of sequential representations: Distributed adaptive control III. Neural Networks, 11, 1531–1549. Verschure, P. F. M. J., Voegtlin, T., & Douglas, R. J. (2003). Environmentally mediated synergy between perception and behaviour in mobile robots. Nature, 425, 620– 624. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14(2), 85–100. Walter, W. G. (1953). The living brain. London: G. Duckworth. Watkins, C. J. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. ¨ otter, ¨ Worg F., & Porr, B. (2005). Temporal sequence learning, prediction and control— a review of different models and their relation to biological mechanisms. Neural Comp., 17, 245–319. Ziemke, T. (2001). Are robots embodied? In C. Balkenius, J. Zlatev, H. Kozima, K. Dautenhahn, & C. Breazeal (Eds.), Proceedings of the First International Workshop on Epigenetic Robotics Modeling Cognitive Development in Robotic Systems. Lund: Lund University.
Received April 13, 2005; accepted September 28, 2005.
LETTER
Communicated by Laurent Itti
An Oscillatory Neural Model of Multiple Object Tracking Yakov Kazanovich yakov
[email protected] Institute of Mathematical Problems in Biology, Russian Academy of Sciences Pushchino, Moscow Region, 142290, Russia
Roman Borisyuk
[email protected] Institute of Mathematical Problems in Biology, Russian Academy of Sciences Pushchino, Moscow Region, 142290, Russia, and Centre for Theoretical and Computational Neuroscience, University of Plymouth, Plymouth PL4 8AA, U.K.
An oscillatory neural network model of multiple object tracking is described. The model works with a set of identical visual objects moving around the screen. At the initial stage, the model selects into the focus of attention a subset of objects initially marked as targets. Other objects are used as distractors. The model aims to preserve the initial separation between targets and distractors while objects are moving. This is achieved by a proper interplay of synchronizing and desynchronizing interactions in a multilayer network, where each layer is responsible for tracking a single target. The results of the model simulation are presented and compared with experimental data. In agreement with experimental evidence, simulations with a larger number of targets have shown higher error rates. Also, the functioning of the model in the case of temporarily overlapping objects is presented. 1 Introduction Selective visual attention is a mechanism that allows a living organism to extract from the incoming visual information the part that is most important at a given moment and that should be processed in more detail. This mechanism is necessary due to the limited processing capability of the visual system, which precludes the rapid analysis of the whole visual scene. Different types of attention are responsible for implementing different strategies of attention focus formation. Traditional theories characterized attention in spatial terms as a spotlight or a “zoom lens” that could move about the visual field focusing on whatever fell within that spatial region (Posner, Snyder, & Davidson, 1980; Eriksen & St. James, 1986). More recent theories of attention state that in some cases, the underlying units of selection are discrete objects whose selection into the focus of attention Neural Computation 18, 1413–1440 (2006)
C 2006 Massachusetts Institute of Technology
1414
Y. Kazanovich and R. Borisyuk
is independent of whatever location these objects occupy. This type of attention is called object-based attention (Egeth & Yantis, 1997, Scholl, 2001). For object-based attention, the limits of information processing concern the number of objects that can be simultaneously attended. An important experimental paradigm in the study of object-based attention is multiple object tracking (MOT). In the canonical MOT experiments (Pylyshyn & Storm, 1988, Pylyshyn, 2001; Scholl, 2001), an observer views a display with m simple identical objects (up to 10 to 12 objects such as points, or plus signs, or circles). A certain subset of the objects (from 1 to m/2, m is supposed to be even) is briefly flashed to mark them as targets. Other objects are considered distractors. Then all objects begin moving independently and unpredictably about the screen without passing too near to each other and without moving off the display. The subjects’ task is to track the subset of targets with their eyes fixed at the center of the screen. At various times during animation, one of the objects is flashed, and the observer should press a key to indicate whether this object is a target or a distractor. In other studies, the subject had to indicate all the targets using a computer mouse. It has been shown that trained subjects are quite efficient in performing MOT. Though the number of errors increases with an increasing number of targets, even for five targets, the performance level was about 85% correct target identifications. It has been argued that the results of the experiments are in agreement with the hypothesis about resourcelimited parallel processing and cannot be explained entirely in terms of a serial attention scanning process, since in the latter case, sequential jumps of a single attentional spotlight from one target to another would have to be done with impossible velocities. A large number of later studies (Yantis, 1992; Blaser, Pylyshyn, & Holcombe, 2000; Scholl & Tremoulet, 2000; Sears & Pylyshyn, 2000; Vis¨ a, 2004; Liu et al., 2005) convanathan & Mingolla, 2002; Oksama & Hyon¨ firmed that the early visual system is able “to individuate and keep track of approximately five visual objects and does so without using an encoding of any of their visual properties” (Pylyshyn, 2001). Visvanathan and Mingolla (2002) investigated MOT in the case when objects are allowed to overlap one another dynamically during a trial. The experiment included those with and without depth cues that signal occlusion. The results show that “although the tracking task does become more difficult . . . it does not become impossible, even in the purely two-dimensional case.” When occlusion cues are added to the display, MOT performance returns to the level observed for nonoverlapping objects. Observers are better in tracking targets than identifying them; the identity tends to be lost when targets come close to each other (Pylyshyn, 2004). In the past few years, attention has become a popular field for neural network modeling. The models of attention can be subdivided into two categories. Connectionist models (Olshausen, Anderson, & Van Essen, 1993; Tsotsos et al., 1995; Moser & Silton, 1998; Grossberg & Raizada, 2000; Itti &
An Oscillatory Neural Model of Multiple Object Tracking
1415
Koch, 2000, 2001) are based on a winner-take-all procedure and are implemented through modification of the weights of connections in a hierarchical neural network. Such models are difficult to use in the case of moving objects since the networks have to work in the space of the visual field; therefore, for any new position of the objects, the weights of the connections should be recomputed. Another type of attention model is represented by oscillatory neural networks (Kazanovich & Borisyuk, 1999; Wang, 1999; Corchs & Deco, 2001; Borisyuk & Kazanovich, 2003, 2004; Katayama, Yano, & Horiguchi, 2004). They are more suitable for object-based attention because in this case, the network operates in phase-frequency space, which makes the attention focus invariant to the locations of objects in physical space. In this letter, we present a neural network model of MOT based on the principles of oscillation synchronization and resonance. As far as we know, this is the first model of MOT and the first example of oscillatory neural network application to processing scenes with moving objects. The MOT model design is based on our earlier published attention model with a central oscillator (AMCO) (Kazanovich & Borisyuk, 1999, 2002, 2003; Borisyuk & Kazanovich, 2003, 2004). Each element of AMCO is an oscillator whose dynamics are described by three variables: the phase of oscillations, the amplitude of oscillations, and the natural frequency of the oscillator. The interaction between oscillators is implemented in terms of phase locking, resonance, and adaptation of the natural frequency. AMCO has a star-like architecture of connections. It contains a one-layer network of locally coupled oscillators, the so-called peripheral oscillators (POs), whose dynamics are controlled by a special central oscillator (CO) through global feedforward and feedback connections (Kryukov, 1991). The POs represent the columns of cortical neurons in the primary visual cortex (areas V1–V2) that respond to specific local features of the image. For simplicity, we use the contrast between the intensities of a pixel and the background as such features. A CO plays the role of the central executive of the attention system (Baddeley, 1996; Cowan, 1988). In AMCO, isolated objects are represented by synchronous assemblies of POs, and the focus of attention is formed by those POs that work synchronously with the CO. In psychological literature, the central executive is considered as a system that is responsible for attentional control of working memory (Baddeley, 1996, 2002, 2003; Shallice, 2002). The question about localization of the central executive in brain structures is still obscure. For a long time, the functions of the central executive have been attributed mostly to a local region in the prefrontal cortex (D’Espasito et al., 1991; Loose, Kaufmann, Auer, & Lange, 2003), but later studies have shown that the central executive may be represented by a distributed network that includes lateral, orbitofrontal, and medial prefrontal cortices linked with motor control structures (Barbas, 2000; Andres, 2003). Recent neuroimaging data show that
1416
Y. Kazanovich and R. Borisyuk
“different executive functions do not only recruit various frontal areas, but also depend upon posterior (mainly parietal) regions” (Collett & Van der Linden, 2002). Daffner et al. (1998) show the relative contribution of the frontal and posterior parietal regions to the differential processing of novel and target stimuli under conditions in which subjects actively directed attention to novel stimuli. The prefrontal cortex may serve as the central node in determining the allocation of attentional resources to novel events, whereas the posterior lobe may provide the neural substrate for the dynamic process of updating one’s internal model of the environment to take a novel event into account. There is some evidence that in addition to neocortical areas, the hippocampus may play an important role in implementing central executive functions: the hippocampus has the final position in the pyramid of cortical convergent zones (Damasio, 1989), participates in controlling the processing of information in most parts of the neocortex (Holscher, 2003), and coordinates the work of the attention system (Vinogradova, 2001; Herrmann & Knight, 2001; Duncan, 2001). An important property of a network with the star-like architecture is a relatively small number of connections (of the order n, where n is the number of elements in the system) in comparison to the systems with all-to-all connections (the number of connections is of the order n2 ). This makes systems with a central element biologically plausible and technically feasible even in the case of large n. Kazanovich and Borisyuk (2002) and Borisyuk and Kazanovich (2004) showed that by combining synchronization and resonance in AMCO, it is possible to select objects in the attention focus one by one in the order determined by the saliency of objects. More salient objects have the advantage in earlier selection. In section 2 we give a short description of AMCO and show how it can be used to track a single target moving among a set of distractors. The main idea of tracking m targets is to use a network that consists of m copies of interactive AMCO with each copy tracking one particular target. When implementing this idea, the following problems have to be solved. First, one should prevent the situation where the same target is simultaneously tracked by several AMCO. Second, the model should be able to operate when objects intersect during their movements. Since objects are identical and assumed to be moving randomly and unpredictably, there is no possibility of errorless identification of a target after it has been occluded by a distractor. In this case, the best strategy for the attention system is to keep in the focus of attention on either of the two objects that have just been separated. Also, there is no possibility to recall the identity of two target objects after their occlusion. But the attention system should be able to track them both after separation. This strategy allows the attention system to permanently hold exactly m isolated objects. This is important in order to prevent the multiplication of attended objects that otherwise would take place.
An Oscillatory Neural Model of Multiple Object Tracking
1417
Central oscillator
Peripheral oscillators
Input image
Figure 1: AMCO architecture. The hollow arrow shows the assignment of natural frequencies of POs. Black arrows show synchronizing connections that are used to (1) bind the POs coding an object into a synchronous assembly and (2) synchronize an assembly of POs with the CO. The gray arrow shows desynchronizing connections that are used to avoid simultaneous synchronization of the CO with several assemblies of POs.
In section 3 we describe how interactive AMCO layers are combined in the MOT model. In section 4 the results of computer simulation of MOT are presented. Section 5 contains discussion. 2 Tracking a Single Target The architecture of AMCO is shown in Figure 1. The input to the network is a grayscale image on the plane grid of the same size as the grid of POs. Each PO receives an external signal from the pixel whose location on the plane is identical to the location of the PO. We use a conventional coding scheme where the external signal determines the natural frequency of oscillations (Niebur, Kammen, & Koch, 1991; Kuramoto, 1991). It is assumed that the external signal is formed in the lateral geniculate nucleus and depends on the contrast between the intensities of the pixel and the background. The value of the natural frequency of a PO is given by the formula ωi = λ(B − Ii ), (0 ≤ Ii ≤ B), where ωi is the natural frequency of the ith PO, Ii is the gray level of the ith pixel, B is the gray level of the background, and λ is a scaling parameter. Thus, the natural frequency is higher if the contrast is higher.
1418
Y. Kazanovich and R. Borisyuk
The POs corresponding to the pixels of objects are called active, and their dynamics are determined by equations A.1 to A.4 in appendix A. The POs corresponding to the pixels of the background are called silent. While an oscillator is silent, it does not participate in the dynamics of the system. The POs have synchronizing local connections with their nearest neighbors. These connections are used to bind the POs representing an isolated object into a coherent assembly. This is done in agreement with the synchronization hypothesis of feature binding (Singer, 1999). The CO has desynchronizing feedforward and synchronizing feedback connections to each PO. The synchronizing connections are used to phase-lock the CO by an assembly of POs. The desynchronizing connections are used to segregate different assemblies of POs in frequency space to prevent simultaneous synchronization of the CO with several assemblies of POs. The interplay between the synchronizing and desynchronizing connections of the CO results in the competition of different assemblies of POs for the synchronization with the CO. Only one assembly of POs can win this competition; therefore, in AMCO at each moment, only one object can be attended (excluding short transitory periods). The POs representing this object work synchronously with the CO, which results in resonance: the amplitudes of these oscillators rapidly increase to a high level. The amplitudes of other POs (which do not work coherently with the CO) are shut down to a low level. A high enough amplitude of a PO is taken as a criterion that this oscillator is included in the focus of attention. The amplitude of the CO in AMCO is a constant. Since the CO can be phase-locked only by those POs whose natural frequencies are in some range around the natural frequency of the CO, the natural frequency of the CO is adapted to its current value. As a result, the natural frequency of the CO becomes equal to the current frequency of an assembly of POs. Such adaptation allows the CO to “search” for an assembly of POs that is an appropriate candidate for synchronization. Our previous publications on AMCO have dealt with stationary objects only. Images with moving objects are more difficult to process since the attention system should be able to properly react to the changes in object locations and also to the intersection and separation of objects during their movements. We illustrate the functioning of AMCO in the case of a visual field of size 25 × 50 pixels that contains nine black squares of size 7 × 7 on a white background. All pixels that belong to the squares receive the same illumination and therefore have the same contrast relative to the background. To comply with the timescales that are typically used in the MOT experiments, we have selected the time unit equal to 100 msec and used gamma oscillations as the range of working frequencies. In computations, the natural frequencies of all active POs were set to ωi = 5 (5 oscillations in 100 msec), which corresponds to the frequency 50 Hz. The amplitudes of active POs have the initial value 2 and vary in the range (0, 11). The threshold for a resonant
An Oscillatory Neural Model of Multiple Object Tracking
1419
Figure 2: Single object tracking. Processing an image with a single target and eight distractors. Attended pixels are black, nonattended pixels of objects are gray, and pixels of the background are white.
amplitude is R = 8.8. If the amplitude of a PO exceeds R, the corresponding pixel is assumed to be included in the focus of attention. Figure 2 displays movie frames of the dynamics of the image at the moments (in seconds) 0, 0.4, 0.8, . . . (time interval between the frames is four time units). The frames are ordered from left to right and from top to bottom. The top-left frame shows the initial position of the squares. Later
1420
Y. Kazanovich and R. Borisyuk
the squares move around in any of the four randomly chosen directions (up, down, left, right) with the probability 0.25. The choice of a new direction is taken at the moments (in seconds) 0.1, 0.2, 0.3, . . . The movement is omitted if it can lead to crossing the borders of the visual field.
Amplitude
10
5
0
50 40 30
10 20
20
10 30
y
x
Amplitude
10
5
0
50 40 30
10 20
20
y
10 30
x
Figure 3: Amplitudes of POs at the moments (top) t = 11.6 (second frame in the last row of Figure 2) and (bottom) t = 12.0 (third frame in the last row of Figure 2).
An Oscillatory Neural Model of Multiple Object Tracking
1421
In Figure 2, attended pixels are shown as black, nonattended pixels are gray, and pixels of the background are white. Initially the squares are regularly distributed in the image. Due to movements, the distribution of the squares around the field becomes random, and complex objects appear that represent different combinations of overlapping squares. At the initial moment, no object is under attention. After a short lag, attention is automatically focused on a randomly chosen square. In the case shown in Figure 3, it is the square in the middle of the image. The focus of attention is firmly attached to an object while this object moves in isolation from other objects. But quite soon, the attended square crosses a complex object formed by several overlapping squares (this situation is shown in the fourth frame of the first row). Since an attended object is included in a cluster of overlapping objects, attention is gradually spread to the pixels surrounding the attended object. If a complex formed by attended objects is later split into two isolated objects, attention is focused on one of these objects. Such a movement of the attention focus along the image is reminiscent of passing a baton in a relay race, with the only difference being that the process of baton passing has a probabilistic nature. One may think that the focus of attention nearly disappears in the second frame of the bottom row of frames and magically reappears again in the next frame. In fact, there is no magic here. The amplitudes of the oscillators in the attended area for a short time become a bit lower than the threshold R (see Figure 3, top). But soon afterward the amplitudes in the attended area again exceed the threshold (see Figure 3, bottom). At this moment, two attended squares become separate, and attention is focused on one of them. Note that in Figure 3, the columns in the diagram with the amplitudes abruptly rising over the surrounding pixels correspond to the pixels where squares have just recently moved. 3 The Model of Multiple Object Tracking The architecture of the MOT network is shown in Figure 4. This architecture corresponds to the case of three targets (in general, the number of layers is the same as the number of targets); therefore, three layers of AMCO are shown as the components of the model. We consider these layers to be attentional subsystems. Each subsystem should track its own target. For convenience of reference, the layers are indexed by different colors: red, green and blue. The equations of dynamics of the network are shown in appendix B for the general case of m targets. The POs that occupy the same location on the plane but belong to different layers form a column. The POs in a column are bound by strong all-to-all synchronizing connections. All POs in a column receive the same external signal from the corresponding pixel of the visual field. As in the case of a single AMCO, the external signal codes the contrast between the intensities of the given pixel and the background. This signal determines the values
1422
Y. Kazanovich and R. Borisyuk
Layer 3 BLUE
Layer 2 GREEN
CO3 BLUE
CO2 GREEN
Layer 1 RED
CO1 RED Synchronizing connections Desynchronizing connections Figure 4: Architecture of the network for MOT. The layers of the network are indexed by different colors. Each layer of the network is responsible for tracking a single target.
of the natural frequencies of POs; therefore, the natural frequencies of the oscillators in a column are identical, which results in rapid synchronization of POs in a column. The local connections between POs in a layer are restricted to the nearest neighbors. These connections are synchronizing. They are strong enough to synchronize the columns of POs that belong to an isolated object. Thus, a coherent assembly of POs is formed in the network in response to stimulation by an individual object in the visual field. The COs belonging to different layers are bound by desynchronizing connections. Such connections are introduced to prevent the synchronization of different COs with the same synchronous assembly of POs. As a result, nonoverlapping targets are coded in the network by noncoherent oscillatory activity of different assemblies of POs. As in the case of a single target, if an assembly of POs in the kth layer works synchronously with the CO in this layer, the amplitudes of these oscillators go up and exceed the threshold for the resonance. If all POs of
An Oscillatory Neural Model of Multiple Object Tracking
1423
the kth layer corresponding to the pixels of an object are in the resonant state, this is interpreted as the fact that this object is included in the focus of attention of the kth attentional subsystem. If objects move slowly enough and do not overlap during their movements, the attention focus (after being formed) is rather stable due to the resonance of POs included in the focus of attention. Resonant oscillators have a much stronger influence on the CO of their layer, which prevents a jump of attention to an assembly of nonresonant oscillators. If the speed of object movements is high relative to the rate of the processes of synchronization and resonance, attention can spontaneously switch from one object to another. This results in errors in distinguishing between targets and distractors. Consider what happens to the attention focus if two objects cross each other. If both objects are unattended, no change of the focus of attention takes place. Temporarily a complex distractor (composed of two objects) is formed, but this has no effect on objects in the focus of attention. In the case when attended and unattended objects overlap, attention is spread to the whole composite object because all POs belonging to this object will be included in a common assembly of synchronized oscillators. This assembly will work synchronously with the same CO that was synchronous with the attended assembly of POs before overlapping. After becoming separate again, the objects renew their competition for being included in the focus of attention. Due to the desynchronizing influence of the CO on POs, only one of the two objects is able to win the competition. Since the whole composite object has been temporarily attended, the system has no information on how to detect which object had been attended before the intersection took place. In this situation, either of the two objects can be newly selected in the focus of attention. The choice is random; therefore, it may lead to an error in target identification with the probability 0.5. If two attended objects overlap, both continue to be in the focus of attention despite the fact that there is a desynchronizing connection between the COs. This is achieved by making this desynchronization weak enough relative to the synchronizing influence that comes to both COs from the common assembly of synchronous POs. When objects move apart, the desynchronizing connection between the COs renews its influence on these oscillators. As a result, each object will again be tracked by its own AMCO layer. Of course, it is possible that objects, say A and B, that before the intersection have been tracked by, for example, “red” and “green” layers, will exchange their tracking systems: after separation, the “red” layer will be used to track the object B, and the object A will be tracked by the “green” layer. Whether this exchange will happen depends on how strong the overlapping has been and how long it has continued. In fact, the interplay between the synchronizing and desynchronizing influences on a CO is even more intricate than the one we have just described. Computation experiments have shown that a constant desynchronizing
1424
Y. Kazanovich and R. Borisyuk
interaction between COs cannot ensure the proper behavior of the COs. One type of error appears if the desynchronizing interaction between COs is too strong. In this case, a CO may lose the synchronization with the assembly of POs at the moment when two attended objects overlap. As a result, only one CO will maintain synchronization with a composite object formed by two previously attended objects. Another type of error may appear if the strength of desynchronization between the COs is too weak. In this case, two COs may maintain the synchronization with the same object after the separation of two simultaneously attended objects. In fact, no constant value of the connection strength between the COs allows the system to avoid one or other of these errors. But the problem can be solved if the interaction between the COs increases after two attended objects overlap. It has been done by using the idea of resonance but applying it now to the amplitudes of COs. Therefore in the MOT model, the amplitudes of COs are not constants anymore. A resonant increase of the amplitude of a CO takes place if two attended objects cross each other (see equation B.5 in appendix B). According to the last term in equation B.1, this results in increasing the strength of desynchronization between the COs that track these objects. Therefore, at the moment of separation of the objects, the strength of desynchronization will be high enough to prevent the situation when both COs track one object leaving the other one outside the focus of attention. To follow the experimental conditions of MOT, the model should not only be able to track moving objects but should also make a proper choice of a set of targets at the initial phase of MOT. In the experiments, targets are indicated to the observer by a brief flash of light. In the model, the notion of saliency is used to formalize the choice of flashed objects in the focus of attention. It is assumed that flashed objects are more salient than other objects and that this leads to automatic attraction of attention to these objects. In terms of the model, the salience of a pixel influences the strength of the influence of POs corresponding to this pixel on the COs. Thus, more salient objects have an advantage in being included in the focus of attention. The idea of the saliency map was introduced by Koch and Ullman (1985) and has been intensively used in computational models of visual search (Itti, Koch, & Niebur, 1998; Itti & Koch, 2000, 2001; Olshausen et al., 1993). The saliency map is a two-dimensional table that encodes the saliency of objects in the visual environment and determines the priority of their choice in the focus of attention. In the MOT model, the saliency map is formed as a set of parameters si that correspond to the pixels of the image and determine the strength of influence of POs on COs as is shown in equation B.1 in appendix B. To reflect the difference in saliency between flashed and nonflashed objects, the saliency si takes one of two positive values: a higher value Sflashed corresponds to the pixels of flashed objects, and a lower value Snonflashed corresponds to the pixels of nonflashed objects. For the pixels of the background, si = 0. The value of Sflashed should be several
An Oscillatory Neural Model of Multiple Object Tracking
1425
times higher than Snonflashed to provide the assemblies of POs that represent flashed targets with a much better chances of winning the competition for synchronization with the COs than the assemblies corresponding to nonflashed distractors. In computer simulations, we put Sflashed = 5 and Snonflashed = 0.2. When flashing is over, all objects become equally salient. This is reflected in the model by making all values of si for the pixels of objects identical. In simulations, si = 1. 4 Model Simulation and Comparison with Experimental Data Two types of MOT model simulations are considered below that correspond to movements with and without overlap, respectively. Computations of the fist type are used to compare the performance of the model with recent ex¨ a (2004). In simulations of the second perimental data of Oksama and Hyon¨ type, overlapping objects are used to demonstrate that the model follows the procedure described in section 3. ¨ a (2004) experimented with a set of 12 objects. To Oksama and Hyon¨ accelerate computation, we reduced the number of objects to 10 as in the experiments of Pylyshyn and Storm (1988). This does not significantly affect the results. In simulations, objects are black squares of size 7 × 7 pixels on a white background in a field of size 30 × 60 pixels. As in section 2, the pixels of the squares are coded in the network by the natural frequencies of POs equal to 50 Hz. Tracking of k targets (2 ≤ k ≤ 5) implies that a network with k layers is used. The timescale for simulations has been chosen as in section 2. A single run of the model takes 7.2 sec and consists of three phases. The first phase takes 0.7 sec and is used for marking the targets. During this period, objects are motionless; the only distinction between targets and distractors is their saliency, as explained in section 3. The desirable result at the end of this phase is the focusing of attention on the targets so that each target is attended by one attentional subsystem. An error may appear if two attentional subsystems are focused on the same target or if a distractor is chosen in the focus of attention of one of the subsystems. Computations have shown that the probability of such errors is less than 0.005; therefore, the initial acquisition of targets in the focus of attention is nearly errorless. The second phase continues 6 sec. This is when objects move in a random manner. The speed of motion is 1 pixel per 50 msec, that is, every 50 msec, all squares make a movement of 1 pixel length in one of the four directions: up, down, left, or right. For each square, the direction of the movement is chosen randomly and independently with the probability 0.25. To prevent collisions, the motion is subject to the restriction that the squares should be always separated by at least one pixel of the background. First, the horizontal or vertical direction of the movement is chosen with the probability 0.5. Then, for horizontal movements, the left or right direction is taken with
1426
Y. Kazanovich and R. Borisyuk
the probability 0.5, and for vertical movement, the up or down direction is taken with the probability 0.5. If there is a danger of collision, the direction of the movement is reversed. If the danger of collision exists for both opposite directions, no movement is undertaken at the current moment. The same rules are applied to prevent the squares from crossing the borders of the visual field. The third phase takes 0.5 sec. During this phase, all movements are stopped. This time is given to the system to resolve an ambiguous situation when several objects are simultaneously attended by an attentional subsystem. Such a situation appears from time to time if objects are moving. When objects are stationary, the time interval of 0.5 sec is long enough for the attentional subsystem to choose which of these objects should be kept in the focus of attention. Other objects are automatically excluded from the focus of attention as a result of desynchronizing influence of the CO on POs in the corresponding layer. At the final moment of the third phase, the number of object identification errors is registered. According to the principles of system design and functioning, at this moment exactly k squares are attended by the network with k layers. Some of the attended squares are targets, but some of them may be distractors due to errors in attention focusing during object movements. Therefore, two types of error may appear: a target is identified as a distractor, or a distractor is identified as a target. According to the strategy implemented in the model, the number of errors is always even: if there is an error in attending a target (a target is missed by all attentional subsystems), this inevitably results in taking a distractor in the focus of attention. To estimate the performance of the model, we executed 50 runs of the system for each target set size k = 2, 3, 4, 5. The results of computations are shown in Table 1. The analysis of variance test (ANOVA) has been used to check whether the means of 50 trials in four groups (corresponding to different number of targets) differ. The null hypothesis is that all the groups are drawn from the same probabilistic distribution (or from different distributions but with the same mean). The result is: F = 43.7, and the pvalue is less than 0.0001; therefore, the results of our simulations do not support the null hypothesis. Further analysis by use of the pair-wise T-test gives the following results: T23 = 4.6, T34 = 3.7, T45 = 3.1. These values of the T-test do not support the null hypothesis that the mean of group k equals the mean of group (k − 1) for k = 3, 4, 5. Therefore, the alternative hypothesis that the mean in group k is larger than the mean in group (k − 1) is supported. ¨ a (2004) estimated human perforIn experiments, Oksama and Hyon¨ mance by using a probe object that should be identified by the subject as a target or a distractor. To exclude a bias in guessing, a probe object was selected with the probability 0.5 from the set of targets and from the ¨ a, we have not made set of distractors. In contrast to Oksama and Hyon¨ direct computational experiments with probe objects but estimated the
An Oscillatory Neural Model of Multiple Object Tracking
1427
Table 1: Results of Identification of Targets and Distractors in the MOT Simulation. Target Set Size 2 Number of errors Mean number of errors per trial Standard deviation Probability of error checked by probe objects
3
2 0.04
38 0.76
0.3 0.006
1.1 0.09
4 86 1.72 1.5 0.179
5 134 2.68 1.6 0.268
probability of error in probe identification based on the number of errors in each run. Let s be the number of objects (s = 10), k the number of targets (k = 2, 3, 4, 5), and e the number of the targets that have been mistakenly identified as distractors in a run of simulations (hence, the same number of distractors has been mistakenly identified as targets). Then the probability of error if checked by a probe object is P = 0.5
e e + k s−k
=
0.5se . k(s − k)
Using this formula, we computed the values of P for each run and averaged these values through all simulation trials. The results are presented in the last row of Table 1 and in Figure 5. For comparison Figure 5 also shows ¨ a, 2004). Both error patthe accuracy of humans in MOT (Oksama & Hyon¨ terns in Figure 5 show similar behavior: the probability of error increases with a larger number of targets. The main difference is that the probability of error for the case of two targets is underestimated in simulations in comparison to the experimental data. To illustrate the functioning of the system in the case when objects are permitted to overlap during their movements, we present a computational experiment where six identical objects (three targets and three distractors) are moving in the visual field. Again objects are black squares of size 7 × 7 pixels on a white background. The size of the field is 19 × 62 pixels. A session of simulation is divided into two phases. The first short phase (1.2 sec) is used to mark the targets and select them into the attention focus. At this phase, all squares are isolated and motionless. The selection is done exactly in the same way as in the case with nonoverlapping objects. During the second phase lasting for 9.6 sec, objects start moving according to the rules that are the same as in the case of a single target (see
1428
Y. Kazanovich and R. Borisyuk
0.35
0.3
Simulation Experimental data
Probability of errors
0.25
0.2
0.15
0.1
0.05
0
2
2.5
3
3.5
4
4.5
5
Number of targets Figure 5: Accuracy of probe objects identification by the MOT model in comparison with humans. Error rates are shown as a function of number of targets ¨ a (2004). tracked. Experimental data are taken from Oksama and Hyon¨
section 2). The movie frames in Figure 6 illustrate the dynamics of the image and the process of attention focus formation and switching. The frames are ordered from left to right and from top to bottom. The time interval between the frames is four time units (0.4 sec). The top-left frame shows the initial position of objects in the image. The next frame shows a moment during exposition time when the attention focus with three targets is formed. Later, objects begin their random movements, which lead to the formation of different combinations of overlapping objects. Colors in Figure 6 do not reflect the colors of the squares in the image (we have already mentioned that all squares in the image are black). Colors are used to show which objects are under the attention of different attentional subsystems (network layers). The color of a pixel depends on the state of the oscillators in the column that corresponds to this pixel. If a PO in the “red” layer is in the resonant state, then the pixel is red. A similar principle of color assignment is applied to green and blue colors. If several POs in a column are in the resonant state, the color of the pixel is a mixture of the basic colors. For example, the pixel is cyan if it is simultaneously attended
An Oscillatory Neural Model of Multiple Object Tracking
1429
Figure 6: Multiple object tracking. Processing an image with three targets and three distractors. Each pixel is painted in red, green, blue, or a mixture of these colors. A pixel is red/green/blue if a “red”/“green”/“blue” oscillator in the column that corresponds to this pixel is in the resonant state. A pixel is cyan if both oscillators in the “blue” and “green” layers are in the resonant state. A pixel is black if no oscillator in the column is in the resonant state. A pixel is white if it belongs to the background.
in the “green” and “blue” layers. Black is used for nonattended pixels of objects, and white is used for the pixels of the background. The intensities of green, red, and blue colors are such that their combination in one pixel results in gray. (In fact, there are no such pixels in Figure 6.) Consider what happens with the attention focus while squares are moving. The pair of squares in the middle of the image is always outside the
1430
Y. Kazanovich and R. Borisyuk
attention focus. These squares are not attended in both cases when they move as isolated objects or temporarily form a composite object. The POs representing the pixels of these objects always work with low amplitude. The pair of squares in the right part of the image shows the process of attention switching in the case when a complex object is formed due to overlapping of attended and nonattended squares. If the time of overlapping is short, after separation of the squares, attention focusing is correct in choosing the square that was under attention before. If the area of overlapping becomes large and the duration of existence of the complex object is not very short, attention is spread over the entire composite object. Therefore, after separation of the objects, either of them may be kept in the focus of attention, while the other is excluded from the focus of attention. In fact, the functioning of the MOT model in this case is not different from how AMCO works in the case of a single target. Computational experiments confirm that the much more intricate architecture of the MOT model does not lead to any additional complications in tracking a single target among distractors by a layer of the network. Finally, let us consider how the pair of squares in the left part of the image is processed by the system. Both squares are marked as targets and selected into the focus of attention. While these squares move separately, the attentional subsystem (network layer) tracking each object does not change; therefore, the colors of the squares (green and blue) in Figure 6 are kept unchanged during this period. The situation changes after a large enough area of overlapping between the squares appears. This area is painted in cyan because it is simultaneously attended by two attentional subsystems, “green” and “blue.” Due to object movements, the size of the cyan area may increase or decrease; it may disappear if the overlapping area is too small and reappear again when the overlapping area becomes large enough. The important fact is that as soon as the squares separate in the image field, each of them is tracked by a single attentional subsystem. The attentional subsystems may exchange targets only if at some moment, the overlapping of the squares does not allow the reliable identification of each square in the composite object. A thorough quantitative investigation of error rates in MOT with overlapping objects is in progress. Preliminary simulations have shown that error rates essentially depend on the speed of object movements. If the speed is low enough, as in the experiment presented above, the probability of error (except those errors that are inevitable after strong overlapping of a target and a distractor) is rather low. 5 Discussion Three main ideas are combined in the presented model of attention. First, oscillations and synchronization are used as a key mechanism in attention focus formation. The evidence that oscillatory activity and long-range
An Oscillatory Neural Model of Multiple Object Tracking
1431
phase synchronization play a major role in the attentional control of visual processing has been provided in studies of EEG (Herrmann & Knight, 2001), MEG (Sokolov et al., 1999; Gross et al., 2004), and local field potential recordings (Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Fries, Schroeder, Roelfsema, & Engel, 2002; Niebur, Hsiao, & Johnson, 2002; Fell, Fernandez, Klaver, Elger, & Fries, 2003; Liang, Bressler, Ding, Desimone, & Fries, 2003). In particular, it has been shown that modulation of neural synchrony determines the outcome of an attention-demanding behavioral task (Gross et al., 2004; Tallon-Baudry, 2004). Second, it is assumed that attention focusing is controlled by a special neural system, the so-called central executive (Baddeley, 1996; Cowan, 1988). In terms of the model, visual attention is normally characterized by synchronous activity of an assembly of neurons that represent the central executive and an assembly of neurons in the primary areas of the visual cortex. MOT represents a special situation when attention is distributed among several isolated objects. It is supposed that in this case, the central executive is split into several subsystems whose activity is desynchronized. Thus, synchronous oscillations are used to label different objects that are simultaneously included in the focus of attention. Third, the resonance is used to formalize the assumption that attention enhances neural activity in attended areas and inhibits responses to nonattended stimuli. The relation between the resonance as it is used in the model and attentional modulation of cortical activity should be explained in more detail. Though electrophysiological (Motter, 1993; Roelfsema, Lamme, & Spekreijse, 1998) and functional imaging studies (Somers, Dale, Seifert, & Tootel, 1999; Kanwisher & Wojciulik, 2000; Seifert, Somers, Dale, & Tootel, 2003) have shown that attentional modulation can be found as early as in the primary visual cortex, it is known that the strength of attentional effects increases as one moves up the cortical processing hierarchy (Treue, 2003). Moreover, Culham et al. (1998) in the fMRI studies of activation produced by attentive tracking of moving objects have found no enhancement in early visual cortex, but it has been shown that the signal more than doubles in parietal and frontal areas. How do these experimental results comply with the model? In the model, direct connections are used from POs to COs. This is a radical simplification of the real situation. In fact, there are many intermediate cortical structures in the pathway from the striate cortex to the higher regions occupied by the central executive. At the moment the flow of information reaches its final station, the difference in the activity of its attended and unattended components becomes quite clear. Still, this difference is not as high as the difference between resonant and nonresonant oscillations in the model. Therefore, one should not literally think that the amplitude of oscillations in the model is a relevant characteristic of the activity in the cortex as it is observed, for example, in fMRI studies. The amplitude of oscillations should be considered as a formal variable that positively correlates with the activity in the cortex and
1432
Y. Kazanovich and R. Borisyuk
determines the strength of interaction between cortical oscillators and the central executive. The theoretical explanation of MOT is based on the idea of preattentive assignment of an index to objects tracked (Pylyshyn & Storm, 1988; Pylyshyn, 2001). It is supposed that this indexing can occur independently and in parallel at several places in the visual field. In contrast to this theory, in our model indexing is implemented in two stages, which include both preattentional and attentional levels. During the first stage, an oscillatory label is assigned to each object. All information related to a given object is coded in the form of synchronous (coherent) oscillations, while oscillations corresponding to different objects are incoherent. Although the oscillatory label is not something constant but varies with time, it provides a reliable mechanism of distinguishing among identical targets. The oscillatory label is used in the second stage when an attentional subsystem is associated with a single target, the one that is tracked by this subsystem. The number of this subsystem can be considered as the index of the target. The difference between attended and unattended objects is realized in the form of synchronization or desynchronization of the activity of assemblies of POs with the COs. The model has a rigid architecture of connections and predetermined interaction functions and parameters. Real biological systems for MOT must be much more flexible. A question that may arise is how the system with a fixed number of layers can adapt to tracking different number of targets. A trivial solution is to assume that the number of active COs is controlled by internal effort and is always set equal to the number of targets tracked. We think that a more plausible solution is a flexible type of interaction among the COs. Suppose that the number k of the COs is restricted (according to the experimental evidence k = 5 is a reasonable upper bound for the number of targets tracked simultaneously), but the type of interaction between the COs can vary depending on the task. If one target should be tracked, all the COs are bound into a synchronous assembly by synchronous interaction among the COs. In the case of two targets, two assemblies of COs are formed with synchronizing interaction among the COs in an assembly and desynchronizing interaction among the assemblies. This situation is equivalent to the case of the model with two COs. Similarly, an arbitrary number of targets below five can be tracked by grouping the COs into a proper number of assemblies. The dependence of number of errors in MOT on the number of targets was the reason for associating MOT with resource-limited parallel processing (Pylyshyn & Storm, 1988). Our model presents an alternative explanation of this phenomenon. Although the processing of information in our model is purely parallel, it has been shown by computer simulations that tracking of the targets will be poorer if the number of targets increases. This is caused by the limited capacity of the phase space where several central oscillators have to operate simultaneously. Increasing the number of
An Oscillatory Neural Model of Multiple Object Tracking
1433
central oscillators makes it more and more difficult for them to avoid temporal synchronization, which may result in unpredictable jumps of attention to nontarget objects. If the number of targets is below five, the probability of such jumps is very low for stationary objects, but it significantly increases when objects start moving with high speed. Due to movements, the processes of synchronization and resonance have not enough time to fully proceed, and these result in the loss of synchronization of a CO with a previously selected object. Although we have chosen the parameters of the model in such a way that it closely follows real-time relations in the experiments with MOT, one should not be too serious about this fact. The model is too simple to be realistic in this respect. A very small number of pixels used for objects representation, restriction of POs interaction to nearest neighbors, and many other features of the model are conditioned by the need to make computations in a reasonable amount of time. In fact, the model is rather flexible in its operation time. Another choice of parameters, or even the duration of the time unit, can lead to other time relations. Therefore, when comparing the performance of the model with human error rates, we used the experimental data averaged through time periods of 5, 9, and 13 sec. The data for 5 sec in the experiments with humans give lower values than those obtained in our simulations, but the pattern of error probabilities is the same. The improvement of the model in timescale representation is planned. In partic¨ a (2004) on nonlinear variation of error ular, the results of Oksama and Hyon¨ rates with the duration of trials are a challenge for the future development. The decision-making strategy used in the model is also oversimplified. The model is forced to always track a fixed number of objects. The experiments show that human strategy is more clever (Pylyshyn & Storm, 1988; ¨ a, 2004). If a subject has a feeling that correct identification Oksama & Hyon¨ of an object during tracking is doubtful, he or she is inclined to stop tracking this object and focus attention on tracking a smaller number of objects. This causes gradual reduction of human performance when the number of targets exceeds five and is probably the source of smaller differences in the probability of error between the cases with four and five targets observed in MOT experiments relative to those found in simulations (see Figure 5). It is known that in MOT experiments, the quality of target tracking can be enhanced by grouping the targets into a virtual polygon and then tracking deformations of the polygon. This grouping can be done spontaneously or according to the instruction given to subjects (Yantis, 1992). These facts can be explained in terms of oscillatory neural networks by assuming that in this case, all targets are combined in a single visual object. The oscillators representing this object are synchronized along virtual borders of the polygon that are formed by some internal effort. As a result, attention is no longer divided, and the central executive operates in a standard manner as a single central oscillator. It is also possible that humans can follow some mixed strategy combining grouping with tracking individual objects. The
1434
Y. Kazanovich and R. Borisyuk
strategy of grouping may be an explanation for the cases when more than five targets are tracked successfully. But according to the data of Oksama ¨ a (2004), if the number of targets exceeds five, the subjects are and Hyon¨ inclined to ignore some targets and focus attention on a smaller target set. Designing a MOT model, we intentionally tried to avoid the use of traditional image processing techniques such as shape analysis, connectivity testing, pattern recognition, and others. Therefore, the model can work equally well when objects in the visual field are not identical or even vary in shape. This is important, for example, if object movements take place in 3D space with the projection of objects on the retina constantly changing. The model design is also caused by the fact that complex procedures of information processing demand more time, and are assumed to be implemented by higher cortical structures. It was interesting to investigate whether MOT can be explained in terms of a simple network architecture where the only function of the top-down flow is to control the focus of attention. Appendix A: Mathematical Description of AMCO The oscillators comprising AMCO are described as generalized phase oscillators. The state of such an oscillator is described by three explicitly given variables: the phase of oscillations, the amplitude of oscillations, and the natural frequency of oscillations. The dynamics of AMCO are described by the following equations. n dθ0 w0 si a i g(θi − θ0 ) = 2πω0 + dt n
(A.1)
i=1
dθi a j p(θ j − θi ) + ρ = 2πωi − a 0 w1 h(θ0 − θi ) + w2 dt j∈N
(A.2)
da i = β(−a i + γ f (θ0 − θi )) dt
(A.3)
dθ0 dω0 = −α 2πω0 − . dt dt
(A.4)
i
In these equations, θ0 is the phase of the CO; θi (i = 1, . . . , n) are the 0 i phases of POs, dθ and dθ are the current frequencies of oscillators; ω0 is dt dt the natural frequency of the CO; ωi are the natural frequencies of POs; a 0 is the amplitude of oscillations of the CO (a constant); a i are the amplitudes
An Oscillatory Neural Model of Multiple Object Tracking
1435
of oscillations of POs; w0 , w1 , w2 are constant positive parameters that control the strength of interaction between oscillators; si is the parameter that distinguishes between active and silent oscillators; si = 1 if the PO is active, otherwise si = 0; Ni is the set of active POs in the nearest neighborhood of the oscillator i; ρ is a gaussian noise term with mean 0 and standard deviation σ ; functions g, h, p control the interaction between oscillators; f is a function that controls the amplitude of oscillations of POs and their transition to the resonant state; and α, β, γ are network parameters (positive constants). The values ωi are determined by the input signal; θ0 , θi , ω0 , a i are internal variables that characterize the state of the network. The functions g, h, p are 2π-periodic, odd, and unimodal in the interval of periodicity. f is 2π-periodic, even, positive, and unimodal in the interval of periodicity. An exact description of these functions and the values of the parameters that are used in computations can be found in Borisyuk and Kazanovich (2004). Equations A.1 and A.2 are traditional equations of phase locking. They correspond to the architecture of connections of Figure 1 and control the processes of synchronization and desynchronization in the network. Equation A.1 describes the dynamics of the CO. Equation A.2 describes the dynamics of POs. The noise ρ in equation A.2 is used as an additional source of desynchronization between assemblies of POs. It helps to randomize the location of different assemblies of POs in phase-frequency space, thus making them distinguishable by the CO. Equation A.3 describes the dynamics of the amplitude of oscillations of POs. This equation provides a mechanism for the resonant increase of the amplitude of oscillations. Let the interval of variation of f be ( f min , f max ). The amplitude of a PO increases to the maximum value a max = γ f max if the PO works synchronously with the CO. The amplitude takes a low value a min = γ f min if the phase of the PO is significantly different from the phase of the CO. We say that a PO is in the resonant state if its amplitude exceeds the threshold R = 0.8 γ f max . The parameter β determines the rate of amplitude increase and decay. Equation A.4 describes the adaptation of the natural frequency of the CO. According to this equation, the value of 2πω0 tends to the current frequency of the CO. Such adaptation allows the CO to “search” for an assembly of POs that is an appropriate candidate for synchronization. Appendix B: Mathematical Description of the MOT Model The equations of the MOT model dynamics represent a modification of equations A.1 to A.4 according to the scheme of Figure 4: m n dθ0k w0 k k si a i g θ j − θ0k − w3 a 0l q θ0l − θ0k = 2πω0k + dt nr es i=1
l=1
(B.1)
1436
Y. Kazanovich and R. Borisyuk
dθik a kj p θ kj − θik = 2πωik − a 0k w1 h θ0k − θik + w2 dt j∈Ni m w4 l l k + a i p θi − θi + ρ m
(B.2)
l=1
da ik = β − a ik + γ f θ0k − θik dt
(B.3)
dω0k dθ0k k = −α 2πω0 − . dt dt
(B.4)
In these equations, upper and lower indices are used to number layers and oscillators in a layer, respectively; k = 1, . . . , m, where m is the number of layers (the same as the number of targets in MOT). The normalizing parameter nr es is equal to the current number of resonant oscillators, but not less than 49 (the number of pixels in an object). The last term in equation B.1 describes the interaction between the central oscillators, and the function q determines the type of interaction (the negative sign before this term makes the interaction desynchronizing). The term before the noise in equation B.2 gives the interaction in a column of POs. Parameters w3 , w4 are positive constants. The parameters si in equation B.1 form a saliency map, si > 0, for the pixels of objects and si = 0 for the pixels of the background. During the stage when some objects are flashed, the values of si are made higher for the pixels of flashed objects than for the pixels of nonflashed objects. When objects are homogeneously illuminated, the values of si are made identical for all pixels of objects. Equations B.3 and B.4 generalize equations A.3 and A.4 for the amplitudes of POs and the natural frequency of COs in the case of a multilayer network. The amplitudes of COs vary according to an equation similar to equation B.3, m l da 0k k k f θ 0 − θ 0 + 1 , = β −a i + γ1 r dt l=1, l=k
where
r (x) =
x,
x ≤ f max ,
f max , x > f max .
(B.5)
An Oscillatory Neural Model of Multiple Object Tracking
1437
The function r is necessary to normalize the variation of the amplitudes of COs. In computations, the values of the resonant amplitude of a CO were about two times higher than the amplitude of a nonresonant CO. Acknowledgments This work was supported by the Russian Foundation of Basic Research (Grants 03-04-48482 and 06-04-48806) and by the UK EPSRC (Grant EP/0036364). References Andres, P. (2003). Frontal cortex as the central executive: time to revise our view. Cortex, 39, 871–895. Baddeley, A. (1996). Exploring the central executive. Quarterly Journal of Experimental Psychology, 49A, 5–28. Baddeley, A. (2002). Fractionating the central executive. In D. Stuss & R. T. Knight (Eds.), Principles of frontal lobe function (pp. 246–260). New York: Oxford University Press. Baddeley, A. (2003). Working memory and language: An overview. Journal of Communication Disorders, 36, 189–208. Barbas, H. (2000). Connections underlying the synthesis of cognition, memory, and emotion in primate prefrontal cortices. Brain Research Bulletin, 52, 319– 330. Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000). Tracking an object through feature space. Nature, 408, 196–199. Borisyuk, R., & Kazanovich, Y. (2003). Oscillatory neural network model of attentionfocus formation and control. BioSystems, 71, 29–36. Borisyuk, R., & Kazanovich, Y. (2004). Oscillatory model of attention-guided object selection and novelty detection. Neural Networks, 17, 899–915. Collett, F., & Van der Linden, M. (2002). Brain imaging of the central executive component of the working memory. Neuroscience and Behavioral Review, 26, 105– 125. Corchs, S., & Deco, G. (2001). A neurodynamical model for selective visual attention using oscillators. Neural Networks, 14, 981–990. Cowan, N. (1988). Evolving conceptions of memory storage, selective attention and their mutual constraints within the human information processing system. Psychological Bulletin, 104, 163–191. Culham, J., Brandt, S. A., Cavanagh, P., Kanwisher, N. G., Dale, A. M., & Tootell, R. (1998). Cortical fMRI activation produced by attentive tracking of moving targets. Journal of Neurophysiology, 80, 2657–2670. Daffner, K. R., Mesulam, M. M., Scinto, L. F. M., Cohen, L. G., Kennedy, B. P., West, W. C., & Holcomb, P. J. (1998). Regulation of attention to novel stimuli by frontal lobes: An event-related potential study. NeuroReport, 9, 787–791. Damasio, A. (1989). The brain binds entities and events by multiregional activation from convergent zones. Neural Computation, 1, 123–132.
1438
Y. Kazanovich and R. Borisyuk
D’Espasito, M., Detre, J. A., Alsop, D. C., Shin, R. R., Atlas, S., & Grossman, M. (1991). The neural basis of central executive system of working memory. Nature, 378, 279–281. Duncan, J. (2001). An adaptive coding model of neural functions in prefrontal cortex. Nature Reviews Neuroscience, 2, 820–829. Egeth, H., & Yantis, S. (1997). Visual attention: Control, representation, and time course. Annual Review of Psychology, 48, 269–297. Eriksen, C. W., & St. James, J. D. (1986). Visual attention within and around the field of focal attention: A zoom lens model. Perception and Psychophysics, 40, 225–240. Fell, J., Fernandez, G., Klaver, P., Elger, C. E., & Fries, P. (2003). Is synchronized neuronal gamma activity relevant for selective attention? Brain Research Reviews, 42, 265–272. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Fries, P., Schroeder, J.-H., Roelfsema, P. R., Singer, W., & Engel, A. K. (2002). Oscillatory neural synchronization in primary visual cortex as a correlate of stimulus selection. Journal of Neuroscience, 22, 3739–3754. Gross, J., Scmitz, F., Schnitzler, I., Kessler, K., Shapiro, K., Hommel, B., & Schnitzler, A. (2004). Modulation of long-range neuronal synchrony reflects temporal limitations of visual attention in humans. Proc. Natl. Acad. Sci. (USA), 101, 13050– 13055. Grossberg, S., & Raizada, R. (2000). Contrast sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Research, 40, 1413–1432. Herrmann, C. S., & Knight, R. T. (2001). Mechanisms of human attention: Event related potentials and oscillations. Neuroscience and Biobehavioral Reviews, 25, 465– 476. Holscher, C. (2003). Time space and hippocampal functions. Review of Neuroscience, 14, 253–284. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2, 194–203. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid screen analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. Kanwisher, N., & Wojciulik, E. (2000). Visual attention: Insights from brain imaging. Nature Reviews Neuroscience, 1, 91–100. Katayama, K., Yano, M., & Horiguchi, T. (2004). Neural network model of selective visual attention using Hodgkin-Huxley equation. Biological Cybernetics, 91, 315– 325. Kazanovich, Y. B., & Borisyuk, R. M. (1999). Dynamics of neural networks with a central element. Neural Networks, 12, 441–454. Kazanovich, Y., & Borisyuk, R. (2002). Object selection by an oscillatory neural network. BioSystems, 67, 103–111. Kazanovich, Y. B., & Borisyuk, R. M. (2003). Synchronization in oscillator systems with phase shifts. Progress in Theoretical Physics, 110, 1047–1058.
An Oscillatory Neural Model of Multiple Object Tracking
1439
Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kryukov, V. I. (1991). An attention model based on the principle of dominanta. In A. V. Holden & V. I. Kryukov (Eds.), Neurocomputers and attention I. Neurobiology, synchronization and chaos (pp. 319–352). Manchester: Manchester University Press. Kuramoto, Y. (1991). Collective synchronization of pulse coupled oscillators and excitable units. Physica D, 50, 15–30. Liang, H., Bressler, S. L., Ding, M., Desimone, R., & Fries, P. (2003). Temporal dynamics of attention-modulated neuronal synchronization in macaque V4. Neurocomputing, 52–54, 481–487. Liu, G., Austen, E. L., Booth, K. S., Fisher, B. D., Argue, R., Rempel, M. I., & Enns, J. T. (2005). Multiple-object tracking is based on scene, not retinal coordinates. Journal of Experimental Psychology, 31, 235–247. Loose, R., Kaufmann, C., Auer, D. P., & Lange, K. W. (2003). Human prefrontal and sensory cortical activity during divided attention tasks. Human Brain Mapping, 18, 249–259. Moser, M. C., & Silton, M. (1998). Computational model of spatial attention. In H. Pashler (Ed.), Attention (pp. 341–393). London: UCL Press. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919. Niebur, E., Hsiao, S. S., & Johnson, K. O. (2002). Synchrony: A neuronal mechanism for attentional selection? Current Opinion in Neurobiology, 12, 190–194. Niebur, E., Kammen, D. E., & Koch, C. (1991). Phase-locking in 1-D and 2-D networks of oscillating neurons. In W. Singer & H. Schuster (Eds.), Nonlinear dynamics and neuronal networks, (pp. 173–204). Berlin: Vieweg Verlag. ¨ a, J. (2004). Is multiple object tracking carried out automatiOksama, L., & Hyon¨ cally by an early vision mechanism independent of higher-order cognition? An individual difference approach. Visual Cognition, 11, 631–671. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Posner, M. I., Snyder, C. R. R., & Davidson, D. J. (1980). Attention and the detection of signals. Journal of Experimental Psychology: General, 109, 160–174. Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision. Cognition, 80, 127–158. Pylyshyn, Z. W. (2004). Some puzzling findings in multiple object tracking (MOT): I. Tracking without keeping track of object identities. Visual Cognition, 1, 301– 322. Pylyshyn, Z. W., & Storm, R. W. (1988). Tracking multiple independent targets: Evidence for a parallel tracking mechanism. Spatial Vision, 3, 179–197. Roelfsema, P. R., Lamme, V., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395, 376–381. Scholl, B. J. (2001). Objects and attention: The state of the art. Cognition, 80, 1–46. Scholl, B. J., & Tremoulet, P. D. (2000). Perception causality and animacy. Trends in Cognitive Science, 4, 299–309.
1440
Y. Kazanovich and R. Borisyuk
Sears, C. R., & Pylyshyn, Z. W. (2000). Multiple object tracking and attentional processing. Canadian Journal of Experimental Psychology, 54, 1–14. Seifert, A. E., Somers, D. C., Dale, A. M., & Tootel, R. (2003). Functional MRI studies of human visual motion perception: Texture, luminance, attention and aftereffects. Cerebral Cortex, 13, 340–349. Shallice, T. (2002). Fractionation of the supervisory system. In D. T. Stuss & R. T. Knight (Eds.), Principles of frontal lobe function (pp. 261–277). New York: Oxford University Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations. Neuron, 24, 49–65. Sokolov, A., Lutzenberger, W., Pavlova, M., Pressl, H., Braun, C., & Birbauner, N. (1999). Gamma-band MEG activity to coherent motion depends on task-driven attention. Neuroreport, 10, 1997–2000. Somers, D. C., Dale, A. M., Seifert, A. E., & Tootel, R. (1999). Functional MRI reveals spatially specific attentional modulation in human primary visual cortex. Proc. Natl. Acad. Sci. (USA), 96, 1663–1668. Steinmetz, P. N., Roy, A., Fitzgerald, P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Tallon-Baudry, C. (2004). Attention and awareness in synchrony. Trends in Cognitive Sciences, 8, 523–525. Treue, S. (2003). Visual attention: The where, what, how and why of salience. Current Opinion in Neurobiology, 13, 428–432. Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nufl, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78, 507–545. Vinogradova, O. S. (2001). Hippocampus as comparator: Role of the two input and two output systems of the hippocampus in selection and registration of information. Hippocampus, 11, 578–598. Visvanathan, L., & Mingolla, E. (2002). Dynamics of attention in depth: Evidence from multi-element tracking. Perception, 31, 1415–1437. Wang, D. L. (1999). Object selection based on oscillatory correlation. Neural Networks, 12, 579–592. Yantis, S. (1992). Multielement visual tracking: Attention and perceptual organization. Cognitive Psychology, 24, 295–340.
Received April 25, 2005; accepted September 15, 2005.
LETTER
Communicated by Gustavo Deco
Analysis of Cluttered Scenes Using an Elastic Matching Approach for Stereo Images Christian Eckes
[email protected] Fraunhofer Institute for Media Communications IMK, D-53754 Sankt Augustin, Germany
Jochen Triesch
[email protected] Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093-0515, U.S.A.
Christoph von der Malsburg
[email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany Institut fur
We present a system for the automatic interpretation of cluttered scenes containing multiple partly occluded objects in front of unknown, complex backgrounds. The system is based on an extended elastic graph matching algorithm that allows the explicit modeling of partial occlusions. Our approach extends an earlier system in two ways. First, we use elastic graph matching in stereo image pairs to increase matching robustness and disambiguate occlusion relations. Second, we use richer feature descriptions in the object models by integrating shape and texture with color features. We demonstrate that the combination of both extensions substantially increases recognition performance. The system learns about new objects in a simple one-shot learning approach. Despite the lack of statistical information in the object models and the lack of an explicit background model, our system performs surprisingly well for this very difficult task. Our results underscore the advantages of view-based feature constellation representations for difficult object recognition problems. 1 Introduction The analysis of complex natural scenes with many partly occluded objects is among the most difficult problems in computer vision (Rosenfeld, 1984). The task is illustrated in Figure 1 and may be defined as follows: Given an input image with multiple, potentially partly occluded objects in front of an unknown complex background, recognize the identities and spatial Neural Computation 18, 1441–1471 (2006)
C 2006 Massachusetts Institute of Technology
1442
C. Eckes, J. Triesch, and C. von der Malsburg
Figure 1: Scene analysis overview. Given a set of object models (examples on the left) and an input image of a complex scene (middle), the task of scene analysis is to compute a high-level interpretation of the scene giving the identities and locations of all known objects, as well as their occlusion relations (right). Our object models take the form of labeled graphs with a gridlike topology.
locations of all known objects present in the scene. Clearly, the ability to master this task is a fundamental prerequisite for countless application domains. Unknown backgrounds and partial occlusions of objects will be the order of the day whenever computer vision systems or visually guided robots are deployed in complex, natural environments. In such situations, it may be that segmentation cannot be decoupled from recognition, but the two problems have to be treated in an integrated manner. In fact, even today’s best “pure” segmentation approaches (e.g., Ren & Malik, 2002) seem to be severely limited in their ability to find object contours when compared to human observers. The system developed and tested here follows the popular approach of elastic graph matching (EGM), a biologically inspired object recognition approach that represents views of objects as 2D constellations of waveletbased features (Lades et al., 1993; Triesch & Eckes, 2004). EGM is an example of a class of architectures that represent particular views of objects as 2D constellations of image features (Fischler & Elschlager, 1973). EGM does not rely on initial segmentation, but directly matches observed image features to stored model features. It has been used successfully for recognizing objects, faces (including identity, pose, gender, and expression; Wiskott, ¨ Fellous, Kruger, & von der Malsburg, 1997), and hand gestures (Triesch & von der Malsburg, 2002), and has already demonstrated its potential for complex scene analysis in an earlier study (Wiskott & von der Malsburg, 1993). The description of object views as constellations of local features is potentially very powerful for dealing with partial occlusions, because the effect of features missing due to occlusion can be modeled explicitly in this approach. For example, missing features of the object “pig” in Figure 1 can be discounted if it is recognized that they are occluded by the object in front. In Bayesian terms, the missing features are explained away by the presence of the occluding object. In this capacity, our approach stands in contrast to
Analysis of Cluttered Scenes
1443
schemes employing more holistic object representations such as eigenspace approaches (Turk & Pentland, 1991; Murase & Nayar, 1995). The philosophy behind our current approach is to integrate multiple sources of information within an EGM framework. The motivation for this philosophy is that when dealing with very difficult (vision) problems, one is often well advised to follow the honorable “use all the help you can get” rationale. Recent research in the computer vision and machine learning communities has repeatedly highlighted the often surprising capabilities of systems that integrate many weak (i.e., individually poor) cues or classifiers. Our current system for scene analysis uses two extensions to the basic graph matching technique, both integrating a new source of information into the system. First, we use an extension of EGM to stereo image pairs, where object graphs are matched simultaneously in the left and right images subject to epipolar constraints (Triesch, 1999; Kefalea, 2001). The disparity information estimated during stereo graph matching is used to disambiguate occlusion relations during scene analysis. Second, we utilize richer features in the object description that fuse information from different visual cues (shape and texture, color) (Triesch & Eckes, 1998; Triesch & von der Malsburg, 2001). The combination of both extensions is shown to dramatically increase recognition performance. The remainder of the letter is organized as follows. In section 2 we describe conventional EGM in the context of scene analysis. Section 3 describes the extension of EGM for stereo image pairs. In section 4 we discuss our extension to richer feature descriptions—so-called compound feature sets. Section 5 presents our method for scene analysis in stereo images. Section 6 covers the results of our experiments. Finally, section 7 discusses the work from a broader perspective and relates it to other approaches in the field. 2 Elastic Graph Matching 2.1 Object Representation. In elastic graph matching (EGM), views of objects are described as 2D constellations of features represented as labeled graphs. In particular, the representation of an object m ∈ {1, . . . , M} is an undirected, labeled graph G m with NV (m) nodes or vertices (set V m ) and NE (m) edges (set E m ): G m = (V m , E m )
with
V m = {nim , i = 1, . . . , NV (m)} E m = {e mj , j = 1, . . . , NE (m)}.
(2.1)
Each node nim is labeled with its corresponding position xim in the training image I m and with a feature set extracted from this position denoted as F(xim , I m ). The feature set provides a local description of the appearance of the object m at location xim . The edges e mj = (i, i ) connect neighboring nodes, nim and nim (i < i ), and are labeled with displacement vectors d m im − xim j = x
1444
C. Eckes, J. Triesch, and C. von der Malsburg
encoding the spatial arrangement of the features. For our experiments, we use graphs with a simple grid-like topology. The distance between nodes in x- and y-direction is 5 pixels, and each node is connected to its four nearest neighbors (see Figure 1). Thus, the number of nodes in the graph varies with object size. Our object database contains graphs of various sizes ranging from 30 to 35 nodes covering the small tea box up to 198 nodes for the large stuffed pig. Note that in many applications of EGM, graph topologies are adapted to the particular problem domain (e.g., for faces or hand gestures). 2.2 Gabor Features. The features labeling the graph nodes are typically vectors of responses of Gabor-based wavelets, so-called Gabor jets (Lades et al., 1993). These features have been shown to be more robust with respect to small translations and rotations than raw pixel values, and they are heavily used in biological vision systems (Jones & Palmer, 1987). These wavelets can be thought of as plane waves restricted by a gaussian envelope function. They come in quadrature pairs (with odd and even symmetry corresponding to sine and cosine functions) and can be compactly written as a single complex wavelet:
ψk (x ) =
k2 − k2 x22 i k· x −σ 2 /2 2σ e . · e − e σ2
(2.2)
For σ , we chose 2π. This choice leads to larger spatial filters than the choice σ = 2, which is sometimes preferred in the literature. Convolving the graylevel component I (x ) of an image at a certain position x0 with a filter kernel ψk yields a complex number, which is denoted by Jk (x0 , I (x )): Jk (x0 , I (x )) =
ψk (x0 − x )I (x )d 2 x.
(2.3)
The family of filters in equation 2.2 is parameterized by the wave vector k. To construct a set of features for a node of a model graph, different filters are obtained by choosing a discrete set of L × D vectors k of L = 3 different frequency levels and D = 8 different directions. In particular, we use: π kl,d = · 2
1 √ 2
l cos
π d D
sin
π d D
l = {0, 1, . . . , L − 1} d = {0, 1, . . . , D − 1}
.
(2.4)
The set of all complex filter responses Jkl,d is aggregated into a complex vector called a Gabor-wavelet jet or just Gabor jet. It is often convenient to
Analysis of Cluttered Scenes
1445
represent the complex filter responses as magnitude and phase values: Jkl,d = |Jkl,d | exp iφkl,d .
(2.5)
Previous studies have suggested that the information contained in the magnitudes of the filter responses is most informative for recognition purposes (Lades et al., 1993), and Shams and von der Malsburg (2002) have argued that their “population responses contain sufficient information to capture the perceptual essence of images.” On this basis, we decided to use only the magnitude information to construct feature sets to be used in the object representations. The feature set F(xim , I m ) for node i of model m is thus a real vector with L × D components: T F(xim , I m ) ≡ Fim = |Jk0,0 (xim , I m )|, . . . , |JkL−1,D−1 (xim , I m )| .
(2.6)
During recognition, we will attempt to establish correspondences between model features and image features in a matching process. To do so, we need to compare Gabor jets attached to the model nodes with Gabor jets extracted from specific points in the input image. We use the similarity function 4
Snode (F, F ) = cos F, F =
F · F F F
4 ,
(2.7)
where ‘·’ denotes the dot or inner product. This similarity function measures the angle between the Gabor jets. It is invariant to changes in the length of the Gabor jets, which provides some robustness to changes in local image contrast. The fourth power helps to better discriminate closely matching jets from just average matching jets by making the distribution of similarity values sparser, leading to fewer occurrences of high similarities, while spreading out the range of interesting similarity values (those close to 1) (Wiskott & von der Malsburg, 1993). Note that the computationally costly division by F and F can be avoided by using normalized Gabor jets with F = F = 1. 2.3 Elastic Matching Process. The goal of elastic matching is to establish correspondences between model features and image features. Matching of individual model features to the image independent of each other suffers from considerable ambiguity. Robust matches can be obtained only by simultaneously matching constellations of many features. EGM performs a topographically constrained feature matching by comparing the entire model graph G m with parameterized image graphs G I (α, m). Such image graphs are generated by parameterized transformations Tα that can account for different transformations of the object in the image with respect to the
1446
C. Eckes, J. Triesch, and C. von der Malsburg
model, such as translation, scaling, rotation in plane, and (limited) rotation in depth. In particular, the transformation Tα maps the position of a model node xim to a new position Tα (xim ). The model feature set attached to a particular node of the object model must be compared to features extracted at the transformed location of that node in the input image. This feature vector is denoted F(Tα (xim ), I ). The similarity of model feature set i with a feature set extracted at a transformed node location in an input image I is written as Si (α, m, I ) = Snode (Fim , F(Tα (xim ), I )) ,
(2.8)
where Snode is defined in equation 2.7. The similarity between the entire image graph corresponding to Tα and the model graph is defined as the mean similarity of the corresponding feature sets attached at each node: Sgraph G m , G I (α, m) =
NV (m) 1 Si (α, m, I ). NV (m)
(2.9)
i=1
EGM is an optimization process that maximizes this similarity function to find the best matching image graph over the predefined space of allowed transformations Tα , α = 1, . . . , A: αˆ = arg max Sgraph G m , G I (α, m) . α
(2.10)
The optimal transformation parameters αˆ depend on the particular object model m and the image I under consideration: αˆ = α(I, ˆ m). Note that it is often useful in EGM to add a term to the graph similarity function Sgraph that punishes geometric distortions of the matched graph compared to the model graph, but this did not improve recognition performance in our particular application. In summary, the matching process yields a set of optimal correspondences {Tαˆ (xim )}, the local similarity at each node of the matched graph Si (α, ˆ m, I ), and the for the complete graph defined as global similarity ¯ α, ˆ m) . These quantities are used in the followS( ˆ m, I ) = Sgraph G m , G I (α, ing scene analysis to recognize the different objects and infer occlusion relations, as will be discussed below (see section 5). 2.4 Remarks on EGM. Before we go into the details, let us review our approach from a broader perspective. Graph matching in computer vision aims at solving the correspondence problem in object recognition and stems from the family of NP-complete problems for which no efficient algorithms exist. Heuristics are therefore unavoidable. We follow a template-based approach to detect correspondences by generating (graph) templates for each
Analysis of Cluttered Scenes
1447
object that contains features at the nodes to describe local surface properties. This is a view-based approach to object recognition that preserves the local feature topology and is well supported by view-based theories of human vision (see, e.g., Tarr & Bulthoff, 1998, or Edelman, 1995, 1998; Ullman, 1998). It is related to approachs based on matching deformable models to image features (see, e.g., Duc, Fischer, & Bigun, 1999) for face authentification and many other registration methods based on articulated template matching (see the pioneering work of Yuille, 1991). EGM aims at establishing a correspondence between image graphs and graphs taken from the model domain. In solving this problem, heuristics must be applied, despite some progress in generating model representation optimized for detecting subgraph isomorphisms (see, e.g., Messmer & Bunke, 1998). But how can we solve this problem more efficiently? One possibility is to perform hierarchical graph matching (see, e.g., Buhmann, Lades, & von der Malsburg, 1990) following a coarse-to-fine strategy (for an application in face detection, see Fleuret & Geman, 2001). However, this work focuses on how stereo and color features can be used to analyze complex scenes, so we had to refrain from implementing many of the ideas mentioned above. We have chosen a rather algorithmic version of the matching process in order to obtain a system with real-time or almost real-time capabilities. Without this, no sound judgment on the usefulness of our approach to cue combination can be made since large-scale cross-runs cannot be avoided. Our system selects graphs of an extended model domain able to sample translation, rotation, and scale and matches these graphs to the image domain. It is evident that the space of possible translations scalings and rotations in plane and depth is too large to be searched exhaustively. Since we consider only moderate changes in scale and aspect here, fortunately, a few simple heuristics can be used to make the matching problem tractable. Typically, a coarse-to-fine strategy is employed, where an initial search coarsely locates the object by evaluating the similarity function on a coarse scan grid and taking the global maximum, while subsequent optimization steps estimate the exact scale, rotation, and any nonrigid deformation parameters. This approach exploits the robustness of the Gabor features to small changes in scale and aspect of the object. We discuss the details of our matching scheme when we introduce the extension to matching in stereo image pairs in section 3. 3 Elastic Graph Matching in Stereo Image Pairs The discussion so far has considered a single model graph being matched to a single input image. In this section, we extend this approach to matching in stereo image pairs. Stereo vision is traditionally considered in terms of attempting to recover the 3D layout of the scene given two images taken from slightly different viewpoints to produce a dense depth map of the
1448
C. Eckes, J. Triesch, and C. von der Malsburg
scene. The aim of scene analysis, however, is just to identify known objects in the scene and establish their locations and depth order. This does not necessarily imply recovering fine-grained three-dimensional shape information. Estimates of the relative distances for the different objects may be sufficient, and the computation of dense disparity maps may not be necessary. For this reason, we attempt to integrate the information from the left and right images at the level of entire object hypotheses rather than at the level of individual image and background features. The stereo problem is essentially a correspondence problem. Elastic graph matching is a successful strategy for solving correspondence problems, and its extension to stereo image pairs is straightforward (Triesch, 1999; Kefalea, 2001). Stereo graph matching searches in the product space of the allowed left and right correspondences for a combined best match; that is, image graphs for the left and right input image are optimized simultaneously. This product space of simultaneous graph transformations in the left and right images, however, is far too large to be searched exhaustively as it scales quadratically O(A2 ) with the number A of transformations used in monocular matching. However, we can again speed up the search by a coarse-to-fine search heuristic. Also, just as in conventional stereo algorithms, it is possible to reduce the search space by exploiting the epipolar geometry, and in addition, we limit the allowed disparity range between the left and right image graphs to further reduce the search space. We extend the notation from above to distinguish left and right images, model graphs extracted from left and right training images, and node positions. Note that these two model graphs created from the left and right training images may have different numbers of nodes and edges. We denote the transformed node positions during matching as p
Tα p (xim )
with p ∈ {R, L},
(3.1)
L
where xim denotes the position of the ith node in the left training image for model m. The set of {Tα L , Tα R } encodes the space of allowed transformations (translation, scaling, and combinations thereof) of model graphs to image graphs. A = {1, . . . , A} is the set of A transformations applied to the model positions. The similarity functions between the left and right image graphs and the left and right model graphs are defined by p p Sgraph G m , G I (α p , m p ) =
NV (m ) p 1 P Snode Fim , F Tα P (xim ), I p . p NV (m ) p
i=1
(3.2) To obtain a combined or fused match similarity, we simply compute the average of the left and right matching similarities, resulting in the following
Analysis of Cluttered Scenes
1449
combined similarity function: L R L R Sstereo G m , G m , G I (α L , m L ), G I (α R , m R ) =
L R 1 1 L R Sgraph G m , G I α L , m L + Sgraph G m , G I α R , m R . 2 2
(3.3)
If the transformation parameters α L and α R were optimized independently, then we would have two independent matching processes, which would double the complexity of the search. Such an increase in complexity, however, can be avoided if we exploit the epipolar constraint, which implies a set of allowed combinations of α L and α R . The epipolar constraint can be formalized as R L x R ∈ E x L =⇒ Tα R x m ∈ E Tα L x m ,
(3.4)
where E x L denotes the epipolar plane defined by the point x L and the cameras’ nodal points. This limits the product space of allowed transformation pairs (α L , α R ) ∈ A × A to a subspace AE ⊂ A × A: R L . AE = (α L , α R ) ∈ A × A | Tα R x m ∈ E Tα L x m
(3.5)
Given the definitions above, stereo graph matching aims at optimizing the combined similarity function in the subset of transformations fulfilling the stereo constraints E : αˆ ≡ (αˆ L , αˆ R ) = arg
L R L R max Sstereo G m , G m , G I (α L , m L ), G I (α R , m R ) .
(α L ,α R )∈AE
(3.6) In analogy to the previous section, the matching process computes a set of p optimal correspondences for the left and right images {Tαˆ p (xim )}, the local p p p p similarity at each node of the matched graphs Si (αˆ , m , I ), and the global similarity for the best stereo match: L R L R ¯ α, S( ˆ m L , m R , I L , I R ) = Sstereo G m , G m , G I (αˆ L , m L ), G I (αˆ R , m R ) . (3.7) Additionally, it will provide a disparity estimate for the optimal match for model m, which we denote by d(m, α), ˆ which we compute from the mean
1450
C. Eckes, J. Triesch, and C. von der Malsburg
position of the matched left and right model graphs in the image pair: R L d(m, α) ˆ = < Tαˆ R (xim ) >i − < Tαˆ L (x mj ) > j
(3.8)
The x-component of this disparity estimate plays an important role in establishing occlusion relations in our scene analysis algorithm described in section 5. 3.1 Matching Schedule. For the experiments described below, we use a coarse-to-fine search heuristic for the stereo matching process. Since we use approximately parallel camera axes, the epipolar geometry is particularly simple. In a first, coarse matching step, the left and right image graphs are scanned across the image on a grid with 4 pixel spacing in x- and ydirection. The disparity is allowed to vary in a range of 3 to 40 pixels in the x-direction and 9 pixels in the y-direction. In a second, fine, matching step, the graphs are allowed to be scaled by a factor ranging from 0.9 to 1.1 in five discrete steps independently from each other, while at the same time their locations are scanned across a fine grid with 2 pixel spacing in the x and y directions. At the same time, disparity is allowed to be corrected by up to 2 pixels relative to the result from the first matching step. An example of stereo graph matching is shown in Figure 2, where we compare stereo graph matching with two independent matching processes in the left and right images. Due to proper handling of the epipolar constraint, stereo graph matching is able to avoid some matching errors that can occur with two independent matching processes. The matching schedule realizes a greedy optimization strategy and converges to a local maximum of the graph matching similarity of equation 3.7 by definition. It should be noted that there is no standard matching schedule for template-based approaches that ensures convergence to a global maximum. Since the search space is huge, prior knowledge, or brute force, must be used. A common approach to face detection by Rowley, Baluja, & Kanade (1998) samples all possible translations, scale, and rotations before comparing the feature data extracted from the image with a statistical face model realized as a neural network. Only a more hierarchical and complex approach, such as by Viola & Jones (2001b), may eventually help to overcome this problem. We refer to Forsyth & Ponce (2002) for a discussion of template-based approaches and their relation to alternative approaches to object recognition. 4 Compound Feature Sets Our second extension to the standard EGM approach outlined so far is the introduction of richer feature combinations at the graph nodes, which we first introduced in Triesch & Eckes (1998). There are several ways of
Analysis of Cluttered Scenes
1451
a
b
c
Figure 2: Stereo matching versus two independent monocular matching steps: (a) Original stereo image pair. (b) Example result of stereo matching of the model “yellow duck” exploiting the epipolar constraint. (c) Result of two independent matching processes in the left and right images. In the right image, a background region obtained a higher similarity than the correct match. This solution was ruled out by the epipolar constraint in b.
introducing richer and more specific features. A standard approach is to assemble information from a bigger support area around the point of interest, that is, to somehow make the features less local. This approach is employed frequently, and the general idea is often referred to as considering a context patch. Example techniques using this approach are set out in Nelson & Selinger (1998) and Belongie, Malik, and Puzicha (2002). While this indeed leads to more specific features, a potential drawback of this idea has received little attention. The robust extraction of these kinds of features requires the entire context patch to be visible. In the presence of partial occlusions, however, features covering large regions of the object are more prone to being compromised by occlusions. Hence, it seems that just using “larger” features may turn out to be problematic in situations where partial occlusions are prevalent. At this point, more work is needed to assess this trade-off. Our approach in this article is to consider richer kinds of features but keep the support area corresponding to a particular node relatively small. In particular, we are interested in investigating the benefits of additional color features, since color has been shown to be a powerful cue in object recognition (Swain & Ballard, 1991). Since we are using a view-based feature constellation technique, we want to extract local color descriptions that we can use to label the graph nodes rather than using
1452
C. Eckes, J. Triesch, and C. von der Malsburg
a holistic description of object color in the form of a histogram, as in Swain & Ballard (1991). We label a node with a set of two feature vectors: a Gabor jet extracted from the gray-level image and a separate feature vector based on local color information. To this end, we use a simple local average of the color around a particular pixel position. We need to choose a suitable color space. We experimented with RGB, HSI, and RY-Y-BY spaces. The last color space is also often called YCbCr since R-Y (Cr) and B-Y (Cb) are croma differences (Cb, Cr) and Y is the luma or brightness information (e.g., Pratt, 2001; Gonzales & Woods, 1992). The transformation from RGB to RY-Y-BY color space and from RGB to HSI color space is given in the appendix. A color feature vector is extracted at a particular image location by simply averaging the colors of the node’s neighborhood covering a region of 5 × 5 pixels. The resulting three component feature vectors are compared with the following similarity function, which is based on a weighted Euclidean distance:
Sr Jc , J c , s1 , s2 , s3 = 1 −
3 c=1
(sc J c − sc J c )2
3 c=1
(255sc )2
.
(4.1)
This similarity function guarantees that the similarity falls in the interval [0, 1]. The scale factors sc are used to enhance certain dimensions of the color space.1 Note that we make no effort to incorporate color constancy mechanisms, so performance must be expected to suffer in the presence of illumination changes. With the graph nodes now being labeled with a set of feature vectors, or a compound feature set, where each feature has an associated similarity function, we need a way of fusing the individual similarities into one node similarity measure. Let a compound feature set be denoted by F C , whereas an individual feature set is denoted Fm : F C := {Fm , m = 1, . . . , M}.
(4.2)
We compare compound feature sets by a weighted average over the output of the individual similarity functions defined for each feature set Fm : M Scompound F C , F C = wm · Sm Fm , F m , m=1
where
wm = 1.
(4.3)
m
1 We used the relative scaling of 1:1:1 in the RY-Y-BY space and 0.75:1:0 in the HSI space.
Analysis of Cluttered Scenes
1453
This similarity function simply replaces Snode in the previous discussion. This strategy avoids early decisions and characterizes this approach as an example of weak fusion. Weights are chosen to optimize performance. Note that overfitting is not a serious problem since the number of objects in all scenes is 858, much higher than the number of weights used in the system. 5 Scene Analysis With our extended matching machinery in place, we are now ready to present our algorithm for scene analysis. It basically operates in two phases. In the first phase, we use the stereo graph matching technique described above to match object models to the input image pairs. In the second phase, we use an iterative method to evaluate matching similarities and disparity information to estimate occlusion relations and to accept or reject (partial) candidate matches. Objects are accepted in a front-to-back order so that possible occlusions can be properly taken into account when an object candidate is considered. The result of this analysis, the scene interpretation, lists all recognized objects and their locations and specifies the occlusion relations between them. One advantage of the graph representation used is that it allows us to explicitly represent partial occlusions of the object. To this end, we introduce R/L a binary label vi ∈ {1, 0} for each node of each matched object model that indicates if this node is visible (vi = 1) or occluded (vi = 0). Given this definition, the average similarity of the visible part of the matched graph, here denoted as S¯ vis m , is defined as S¯ vis m =
1 2
i
L vm,i SiL (α L , m L , I L ) 1 L + 2 i vm,i
i
R vm,i SiR (α R , m R , I R ) R . i vm,i
(5.1)
This quantity is interesting because it explicitly discounts information from object parts that are estimated to be occluded. The (usually) low similarities of such nodes are explained away by the occlusion. The following algorithm for scene analysis simultaneously estimates the presence or absence of objects, their locations, and the visibility of individual nodes, thereby segmenting the scene into known objects and background regions. It iterates through the following steps (summarized in Figure 3): 1. INITIALIZATION: All image regions are marked as free, and all object models are matched to the image pair as described above. The nodes of all objects are marked as visible: voiR = 1 ∀ o, i and voiL = 1 ∀ o, i. 2. FIND CANDIDATES: From all object models that have not already been accepted or rejected (see below), select a set candidate of objects for recognition. An object enters the set of candidates if two conditions hold:
1454
C. Eckes, J. Triesch, and C. von der Malsburg
Algorithm for Scene Analysis: 1. Initialization: mark all image areas as free, then match all object models. 2. Find candidate objects. IF set of candidates is empty THEN END. 3. Determine closest candidate object. 4. Accept or reject closest candidate object. 5. If accepted, mark corresponding image region as occu-
pied and update node visibilities of all remaining object models and re-compute their match similarities. 6. Go to step 2. Figure 3: Overview of scene analysis algorithm.
a. A sufficient portion of it is estimated to be visible. Since we are considering left and right images at the same time and since the left and right model graphs for an object may have different numbers of nodes, and because different numbers of nodes may be visible in left and right image, we define the number of visible nodes to be the maximum number of visible nodes in either the left or right image. This number has to be above a threshold of φ = 20: R L max vo,i , vo,i > φ. i
i
Analysis of Cluttered Scenes
1455
Robust recognition and, in particular, correct rejection on the basis of matching similarities can be made only if a sufficient number of similarity measurements are available. This limits the degree of occlusion the system can cope with. The threshold of 20 nodes corresponds to a minimum required visibility of 60% of the surface area for small objects (e.g., the tea box) and down to 10% for the largest objects (e.g., the toy pig), respectively. This is plausible since very small objects are more likely overlooked in heavily cluttered scenes than larger objects. b. The mean similarity of the visible nodes of left and right graphs, denoted by S¯ vis o , lies above a threshold θ : vis S¯ o > θ. If the candidate set is empty because no more object hypotheses fulfill these requirements, the algorithm terminates. 3. FIND FRONT-MOST CANDIDATE: The identification of the frontmost candidate is based on the fusion of a disparity cue and a node similarity cue. We evaluate all pairwise comparisons between candidates. The candidate judged to be “in front” most often is selected. The disparity cue simply compares the disparities of two candidates to select the closer object. The assumption behind the node similarity cue is that if the matched object graphs for two objects o and o overlap in the image, then the closer object must occlude the other, and the image region in question should look more similar to the model of the occluding object than the model of the occluded object. Consequently, R/L we can define the left and right node similarity occlusion index Coo as follows: R/L L/R R/L R/L R/L Soi − So j ∀ (i, j) with xoi − xo j ≤ d. Coo = (i, j)
The distance threshold d was set to 5 pixels. We also define the average occlusion index C¯ oo as the mean of left and right occlusion indices: C¯ oo =
1 R L (C + Coo ). 2 oo
We fuse the estimates of the disparity and the node similarity cues as follows: if the absolute difference between the disparities is above a threshold of θd = 5 pixels, we use the estimate of the disparity cue to determine which of two objects is in front. Otherwise, we use the estimate of the node similarity cue. This choice reflects our observation that the disparity cue is reliable only for large disparity differences. This is also supported by studies on the use of disparity in biological vision (see, e.g., Harwerth, Moeller, & Wensveen, 1998).
1456
C. Eckes, J. Triesch, and C. von der Malsburg
4. ACCEPT/REJECT CANDIDATE: The algorithm now accepts the front-most candidate. If, however, this object graph shows a higher disparity than the maximum disparity of all previously accepted objects, this candidate is rejected, because in this case, the analysis indicates that the candidate should have already been selected during an earlier iteration but was not. 5. UPDATE NODE VISIBILITIES: After a candidate has been accepted, we mark free image regions that are covered by the object as occupied. The nodes of other objects that have not been accepted or rejected earlier are now labeled as occluded if they fall in the image region covered by the just accepted object. 6. GO TO STEP 2. The proposed algorithm is conceptually simple and effective, but it has some important limitations. Most prominent, the initial matching of all object models does not take occlusions into account. A more elaborate (but slower) scheme would be to rematch all object models after a new candidate has been accepted, properly taking the occlusions due to all previously accepted objects into account. We also do not allow multiple matches of the same object model in different places. Finally, occlusions by objects that are unknown to the system are not handled in this approach. Although local occlusion can also be determined on the basis of local matching similarity (see Wiskott & von der Malsburg, 1993), future research must address this issue more thoroughly. 6 Results 6.1 The Database. We have recorded a database of stereo image pairs of cluttered scenes composed of 18 different objects in front of uniform and complex backgrounds.2 The database is available for download (Eckes & Triesch, 2004). A collage of the 18 objects is shown in Figure 4. Note that there is considerable variation in object shape and size (compare the objects “pig” and “cell-phone” in the upper right corner), but there are also objects with nearly identical shapes, differing only in color and surface texture (compare objects “Oliver Twist” and “Life of Brian” in the upper left corner.) The database has two parts. Part A, the training part, is formed by one training image pair of each object in front of a uniform blue cloth background. Part B, the test part, contains image pairs of scenes with simple and complex backgrounds. There are 80 scenes composed of two to five
2
Images were acquired through a pair of SONY XC-999P cameras with 6 mm lens, mounted on a modified ZEBRA TRC stereo head with a baseline of 30 cm. We used two IC-P-2M frame grabber boards with AM-CLR module from Imaging Technology.
Analysis of Cluttered Scenes
1457
Figure 4: Object database. A collage of the 18 objects used. See the text for a description.
objects in front of the same uniform blue cloth background. The total number of objects in these scenes is 280. The test part also contains 160 scenes composed of two to five objects in front of one of four structured backgrounds. Here, we have a total of 558 objects. Finally, the test part has five scenes without any objects—just the blue cloth and the four complex backgrounds. Thus, the test part has 263 scenes in total. Some example images are shown in Figure 5. 6.2 Evaluation. Every recognition system able to reject unknown objects uses confidence values in order to judge whether a given object has been localized and identified correctly. If the confidence value falls below a certain threshold—the so-called rejection threshold—the corresponding recognition hypothesis is discarded since it is most likely incorrect. In many applications, missing an object is a much less severe mistake than a socalled false-positive error, such as incorrectly recognizing an object that is not present in the scene at all (e.g., in security applications). Hence, the proper value for the rejection threshold depends on the application. Alternatively, the area above the receiver-operator characteristic (ROC) curve (and the upper limit of 100% recognition performance) measures the total error regardless of the application scenario in mind (see e.g., Duda & Hart, 1973).
1458
C. Eckes, J. Triesch, and C. von der Malsburg
a
b
c
d
e
f
Figure 5: Example scenes with simple and complex backgrounds: (a, b) Two scenes with uniform background. Note the variation in size for object “pig” and the different color appearance. Image a stems from the left camera and b from the right. (c–f) Examples of scenes with the four complex backgrounds.
The most important parameter of our system is the similarity threshold θ , which we use as the rejection threshold. θ determines if a model’s matching similarity was high enough to allow it to enter the set of candidates during scene analysis (compare section 5). A low-threshold θ will result in many false-positive recognitions, while a high θ will cause many false rejections. Depending on the kind of application, one type of error may be more harmful than the other, and θ should be chosen to reflect this trade-off for any particular application.3 An example of the effect of varying θ is given in Figure 6. In the following, we report our results in the form of ROC curves obtained by systematically varying θ for different versions of the system. The parameter θ was modified in all experiments within the interval [0.32, 0.84] in steps of 0.01, the corresponding system was tested on the complete database, and the number of correct object recognitions and the number of false-positive recognitions was recorded. Hence, the parameter θ is used as the rejection threshold that determines the working point of the investigated system. All other parameters are left constant at the values given
3 One could also learn a threshold that depends on the specific object model m, that is, θ = θ (m), but this was not attempted in this work.
Analysis of Cluttered Scenes
L
1459
R
a) θ = 0.43
θ = 0.49
θ = 0.50
b) θ = 0.67
θ = 0.68
θ = 0.685
Figure 6: Rejection performance of different systems for an example scene: (L+R) Complex scene with five objects. (a) Results of the stereo recognition using intensity-based Gabor processing (color-blind features) with three different values of θ. A higher rejection threshold leads to fewer false positives but also tends to remove some true positives. (b) Results of the same stereo scene recognition system when raw color Gabor features taken in HSI color space are used instead. Here, a higher rejection threshold is able to suppress false-positive recognitions without influencing already correct recognitions. The additional color information clearly improves the recognition performance. The system still misses the highly occluded gray toy ape in the background though.
above. A recognition result is considered correct if the object position estimated by the matching process is within a radius of 7 pixels distance to the hand-labeled ground truth position. Our main results are shown in Figure 7. It compares the ROC curves for the base system using only Gabor features and monocular analysis with our full system employing stereo graph matching and using compound features incorporating color information. The monocular system analyzed left and right images independently, and the results were simply averaged. Note that for the determination of occlusion relations during scene analysis, the monocular system must rely on the node similarity cue alone since disparity information is not available. The performance of the stereo graph
1460
C. Eckes, J. Triesch, and C. von der Malsburg
0%
2%
4%
6%
false positive rate 8% 10% 12% 14%
16%
18%
20%
100% 90%
0.72
0.66
700
0.68
80% 0.48
600
70%
0.65 60%
500
50%
400 300
0.71
0.71
0.51
200 100 0
0
500 number of false positives
mono-greylevel stereo-greylevel mono-rc-hsi mono-rc-rgb mono-rc-ryyby stereo-rc-hsi stereo-rc-rgb stereo-rc-ryyby
40% 30%
correct object recognitions (%)
number of correct object recognitions
800
20% 10%
0 1000
Figure 7: Comparison by ROC analysis. Systematic variation of the rejection parameter θ from [0.32, 0.84] in steps of 0.01 and recording of correct object recognitions and false positives lead to this ROC plot for all investigated systems. We have marked data values with peak performance with the corresponding rejection value (see Table 1 for a summary). The three stereo systems using compound feature sets incorporating color information in HSI, RY-Y-BY, or RGB color (stereo-rc-lines) clearly outperform the monocular systems. They also outperform the stereo system using only gray-level features (stereo-graylevel line) over a large range of rejection parameters and corresponding false-positive rates.
matching using only intensity-based Gabor wavelets has also been added for comparison. Table 1 gives an overview over the performance of the different systems. All scenes with complex and simple backgrounds with two up to five objects per scene were used in these experiments. We have also investigated which of the two different types of color feature sets, raw color compound or color Gabor compounds, show superior performance. The complete results, presented in Eckes (2006), revealed that raw color compound features support a better recognition performance than color Gabor features. This is understandable since the first type of feature set explictly encodes the local surface color, whereas the latter type of feature set encodes the color texture of the objects. Hence, we refrain from presenting any results based on the inferior Gabor color features in the following in order to keep the discussion simple.
Analysis of Cluttered Scenes
1461
Table 1: Comparison of Peak Recognition Results. Performance
Scene Recognition System Matching Mono Mono Mono Mono Stereo Stereo Stereo Stereo
Features
θ
Error (%)
Correct (Number)
False Positives (Number)
Gabor Compound (RYYBY) Compound (HSI) Compound (RGB) Gabor Compound (RYYBY) Compound (HSI) Compound (RGB)
0.51 0.71
55 47
388 457
500 205
0.65 0.71 0.48 0.68
40 44 20 8
510 478 681 787
386 372 850 544
0.66 0.72
7 6
793 803
293 292
6.3 Discussion. The introduction of the stereo algorithm leads to an increase in peak performance, reducing the error rate from 55% to 20%. This improvement must be paid with an increase of the false-positive rate from around 10% to 18%. This becomes more than compensated, however, when color features are added to the system. For a fixed number of false positives, the extended system typically reaches twice as many correct recognitions in comparison to the original one, whereas the choice of color space, RGB, HSI, or RY-Y-BY, makes only a minor difference in the peak performance (see Figure 7 for the analysis of the ROC). The optimal trade-off behavior shows the system using raw color compounds from the HSI color space since the number of correct object recognitions remains optimal even if the number of false positives is decreased. For very small numbers of false positives (N ≤ 10), however, the original system outperforms our extended version, as the magnification of the ROC in Figure 8 reveals. The ROC curves show an interval of monotonic increase in recognition performance as the number of false positives increases in accordance with expectations. The recognition systems, however, also show a decrease of correct object recognitions as the number of false positives rises above a certain value. This is due to a competition between false-positive and correct object models matched at the same region in the scene. Table 1 summarizes the peak performance of the different systems and makes it possible to study the effect of stereo graph matching and compound feature sets in isolation. Peak performance was defined as the maximum number of correctly recognized objects regardless of the number of false positives. The corresponding θ values are given in the table. Additional color and stereo information has resulted in an enormous increase in performance. The optimal combination of stereo and color cues enables the
1462
C. Eckes, J. Triesch, and C. von der Malsburg false positive rate 1%
0%
2%
100% 90%
0.68
700 600
80% 70%
0.71
0.74
60%
500
50%
400
40% 300
mono-greylevel stereo-greylevel mono-rc-hsi mono-rc-rgb mono-rc-ryyby stereo-rc-hsi stereo-rc-rgb stereo-rc-ryyby
0.69 0.76 0.74
200
0.55 0.57
100 0
0
10
20
30
40 50 60 number of false positives
70
80
90
30%
correct object recognitions (%)
number of correct object recognitions
800
20% 10% 0 100
Figure 8: Comparison by ROC analysis (magnification of Figure 7). For more than eight false positives (0.1% FPR), the stereo-based color compound systems outperforms the original system. Interestingly, the stereo algorithm with intensity-based Gabor features shows worse performance than the original system until more than 54 false positives (1.2% FPR) are allowed.
system to reduce its error rate by almost a factor of 8. Interestingly, the use of the stereo algorithm with color-blind features increases the number of false positives significantly, but the combination of both extensions actually reduces the number of false positives in comparison to the original system dramatically. The best performance is achieved with color features represented in the HSI color space. As Figure 7 has shown, the difference between the different color compound features is small when looking at the peak performance, but the stereo system based on HSI color compound features outperforms all other systems as it keeps the optimal balance between performance and false-positive rate (FPR). 6.4 Efficiency. It takes roughly 30 seconds on a 1 GHz Pentium III with 512 MB RAM to analyze a 2562 stereo image pair if 18 objects have been enrolled. Transformation and feature extraction takes 2 seconds, and stereo matching takes 1 to 2 seconds per object. The time for the final scene analysis is negligible. Hence, most time is spent in the matching routines, which scales linear with the number of objects n, O(n). A recognition system of 21 objects working on image sizes of 1282 performs on the given hardware
Analysis of Cluttered Scenes
1463
in almost real time. Note that the research code is not optimized for speed. A system for face detection, tracking, and recognition based on the same method, Gabor wavelet preprocessing and graph matching, performs in real time on similar image sizes (see Steffens, Elagin, & Neven, 1998). 7 Discussion The system we have proposed is able to recognize and segment partially occluded objects in cluttered scenes with complex backgrounds on the basis of known shape, color, and texture. The specific approach we presented here is based on elastic graph matching, an object recognition architecture that models views of objects as 2D constellations of image features represented as labeled graphs. The graph representation is particularly powerful for object recognition under partial occlusions because it naturally allows us to explicitly model occlusions of object parts (Wiskott & von der Malsburg, 1993). Contributions of this article have been the incorporation of richer object descriptions in the form of compound feature sets (Triesch & Eckes, 1998), the extension to graph matching in stereo image pairs (Triesch, 1999; Kefalea, 2001), and the introduction of a scene analysis algorithm utilizing the disparity information provided in the stereo graph matching to disambiguate occlusion relations. The combination of these innovations leads to substantial performance improvements. At a more abstract level, it appears that the major reason for the significant boost in performance is the combined use of different modalities or cues: stereo, color, shape, texture, and occlusion. We are unaware of any other system that has attempted to integrate such a wide range of cues for the purpose of object recognition and scene analysis. The endeavor to integrate a wide range of cues may be motivated through studies of the development of different modalities in infants, indicating that the human visual system tries to exploit every available cue in order to recognize objects (Wilcox, 1999). Note that stereo information is typically used as a low-level cue; that is, information from left and right images is typically fused at a very early processing stage, while here we fuse the images only at the level of full (or partial) object descriptions. This may not be very plausible from a biological perspective since fusion of information from the left and right eyes in mammalian vision systems seems to occur quite early. Nevertheless, the results we obtained are promising. Another point worth highlighting is our system’s ability to learn about new objects in a one-shot fashion, which is quite valuable. Also, our system’s ability to handle unknown backgrounds is worth mentioning. That is not to say that the system could not be substantially improved by gathering and utilizing statistical information about objects and backgrounds from many training images. On the contrary, we would expect this to substantially improve the system (e.g., Duc et al., 1999). A statistical well-founded system for object detection was presented in Viola and Jones (2001a, 2001b) in which
1464
C. Eckes, J. Triesch, and C. von der Malsburg
the type of features as well as the detection classifiers were learned from many training examples. Its high efficiency and its well-founded learning architecture make it very attractive, although it is unclear how it can be extended to deal with occlusions. However, we view the good performance obtained in our study despite the lack of such statistical descriptions as strong testimony to the power of the chosen object representation. We must admit that our system is shaped by a number of somewhat arbitrary choices concerning, for instance, the sequence of events, relative weights in the creation of compound features, and the rather algorithmic nature of the system. It is our vision that the coordination of processes that together constitute the analysis of a visual scene will eventually be described as a dynamical system shaped by self-organization on both the timescales of learning and brain state organization. The actual sequence of events may then depend on particular features of a given scene. In the following, we dicuss some related work. The number of articles on object recognition is vast, and the goal of this section is to discuss our work in the light of some specific example approaches. We restrict our discussion to appearance-based object recognition approaches. There have been a number of models of object recognition in the primate visual system proposed in the literature. Examples are Fukushima, Miyake, and Ito ¨ (1983), Mel (1997), and Deco and Schurmann (2000) (see the overview in Riesenhuber & Poggio (2000)). However, none of these systems tries to explicitly model partial occlusions. Some more closely related work has been proposed in the computer vision literature. 7.1 Modeling Occlusions in Object Recognition. Of particular interest is the work by Perona’s group (Burl & Perona, 1996; Moreels, Maire, & Perona, 2004), which also describes objects as a 2D constellation of features. Their approach is cast in a probabilistic formulation. However, they restrict analysis to the use of binary features—features that either are or are not present. Thus, their method must make “hard” decisions at the early level of feature extraction, which is often problematic. A similar approach has also been proposed in Pope and Lowe (1996, 2000) and, more recently Lowe (2004). Such object recognition models rely on detecting stable key points on the object surface from which robust features (e.g., extrema in scale-space) are extracted and compared to features stored in memory. The matched features vote for the most likely affine transformation on the basis of a Hough transformation, thereby establishing the correspondences between training image and the current input scene. The matching is fast and able to detect highly occluded objects in some complicated scenes. To this end, either features are made so specific that detection of a small number of them, say three, is sufficient evidence for the presence of the object (Lowe), or missing features are accounted for by simply specifying a prior probability of any feature missing (Perona). Thus, these systems do not explicitly try
Analysis of Cluttered Scenes
1465
to model occlusions due to recognized objects to explain away missing features in partly occluded objects as our approach does. 7.2 Integrating Segmentation and Recognition. The analysis of complex cluttered scenes is one of the most challenging problems in computer vision. Progress in this area has been comparatively slow, and we feel that a reason is that researchers have usually tried to decompose the problem in the wrong way. A specific problem may have been a frequent insistence on treating segmentation and recognition as separate problems, to be solved sequentially. We believe that this approach is inherently problematic, and this work is to be seen as a first step toward finding integrated solutions to both problems. Similarly, other classic computer vision problems such as stereo, shape from shading, and optic flow have been studied mostly in isolation, despite their ill-posed nature. From this perspective, we feel that the greatest progress may result from breaking with this tradition and building vision systems that try to find integrated solutions to several of these problems. This is a challenging endeavor, but we feel that it is time to address it. With respect to segmentation and recognition, there is striking evidence for a very early combination of top-down and bottom-up processing in the human visual system (see, e.g., the work of Peterson and colleagues: Peterson, Harvey, & Weidenbacher, 1991; Peterson & Gibson, 1994; Peterson, 1999 or neurophysical studies such as Mumford (1991) or Lee, Mumford, Romero, and Lamme (1998)). In our current system, segmentation is purely model driven. But earlier work has also demonstrated the benefits of integrating bottom-up and model-driven processing for segmentation and ¨ object learning (Eckes & Vorbruggen, 1996; Eckes, 2006). In these studies, even partial object knowledge was able to disambiguate the otherwise notoriously error-prone data-driven segmentation. Such an approach leads to a more robust object segmentation that adapts and refines the object model in a stabilizing feedback loop. An interesting alternative to that approach is work by Ullman and colleagues (Borenstein, Sharon, & Ullman, 2004; Borenstein & Ullman, 2002, 2004) in which bottom-up segmentation is combined with a learned object representation of a particular object class. The system is able to segment other objects within this category from the background. Images of different horses were used to learn a “patch representation” of the object class “horse,” which is used to recognize subpatterns in horse images afterward. This top-down information becomes fused with a low-level segmentation by minimizing a cost function and improves the segmentation performance. In contrast to our work, the focus here is on improving segmentation based on prior object knowledge, assuming the object has already been recognized. We believe that the close integration of bottom-up segmentation and recognition processes may also benefit recognition performance due to the explaining away of occluded object parts and due to removal of unwanted features likely not belonging to the object, which would otherwise
1466
C. Eckes, J. Triesch, and C. von der Malsburg
introduce noise into higher-level object representations based on large receptive fields. Another interesting segmentation architecture able to integrate low-level and high-level segmentation was presented in Tu, Chen, Yuille, & Zhu (2003). Low-level segmentation based on Markov random field (MRF) models based on intensity values is combined with high-level object modules for faces and text detection. An input image is parsed into low-level regions (based on brightness cues) and the object regions by a Metropolis-Hastings dynamics that minimizes a cost function of a combined MRF model. Tu et al. trained prior models for facial and textural regions in the image with help of ADA boost from training examples. The output of the probabilistic model are combined with an MRF segmentation model. The additive priors in the energy function of the MRF model favors step-wise homogeneous regions in gray-level intensity with smooth borders and not-too-small region sizes. This system has a nice probabilistic formulation, but the data-driven Markov chain Monte Carlo method used for inference tends to be very slow. In addition, the system is unlikely to handle large occlusions because the object recognizers for text and faces are not very robust to partial occlusions. Combining recognition with attention (e.g., Itti & Koch, 2000) is also a mandatory extension to our system. The recognition process might focus on areas with salient image features first, which significantly reduces the rather sequential graph matching effort. These features may also help to preselect or activate feature constellations in the model domain and may bias the recognition system to start with only a promising subset of known objects. In general, combining low-level segmentation and attention approaches such as (Poggio, Gamble, & Little, 1988) with recognition approaches is very promising and biologically highly relevant (see, e.g., Lee & Mumford, 2003). However, we believe that a more dynamical formulation of the matching process (e.g., in the style of the neocognitron; see Fukushima, 2000), in combination with a recurrent and continously refined low-level segmentation, must be established. Such an integrated model may help us understand how the brain is solving the very difficult problem of vision and may also help us develop better machine vision. Despite a number of drawbacks, our method and the methods discussed here are promising examples of what integrated segmentation and object recognition systems might look like. There are still many open questions, and a biologically plausible dynamic formulation is desirable. 8 Conclusion We have presented a system for the automatic analysis of cluttered scenes of partially occluded objects that obtained good results by integrating a number of visual cues that are typically treated separately. The addition of these cues has increased the performance of the scene analysis considerably by reducing the error rate by roughly a factor of 8. Only the combined use of
Analysis of Cluttered Scenes
1467
stereo and color cues was responsible for this unusual large improvement in performance. Our study of the different systems has shown that all proposed systems based on color and stereo cues outperform the monocular and color-blind recognition system over a large range of rejection settings when more than 20 false positives (0.4% false-positive rate) can be tolerated. Future research should try to incorporate bottom-up segmentation mechanisms and the learning of statistical object descriptions—ideally in an only weakly supervised fashion. A direct comparison of different scene analysis approaches is still difficult at present because there is no standard database available. The FERET test in face recognition (Phillips, Moon, Rizvi, & Rauss, 2000) has demonstrated the benefits of independent databases and benchmarks. We hope that the publication of our data set (Eckes & Triesch, 2004) will help fill this gap and facilitate future work in this area. Appendix: Color Space Transformations Let us specify the type of color transformation we have used in this work by following Gonzales & Woods (1992), since there is often confusion about color spaces in both the research literature and in documents on ITU or ISO/IEC standards. The transformation from RGB to RY-Y-BY color space is given by
RY 0.500 −0.419 −0.081 R 127 Y = 0.299 0.587 0.114 G + 0 . BY −0.169 −0.331 0.500 B 127
(A.1)
The transformation from RGB to HSI color space is given by R− B + R−G H = arccos (R − G) + (R − B) (G − B) S=1 − 3
max (R, G, B) R+G + B
1 I = (R + G + B). 3
(A.2)
Acknowledgments This work was in part supported by grant 0208451 from the National Science Foundation. We thank the developers of the FLAVOR software environment, which served as the platform for this research.
1468
C. Eckes, J. Triesch, and C. von der Malsburg
References Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Trans. PAMI, 24(4), 509–522. Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-down and bottom-up segmentations. In IEEE Workshop on Perceptual Organization in Computer Vision. Washington, DC. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV2002, 7th European Conference on Computer Vision (pp. 109–122). Berlin: Springer. Borenstein, E., & Ullman, S. (2004). Learning to segment. In The 8th European Conference on Computer Vision—ECCV 2004, Prague. Berlin: Springer-Verlag. Buhmann, J., Lades, M., & von der Malsburg, C. (1990). Size and distortion invariant object recognition by hierarchical graph matching. In IJCNN International Conference on Neural Networks (pp. 411–416). San Diego, CA: IEEE. Burl, M., & Perona, P. (1996). Recognition of planar object classes. In 1996 Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Computer Society. ¨ Deco, G., & Schurmann, B. (2000). A hierarchical neural system with attentional top-down enhancement of the spatial resolution for object recognition. Vision Research, 40(20), 2845–2859. Duc, B., Fischer, S., & Bigun, J. (1999). Face authentication with Gabor information on deformable graphs. IEEE Transactions on Image Processing, 8(4), 504–516. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Eckes, C. (2006). Fusion of visual cues for segmentation and object recognition. In preparation. Eckes, C., & Triesch, J. (2004). Stereo cluttered scenes image database (SCS– IDB). Available online at http://greece.imk.fraunhofer.de/publications/data/ SCS/SCS.zip. ¨ Eckes, C., & Vorbruggen, J. C. (1996). Combining data-driven and model-based cues for segmentation of video sequences. In World Conference on Neural Networks (pp. 868–875). Mahwah, NJ: INNS Press and Erlbaum. Edelman, S. (1995). Representation of similarity in three-dimensional object discrimination. Neural Computation, 7, 408–423. Edelman, S. (1998). Representation is representation of similarities (Tech. Rep. No. CS96-0 8, 1996). Jerusalem: Weizmann Science Press. Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Trans. Computers, 22(1), 67–92. Fleuret, F., & Geman, D. (2001). Coarse-to-fine face detection. International Journal of Computer Vision, 41(1/2), 85–107. Forsyth, D. A., & Ponce, J. (2002). Computer vision: A modern approach. Upper Saddle River, NJ: Prentice Hall. Fukushima, K. (2000). Active and adaptive vision: Neural network models. In ¨ S.-W. Lee, H. H. Bulthoff, & T. Poggio (Eds.), Biologically motivated computer vision. Berlin: Springer. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Systems, Man, and Cybernetics, SMC–13(5), 826–834.
Analysis of Cluttered Scenes
1469
Gonzales, R. C., & Woods, R. E. (Eds.). (1992). Digital image processing. Reading, MA: Addison-Wesley. Harwerth, R., Moeller, M., & Wensveen, J. (1998). Effects of cue context on the perception of depth from combined disparity and perspective cues. Optometry and Vision Science, 75(6), 433–444. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(12), 1489–1506. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 56(6), 1233–1258. Kefalea, E. (2001). Flexible object recognition for a grasping robot. Unpublished doctoral dissertation, Universit¨at Bonn. ¨ ¨ Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R. P., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of Optical Society of America A, 20(7), 1434–1448. Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38(15/16), 2429–2454. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture. Neural Computation, 9(4), 777–804. Messmer, B. T., & Bunke, H. (1998). A new algorithm for error-tolerent subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Learning, 20(5), 493–504. Moreels, P., Maire, M., & Perona, P. (2004). Recognition by probabilistic hypothesis construction. In Eighth European Conference on Computer Vision. Berlin: Springer. Mumford, D. (1991). On the computational architecture of the neocortex. I. The role of the thalamo-cortical loop. Biological Cybernetics, 65(2), 135–145. Murase, H., & Nayar, S. K. (1995). Visual learning and recognition of 3-D objects from appearance. Int. J. of Computer Vision, 14(2), 5–24. Nelson, R. C., & Selinger, A. (1998). Large-scale tests of a keyed, appearance-based 3-D object recognition system. Vision Research, 38(15–16), 2469–2488. Peterson, M. (1999). Knowledge and intention can penetrate early vision. Behavioral and Brain Sciences, 22(3), 389. Peterson, M. A., & Gibson, B. S. (1994). Must figure ground organization precede object recognition? An assumption in peril. Psychological Science, 7(5), 253– 259. Peterson, M. A., Harvey, E. M., & Weidenbacher, H. J. (1991). Shape recognition contributes to figure-ground reversal: Which route counts? Journal of Experimental Psychology: Human Perception and Performance, 17(4), 1075–1089. Phillips, P., Moon, H., Rizvi, S., & Rauss, P. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1090–1104. Poggio, T., Gamble, E., & Little, J. (1988). Parallel integration of vision modules. Science, 242, 436–440.
1470
C. Eckes, J. Triesch, and C. von der Malsburg
Pope, A. R., & Lowe, D. G. (1996). Learning appearance models for object recognition. In J. Ponce, A. Zisserman, & Hebert (Eds.), Object representation in computer vision II (pp. 201–219). Berlin: Springer. Pope, A. R., & Lowe, D. G. (2000). Probabilistic models of appearance for 3-D object recognition. International Journal of Computer Vision, 40(2), 149–167. Pratt, W. K. (2001). Digital image processing. New York: Wiley. Ren, X., & Malik, J. (2002). A probabilistic multi-scale model for contour completion based on image statistics. In Proceedings of the Seventh European Conference on Computer Vision (pp. 312–327). Berlin: Springer. Riesenhuber, M., & Poggio, T. (2000). Models of object recognition. Nature of Neuroscience, 3(Supp.), 1199–1204. Rosenfeld, A. (1984). Image analysis: Problems, progress, and prospects. Pattern Recognition, 17, 3–12. Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network–based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 23– 38. Shams, L., & von der Malsburg, C. (2002). The role of complex cells in object recognition. Vision Research, 42, 2547–2554. Steffens, J., Elagin, E., & Neven, H. (1998). Personspotter—fast and robust system for human detection, tracking and recognition. In Proceedings of the Third International Conference on Face and Gesture Recognition (pp. 516–521). Piscataway, NJ: IEEE Computer Society. Swain, M., & Ballard, D. H. (1991). Color indexing. Int. J. of Computer Vision, 7(1), 11–32. Tarr, M., & Bulthoff, H. (1998). Image-based object recognition in man, monkey and machine. Cognition, 67(1–2), 1–20. Triesch, J. (1999). Vision-based robotic gesture Recognition. Aachen, Germany: Shaker Verlag. Triesch, J., & Eckes, C. (1998). Object recognition with multiple feature types. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings ICANN 98 (pp. 233–238). Berlin: Springer. Triesch, J., & Eckes, C. (2004). Object recognition with deformable feature graphs: Faces, hands, and cluttered scenes. In C. H. Chen (Ed.), Handbook of pattern recognition and computer vision. Singapore: World Scientific. Triesch, J., & von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Trans. PAMI, 23(12), 1449–1453. Triesch, J., & von der Malsburg, C. (2002). Classification of hand postures against complex backgrounds using elastic graph matching. Image and Vision Computing, 20(13–14), 937–943. Tu, Z., Chen, X., Yuille, A., & Zhu, S. (2003). Image parsing: Unifying segmentation, detection, and recognition. In Proceedings of the Ninth IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE Computer Society. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S. (1998). Three-dimensional object recognition based on the combination of views. Cognition, 67(1–2), 21–44.
Analysis of Cluttered Scenes
1471
Viola, P., & Jones, M. J. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Computer Society. Viola, P., & Jones, M. J. (2001b). Robust real-time object detection (Tech. Rep. CRL 2001/01). Cambridge, MA: Cambridge Research Laboratory. Wilcox, T. (1999). Object individuation: Infants’ use of shape, size, pattern, and color. Infant Behavior and Development, 72(2), 125–166. ¨ Wiskott, L., Fellous, J., Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. Wiskott, L., & von der Malsburg, C. (1993). A neural system for the recognition of partially occluded objects in cluttered scenes: A pilot study. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 935–948. Yuille, A. L. (1991). Deformable templates for face recognition. Journal of Cognitive Neuroscience, 3(1), 59–70.
Received July 20, 2004; accepted August 10, 2005.
LETTER
Communicated by Joachim Buhmann
Support Vector Machines for Dyadic Data Sepp Hochreiter∗
[email protected] Klaus Obermayer
[email protected] Department of Electrical Engineering and Computer Science, Technische Universit¨at Berlin, 10587 Berlin, Germany
We describe a new technique for the analysis of dyadic data, where two sets of objects (row and column objects) are characterized by a matrix of numerical values that describe their mutual relationships. The new technique, called potential support vector machine (P-SVM), is a largemargin method for the construction of classifiers and regression functions for the column objects. Contrary to standard support vector machine approaches, the P-SVM minimizes a scale-invariant capacity measure and requires a new set of constraints. As a result, the P-SVM method leads to a usually sparse expansion of the classification and regression functions in terms of the row rather than the column objects and can handle data and kernel matrices that are neither positive definite nor square. We then describe two complementary regularization schemes. The first scheme improves generalization performance for classification and regression tasks; the second scheme leads to the selection of a small, informative set of row support objects and can be applied to feature selection. Benchmarks for classification, regression, and feature selection tasks are performed with toy data as well as with several real-world data sets. The results show that the new method is at least competitive with but often performs better than the benchmarked standard methods for standard vectorial as well as true dyadic data sets. In addition, a theoretical justification is provided for the new approach. 1 Introduction Learning from examples in order to predict is one of the standard tasks in machine learning. Many techniques have been developed to solve classification and regression problems, but by far, most of them were specifically designed for vectorial data. Vectorial data, where data objects are described by vectors of numbers that are treated as elements of a vector space, are very ∗ New affiliation: Institute for Bioinformatics, Johannes Kepler Universit¨ at Linz, 4040 Linz, Austria.
Neural Computation 18, 1472–1510 (2006)
C 2006 Massachusetts Institute of Technology
Support Vector Machines for Dyadic Data
A)
a b c d α β χ δ ε φ γ
1.3
−2.2 −1.6
7.8
−1.8
−1.1 7.2
2.3
1.2
1.9
−2.9
−2.2
3.7
0.8 −0.6
2.5
9.2
−9.4 −8.3
9.2
−7.7 8.6 −9.7
−7.4
−4.8 0.1 −1.2
0.9
B)
1473
a b c d e f a b c d e f g
0.9 −0.1 −0.8
0.5 0.2
g
−0.5 −0.7
−0.1
0.9
0.6
0.3 −0.7 −0.6
0.3
−0.8
0.6
0.9
0.2 −0.6
0.6
0.5
0.5
0.3
0.2
0.9
0.7
0.1
0.3
0.2 −0.7 −0.6
0.7
0.9 −0.9 −0.5
−0.5 −0.6 0.6
0.1 −0.9
0.9
0.9
−0.7
0.3 −0.5 0.9
0.9
0.3
0.5
Figure 1: (A) Dyadic data. Column objects {a , b, . . .} and row objects {α, β, . . .} are represented by a matrix of numerical values that describe their mutual relationships. (B) Pairwise data. A special case of dyadic data, where row and column objects coincide.
convenient because of the structure imposed by the Euclidean metric. However, there are data sets for which a vector-based description is inconvenient or simply wrong, and representations that consider relationships between objects are more appropriate. In the following, we will study representations of data objects based on matrices. We consider “column” objects, which are the objects to be described, and “row” objects, which are the objects that serve for their description (see Figure 1A). The whole data set can then be represented using a rectangular matrix whose entries denote the relationships between the corresponding row and column objects. We call representations of this form dyadic data (Hofmann & Puzicha, 1998; Li & Loken, 2002; Hoff, 2005). If row and column objects are from the same set (see Figure 1B), the representation is usually called pairwise data, and the entries of the matrix can often be interpreted as the degree of similarity (or dissimilarity) between objects. Dyadic descriptions are more powerful than vector-based descriptions, as vectorial data can always be brought into dyadic form when required. This is often done for kernel-based classifiers or regression functions ¨ (Scholkopf & Smola, 2002; Vapnik, 1998), where a Gram matrix of mutual similarities is calculated before the predictor is learned. A similar procedure can also be used in cases where the row and column objects are from different sets. If both of them are described by feature vectors, a matrix can be calculated by applying a kernel function to pairs of feature vectors—one vector describing a row and the other vector describing a column object. One example for this is the drug-gene matrix of Scherf et al. (2000), which
1474
S. Hochreiter and K. Obermayer
was constructed as the product of a measured drug sample and a measured sample gene matrix and where the kernel function was a scalar product. In many cases, however, dyadic descriptions emerge because the matrix entries are measured directly. Pairwise data representations as a special case of dyadic data can be found for data sets where similarities or distances between objects are measured. Examples include sequential or biophysical similarities between proteins (Lipman & Pearson, 1985; Sigrist et al., 2002; Falquet et al., 2002), chromosome location or co-expression similarities between genes (Cremer et al., 1993; Lu, Wallrath, & Elgin, 1994; Heyer, Kruglyak, & Yooseph, 1999), co-citations of text documents (White & McCain, 1989; Bayer, Smart, & Melaughlin, 1990; Ahlgren, Jarneving, & Rousseau, 2003), or hyperlinks between web pages (Kleinberg, 1999). In general, these measured matrices are symmetric but may not be positive definite, and even if they are for the training set, they may not remain positive definite if new examples are included. Genuine dyadic data occur whenever two sets of objects are related. Examples are DNA microarray data (Southern, 1988; Lysov et al., 1988; Drmanac, Labat, Brukner, & Crkvenjakov, 1989; Bains & Smith, 1988), where the column objects are tissue samples, the row objects are genes, and every sample-gene pair is related by the expression level of this particular gene in this particular sample. Other examples are web documents, where both the column and row objects are web pages and column web pages are described by either hyperlinks to or from the row objects, which give rise to a rectangular matrix.1 Further examples include documents in a database described by word frequencies (Salton, 1968) or molecules described by transferable atom equivalent (TAE) descriptors (Mazza, Sukumar, Breneman, & Cramer, 2001). Traditionally, row objects have often been called “features” and column vectors of the dyadic data matrix have mostly been treated as “feature vectors” that live in a Euclidean vector space, even when the dyadic nature of the data was made explicit (see Graepel, Herbrich, Bollman-Sdorra, & Obermayer, 1999; Mangasarian, ¨ 1998) or the “feature map” method (Scholkopf & Smola, 2002). Difficulties, however, arise when features are heterogeneous, and apples and oranges must be compared. What theoretical arguments would, for example, justify, treating the values of a set of TAE descriptors as coordinates of a vector in an Euclidean space? A nonvectorial approach to pairwise data is to interpret the data matrix as a Gram matrix and to apply support vector machines (SVM) for classification and regression if the data matrix is positive semidefinite (Graepel et al. 1999). For indefinite (but symmetric) matrices, two other nonvectorial approaches have been suggested (Graepel et al., 1999). In the first approach, the data matrix is projected into the subspace spanned by the eigenvectors
1 Note that for pairwise data examples, the linking matrix was symmetric because links were considered bidirectional.
Support Vector Machines for Dyadic Data
1475
with positive eigenvalues. This is an approximation, and one would expect good results only if the absolute values of the negative eigenvalues are small compared to the dominant positive ones. In the second approach, directions of negative eigenvalues are processed by flipping the sign of these eigenvalues. All three approaches lead to positive semidefinite matrices on the training set, but positive semidefiniteness is not ensured if a new test object must be included. An embedding approach was suggested by Herbrich, Graepel, & Obermayer (2000) for antisymmetric matrices, but this was specifically designed for data sets, where the matrix entries denote preference relations between objects. In summary, no general and principled method exists to learn classifiers or regression functions for dyadic data. In order to avoid the shortcomings noted, we suggest considering column and row objects on an equal footing and interpret the matrix entries as the result of a kernel function or measurement kernel, which takes a row object, applies it to a column object, and outputs a number. It will turn out that mild conditions on this kernel suffice to create a vector space endowed with a dot product into which the row and the column objects can be mapped. Using this mathematical argument as a justification, we show how to construct classification and regression functions in this vector space in analogy to the large margin-based methods for learning perceptrons for vectorial data. Using an improved measure for model complexity and a new set of constraints that ensure good performance on the training data, we arrive at a generally applicable method to learn predictors for dyadic data. The new method is called the potential support vector machine (P-SVM). It can handle rectangular matrices as well as pairwise data whose matrices are not necessarily positive semidefinite, but even when the P-SVM is applied to regular Gram matrices, it shows very good results when compared with standard methods. Due to the choice of constraints, the final predictor is expanded into an usually sparse set of descriptive row objects, which is different from the standard SVM expansion in terms of the column objects. This opens up another important application domain: a sparse expansion is equivalent to feature selection (Guyon & Elisseeff, 2003; Hochreiter & Obermayer, 2004b; Kohavi & John, 1997; Blum & Langley, 1997). An efficient implementation of the P-SVM using a modified sequential minimal optimization procedure for learning is described in Hochreiter and Obermayer (2004a).
2 The Potential Support Vector Machine 2.1 A Scale-Invariant Objective Function. Consider a set xi | 1 ≤ i ≤ L ⊂ X of objects that are described by feature vectors x iφ ∈ R N and form a training set Xφ = x 1φ , . . . , x φL . The index φ is introduced because we will later use the kernel trick and assume that the vectors x iφ are images of a map φ that is induced by either a kernel or a measurement
1476
S. Hochreiter and K. Obermayer
function. Assume for the moment a simple binary classification problem, where class membership is indicated by labels yi ∈ {+1, −1}, ·, · denotes the dot product, and a linear classifier parameterized by the weight vector w and the offset b is chosen from the set
sign f (x φ ) | f (x φ ) = w, x φ + b .
(2.1)
Standard SVM techniques select the classifier with the largest margin under the constraint of correct classification on the training set: 1 w2 w,b 2 s.t. yi w, x iφ + b ≥ 1.
min
(2.2)
If the training data are not linearly separable, a large margin is traded against a small training error using a suitable regularization scheme. The maximum margin objective is motivated by bounds on the generalization error using the Vapnik-Chervonenkis (VC) dimension h as a capacity measure (Vapnik, 1998). For the set of all linear defined on Xφ , for classifiers 2 which γ ≥ γmin holds, one obtains h ≤ min R2 /γmin , N + 1 (see Vapnik, ¨ 1998; Scholkopf & Smola, 2002). [·] denotes the integer part, γ is the margin, and R is the radius of the smallest sphere in data space, which contains all the training data. Bounds derived using the fat-shattering dimension ¨ (Shawe-Taylor, Bartlett, Williamson, & Anthony, 1996, 1998; Scholkopf & Smola, 2002), and bounds on the expected generalization error (cf. Vapnik, ¨ 1998; Scholkopf & Smola, 2002) depend on R/γmin in a similar manner. Both the selection of a classifier using the maximum margin principle and the values obtained for the generalization error bounds suffer from the problem that they are not invariant under linear transformations. This problem is illustrated in Figure 2 for a 2D classification problem. In the left figure, both classes are separated by the hyperplane with the largest margin (solid line). In the right figure, the same classification problem is shown, but scaled along the vertical axis by a factor s. Again, the solid line denotes the support vector solution, but when the classifier is scaled back to s = 1 (dashed line in the left figure) it no longer coincides with the original SVM solution. The ratio R2 /γ 2 , which bounds the VC dimension, also depends on the scale factors (see the legend of Figure 2). The situation depicted in Figure 2 occurs whenever the data can be enclosed by a (nonspherical) ¨ ellipsoid (cf. Scholkopf, Shawe-Taylor, Smola, & Williamson, 1999). The considerations of Figure 2 show that the optimal hyperplane is not scale invariant, and predictions of class labels may change if the data are rescaled before learning. Thus, the question arises of which scale factors should be used for classifier selection.
Support Vector Machines for Dyadic Data
1477
Figure 2: (Left) Data points from two classes (triangles and circles) are separated by the hyperplane with the largest margin (solid line). The two support vectors axis (black symbols) are separated by dx along horizontal and by d y along the the vertical axis, 4γ 2 = dx2 + d y2 and R2 / 4γ 2 = R2 / dx2 + d y2 . The dashed line indicates the classification boundary of the classifier shown on the right, scaled along the vertical axis by the factor 1/s. (Right) The same data but scaled along the vertical axis by the factor s. The solid the maximum margin line denotes hyperplane, 4γ 2 = dx2 + s 2 d y2 and R2 / 4γ 2 = R2 / dx2 + s 2 d y2 . For d y = 0 both the margin γ and the term R2 /γ 2 depend on s.
Here we suggest scaling the training data such that the margin γ remains constant while the radius R of the sphere containing all training data ˜ becomes as small as possible. The result is a new sphere with radius R, which leads to a tighter margin-based bound for the generalization error. Optimality is achieved when all directions orthogonal to the normal vector w of the hyperplane γ are scaled to zero and maximal margin with R˜ = mint∈R maxi w, ˆ := w/w. If |t| is ˆ x iφ + t ≤ maxi w, ˆ x iφ , where w small compared to w, ˆ x iφ , for example, if the data are centered at the origin, t can be neglected through above inequality. Unfortunately, this formulation does not lead to an optimization problem that is easy to implement. Therefore, we suggest minimizing the upper bound, ˜2 2 2 2 R , = R˜ 2 w2 ≤ max w, x iφ ≤ w, x iφ = X φw 2 i γ i
(2.3)
where the matrix Xφ := x 1φ , x 2φ , . . . , x φL contains the training vectors x iφ . In the second inequality, the maximum norm is bounded by the Euclidean norm. Its worst-case factor is L, but the bound is tight. The new objective is well defined also for cases where Xφ X φ or/and X φ X φ is singular, and the kernel trick carries over to the new technique.
1478
S. Hochreiter and K. Obermayer
It can be shown that replacing the objective function w2 (see equation 2.2) by the upper bound 2 w Xφ X φ w = Xφ w
(2.4)
˜2
on Rγ 2 , equation 2.3, corresponds to the integration of sphering (whitening) and SVM learning if the data have zero mean. If the data have already been sphered, then the covariance matrix is given by X φ X φ = I, and we recover the classical SVM. If not, minimizing the new objective leads to normal vectors that are rotated toward directions of low variance of the data when compared with the standard maximum margin solution. Note, however, that whitening can easily be performed in input space but becomes nontrivial if the data are mapped to a high-dimensional feature space using a kernel function. In order to test whether the situation depicted in Figure 2 actually occurs in practice, we applied a C-SVM with a radial basis function (RBF) kernel to the UCI data set Breast Cancer from section 3.1.1. The kernel width was chosen small and C large enough so that no error occurred on the training set (to be consistent with Figure 2). Sphering was performed in feature space by replacing the objective function in equation 2.2 with the new objective, equation 2.4. Figure 3 shows the angle
w , w svm sph ψ = arccos wsvm , wsvm wsph , w sph
i j (2.5) svm sph αi α j yi k x , x i, j
= arccos
sph sph svm svm i j i j αi α j yi y j k (x , x ) αi α j k (x , x ) i, j
i, j
between the hyperplane for the sphered (“sph”) and the nonsphered (“svm”) case as a function of the width σ of the RBF kernel. The observed values range between 6 and 50 degrees, providing evidence for the situation shown in Figure 2. The new objective function, equation 2.4, leads to separating hyperplanes that are invariant under linear transformations of the data. As a consequence, neither the bounds nor the performance of the derived classifier depends on how the training data were scaled. But is the new objective function also related to a capacity measure for the model class as the margin is? It is, and Hochreiter, and Obermayer (2004c) have shown that the capacity measure, equation 2.4, emerges when a bound for the generalization error is constructed using the technique of covering numbers. 2.2 Constraints for Complex Features. The next step is to formulate a set of constraints that enforce good performance on the training set and
Support Vector Machines for Dyadic Data
1479
50 45 40
ψ in degree
35 30 25 20 15 10 5 0.5
1
1.5
2
2.5
3
kernel width σ
3.5
4
Figure 3: Application of a C-SVM to the data set Breast Cancer of the UCI Benchmark Repository. The angle ψ (see equation 2.5) between weight vectors wsvm and wsph is plotted as a function of σ (C = 1000). For σ → 0, the data are already approximately sphered in feature space; hence, the objective functions from equations 2.2 and 2.4 lead to similar results.
implement the new idea that row and column objects are both mapped into a Hilbert space within which matrix entries correspond to scalar products and the classification or regression function is constructed. Assume that each matrix entry was measured by a device that determines the projection of an object feature vector x φ onto a direction zω in feature j space. The value K i j of such a complex feature zω for an object x iφ is then given by the dot product K i j = x iφ , zωj .
(2.6)
In analogy to the index φ for x φ , the index ω indicates that we will later assume that the vectors ziω are images in R N of a map ω that is induced by either a kernel or a measurement function. A mathematical foundation of the ansatz equation, 2.6, is given in the Appendix. Assume that we have P devices whose correspondingvectors zω arethe only measurable and accessible directions, and let Zω := z1ω , z2ω , . . . , zωP be the matrix of all complex features. Then we can summarize our (incomplete) knowledge about the objects in Xφ using the data matrix K = X φ Zω .
(2.7)
1480
S. Hochreiter and K. Obermayer
For DNA microarray data, for example, we could identify K with the matrix of expression values obtained by a microarray experiment. For web data, we could identify K with the matrix of ingoing or outgoing hyperlinks. For a document data set, we could identify K with the matrix of word frequencies. Hence we assume that x φ and zω live in a space of hidden causes that are responsible for the different attributes of the objects. The j complex features {zω } span a subspace of the original feature space, but we do not require them to be orthogonal, normalized, or linearly independent. j If we set zω = e j ( jth Cartesian unit vector), that is, Zω = I, K i j = xij and P = N, the “new” description, equation 2.7, is fully equivalent to the “old” description using the original feature vectors x φ . We now define the quality measure for the performance of the classifier or the regression function on the training set. We consider the quadratic loss function, c(yi , f (x iφ )) =
1 2 r , 2 i
ri = f (x iφ ) − yi = w, x iφ + b − yi .
(2.8)
The mean squared error is Remp [ f w, b ] =
L 1 c yi , f (x iφ ) . L
(2.9)
i=1
We now require that the selected classification or regression function minimizes Remp , that is, that ∇w Remp [ f w, b ] =
1 Xφ X φ w + b1 − y = 0 L
(2.10)
and ∂ Remp [ f ] 1
1 = ri = b + w, x iφ − yi = 0, ∂b L i L i
(2.11)
where the labels for all objects in the training set are summarized by a label vector y. Since the quadratic loss function is convex in w and b, only one minimum exists if X φ X φ has full rank. If X φ X φ is singular, then all points of minimal value correspond to a subspace of R N . From equation 2.11, we obtain b=−
L 1 1 w, x iφ − yi = − w X φ − y 1. L L i=1
(2.12)
Support Vector Machines for Dyadic Data
1481
Condition equation 2.10 implies that the directional derivative should be zero along any direction in feature space, including the directions of the complex feature vectors zω . We therefore obtain d Remp f w+tzωj ,b dt
= zωj ∇w Remp [ f w, b ] =
1 j z Xφ X φ w + b1 − y = 0, L ω
(2.13)
and, combining all complex features, 1 1 Zω Xφ X K r = σ = 0. φ w + b1 − y = L L
(2.14) j
Hence, we require that for every complex feature zω , the mixed moments σ j between the residual error ri and the measured values K i j should be zero. 2.3 The Potential Support Vector Machine 2.3.1 The Basic P-SVM. The new objective from equation 2.4 and the new constraints from equation 2.14 constitute a new procedure for selecting a classifier or a regression function. The number of constraints is equal to the number P of complex features, which can be larger or smaller than the number L of data points or the dimension N of the original feature space. Because the mean squared error of a linear function f w,b is a convex function of the parameters w and b, the constraints enforce that Remp is minimal.2 If K has at least rank L (number of training examples), then the minimum even corresponds to r = 0 (cf. equation 2.14). Consequently, overfitting occurs, and a regularization scheme is needed. Before a regularization scheme can be defined, however, the mixed moments σ j must be normalized. The reason is that high values of σ j may be a result of either a high variance of the values of K i j or a high correlation between the residual error ri and the values of K i j . Since we are interested in the latter, the most appropriate measure would be Pearson’s correlation coefficient, L
K i j − K¯ j (ri − r¯ ) σˆ j = , 2 L L 2 ¯ K − K − r ¯ (r ) i j j i i=1 i=1 i=1
2
(2.15)
∗ ∗ Note that w = X is the pseudo( y − b1) fulfills the constraints, where X φ φ
inverse of X φ.
1482
S. Hochreiter and K. Obermayer
1 L 1 L ¯ ri and K j = L i=1 K i j . If every column vector where r¯ = L i=1 K 1 j , K 2 j , . . . , K L j of the data matrix K is normalized to mean zero and variance one, we obtain
σj =
L 1 1
K i j ri = σˆ j √ r − r¯ 12 . L L
(2.16)
i=1
Because r¯ = 0 (see equation 2.11), the mixed moments are now proportional to the correlation coefficient σˆ j with a proportionality constant independent j of the complex feature zω , and σ j can be used instead of σˆ j to formulate the constraints. If the data vectors are normalized, the term K 1 vanishes, and we obtain the basic P-SVM optimization problem:
min w
s.t.
1 2 X w 2 φ K X φ w − y = 0.
(2.17)
The offset b of the classification or regression function is given by equation 2.12, which simplifies after normalization to (see Hochreiter & Obermayer, 2004a)
b=
L 1
yi . L
(2.18)
i=1
We will call this model selection procedure the potential support vector machine (P-SVM), and we will always assume normalized data vectors in the following. 2.3.2 The Kernel Trick. Following the standard support vector way to derive learning rules for perceptrons, we have so far considered linear classifiers only. Most practical classification problems, however, require nonlinear classification boundaries, which makes the construction of a proper feature space necessary. In analogy to the standard SVM, we now invoke the kernel trick. Let x i and z j be feature vectors, which describe the column the and row objects of the data set. We then choose a kernel function k x i , z j and compute the matrix K of relations between column and row objects: K i j = k x i , z j = φ x i , ω z j = x iφ , zωj ,
(2.19)
Support Vector Machines for Dyadic Data
1483
j where x iφ = φ x i and zω = ω z j . In the appendix, it is shown that any L 2 kernel corresponds (for almost all points) to a dot product in a Hilbert space in the sense of equation 2.19 and induces a mapping into a feature space within which a linear classifier can then be constructed. In the following sections, we will therefore distinguish between the actual measurements x i j and z j and the feature vectors x iφ , and zω “induced” by the kernel k. Potential choices for row objects and their vectorial description are z j = x j , P = L (standard construction of a Gram matrix), z j = e j , P = N (elementary features), z j is the jth cluster center of a clustering algorithm applied to the vectors x i (example for a “complex” feature), or z j is the jth vector of a principal component analysis (PCA) or independent component analysis (ICA) preprocessing (another example for a “complex” feature). If the entries K i j of the data matrix are directly measured, the application of the kernel trick needs additional considerations. In the appendix, we show that if the measurement process can be expressed through a kernel k xi , z j , which takes a column object xi and a row object z j and outputs a number, the matrix K of relations between the row and column objects can be interpreted as a dot product, K i j = x iφ , zωj ,
(2.20)
j in some features space, where x iφ = φ xi and zω = ω z j . Note that we distinguish between an object xi and its associated feature vectors x i or x iφ , leading to differences in the definition of k for the cases of vectorial and (measured) dyadic data. Equation 2.20 justifies the P-SVM approach, which was derived for the case of linear predictors and also for measured data. 2.3.3 The P-SVM for Classification. If the P-SVM is used for classification, we suggest a regularization scheme based on slack variables ξ + and ξ − . Slack variables allow small violations of individual constraints if the correct choice of w would lead to a large increase of the objective function otherwise. We obtain min + −
w,ξ ,ξ
s.t.
1 2 Xφ w + C1 ξ + + ξ − 2 + K X φw− y +ξ ≥ 0 − K X φw− y −ξ ≤ 0 0 ≤ ξ +, ξ −
(2.21)
for the primal problem. The above regularization scheme makes the optimization problem robust against outliers. A large value of a slack variable indicates that the
1484
S. Hochreiter and K. Obermayer
particular row object only weakly influences the direction of the classification boundary, because it would otherwise considerably increase the value of the complexity term. This happens in particular for high levels of measurement noise, which leads to large, spurious values of the mixed moments σ j . If the noise is large, the value of C must be small to remove the corresponding constraints via the slack variables ξ . If the strength of the measurement noise is known, the correct value of C can be determined a priori. Otherwise it must be determined using model selection techniques. To derive the dual optimization problem, we evaluate the Lagrangian L, + 1 L = w Xφ X ξ + ξ− φ w + C1 2 K Xφ w − y + ξ + − α+ K Xφ w − y − ξ − + α− − µ+ ξ + − µ− ξ − ,
(2.22)
where the vectors α + ≥ 0, α − ≥ 0, µ+ ≥ 0, and µ− ≥ 0 are the Lagrange multipliers for the constraints in equations 2.21. The optimality conditions require that ∇w L = X φ X φ w − X φ Kα = Xφ X φ w − X φ X φ Zω α = 0,
(2.23)
where α = α + − α − (αi = αi+ − αi− ). Equation 2.23 is fulfilled if w = Zω α.
(2.24)
In contrast to the standard SVM expansion of w into its support vectors x φ , the weight vector w is now expanded into a set of complex features zω , which we will call support features in the following. We then arrive at the dual optimization problem: min α
s.t.
1 α K Kα − y Kα 2 −C1 ≤ α ≤ C1,
(2.25)
which now depends on the data via the kernel or data matrix K only. The dual problem can be solved by a sequential minimal optimization (SMO) technique, which is described in Hochreiter and Obermayer (2004a). The SMO technique is essential if many complex features are used, because the
Support Vector Machines for Dyadic Data
1485
P × P matrix K K enters the dual formulation. Finally, the classification function f has to be constructed using the optimal values of the Lagrange parameters α:
f (x φ ) =
P
α j K (x) j + b,
j=1
where the expansion, equation 2.24 has been used for the weight vector w and b is given by equation 2.18. The classifier based on equation 2.26 depends on the coefficients α j , which were determined during optimization; on b, which can be computed directly; and on the measured values K (x) j for the new object x. The coefficients α j = α +j − α −j can be interpreted as class indicators because they separate the complex features into features that are relevant for class 1 and class −1, according to the sign of α j . Note that if we consider the Lagrange parameters α j as parameters of the classifier, we find that d Remp f w+tzωj ,b dt
= σj =
∂ Remp [ f ] . ∂α j
(2.26)
The directional derivatives of the empirical error Remp along the complex features in the primal formulation correspond to its partial derivatives with respect to the corresponding Lagrange parameter in the dual formulation. One of the most crucial properties of the P-SVM procedure is that the dual optimization problem depends on only K via K K. Therefore, K is not required to be positive semidefinite or square. This allows not only the construction of SVM-based classifiers for matrices K of general shape but also to extend SVM-based approaches to the new class of indefinite kernels operating on the objects’ feature vectors. In the following, we illustrate the application of the P-SVM approach to classification using a toy example. The data set consists of 70 objects x, 28 from class 1 and 42 from class 2, which are described by 2D feature vectors x (open and solid circles in Figure 4). Two pairwise data sets were then generated by applying RBF kernel and the (indefinite) a standard sine-kernel k x i , x j = sin θ x i − x j which leads to an indefinite Gram matrix. Figure 4 shows the classification result. The sine kernel is more appropriate than the RBF kernel for this data set because it is better adjusted to the oscillatory regions of class membership, leading a smaller number of support vectors and a smaller test error. In general, the value of θ has to be selected using standard model selection techniques. A large value of θ leads to a more complex set of classifiers, reduces the classification error on the training set, but increases the error on the test set. Non-Mercer kernels
1486
S. Hochreiter and K. Obermayer
kernel: RBF, SV: 50, σ: 0.1, C: 100 1.2
kernel: sine, SV: 7, θ: 25, C: 100 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 4: Application of the P-SVM method to a toy classification problem. Objects are described by 2D feature vectors x. Seventy feature vectors were generated: 28 for class 1 (open circles) and 42 for class 2 (solid circles). A Gram matrix was constructed using the kernels RBF k x i , x j = exp − σ12 x i − x j 2 (left) and sine k x i , x j = sin θ x i − x j (right). White and gray indicate regions of class 1 and class 2 membership, respectively. Circled data indicate support vectors (SV). For the parameters, see the figure.
extend the range of kernels that are used and therefore open up a new direction of research for kernel design. 2.3.4 The P-SVM for Regression. The new objective function of equation 2.4 was motivated for a classification problem, but it can also be used to find an optimal regression function in a regression task. In regression, 2 the term X ˆ 2 w2 , w ˆ := w/w, is the product of a term that φ w = X φ w expresses the deviation of the data from the zero isosurface of the regression function and a term that corresponds to the smoothness of the regressor. The smoothness of the regression function is determined by the norm of the weight vector w. If f (x iφ ) = w, x iφ + b and the length of the vectors x φ is bounded by B, then the deviation of f from offset b is bounded by f (x i ) − b = w, x i ≤ w x i ≤ wB, φ φ φ
(2.27)
where the first ≤ follows from the Cauchy-Schwarz inequality; hence, a smaller value of w leads to a smoother regression function. This trade-off between smoothness and residual error is also reflected by equation A.24 in the appendix, which shows that equation 2.4 is the L 2 -norm of the function f . The constraints of vanishing mixed moments carry over to regression problems, with the only modification that the target values yi in equations
Support Vector Machines for Dyadic Data
1487
10 10
9 9
8 8
7
7
6
6
5
5
4
4
SV: 7 σ: 2, C: 20
3
2
0 −10
2
x = −2
1
−8
−6
−4
−2
0
SV: 5 σ: 4, C: 20
3
x = −2
1
2
4
6
8
0 −10
10
−8
−6
−4
−2
0
2
4
6
8
10
10
9
8
7
6
5
4
SV: 14 σ: 2, C: 0.6
3
2
x = −2
1
0 −10
−8
−6
−4
−2
0
2
4
6
8
10
Figure 5: Application of the P-SVM method to a toy regression problem. Objects (small dots), described by the x-coordinate, were generated by randomly choosing a value from [−10, 10] and calculating the y-value from the true function (dashed line) plus gaussian noise N (0, 0.2). One outlier was added at x = 0 (thin arrows). A Gram matrix was generated using an RBF kernel k x i , x j = exp − σ12 x i − x j 2 . The solid lines show the regression result. Circled dots indicate support vectors (SV). For parameters, see the figure. Bold arrows mark x = −2.
2.21 are real rather than binary numbers. The constraints are even more “natural” for regression because the ri are indeed the residuals that a regression function should minimize. We therefore propose to use the primal optimization problem, eqs. 2.21, and its corresponding dual, equations 2.25, also for the regression setting. Figure 5 shows the application of the P-SVM to a toy regression example (pairwise data). Fifty data points are randomly chosen from the true function (dashed line), and independent and identically distributed gaussian noise with mean 0 and standard deviation 0.2 is added to each y-component. One outlier was added at x = 0. The figure shows the PSVM regression results (solid lines) for an RBF kernel and three different combinations of C and σ . The hyperparameter C controls the residual error. A smaller value of C increases the residual error at x = 0 but also the
1488
S. Hochreiter and K. Obermayer
number of support vectors. The width σ of the RBF kernel controls the overall smoothness of the regressor: A larger value of σ increases the error at x = 0 without increasing the number of support vectors (arrows in bold in Figure 5). 2.3.5 The P-SVM for Feature Selection. In this section we modify the P-SVM method for feature selection such that it can serve as a data preprocessing method in order to improve the generalization performance of subsequent classification or regression tasks (see also Hochreiter & Obermayer, 2004b). Due to the property of the P-SVM method to expand w into a sparse set of support features, it can be modified to optimally extract a small number of informative features from the set of row objects. The set of support features can then be used as input to an arbitrary predictor, for example, another P-SVM, a standard SVM, or a K-nearest-neighbor classifier. Noisy measurements can lead to spurious mixed moments; that is, complex features may contain no information about the objects’ attributes but still exhibit finite values of σ j . In order to prevent those features from affecting the classification boundary or the regression function, we introduce a correlation threshold ε and modify the constraints in equations 2.17 according to K X φw− K X φw−
y − 1 ≤ 0, y + 1 ≥ 0.
(2.28)
¨ This regularization scheme is analogous to the ε-insensitive loss (Scholkopf & Smola, 2002). Absolute values of mixed moments smaller than ε are considered to be spurious, and the corresponding features do not influence the weight vector because the constraints remain fulfilled. The value of ε directly correlates with the strength of the measurement noise and can be determined a priori if it is known. If this is not the case, ε serves as a hyperparameter, and its value can be determined using model selection techniques. Note that data vectors have to be normalized (see section 2.3.1) before applying the P-SVM, because otherwise, a global value of ε would not suffice. Combining equation 2.4 and equations 2.28, we then obtain the primal optimization problem,
min w
s.t.
1 2 X w 2 φ K X φw− K X φw−
y + 1≥0 y − 1≤0
(2.29)
Support Vector Machines for Dyadic Data
1489
for P-SVM feature selection. In order to derive the dual formulation, we have to evaluate the Lagrangian, + 1 L = w Xφ X K Xφ w − y + 1 φw− α 2 K Xφ w − y − 1 , + α−
(2.30)
where we have used the notation from section 2.3.3. The vector w is again expressed through the complex features, w = Zω α,
(2.31)
and we obtain the dual formulation of equation 2.29:
min
α + ,α −
s.t.
1 + α − α− K K α+ − α− 2 − y K α + − α − + 1 α + + α − 0 ≤ α+ ,
0 ≤ α− .
(2.32)
The term 1 (α + + α − ) in this dual objective function enforces a sparse expansion of the weight vector w in terms of the support features. This occurs because for large enough values of ε, this term forces all α j toward zero except for the complex features that are most relevant for classification or regression. If K K is singular and w is not uniquely determined, ε enforces a unique solution, which is characterized by the sparsest representation through complex features. The dual problem is again solved by a fast sequential minimal optimization (SMO) technique (see Hochreiter & Obermayer, 2004a). Finally, let us address the relationship between the value of a Lagrange j multiplier α j and the importance of the corresponding complex feature zω for prediction. The change of the empirical error under a change of the j weight vectors by an amount β along the direction of a complex feature zω is given by Remp f w+β zωj ,b − Remp [ f w,b ] = βσ j +
β2 2 β2 |β| β 2 K i j = βσ j + ≤ + , 2L i 2 L 2
(2.33)
1490
S. Hochreiter and K. Obermayer
because the constraints, equation 2.28, ensure that |σ j |L ≤ . If a complex j feature zω is completely removed, then β = −α j and Remp f w−α j zωj ,b − Remp [ f w,b ] ≤
|α j | L
+
α 2j . 2
(2.34)
The Lagrange parameter α j is directly related to the increase in the empirical error when a feature is removed. Therefore, α serves as importance measures for the complex features and allows ranking features according to the absolute value of its components. In the following, we illustrate the application of the P-SVM approach to feature selection using a classification toy example. The data set consists of 50 column objects x, 25 from each class, which are described by 2D-feature vectors x (the open and solid circles in Figure 6). Fifty row objects z were randomly selected by choosing their 2D feature vectors z according to a uniform distribution on [−1.2, 1.2] × [−1.2, 1.2]. K was generated using an RBF kernel k(x i , z j ) = exp(− 2σ1 2 x i − z j 2 ) with σ = 0.2. Figure 6 shows the result of the P-SVM feature selection method with a correlation threshold = 20. Only six features were selected (the crosses in Figure 6), but every group of data points (column objects) is described (and detected) by one or two feature vectors (row objects). The number of selected features depends on ε and σ , which determines how the strength of a complex feature decreases with the distances x i − z j . Smaller ε or larger σ would result in a larger number of support features. 2.4 The Dot Product Interpretation of Dyadic Data. The derivation of the P-SVM method is based on the assumption that the matrix K is a dot product matrix whose elements denote a scalar product between the feature vectors that describe the row and the column objects. If K, however, is a matrix of measured values, the question arises, under which conditions can such a matrix be interpreted as a dot product matrix? In the appendix, we show that the question reduces to these conditions: 1. Can the set X of column objects x be completed to a measure space? 2. Can the set Z of row objects z be completed to a measure space? 3. Can the measurement process be expressed using the evaluation of a measurable kernel k (x, z), which is from L 2 (X , Z)? Conditions 1 and 2 can be easily fulfilled by defining a suitable σ -algebra on the sets. Condition 3 holds if k is bounded and the sets X and Z are compact; it equates the evaluation of a kernel as known from standard SVMs with physical measurements, and physical characteristics of the measurement device determine the properties of the kernel, such as boundedness and continuity. But can a measurement process indeed be expressed through a kernel? There is no full answer to this question from a theoretical viewpoint;
Support Vector Machines for Dyadic Data
1491
1
0.5
SV: 6, σ: 0.1, : 20
0
−0.5
−1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 6: P-SVM method applied to a toy feature selection problem. Column and row objects are described by 2D feature vectors x and z. Fifty row feature vectors, 25 from each class (open/solid circles), were chosen from N ((±1, ±1), 0.1I), where centers are chosen with equal probability. One hundred column feature vectors were uniformly from [−1.2, 1.2]2 . K is con chosen 1 i j 2 . Black crosses indicate P-SVM x − z structed by an RBF kernel exp − 20.2 2 support features.
practical applications have to confirm (or disprove) the chosen ansatz and data model. 3 Numerical Experiments and Applications In this section, we apply the P-SVM method to various kinds of real-world data sets and provide benchmark results with previously proposed methods when appropriate. This section consists of three parts that cover classification, regression, and feature selection. The P-SVM is first tested as a classifier on data sets from the UCI Benchmark Repository, and its performance is compared with results obtained with C- and the ν-SVMs for different kernels. Then we apply the P-SVM to a measured (rather than constructed)
1492
S. Hochreiter and K. Obermayer
pairwise (“protein”) and a genuine dyadic data set (World Wide Web). Second, we apply the P-SVM to regression problems taken from the UCI Benchmark Repository and compare with results obtained with C-support vector regression and Bayesian SVMs. Finally, we describe results obtained for the P-SVM as a feature selection method for microarray data and for the Protein and World Wide Web data sets. 3.1 Application to Classification Problems 3.1.1 UCI Data Sets. In this section we report benchmark results for the data sets Thyroid (5 features), Heart (13 features), Breast Cancer (9 features), and German (20 features) from the UCI Benchmark Repository and for the data set Banana (2 features) taken from R¨atsch, Onoda, ¨ and Muller (2001). All data sets were preprocessed as described in R¨atsch et al., and divided into 100 training and test set pairs. Data sets were generated through resampling, where data points were randomly selected for the training set and the remaining data were used for the test set. We downloaded the original 100 training and test set pairs from http://ida.first.fraunhofer.de/projects/bench/. For the data sets German and Banana, we restricted the training set to the first 200 examples of the original training set in order to keep hyperparameter selection feasible but used the original test sets. Pairwise data sets were generated by constructing the Gram matrix for radial basis function (RBF), polynomial (POL), and Plummer (PLU; see Hochreiter, Mozer, & Obermayer, 2003) kernels, and the Gram matrices were used as input for kernel Fisher discriminant analysis (KFD, Mika, ¨ ¨ R¨atsch, Weston, Scholkopf, & Muller, 1999), C-, ν-, and P-SVM. Because KFD only selects a direction in input space onto which all data points are projected, we must select a classifier for the resulting one-dimensional ¨ classification task. We follow Scholkopf and Smola (2002) and classify a data point according to its posterior probability under the assumption of a gaussian distribution for each label. Hyperparameters (C, ν, and kernel parameters) were optimized using five–fold cross validation on the corresponding training sets. To ensure a fair comparison, the hyperparameter selection procedure was equal for all methods, except that the ν values of the ν-SVM were selected from {0.1, . . . , 0.9}, in contrast to the selection of C, for which a logarithmic scale was used. To test the significance of the differences in generalization performance (percentage of misclassifications), we calculated for what percentage of the individual runs (for a total of 100; see below) our method was significantly better or worse than others. In order to do so, we first performed a test for the difference of two proportions for each training and test set pair (Dietterich, 1998). The difference of two proportions is the difference of the misclassification rates of two models on the test set, where the models are selected on the training set by the two methods that are to be compared. After this test we adjusted the false
Support Vector Machines for Dyadic Data
1493
discovery rate through the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995), which was recently shown to be correct for dependent outcomes of the tests (Benjamini & Yekutieli, 2001). The fact that the tests can be dependent is important because training and test sets overlap for the different training and test set pairs. The false detection rate was controlled at 5%. We counted for each pair of methods the selected models from the first method that perform significantly (5% level) better than the selected models from the second method. Table 1 summarizes the percentage of misclassification averaged over 100 experiments. Despite the fact that C– and ν–SVMs are equivalent, results differ because of the somewhat different model selection results for the hyperparameters C and ν. Best and second-best results are indicated by bold and italic numbers. The significance tests did not reveal significant differences over the 100 runs in generalization performance for the majority of the cases (for details, see http://ni.cs.tu-berlin.de/publications/psvm sup). The P-SVM with RBF and PLU kernels, however, never performed significantly worse for any of the 100 runs of each data sets than their competitors. Note that the significance tests do not tell whether the averaged misclassification rates of one method are significantly better or worse than the rates of another method. They provide information on how many runs out of 100 one method was significantly better than another method. The UCI Benchmark result shows that the P-SVM is competitive to other state-of-the-art methods for a standard problem setting (measurement matrix equivalent to the Gram matrix). Although the P-SVM method never performed significantly worse, it generally required fewer support vectors than other SVM approaches. This was also true for the cases where the P-SVM’s performance was significantly better. 3.1.2 Protein Data Set. The Protein data set (cf. Hofmann & Buhmann, 1997) was provided by M. Vingron and consists of 226 proteins from the globin families. Pairs of proteins are characterized by their evolutionary distance, which is defined as the probability of transforming one amino acid sequence into the other via point mutations. Class labels are provided that denote membership in one of the four families: hemoglobin-α (H-α), hemoglobin-β (H-β), myoglobin (M), and heterogeneous globins (GH). Table 2 summarizes the classification results, which were obtained with the generalized SVM (G-SVM; Graepel et al., 1999; Mangasarian, 1998) and the P-SVM method. Since the G-SVM interprets the columns of the data matrix as feature vectors for the column objects and applies a standard ν-SVM ¨ to these vectors (this is also called feature map; Scholkopf & Smola, 2002), we call the G-SVM simply ν-SVM in the following. The table shows the percentage of misclassification for the four two-class classification problems, one class against the rest. The P-SVM yields better classification results, although a conservative test for significance was not possible due to the small
1494
S. Hochreiter and K. Obermayer
Table 1: Average Percentage of Misclassification for the UCI and the Banana Data Sets.
Thyroid C-SVM ν-SVM KFD P-SVM Heart C-SVM ν-SVM KFD P-SVM Breast Cancer C-SVM ν-SVM KFD P-SVM Banana C-SVM ν-SVM KFD P-SVM German C-SVM ν-SVM KFD P-SVM
RBF
POL
PLU
5.11 (2.34) 5.15 (2.23) 4.96 (2.24) 4.71 (2.06)
4.51 (2.02) 4.04 (2.12) 6.52 (3.18) 9.44 (3.12)
4.96 (2.35) 4.83 (2.03) 5.00 (2.26) 5.08 (2.18)
16.67 (3.51) 16.87 (3.87) 17.82 (3.45) 16.18 (3.66)
18.26 (3.50) 17.44 (3.90) 22.53 (3.37) 16.67 (3.40)
16.33 (3.47) 18.47 (7.81) 17.80 (3.86) 16.54 (3.64)
26.94 (5.18) 27.69 (5.62) 27.53 (4.19) 26.78 (4.58)
26.87 (4.79) 26.69 (4.73) 31.30 (6.11) 26.40 (4.54)
26.48 (4.87) 30.16 (7.83) 27.19 (4.72) 26.66 (4.59)
11.88 (1.11) 11.67 (0.90) 12.32 (1.12) 11.59 (0.96)
12.09 (0.96) 12.72 (2.16) 14.04 (3.89) 14.93 (2.09)
11.81 (1.07) 11.74 (0.98) 12.14 (0.96) 11.52 (0.93)
26.51 (2.60) 27.14 (2.84) 26.58 (2.95) 26.45 (3.20)
27.27 (3.23) 27.13 (3.06) 30.96 (3.23) 25.87 (2.45)
26.88 (3.12) 28.60 (6.27) 26.90 (3.15) 26.65 (2.95)
Notes: The table compares results obtained with the kernel Fisher discriminant analysis (KFD), C-, ν-, and P-SVM for the radial basis function (RBF), exp(− 2σ1 2 x i − x j 2 ), polynomial (POL), (x i , x j + η)δ , and Plummer (PLU), (x i − x j + ρ)−ζ kernels. Results were averaged over 100 experiments with separate training and test sets. For each data set, numbers in bold and italics highlight, respectively, the best and the second-best results and the numbers in parentheses denote standard deviations of the trials. C, ν, and kernel parameters were determined using five-fold cross validation on the training set and usually differed among individual experiments.
number of data. However, the P-SVM selected 180 proteins as support vectors on average, compared to 203 proteins used by the ν-SVM (note that for 10-fold cross validation, 203 is the average training size). Here a smaller number of support vectors is highly desirable because it reduces the computational costs of sequence alignments, which are necessary for the classification of new examples.
Support Vector Machines for Dyadic Data
1495
Table 2: Percentage of Misclassification for the Protein Data Set for Classifiers Obtained with the P- and ν-SVM Methods.
Size ν-SVM ν-SVM ν-SVM P-SVM P-SVM P-SVM
Reg.
H-α
H-β
M
GH
— 0.05 0.1 0.2 300 400 500
72 1.3 1.8 2.2 0.4 0.4 0.4
72 4.0 4.5 8.9 3.5 3.1 3.5
39 0.5 0.5 0.5 0.0 0.0 0.0
30 0.5 0.9 0.9 0.4 0.9 1.3
Notes: Column Reg. lists the values of the regularization parameter (ν for ν-SVM and C for P-SVM). Columns H-α to GH provide the percentage of misclassification for the four problems “one class against the rest” (“size” denotes the number of proteins per class), computed using 10–fold cross validation. The best results for each problem are shown in bold. The data matrix was not positive semidefinite.
3.1.3 World Wide Web Data Set. The World Wide Web data sets consist of 8282 WWW pages collected during the Web−→Kb Project at Carnegie Mellon University in January 1997 from the web sites of the computer science departments of Cornell and the universities of Texas, Washington, and Wisconsin. The pages were manually classified into the categories student, faculty, staff, department, course, project, and other. Every pair (i, j) of pages is characterized by whether page i contains a hyperlink to page j and vice versa. The data are summarized using two binary matrices and a ternary matrix. The first matrix K (“out”) contains a one for at least one outgoing link (i → j) and a zero if no outgoing link exists; the second matrix K (“in”) contains a one for at least one ingoing link ( j → i) and a zero otherwise; and the third, ternary matrix 12 K + K (“sym”) contains a zero if no link exists, a value of 0.5 if only one unidirectional link exists, and a value of 1 if links exist in both directions. In the following, we restricted the data set to pages from the first six classes, which had more than one in- or outgoing link. The data set thus consists of the four subsets Cornell (350 pages), Texas (286 pages), Wisconsin (300 pages), and Washington (433 pages). Table 3 summarizes the classification results for the C- and P-SVM methods. The parameter C for both SVMs was optimized for each crossvalidation trial using another four–fold cross validation on the training set. Significance tests were performed to evaluate the differences in generalization performance using the 10-fold cross-validated paired t-test (Dietterich, 1998). We considered 48 tasks (4 universities, 4 classes, 3 matrices) and checked for each task whether the C-SVM or the P-SVM performed better using a p-value of 0.05. In 30 tasks, the P-SVM had a significantly better performance than the C-SVM, while the C-SVM was
1496
S. Hochreiter and K. Obermayer
Table 3: Percentage of Misclassification for the World Wide Web Data Sets for Classifiers Obtained with the P- and C-SVM Methods. Course Cornell University Size 57 C-SVM (Sym) 11.1 (6.2) C-SVM (Out) 12.6 (3.1) C-SVM (In) 11.1 (4.9) P-SVM (Sym) 12.3 (3.3) P-SVM (Out) 8.6 (3.8) P-SVM (In) 7.1 (4.1) University of Texas Size 52 C-SVM (Sym) 17.2 (9.0) C-SVM (Out) 9.5 (5.1) C-SVM (In) 12.6 (4.9) P-SVM (Sym) 15.8 (5.8) P-SVM (Out) 8.1 (7.6) P-SVM (In) 12.3 (5.6) University of Wisconsin Size 77 C-SVM (Sym) 27.0 (10.0) C-SVM (Out) 19.3 (7.5) C-SVM (In) 22.0 (8.6) P-SVM (Sym) 18.7 (4.5) P-SVM (Out) 12.0 (5.5) P-SVM (In) 13.3 (4.4) University of Washington Size 169 C-SVM (Sym) 19.6 (6.8) C-SVM (Out) 10.6 (4.6) C-SVM (In) 20.3 (6.4) P-SVM (Sym) 17.1 (4.4) P-SVM (Out) 10.6 (5.2) P-SVM (In) 11.8 (5.6)
Faculty
Project
Student
60 19.7 (5.3) 15.1 (6.0) 21.4 (4.3) 17.1 (6.2) 14.3 (6.3) 13.7 (6.6)
52 13.7 (4.0) 10.6 (4.9) 14.6 (5.5) 15.4 (6.3) 8.3 (4.9) 10.9 (5.5)
143 50.0 (11.5) 22.3 (10.3) 48.9 (15.5) 19.1 (5.7) 16.9 (7.9) 17.1 (5.7)
35 22.0 (9.1) 16.5 (5.8) 20.6 (7.8) 13.6 (7.3) 9.8 (3.6) 10.5 (6.3)
29 19.8 (6.7) 20.2 (8.7) 20.9 (5.1) 12.2 (6.2) 9.8 (3.9) 9.4 (4.6)
129 53.5 (11.8) 28.9 (11.7) 16.4 (7.6) 25.5 (6.9) 20.9 (6.7) 13.0 (5.0)
36 22.0 (5.5) 16.0 (3.8) 16.3 (5.8) 15.0 (9.3) 11.3 (6.5) 8.7 (8.2)
22 14.0 (6.4) 10.3 (4.8) 7.7 (4.5) 10.0 (5.4) 7.7 (4.2) 6.3 (7.1)
117 49.3 (11.1) 34.3 (10.5) 24.3 (9.9) 34.3 (8.6) 23.7 (4.8) 13.3 (5.9)
44 18.7 (6.8) 14.1 (3.0) 20.4 (5.3) 13.4 (6.6) 12.7 (2.9) 9.2 (6.2)
39 10.6 (3.5) 14.3 (4.8) 13.8 (4.7) 8.8 (2.1) 6.7 (3.4) 6.7 (2.0)
151 43.6 (8.3) 28.2 (9.8) 38.3 (11.9) 20.3 (6.8) 17.1 (4.3) 14.3 (6.9)
Notes: The percentage of misclassification was measured using 10–fold cross validation. The best and second-best results for each data set and classification task are indicated, respectively, in bold and italics; numbers in parentheses denote standard deviations of the trials.
never significantly better than the P-SVM (for details, see http://ni.cs.tuberlin.de/publications/psvm sup). Classification results are better for the asymmetric matrices “in” and “out” than for the symmetric matrix “sym,” because there are cases for which highly indicative pages (hubs) exist that are connected to one particular class of pages by either in- or outgoing links. The symmetric case blurs the contribution of the indicative pages because ingoing and outgoing links can no longer be distinguished, which leads
Support Vector Machines for Dyadic Data
1497
Table 4: Regression Results for the UCI Data Sets.
SVR BSVR P-SVM
Robot Arm (10−3 )
Boston Housing
Computer Activity
Abalone
5.84 5.89 5.88
10.27 (7.21) 12.34 (9.20) 9.42 (4.96)
13.80 (0.93) 17.59 (0.98) 10.28 (0.47)
0.441 (0.021) 0.438 (0.023) 0.424 (0.017)
Note: The table shows the mean squared error and its standard deviation in parentheses. Best results for each data set are shown in bold. For the Robot Arm data, only one data set was available, and therefore no standard deviation is given.
to poorer performance. Because the P-SVM yields fewer support vectors, online classification is faster than for the C-SVM, and if web pages cease to exist, the P-SVM is less likely to be affected. 3.2 Application to Regression Problems. In this section we report results for the data sets Robot Arm (2 features), Boston Housing (13 features), Computer Activity (21 features), and Abalone (10 features) from the UCI Benchmark Repository. The data preprocessing is described in Chu, Keerthi, and Ong (2004), and the data sets are available as training and test set pairs at http://www.gatsby.ucl.ac.uk/∼chuwei/data. The sizes of the data sets were (training set/test set): Robot Arm: 200/200, 1 set; Boston Housing: 481/25, 100 sets; Computer Activity: 1000/6192, 10 sets; Abalone: 3000/1177, 10 sets. Pairwise data sets were generated by constructing the Gram matrices for RBF kernels of different widths σ , and the Gram matrices were used as input for the three regression methods: C-support vector regression (SVR; ¨ Scholkopf & Smola, 2002), Bayesian support vector regression (BSVR; Chu et al., 2004), and the P-SVM. Hyperparameters (C and σ ) were optimized using n-fold cross-validation (n = 50 for Robot Arm, n = 20 for Boston Housing, n = 4 for Computer Activity, and n = 4 for Abalone). Parameters were first optimized on a coarse 4 × 4 grid and later refined on a 7 × 7 fine grid around the values for C and σ selected in the first step (65 tests per parameter selection). Table 4 shows the mean squared error and the standard deviation of the results. We also performed a Wilcoxon signed rank test to verify the significance for these results (for details, see http://ni.cs.tu-berlin. de/publications/psvm sup), except for the Robot Arm data set, which has only one training and test set pair, and the Boston Housing data set, which contains too few test examples. On Computer Activity, SVR was significantly better (5% threshold) than BSVR, and on both data sets, Computer Activity and Abalone, SVR and BSVR were significantly outperformed by the P-SVM, which in addition used fewer support vectors than its competitors.
1498
S. Hochreiter and K. Obermayer
Table 5: Percentage of Misclassification and Number of Support Features for the Protein Data Set for the P-SVM Method. Protein Data
H-α
H-β
M
GH
0.2 1 10 20
1.3 (203) 2.6 (41) 3.5 (10) 3.5 (5)
4.9 (203) 5.3 (110) 8.8 (26) 8.4 (12)
0.9 (203) 1.3 (28) 1.8 (5) 4.0 (4)
1.3 (203) 4.4 (41) 13.3 (7) 13.3 (5)
Notes: The total number of features is 226. C was 100. The five columns (left to right) show the values for and the results for the four classification problems “one class against the rest” using 10–fold cross validation. The number of support features are in parentheses.
3.3 Application to Feature Selection Problems. In this section we apply the P-SVM to feature selection problems of various kinds, using the correlation threshold regularization scheme (see section 2.3.5). We first reanalyze the Protein and World Wide Web data sets of sections 3.1.3 and 3.1.2 and then report results on three DNA microarray data sets. Further feature selection results can be found in Hochreiter & Obermayer, 2005, where results for the Neural Information Processing Systems 2003 feature selection challenge are reported and where the P-SVM was the best standalone method for selecting a compact feature set, and in Hochreiter and Obermayer (2004b), where details of the microarray datasets benchmarks are reported. 3.3.1 Protein and World Wide Web Data Sets. In this section we again apply the P-SVM to the Protein and World Wide Web data sets of sections 3.1.2 and 3.1.3. Using both regularization schemes simultaneously leads to a trade-off between a small number of features (a small number of measurements) and better classification result. Reducing the number of features is beneficial if measurements are costly and if a small increase in prediction error can be tolerated. Table 5 shows the results for the Protein data sets for various values of the regularization parameter . C was set to 100 because it gave good results for a wide range of values. We chose a minimal = 0.2 because it resulted in a classifier where all complex features were support vectors. The size of the training set is 203. Note that C was smaller than in the experiments in section 3.1.2 because large values of push the dual variables α toward zero and large values of C have no influence. The table shows that classification performance drops if fewer features are considered, but that 5% of the features suffice to obtain a performance that leads only to about 5% misclassifications compared to about 2% at the optimum. Since every
Support Vector Machines for Dyadic Data
1499
Table 6: P-SVM Feature Selection and Classification Results (10–Fold Cross Validation) for World Wide Web Data Set Cornell and the Classification Problem Student Pages Against the Rest. Cornell Data set, Student Pages
% cl.
% mis.
# (%) SVs
% cl.
% mis.
# (%) SVs
0.1 0.2 0.3 0.4 0.5 0.6 0.7
84 81 79 75 73 71 66
14 12 9.7 6.9 5.5 4.8 3.9
135 (38.6) 115 (32.8) 99 (28.3) 72 (20.6) 58 (16.6) 48 (13.7) 38 (10.9)
0.8 0.9 1.0 1.1 1.4 1.6 2.0
65 64 61 59 56 55 51
3.1 2.7 1.4 1.0 1.0 1.0 0.6
34 (9.7) 32 (9.1) 27 (7.7) 21 (6.0) 12 (3.4) 10 (2.8) 8 (2.3)
Notes: The columns show (left to right): the value , the percentage of classified pages (cl.), the percentage of misclassifications (mis.) and the number (and percentage) of support vectors. C was obtained by a three-fold cross validation on the corresponding training sets.
feature value has to be determined by a sequence alignment, this saving in computation time is essential for large databases like the Swiss-Prot data base (130,000 proteins), where supplying all pairwise relations is currently impossible. Table 6 shows the corresponding results (10-fold cross validation) for the P-SVM applied to the World Wide Web data set Cornell and for the classification problem “student pages vs. the rest.” Only ingoing links (matrix K of section 3.1.3) were used. C was optimized using 3–fold cross validation on the corresponding training sets for each of the 10–fold crossvalidation runs. By increasing the regularization parameter , the number of web pages that have to be considered in order to classify a new page (the number of support vectors) decreases from 135 to 8. At the same time the percentage of pages that can no longer be classified because they receive no ingoing link from one of the support vector page increases. The percentage of misclassification, however, is reduced from 14% for = 0.1 to 0.6% for = 2.0. With only 8 pages providing ingoing links, more than 50% of the pages could be classified with only a 0.6% misclassification rate. 3.3.2 DNA Micorarray Data Sets. In this section we describe the application the P-SVM to DNA microarray data. The data were taken from Pomeroy et al. (2002) (Brain Tumor), Shipp et al. (2002) (Lymphoma Tumor), and van’t Veer et al. (2002) (Breast Cancer). All three data sets consist of tissue samples from different patients that were characterized by the expression values of a large number of genes. All samples were labeled according to the outcome of a particular treatment (positive or negative), and the task was to predict the outcome of the treatment for a new patient.
1500
S. Hochreiter and K. Obermayer
We compared the classification performance of the following combinations of a feature selection and a classification method: (1) (2) (3) (4) (5) (6) (7)
Feature Selection Expression value of the TrkC gene SPLASH (Califano, Stolovitzky, & Tu, 1999) Signal-to-noise-statistics (STN) Signal-to-noise-statistics (STN) Fisher statistics (Fisher) R2W2 P-SVM
Classification One gene classification Likelihood ratio classifier (LRC) K -nearest neighbor (KNN) Weighted voting (voting) Weighted voting (voting) R2W2 ν-SVM
The P-SVM/ν-SVM results are taken from Hochreiter and Obermayer (2004b), where the details concerning the data sets and the gene selection procedure based on the P-SVM can be found; the results for the other combinations were taken from the references cited above. All results are summarized in Table 7. The comparison shows that the P-SVM method outperforms the standard methods. 4 Summary In this contribution we have described the potential support vector machine (P-SVM) as a new method for classification, regression, and feature selection. The P-SVM selects models using the principle of structural risk minimization. In contrast to standard SVM approaches, it is based on a new objective function and a new set of constraints that lead to an expansion of the classification or regression function in terms of support features. The optimization problem is quadratic, always well defined, suited for dyadic data, and requires neither square nor positive-definite Gram matrices. Therefore, the method can also be used without preprocessing with matrices that are measured and matrices that are constructed from a vectorial representation using an indefinite kernel function. In feature selection mode, the P-SVM allows selecting and ranking the features through the support vector weights of its sparse set of support vectors. The sparseness constraint avoids sets for features that are redundant. In a classification or regression setting, this is an advantage over statistical methods where redundant features are often kept as long as they provide information about the objects’ attributes. Because the dual formulation of the optimization problem can be solved by a fast SMO technique, the new P-SVM can be applied to data sets with many features. The P-SVM approach was compared with several state-of-the-art classification, regression, and feature selection methods. Whenever significance tests could be applied, the P-SVM never performed significantly worse than its competitor, and in many cases, it performed significantly better. But even if no significant improvement in prediction error could be found, the P-SVM needed fewer support features, that is, fewer measurements, for evaluating new data objects.
Brain Tumor
Lymphoma
Feature Selection/ Classification
#F
E
TrkC gene SPLASH/LRC R2W2 STN/voting STN/KNN TrkC & SVM & KNN P-SVM/ν-SVM
1 – – – 8 – 45
33 25 25 23 22 20 7
Breast Cancer
Feature Selection/ Classification
#F
E
STN/KNN STN/voting R2W2 P-SVM/ν-SVM
8 13 – 18
28 24 22 21
Feature Selection/ Classification
#F
E
ROC area
Fisher/voting P-SVM/ν-SVM
70 30
26 15
0.88 0.77
Support Vector Machines for Dyadic Data
Table 7: Results for the Three DNA Microarray Data Sets Brain Tumor, Lymphoma, and Breast Cancer.
Notes: The table shows the leave-one-out classification error E (% misclassifications) and the number F of selected features. For Breast Cancer only, the minimal value of E for different thresholds was available; therefore, the area under a receiver operating curve is provided in addition. Best results are shown in bold.
1501
1502
S. Hochreiter and K. Obermayer
Finally, we have suggested a new interpretation of dyadic data, where objects in the real world are not described by vectors. Structures like dot products are induced directly through measurements of object pairs, that is, relations between objects. This opens up a new field of research where relations between real-world objects determine mathematical structures. Appendix: Measurements, Kernels, and Dot Products In this appendix we address the question under what conditions a measurement kernel that gives rise to a measured matrix K can be interpreted as a dot product between feature vectors describing the row and column objects of a dyadic data set. Assume that column objects x (samples) and row objects z (complex features) are from sets X and Z, both of which can be completed by a σ -algebra and a measure µ to measurable spaces. Let (U, B, µ) be a measurable space with σ -algebra B and a σ -additive measure µ on the set U. We consider functions f : U → R on the set U. A function f is called µ-measurable on (U, B) if f −1 ([a , b]) ∈ B for all a , b ∈ R, and µ-integrable if U f dµ < ∞. We define f L 2µ :=
12 f 2 dµ
U
(A.1)
and the set L 2µ (U) := f : U → R; f is µ-measurable and f L 2µ < ∞ .
(A.2)
L 2µ (U) is a Banach space with norm · L 2µ . If we define the dot product f, g L 2µ (U) :=
U
f gdµ,
(A.3)
then the Banach space L 2µ (U) is a Hilbert space with a dot product ·, · L 2µ (U) and scalar body R. For simplicity, we denote this Hilbert by L 2 (U). L 2 (U1 , U2 ) is the Hilbert space of functions k space with U1 U2 k 2 (u1 , u2 ) dµ (u2 ) dµ (u1 ) < ∞ using the product measure of µ (U1 × U2 ) = µ (U1 ) µ (U2 ). With these definitions, we see that H1 := L 2 (Z), H2 := L 2 (X ), and H3 := L 2 (X , Z) are Hilbert spaces of L 2 -functions with domains X , Z and X × Z, respectively. The dot product in Hi is denoted by ·, · Hi . Note that for discrete X or Z, the respective integrals can be replaced by sums (using a measure of Dirac delta functions at the discrete points; see Werner, 2000, p. 464, example c).
Support Vector Machines for Dyadic Data
1503
Let us now assume that k ∈ H3 . k induces a Hilbert-Schmidt operator Tk , f (x) = (Tk α)(x) =
Z
k(x, z) α(z) dµ(z),
(A.4)
which maps α ∈ H1 (a parameterization) to f ∈ H2 (a classifier). If we set µ(z) = Pj=1 δ z j , we recover the P-SVM classification function (without b), equation 2.26,
f (u) =
P
P
α j k u, z j = α j K (u) j ,
j=1
(A.5)
j=1
where α j = α(z j ) and δ z j is the Dirac delta function at location z j . Theorem 1 provides the conditions a kernel must fulfill in order to be interpretable as a dot product between the objects’ feature vectors: Theorem 1 (Singular Value Expansion). Let α be from H1 and let k be a kernel from H3 , which defines a Hilbert-Schmidt operator Tk : H1 → H2 : (Tk α) (x) = f (x) =
Z
k(x, z)α(z)dz.
(A.6)
Then f 2H2 = Tk∗ Tk α, α H1 ,
(A.7)
where Tk∗ is the adjoint operator of Tk and there exists an expansion k(x, z) =
sn e n (z)gn (x)
(A.8)
n
that converges in the L 2 -sense. The sn ≥ 0 are the singular values of Tk , and e n ∈ H1 , gn ∈ H2 are the corresponding orthonormal functions. For X = Z and symmetric, positive-definite kernel k, we obtain the eigenfunctions e n = gn of Tk with corresponding eigenvalues sn . Proof.
From f = Tk α we obtain
f 2H2 = Tk α, Tk α H2 = Tk∗ Tk α, α H1 .
(A.9)
1504
S. Hochreiter and K. Obermayer
The singular value expansion of Tk is Tk α =
sn α, e n H1 gn
(A.10)
n
(see Werner, 2000, theorem VI.3.6). The values sn are the singular values of Tk for the orthonormal systems {e n } on H1 and {gn } on H2 . We define rnm := Tk e n , gm H2 = δnm sm ,
(A.11)
where the last “=” results from equation A.10 for α = e n . The sum
2 rnm =
m
(Tk e n , gm H2 )2 ≤ Tk e n 2H2 < ∞
(A.12)
m
converges because of Bessel’s inequality (the ≤ sign). Next, we complete the orthonormal system (ONS) {e n } to an orthonormal basis (ONB) {˜e l } by adding an ONB of the kernel ker (Tk ) of the operator Tk to the ONS {e n }. The function α ∈ H1 possesses a unique representation through this basis: α = l α, e˜ l H1 e˜ l . We obtain Tk α =
α, e˜ l H1 Tk e˜ l ,
(A.13)
l
where we used that Tk is continuous. Because Tk e˜ l = 0 for all e˜ l ∈ ker (Tk ), the image Tk α can be expressed through the ONS {e n }: Tk α =
α, e n H1 Tk e n
n
= α, e n H1 Tk e n , gm H2 gm n
m
! =
rnm α, e n H1 gm .
(A.14)
n,m
Here we used the fact that {gm } is an ONB of the range of Tk , and therefore Tk e n = m Tk e n , gm H2 gm . {e n gm } is an ONS in H3 (which can be comBecause the set of functions 2 pleted to an ONB) and n,m rnm < ∞ (cf. equation A.12), the kernel ˜ x) := k(z,
n,m
rnm e n (z)gm (x)
(A.15)
Support Vector Machines for Dyadic Data
1505
is from H3 . We observe that the induced Hilbert-Schmidt operator Tk˜ is equal to Tk , (Tk˜ α) (x) =
rnm α, e n H1 gm (x) = (Tk α)(x),
(A.16)
n,m
where the first equals sign follows from equation A.15 and the second equals sign from equation A.14. Hence, the kernel k and kernel k˜ are equal ˜ We obtain from equation A.11 except for a set with zero measure: k =µ k. Tk e l , gt H1 = δlt sl and Tk e l , gt H1 = rlt from equation A.16, and therefore rlt = δlt sl . Inserting rnm = δnm sn into equation A.15 proves equation A.8 in the theorem. The last statement of the theorem follows from the fact that |Tk | = ∗ 1/2 Tk Tk = Tk (Tk is positive and self-adjoint), and therefore e n = gn (Werner, 2000, proof of theorem VI.3.6). As a consequence of this theorem, we can for finite Z define a mapping ω of row objects z and a mapping φ column objects x into a common feature space where k is a dot product: φ(x) := (s1 g1 (x), s2 g2 (x), . . .), ω (z) := (e 1 (z), e 2 (z), . . .),
sn e n (z)gn (x) = k(z, x). φ(x), ω (z) =
(A.17)
n
Note that finite Z ensures that ω (z) , ω (z) converges. From equation A.14, we obtain for the classification or regression function f (x) =
sn α, e n H1 gn (x).
(A.18)
n
It is well defined because sets of zero measure vanish through integration in equation A.4, which is confirmed through expansion equation A.18, where the zero measure is absorbed in the terms α, e n H1 . To obtain absolute and uniform convergence of the sum for f (x), we must enforce k(x, ·)2H1 ≤ K 2 , as can be seen in corollary 1: Let the assumptions of theCorollary 1 (Linear Classification in 2 ). 2 orem 1 hold, and let Z (k(x, z)) dz ≤ K 2 for all x ∈ X . We define w := (α, e 1 H1 , α, e 2 H1 , . . .), and φ(x) := (s1 g1 (x), s2 g2 (x), . . .). Then w, φ(x) ∈ 2 , where w22 ≤ α2H1 and φ(x)22 ≤ K 2 , and the following sum convergences absolutely and uniformly: f (x) = w, φ(x)2 =
n
sn α, e n H1 gn (x).
(A.19)
1506
S. Hochreiter and K. Obermayer
First, we show that φ(x) ∈ 2 :
Proof.
φ(x)l22 =
(sn gn (x))2 =
n
((Tk e n )(x))2
n
= (k(x, .), e n H1 )2 ≤ k(x, ·)2H1 ≤ sup{ (k(x, z))2 dz} ≤ K 2 , x∈X
n
Z
(A.20) where we used Bessel’s inequality for the first ≤,the supremum over x ∈ X for the second ≤ (the supremum exists because { (k(x, z))2 dz} is a bounded subset of R), and the assumption of the corollary for the last ≤. To prove w22 ≤ α2H1 , we again use Bessel’s inequality: w22 =
(α, e n H1 )2 ≤ α2H1 .
(A.21)
n
Finally, we prove that the sum f (x) = w, φ(x)2 =
sn α, e n H1 gn (x)
(A.22)
n
converges absolutely and uniformly. The fact that the sum converges in the L 2 sense follows directly from the singular value expansion of theorem 1. We now choose an m ∈ N with ∞
(α, e n H1 )2 ≤
n=m
2 K
(A.23)
for > 0 (because of equation A.21, such an m exists), and we apply the Cauchy-Schwarz inequality, ∞
sn α, e n H gn (x) ≤ 1
∞
n=m
n=m
≤K
= , K
! 12 (sn gn (x))
2
∞
! 12 (α, e n H1 )
2
n=m
(A.24)
where we used inequalities equations A.20 and A.23. Because m is independent of x, the convergence is absolute and uniform too. Equation A.4 or, equivalently, A.19, is a linear classification or regression function in 2 . We find that the expansion of the classifier f converges absolutely and uniformly and therefore that f is continuous. This can be seen
Support Vector Machines for Dyadic Data
1507
because e n are eigenfunctions of the compact, positive, self-adjoint operator ∗ 12 Tk Tk and gn are isometric images of e n (see Werner, 2000, theorem VI.3.6 and the text before theorem VI.4.2). Hence, the orthonormal functions gn are continuous. We also justified the analysis in equation 2.27 and derivatives of gm with respect to x. Relation of the theoretical considerations to the P-SVM. Using µ(x) = i j j L P i=1 δ x , µ(z) = j=1 δ z , and α j := α z , we obtain
f (x) =
P
α j k x, z
j
" = φ (x) ,
j=1
P
# j αjω z ,
j=1
Xφ = φ x1 , φ x2 , . . . , φ x L , Zω = ω z1 , ω z2 , . . . , ω z P , w=
P
αi ω z j (expansion into support vectors),
j=1
K i j = φ xi , ω z j = sn e n z j gn xi = k xi , z j , n
K=
X φ Zω , and
2 f 2H2 = α K Kα = X φ w2 (the objective function).
(A.25)
Note that in the P-SVM formulation, w is not unique with respect to the zero subspace of the matrix X φ . Here we obtain the related result that w is not unique with respect to the subspace mapped to the zero function by Tk . Interestingly, we recovered the new objective function, equation 2.4, as the L 2 -norm f 2H2 on the classification function. This again motivates the use of the new objective function as a capacity measure. We also find that the primal problem of the P-SVM, equation 2.21, corresponds to the formulation in H2 , while the dual, equation 2.25, corresponds to the formulation in H1 . Primal and dual P-SVM formulations can be transformed into each other by the property Tk α, Tk α H2 = Tk∗ Tk α, α H1 . Acknowledgments ¨ We thank M. Albery-Speyer, C. Buscher, C. Minoux, R. Sanyal, A. Paus, and S. Seo for their help with the numerical simulations. This work was funded by the Anna-Geissler-, the Monika-Kuntzner-Stiftung, and the BMWA under project no. 10024213.
1508
S. Hochreiter and K. Obermayer
References Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54, 550–560. Bains, W., & Smith, G. (1988). A novel method for nucleic acid sequence determination. Journal of Theoretical Biology, 135, 303–307. Bayer, A. E., Smart, J. C., & McLaughlin, G. W. (1990). Mapping intellectual structure of a scientific subfield through author cocitations. Journal of the American Society for Information Science, 41(6), 444–452. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. B, 57(1), 289–300. Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188. Blum, A., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245–271. Califano, A., Stolovitzky, G., & Tu, Y. (1999) Analysis of gene expression microarrays for phenotype classification. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (pp. 75–85). Mento Park, CA: AAAI Press. Chu, W., Keerthi, S. S., & Ong, C. J. (2004) Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural Networks, 15(1), 29–44. ¨ Cremer, T., Kurz, A., Zirbel, R., Dietzel, S., Rinke, B., Schrock, E. Speicher, M. R., Mathieu, U., Jauch, J., Emmerich, P., Scherthan, H., Ried, T., Cremer, C., & Lichter, P. (1993). Role of chromosome territories in the functional compartmentalization of the cell nucleus. Cold Spring Harbor Symp. Quant. Biol., 58, 777–792. Dietterich, T. G. (1998). Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Drmanac, R., Labat, I., Brukner, I., & Crkvenjakov, R. (1989). Sequencing of megabase plus DNA by hybridization: Theory of the method. Genomics, 4, 114–128. Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J., Hofmann, K., & Bairoch, A. (2002). The PROSITE database, its status in 2002. Nucleic Acids Research, 30, 235–238. Graepel, T., Herbrich, R., Bollmann-Sdorra, P., & Obermayer, K. (1999). Classification on pairwise proximity data. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 438–444). Cambridge, MA: MIT Press. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries ¨ for ordinal regression. In A. Smola, P. Bartlett, B. Scholkopf, & D. Schuurmans, Advances in large margin classifiers. Cambridge, MA: MIT Press. Heyer, L. J., Kruglyak, S., & Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 11, 1106–1115. Hochreiter, S., Mozer, M. C., & Obermayer, K. (2003). Coulomb classifiers: Generalizing support vector machines via an analogy to electrostatic systems. In S. Becker,
Support Vector Machines for Dyadic Data
1509
S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 545–552). Cambridge, MA: MIT Press. Hochreiter, S., & Obermayer, K. (2004a). Classification, regression, and feature selection on matrix data (Tech. Rep. No. 2004/2). Berlin: Technische Universit¨at Berlin, ¨ Elektrotechnik und Informatik. Fakult¨at fur Hochreiter, S., & Obermayer, K. (2004b). Gene selection for microarray data. In B. ¨ Scholkopf, K. Tsuda, & J.-P. Vert (Eds.), Kernel methods in computational biology (pp. 319–355). Cambridge, MA: MIT Press. Hochreiter, S., & Obermayer, K. (2004c). Sphered support vector machine (Tech. Rep.). ¨ Elektrotechnik und Informatik. Berlin: Technische Universit¨at Berlin, Fakult¨at fur Hochreiter, S., & Obermayer, K. (2005). Nonlinear feature selection with the potential support vector machine. In I. Guyon, S. Gunn, M. Nikravesh, & L. Zadeh (Eds.), Feature extraction, foundations and applications. Berlin: Springer. Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Association, 100(469), 286–295. Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1–25. Hofmann, T., & Puzicha, J. (1998). Unsupervised learning from dyadic data (Tech. Rep. No. TR-98-042). Berkeley, CA: International Computer Science Insitute. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 46(5), 604–632. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Li, H., & Loken, E. (2002). A unified theory of statistical analysis and inference for variance component models for dyadic data. Statistica Sinica, 12, 519–535. Lipman, D., & Pearson, W. (1985). Rapid and sensitive protein similarity searches. Science, 227, 1435–1441. Lu, Q., Wallrath, L. L., & Elgin, S. C. R. (1994). Nucleosome positioning and gene regulation. Journal of Cellular Biochemistry, 55, 83–92. Lysov, Y., Florent’ev, V., Khorlin, A., Khrapko, K., Shik, V., & Mirzabekov, A. (1988). DNA sequencing by hybridization with oligonucleotides. Doklady Academy Nauk USSR, 303, 1508–1511. Mangasarian, O. L. (1998). Generalized support vector machines (Tech. Rep. No. 98-14). Madison: Computer Sciences Department, University of Wisconsin. Mazza, C. B., Sukumar, N., Breneman, C. M., & Cramer, S. M. (2001). Prediction of protein retention in ion-exchange systems using molecular descriptors obtained from crystal structure. Anal. Chem., 73, 5457–5461. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing, 9 (pp. 41–48). Piscataway, NS: IEEE. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olsen, J. M., Gurran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califeno, A., Stoloviteky, G., Louis, D. N., Mesirov, J. P., Lander, E. S., & Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436–442.
1510
S. Hochreiter and K. Obermayer
¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320. Salton, G. (1968). Automatic information organization and retrieval. New York: McGrawHill. Scherf, U., Ross, D. T., Waltham, M., Smith, L. H., Lee, J. K., Tanabe, L., Kohn, K. W., Reinhold, W. C., Myers, T. G., Andrews, D. T., Scudiero, D. A., Eisen, M. B., Sausville, E. A., Pommier, Y., Botstein, D., Brown, P. O., & Weinstein, J. N. (2000). A gene expression database for the molecular pharmacology of cancer. Nature Genetics, 24(3), 236–244. ¨ Scholkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Generalization bounds via eigenvalues of the gram matrix (Tech. Rep. No. NC2-TR-1999-035). London: Royal Wolloway University of London. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Shawe-Taylor, J., Bartlett, P. L., Williamson, R., & Anthony, M. (1996). A framework for structural risk minimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory (pp. 68–76). New York: Association for Computing Machinery. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, R. C. T. A. J. L., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., & Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nature Medicine, 8(1), 68– 74. Sigrist, C. J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinformatics, 3, 265–274. Southern, E. (1988). United Kingdom patent application GB8810400. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Pet`erse, W. L., vander kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend, S. W. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Werner, D. (2000). Funktionalanalysis (3rd ed.). Berlin: Springer. White, H. D., & McCain, K. W. (1989). Bibliometrics. Annual Review of Information Science and Technology, 24, 119–186.
Received March 12, 2004; accepted July 15, 2005.
NOTE
Communicated by Alexandre Pouget
Optimal Neuronal Tuning for Finite Stimulus Spaces W. Michael Brown
[email protected] Computational Biology, Sandia National Laboratories, Albuquerque, NM, 87123, U.S.A.
Alex B¨acker
[email protected] Computational Biology, Sandia National Laboratories, Albuquerque, NM, 87123, and Division of Biology, California Institute of Technology, Pasadena, CA 91125, U.S.A.
The efficiency of neuronal encoding in sensory and motor systems has been proposed as a first principle governing response properties within the central nervous system. We present a continuation of a theoretical study presented by Zhang and Sejnowski, where the influence of neuronal tuning properties on encoding accuracy is analyzed using information theory. When a finite stimulus space is considered, we show that the encoding accuracy improves with narrow tuning for one- and twodimensional stimuli. For three dimensions and higher, there is an optimal tuning width. 1 Introduction The potential impact of coding efficiency on neuronal response properties within the central nervous system was first proposed by Attneave (1954) and has since been studied using both theoretical and experimental approaches. The issue of optimal neuronal tuning widths has received much attention in recent literature. Empirical examples of both finely tuned receptive fields (Kuffler, 1953; Lee, 1999) and broadly tuned neurons (Georgopoulos, Schwartz, & Kettner, 1986; Knudsen & Konishi, 1978) have been found. Theoretical arguments have also been made for both sharp (Barlow, 1972; Lettvin, Maturana, McCulloch, & Pitts, 1959) and broad (Baldi & Heiligenberg, 1988; Eurich & Schwegler, 1997; Georgopoulos et al., 1986; Hinton, McClelland, & Rumelhart, 1986; Salinas & Abbott, 1994; Seung & Sompolinsky, 1993; Snippe, 1996; Snippe & Koenderink, 1992) tuning curves as a means to increase encoding accuracy. Using Fisher information, Zhang and Sejnowski (1999) offered an intriguing solution where the choice of narrow or broad tuning curves depends on the dimensionality of the stimulus space. They found that for one dimension, the encoding accuracy increases with decreasing tuning Neural Computation 18, 1511–1526 (2006)
C 2006 Massachusetts Institute of Technology
1512
W. M. Brown and A. B¨acker
width and that for two dimensions, the encoding accuracy is independent of the tuning width. For three dimensions and higher, the results suggest that encoding accuracy should increase with increasing tuning width. The result, which is widely cited in works on neuronal encoding, offers a universal scaling rule for all radial symmetric tuning functions. However, this scaling rule is highly unintuitive in that for greater than three dimensions, it predicts optimal encoding accuracy for infinite tuning widths, that is, tuning widths for which neurons have no discrimination power and all neurons are indistinguishable from each other. In this note, we analyze this effect and show that when a finite stimulus space is considered, there is an optimal tuning width (in terms of Fisher information) for all stimulus dimensionalities. 2 Fisher Information 2.1 Fisher Information for an Infinite Stimulus Space. The Cram´erRao inequality gives a lower bound for the variance of any unbiased estimator (Cover & Thomas, 1991) and is useful for studying neuronal encoding accuracy in that it represents the minimum mean-squared reconstruction error that can be achieved by any decoding strategy (Seung & Sompolinsky, 1993). Let x = (x1 , x2 , . . . , xD ) be a vector describing a D-dimensional stimulus. The Cram´er-Rao bound is then given by v (x) ≥ J −1 (x) ,
(2.1)
where v is the covariance matrix of a set of unbiased estimators for x, J is the Fisher information matrix, and the matrix inequality is given in the sense that v(x) − J −1 (x) must be a nonnegative definite matrix (Cover & Thomas, 1991). For an encoding variable representing neuronal firing rates, the Fisher information matrix for a neuron is given by J ij (x) = E
∂ ∂ ln P [n | x, τ ] ln P [n | x, τ ] , ∂ xi ∂xj
(2.2)
where E represents the expectation value over the probability distribution P[n | x,τ ] for firing n spikes at stimulus x within a time window τ . For multiple neurons with independent spiking, the total Fisher information for N neurons is given by the sum J (x) =
N
J a (x).
(2.3)
a =1
If the neurons are restricted to radial symmetric tuning functions and distributed identically throughout the stimulus space such that the
Optimal Neuronal Tuning for Finite Stimulus Spaces
1513
distributions of estimation errors in each dimension are identical and uncorrelated, the Fisher information matrix becomes diagonal (Zhang & Sejnowski, 1999), and the total Fisher information reduces to
J (x) =
D N
E
a =1 i=1
2 ∂ ln Pa [n | x, τ ] . ∂ xi
(2.4)
For homogeneous Poisson spike statistics, P[n | x, τ ] =
(τ · f (x))n exp (−τ f (x)) , n!
(2.5)
where f (x) describes the mean firing rate of the neuron with respect to the stimulus, or the neuronal tuning function. Equation 2.4 then becomes
J (x) = τ
2 D N ∂ 1 f a (x) . ∂ xi f a (x)
(2.6)
a =1 i=1
For a gaussian tuning function,
D 1 2 f (x) = F exp − 2 (xi − c i ) , 2σ
(2.7)
i=1
where F represents the mean peak firing rate, c = (c 1 , c 2 , . . . , c D ) represents the preferred stimulus of the neuron, and σ represents the tuning width parameter for the neuron. Substitution into equation 2.6 gives D N D 1 1 2 2 J (x) = τ (xi − c a ,i ) exp − 2 (xi − c a ,i ) . σ4 2σ a =1
i=1
(2.8)
i=1
Assuming that the preferred stimuli for the neurons are uniformly distributed throughout the stimulus space, the average Fisher information per neuron for an infinite stimulus space can be found by replacing the summation with an integral (Zhang & Sejnowski, 1999):
J =
∞
−∞
J 1 (x) d x1 , . . . , d xD ,
(2.9)
1514
W. M. Brown and A. B¨acker
where J 1 (x) represents the Fisher information for a single neuron rather than the total as given in equation 2.4. Under these assumptions, we can make the replacement:
J =
∞
ξ = x − c,
−∞
J 1 (ξ ) dξ1 , . . . , dξ D
.
(2.10)
For Poisson spike statistics and gaussian tuning (see equation 2.8), Fτ J = 4 σ
σ2 − 2
ξ2 2ξ exp − 2 2σ
√ ∞ √ ξ 2 , − σ 2π · erf 2σ
(2.11)
−∞
for D = 1. Here, the gaussian error function is given by 2 erf (b) = √ π
b
exp −t 2 dt.
(2.12)
0
If we assume that the stimulus space (integration interval) is infinite and that σ is finite, equation 2.11 reduces to J 1D
√ F τ 2π = . σ
(2.13)
Due to symmetry with respect to different dimensions, equation 2.13 can be generalized to any dimensionality to give the result reported by Zhang and Sejnowski (1999): J = F dτ (2π)d/2 σ d−2 .
(2.14)
The Fisher information based on equation 2.14 as a function of tuning width is shown in Figure 1. Although we have shown the Fisher information per neuron as an average, it is actually the exact Fisher information for each neuron because of the assumption of homogeneous tuning widths and an infinite stimulus space. Therefore, the total Fisher information can be found by multiplying the Fisher information per neuron by the number of neurons to get Fisher information across the stimulus space that is independent of x. Equation 2.14 does not describe the influence of tuning width on encoding accuracy in the limit as σ approaches infinity, however, and therefore is unusable when the tuning width is large relative to the stimulus space. This is relevant, since for D > 2, equation 2.14 predicts optimal tuning widths to be infinite. Furthermore, when using neuronal firing rate as an encoding variable, this becomes relevant in that for a finite number of neurons with
Optimal Neuronal Tuning for Finite Stimulus Spaces
1515
Figure 1: Scaling rule for the average Fisher information per neuron as a function of tuning width for different stimulus dimensionalities. Dashed lines represent the Fisher information calculated when the stimulus space is infinite (see equation 2.14), and solid lines represent calculations for a finite stimulus space (see equation 2.15). In this plot, the Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
finite firing rates and a finite decoding time, the range for the stimulus must be finite. That is, with discrete spiking events, there is a range of stimulus space beyond which no Fisher information can be conveyed within a reasonable decoding time. 2.2 Fisher Information for a Finite Stimulus Space. In order to study Fisher information within a finite stimulus space, we begin by considering a stimulus range normalized to lie in the inclusive range between 0 and 1. In this case, the tuning width is expressed in terms of a fraction of the finite stimulus space. If we consider an infinite number of neurons with preferred stimuli evenly distributed across the finite stimulus space, the average Fisher information per neuron is given (for radial symmetric tuning functions) by
Jf = 0
1
dc 1 , . . . , dc D 0
1
J 1 (x, c)d x1 , . . . , d xD ,
(2.15)
1516
W. M. Brown and A. B¨acker
where J 1 (x, c) is the Fisher information at x for the neuron with preferred stimulus c. Here, D D 1 1 2 2 J 1 (x, c) = τ 4 (2.16) (xi − c i ) exp − 2 (xi − c i ) . σ 2σ i=1
i=1
For the one-dimensional case with Poisson spiking statistics and gaussian tuning (see equation 2.8), the average Fisher information per neuron is given by √ √ 2π 2 1 J f 1D = F τ erf + 4 exp − 2 − 4 , (2.17) σ 2σ 2σ for the two-dimensional case by √ 2 √ √ −1 2 2 π erf + 3σ 2πerf exp −1 2σ 2σ 2σ 2 , J f 2D = 4F τ 1 −1 2 1 − 2 exp + 4σ + 4σ 2 exp 2 2 σ 2σ (2.18) and for the three-dimensional case by J f 3D =
√ 2 √ √ 3 −1 2 2 2 π − 4σ π erf erf 1 − exp 2 2σ 2σ 2σ 2 √ √ −1 −1 2 12σ F τ − 5σ 2 2πerf . 2 exp − exp −1 2 2 2σ 2σ σ −1 −1 −3 3 + 4σ 3 exp − 3 exp + exp −1 2 2 2 2σ σ 2σ 3/2
(2.19) The influence of tuning width on the average Fisher information per neuron is plotted in Figure 1 for the first D = 1–3 dimensionalities. For one and two dimensions, the average Fisher information, and thus encoding accuracy, increases with decreasing tuning width. For three dimensions, there is an optimal tuning width in terms of Fisher information, given by the σ at which dJ/dσ = 0 (approximately 22.541% of the stimulus space). For higher dimensionalities, optimal tuning widths also exist. This can be seen by the fact that as the tuning width approaches infinity, the derivative of the probability of firing a given number of spikes with respect to a
Optimal Neuronal Tuning for Finite Stimulus Spaces
1517
stimulus goes to zero. For the example presented here, D ∂ F (c k − xk ) 1 2 lim f (x) = lim exp − 2 (xi − c i ) = 0, σ →∞ ∂ xk σ →∞ σ2 2σ i=1
(2.20) for any k from 1 to D. In this limit, the tuning function is independent of x, the probability distribution for spike firing (see equation 2.5) is independent of x, and the resulting derivatives give a Fisher information of zero (see equation 2.2). In the limit as the tuning width goes to zero, equation 2.14 becomes valid (the limits of the error functions and exponentials are equivalent in both the case where the tuning width is infinitesimal and the case when the stimulus space is infinite):
lim J f = lim
σ →0
σ →0 0
= lim
1
dc 1 , . . . , dc D
1
J 1 (x, c)d x1 , . . . , d xD
0 ∞
σ →0 −∞
J 1 (ξ )dξ1 , . . . , dξ D
∞, D = 1 = 2Fd τ π, D = 2 0, D > 2.
(2.21)
Therefore, at least one maximum in the Fisher information must exist for higher dimensionalities. While we expect that the optimal tuning width shifts toward a larger fraction of the stimulus space as the dimensionality is increased, we are unable to find a closed form for equation 2.15 that can prove this. The deviation of the results from equation 2.14 is explained by the fact that an increase in tuning width in equation 2.15 results in an increase in tuning width relative to the stimulus space, an effect that is not easily seen when an infinite stimulus space is considered. Clearly, this deviation should be expected, as the limit given in equation 2.20 contradicts the result in equation 2.14. An infinite tuning width produces a neuronal tuning function that is flat at all but infinite stimuli. A neuron with such a tuning curve cannot discriminate between finite stimuli and therefore contributes zero Fisher information within finite stimulus ranges. This result, which is true for all dimensionalities, tells us that an infinite tuning width can never be optimal, at least under the assumptions presented in this work. 2.3 Determining the Finite Stimulus Range. As stated earlier, when using neuronal firing rate as an encoding variable, there is a physiological limit to the range of stimulus space that can be perceived. This limit is due
1518
W. M. Brown and A. B¨acker
Figure 2: Fisher information as a function of stimulus (J x (x)) calculated using equation 2.24. The average Fisher information per neuron calculated using equation 2.22 is found by normalizing the finite stimulus space to lie between 0 and 1 and distributing the preferred stimuli of the neurons between α and 1 − α such that the drop in Fisher information at the corners of the stimulus space does not fall below J cutoff . The shaded regions are not included in the finite stimulus space and therefore are not included in the average. For this plot, σ = 0.2, J cutoff = 8, and α = −0.23. The Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
to the fact that the number of neurons is finite, the firing rates are finite, and the decoding time is finite. This limit will depend on the number of neurons, the preferred stimuli of the neurons, the mean peak firing rate, and the decoding time. There is no implication, however, that the stimulus space range should equal the range of preferred stimuli for the neurons. It is therefore important to consider the influence of the preferred stimuli range on the optimal tuning width. We can evaluate the effect of the preferred stimuli range on the optimum tuning width by fixing the integration interval for the stimulus space (x) to lie between 0 and 1, and setting the integration interval for the preferred stimuli (c) such that the drop in Fisher information at the corners of the stimulus space does not fall below some threshold value (J cutoff ) (see Figure 2). In this case, the tuning width always represents a fraction of the stimulus space. The range of the preferred stimuli is dependent on both σ and J cutoff ;
Optimal Neuronal Tuning for Finite Stimulus Spaces
1519
however, it will always be centered on the stimulus space and identical in each dimension. The choice of the Fisher information as a threshold to limit the stimulus space is reasonable because there is a finite range of stimulus space within which the mean-squared error will be tolerable based on a finite distribution of preferred stimuli. We can evaluate the effect of J cutoff on the Fisher information per neuron in terms of σ as J α (σ, α) =
1 (1 − 2α)
D
α
1−α
dc 1 , . . . , dc D
1
J 1 (x, c)d x1 , . . . , d xD ,
0
(2.22) such that J x (x1 = 0, . . . , xD = 0) = J x (x1 = 1, . . . , xD = 1) = J cutoff ,
(2.23)
where J x represents the Fisher information as a function of the stimulus
J x (x) =
α
1−α
J 1 (x, c) dc 1 , . . . , dc D ,
(2.24)
and α determines the range of the preferred stimuli. For a given J cutoff and σ , we solve for the range of the preferred stimuli (α – 1-α) using a numerical evaluation of equation 2.23 such that the Fisher information at the corners of the stimulus space will be equal to J cutoff . Of course, for each value of σ , there is a maximum J cutoff for which a solution exists simply because the Fisher information at any point in the stimulus space is finite. Once α has been determined, an analytic solution to equation 2.22 can then be solved to determine the average Fisher information per neuron over the stimulus space range, using a range of preferred stimuli, which may be larger or smaller than the range of the stimulus space, depending on the value of J cutoff . Because equation 2.22 represents an average Fisher information per neuron, it is normalized to account for the range of the neuronal distribution. For a finite neuronal distribution range, there is a finite range of stimulus space within which the mean squared error will be tolerable. By determining the average Fisher information per neuron using equation 2.22, the finite stimulus space is always determined by the points at which the Fisher information begins to fall below this threshold, regardless of the tuning width. The results for this evaluation are shown in Figure 3 for the first three dimensionalities. For all cases, an increase in J cutoff results in an increase in the range of the preferred stimuli (decrease in α); that is, when the cutoff becomes low, the range of the stimulus space must become larger relative to the distribution of the neurons such that the stimulus space includes
1520
W. M. Brown and A. B¨acker
Figure 3: Plot of the average Fisher information per neuron for one to three dimensions calculated using equation 2.22 (left) along with the corresponding ranges for the preferred stimuli (right). The plotted values for J cutoff were chosen so that the range of σ is large enough to illustrate differences (as σ increases, the maximum J cutoff that can be achieved decreases). For all dimensionalities, there is an increase in the range of preferred stimuli and a decrease in the average Fisher information per neuron as J cutoff increases. The Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
Optimal Neuronal Tuning for Finite Stimulus Spaces
1521
regions with higher mean squared errors. This increase in the range of the preferred stimuli results in a decrease in the average Fisher information per neuron. This is due to the fact that the total Fisher information within the finite stimulus space (0–1) for a single neuron decreases as c deviates from 0.5. As the range of the preferred stimuli increases, the average Fisher information for each neuron within the stimulus space will decrease. The results quantitate the idea that the encoding accuracy within a finite stimulus space of interest can be increased by increasing the range of preferred stimuli around the center of the stimulus space. This increase comes at an energetic cost resulting from an increase in the number of neurons, with diminishing returns due to a decrease in the average Fisher information per neuron within the stimulus space. While a change in the curvature of the average Fisher information as a function of σ can result from a change in the Fisher information cutoff, the Fisher information per neuron always improves with narrow tuning for one and two dimensions. For three dimensions, there will always be an optimal tuning width; however, as shown in Figure 3, the optimal tuning width will shift toward a larger fraction of the stimulus space with an increase in the Fisher information cutoff. Numerical evaluation of the optimal tuning width calculated with J cutoff ranging from 10−5 to 10 results in an increase in the optimal tuning width from 0.21 to 0.41, a decrease in α from 0.48 to −0.5, and a decrease in the optimal average Fisher information per neuron from 8.34 to 1.78. 3 Conclusion The result in equation 2.14 gives the Fisher information per neuron when an infinite stimulus space is considered. It is implicit in this model that the Fisher information at any point in the stimulus space is constant (independent of the stimulus) due to an infinite range of the preferred stimuli for the neurons (see the substitution in equation 2.10). We have continued this analysis using a finite stimulus space for two reasons. First, equation 2.14 is not accurate when the tuning width is large relative to the stimulus space due to an assumption in the derivation that σ is finite. Therefore, it is convenient to use a finite stimulus space in order to ascertain accurate results at any σ . Second, physiological limits preclude both the case where the stimulus space is infinite and the case where the range of the preferred stimuli is infinite. Using a finite stimulus space and a finite range of preferred stimuli introduces edge effects that are important to consider in models of encoding accuracy simply because these edge effects must exist in animal physiology. In the case where a finite range of preferred stimuli is considered, the Fisher information cannot be independent of the stimulus, even when an infinite number of neurons are considered. This results in limited regions of the stimulus space where the encoding accuracy is tolerable. When using a finite stimulus space, the Fisher information per neuron is not constant
1522
W. M. Brown and A. B¨acker
and therefore must be reported as an average. The fact that both of these conditions must be true leaves us with the case where we are interested only in limited regions of the stimulus space and for each neuron are concerned only with the contribution to encoding accuracy that lies within this space. In our model, the limits to the encoding accuracy are governed by the limits to the range of the preferred stimuli. The minimum Fisher information across the finite stimulus space of interest can be increased by increasing the range of the preferred stimuli. The cost of this increase is both an increase in the number of neurons and a decrease in the average Fisher information per neuron within the stimulus space. The optimal tuning width for three dimensions is dependent on the distribution of the preferred stimuli within the stimulus space. The average Fisher information per neuron at a given tuning width will change depending on the desired value for J cutoff . However, the following rule is universal under the framework of our model: the encoding accuracy will improve with narrow tuning for one and two dimensions, and for higher dimensions there will be at least one optimal tuning width. In general, when a finite number of neurons are considered to encode a finite stimulus space, there should be an optimal tuning width (in terms of encoding accuracy) for any dimension. Although our results show an infinitesimal tuning width to be optimal for one and two dimensions, for a finite number of neurons this cannot be the case, as the tuning curves will become too narrow to cover the stimulus space without gaps (Eurich & Wilke, 2000; Wilke & Eurich, 2002). Therefore, the variance in the encoding accuracy as a function of the tuning width is also important to consider. We have based our work on the model developed by Zhang and Sejnowski (1999), assuming independent spike firing, constant tuning widths, radial symmetric tuning curves, and neuron distributions such that the estimation errors in different dimensions are always identical and uncorrelated. The model is desirable in that it is mathematically simple and therefore useful for studying the effect of dimensionality on encoding accuracy. However, when applied in a biological setting, many other factors have been shown to influence optimal tuning widths. In addition to Fisher information and variance of encoding accuracy, an objective function for optimal tuning widths should also consider energetic constraints (Bethge, Rotermund, & Pawelzik, 2002), heterogeneity in the tuning widths across distinct stimulus dimensions (Eurich & Wilke, 2000), heterogeneity in the tuning widths within a stimulus dimension (Wilke & Eurich, 2002), noise models (Wilke & Eurich, 2002), covariance of the noise (Karbowski, 2000; Pouget, Deneve, Ducom, & Latham, 1999; Wilke & Eurich, 2002; Wu, Amari, & Nakahara, 2002), nonsymmetric tuning curves (Eurich & Wilke, 2000), decoding time and maximum firing rates (Bethge et al., 2002), hidden dimensions (Eurich & Wilke, 2000), choice of encoding variable(s) (Eckhorn, Grusser, Kroller, Pellnitz, & Popel, 1976), and biased estimators.
Optimal Neuronal Tuning for Finite Stimulus Spaces
1523
Appendix The steps in the integration used to derive the average Fisher information per neuron for one dimension (J f1D ) are given below (The integrations for higher dimensionalities are similar):
0
J f 1D =
1
1
dc
1
J (x, c)d x
(x − c)2 (x − c)2 J (x, c)d x = Fτ exp − dx σ4 2σ 2 0
1 c2 x2 cx = F τ σ −4 exp − 2 (x − c)2 exp − 2 + 2 d x 2σ 0
2σ σ 1 x2 cx 2 d x x exp − + 2σ 2 σ2 0 1 2 c2 x cx −4 = F τ σ exp − 2 − 2cx exp − 2 + 2 d x 2σ 2σ σ 0 2 1 x cx 2 + c exp − 2 + 2 d x 2σ σ 0 c2 = F τ σ −4 exp − 2 2σ 1
1 x2 cx x2 cx 2 2 + cσ x exp − 2 + 2 d x σ x exp − 2σ 2 + σ 2 σ 0 2σ
1 0 2 x cx 2 d x + σ exp − + 2 2 2σ σ 0 1
1 × 2 2 x cx x cx 2 2 − 2cσ exp − + 2 − 2c exp − 2 + 2 d x 2 2σ σ 2σ σ 0 0 √ 1 √ σ c 2 2π c2 2 (x − c) + exp erf 2 2 2σ 2σ 0 c2 −4 = F τ σ exp − 2 2σ 1 1 x2 cx x2 cx 2 2 σ x exp − + −cσ exp − + 2 2σ 2 σ 2 0 σ2 0 2σ 1 2 x cx +c 2 exp − 2 + 2 d x 2σ σ 0 √ 1 √ 3 2 c 2 (x − c) σ 2π exp erf + 2 2σ 2 2σ 0 1 2 × x cx 2 2cσ exp − 2 + 2 2σ σ 0 √ 1 2 √ 2 − c) c (x 2 erf − c σ 2π exp 2σ 2 2σ 0 √ √ 1 2 2 c 2π 2 − c) σ c (x exp erf + 2 2σ 2 2σ
0 1
0
0
1524
W. M. Brown and A. B¨acker
J f 1D
c c2 1 − c = F τ σ −2 exp − 2 − (c − 1) exp 2 σ2 √ 2σ 2σ √ √ c 2 2π 2 (c − 1) erf − erf − Fτ 2σ 2σ 2σ 2 c c 1 1 − c dc = F τ σ −2 0 exp − 2 − (c − 1) exp 2 2 2σ σ √2σ √
1 √ c 2 2π 2 (c − 1) − erf dc − Fτ erf 2σ 0 2σ 2σ 1 1 c2 c = F τ σ −2 exp − 2 exp − 2 + 2 (c − 1) dc 2σ 2σ σ 0
1 c2 − F τ σ −2 c exp − 2 dc √ 2σ √ √ 0 1 √
c 2 2π 2 (c − 1) 2π 1 − Fτ erf erf dc − F τ dc 2σ 0 2σ 2σ 0 2σ 1 c2 c2 c 1 = − F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 0 2σ √ 1
√ √ 1 2π 2 (c − 1) 2 (c − 1) d dc − Fτ + c erf c · gerf 2σ 2σ dc 2σ 0 0 √ √ 1
√ 1 2π c 2 d c 2 + c erf − Fτ c · gerf dc 2σ 2σ dc 2σ 0 0 1 c2 c2 c 1 = − F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 0 2σ √ 1 √ 2π 2 (c − 1) − Fτ c · gerf 2σ 2σ 0
1 √ c 2 c 1 c2 dc + √ exp − 2 + 2 − 2σ σ 2σ 2 0 σ π √ 1
√ √ 1 2π c 2 c2 c 2 − Fτ − c · gerf √ exp − 2 dc 2σ 2σ 2σ 0 σ π 0 1 2 1 c c2 c = −F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 2σ √ 1 0 √ √ 2π 2 (c − 1) 2 (c − 1) − Fτ + erf c · gerf 2σ 2σ 2σ 0 1 √ σ 2 c 1 c2 − √ exp − 2 + 2 − 2σ σ 2σ 2 π 0 √ 1 √ √ 1 2π σ 2 c2 c 2 − Fτ + √ exp − 2 c · gerf 2σ 2σ 2σ π 0 0 √ √ 2π 2 1 erf + 4 exp − 2 − 4 = Fτ σ 2σ 2σ
Optimal Neuronal Tuning for Finite Stimulus Spaces
1525
Acknowledgments We thank Shawn Martin at Sandia National Laboratories and the reviewers for their guidance in presenting this work. Support for this work was provided by Sandia National Laboratories’ LDRD and Mathematics, Information, and Computational Sciences Program of the U.S. Department of Energy, and Caltech’s Beckman Institute. Sandia is a multiprogram laboratory operated by Sandia Corp., a Lockheed Martin Company, for the U.S. Department of Energy’s National Nuclear Security Administration. References Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59(4–5), 313–318. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1(4), 371–394. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Computation, 14(10), 2317–2351. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Eckhorn, R., Grusser, O. J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological Cybernetics, 22(1), 49–60. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive field neurons. Biological Cybernetics, 76(5), 357–363. Eurich, C. W., & Wilke, S. D. (2000). Multidimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In J. L. McClelland (Ed.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Karbowski, J. (2000). Fisher information and temporal correlations for spiking neurons with stochastic dynamics. Physical Review E, 61(4 Pt. B), 4235–4252. Knudsen, E. I., & Konishi, M. (1978). A neural map of auditory space in the owl. Science, 200(4343), 795–797. Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16(1), 37–68. Lee, B. B. (1999). Single units and sensation: A retrospect. Perception, 28(12), 1493– 1508. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institute of Radio Engineers (New York), 47, 1940–1951.
1526
W. M. Brown and A. B¨acker
Pouget, A., Deneve, S., Ducom, J. C., & Latham, P. E. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11(1), 85–90. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1(1–2), 89–107. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences U S A, 90(22), 10749– 10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–529. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67(2), 183–190. Wilke, S. D., & Eurich, C. W. (2002). Representational accuracy of stochastic neural populations. Neural Computation, 14(1), 155–189. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Computation, 14(5), 999–1026. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11(1), 75–84.
Received September 17, 2004; accepted November 1, 2005.
LETTER
Communicated by Yann Le Cun
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton
[email protected] Simon Osindero
[email protected] Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4
Yee-Whye Teh
[email protected] Department of Computer Science, National University of Singapore, Singapore 117543
We show how to use “complementary priors” to eliminate the explainingaway effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind. 1 Introduction Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases. We describe a model in which the top two hidden layers form an undirected associative memory (see Figure 1) and the remaining hidden layers Neural Computation 18, 1527–1554 (2006)
C 2006 Massachusetts Institute of Technology
1528
G. Hinton, S. Osindero, and Y.-W. Teh
2000 top-level units
10 label units
This could be the top level of another sensory pathway
500 units 500 units 28 x 28 pixel image
Figure 1: The network used to model the joint distribution of digit images and digit labels. In this letter, each training case consists of an image and an explicit class label, but work in progress has shown that the same learning algorithm can be used if the “labels” are replaced by a multilayer pathway whose inputs are spectrograms from multiple different speakers saying isolated digits. The network then learns to generate pairs that consist of an image and a spectrogram of the same digit class.
form a directed acyclic graph that converts the representations in the associative memory into observable variables such as the pixels of an image. This hybrid model has some attractive features:
r r r r
There is a fast, greedy learning algorithm that can find a fairly good set of parameters quickly, even in deep networks with millions of parameters and many hidden layers. The learning algorithm is unsupervised but can be applied to labeled data by learning a model that generates both the label and the data. There is a fine-tuning algorithm that learns an excellent generative model that outperforms discriminative methods on the MNIST database of hand-written digits. The generative model makes it easy to interpret the distributed representations in the deep hidden layers.
A Fast Learning Algorithm for Deep Belief Nets
r r r
1529
The inference required for forming a percept is both fast and accurate. The learning algorithm is local. Adjustments to a synapse strength depend on only the states of the presynaptic and postsynaptic neuron. The communication is simple. Neurons need only to communicate their stochastic binary states.
Section 2 introduces the idea of a “complementary” prior that exactly cancels the “explaining away” phenomenon that makes inference difficult in directed models. An example of a directed belief network with complementary priors is presented. Section 3 shows the equivalence between restricted Boltzmann machines and infinite directed networks with tied weights. Section 4 introduces a fast, greedy learning algorithm for constructing multilayer directed networks one layer at a time. Using a variational bound, it shows that as each new layer is added, the overall generative model improves. The greedy algorithm bears some resemblance to boosting in its repeated use of the same “weak” learner, but instead of reweighting each data vector to ensure that the next step learns something new, it rerepresents it. The “weak” learner that is used to construct deep directed nets is itself an undirected graphical model. Section 5 shows how the weights produced by the fast, greedy algorithm can be fine-tuned using the “up-down” algorithm. This is a contrastive version of the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995) that does not suffer from the “mode-averaging” problems that can cause the wake-sleep algorithm to learn poor recognition weights. Section 6 shows the pattern recognition performance of a network with three hidden layers and about 1.7 million weights on the MNIST set of handwritten digits. When no knowledge of geometry is provided and there is no special preprocessing, the generalization performance of the network is 1.25% errors on the 10,000-digit official test set. This beats the 1.5% achieved by the best backpropagation nets when they are not handcrafted for this particular application. It is also slightly better than the 1.4% errors reported by Decoste and Schoelkopf (2002) for support vector machines on the same task. Finally, section 7 shows what happens in the mind of the network when it is running without being constrained by visual input. The network has a full generative model, so it is easy to look into its mind—we simply generate an image from its high-level representations. Throughout the letter, we consider nets composed of stochastic binary variables, but the ideas can be generalized to other models in which the log probability of a variable is an additive function of the states of its directly connected neighbors (see appendix A for details).
1530
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 2: A simple logistic belief net containing two independent, rare causes that become highly anticorrelated when we observe the house jumping. The bias of −10 on the earthquake node means that in the absence of any observation, this node is e 10 times more likely to be off than on. If the earthquake node is on and the truck node is off, the jump node has a total input of 0, which means that it has an even chance of being on. This is a much better explanation of the observation that the house jumped than the odds of e −20 , which apply if neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because the probability of both happening is e −10 × e −10 = e −20 . When the earthquake node is turned on, it “explains away” the evidence for the truck node.
2 Complementary Priors The phenomenon of explaining away (illustrated in Figure 2) makes inference difficult in directed belief nets. In densely connected networks, the posterior distribution over the hidden variables is intractable except in a few special cases, such as mixture models or linear models with additive gaussian noise. Markov chain Monte Carlo methods (Neal, 1992) can be used to sample from the posterior, but they are typically very time-consuming. Variational methods (Neal & Hinton, 1998) approximate the true posterior with a more tractable distribution, and they can be used to improve a lower bound on the log probability of the training data. It is comforting that learning is guaranteed to improve a variational bound even when the inference of the hidden states is done incorrectly, but it would be much better to find a way of eliminating explaining away altogether, even in models whose hidden variables have highly correlated effects on the visible variables. It is widely assumed that this is impossible. A logistic belief net (Neal, 1992) is composed of stochastic binary units. When the net is used to generate data, the probability of turning on unit i is a logistic function of the states of its immediate ancestors, j, and of the
A Fast Learning Algorithm for Deep Belief Nets
1531
weights, wi j , on the directed connections from the ancestors: p(si = 1) =
1 , 1 + exp −b i − j s j wi j
(2.1)
where b i is the bias of unit i. If a logistic belief net has only one hidden layer, the prior distribution over the hidden variables is factorial because their binary states are chosen independently when the model is used to generate data. The nonindependence in the posterior distribution is created by the likelihood term coming from the data. Perhaps we could eliminate explaining away in the first hidden layer by using extra hidden layers to create a “complementary” prior that has exactly the opposite correlations to those in the likelihood term. Then, when the likelihood term is multiplied by the prior, we will get a posterior that is exactly factorial. It is not at all obvious that complementary priors exist, but Figure 3 shows a simple example of an infinite logistic belief net with tied weights in which the priors are complementary at every hidden layer (see appendix A for a more general treatment of the conditions under which complementary priors exist). The use of tied weights to construct complementary priors may seem like a mere trick for making directed models equivalent to undirected ones. As we shall see, however, it leads to a novel and very efficient learning algorithm that works by progressively untying the weights in each layer from the weights in higher layers. 2.1 An Infinite Directed Model with Tied Weights. We can generate data from the infinite directed net in Figure 3 by starting with a random configuration at an infinitely deep hidden layer1 and then performing a top-down “ancestral” pass in which the binary state of each variable in a layer is chosen from the Bernoulli distribution determined by the top-down input coming from its active parents in the layer above. In this respect, it is just like any other directed acyclic belief net. Unlike other directed nets, however, we can sample from the true posterior distribution over all of the hidden layers by starting with a data vector on the visible units and then using the transposed weight matrices to infer the factorial distributions over each hidden layer in turn. At each hidden layer, we sample from the factorial posterior before computing the factorial posterior for the layer above.2 Appendix A shows that this procedure gives unbiased samples
1 The generation process converges to the stationary distribution of the Markov chain, so we need to start at a layer that is deep compared with the time it takes for the chain to reach equilibrium. 2 This is exactly the same as the inference procedure used in the wake-sleep algorithm (Hinton et al., 1995) but for the models described in this letter no variational approximation is required because the inference procedure gives unbiased samples.
1532
G. Hinton, S. Osindero, and Y.-W. Teh
etc. WT
W 2 V2 vi
WT
W 1 H1 h j
WT
W 1 V1 vi
WT
W 0 H0 h j
WT
W 0 V0 vi
Figure 3: An infinite logistic belief net with tied weights. The downward arrows represent the generative model. The upward arrows are not part of the model. They represent the parameters that are used to infer samples from the posterior distribution at each hidden layer of the net when a data vector is clamped on V0 .
because the complementary prior at each layer ensures that the posterior distribution really is factorial. Since we can sample from the true posterior, we can compute the derivatives of the log probability of the data. Let us start by computing the derivative for a generative weight, wi00j , from a unit j in layer H0 to unit i in layer V0 (see Figure 3). In a logistic belief net, the maximum likelihood learning rule for a single data vector, v0 , is ∂ log p(v0 ) 0 0 = h j vi − vˆ i0 , ∂wi00j
(2.2)
where · denotes an average over the sampled states and vˆ i0 is the probability that unit i would be turned on if the visible vector was stochastically
A Fast Learning Algorithm for Deep Belief Nets
1533
reconstructed from the sampled hidden states. Computing the posterior distribution over the second hidden layer, V1 , from the sampled binary states in the first hidden layer, H0 , is exactly the same process as reconstructing the data, so vi1 is a sample from a Bernoulli random variable with probability vˆ i0 . The learning rule can therefore be written as ∂ log p(v0 ) 0 0 = h j vi − vi1 . 00 ∂wi j
(2.3)
The dependence of vi1 on h 0j is unproblematic in the derivation of equation 2.3 from equation 2.2 because vˆ i0 is an expectation that is conditional on h 0j . Since the weights are replicated, the full derivative for a generative weight is obtained by summing the derivatives of the generative weights between all pairs of layers: ∂ log p(v0 ) 0 0 = h j vi − vi1 + vi1 h 0j − h 1j + h 1j vi1 − vi2 + · · · ∂wi j
(2.4)
All of the pairwise products except the first and last cancel, leaving the Boltzmann machine learning rule of equation 3.1. 3 Restricted Boltzmann Machines and Contrastive Divergence Learning It may not be immediately obvious that the infinite directed net in Figure 3 is equivalent to a restricted Boltzmann machine (RBM). An RBM has a single layer of hidden units that are not connected to each other and have undirected, symmetrical connections to a layer of visible units. To generate data from an RBM, we can start with a random state in one of the layers and then perform alternating Gibbs sampling. All of the units in one layer are updated in parallel given the current states of the units in the other layer, and this is repeated until the system is sampling from its equilibrium distribution. Notice that this is exactly the same process as generating data from the infinite belief net with tied weights. To perform maximum likelihood learning in an RBM, we can use the difference between two correlations. For each weight, wi j , between a visible unit i and a hidden unit, j, we measure the correlation vi0 h 0j when a data vector is clamped on the visible units and the hidden states are sampled from their conditional distribution, which is factorial. Then, using alternating Gibbs sampling, we run the Markov chain shown in Figure it reaches its 4 until stationary distribution and measure the correlation vi∞ h ∞ . The gradient of j the log probability of the training data is then ∂ log p(v0 ) 0 0 ∞ ∞ = vi h j − vi h j . ∂wi j
(3.1)
1534
G. Hinton, S. Osindero, and Y.-W. Teh t=0 j
t=1
t=2
j
j
t = infinity j
< vi0 h0j > i
< vi∞ h ∞j > i
i
i t = infinity
Figure 4: This depicts a Markov chain that uses alternating Gibbs sampling. In one full step of Gibbs sampling, the hidden units in the top layer are all updated in parallel by applying equation 2.1 to the inputs received from the the current states of the visible units in the bottom layer; then the visible units are all updated in parallel given the current hidden states. The chain is initialized by setting the binary states of the visible units to be the same as a data vector. The correlations in the activities of a visible and a hidden unit are measured after the first update of the hidden units and again at the end of the chain. The difference of these two correlations provides the learning signal for updating the weight on the connection.
This learning rule is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights, and each step of Gibbs sampling corresponds to computing the exact posterior distribution in a layer of the infinite logistic belief net. Maximizing the log probability of the data is exactly the same as minimizing the Kullback-Leibler divergence, K L(P 0 ||Pθ∞ ), between the distribution of the data, P 0 , and the equilibrium distribution defined by the model, Pθ∞ . In contrastive divergence learning (Hinton, 2002), we run the Markov chain for only n full steps before measuring the second correlation.3 This is equivalent to ignoring the derivatives that come from the higher layers of the infinite net. The sum of all these ignored derivatives is the derivative of the log probability of the posterior distribution in layer Vn , which is also the derivative of the Kullback-Leibler divergence between the posterior distribution in layer Vn , Pθn , and the equilibrium distribution defined by the model. So contrastive divergence learning minimizes the difference of two Kullback-Leibler divergences: K L P 0 Pθ∞ − K L Pθn Pθ∞ .
(3.2)
Ignoring sampling noise, this difference is never negative because Gibbs sampling is used to produce Pθn from P 0 , and Gibbs sampling always reduces the Kullback-Leibler divergence with the equilibrium distribution. It 3
Each full step consists of updating h given v, then updating v given h.
A Fast Learning Algorithm for Deep Belief Nets
1535
is important to notice that Pθn depends on the current model parameters, and the way in which Pθn changes as the parameters change is being ignored by contrastive divergence learning. This problem does not arise with P 0 because the training data do not depend on the parameters. An empirical investigation of the relationship between the maximum likelihood and the contrastive divergence learning rules can be found in Carreira-Perpinan and Hinton (2005). Contrastive divergence learning in a restricted Boltzmann machine is efficient enough to be practical (Mayraz & Hinton, 2001). Variations that use real-valued units and different sampling schemes are described in Teh, Welling, Osindero, and Hinton (2003) and have been quite successful for modeling the formation of topographic maps (Welling, Hinton, & Osindero, 2003) for denoising natural images (Roth & Black, 2005) or images of biological cells (Ning et al., 2005). Marks and Movellan (2001) describe a way of using contrastive divergence to perform factor analysis and Welling, Rosen-Zvi, and Hinton (2005) show that a network with logistic, binary visible units and linear, gaussian hidden units can be used for rapid document retrieval. However, it appears that the efficiency has been bought at a high price: When applied in the obvious way, contrastive divergence learning fails for deep, multilayer networks with different weights at each layer because these networks take far too long even to reach conditional equilibrium with a clamped data vector. We now show that the equivalence between RBMs and infinite directed nets with tied weights suggests an efficient learning algorithm for multilayer networks in which the weights are not tied. 4 A Greedy Learning Algorithm for Transforming Representations An efficient way to learn a complicated model is to combine a set of simpler models that are learned sequentially. To force each model in the sequence to learn something different from the previous models, the data are modified in some way after each model has been learned. In boosting (Freund, 1995), each model in the sequence is trained on reweighted data that emphasize the cases that the preceding models got wrong. In one version of principal components analysis, the variance in a modeled direction is removed, thus forcing the next modeled direction to lie in the orthogonal subspace (Sanger, 1989). In projection pursuit (Friedman & Stuetzle, 1981), the data are transformed by nonlinearly distorting one direction in the data space to remove all nongaussianity in that direction. The idea behind our greedy algorithm is to allow each model in the sequence to receive a different representation of the data. The model performs a nonlinear transformation on its input vectors and produces as output the vectors that will be used as input for the next model in the sequence. Figure 5 shows a multilayer generative model in which the top two layers interact via undirected connections and all of the other connections
1536
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 5: A hybrid network. The top two layers have undirected connections and form an associative memory. The layers below have directed, top-down generative connections that can be used to map a state of the associative memory to an image. There are also directed, bottom-up recognition connections that are used to infer a factorial representation in one layer from the binary activities in the layer below. In the greedy initial learning, the recognition connections are tied to the generative connections.
are directed. The undirected connections at the top are equivalent to having infinitely many higher layers with tied weights. There are no intralayer connections, and to simplify the analysis, all layers have the same number of units. It is possible to learn sensible (though not optimal) values for the parameters W0 by assuming that the parameters between higher layers will be used to construct a complementary prior for W0 . This is equivalent to assuming that all of the weight matrices are constrained to be equal. The task of learning W0 under this assumption reduces to the task of learning an RBM, and although this is still difficult, good approximate solutions can be found rapidly by minimizing contrastive divergence. Once W0 has been learned, the data can be mapped through W0T to create higher-level “data” at the first hidden layer. If the RBM is a perfect model of the original data, the higher-level “data” will already be modeled perfectly by the higher-level weight matrices. Generally, however, the RBM will not be able to model the original data perfectly, and we can make the generative model better using the following greedy algorithm:
A Fast Learning Algorithm for Deep Belief Nets
1537
1. Learn W0 assuming all the weight matrices are tied. 2. Freeze W0 and commit ourselves to using W0T to infer factorial approximate posterior distributions over the states of the variables in the first hidden layer, even if subsequent changes in higher-level weights mean that this inference method is no longer correct. 3. Keeping all the higher-weight matrices tied to each other, but untied from W0 , learn an RBM model of the higher-level “data” that was produced by using W0T to transform the original data. If this greedy algorithm changes the higher-level weight matrices, it is guaranteed to improve the generative model. As shown in Neal and Hinton (1998), the negative log probability of a single data vector, v0 , under the multilayer generative model is bounded by a variational free energy, which is the expected energy under the approximating distribution, Q(h0 |v0 ), minus the entropy of that distribution. For a directed model, the “energy” of the configuration v0 , h0 is given by E(v0 , h0 ) = −[log p(h0 ) + log p(v0 |h0 )],
(4.1)
so the bound is log p(v0 ) ≥
Q(h0 |v0 )[log p(h0 ) + log p(v0 |h0 )]
all h0
−
Q(h0 |v0 ) log Q(h0 |v0 ),
(4.2)
all h0
where h0 is a binary configuration of the units in the first hidden layer, p(h0 ) is the prior probability of h0 under the current model (which is defined by the weights above H0 ), and Q(·|v0 ) is any probability distribution over the binary configurations in the first hidden layer. The bound becomes an equality if and only if Q(·|v0 ) is the true posterior distribution. When all of the weight matrices are tied together, the factorial distribution over H0 produced by applying W0T to a data vector is the true posterior distribution, so at step 2 of the greedy algorithm, log p(v0 ) is equal to the bound. Step 2 freezes both Q(·|v0 ) and p(v0 |h0 ), and with these terms fixed, the derivative of the bound is the same as the derivative of
Q(h0 |v0 ) log p(h0 ).
(4.3)
all h0
So maximizing the bound with respect to the weights in the higher layers is exactly equivalent to maximizing the log probability of a data set in which h0 occurs with probability Q(h0 |v0 ). If the bound becomes tighter, it
1538
G. Hinton, S. Osindero, and Y.-W. Teh
is possible for log p(v0 ) to fall even though the lower bound on it increases, but log p(v0 ) can never fall below its value at step 2 of the greedy algorithm because the bound is tight at this point and the bound always increases. The greedy algorithm can clearly be applied recursively, so if we use the full maximum likelihood Boltzmann machine learning algorithm to learn each set of tied weights and then we untie the bottom layer of the set from the weights above, we can learn the weights one layer at a time with a guarantee that we will never decrease the bound on the log probability of the data under the model.4 In practice, we replace the maximum likelihood Boltzmann machine learning algorithm by contrastive divergence learning because it works well and is much faster. The use of contrastive divergence voids the guarantee, but it is still reassuring to know that extra layers are guaranteed to improve imperfect models if we learn each layer with sufficient patience. To guarantee that the generative model is improved by greedily learning more layers, it is convenient to consider models in which all layers are the same size so that the higher-level weights can be initialized to the values learned before they are untied from the weights in the layer below. The same greedy algorithm, however, can be applied even when the layers are different sizes. 5 Back-Fitting with the Up-Down Algorithm Learning the weight matrices one layer at a time is efficient but not optimal. Once the weights in higher layers have been learned, neither the weights nor the simple inference procedure are optimal for the lower layers. The suboptimality produced by greedy learning is relatively innocuous for supervised methods like boosting. Labels are often scarce, and each label may provide only a few bits of constraint on the parameters, so overfitting is typically more of a problem than underfitting. Going back and refitting the earlier models may therefore cause more harm than good. Unsupervised methods, however, can use very large unlabeled data sets, and each case may be very high-dimensional, thus providing many bits of constraint on a generative model. Underfitting is then a serious problem, which can be alleviated by a subsequent stage of back-fitting in which the weights that were learned first are revised to fit in better with the weights that were learned later. After greedily learning good initial values for the weights in every layer, we untie the “recognition” weights that are used for inference from the “generative” weights that define the model, but retain the restriction that the posterior in each layer must be approximated by a factorial distribution in which the variables within a layer are conditionally independent given
4
The guarantee is on the expected change in the bound.
A Fast Learning Algorithm for Deep Belief Nets
1539
the values of the variables in the layer below. A variant of the wake-sleep algorithm described in Hinton et al. (1995) can then be used to allow the higher-level weights to influence the lower-level ones. In the “up-pass,” the recognition weights are used in a bottom-up pass that stochastically picks a state for every hidden variable. The generative weights on the directed connections are then adjusted using the maximum likelihood learning rule in equation 2.2.5 The weights on the undirected connections at the top level are learned as before by fitting the top-level RBM to the posterior distribution of the penultimate layer. The “down-pass” starts with a state of the top-level associative memory and uses the top-down generative connections to stochastically activate each lower layer in turn. During the down-pass, the top-level undirected connections and the generative directed connections are not changed. Only the bottom-up recognition weights are modified. This is equivalent to the sleep phase of the wake-sleep algorithm if the associative memory is allowed to settle to its equilibrium distribution before initiating the downpass. But if the associative memory is initialized by an up-pass and then only allowed to run for a few iterations of alternating Gibbs sampling before initiating the down-pass, this is a “contrastive” form of the wake-sleep algorithm that eliminates the need to sample from the equilibrium distribution of the associative memory. The contrastive form also fixes several other problems of the sleep phase. It ensures that the recognition weights are being learned for representations that resemble those used for real data, and it also helps to eliminate the problem of mode averaging. If, given a particular data vector, the current recognition weights always pick a particular mode at the level above and ignore other very different modes that are equally good at generating the data, the learning in the down-pass will not try to alter those recognition weights to recover any of the other modes as it would if the sleep phase used a pure ancestral pass. A pure ancestral pass would have to start by using prolonged Gibbs sampling to get an equilibrium sample from the top-level associative memory. By using a top-level associative memory, we also eliminate a problem in the wake phase: independent top-level units seem to be required to allow an ancestral pass, but they mean that the variational approximation is very poor for the top layer of weights. Appendix B specifies the details of the up-down algorithm using MATLAB-style pseudocode for the network shown in Figure 1. For simplicity, there is no penalty on the weights, no momentum, and the same learning rate for all parameters. Also, the training data are reduced to a single case.
Because weights are no longer tied to the weights above them, vˆ i0 must be computed using the states of the variables in the layer above i and the generative weights from these variables to i. 5
1540
G. Hinton, S. Osindero, and Y.-W. Teh
6 Performance on the MNIST Database 6.1 Training the Network. The MNIST database of handwritten digits contains 60,000 training images and 10,000 test images. Results for many different pattern recognition techniques are already published for this publicly available database, so it is ideal for evaluating new pattern recognition methods. For the basic version of the MNIST learning task, no knowledge of geometry is provided, and there is no special preprocessing or enhancement of the training set, so an unknown but fixed random permutation of the pixels would not affect the learning algorithm. For this “permutationinvariant” version of the task, the generalization performance of our network was 1.25% errors on the official test set. The network shown in Figure 1 was trained on 44,000 of the training images that were divided into 440 balanced mini-batches, each containing 10 examples of each digit class.6 The weights were updated after each mini-batch. In the initial phase of training, the greedy algorithm described in section 4 was used to train each layer of weights separately, starting at the bottom. Each layer was trained for 30 sweeps through the training set (called “epochs”). During training, the units in the “visible” layer of each RBM had real-valued activities between 0 and 1. These were the normalized pixel intensities when learning the bottom layer of weights. For training higher layers of weights, the real-valued activities of the visible units in the RBM were the activation probabilities of the hidden units in the lower-level RBM. The hidden layer of each RBM used stochastic binary values when that RBM was being trained. The greedy training took a few hours per layer in MATLAB on a 3 GHz Xeon processor, and when it was done, the error rate on the test set was 2.49% (see below for details of how the network is tested). When training the top layer of weights (the ones in the associative memory), the labels were provided as part of the input. The labels were represented by turning on one unit in a “softmax” group of 10 units. When the activities in this group were reconstructed from the activities in the layer above, exactly one unit was allowed to be active, and the probability of picking unit i was given by exp(xi ) pi = , j exp(x j )
(6.1)
where xi is the total input received by unit i. Curiously, the learning rules are unaffected by the competition between units in a softmax group, so the 6 Preliminary experiments with 16 × 16 images of handwritten digits from the USPS database showed that a good way to model the joint distribution of digit images and their labels was to use an architecture of this type, but for 16 × 16 images, only three-fifths as many units were used in each hidden layer.
A Fast Learning Algorithm for Deep Belief Nets
1541
synapses do not need to know which unit is competing with which other unit. The competition affects the probability of a unit turning on, but it is only this probability that affects the learning. After the greedy layer-by-layer training, the network was trained, with a different learning rate and weight decay, for 300 epochs using the up-down algorithm described in section 5. The learning rate, momentum, and weight decay7 were chosen by training the network several times and observing its performance on a separate validation set of 10,000 images that were taken from the remainder of the full training set. For the first 100 epochs of the up-down algorithm, the up-pass was followed by three full iterations of alternating Gibbs sampling in the associative memory before performing the down-pass. For the second 100 epochs, 6 iterations were performed, and for the last 100 epochs, 10 iterations were performed. Each time the number of iterations of Gibbs sampling was raised, the error on the validation set decreased noticeably. The network that performed best on the validation set was tested and had an error rate of 1.39%. This network was then trained on all 60,000 training images8 until its error rate on the full training set was as low as its final error rate had been on the initial training set of 44,000 images. This took a further 59 epochs, making the total learning time about a week. The final network had an error rate of 1.25%.9 The errors made by the network are shown in Figure 6. The 49 cases that the network gets correct but for which the second-best probability is within 0.3 of the best probability are shown in Figure 7. The error rate of 1.25% compares very favorably with the error rates achieved by feedforward neural networks that have one or two hidden layers and are trained to optimize discrimination using the backpropagation algorithm (see Table 1). When the detailed connectivity of these networks is not handcrafted for this particular task, the best reported error rate for stochastic online learning with a separate squared error on each of the 10 output units is 2.95%. These error rates can be reduced to 1.53% in a net with one hidden layer of 800 units by using small initial weights, a separate cross-entropy error function on each output unit, and very gentle learning 7 No attempt was made to use different learning rates or weight decays for different layers, and the learning rate and momentum were always set quite conservatively to avoid oscillations. It is highly likely that the learning speed could be considerably improved by a more careful choice of learning parameters, though it is possible that this would lead to worse solutions. 8 The training set has unequal numbers of each class, so images were assigned randomly to each of the 600 mini-batches. 9 To check that further learning would not have significantly improved the error rate, the network was then left running with a very small learning rate and with the test error being displayed after every epoch. After six weeks, the test error was fluctuating between 1.12% and 1.31% and was 1.18% for the epoch on which number of training errors was smallest.
1542
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 6: The 125 test cases that the network got wrong. Each case is labeled by the network’s guess. The true classes are arranged in standard scan order.
(John Platt, personal communication, 2005). An almost identical result of 1.51% was achieved in a net that had 500 units in the first hidden layer and 300 in the second hidden layer by using “softmax” output units and a regularizer that penalizes the squared weights by an amount carefully chosen using a validation set. For comparison, nearest neighbor has a reported error rate (http://oldmill.uchicago.edu/wilder/Mnist/) of 3.1% if all 60,000 training cases are used (which is extremely slow) and 4.4% if 20,000 are used. This can be reduced to 2.8% and 4.0% by using an L3 norm. The only standard machine learning technique that comes close to the 1.25% error rate of our generative model on the basic task is a support vector machine that gives an error rate of 1.4% (Decoste & Schoelkopf, 2002). But it is hard to see how support vector machines can make use of the domainspecific tricks, like weight sharing and subsampling, which LeCun, Bottou, and Haffner (1998) use to improve the performance of discriminative
A Fast Learning Algorithm for Deep Belief Nets
1543
Figure 7: All 49 cases in which the network guessed right but had a second guess whose probability was within 0.3 of the probability of the best guess. The true classes are arranged in standard scan order.
neural networks from 1.5% to 0.95%. There is no obvious reason why weight sharing and subsampling cannot be used to reduce the error rate of the generative model, and we are currently investigating this approach. Further improvements can always be achieved by averaging the opinions of multiple networks, but this technique is available to all methods. Substantial reductions in the error rate can be achieved by supplementing the data set with slightly transformed versions of the training data. Using one- and two-pixel translations, Decoste and Schoelkopf (2002) achieve 0.56%. Using local elastic deformations in a convolutional neural network, Simard, Steinkraus, and Platt (2003) achieve 0.4%, which is slightly better than the 0.63% achieved by the best hand-coded recognition algorithm (Belongie, Malik, & Puzicha, 2002). We have not yet explored the use of distorted data for learning generative models because many types of distortion need to be investigated, and the fine-tuning algorithm is currently too slow. 6.2 Testing the Network. One way to test the network is to use a stochastic up-pass from the image to fix the binary states of the 500 units in the lower layer of the associative memory. With these states fixed, the label units are given initial real-valued activities of 0.1, and a few iterations of alternating Gibbs sampling are then used to activate the correct label unit. This method of testing gives error rates that are almost 1% higher than the rates reported above.
1544
G. Hinton, S. Osindero, and Y.-W. Teh
Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recognition Task. Version of MNIST Task
Learning Algorithm
Test Error %
Permutation invariant
Our generative model: 784 → 500 → 500 ↔ 2000 ↔ 10 Support vector machine: degree 9 polynomial kernel Backprop: 784 → 500 → 300 → 10 cross-entropy and weight-decay Backprop: 784 → 800 → 10 cross-entropy and early stopping Backprop: 784 → 500 → 150 → 10 squared error and on-line updates Nearest neighbor: all 60,000 examples and L3 norm Nearest neighbor: all 60,000 examples and L2 norm Nearest neighbor: 20,000 examples and L3 norm Nearest neighbor: 20,000 examples and L2 norm Backprop: cross-entropy and early-stopping convolutional neural net
1.25
Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Unpermuted images; extra data from elastic deformations Unpermuted de-skewed images; extra data from 2 pixel translations Unpermuted images Unpermuted images; extra data from affine transformations Unpermuted images
1.4 1.51 1.53 2.95 2.8 3.1 4.0 4.4 0.4
Virtual SVM: degree 9 polynomial kernel
0.56
Shape-context features: hand-coded matching Backprop in LeNet5: convolutional neural net
0.63
Backprop in LeNet5: convolutional neural net
0.8
0.95
A better method is to first fix the binary states of the 500 units in the lower layer of the associative memory and to then turn on each of the label units in turn and compute the exact free energy of the resulting 510-component binary vector. Almost all the computation required is independent of which label unit is turned on (Teh & Hinton, 2001), and this method computes the exact conditional equilibrium distribution over labels instead of approximating it by Gibbs sampling, which is what the previous method is doing. This method gives error rates that are about 0.5% higher than the ones quoted because of the stochastic decisions made in the up-pass. We can remove this noise in two ways. The simpler is to make the up-pass deterministic by using probabilities of activation in place of
A Fast Learning Algorithm for Deep Belief Nets
1545
Figure 8: Each row shows 10 samples from the generative model with a particular label clamped on. The top-level associative memory is run for 1000 iterations of alternating Gibbs sampling between samples.
stochastic binary states. The second is to repeat the stochastic up-pass 20 times and average either the label probabilities or the label log probabilities over the 20 repetitions before picking the best one. The two types of average give almost identical results, and these results are also very similar to using a single deterministic up-pass, which was the method used for the reported results. 7 Looking into the Mind of a Neural Network To generate samples from the model, we perform alternating Gibbs sampling in the top-level associative memory until the Markov chain converges to the equilibrium distribution. Then we use a sample from this distribution as input to the layers below and generate an image by a single down-pass through the generative connections. If we clamp the label units to a particular class during the Gibbs sampling, we can see images from the model’s class-conditional distributions. Figure 8 shows a sequence of images for each class that were generated by allowing 1000 iterations of Gibbs sampling between samples. We can also initialize the state of the top two layers by providing a random binary image as input. Figure 9 shows how the class-conditional state of the associative memory then evolves when it is allowed to run freely, but with the label clamped. This internal state is “observed” by performing a down-pass every 20 iterations to see what the associative memory has
1546
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 9: Each row shows 10 samples from the generative model with a particular label clamped on. The top-level associative memory is initialized by an up-pass from a random binary image in which each pixel is on with a probability of 0.5. The first column shows the results of a down-pass from this initial highlevel state. Subsequent columns are produced by 20 iterations of alternating Gibbs sampling in the associative memory.
in mind. This use of the word mind is not intended to be metaphorical. We believe that a mental state is the state of a hypothetical, external world in which a high-level internal representation would constitute veridical perception. That hypothetical world is what the figure shows. 8 Conclusion We have shown that it is possible to learn a deep, densely connected belief network one layer at a time. The obvious way to do this is to assume that the higher layers do not exist when learning the lower layers, but this is not compatible with the use of simple factorial approximations to replace the intractable posterior distribution. For these approximations to work well, we need the true posterior to be as close to factorial as possible. So instead of ignoring the higher layers, we assume that they exist but have tied weights that are constrained to implement a complementary prior that makes the true posterior exactly factorial. This is equivalent to having an undirected model that can be learned efficiently using contrastive divergence. It can also be viewed as constrained variational learning because a penalty term—the divergence between the approximate and true
A Fast Learning Algorithm for Deep Belief Nets
1547
posteriors—has been replaced by the constraint that the prior must make the variational approximation exact. After each layer has been learned, its weights are untied from the weights in higher layers. As these higher-level weights change, the priors for lower layers cease to be complementary, so the true posterior distributions in lower layers are no longer factorial, and the use of the transpose of the generative weights for inference is no longer correct. Nevertheless, we can use a variational bound to show that adapting the higher-level weights improves the overall generative model. To demonstrate the power of our fast, greedy learning algorithm, we used it to initialize the weights for a much slower fine-tuning algorithm that learns an excellent generative model of digit images and their labels. It is not clear that this is the best way to use the fast, greedy algorithm. It might be better to omit the fine-tuning and use the speed of the greedy algorithm to learn an ensemble of larger, deeper networks or a much larger training set. The network in Figure 1 has about as many parameters as 0.002 cubic millimeters of mouse cortex (Horace Barlow, personal communication, 1999), and several hundred networks of this complexity could fit within a single voxel of a high-resolution fMRI scan. This suggests that much bigger networks may be required to compete with human shape recognition abilities. Our current generative model is limited in many ways (Lee & Mumford, 2003). It is designed for images in which nonbinary values can be treated as probabilities (which is not the case for natural images); its use of top-down feedback during perception is limited to the associative memory in the top two layers; it does not have a systematic way of dealing with perceptual invariances; it assumes that segmentation has already been performed; and it does not learn to sequentially attend to the most informative parts of objects when discrimination is difficult. It does, however, illustrate some of the major advantages of generative models as compared to discriminative ones:
r
r r
Generative models can learn low-level features without requiring feedback from the label, and they can learn many more parameters than discriminative models without overfitting. In discriminative learning, each training case constrains the parameters only by as many bits of information as are required to specify the label. For a generative model, each training case constrains the parameters by the number of bits required to specify the input. It is easy to see what the network has learned by generating from its model. It is possible to interpret the nonlinear, distributed representations in the deep hidden layers by generating images from them.
1548
r
G. Hinton, S. Osindero, and Y.-W. Teh
The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.
Appendix A: Complementary Priors A.1 General Complementarity. Consider a joint distribution over observables, x, and hidden variables, y. For a given likelihood function, P(x|y), we define the corresponding family of complementary priors to be those distributions, P(y), for which the joint distribution, P(x, y) = P(x|y)P(y), leads to posteriors, P(y|x), that exactly factorize, that is, leads to a posterior that can be expressed as P(y|x) = j P(y j |x). Not all functional forms of likelihood admit a complementary prior. In this appendix, we show that the following family constitutes all likelihood functions admitting a complementary prior, 1 P(x|y) = exp (y) = exp
j (x, y j ) + β(x)
j
j (x, y j ) + β(x) − log (y) ,
(A.1)
j
where is the normalization term. For this assertion to hold, we need to assume positivity of distributions: that both P(y) > 0 and P(x|y) > 0 for every value of y and x. The corresponding family of complementary priors then assumes the form P(y) =
1 exp C
log (y) +
α j (y j ) ,
(A.2)
j
where C is a constant to ensure normalization. This combination of functional forms leads to the following expression for the joint, P(x, y) =
1 exp C
j
j (x, y j ) + β(x) +
α j (y j ) .
(A.3)
j
To prove our assertion, we need to show that every likelihood function of form equation A.1 admits a complementary prior and vice versa. First, it can be directly verified that equation A.2 is a complementary prior for the likelihood functions of equation A.1. To show the converse, let us assume that P(y) is a complementary prior for some likelihood function P(x|y). Notice that the factorial form of the posterior simply means that the
A Fast Learning Algorithm for Deep Belief Nets
1549
joint distribution P(x, y) = P(y)P(x|y) satisfies the following set of conditional independencies: y j ⊥⊥ yk | x for every j = k. This set of conditional independencies corresponds exactly to the relations satisfied by an undirected graphical model having edges between every hidden and observed variable and among all observed variables. By the Hammersley-Clifford theorem and using our positivity assumption, the joint distribution must be of the form of equation A.3, and the forms for the likelihood function equation A.1 and prior equation A.2 follow from this. A.2 Complementarity for Infinite Stacks. We now consider a subset of models of the form in equation A.3 for which the likelihood also factorizes. This means that we now have two sets of conditional independencies: P(x|y) =
P(xi |y)
(A.4)
P(y j |x).
(A.5)
i
P(y|x) =
j
This condition is useful for our construction of the infinite stack of directed graphical models. Identifying the conditional independencies in equations A.4 and A.5 as those satisfied by a complete bipartite undirected graphical model, and again using the Hammersley-Clifford theorem (assuming positivity), we see that the following form fully characterizes all joint distributions of interest, P(x, y) =
1 exp Z
i, j (xi , y j ) +
i, j
γi (xi ) +
i
α j (y j ) ,
(A.6)
j
while the likelihood functions take on the form P(x|y) = exp
i, j
i, j (xi , y j ) +
γi (xi ) − log (y) .
(A.7)
i
Although it is not immediately obvious, the marginal distribution over the observables, x, in equation A.6 can also be expressed as an infinite directed model in which the parameters defining the conditional distributions between layers are tied together. An intuitive way of validating this assertion is as follows. Consider one of the methods by which we might draw samples from the marginal distribution P(x) implied by equation A.6. Starting from an arbitrary configuration of y, we would iteratively perform Gibbs sampling using, in alternation, the distributions given in equations A.4 and A.5. If we run this Markov chain for long enough, then, under the mild assumption that the chain
1550
G. Hinton, S. Osindero, and Y.-W. Teh
mixes properly, we will eventually obtain unbiased samples from the joint distribution given in equation A.6. Now let us imagine that we unroll this sequence of Gibbs updates in space, such that we consider each parallel update of the variables to constitute states of a separate layer in a graph. This unrolled sequence of states has a purely directed structure (with conditional distributions taking the form of equations A.4 and A.5 and in alternation). By equivalence to the Gibbs sampling scheme, after many layers in such an unrolled graph, adjacent pairs of layers will have a joint distribution as given in equation A.6. We can formalize the above intuition for unrolling the graph as follows. The basic idea is to unroll the graph “upwards” (i.e., moving away from the data), so that we can put a well-defined distribution over the infinite stack of variables. Then we verify some simple marginal and conditional properties of the joint distribution and thus demonstrate the required properties of the graph in the “downwards” direction. Let x = x(0) , y = y(0) , x(1) , y(1) , x(2) , y(2) , . . . be a sequence (stack) of variables, the first two of which are identified as our original observed and hidden variable. Define the functions f (x , y ) = f x (x ) =
1 exp Z
i, j (xi , yi ) +
i, j
γi (xi ) +
i
α j (y j )
(A.8)
j
f (x , y )
(A.9)
f (x , y )
(A.10)
y
f y (y ) =
x
gx (x |y ) = f (x , y )/ f y (y )
(A.11)
g y (y |x ) = f (x , y )/ f x (x ),
(A.12)
and define a joint distribution over our sequence of variables as follows: P(x(0) , y(0) ) = f x(0) , y(0) P(x(i) |y(i−1) ) = gx x(i) |y(i−1) P(y(i) |x(i) ) = g y y(i) |x(i) .
(A.13) i = 1, 2, . . .
(A.14)
i = 1, 2, . . .
(A.15)
We verify by induction that the distribution has the following marginal distributions: P(x(i) ) = f x x(i) P(y(i) ) = f y y(i)
i = 0, 1, 2, . . .
(A.16)
i = 0, 1, 2, . . .
(A.17)
A Fast Learning Algorithm for Deep Belief Nets
1551
For i = 0 this is given by definition of the distribution in equation A.13. For i > 0, we have: P(x ) = (i)
y(i−1)
(i) (i−1) (i−1) f x(i) , y(i−1) P x |y f y y(i−1) P y = (i−1) fy y y(i−1)
= f x x(i)
(A.18)
and similarly for P(y(i) ). Now we see that the following conditional distributions also hold true: P x(i) |y(i) = P x(i) , y(i) P y(i) = gx x(i) |y(i) P y(i) |x(i+1) = P y(i) , x(i+1) P x(i+1) = g y y(i) |x(i+1) .
(A.19) (A.20)
So our joint distribution over the stack of variables also leads to the appropriate conditional distributions for the unrolled graph in the “downwards” direction. Inference in this infinite graph is equivalent to inference in the joint distribution over the sequence of variables, that is, given x(0) , we can obtain a sample from the posterior simply by sampling y(0) |x(0) , x(1) |y(0) , y(1) |x(1) , . . . . This directly shows that our inference procedure is exact for the unrolled graph. Appendix B: Pseudocode for Up-Down Algorithm We now present MATLAB-style pseudocode for an implementation of the up-down algorithm described in section 5 and used for back-fitting. (This method is a contrastive version of the wake-sleep algorithm; Hinton et al., 1995.) The code outlined below assumes a network of the type shown in Figure 1 with visible inputs, label nodes, and three layers of hidden units. Before applying the up-down algorithm, we would first perform layer-wise greedy training as described in sections 3 and 4. \% \% \% \% \% \% \% \% \%
UP-DOWN ALGORITHM the data and all biases are row vectors. the generative model is: lab top pen --> hid --> vis the number of units in layer foo is numfoo weight matrices have names fromlayer tolayer "rec" is for recognition biases and "gen" is for generative biases. for simplicity, the same learning rate, r, is used everywhere.
1552
G. Hinton, S. Osindero, and Y.-W. Teh
\% PERFORM A BOTTOM-UP PASS TO GET WAKE/POSITIVE PHASE \% PROBABILITIES AND SAMPLE STATES wakehidprobs = logistic(data*vishid + hidrecbiases); wakehidstates = wakehidprobs > rand(1, numhid); wakepenprobs = logistic(wakehidstates*hidpen + penrecbiases); wakepenstates = wakepenprobs > rand(1, numpen); wakeopprobs = logistic(wakepenstates*pentop + targets*labtop + topbiases); wakeopstates = wakeopprobs > rand(1, numtop);
\% POSITIVE PHASE STATISTICS FOR CONTRASTIVE DIVERGENCE poslabtopstatistics = targets’ * waketopstates; pospentopstatistics = wakepenstates’ * waketopstates; \% PERFORM numCDiters GIBBS SAMPLING ITERATIONS USING THE TOP LEVEL \% UNDIRECTED ASSOCIATIVE MEMORY negtopstates = waketopstates; \% to initialize loop for iter=1:numCDiters negpenprobs = logistic(negtopstates*pentop’ + pengenbiases); negpenstates = negpenprobs > rand(1, numpen); neglabprobs = softmax(negtopstates*labtop’ + labgenbiases); negtopprobs = logistic(negpenstates*pentop+neglabprobs*labtop+ topbiases); negtopstates = negtopprobs > rand(1, numtop)); end; \% NEGATIVE PHASE STATISTICS FOR CONTRASTIVE DIVERGENCE negpentopstatistics = negpenstates’*negtopstates; neglabtopstatistics = neglabprobs’*negtopstates;
\% STARTING FROM THE END OF THE GIBBS SAMPLING RUN, PERFORM A \% TOP-DOWN GENERATIVE PASS TO GET SLEEP/NEGATIVE PHASE \% PROBABILITIES AND SAMPLE STATES sleeppenstates = negpenstates; sleephidprobs = logistic(sleeppenstates*penhid + hidgenbiases); sleephidstates = sleephidprobs > rand(1, numhid); sleepvisprobs = logistic(sleephidstates*hidvis + visgenbiases);
\% PREDICTIONS psleeppenstates = logistic(sleephidstates*hidpen + penrecbiases); psleephidstates = logistic(sleepvisprobs*vishid + hidrecbiases); pvisprobs = logistic(wakehidstates*hidvis + visgenbiases); phidprobs = logistic(wakepenstates*penhid + hidgenbiases); \% UPDATES TO GENERATIVE PARAMETERS hidvis = hidvis + r*poshidstates’*(data-pvisprobs);
A Fast Learning Algorithm for Deep Belief Nets
1553
visgenbiases = visgenbiases + r*(data - pvisprobs); penhid = penhid + r*wakepenstates’*(wakehidstates-phidprobs); hidgenbiases = hidgenbiases + r*(wakehidstates - phidprobs);
\% UPDATES TO TOP LEVEL ASSOCIATIVE MEMORY PARAMETERS labtop = labtop + r*(poslabtopstatistics-neglabtopstatistics); labgenbiases = labgenbiases + r*(targets - neglabprobs); pentop = pentop + r*(pospentopstatistics - negpentopstatistics); pengenbiases = pengenbiases + r*(wakepenstates - negpenstates); topbiases = topbiases + r*(waketopstates - negtopstates); \%UPDATES TO RECOGNITION/INFERENCE APPROXIMATION PARAMETERS hidpen = hidpen + r*(sleephidstates’*(sleeppenstatespsleeppenstates)); penrecbiases = penrecbiases + r*(sleeppenstates-psleeppenstates); vishid = vishid + r*(sleepvisprobs’*(sleephidstatespsleephidstates)); hidrecbiases = hidrecbiases + r*(sleephidstates-psleephidstates); Acknowledgments We thank Peter Dayan, Zoubin Ghahramani, Yann Le Cun, Andriy Mnih, Radford Neal, Terry Sejnowski, and Max Welling for helpful discussions and the referees for greatly improving the manuscript. The research was supported by NSERC, the Gatsby Charitable Foundation, CFI, and OIT. G.E.H. is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522. Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. In R. G. Cowell & Z. Ghahramani (Eds.), Artificial Intelligence and Statistics, 2005. (pp. 33–41). Fort Lauderdale, FL: Society for Artificial Intelligence and Statistics. Decoste, D., & Schoelkopf, B. (2002). Training invariant support vector machines, Machine Learning, 46, 161–190. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 12(2), 256–285. Friedman, J., & Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817–823. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence, Neural Computation, 14(8), 1711–1800.
1554
G. Hinton, S. Osindero, and Y.-W. Teh
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. (1995). The wake-sleep algorithm for self-organizing neural networks. Science, 268, 1158–1161. LeCun, Y., Bottou, L., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America, A, 20, 1434–1448. Marks, T. K., & Movellan, J. R. (2001). Diffusion networks, product of experts, and factor analysis. In T. W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Proc. Int. Conf. on Independent Component Analysis (pp. 481–485). San Diego. Mayraz, G., & Hinton, G. E. (2001). Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 189–197. Neal, R. (1992). Connectionist learning of belief networks, Artificial Intelligence, 56, 71–113. Neal, R. M., & Hinton, G. E. (1998). A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., & Barbano, P. (2005). Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 14(9), 1360–1371. Roth, S., & Black, M. J. (2005). Fields of experts: A framework for learning image priors. In IEEE Conf. on Computer Vision and Pattern Recognition (pp. 860–867). Piscataway, NJ: IEEE. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural networks. Neural Networks, 2(6), 459–473. Simard, P. Y., Steinkraus, D., & Platt, J. (2003). Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR) (pp. 958–962). Los Alamitos, CA: IEEE Computer Society. Teh, Y., & Hinton, G. E. (2001). Rate-coded restricted Boltzmann machines for face recognition. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 908–914). Cambridge, MA: MIT Press. Teh, Y., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260. Welling, M., Hinton, G., & Osindero, S. (2003). Learning sparse topographic representations with products of Student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1359–1366). Cambridge, MA: MIT Press. Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1481–1488). Cambridge, MA: MIT Press.
Received June 8, 2005; accepted November 8, 2005.
LETTER
Communicated by Kechen Zhang
Optimal Tuning Widths in Population Coding of Periodic Variables Marcelo A. Montemurro
[email protected] Stefano Panzeri
[email protected] Faculty of Life Sciences, University of Manchester, Manchester M60 1QD, U.K.
We study the relationship between the accuracy of a large neuronal population in encoding periodic sensory stimuli and the width of the tuning curves of individual neurons in the population. By using general simple models of population activity, we show that when considering one or two periodic stimulus features, a narrow tuning width provides better population encoding accuracy. When encoding more than two periodic stimulus features, the information conveyed by the population is instead maximal for finite values of the tuning width. These optimal values are only weakly dependent on model parameters and are similar to the width of tuning to orientation or motion direction of real visual cortical neurons. A very large tuning width leads to poor encoding accuracy, whatever the number of stimulus features encoded. Thus, optimal coding of periodic stimuli is different from that of nonperiodic stimuli, which, as shown in previous studies, would require infinitely large tuning widths when coding more than two stimulus features. 1 Introduction The width of the tuning curves of individual neurons to sensory stimuli plays an important role in determining the nature of a neuronal population code (Pouget, Deneve, Ducom, & Latham, 1999; Zhang & Sejnowski, 1999). On the one hand, narrow tuning makes single neurons highly informative about a specific range of stimulus values. On the other hand, coarse tuning increases the size of the population activated by the stimulus, but at the price of making individual neurons less precisely tuned. An important question is whether there is an optimal value of tuning width that allows a most effective trade-off between these two partly conflicting requirements of high encoding accuracy by single neurons and engagement of a large population. When considering the encoding of nonperiodic stimulus variables, such as the Cartesian coordinates of position in space, Zhang and Sejnowski (1999) demonstrated that under very general conditions, the information about the stimulus conveyed by the population Neural Computation 18, 1555–1576 (2006)
C 2006 Massachusetts Institute of Technology
1556
M. Montemurro and S. Panzeri
scales as σ D−2 , where σ stands for the width of the tuning curve and D is equal to the number of encoded stimulus features. Thus, extremely narrow tuning curves are better for encoding one nonperiodic stimulus feature, whereas extremely coarse tuning curves are better for encoding more than two nonperiodic features (Zhang & Sejnowski, 1999). However, many important stimulus variables are described by periodic quantities. Example of such variables are the direction of motion and the orientation of a visual stimulus. Thus, it is crucial to investigate how the population accuracy in encoding periodic stimuli depends on the tuning width. Here, we address this problem, and we find that under general conditions and in marked contrast with the case of nonperiodic stimuli, the population accuracy in encoding periodic stimuli decreases quickly for large tuning widths, whatever the stimulus dimensionality. For stimulus dimensions D > 2, there is a finite optimal value of the tuning curve width for which the population conveys maximal information. If D is in the range 2 to 6, the optimal tuning widths predicted by the model are similar in magnitude to those observed in visual cortex. This suggests that the tuning widths of visual cortical neurons are efficient at transmitting information about a small number of periodic stimulus features. 2 Model of Encoding Periodic Variables We consider a population made up of a large number N of neurons. The response of each individual neuron is quantified by the number of spikes fired in a certain poststimulus time window. Thus, the overall neuronal population response is represented as an N-dimensional spike count vector. We assume that the neurons are tuned to a small number D of periodic stimulus variables, such as the orientation or the direction of motion of a visual object. The stimulus variable will be described as an angular vector θ = (θ1 , . . . , θ D ) of dimension D. A real cortical neuron may also encode sensory variables that are not periodic, such as retinal position or speed of motion. However, for simplicity here, we will focus entirely on periodic variables. The neuronal tuning curves, which quantify the mean spike count of each neuron to the presented D-dimensional stimulus, are all taken to be identical in shape but having their maxima at different angles. Thus, each neuron is uniquely identified by its preferred angle φ. For concreteness, we choose the following multidimensional circular normal form for the tuning function, f (θ − φ) ≡ b + f 0 (θ − φ) = b + m
D i=1
exp
cos(ν(θi − φi )) − 1 , (νσ )2 (2.1)
Optimal Periodic Tuning Widths
1557
where b is the baseline (nonstimulus-induced) firing. The stimulusmodulated part of the tuning curve f 0 (θ − φ) depends on σ , the width of tuning, and on m, the response modulation. As in Zhang and Sejnowski (1999), we assume that the preferred angles of each neurons are distributed across the population uniformly over the D-dimensional cuboid region [−π/ν, π/ν] D . The parameter ν sets the period of the tuning function, which is equal to (2π)/ν. For example, ν = 1 corresponds to a period equal to 2π and would describe a motion direction angle, whereas ν = 2 corresponds to a period equal to π and would describe an orientation angle. For simplicity, we assume that different stimulus dimensions are mapped in a separable way. Thus, the stimulus-dependent part of the multidimensional tuning curve in equation 2.1 is written as a product of a onedimensional circular normal function over the different stimulus dimensions: g(θi − φi ) = exp
cos(ν(θi − φi )) − 1 . (νσ )2
(2.2)
This tuning function has been used in neural coding models (see, e.g., Pouget, Zhang, Deneve, & Latham, 1998 and Sompolinsky, Yoon, Kang, & Shamir, 2001). It was chosen here because, unlike the most commonly used gaussian models, it is a genuinely periodic function of the stimulus, and it fits accurately experimental tuning curves for both orientation-sensitive (Swindale, 1998) and direction-sensitive (Kohn & Movshon, 2004) neurons in primary visual cortex. In Figure 1 we plot two examples of the onedimensional circular normal distribution in equation 2.2, compared with their respective gaussian approximations. If σ is smaller than ≈200 , the circular normal function closely resembles a gaussian tuning curve (see equation). For tuning widths much larger than ≈200 , the shape of the circular tuning functions differs substantially from the gaussian. The above assumption that the multidimensional tuning curve is just a product of one-dimensional tuning curves of individual features has been used extensively in population coding models. In addition to being, mathematically convenient for its simplicity, the multiplicative form of the multidimensional tuning function describes well the tuning of neurons in higher visual cortical areas to, for example, complex multidimensional rotations (Logothetis, Pauls, & Poggio, 1995) or to local features describing complex boundary configurations (Pasupathy & Connor, 2001). We assume that the neurons in the population are uncorrelated at fixed stimulus. Thus, the stimulus-conditional probability of population response P(r|θ) is a product of the spike count distribution of each individual neuron. Although it is a simplification, this assumption is useful because it makes our model mathematically tractable, and it is sufficient to account in most cases for the majority of information transmitted by real neuronal populations (Nirenberg, Carcieri, Jacobs, & Latham, 2001; Petersen,
1558
M. Montemurro and S. Panzeri
50
Firing rate [Hz]
40
=400 30
20
=200
10
0
80
60
40
20
0
20
40
60
80
[deg] Figure 1: Comparison of periodic and nonperiodic tuning functions for an orientation-selective neuron coding one stimulus variable. The curves show mean firing rates as a function of the difference between the presented and the preferred stimulus of the neuron. Solid lines: periodic tuning function given by equation 2.2 for ν = 2 (orientation selectivity) and for two values of the tuning width σ . Dashed lines: nonperiodic (gaussian) tuning function for the same tuning widths. The difference between the periodic and nonperiodic tuning functions becomes apparent for large tuning widths and for angles away from the preferred one.
Panzeri, & Diamond, 2001; Oram, Hatsopoulos, Richmond, & Donoghue, 2001). However, in section 8 we introduce a specific model that takes into account cross-neuronal correlations and demonstrates that the conclusions obtained with the uncorrelated assumption are still valid in that correlated case. Following Zhang and Sejnowski (1999), we choose a general model of the activity of the single neuron in the population by requiring that the probability that the neuron with preferred angle φ emits r spikes in response to stimulus θ is an arbitrary function of the mean spike count only: P(r |θ) = S(r, f (θ − φ)).
(2.3)
Optimal Periodic Tuning Widths
1559
In this article, some specific cases of single-neuron models that satisfy this assumption are studied in detail. We shall also derive scaling rules of the encoding efficiency as a function of σ and D that are valid for any model of single-neuron firing that satisfies equation 2.1. 3 Fisher Information The ability of the population to encode accurately a particular stimulus value can be quantified by means of Fisher information (Cover & Thomas, 1991). When the stimulus is a D-dimensional periodic variable, Fisher information is a D × D matrix, J, whose elements i, j are measured in units of deg−2 and are defined as follows:
J i, j (θ ) = −
drP(r|θ)
∂2 log P(r|θ ) . ∂θi ∂θ j
(3.1)
Fisher information provides a good measure of stimulus encoding because it sets a limit on the accuracy with which a particular stimulus value can be reconstructed from a single observation of the neuronal population activity. In fact, it satisfies the following generalized Cram´er-Rao matrix inequality (Cover & Thomas, 1991), ≥ J−1 ,
(3.2)
where is the covariance matrix of the D-dimensional error made by any unbiased estimation method reconstructing the stimulus from the neuronal population activity. Since the neurons fire independently, the population Fisher information is simply given by the sum over all neurons of the single-neuron Fisher (neuron) information (Cover & Thomas, 1991). The latter, denoted as J i, j (θ − φ), has the following expression: (neur on)
J i, j
(θ − φ) = −
dr S(r, f (θ − φ))
∂2 log S(r, f (θ − φ)) . ∂θi ∂θ j (3.3)
By computing explicitly the derivatives in the above expression, one can rewrite the single-neuron Fisher information in equation 3.3 as follows: (neur on)
J i, j
S (r, f (θ − φ))2 (θ − φ) = f 02 (θ − φ) dr S(r, f (θ − φ)) ×
sin((ν(θi − φi )) sin(ν(θ j − φ j )) , ν2σ 4
(3.4)
1560
M. Montemurro and S. Panzeri
∂ where S = ∂z S(r, z), and we require the integral over responses in the above equation to be convergent, so that the single-neuron Fisher information is finite. Since the number of neurons in the population is assumed to be large and since a neuron is uniquely identified by the center φ of its tuning curve, we can compute the population Fisher information by approximating the sum over the neurons by an integral over the preferred angles φ,
J i, j (θ ) =
Nν D (2π)
D
π/ν
−π/ν
(neur on)
d D φ J i, j
(θ − φ),
(3.5)
π/ν where −π/ν d D φ denotes the angular integration over the D-dimensional cuboid region [−π/ν, π/ν] D . Since the term in square brackets in equation 3.4 is an even function of each of the angular variables, the integral in equation 3.5 will be nonzero only when i = j. It is also clear that because of symmetry over index permutations, the population Fisher information J i, j (θ ) is proportional to the identity matrix. Moreover, since the integrand in equation 3.5 is a periodic function integrated over its whole period, the Fisher information in the equation does not depend on the value of angular stimulus variable. Thus, dropping index notation and the θ-dependency in the argument, we will denote by J the diagonal element of the Fisher information matrix, and call it simply Fisher information. 4 Poisson Model We begin our analysis of population coding of periodic stimuli by considering a neuronal firing model satisfying equation 2.3: we assume that single-neuron statistics at fixed stimulus is described by a Poisson process with mean given by the neuronal tuning function, equation 2.1: P(r |θ) =
( f (θ − φ))r exp (− f (θ − φ)) . r!
(4.1)
The Fisher information conveyed by the Poisson neuronal population is as follows, Nν D J = (2π) D
π ν
− πν
d φ D
f 02 (θ − φ) sin2 (ν(θ1 − φ1 )) , f (θ − φ) ν2σ 4
(4.2)
where the term in square brackets is the single-neuron Fisher information, and θ1 and φ1 denote the projections of θ and φ along the first stimulus dimension. In the following, we will study how the Poisson population Fisher information depends on the model parameters.
Optimal Periodic Tuning Widths
Poisson 3.5
3
D=1 D=2
2.5
2.5
−2
J/N [deg−2 ]
3
2 1.5 1 0.5 0 0
D=3
40
60
σ [deg]
80
D=1 D=2
2 1.5 1 D=3
0.5
D=4
20
Gaussian
B
J/N [deg ]
A
1561
0 0
D=4
20
40
60
σ [deg]
80
Figure 2: (A) Fisher information per neuron J /N as a function of the tuning curve width σ for a population of orientation-selective “Poisson” neurons, for stimulus dimensions D = 1, 2, 3, and 4. Solid lines correspond to the periodic stimulus Fisher information J and were calculated with equation 4.3. For plotting, the parameter m (which has only a trivial multiplicative effect) was fixed to 5. Dotted lines correspond to the nonperiodic Fisher information J np and were calculated using gaussian tuning curves of width σ (Zhang & Sejnowski, 1999). (B) Fisher information per neuron J /N for a population of gaussian independent neurons, for different dimensions of the periodic stimulus variable. Parameters were as follows: m = 5; b = 0.5, α = 1, and β = 1.
4.1 Analytical Solution with No Baseline Firing. First, we consider the case in which there is no baseline firing: b = 0. In this case, it is possible to integrate exactly equation 4.2 and obtain the following analytical solution for J, J =
Nm K 1 (ν 2 σ 2 )K 0D−1 (ν 2 σ 2 ) , σ2
(4.3)
where K n (x) = e −1/x In (1/x),
(4.4)
and In (x) stands for the nth order modified Bessel function of the first kind. As in the nonperiodic case (Zhang & Sejnowski, 1999), N and m affect the Fisher information in equation 4.3 only as trivial multiplicative factors. Thus, we can focus on the dependence of Fisher information on σ and D. Figure 2A compares the periodic-stimulus Fisher information J (normalized in the plot as Fisher information per neuron J /N) to that obtained in the nonperiodic case (which we will denote as J np , and has
1562
M. Montemurro and S. Panzeri
a very simple σ -dependence: J np ∝ σ D−2 ; see Zhang & Sejnowski, 1999). It is apparent that the periodic-stimulus Fisher information J depends on σ in a more complex way than the nonperiodic one J np . While for D = 1 there is no qualitative difference between the periodic and the nonperiodic case (with both J and J np being divergent at σ = 0), significant differences appear for D ≥ 2. If D = 2, J is not constant with σ , but it has a maximum at σ = 0 and then decays rapidly. If D > 2, in sharp contrast with J np , J exhibits a maximum at finite σ . The optimal values of σ that maximized J were 26.6, 34.1, 39.9, 44.9 degrees for D = 3, 4, 5, and 6, respectively. The dependence of J on σ and D and its relation with J np can be understood by comparing their respective expressions and taking into account the properties of the functions K n (ν 2 σ 2 ), as follows. For small σ , K n (ν 2 σ 2 ) scales to zero as σ . Thus, the small-σ scaling of the periodic-stimulus model is identical to that of Zhang & Sejnowski (1999). This is because as σ → 0, the periodic tuning function tends to a gaussian and σ can be rescaled away from the angular integrals as in the nonperiodic case. For large σ , K 0 (ν 2 σ 2 ) increases monotonically toward 1, whereas K 1 (ν 2 σ 2 ) (which has a maximum at νσ ≈ 0.8) decreases toward zero as 1/σ 2 . Thus, J decreases rapidly to zero as 1/σ 4 for any D. This is very different from the nonperiodic case, in which J np grows to infinity for large σ if D > 2 (Zhang & Sejnowski, 1999). The occurrence of a finite-σ maximum of J can also be understood in terms of the K n functions. If D is 1 or 2, then the dominant factor is 1/σ 2 , which leads to a maximum of J at σ = 0. For larger (but finite) D, the term K 0D−1 becomes more and more important, and thus the maximum of J is shifted toward larger σ . However, unlike in the nonperiodic case, since K 0 saturates at 1 and K 1 goes to zero for σ → ∞, this maximum must be reached at a finite σ value. An infinite optimal σ value can only be reached in the D → ∞ limit, where K 0D−1 is dominant and the other terms can be neglected. It is interesting to compare the optimal values of tuning width obtained with our model with the widths of orientation tuning curves observed in visual cortex (summarized in Table 1). The width of orientation tuning of V1 neurons is typically in the range of 17 to 21 degrees. In higher cortical visual areas, the tuning curves get progressively broader: σ is approximately 26 degrees in MT and 38 degrees in V4. Thus, tuning widths of cortical neurons are neither too narrow nor too wide and are similar in magnitude to the optimal ones obtained from information-theoretic principles when considering multidimensional periodic stimuli. What is the advantage of using tuning widths in this intermediate range of 15 to 40 degrees observed in cortex? Our model results suggest that unlike very narrow or very wide tuning widths, tuning widths in this intermediate range are rather efficient at conveying information over a range of stimulus dimensions. Intermediate tuning widths
Optimal Periodic Tuning Widths
1563
Table 1: Typical Values of Tuning Widths to Either Orientation or Motion Direction of Neurons in Different Visual Cortex Areas. Species
Area
Stimulus
σ [deg]
Ferret Cat Macaque Macaque Macaque Macaque Macaque
V1 V1 V1 MT V4 V1 MT
O O O O O D D
15–17∗ 14.6† 19–22∗ 24–27∗ 38† 25–29∗ 35–40∗
Notes: These values were taken or derived from published reports. † = the original data were reported as the standard deviation σ of the experimental tuning curve. ∗ = the original published values were given as full widths at half height, and we then converted into σ using equation 2.1 with D = 1. Since the conversion depends on the value of the baseline firing b, we report the converted σ assuming a baseline firing ranging from 10% of the response modulation m (lower σ value) to zero baseline firing (upper σ value). Sources: Usrey, Sceniak, and Chapman (2003); Henry, Dreher, and Bishop (1974); Albright (1984); McAdams and Maunsell (1999).
(e.g., σ = 25 degrees) would not be efficient for D = 1. However, they would be highly efficient at encoding a handful of periodic stimulus features (e.g., D = 2, · · · , 5). On the other hand, using a small width of tuning, say, σ = 5 degrees, would be more efficient for D = 1, only marginally more efficient for D = 2, but very inefficient for D > 3. Using a large σ of 90 degrees would lead to poor information transmission for any value of D below ≈15. Therefore, the advantage of intermediate tuning widths is that they offer a highly efficient information transmission across a range of stimulus dimensions, provided that the population encodes more than one stimulus feature. The results obtained for orientation tuning are easily extended to coding of direction stimulus variables (i.e., ν = 1). It is easy to see that apart from an overall multiplicative factor ν 2 , the Fisher information in equation 4.3 depends on σ only through the product νσ (Notice that this is true not only for the Poisson model solution in equation 4.3, but also for the general Fisher information in equation 3.5). Thus, the dependence of the information on σ for direction-selective populations will be the same as that obtained for orientation-selective neurons, with an overall rescaling of σ by 2. Therefore, for any D, optimal tuning widths for direction-selective populations are exactly twice those corresponding to orientation-selective populations. Table 1 shows that in both V1 and MT, the motion direction tuning widths are larger than the orientation tuning widths, by a factor of ≈1.5. Cortical direction tuning widths are hence also efficient for D > 1.
1564
M. Montemurro and S. Panzeri
4.2 Effect of Baseline Firing. If there is baseline firing, that is, b > 0, it is not possible to express J by a simple analytical formula such as equation 4.3. However, we can gain insight into the effect of baseline firing by obtaining approximate solutions for the limiting cases of very small and very large b, as follows. Here we will focus on how robust the optimal σ values are that were found above in the D > 2 case. When b is very small, we can expand the integrand in equation 4.2 in powers of b/ f 0 (θ − φ).1 Keeping up to first order in b/ f 0 (θ − φ), we obtain: J =
Nm b D−1 2 2 2 2 . K (ν σ )K (ν σ ) − 1 0 σ2 2m
(4.5)
The first term corresponds to the Fisher information, equation 4.3, for the case b = 0, which has a maximum at a certain value of tuning curve width, which we indicate by σ ∗. The second term is a perturbative correction that decreases monotonously when increasing σ . Consequently, for D > 2, the effect of a small baseline firing is to increase slightly the optimal σ with respect to the σ ∗ values obtained for b = 0. How much can the optimal σ values increase when increasing the baseline firing? This can be established by considering the opposite limit, b → ∞. In this case, f 0 (θ − φ)/b 1 for all angles, and we can expand the integrand in equation 4.2 in powers of f 0 (θ − φ)/b. Up to first order in b −1 , we find: J =
2 2 2 2 2 2 2 2 Nm2 m ν σ ν σ D−1 ν σ D−1 ν σ K − K K . K 1 1 0 0 b σ2 2 2 b 3 3 (4.6)
The first contribution √ ∗is the asymptotic behavior for b → ∞ and has a maximum at σ = 2σ . It can be shown that √ the correction term tends to decrease the optimal value of σ from its 2σ ∗ asymptotic large-b value. This suggests that the maximal effect √ of baseline firing on optimal σ values consists in an increase of a factor 2 with respect to the optimal value for b = 0, and this maximal effect is reached when baseline firing dominates. We have verified this prediction by integrating numerically equation 4.2 for different values of b. We found that (data not shown) for D √ = 3 and 4, the optimal σ values were located, for any b, between σ ∗ and 2σ ∗. Thus, optimal tuning width values are robust to the introduction of baseline firing.
1 This expansion is valid if b/ f (θ − φ) 1 for all angles. This condition can be met if 0 σ is nonzero and b is sufficiently small.
Optimal Periodic Tuning Widths
1565
5 Gaussian Model We consider next another model of single-cell firing, the gaussian model, which describes well the statistics of spike counts of visual neurons when spikes are counted over sufficiently long windows (Abbott, Rolls, & Tov´ee, 1996; Gershon, Wiener, Latham, & Richmond, 1998). This model, unlike the Poisson one, permits considering the effect of autocorrelations between the spikes emitted by the same neuron. The statistics of single-neuron spike counts for the gaussian model are defined as follows:
(r − f (θ − φ))2 P(r |θ) = √ exp − 2ψ 2 (ζ ) 2πψ(ζ ) 1
,
(5.1)
where the standard deviation of spike counts, ψ, is an arbitrary function of the mean: ζ = f (θ − φ). Under these assumptions, it is easy to show that the population Fisher information J has the following expression, Nν D J = (2π) D
π ν
− πν
d φ D
f 02 (θ
2ψ 2 (ζ ) 1 + 2 − φ) 2 ψ (ζ ) ψ (ζ )
sin2 (ν(θ1 − φ1 )) ν2σ 4
, (5.2)
where the term in square brackets is the single-neuron Fisher information, and ψ (ζ ) = ∂ζ∂ ψ(ζ ). Since the variance of spike counts of cortical neurons is well described by a power law function of the mean spike count (Tolhurst, Movshon, & Thompson, 1981; Gershon et al., 1998), from now on we will assume that ψ is a power law function of the tuning curve: ψ = α 1/2 f β/2 (θ − φ).
(5.3)
In this case, equation 5.2 becomes J=
π ν Nν D d Dφ D (2π) − πν × f 02 (θ − φ)
β2 1 + β 2 α f (θ − φ) 2 f (θ − φ)
sin2 (ν(θ1 − φ1 )) . (5.4) ν2σ 4
For cortical neurons, the parameters α and β are typically close to 1, both distributed within the range 0.8 to 1.4 (Gershon et al., 1998; Dayan & Abbott, 2001). If all spike times are independent, then the spiking process is Poisson, and α and β would both be 1. Deviations of α and β from 1 require the presence of autocorrelations. Therefore, the study of how J depends on α and
1566
M. Montemurro and S. Panzeri
β allows an understanding of how autocorrelations influence population coding. The dependence of the gaussian model Fisher information on σ and D is plotted in Figure 2B. For this plot, we chose m = 5. The parameters α and β were both set to 1, so that, as for the Poisson process, the variance of spike counts equals the mean. In this case, the scaling of Fisher information is essentially identical to that of the Poisson model in Figure 2A. We investigated numerically the dependence of J on α and β. We found that the shape of Fisher information plotted in Figures 2A and 2B is conserved across the entire range analyzed. In particular, J scaled to zero for large σ for any D and scaled as σ D−2 for small σ . For D = 1, J was always divergent at σ = 0. For D = 2, the maximum was always at σ = 0. For D > 2, J always had a maximum at finite values of σ . The position of the maximum varied slightly as a function of α and β. Results for the position of the maximum for D = 3 and D = 4 are reported in Figure 3. The position of the maxima was almost unchanged when varying α. They varied within less than 3 degrees for D = 3 and 5 degrees for D = 4 when β varied within the typical cortical range 0.8 to 1.4. The similarity between the gaussian model Fisher information, equation 5.4, and the Poisson model Fisher information, equation 4.3, can be explained by noting that if m is large and β < 2, then the second additive term within parentheses in equation 5.4 can be neglected and the gaussian model Fisher information has the following approximated solution:
J ≈
Nm2−β K1 α(2 − β)σ 2
ν2σ 2 2−β
K 0D−1
ν2σ 2 2−β
.
(5.5)
This expression is (apart from an overall multiplicative factor) identical to the population Fisher information for the Poisson model, equation 4.3, with an overall rescaling of the arguments of the K functions by a factor 2 − β. Therefore, if the exponent β of the power law variance-mean relationship, equation 5.3, is approximately 1 (as for real cortical neurons), the optimal values of the gaussian model in the large-m case are almost identical to the ones obtained for the Poisson population. It is worth noting that the α- and β-dependence of J arising from the large-m approximation in equation 5.5 are compatible with the intermediate-m numerical results of Figure 3, which showed that the optimal σ values depend very mildly on α and decrease monotonically with β at fixed D. If m is not very large, then the second additive term within parentheses in equation 5.4 contributes to the gaussian Fisher information. However, we have verified that under a wide range of parameters, this contribution is less dependent on σ than the other one and does not shift the maxima or alter the dependence of J on σ and D in a prominent way.
Optimal Periodic Tuning Widths
A 36
1567
B 36
34 32
σopt [deg]
σopt [deg]
β=0.8 β=1.0 β=1.2
30
D=3
26 0
α
2
3
D
45
β=0.8 β=1.0 β=1.2
40
σ
opt
[deg]
C
1
35 0
D=4 1
α
2
32 30 28
D=3
26 0
σopt [deg]
28
α =0.8 α =1.0 α =1.2
34
3
1
β
2
45
α =0.8 α =1.0 α =1.2
40 35 0
3
D=4 1
β
2
3
Figure 3: Optimal tuning width, σopt (corresponding to a maximum of Fisher information J ), for a population of orientation-selective “gaussian” neurons, as a function of the parameters α and β defining the power law spike count mean variance relationship, equation 5.3. (A) The number of encoded stimulus variables is D = 3. Here β is kept fixed while α is varied. (B) Here, α is kept fixed, β is varied, and D = 3. (C) Now D is 4; β is kept fixed, while α is varied. (D) Again D is 4. Here, α is kept fixed, and β is varied.
Thus, we conclude that the values of optimal tuning widths obtained with the Poisson model are robust to changes in model details such as the introduction of autocorrelations parameterized by α and β, as long as these parameters remain within the realistic cortical range. 6 General Multiplicative Noise Model To further check the robustness of the above conclusions, we introduced a more general model of single-neuron firing: the multiplicative noise model. This model, unlike the Poisson and the gaussian models, has the advantage of not assuming a particular functional form for the variability of neuronal responses at fixed stimulus. It assumes that the variability of spike counts in response to any stimulus is generated by an arbitrary stochastic process
1568
M. Montemurro and S. Panzeri
modulated by an arbitrary function ψ of the mean spike rate. In this case, the spike rate of each neuron is given by the following equation, r = f (θ − φ) + ψ(ζ )z ,
(6.1)
where z is an arbitrary stochastic process of zero mean and unit variance with distribution Q(z), ψ(ζ ) ≡ ψ ( f (θ − φ)), and is a parameter that modulates the overall strength of the response variability. Under these assumptions, the single-neuron’s spike count probability is 1 (r − f (θ − φ)) P(r |θ) = Q , ψ(ζ ) ψ(ζ )
(6.2)
and the single-neuron Fisher information, equation 3.3, has the following form: (neur on)
J i,i
(θ ) =
f 02 (θ − φ) sin2 (ν(θi − φi )) 2 ψ 2 (ζ )ν 2 σ 4
× T0 (Q) + 2 ψ (ζ )T1 (Q) + 2 ψ 2 (ζ )T2 (Q) .
(6.3)
The coefficients Ti (Q) are a function of the noise distribution only and are defined as follows:
Q2 (z) Q2 (z)z dz ; T1 (Q) = dz Q (z)dz + Q(z) Q(z) Q2 (z) 2 z dz + 2 Q (z)zdz. T2 (Q) = 1 + Q(z)
T0 (Q) =
(6.4)
Although equation 6.3 permits the computation of the population Fisher information for any multiplicative noise model, in the following we will concentrate on examining two interesting limiting cases: very low and very high noise strengths. In examining these two cases, we will assume that (as for the gaussian model) the variance of the noise ψ 2 is in a power law relationship with the mean, equation 5.3. 6.1 Low Noise Limit. We first consider the low noise limit 1. In this case, responses are almost deterministic, and single neurons convey information by stimulus-induced changes in the mean spike rate (Brunel & Nadal, 1998). In this limit, the population Fisher information J can be calculated by keeping the leading order in only in the single-neuron
Optimal Periodic Tuning Widths
1569
Fisher information, equation 6.3, and then integrating it over the preferred stimuli, obtaining: Nm2−β T0 (Q) J = K1 α(2 − β)σ 2
ν2σ 2 2−β
K 0D−1
ν2σ 2 2−β
.
(6.5)
The dependence of the low-noise Fisher information on σ and D is thus affected only by β, with all other model parameters entering only as an overall multiplicative factor. Equation 6.5 is identical (apart from an overall multiplicative factor) to the large-m approximation of the gaussian model Fisher information, equation 5.5. It is also identical (apart from a rescaling of the argument of K n ) to the Poisson model exact solution in equation 4.3. 6.2 High Noise Limit. We considered next the case of very noisy neurons: 1. In this limit, information is transmitted entirely by stimulusmodulated changes of the variance ψ 2 . (If the variance was not stimulus dependent, then information would be zero in the high noise limit; Samengo & Treves, 2001). Taking the → ∞ of the single-neuron Fisher information, integrating it over the preferred angles, and taking into account equation 5.3, we obtain the following expression for the population Fisher information: J =
Nν D (2π) D
π ν
− πν
d D φ f 02 (θ − φ)
T2 (Q) α f β (θ − φ)
sin2 (ν(θ1 − φ1 )) . ν2σ 4
(6.6)
In this limit, J is thus independent of the noise strength and depends on the details of the single-neuron model only through a multiplicative factor T2 (Q). It can be seen that its expression is similar to the first (and dominant) additive term of the gaussian solution, equation 5.4. Because of this similarity, when integrating numerically equation 6.6, we found that the dependence of the high-noise population Fisher information on σ and D is remarkably consistent with that obtained for the Poisson and gaussian models (see Figure 2), and that the changes in the position of the maxima when varying α and β were again similar to those reported in Figure 3 for the gaussian model (data not shown). In summary, the dependence on σ and D of the information transmitted by a population of neurons described by the general multiplicative model behaves in a way consistent with the results obtained above in both the high-noise and the low-noise limit. 7 General Scaling Limits for Large and Small σ After having analyzed three different classes of single-neuron firing models, in this section we switch back to the most general firing model in equation
1570
M. Montemurro and S. Panzeri
2.3, in which the single-neuron statistics is an arbitrary function of the mean spike rate; we consider its small- and large-σ scaling. We will derive that for any such firing model, Fisher information scales in an universal way as σ D−2 for small σ and goes to zero as 1/σ 4 for large σ . Thus, for D > 2, Fisher information reaches a maximum for a finite value of the tuning width, whatever the firing model considered. 7.1 Small σ Scaling. When σ 1, the exponential in equation 2.1 gives a nonzero contribution to the tuning function only when θ − φ ∼ 0. In this regime, we can take a Taylor expansion of the cosines in the exponent of equation 2.1, obtaining the following,
− |θ − φ|2 f (θ − φ) b + m exp 2σ 2
≡G
|θ − φ|2 σ2
,
(7.1)
where in the above G(|θ − φ|2 /σ 2 ) is the standard gaussian tuning function. The population Fisher information becomes J =
Nν D
(2π) D
π/ν −π/ν
d φ A˜ D
|θ − φ|2 σ2
(θ1 − φ1 )2 , σ2
(7.2)
where, following Zhang and Sejnowski (1999), the function A˜ is defined as follows:
|θ − φ|2 A˜ σ2
|θ − φ|2 = exp − σ2
2 2 S r, G |θ−φ| σ2 . dr 2 S r, G |θ−φ| σ2
(7.3)
By introducing new integration variables ξi = (θi − φi )/σ , one can rewrite equation 7.2 as follows: J = σ D−2
Nν D (2π)
D
π/(νσ )
−π/(νσ )
d D ξ A˜ |ξ |2 ξ12 .
(7.4)
By taking the small-σ limit of the above expression, one gets J =σ
D−2
Nν D (2π)
D
+∞
−∞
d D ξ A˜ |ξ |2 ξ12 .
(7.5)
If the improper integral above converges, then the periodic Fisher information scales as σ D−2 for small σ , coinciding with the universal scaling rule for nonperiodic stimuli found by Zhang and Sejnowski (1999). It is important
Optimal Periodic Tuning Widths
1571
to note that for a given neuronal model defined by a probability function S, equation 7.5 is the Fisher information obtained if the tuning curve was gaussian with variance σ and the stimulus variable ξ was nonperiodic (Zhang & Sejnowski, 1999). Thus, the small-σ scaling of the periodic-stimulus Fisher information J is ∝ σ D−2 whenever the Fisher information of the corresponding gaussian nonperiodic tuning model is well defined (see Wilke & Eurich, 2001, for cases and parameters in which the gaussian model nonperiodic Fisher information is not well defined). 7.2 Large σ Scaling. When σ 1, the argument of the exponentials in equation 2.1 is very small. Thus, the following expansion will be valid:
f (θ − φ) b + m + m
D cos(ν(θi − φi )) − 1 1 ≈ f (0) + O . 2 (νσ ) σ2 i=1
(7.6) Consequently, the population Fisher information becomes sin2 (θ1 − φ1 ) S (r, m + b)2 1 2 . dφ dr J ≈ 4 m σ S(r, m + b) ν2
(7.7)
Thus, for large σ , Fisher information goes to zero as σ 4 for any stimulus dimensionality. 8 Correlations Between Neurons The analysis above considered a population of independent neurons. In this section, we address the effect of correlated variability among the populations on the position of the optimal values of the tuning curve width. For simplicity, we shall consider that the firing statistics are governed by a multivariate gaussian distribution as follows, P(r|θ) =
1 (2π) N |C|
e − 2 (r−f) 1
T
C−1 (r−f)
(8.1)
where C is the population correlation matrix and f stands for a column vector whose components are the neuron tuning functions, that is, f ≡ [ f (θ − φ (1) ), . . . , f (θ − φ (N) )]T , φ (k) being the preferred stimulus of the kth neuron. The correlation matrix is defined as follows, C (kl) = [δkl + (1 − δkl )q ] ψ(ζ (k) )ψ(ζ (l) ),
(8.2)
1572
M. Montemurro and S. Panzeri
where ζ ( p) ≡ f (θ − φ ( p) ), and (to ensure that the correlation matrix C is positive definite) the cross-correlation strength parameter q is allowed to vary in the range 0 ≤ q < 1. The Fisher information for this gaussian-correlated model is given by the following expression (Abbott & Dayan, 1999): dC −1 dC −1 ∂fT −1 ∂f 1 C + Tr C C . ∂θi ∂θ j 2 dθi dθ j
J i j (θ ) =
(8.3)
By inserting the correlation matrix definition given by equation 8.2, into equation 8.3, and taking the continuous limit for N 1 (Abbott & Dayan, 1999; Wilke & Eurich, 2001), one arrives at the following expression: Nν D−2 J i j (θ ) = (2π) D (1 − q )σ 4
π ν
− πν
d φ D
f 02 (θ
1 (2 − q )ψ 2 (ζ ) − φ) + ψ 2 (ζ ) ψ 2 (ζ )
× sin(ν(θi − φ i )) sin(ν(θ j − φ j )).
(8.4)
Note that as for the uncorrelated model discussed in section 2, the only nonzero elements of the Fisher information matrix are the diagonal ones; these elements are all identical and do not depend on the value of angular stimulus variable. Thus, dropping index and θ-dependency notation, we will again simply denote by J the diagonal element of the Fisher information matrix in equation 8.4. The expression of J for the correlated model in equation 8.4 is almost identical to the one for the uncorrelated gaussian model in equation 5.2, the only difference being a q -dependent relative weight of the two additive terms within parentheses in equation 8.4. Since, as explained in section 5, the first additive term in parentheses is the prominent one in shaping the σ - and D-dependence of J, the correlated model J behaves very much like the uncorrelated-gaussian-model J in equation 5.2. In particular, the cross-correlation parameter q affects the optimal σ values in a very marginal way. Thus, we expect that the only appreciable effect of the cross-correlation strength q is to shift slightly the position of the maximum for D > 2. The variations of the optimal σ values of the correlated model as a function of q (obtained integrating numerically equation 8.4) are reported in Figure 4 for D = 3 and D = 4. ψ(ζ ) was again chosen according to equation 5.3 with α = 1 and β = 1. It is apparent that the values of the optimal tuning widths found for the uncorrelated model are virtually unchanged throughout the entire allowed range of cross-correlation strength q . 9 Discussion Determining how the encoding accuracy of a population depends on the tuning width of individual neurons is crucial for understanding the
Optimal Periodic Tuning Widths
1573
45
40
σopt [deg]
D=4
35 D=3
30
25 0
0.2
0.4
q
0.6
0.8
1
Figure 4: Optimal tuning widths for a population of orientation-selective neurons with gaussian firing statistics in the presence of uniform cross correlations, as a function of the cross-correlation strength parameter q . The two cases D = 3 (lower curve) and D = 4 (upper curve) are separately reported.
transformation of the sensory representation across different levels of the cortical hierarchy (Hinton, McClelland, & Rumelhart, 1986; Zohary, 1992). Generalizing previous results obtained for nonperiodic stimuli (Zhang & Sejnowski, 1999), here we have determined how encoding accuracy of periodic stimuli depends on the tuning width. Although we found no universal scaling rule, the dependence of the encoding accuracy of periodic stimuli on the width of tuning was remarkably consistent across neural models and model parameters. This indicates that the key properties of encoding of periodic variables are general. The encoding accuracy of angular variables differs significantly from that of nonperiodic stimuli. The two major differences are that (1) whatever the number of stimulus features D, very large tuning widths are inefficient for encoding a finite number of periodic variables, and (2) for D > 2, intermediate values of tuning widths (within the range observed in cortex) provide the population with the largest representational capacity. These differences suggest that population coding of periodic stimuli may be influenced by
1574
M. Montemurro and S. Panzeri
computational constraints that differ from those influencing the coding of nonperiodic stimuli. As for the nonperiodic case (Zhang & Sejnowski, 1999), the neural population information about periodic stimuli depends crucially on D, the number of stimulus features being encoded. This number is not known exactly; therefore, it is difficult to derive precisely the optimal tuning widths in each cortical area. However, some evidence indicates that neurons may encode a small number of stimulus features, and in many cases the number of encoded features is more than one. For example, visual neurons extract a few features out of a rich dynamic stimulus (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Touryan, Lau, & Dan, 2002), and a small number of different stimulus maps are often found to coexist over the same area of cortical tissue (Swindale, 2004). Our results show that in this regime in which more than one (but no more than a few) periodic features are encoded, tuning widths within the range observed in cortex are efficient at transmitting information. We showed that the optimal tuning width for population coding increases with the number of periodic features being encoded. Neurophysiological data in Table 1 show a progressive increase of tuning widths across the cortical hierarchy, consistent with the idea that higher visual areas encode complex objects described by a combination of several stimulus parameters (Pasupathy & Connor, 2001), thus requiring larger tuning widths for efficient coding. In summary, our results demonstrate that tuning curves of intermediate widths offer computational advantages when considering the encoding of periodic stimuli. Acknowledgments We thank I. Samengo and R. Petersen for interesting discussions. This research was supported by Human Frontier Science Program, Royal Society and Wellcome Trust 066372/Z/01/Z. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Abbott, L. F., Rolls, E. T., & Tov´ee, M. J. (1996). Representational capacity of face coding in monkeys. Cerebral Cortex, 6, 498–505. Albright, T. D. (1984). Direction and orientation selectivity of neurons in visual area MT of the macaque. J. Neurophysiol., 52, 1106–1130. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brunel, N., & Nadal, J. P. (1998). Mutual information, Fisher information and population coding. Neural Computation, 10, 1731–1757.
Optimal Periodic Tuning Widths
1575
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. J. Neurophysiol., 79, 1135– 1144. Henry, G. H., Dreher, B., & Bishop, P. O. (1974). Orientation specificity of cells in cat striate cortex. Journal of Neurophysiology, 37, 1394–1409. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Kohn, A., & Movshon, A. (2004). Adaptation changes the direction tuning of macaque MT neurons. Nature Neuroscience, 7, 764–772. Logothetis, N., Pauls, J., & Poggio, T. (1995). Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. McAdams, C. J., & Maunsell, J. H. R. (1999). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19(1), 431–441. Nirenberg, S., Carcieri, S. M., Jacobs, A., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Oram, M. W., Hatsopoulos, N., Richmond, B., & Donoghue, J. (2001). Excess synchrony in motor cortical neurons provides redundant direction information with that from coarse temporal measures, J. Neurophysiol., 86, 1700–1716. Pasupathy, A., & Connor, C. E. (2001). Shape representation in area v4: Positionspecific tuning for boundary conformation. J. Neurophysiol., 86, 2505–2519. Petersen, R. S., Panzeri, S., & Diamond, M. (2001). Population coding of stimulus location in rat somatosensory cortex. Neuron, 32, 503–514. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Comp., 10, 373–401. Samengo, I., & Treves, A. (2001). Representational capacity of a set of independent neurons. Physical Review E 63, 0119101–01191014. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64, 051904. Swindale, N. V. (1998). Orientation tuning curves: Empirical description and estimation parameters. Biological Cybernetics, 78, 45–56. Swindale, N. V. (2004). How different feature spaces may be represented in cortical maps. Network, 15, 217–242. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Experimental Brain Research, 41, 414–419. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Usrey, W. M., Sceniak, M. P., & Chapman, B. (2003). Receptive fields and response properties of neurons in layer 4 of ferret visual cortex. J. Neurophysiol., 89, 1003– 1015.
1576
M. Montemurro and S. Panzeri
Wilke, S. D., & Eurich, C. W. (2001). Representational accuracy of stochastic neural populations. Neural Comp., 14, 155–189. Zhang, K., & Sejnowski, T. (1999). Neuronal tuning: To sharpen or to broaden? Neural Computation, 11, 75–84. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272.
Received February 7, 2005; accepted November 3, 2005.
LETTER
Communicated by Michael Hasselmo
How Inhibitory Oscillations Can Train Neural Networks and Punish Competitors Kenneth A. Norman∗
[email protected] Ehren Newman∗
[email protected] Greg Detre
[email protected] Sean Polyn
[email protected] Department of Psychology, Princeton University, Princeton, NJ 08544, U.S.A.
We present a new learning algorithm that leverages oscillations in the strength of neural inhibition to train neural networks. Raising inhibition can be used to identify weak parts of target memories, which are then strengthened. Conversely, lowering inhibition can be used to identify competitors, which are then weakened. To update weights, we apply the Contrastive Hebbian Learning equation to successive time steps of the network. The sign of the weight change equation varies as a function of the phase of the inhibitory oscillation. We show that the learning algorithm can memorize large numbers of correlated input patterns without collapsing and that it shows good generalization to test patterns that do not exactly match studied patterns. 1 Introduction The idea that memories compete to be retrieved is one of the most fundamental axioms of neural processing. According to this view, retrieval success is a function of the amount of input that the target memory receives, relative to other, competing memories. If the target memory receives substantially more input than competing memories, it will be retrieved quickly and accurately; if support for the target memory is low relative to competing memories, the target memory will be retrieved slowly or not at all. This view implies that, to maximize subsequent retrieval success, the memory system can enact two distinct kinds of changes. The more obvious way to improve retrieval is to strengthen the target memory. However, it should also be possible to improve retrieval by weakening competing ∗ The
first two authors contributed equally to this research.
Neural Computation 18, 1577–1610 (2006)
C 2006 Massachusetts Institute of Technology
1578
K. Norman, E. Newman, G. Detre, and S. Polyn
memories. Over the past decade, the idea that competitors are punished has received extensive empirical support. Studies by Michael Anderson and others have demonstrated the following regularity: if a memory receives input from the retrieval cue but the memory is not ultimately retrieved, then the memory is weakened. Furthermore, this weakening appears to be proportional to the amount of input that a competitor receives (e.g., Anderson, 2003). Put simply, the more that a memory competes (so long as it is not ultimately retrieved), the more it is punished. We review some illustrative findings in section 1.1. However, despite the obvious functional utility of punishing competitors and the large body of psychological research indicating that competitor punishment occurs, extant computational models of memory have not directly addressed the issue of competitor punishment. In section 2, we present a theory of how the brain can exploit regular oscillations in neural inhibition to punish competing memories and strengthen weak parts of target memories. In section 3, we explore the functional properties of our oscillation-based learning algorithm: How well can it store patterns, relative to other learning algorithms that do not explicitly incorporate competitorpunishment, and how does increasing overlap between patterns affect the learning algorithm’s ability to store item-specific features of individual patterns? The ultimate goal of this work is to show how explicitly incorporating competitor punishment mechanisms into neural learning algorithms can improve the algorithms’ ability to memorize overlapping patterns and improve our understanding of how inhibitory oscillations (e.g., theta oscillations) contribute to learning 1.1 Data Indicating Competitor Punishment. The phenomenon of competitor punishment is nicely illustrated by data from Michael Anderson’s retrieval induced forgetting (RIF) paradigm (for a comprehensive overview of RIF results, see Anderson, 2003). In the RIF paradigm, participants are given a list of category exemplar pairs (e.g. Fruit-Apple, FruitKiwi, and Fruit-Pear) one at a time and are told to memorize the pairs. Immediately after viewing the pairs, participants are given a practice phase where they practice retrieving a subset of the items on the list (e.g., they are given Fruit-Pe and must say “pear”). After a delay (e.g., 20 minutes), participants’ memory for all of the items on the study list is tested. Practicing Fruit-Pe improves recall of “pear” but hurts recall of competing items (e.g., “apple”). Importantly, Anderson and Spellman (1995) found that reduced recall of “apple” is evident even when subjects are given independent cues that were not presented during the practice stage (e.g., Red-A ). This finding indicates that the Apple representation itself, and not just the Fruit-Apple connection, has been weakened. Anderson, Bjork, and Bjork (1994) also found that practicing Fruit-Pe results in more punishment for strong associates of fruit (“apple”) than weak associates of fruit
How Inhibitory Oscillations Can Train
1579
(“kiwi”). Intuitively, strong associates compete more than weak associates, so they suffer more competitor punishment. The fact that impaired recall of the competitor (relative to baseline) lasts on the order of tens of minutes and possibly longer suggests that impaired recall is due to changes in synaptic weights, as opposed to carryover of activation states from the retrieval practice phase. Anderson’s retrieval-induced forgetting experiments are a particularly well-characterized example of competitor punishment. Importantly, though, they are not the only example of this dynamic. To illustrate the generalized nature of this phenomenon, we briefly review findings from other domains that can be understood in terms of competitor punishment:
r
r
r
r
Metaphor comprehension. Glucksberg, Newsome, and Goldvarg (2001) showed that word meanings that are not applicable to the current sentence suffer lasting inhibition. For example, after reading, “My lawyer is a shark,” participants were slower to evaluate sentences that reference the concept “swim” (e.g., “Geese are good swimmers”). Glucksberg et al. argue that when participants read, “My lawyer is a shark,” less appropriate meanings of shark (e.g., “swim”) compete with more appropriate meanings (e.g., “vicious”). “Swim” receives input but loses the competition so it is punished. Task switching. Mayr and Keele (2000) found that after switching from task A to task B, it is more difficult to switch back to task A than to switch to a new task (task C). This can be explained in terms of task A’s competing with task B during the initial switch. Because task A competes but loses (i.e., participants eventually succeed in switching to task B), the neural representation of task A is punished, thereby making it more difficult to reactivate later. Negative priming. In negative priming tasks, participants have to process one object and ignore other objects on each trial (e.g., participants might be instructed to name the green object and ignore objects that are not green). Objects that were ignored sometimes reappear as target objects on subsequent trials. Studies have found that participants are slower to process objects that were ignored on previous trials compared to objects that were not ignored (see Fox, 1995, for a review). This can be explained in terms of the idea that nontarget objects compete with target objects. By attending to the target color, participants bias the competition such that the target object wins the competition. Because the representations of nontarget objects receive strong input from the stimulus array but lose the competition, these representations are punished. Cognitive dissonance reduction. Freedman (1965) showed that if a child is given a mild threat not to play with a toy (and then does not play with the toy), the child ends up liking the toy less. In contrast, if a
1580
K. Norman, E. Newman, G. Detre, and S. Polyn
child is given a strong threat not to play with a toy, there is no attitude change. This paradigm can be understood in terms of a competition between wanting to play with the toy (“play”) and not wanting to play with the toy (“don’t play”). In the mild threat condition, “don’t play” just barely wins over “play”; “play” is a losing competitor, so it is punished, resulting in reduced liking. In the strong threat condition, “don’t play” wins easily over “play”; there is not strong competition, so “play” is not punished. The ubiquitous presence of competitor punishment in psychology spurred us to develop a learning mechanism that can account for this dynamic. Our goal is to come up with a neural network learning algorithm that punishes representations if and only if they compete (and lose) during retrieval. Also, the learning algorithm should be able to efficiently train new patterns into the network in a manner that supports subsequent recall of these patterns. Ultimately we believe that these goals are synergistic: pushing away competitors during training should improve the accuracy of recall at test. 2 The Learning Algorithm In this section, we present a learning algorithm for rate-coded neural networks that can punish competitors as well as train new patterns into a network. The algorithm depends critically on changes in the strength of neural inhibition. By way of background, we first review how inhibition works in our model. Then we provide an overview of how the learning algorithm works. Finally, we provide a more detailed account of how the learning algorithm exploits changes in neural inhibition to punish competitors and strengthen weak memories. 2.1 The Role of Inhibition. Neural systems need some way of controlling excitatory neural activity so this activity does not spread across the entire system, causing a seizure. In keeping with O’Reilly and Munakata (2000), we argue that inhibitory interneurons act to control excitatory activity. Inhibitory interneurons accomplish this goal by sampling the overall amount of excitatory activity within a particular region via diffuse input projections and sending back a commensurate amount of inhibition via diffuse output projections. In this manner, inhibitory interneurons act to enforce a set point on the amount of excitatory activity. Increasing the strength of inhibition leads to a decrease in the overall amount of excitatory activity, and reducing the strength of inhibition leads to an increase in the overall amount of excitatory activity. In terms of this framework, an excitatory neuron will be active if the amount of excitation that it receives is sufficient to counteract the amount of inhibition that it receives. These active units make up the representation of the input pattern. Units that
How Inhibitory Oscillations Can Train
1581
receive a substantial amount of excitatory input, but not enough to counteract the effects of inhibition, can be thought of as competitors. Given processing noise, it is possible that these competing units could be recalled in place of target units. 2.2 Pr´ecis of the Learning Algorithm. The learning algorithm utilizes the Contrastive Hebbian Learning (CHL) weight change equation (Ackley, Hinton, & Sejnowski, 1985; Hinton & Sejnowski, 1986; Hinton, 1989; Movellan, 1990). CHL learning involves contrasting a more desirable state of network activity (sometimes called the plus state) with a less desirable state of network activity (sometimes called the minus state). The CHL equation adjusts network weights to strengthen the more desirable state of network activity (so it is more likely to occur in the future) and weaken the less desirable state of network activity (so it is less likely to occur in the future): d Wij = ((Xi+ Yj+ ) − (Xi− Yj− )).
(2.1)
In the above equation, Xi is the activation of the presynaptic (sending) unit, and Yj is the activation of the postsynaptic (receiving) unit. The plus and minus superscripts refer to plus-state and minus-state activity, respectively. d Wij is the change in weight between the sending and receiving units, and is the learning rate parameter. Our algorithm uses changes in the strength of neural inhibition to generate plus and minus patterns to feed into the CHL equation. To memorize a pattern of activity, we start by soft-clamping the target pattern onto the network. Clamp strength was tuned such that, given a normal level of inhibition, all of the target features (and only those features) are active. This pattern serves as the plus state for learning. We then create two distinct kinds of minus patterns by raising and lowering inhibition, respectively. Raising inhibition distorts the target pattern by making it harder for target units to stay on. Lowering inhibition distorts the target pattern by making it easier for nontarget units to be active. Next, the learning algorithm updates weights by two separate CHLbased comparisons. First, it applies CHL to the difference in network activity given normal versus high inhibition. Second, it applies CHL to the difference in network activity given normal versus low inhibition. 2.2.1 Comparing Normal versus High Inhibition. At a functional level, the normal versus high-inhibition comparison strengthens weak parts of the target pattern by increasing their connectivity with other parts of the target pattern. Raising inhibition acts as a kind of stress test on the target pattern. If a target unit is receiving relatively little collateral support from other target units, such that its net input is just above threshold, raising inhibition will
1582
K. Norman, E. Newman, G. Detre, and S. Polyn
trigger a decrease in the activation of that unit. However, if a target unit is receiving strong collateral support, such that its net input is far above threshold, it will be relatively unaffected by this manipulation. The CHL equation (applied to normal versus high inhibition) strengthens units that turn off when inhibition is raised, by increasing weights from other active units. These weight changes ensure that a target unit that drops out on a given trial will receive more input the next time that cue is presented. If the same pattern is presented repeatedly, eventually the input to that unit will increase to the point where it no longer drops out in the high-inhibition condition. At this point, the unit should be well connected to the rest of the target representation, making it possible for the network to complete that unit, and no further strengthening will occur. 2.2.2 Comparing Normal versus Low Inhibition. The normal-versus lowinhibition comparison punishes competing units by reducing their connectivity with target units. As discussed earlier, competing units can be defined as nontarget units that (given normal levels of inhibition) receive almost enough net input to come on, but not enough input to be active in the final, settled state of the network. If a nontarget unit is located just below threshold, then lowering inhibition will cause that unit to become active. However, if a nontarget unit is far below threshold (i.e., it is not receiving strong input), it will be relatively unaffected by this manipulation. The CHL equation (applied to normal versus low inhibition) weakens units that turn on when inhibition is lowered, by reducing weights from other active units. These weight changes ensure that a unit that competes on one trial will receive less input the next time that cue is presented. If the same cue is presented repeatedly, eventually the input to that unit will diminish to the point where it no longer activates in the low-inhibition condition. At this point, the unit is no longer a competitor, so no further punishment occurs. 2.3 Implementing the Algorithm Using Inhibitory Oscillations. The fact that the learning algorithm involves changes in the strength of inhibition led us to consider how the algorithm relates to neural theta oscillations (rhythmic changes in local field potential at a frequency of approximately 4 to 8 Hz in humans). Theta oscillations depend critically on changes in the firing of inhibitory interneurons (Buzsaki, 2002; Toth, Freund, & Miles, 1997), and there are several data points indicating that theta oscillations might play a role in learning (e.g., Seager, Johnson, Chabot, Asaka, & Berry, 2002; Huerta & Lisman, 1996). In section 4, we assess the correspondence between our algorithm and theta in more detail. At this point, the key insights are that continuous inhibitory oscillations are widespread in the brain, and these oscillations might serve as a neural substrate for our learning algorithm. The version of the learning algorithm described in the previous section (where inhibition is set to normal, higher than normal, or lower than
How Inhibitory Oscillations Can Train
1583
normal) is useful for expository purposes, but the discrete nature of the inhibitory states conflicts with the continuous nature of theta oscillations. To remedy this, we devised an implementation of the learning algorithm that oscillates inhibition in a continuous sinusoidal fashion (from higher than normal to lower than normal). Also, instead of changing weights by comparing normal versus high inhibition and normal versus low inhibition, we change weights by comparing network activity on successive time steps. With regard to the CHL algorithm, the key intuition is that at each point in the inhibitory oscillation, the network is either moving toward the target state (i.e., the pattern of network activity when inhibition is at its normal level) or away from its target state, toward a less desirable state where there is either too little activity (in the case of high inhibition) or too much activity (in the case of low inhibition) . Consider the pattern of activity at time t and the pattern of activity at time t + 1. If inhibition is moving toward its normal level, then the activity pattern at time t + 1 will be closer to the target state than the activity pattern at time t. In this situation, we will use the CHL equation to adapt weights to make the pattern of activity at time t more like the pattern at time t + 1. However, if inhibition is moving away from its normal level, then the activity pattern at time t + 1 will be farther from the target state than the activity pattern at time t. In this situation, we will use the CHL equation to adapt weights to make the pattern of activity at time t + 1 more like the pattern at time t. These rules are formalized in equation 2.2:
d Wij =
((Xi (t + 1)Yj (t + 1)) − (Xi (t)Yj (t))) if inhibition is returning to its normal value
((Xi (t)Yj (t)) − (Xi (t + 1)Yj (t + 1))) if inhibition is moving away from its normal value.
(2.2)
Note that the two equations are identical except for a change in sign. These equations collectively serve the same functions as the normal-versus high-inhibition and normal-versus low-inhibition comparisons described earlier: competitors are punished when the network moves between normal and low inhibition and back again, and weak parts of the target representation are strengthened when the network moves between normal and high inhibition and back again. However, instead of changing weights by comparing snapshots taken at disparate points in time, equation 2.2 achieves the same goal by comparing temporally adjacent network states. Figure 1 summarizes the learning algorithm. 2.4 Network Architecture and Biological Relevance. Although we think the oscillating learning algorithm may be applicable to multiple brain structures, the work described here focuses on applying the algorithm to a neocortical network architecture. McClelland, McNaughton, and O’Reilly
1584
K. Norman, E. Newman, G. Detre, and S. Polyn
Inhibition
High
Competitor Target
Competitor Target
Competitor Target
Competitor Target
Low
Competitor Target
Change in activation: Competitor None Target Negative
Change in activation: Competitor None Target Positive
Change in activation: Competitor Positive Target None
Change in activation: Competitor Negative Target None
Learning rate: Negative
Learning rate: Positive
Learning rate: Negative
Learning rate: Positive
Result:
Result:
Result: Competitor weakened
Result: Competitor weakened
Target strengthened
Target strengthened
Figure 1: Summary of the combined learning algorithm, showing how target and competitor activity change during different phases of the inhibitory oscillation and how these changes in activity affect learning. Moving from normal to high back to normal inhibition serves to identify and strengthen weak parts of the target pattern. Moving from normal to low back to normal inhibition serves to identify and punish competitors.
(1995) and many others (e.g., Hinton & Ghahramani, 1997; Grossberg, 1999) have argued that the goal of neocortical processing is to gradually develop an internal model of the structure of the environment that allows the network to generate predictions about unseen input features. According to the Complementary Learning Systems model developed by McClelland et al. (1995), one of the defining features of cortical learning is that the cortex assigns similar representations to similar inputs. This property allows the network to generalize to new patterns based on their similarity to previously encountered patterns. McClelland et al. contrast this with hippocampal learning, which (according to their model) involves assigning distinct representations to input patterns regardless of their similarity; this property allows the hippocampus to do one-trial memorization but hurts its ability to generalize to new patterns based on similarity (see also Marr, 1971; McNaughton & Morris, 1987; Rolls, 1989; Hasselmo, 1995; Norman & O’Reilly, 2003). To implement a model of cortical learning, our initial simulations used a simple two-layer network, comprising an input-output layer and a hidden layer. The network is shown in Figure 2. The input-output layer was used to present patterns to the network. The hidden layer was allowed to selforganize. Every input-output unit was connected to every input-output unit (including itself) and to every hidden unit. All of the synaptic connections were bidirectional and modifiable according to the dictates of the learning algorithm.
How Inhibitory Oscillations Can Train
1585
Hidden
Input/Output Figure 2: Diagram of the network used in our simulations. Patterns were presented to the lower part of the network (the input-output layer). The upper part of the network (the hidden layer) was allowed to self-organize. Every unit in the input-output layer was connected to every input-output unit (including itself) and to every hidden unit via modifiable, bidirectional weights. All of the simulations described in the article used an 80-unit input-output layer. The hidden layer contained 40 units except when specifically noted otherwise.
This architecture is capable of completing missing pieces of input patterns via input-layer recurrents and backprojections from the hidden layer. More important, the hidden layer gives it the ability to adaptively rerepresent the inputs to facilitate this process of pattern completion. An important goal of the simulations below is to assess whether the oscillating algorithm, applied to this simple cortical architecture, can simultaneously meet the following desiderata for cortical learning (McClelland et al., 1995; O’Reilly & Norman, 2002):
r
The network should assign similar hidden-layer representations to similar inputs.
1586
r
r
K. Norman, E. Newman, G. Detre, and S. Polyn
After repeated exposure to a set of patterns, the network should be able to fill in missing pieces of those patterns, even if the patterns are highly correlated (insofar as real-world input patterns show a high degree of correlation; Simoncelli & Olshausen, 2001). The network should be able to generalize to input patterns that resemble (but do not exactly match) trained patterns.
2.5 General Simulation Methods. The simulation was implemented using a modified version of O’Reilly’s Leabra algorithm (O’Reilly & Munakata, 2000). Apart from a small number of changes listed below (most importantly, relating to the weight update algorithm and how we added an oscillating component to inhibition), all other aspects of the algorithm used here were identical to Leabra. (For a more detailed description of the Leabra algorithm, see O’Reilly & Munakata, 2000.) As per the Leabra algorithm, we explicitly simulated only excitatory units and excitatory connections between these units; we did not explicitly simulate inhibitory interneurons. Excitatory activity was controlled by means of a k-winner-take-all (kWTA) inhibitory mechanism (O’Reilly & Munakata, 2000; Minai & Levy, 1994). The kWTA algorithm sets the amount of inhibition for each layer to a value such that at most k units in that layer show activation values above .25 (fewer than k units will be active if excitatory input does not exceed the leak current, which exerts a constant drag on unit activation). According to this algorithm, all of the units in a layer receive the same amount of inhibitory input on a given time step, but the amount of inhibition can vary across layers. The kWTA algorithm can be viewed as a shortcut that captures the “set-point” role of inhibitory interneurons while reducing computational overhead (relative to explicitly simulating the neurons). The kWTA algorithm also makes it easy to specify the desired amount of activity in a layer by changing the k model parameter. The k parameter was set to k = 8 in both the input-output and hidden layers, except when specified otherwise. To implement the inhibitory oscillation required for the learning algorithm, we used the following procedure. First, at each time step, we used the kWTA algorithm to compute a baseline (normal) level of inhibition. Then we added an oscillating component to the baseline inhibition value. The oscillating component was added only to the input-output layer, not the hidden layer. We limited the oscillation to the input-output layer because we wanted to build the simplest possible architecture that exhibits the desired learning dynamic. We found that adding oscillations to the hidden layer increases the complexity of the model’s behavior, but it does not substantially affect learning performance in either a positive or negative fashion (see appendix C for a concrete demonstration of this point). The magnitude of the oscillating component was varied in a sinusoidal fashion
How Inhibitory Oscillations Can Train
1587
from min to max (where min and max are negative and positive numbers, respectively).1 At the start of each training trial, the target pattern was soft-clamped onto the input-output layer. Over the course of a trial, inhibition was oscillated once from its normal value to the high-inhibition value, then back to normal, then down to the low-inhibition value, then back to normal. The onset of the inhibitory oscillation was delayed 20 time steps from the onset of the stimulus. This delay ensures that activity will reach its equilibrium state (corresponding to the retrieved memory) prior to the start of the oscillation. The period of the inhibitory oscillation was set to 80 time steps. This number was chosen because it gives the network enough time for changes in inhibition to lead to changes in activation, but no more time than was necessary. In principle, we could oscillate inhibition multiple times per stimulus. However, given the way that we calculated weight updates (see below), the effects of multiple inhibitory cycles could be simulated perfectly by staying with one oscillation per stimulus and increasing the learning rate. For a summary of key model parameters relating to the inhibitory oscillation (and other aspects of the model as well), see appendix A. At each time step (starting at the beginning of the inhibitory oscillation), weight updates were calculated using equation 2.2. However, these weight updates were not applied until the end of the trial. This policy makes it easier to analyze network behavior because weight changes cannot feed back and influence patterns of activation within a trial. 3 Simulations In the following simulations, we explore the oscillating algorithm’s ability to meet the desiderata outlined in section 2.4. In particular, we are interested in the algorithm’s ability to support omnidirectional pattern completion, that is, its ability to recall any piece of a pattern when given the rest of the pattern as a cue. The use of the term omnidirectional sets this kind of pattern completion apart from asymmetric forms of pattern completion where, for example, the first half of the pattern can cue recall of the second half, but not vice versa. To illustrate the strengths and weaknesses of the oscillating algorithm, we compare it to O’Reilly’s Leabra algorithm (O’Reilly, 1996; O’Reilly & Munakata, 2000). Leabra consists of two parts. The core of Leabra is a CHLbased error-driven learning rule, which we will refer to as Leabra-Error. In contrast to the oscillating algorithm, which uses changes in the strength of 1 We chose values for min and max according to the following criteria: min has to be low enough to allow competitors to activate during the low-inhibition phase, but not so low that the network becomes epileptic. Max has to be high enough such that poorly supported target units turn off during the high-inhibition phase, but not so high that well-supported target units turn off also.
1588
K. Norman, E. Newman, G. Detre, and S. Polyn
inhibition to generate patterns for CHL, the Leabra-Error algorithm learns by comparing the following two phases:
r r
A minus phase, where some features of the to-be-learned pattern are omitted, and the network has to fill in missing features A plus phase, where the entire to-be-learned pattern is clamped on
The level of inhibition is kept constant across both phases. By comparing minus and plus patterns using CHL, the network learns to minimize the discrepancy between its “guess” about missing features and the actual pattern. The full version of Leabra complements Leabra-Error with a simple Hebbian learning rule that (during the plus phase) strengthens weights between sending and receiving units when they are both active and weakens weights when the receiving unit is active but the sending unit is not. This Hebbian rule was developed by Grossberg (1976), who called it instar learning; O’Reilly and Munakata (2000) describe the same algorithm using the name CPCA Hebbian Learning.2 Simulations conducted by O’Reilly (see, e.g., O’Reilly, 2001; O’Reilly & Munakata, 2000) have demonstrated that adding small amounts of CPCA Hebbian Learning to Leabra-Error boosts the learning performance of Leabra-Error by forcing it to represent meaningful input features in the hidden layer. As recommended by O’Reilly and Munakata (2000), our Leabra comparison simulations used a small proportion of CPCA Hebbian Learning (such that weight changes were 99% driven by Leabra-Error and 1% by CPCA Hebb).3 Finally, to compare the form of CHL inherent in LeabraError to the form of CHL inherent in the oscillating algorithm more directly, we also ran simulations using the Leabra-Error rule on its own (without any CPCA Hebbian Learning). Bias weight learning was turned off in the Leabra and Leabra-Error simulations in order to better match the oscillating-algorithm simulations (which did not include bias weight learning). In graphs of simulation results, error bars indicate the standard error of the mean, computed across simulated participants. When error bars are not visible, this is because they are too small relative to the size of the symbols on the graph (and thus are covered by the symbols).
2 It is important to emphasize that CPCA Hebbian Learning and Contrastive Hebbian Learning are completely different algorithms: the latter algorithm operates based on differences between two activation states, whereas the former algorithm operates based on single activity snapshots. 3 O’Reilly and Munakata (2000) found that higher proportions of Hebbian learning hurt performance by causing the network to overfocus on prototypical features and to underfocus on lower-frequency features. Pilot simulation work, not published here, confirms that this was true in our Leabra simulations as well.
How Inhibitory Oscillations Can Train
1589
3.1 Simulation 1: Omnidirectional Pattern Completion as a Function of Input Pattern Overlap and Test Pattern Noise. In this simulation, we explore the oscillating algorithm’s ability to memorize both correlated and uncorrelated patterns. When given a large number of correlated input patterns, some self-organizing learning algorithms have a tendency to overrepresent shared features and underrepresent itemspecific features, leading to poor recall of item-specific features (e.g., Norman & O’Reilly, 2003, discuss this problem as it applies to CPCA Hebbian Learning). In this section, we show that the oscillating algorithm is not subject to this problem. To the contrary, we show that the oscillating algorithm meets all three of the desiderata outlined in section 2.4:
r
r r
The oscillating algorithm outperforms both Leabra and LeabraError at recalling individuating features of highly correlated input patterns in terms of both asymptotic capacity and training speed. The oscillating algorithm shows good generalization to test cues that do not exactly match stored patterns. The oscillating algorithm learns representations that reflect the similarity structure of the input space.
3.1.1 Methods • Input pattern creation. We gave the network 200 binary input patterns to learn. Each pattern had 8 (out of 80) units active. To generate the patterns, we started with a single prototype pattern and then distorted the prototype by randomly turning off some number of (prototype) units and turning on an equivalent number of (nonprototypical) units. By varying the number of “flipped bits,” we were able to vary the average overlap between input patterns. There were three overlap conditions: 57% average overlap (achieved by flipping two bits), 28% average overlap (achieved by flipping four bits), and 11% average overlap (achieved by flipping all eight bits). We call the last condition the unrelated pattern condition because the patterns do not possess any central tendency. In creating the patterns (for all of the levels of bit flipping noted), we implemented a minimum pairwise distance constraint, such that every input pattern differed from every other input pattern by at least two (out of eight) active bits. • Training and testing. All three algorithms were repeatedly presented with the 200-pattern training set until learning reached asymptote. After each epoch of training, we tested pattern completion by measuring the network’s ability to recall a single nonprototypical feature from each pattern, given all of the other features of that pattern as a retrieval cue. In the simulations reported here, recall was marked as correct if the activation of
1590
K. Norman, E. Newman, G. Detre, and S. Polyn
the correct unit was larger than the activation of all of the other (noncued) input-output units.4 To assess the model’s ability to generalize to test cues that do not exactly match studied patterns, we distorted retrieval cues by adding gaussian noise to the pattern that was clamped onto the network. Specifically, each unit’s external input value was adjusted by a value sampled from a zeromean gaussian distribution. These input values, once adjusted by noise, remained fixed throughout the trial. We manipulated the amount of noise at test by adjusting the variance of the noise distribution. • Applying Leabra to omnidirectional pattern completion. For our Leabra and Leabra-Error simulations, we constructed minus phase patterns by randomly blanking out four of the eight units in the input pattern, thereby forcing the network to guess the correct values of these patterns. In the plus phase, we clamped the full eight-unit pattern onto the input layer. Every time that a pattern was presented, a different (randomly selected) set of four units was blanked. Otherwise, if the same four units were blanked each time, the learning algorithm would learn to recall those four units but not any of the other units.5 • Learning rates. Based on pilot simulations, we selected .0005 as our default learning rate for Leabra and Leabra-Error. Simulations using this learning rate yielded asymptotic capacity that was almost identical to the capacity achieved with lower learning rates, and training time was within acceptable bounds. For the oscillating algorithm, we were able to achieve near-peak performance with much higher learning rates. We found that a learning rate of .05 for the oscillating algorithm yielded the best combination of high final capacity and (relatively) short training time. The number of training epochs for each algorithm/learning-rate combination was adjusted to ensure that training lasted long enough to reach asymptote. Large differences in learning rates were mirrored by commensurately large differences in training duration (e.g., the oscillating algorithm simulations with learning rate .05 took 250 epochs to reach asymptote; in contrast, Leabra simulations with learning rate .0005 took 10,000 epochs to reach asymptote).
4
We have also run pattern completion simulations using a more restrictive recall criterion, whereby recall was marked “correct” if the activation of the correct unit was more than .5 and none of the incorrect units had activation more than .5. All of the advantages of the oscillating algorithm (relative to other algorithms) that are shown in simulation 1 and simulation 2 are also present when we use this more restrictive recall criterion. 5 This is not the only way to train a Leabra network so it supports pattern completion. However, it is the most effective method that we were able to find. Any conclusions that we reach about Leabra and Leabra-Error are restricted to the particular variants that we used in our simulations and may not apply to simulations where other methods are used to generate partial patterns for the minus phase.
How Inhibitory Oscillations Can Train
1591
Patterns Learned as a Function of Test Pattern Noise Number of Patterns Learned
A Unrelated Input Patterns 200 150
Oscillating Algorithm Leabra Leabra-Error
100 50 0 0
2
4
6
8
10
12
14
16
Test Pattern Noise (x 10-2)
C 28% Average Input Overlap
200 150 100 50 0 0
2
4
6
8
10
12
14 -2
Test Pattern Noise (x 10 )
16
Number of Patterns Learned
Number of Patterns Learned
B
57% Average Input Overlap 200 150 100 50 0 0
2
4
6
8
10
12
14
16
Test Pattern Noise (x 10-2)
Figure 3: This figure shows the number of patterns (out of 200) successfully recalled at the end of training by each algorithm as a function of the amount of overlap between input patterns: (A) unrelated patterns; (B) correlated patterns, 28% overlap; (C) correlated patterns, 57% overlap. It also shows the amount of noise applied to retrieval cues at test. Leabra and Leabra-Error outperform the oscillating algorithm given low input pattern overlap and low levels of test pattern noise. However, for higher levels of input pattern overlap and test pattern noise, the oscillating algorithm outperforms Leabra and Leabra-Error.
3.1.2 Results • Capacity. Figure 3 shows the asymptotic number of patterns learned for the oscillating algorithm, Leabra, and Leabra-Error. For unrelated input patterns and low levels of test pattern noise, the oscillating algorithm learns approximately 150 of 200 patterns, but it does less well than both Leabra and Leabra-Error. However, for higher levels of test pattern noise and higher levels of input pattern overlap, the relative position of the algorithms reverses, and the oscillating algorithm performs substantially better than Leabra and Leabra-Error. Appendix C shows that the oscillating algorithm’s advantage for highly overlapping inputs is still obtained when inhibition is oscillated in the hidden layer (in addition to the input-output layer).
1592
K. Norman, E. Newman, G. Detre, and S. Polyn
Hidden Layer Overlap as a Function of Input Overlap
Hidden Layer Overlap
1.0 0.8 0.6 0.4 0.2 0.0 unrelated
28%
57%
Input Pattern Overlap Oscillating Algorithm Leabra Leabra-Error Figure 4: This figure plots, for the oscillating algorithm, Leabra, and LeabraError, the average pairwise overlap between patterns in the hidden layer (at the end of training), as a function of input-pattern overlap. Hidden-layer overlap is lower for the oscillating algorithm than for Leabra and Leabra-Error.
• Hidden representations. The oscillating algorithm’s superior performance for high levels of input pattern overlap and test pattern noise stems from its ability to maintain reasonable levels of pattern separation on the hidden layer, even when inputs are very similar. Figure 4 plots the average pairwise overlap between patterns in the hidden layer (at the end of training) as a function of input overlap.6 The figure shows that all three algorithms maintain good pattern separation in the hidden layer given low input overlap, but as input overlap increases, hidden overlap increases much more sharply in the Leabra and Leabra-Error simulations versus in the simulations using the oscillating algorithm. The high level of hidden layer overlap in the Leabra and Leabra-Error simulations facilitates recall
6 Both input overlap and hidden overlap were operationalized using a cosine distance measure; this measure ranges from zero (no overlap) to one (maximal overlap).
How Inhibitory Oscillations Can Train
1593
of shared features but makes it difficult for the network to recall the unique features of individual patterns. This problem is especially severe given high levels of test pattern noise. When hidden representations are located close together, this increases the odds that, given a noisy input pattern, the network will slip out of the correct attractor into a neighboring attractor. The oscillating algorithm’s good pattern separation in the high-overlap condition is due in large part to its ability to punish competitors. If the representations of two patterns (call them pattern A and pattern B) get too close to each other, then pattern A will start appearing as a competitor (during the low-inhibition phase) during study of pattern B, and vice versa. Assuming that both A and B are presented a large number of times at training, the ensuing competitor punishment will have the effect of pushing apart the hidden layer representations of A and B so they no longer compete with one another. Another factor that contributes to the oscillating algorithm’s good performance is its ability to focus learning on features that are not already well learned. Given a large number of correlated patterns, the oscillating algorithm stops learning about prototypical features relatively early in training (once their representation is strong enough to resist increased inhibition) and focuses instead on learning idiosyncratic features of individual items (which are less able to resist increased inhibition). Reducing learning of prototypical features, relative to item-specific features, improves pattern separation and (through this) pattern completion performance. While the oscillating algorithm shows more pattern separation than Leabra and Leabra-Error, it still possesses the key property that it assigns similar hidden representations to similar stimuli. In this respect, the oscillating algorithm (applied to this two-layer cortical network) differs strongly from hippocampal architectures that automatically assign distinct representations to stimuli (e.g., Norman & O’Reilly, 2003). To quantify the oscillating algorithm’s tendency to use similarity-based representations, we computed the correlation (across all pairs of patterns) between input-layer overlap and hidden-layer overlap. Figure 5 plots this input-hidden similarity score for the oscillating algorithm, Leabra, and Leabra-Error as a function of input pattern overlap. The average similarity score for the oscillating algorithm is approximately .5. For the values of input pattern overlap plotted here, the similarity scores for the oscillating algorithm are higher than the scores for Leabra-Error but lower than the scores for Leabra. The observed difference between Leabra and the oscillating algorithm can be viewed in terms of a simple trade-off: Leabra learns representations that are true to the structure of the input space, but (given similar input patterns) this fidelity leads to high hidden layer overlap that hurts recall. The oscillating algorithm gives up a small amount of this fidelity, but as a result of this sacrifice, it is much better able to recall given high levels of input pattern overlap and noisy test cues. Furthermore, we should emphasize that the trade-off observed here is a direct consequence of the limited
1594
K. Norman, E. Newman, G. Detre, and S. Polyn
Input-Hidden Similarity Score
Input-Hidden Similarity Score as a Function of Input Overlap 1.0 0.8 0.6 0.4 0.2 0.0 unrelated
28%
57%
Input Pattern Overlap Oscillating Algorithm Leabra Leabra-Error Figure 5: This figure plots, for the oscillating algorithm, Leabra, and LeabraError, the models’ tendency to assign similar hidden representations to similar input patterns. See the text for more detail on how this similarity score was computed. Similarity scores for the oscillating algorithm are higher than similarity scores associated with Leabra-Error and lower than similarity scores associated with Leabra.
size of the hidden layer. When hidden layer size is increased, the oscillating algorithm is able to utilize the extra hidden units to simultaneously boost similarity scores and capacity; we demonstrate this point in appendix B. • Training speed. Finally, in addition to measuring capacity, we can also evaluate how quickly the various algorithms reach their asymptotic capacity. Across all of the conditions described above, the oscillating algorithm learns more quickly than Leabra and Leabra-Error. To illustrate this point, we selected two conditions where asymptotic capacity was approximately matched between the oscillating algorithm and either Leabra or Leabra-Error. Specifically, we compared the oscillating algorithm and Leabra-Error for unrelated input patterns with .06 test pattern noise; we also compared the oscillating algorithm and Leabra for input patterns with 57% overlap and zero test pattern noise. For each of these conditions, we plotted the time course of learning across epochs, for a variety of Leabra and Leabra-Error learning rate values, in Figure 6.
How Inhibitory Oscillations Can Train
1595
Number of Patterns Learned
Speed of Training Unrelated Input Patterns Test Pattern Noise .06
57% Average Input Overlap No Test Pattern Noise
160 140 120 100 80 60 40 20 0 1e +0 1e+1 1e+2 1e+3 1e+4 1e+5
1e+0 1e+1 1e+2 1e+3 1e+4 1e+5
Epochs of Training
Epochs of Training
Oscillating Algorithm, lr .05 Leabra-Error, lr .03 Leabra-Error, lr .01 Leabra-Error, lr .0005
Oscillating Algorithm, lr .05 Leabra, lr .03 Leabra, lr .01 Leabra, lr .0005
Figure 6: Training speed for the oscillating algorithm, Leabra, and LeabraError. The left-hand figure plots the time course of training for the oscillating algorithm and Leabra-Error for unrelated input patterns and .06 test pattern noise; the right-hand figure plots the time course of training for the oscillating algorithm and Leabra for 57% input overlap and zero test pattern noise. In both figures, the oscillating algorithm training curves are located to the left of the Leabra and Leabra-Error training curves, across a wide variety of Leabra and Leabra-Error learning rate values. Error bars were omitted from the graph for visual clarity. For all of the points shown here, the standard error was less than 3.5.
In Figure 6, the oscillating algorithm training curves lie to the left of the Leabra and Leabra-Error training curves across the full range of Leabra and Leabra-Error learning rates that we tested (ranging from .0005 to .03). While it is possible to increase the initial speed of learning in Leabra and Leabra-Error by increasing the learning rate parameter, this also has the effect of lowering the asymptotic capacity of the Leabra and Leabra-Error networks to below the asymptotic capacity of the oscillating algorithm. The Leabra and Leabra-Error variants used here learn more slowly than the oscillating algorithm because they learn about only a subset of intraitem associations on each trial. For example, if the top four units are blanked during the minus phase in Leabra-Error, the network will learn how to complete from the bottom four units to the top four units, but it will not learn how to complete from the top four to the bottom four. Thus, it takes multiple passes through the study set (blanking a different set of units each time) for Leabra-Error to strengthen all of the different connections that are
1596
K. Norman, E. Newman, G. Detre, and S. Polyn
required to support omnidirectional pattern completion. In contrast, the oscillating algorithm has the ability to learn about the whole pattern on each trial. 3.2 Simulation 2: Three-Layer Autoencoder. In this simulation, we set out to replicate and extend the results of simulation 1 using a different network architecture: the three-layer autoencoder (Ackley et al., 1985). These networks consist of an input layer that is connected to a hidden layer, which in turn is connected to an output layer. During training, the to-be-learned pattern is presented (in its entirety) to the input layer, and activity is allowed to propagate through the network. The goal of autoencoder learning is to adjust network weights so the network is able to reconstruct a copy of the input pattern on the output layer. The main difference between the autoencoder architecture and the two-layer architecture used in simulation 1 is that, in the autoencoder architecture, there are no direct connections between the input units that receive external input (from the retrieval cue) at test and the to-be-retrieved output unit; everything has to go through the hidden layer. Thus, the autoencoder architecture constitutes a more stringent test of whether a learning algorithm can develop informationpreserving hidden representations that support pattern completion. We compared the oscillating algorithm’s ability to learn patterns in the autoencoder architecture to Leabra and Leabra-Error. Also, the way the network was structured (with distinct input and output layers) made it possible for us to explore two additional comparison algorithms: Almeida-Pineda recurrent backpropagation and standard (nonrecurrent) backpropagation. The results of this simulation replicate the key finding from simulation 1: the oscillating algorithm outperforms the comparison algorithms at omnidirectional pattern completion when both input pattern overlap and test pattern noise are high. However, unlike in simulation 1, the oscillating algorithm also outperforms the comparison algorithms in tests with unrelated input patterns and low levels of test pattern noise. We attribute this latter finding to the fact that the oscillating algorithm automatically strengthens weak connections between target units. In contrast, error-driven algorithms like backpropagation and Leabra-Error strengthen connections between target units only if this is needed to reduce error at training. 3.2.1 Autoencoder Methods • Network architecture. To implement an autoencoder architecture, we added an output layer to the network, so the network was composed of an 80-unit input layer, a 40-unit hidden layer, and an 80-unit output layer. The input layer had a full bidirectional projection to the hidden layer, and the hidden layer had a full bidirectional projection to the output layer. To maximize comparability with our initial simulations, every input unit was directly connected to every other input unit, and every output unit was directly connected to every other output unit (the one crucial difference
How Inhibitory Oscillations Can Train
1597
was that, in the autoencoder simulations, the input units were not directly connected to the output units). We used the same connection parameters as in simulation 1; all connections were modifiable. The one exception to the scheme outlined here was the backpropagation autoencoder, which had only feedforward connections (from the input layer to the hidden layer and from the hidden layer to the output layer). • Training and testing. All of the algorithms were given 150 patterns to memorize. The training set was repeatedly presented (in a different order each time) until learning reached asymptote. The details of training for each algorithm are presented below. After each training epoch, we tested pattern completion. On each test trial, we left out one feature from the input pattern and measured how well the network was able to recall the missing feature on the output layer. As with all of the simulations described earlier, we only tested recall of item-specific features (i.e., features that were not part of the prototype pattern), and we scored recall as being correct based on whether the to-be-recalled output unit was more active than all of the other (nontarget) units in the output layer. Finally, as in simulation 1, we manipulated the level of input pattern overlap and also explored how distorting test cues (by adding gaussian noise to the test patterns) affected pattern completion performance.7 • Methods for oscillating algorithm simulations. We trained the oscillating algorithm version of the autoencoder by simultaneously presenting the same pattern to both the input and output layers. During training, inhibition was oscillated on the input and output layers but not on the hidden layer. The input layer and output layer oscillation parameters were identical to each other and identical to the parameters that we used in simulation 1. • Methods for Leabra and Leabra-Error simulations. The Leabra and Leabra-Error autoencoder simulations used a two-phase design. In the minus phase, the complete target pattern was clamped onto the input layer (but not the output layer) and the network was allowed to settle. In the plus phase, the complete target pattern was clamped onto both the input and output layers and the network was allowed to settle. Otherwise the details of the Leabra and Leabra-Error simulations were the same as in simulation 1. • Methods for recurrent and nonrecurrent backpropagation simulations. For our simulations using the Almeida-Pineda (A-P) recurrent backpropagation algorithm (Almeida, 1989; Pineda, 1987), we used the
7 A given amount of test pattern distortion had less of an effect in simulation 2 than in simulation 1 because in simulation 2 we were only distorting the input layer pattern, whereas in simulation 1 we were applying the distortion to a shared input-output layer (so the distortion had a direct effect on output activity, in addition to an indirect effect via the distorted input pattern). To compensate for this difference, we used a wider range of test pattern noise values in simulation 2 than in simulation 1.
1598
K. Norman, E. Newman, G. Detre, and S. Polyn
rbp++ program contained in the PDP++ neural network software package.8 For our simulations using the (nonrecurrent) backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986), we used the bp++ program contained in the PDP++ software package. For both sets of simulations (rbp++ and bp++), we used the default learning parameters built in to the software package (e.g., momentum = .9), except we changed the learning rate (as described below) and, as with all of the other simulations in this article, turned off bias weight learning. • Learning rates. We used a learning rate of .0005 for all four comparison algorithms. The oscillating algorithm simulations used a learning rate of .05. We allowed each algorithm to run until learning reached asymptote (10,000–20,000 epochs for Leabra, Leabra-Error, and A-P recurrent backpropagation; 100,000 epochs for feedforward backpropagation; 250 epochs for the oscillating algorithm).9 3.2.2 Results of Autoencoder Simulations. The results of our autoencoder capacity simulations are shown in Figure 7. The oscillating algorithm outperforms the comparison algorithms in every condition, with the sole exception of high input overlap, low test pattern noise (where the oscillating algorithm’s performance is comparable to backpropagation). As in simulation 1, the oscillating algorithm’s advantage over other algorithms is larger for high test pattern noise than for low test pattern noise. The most notable difference between simulation 1 and simulation 2 is that, in this simulation, the oscillating algorithm outperforms the comparison algorithms given low input pattern overlap and no test pattern noise (whereas the opposite is true in simulation 1). The fact that the oscillating algorithm shows better pattern completion than the four comparison algorithms can be explained as follows:
r r
Pattern completion performance is a direct function of the strength of links between features within a pattern. Standard autoencoder training (as embodied by the four comparison algorithms) does not force the network to learn strong associations between input features.
In order to minimize reconstruction error, autoencoders need to represent all of the input features somewhere in the hidden layer, but they do not have to link these features. In many situations, the autoencoder can minimize reconstruction error by representing features individually and then
8 This software can be downloaded from Randy O’Reilly’s PDP++ web site at the University of Colorado: http://psych.colorado.edu/∼oreilly/PDP++/PDP++.html. 9 We also ran backpropagation simulations with larger learning rates, and the results were qualitatively similar to the results obtained here.
How Inhibitory Oscillations Can Train
1599
Autoencoder Simulations: Patterns Learned as a Function of Test Pattern Noise Number of Patterns Learned
A Unrelated Input Patterns 140
Oscillating Algorithm A-P Backpropagation Backpropagation Leabra Leabra-Error
120 100 80 60 40 20 0 0
5
10
15
20
25
30
Test Pattern Noise (x 10-2)
C 28% Average Input Overlap
14 0 12 0 10 0 80 60 40 20 0 0
5
10
15
20
25
30 -2
Test Pattern Noise (x 10 )
Number of Patterns Learned
Number of Patterns Learned
B
57% Average Input Overlap 140 120 100 80 60 40 20 0 0
5
10
15
20
25
30
Test Pattern Noise (x 10-2)
Figure 7: Results of three-layer autoencoder simulations where we manipulated input pattern overlap and test pattern noise. The oscillating algorithm performs better than the other algorithms in all conditions except for high input overlap, low test pattern noise (where the oscillating algorithm’s performance is comparable to backpropagation). In general, the oscillating algorithm’s advantage is accentuated for high levels of test pattern noise.
reconstructing the input on a feature-by-feature basis. This strategy leads to poor pattern completion performance. In contrast, the oscillating learning algorithm places strong pressure on the network to form direct associations between the features of to-be-memorized patterns (because units need collateral support from other units in order to withstand the “stress test” of increased inhibition). These strong links result in good pattern completion performance. 4 Discussion The research presented here shows how oscillations in the strength of neural inhibition can facilitate learning. Specifically, lowering inhibition can be used to identify competing memories so they can be punished, and raising
1600
K. Norman, E. Newman, G. Detre, and S. Polyn
inhibition can be used to identify weak parts of memories so they can be strengthened. The specific weight change equation (CHL) that we use in the oscillating algorithm is not novel. Rather, the novel claim is that changes in the strength of inhibition can be used to generate minus (i.e., less desirable) patterns to feed into the CHL equation. In this section, we provide a brief overview of the primary virtues of the oscillating learning algorithm relative to the other algorithms considered here. Then we discuss how the oscillating algorithm relates to neural data on oscillations and learning and how the oscillating algorithm relates to the BCM rule (Bienenstock, Cooper, & Munro, 1982). 4.1 Functional Properties of the Learning Algorithm. In section 3, we showed that the oscillating algorithm (applied to a cortical network architecture) meets the desiderata for cortical learning outlined earlier: good completion of overlapping patterns (after repeated exposure to those patterns), good generalization to retrieval cues that do not exactly match studied patterns, and good correspondence between the structure of the input patterns and the hidden representations (i.e., similar input patterns tend to get assigned similar hidden representations). We attributed the oscillating algorithm’s good performance for overlapping inputs and noisy test cues to its ability to punish competing memories. Whenever memories start to blend together, they start to compete with one another at retrieval, and the competitor punishment mechanism pushes them apart. In this manner, the oscillating algorithm retains good pattern separation in the hidden layer (see Figure 3) even when inputs overlap strongly. As discussed earlier, this extra separation is not without costs (e.g., it incrementally degrades the hidden layer’s ability to represent the structure of the input space, compared to Leabra), but the costs are small relative to the following benefit: maintaining good separation between representations helps to ensure that memories can be accurately stored and accessed even in difficult situations (e.g., when there are many similar memories stored in the system, and the cue only slightly favors one memory over the other). The fact that the oscillating algorithm outperforms all of the comparison algorithms in simulation 2 (pattern completion with an autoencoder architecture), even for unrelated input patterns, points to another key property of the algorithm: it automatically probes for weak parts of the attractor (by raising inhibition) and strengthens these weak parts. This automatic probing and strengthening ensures that the network will be able to patterncomplete from one arbitrary subpart of the pattern to another, regardless of whether that particular partial pattern has been encountered before. In contrast, the other algorithms that we examined (e.g., Leabra-Error) show a large performance hit when the partial patterns used to cue retrieval at test do not exactly match the patterns used to cue retrieval (during the minus phase) at training.
How Inhibitory Oscillations Can Train
1601
4.2 Relating the Oscillating Algorithm to Neural Theta Oscillations. We think that neural theta oscillations (and theta-dependent learning processes) may serve as the neural substrate of the oscillating learning algorithm. Theta oscillations have been observed in humans in both neocortex and the hippocampus. Raghavachari et al. (2001) found that cortical theta was gated by stimulus presentation during a memory experiment, and Rizzuto et al. (2003) found that theta phase is reset by stimulus onset. Both findings indicate that theta oscillations are present at the right time to support stimulus memorization. Other findings point to a more direct link between theta and synaptic plasticity. In a recent study, Seager et al. (2002) found that eyeblink conditioning occurred more quickly when animals were trained during periods of high versus low hippocampal theta power (see Berry & Seager, 2001, for a review of similar studies). Also, Huerta and Lisman (1996) induced theta oscillations in a hippocampal slice preparation and found that the direction of plasticity (long-term potentiation versus long-term depression) depends on the phase of theta (see also Holscher, Anwyl, & Rowan, 1997, for a similar result in anesthetized animals, and Hyman, Wyble, Goyal, Rossi, & Hasselmo, 2003, for a similar result in behaving animals). The finding that LTP is obtained during one phase of theta and LTD is obtained during another phase fits very well with the oscillating algorithm’s postulate that one part of the inhibitory oscillation (going from normal to high to normal inhibition) is primarily concerned with strengthening target memories, and the other part of the oscillation (going from normal to low to normal inhibition) is primarily concerned with weakening competitors. Although this result is very encouraging, more work is needed to explore the mapping between the LTP/LTD findings, and our model. The mapping is not straightforward because the studies noted used local field potential to index theta, and it is unclear how much local field potential is driven by excitatory versus inhibitory neurons. One could reasonably ask why we think the oscillation in our algorithm relates to theta oscillations as opposed, to, say, alpha or gamma oscillations. Functionally, the frequency of the oscillation in our algorithm is bounded by two constraints. First, the oscillation has to be fast enough such that the oscillation completes at least one full cycle (and ideally more) when a stimulus is presented. This rules out slow oscillations (less than 1 Hz). Also, if the oscillation is too fast relative to the speed of spreading activation in the network, competitors will not have a chance to activate during the low-inhibition phase. This constraint rules out very fast oscillations. Thus, although we are not certain that theta is the correct frequency, the functional constraints outlined here and the findings relating theta to learning (outlined above) make this a possibility worth pursuing. 4.3 Applications to Hippocampal Architectures. Although this article has focused on cortical network architectures, we also think that our ideas
1602
K. Norman, E. Newman, G. Detre, and S. Polyn
about theta (that theta can help to selectively strengthen weak target units and punish competitors) may be applicable to hippocampal architectures. At this time, there are several theories (other than ours) regarding how theta oscillations might contribute to hippocampal processing. For example, Hasselmo, Bodelon, and Wyble (2002) argue that theta oscillations help tune hippocampal dynamics for encoding versus retrieval, such that dynamics are optimized for encoding during one phase of theta and dynamics are optimized for retrieval during another phase of theta. Hasselmo et al.’s model varies the relative strengths of different excitatory projections as a function of theta (to foster encoding versus retrieval) but does not vary inhibition. In contrast, our model varies the strength of inhibition but does not vary the strength of excitatory inputs. At this time, it is unclear how our model relates to Hasselmo et al.’s model. We do not see any direct contradictions between our model and Hasselmo et al.’s model (insofar as they manipulate different model parameters as a function of theta), so it seems possible that the two models could be merged, but further simulation work is needed to address this question.
4.4 Relating the Oscillating Algorithm to BCM. In this section, we briefly review another algorithm (the BCM algorithm: Bienenstock et al., 1982) that can be viewed as implementing competitor punishment. Like the CPCA Hebbian Learning rule, the BCM algorithm is set up to learn about clusters of correlated features. The main difference between BCM and CPCA Hebbian Learning relates to the circumstances under which synaptic weakening (LTD) occurs. CPCA Hebbian Learning reduces synaptic weights when the receiving unit is active but the sending unit is not. In contrast, BCM reduces synaptic weights from active sending units when the receiving unit’s activation is above zero but below its average level of activation. Thus, BCM actively pushes away input patterns from weakly activated hidden units. This form of synaptic weakening can be construed as a form of competitor punishment: if a memory receives enough input to activate its hidden representation but not enough to fully activate that representation, that memory is weakened. In contrast, if a memory does not receive enough input to activate its hidden representation, the memory is not affected. The main functional difference between competitor punishment in BCM versus the oscillating algorithm is that BCM can punish competitors only if their representations show above-zero (but below-average) activation. In contrast, the oscillating algorithm actively probes for competitors (by lowering inhibition) and is therefore capable of punishing competitors even if they are not active given normal levels of inhibition. This “active probing” mechanism should result in much more robust competitor punishment. Importantly, BCM’s form of competitor punishment and the oscillating algorithm’s form of competitor punishment are not mutually exclusive. It is possible that combining the algorithms would result in better performance
How Inhibitory Oscillations Can Train
1603
than either algorithm taken in isolation. We will explore ways of integrating BCM with the oscillating learning algorithm in future research.
4.5 Applying the Oscillating Algorithm to Psychological Data. In this article, we have focused on functional properties of the learning algorithm (e.g., its capacity for learning patterns, given different levels of overlap). Another way to constrain the model is to explore its ability to simulate detailed patterns of psychological data. In one line of research, we have used the model to account for several key findings relating to retrieval-induced forgetting (Anderson 2003; see section 1.1 for more discussion of this phenomenon). For example, the model can explain the finding that forgetting of competing items is cue independent (Anderson & Spellman, 1995), the finding that competitor punishment effects are larger when subjects are asked to retrieve the target versus when they are just shown the target (Anderson, Bjork, & Bjork, 2000), and the finding that the amount of competitor punishment is proportional to the strength of the competitor (Anderson et al., 1994). This modeling work is described in detail in Norman, Newman, and Detre (2006).
5 Conclusions The research presented here started with a psychological puzzle: How can we account for data showing that competitors are punished? In the course of addressing this issue, we found that competitor punishment mechanisms can boost networks’ ability to learn highly overlapping patterns (by ensuring that hidden representations do not collapse together). We also observed that the changing inhibition aspect of our algorithm bears a strong resemblance to neural theta oscillations. As such, this research may end up speaking to the longstanding puzzle of how theta oscillations contribute to learning. The challenge now is to follow up the admittedly preliminary results presented here with a more detailed assessment of how the basic principles of the oscillating algorithm (competitor punishment via decreased inhibition and selective strengthening via increased inhibition) can shed light on psychological, neural, and functional issues.
Appendix A: Model Parameters A.1 Basic Network Parameters. At the beginning of each simulation, all of the weights were initialized to random values from the uniform distribution centered on .5 with range = .4. The initial weight values were symmetric, such that the initial weight from unit i to unit j was equivalent to the initial weight from unit j to unit i. This symmetry was maintained
1604
K. Norman, E. Newman, G. Detre, and S. Polyn
through learning because the weight update equations are symmetric. Other model parameters were as follows: Parameter stm gain input/output layer dtvm hidden layer dtvm i kwta pt
Value 0.4 0.2 0.15 0.325
Apart from the parameters mentioned above, all other parameters shared by the oscillating learning algorithm and Leabra were set to their Leabra default values. A.2 Oscillation Parameters. The oscillating component of inhibition was varied from min = −1.21 to max = 1.96. As per equation 2.2, the sign of the learning rate was shifted from positive to negative depending on whether inhibition was moving toward its normal (midpoint) value or away from its normal value. The network was given 20 time steps to settle into a stable state before the onset of the inhibitory oscillation. Figure 8 shows how inhibition was oscillated on each trial and how the sign of the learning rate was changed as a function of the phase of the inhibitory oscillation. Appendix B: Effects of Hidden Layer Size In this simulation, we show that the oscillating algorithm can take advantage of additional hidden layer resources to store more patterns and to more accurately represent the structure of the input space. Specifically, we explored the effect of increased hidden layer size (120 versus 40 units) on the oscillating algorithm, Leabra, and Leabra-Error. The hidden layer k value was adjusted as a function of hidden layer size to ensure that (on average) 20% of the hidden units would be active for both the 120-hidden-unit simulations and the 40-hidden-unit simulations. Input patterns had 57% average overlap, and we used a test pattern noise value of .04. Figure 9 shows the effects of hidden layer size on capacity and on the fidelity of the network’s representations (as indexed by our “similarity score” measure). There are two important results. First, increasing hidden layer size boosts the number of patterns learned by the oscillating algorithm. The effect of increasing hidden layer size is numerically larger for the oscillating algorithm than for Leabra and Leabra-Error, so the capacity advantage for the oscillating algorithm is preserved in the 120-hidden-unit condition. The second important result is that for all three algorithms, our input-hidden similarity metric (as described in section 3.1.2) is substantially larger for the large network simulations: For the oscillating algorithm, the
How Inhibitory Oscillations Can Train
1605
Learning Rate
2 0.03 1 0.00
0
-0.03
-1
-0.06
-2 0
20
40
60
80
Inhibitory Oscillation
3
0.06
100
Time Elapsed (Number of Time Steps) Learning Rate Inhibitory Oscillation Figure 8: Illustration of how inhibition was oscillated on each trial. At each time step, the inhibitory oscillation component depicted on this graph was added to the value of inhibition computed by the kWTA algorithm. The graph also shows how the sign of the learning rate was set to a positive value when the inhibitory oscillation was moving toward its midpoint, and it was set to a negative value when the inhibitory oscillation was moving away from its midpoint.
similarity score is .576 for the 40-unit simulation and .788 for the 120-unit simulation.
Appendix C: Effects of Hidden Layer Oscillations For simplicity, the oscillating algorithm simulations oscillated inhibition in the input-output layer but not the hidden layer. In this simulation, we show that the same qualitative pattern of results is obtained when inhibition is simultaneously oscillated in both the input-output layer and in the hidden layer. We selected hidden layer oscillation parameters such that over the course of training, the effect of the inhibitory oscillation on network activity (operationalized as the difference in average network activation from the peak of the inhibitory oscillation to the trough of the oscillation) was approximately
1606
K. Norman, E. Newman, G. Detre, and S. Polyn
Number of Patterns Learned
A
Capacity as a Function of Hidden-Layer Size 57% Average Input Overlap, Test Noise .04 200 150 100 50 0 40
120
B
Similarity Score as a Function of Hidden-Layer Size 57% Average Input Overlap
Input-Hidden Similarity Score
Number of Hidden Units
1.0 0.8 0.6 0.4 0.2 0.0 40
120
Number of Hidden Units Oscillating Algorithm Leabra Leabra-Error Figure 9: (A) Capacity scores (number of patterns learned out of 200) for the oscillating algorithm, Leabra, and Leabra-Error, given 57% input pattern overlap and .04 test pattern noise. Increasing hidden layer size increases the number of patterns learned by the oscillating algorithm, and the oscillating algorithm continues to perform well relative to Leabra and Leabra-Error. (B) Input hidden similarity scores (see simulation 1 for how these were calculated) given 57% input pattern overlap. All three algorithms show better similarity scores for the larger network.
How Inhibitory Oscillations Can Train
1607
Number of Patterns Learned
Effect of Hidden-Layer Oscillations on Capacity 57% Average Input Overlap 200 150 100 50 0 0
2
4
6
8
10
12
14
16
-2
Test Pattern Noise (x 10 ) Oscillating Algorithm with Hidden Oscillations Oscillating Algorithm without Hidden Oscillations Leabra Leabra-Error Figure 10: Capacity scores (number of patterns learned out of 200) for the oscillating algorithm with and without hidden layer oscillations, given 57% input pattern overlap. Results for Leabra and Leabra-Error are included for comparison purposes. The same qualitative pattern of results is present both with and without hidden layer oscillations.
equated for the input-output layer and the hidden layer.10 The simulation used input patterns with 57% overlap. The results of this simulation are shown in Figure 10. The oscillating algorithm continues to outperform Leabra and Leabra-Error at learning highly overlapping patterns, even with the addition of oscillations in the hidden 10 To equate the average effect of the inhibitory oscillation on activity in the inputoutput layer versus the hidden layer, we ended up using a much smaller-sized oscillation in the hidden layer than in the input-output layer: hidden oscillation min = −0.18 and max = 0.10; for the input oscillation, we used our standard max = 1.96 and a slight smaller than usual min = −1.11; we set the learning rate to .03. The input layer inhibitory oscillation needs to be large in order to offset the strong (excitatory) external input coming into the target units. Hidden units do not receive this strong external input, so less inhibition is required to deactivate these units during the high-inhibition phase.
1608
K. Norman, E. Newman, G. Detre, and S. Polyn
layer. We did not attempt to fine-tune the performance of the model once we added hidden oscillations, so the detailed pattern of results obtained here (e.g., the fact that the network performed slightly better without hidden oscillations) should not be viewed as reflecting parameter-independent properties of the model. Rather, these results constitute an existence proof that the oscillating algorithm advantages that we find in our “default parameter” simulations (without hidden layer oscillations) can also be observed in simulations with comparably sized hidden layer and input-output-layer oscillations. Acknowledgments This research was supported by NIH grant R01MH069456, awarded to K.A.N. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Almeida, L. B. (1989). Backpropagation in nonfeedforward networks. In I. Aleksander, (Ed.), Neural Computing, London: Kogan Page. Anderson, M. C. (2003). Rethinking interference theory: Executive control and the mechanisms of forgetting. Journal of Memory and Language, 49, 415–445. Anderson, M. C., Bjork, E. L., & Bjork, R. A. (2000). Retrieval-induced forgetting: Evidence for a recall-specific mechanism. Memory and Cognition, 28, 522. Anderson, M. C., Bjork, R. A., & Bjork, E. L. (1994). Remembering can cause forgetting: Retrieval dynamics in long-term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 5, 1063–1087. Anderson, M. C., & Spellman, B. A. (1995). On the status of inhibitory mechanisms in cognition: Memory retrieval as a model case. Psychological Review, 102, 68. Berry, S. D. & Seager, M. A. (2001). Hippocampal theta oscillations and classical conditioning. Neurobiology of Learning and Memory, 76, 298–313. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2(2), 32–48. Buzsaki, G. (2002). Theta oscillations in the hippocampus. Neuron, 33, 325–340. Fox, E. (1995). Negative priming from ignored distractors in visual selection. Psychonomic Bulletin and Review, 2, 145–173. Freedman, J. L. (1965). Long-term behavioral effects of cognitive dissonance. Journal of Experimental Social Psychology, 1, 145–155. Glucksberg, S., Newsome, M. R., & Goldvarg, G. (2001). Inhibition of the literal: Filtering metaphor-irrelevant information during metaphor comprehension. Metaphor and Symbolic Activity, 16, 277–293. Grossberg, S. (1976). Adaptive pattern classification and universal recoding I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134.
How Inhibitory Oscillations Can Train
1609
Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention, and grouping by the laminar circuits of visual cortex. Spatial Vision, 12, 163– 186. Hasselmo, M. E. (1995). Neuromodulation and cortical function: Modeling the physiological basis of behavior. Behavioural Brain Research, 67, 1–27. Hasselmo, M. E., Bodelon, C., & Wyble, B. P. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation, 14, 793–818. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society (London) B, 352, 1177–1190. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing, Vol. 1: Foundations (pp. 282–317). Cambridge, MA: MIT Press. Holscher, C., Anwyl, R., & Rowan, M. J. (1997). Stimulation on the positive phase of hippocampal theta rhythm induces long-term potentiation that can be depotentiated by stimulation on the negative phase in area CA1 in vivo. Journal of Neuroscience, 17, 6470. Huerta, P. T., & Lisman, J. E. (1996). Synaptic plasticity during the cholinergic thetafrequency oscillation in vitro. Hippocampus, 49, 58–61. Hyman, J. M., Wyble, B. P., Goyal, V., Rossi, C. A., & Hasselmo, M. E. (2003). Stimulation in hippocampal region CA1 in behaving rats yields long-term potentiation when delivered to the peak of theta and long-term depression when delivered to the trough. Journal of Neuroscience, 23, 11725–11731. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society (London) B, 262, 23–81. Mayr, U., & Keele, S. (2000). Changing internal constraints on action: The role of backward inhibition. Journal of Experimental Psychology: General, 1, 4–26. McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex? Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102, 419–457. McNaughton, B. L., & Morris, R. G. M. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. Trends in Neurosciences, 10(10), 408–415. Minai, A. A., & Levy, W. B. (1994). Setting the activity level in sparse random networks. Neural Computation, 6, 85–99. Movellan, J. R. (1990). Contrastive Hebbian learning in the continuous Hopfield model. In D. S. Tourtezky, J. L. Elman, T. J. Sejnowski, & G. E. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School (pp. 10–17). San Mateo, CA: Morgan Kaufmann. Norman, K. A., Newman, E. L., & Detre, G. J. (2006). A neural network model of retrievalinduced forgetting (Tech. Rep. No. 06-1). Princeton, NJ: Princeton University, Center for the Study of the Brain, Mind, and Behavior.
1610
K. Norman, E. Newman, G. Detre, and S. Polyn
Norman, K. A., & O’Reilly, R. C. (2003). Modeling hippocampal and neocortical contributions to recognition memory: A complementary-learning-systems approach. Psychological Review, 4, 611–646. O’Reilly, R. C. (1996). The Leabra model of neural interactions and learning in the neocortex, Unpublished doctoral dissertation, Carnegie Mellon University. O’Reilly, R. C. (2001). Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199–1242. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. O’Reilly, R. C., & Norman, K. A. (2002). Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in Cognitive Sciences, 12, 505–510. Pineda, F. J. (1987). Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 18, 2229–2232. Raghavachari, S., Kahana, M. J., Rizzuto, D. S., Caplan, J. B., Kirschen, M. P., Bourgeois, B., Madsen, J. R., & Lisman, J. E. (2001). Gating of human theta oscillations by a working memory task. Journal of Neuroscience, 9, 3175–3183. Rizzuto, D. S., Madsen, J. R., Bromfield, E. B., Schulze-Bonhage, A., Seelig, D., Aschenbrenner-Scheibe, R., & Kahana, M. J. (2003). Reset of human neocortical oscillations during a working memory task. Proceedings of the National Academy of Sciences, 13, 7931–7936. Rolls, E. T. (1989). Functions of neuronal networks in the hippocampus and neocortex in memory. In J. H. Byrne & W. O. Berry (Eds.), Neural models of plasticity: Experimental and theoretical approaches (pp. 240–265). San Diego, CA: Academic Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing, Vol. 1: Foundations (pp. 318– 362). Cambridge, MA: MIT Press. Seager, M. A., Johnson, L. D., Chabot, E. S., Asaka, Y., & Berry, S. D. (2002). Oscillatory brain states and learning: Impact of hippocampal theta-contingent training. Proceedings of the National Academy of Sciences, 99, 1616–1620. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 193–216. Toth, K., Freund, T. F., & Miles, R. (1997). Disinhibition of rat hippocampal pyramidal cells by GABAergic afferents from the septum. Journal of Physiology, 500, 463–474.
Received November 8, 2004; accepted December 13, 2005.
LETTER
Communicated by Christopher Moore
Temporal Decoding by Phase-Locked Loops: Unique Features of Circuit-Level Implementations and Their Significance for Vibrissal Information Processing Miriam Zacksenhouse
[email protected] Sensory-Motor Integration Laboratory, Technion Institute of Technology, Haifa, Israel
Ehud Ahissar
[email protected] Department of Neurobiology, Weizmann Institute, Rehovot, Israel
Rhythmic active touch, such as whisking, evokes a periodic reference spike train along which the timing of a novel stimulus, induced, for example, when the whiskers hit an external object, can be interpreted. Previous work supports the hypothesis that the whisking-induced spike train entrains a neural implementation of a phase-locked loop (NPLL) in the vibrissal system. Here we extend this work and explore how the entrained NPLL decodes the delay of the novel, contact-induced stimulus and facilitates object localization. We consider two implementations of NPLLs, which are based on a single neuron or a neural circuit, respectively, and evaluate the resulting temporal decoding capabilities. Depending on the structure of the NPLL, it can lock in either a phase- or co-phase-sensitive mode, which is sensitive to the timing of the input with respect to the beginning of either the current or the next cycle, respectively. The cophase-sensitive mode is shown to be unique to circuit-based NPLLs. Concentrating on temporal decoding in the vibrissal system of rats, we conclude that both the nature of the information processing task and the response characteristics suggest that the computation is sensitive to the co-phase. Consequently, we suggest that the underlying thalamocortical loop should implement a circuit-based NPLL. 1 Introduction One of the major computational tasks facing the vibrissal somatosensory system is to determine the angle of the vibrissa on contact with an external obstacle. The vibrissal system receives external sensory input from the trigeminal neurons whose response patterns include both whisking locked spikes and contact-induced spikes (Szwed, Bagdasarian, & Ahissar, 2003). The whisking locked spike train provides a periodic reference input at the whisking frequency. The contact-induced activity represents the Neural Computation 18, 1611–1636 (2006)
C 2006 Massachusetts Institute of Technology
1612
M. Zacksenhouse and E. Ahissar
timing of the novel event of interest. When whisking frequency is consistent across cycles, the resulting computational task is equivalent to decoding the temporal delay or phase shift of a novel input with respect to a reference periodic input (Ahissar & Zacksenhouse, 2001), a basic computational task shared by other active sensory tasks, including vision (Ahissar & Arieli, 2001). At the algorithmic level (Marr, 1982), it was suggested that this computation can be performed by phase-locked loops (PLL) (Ahissar & Vaadia 1990; Ahissar, Haidarliu, & Zacksenhouse, 1997; Ahissar & Zacksenhouse, 2001). PLLs are (electronic) circuits that can lock to the frequency of their external input and perform important processing tasks, including frequency tracking and demodulation (Gardner, 1979). One of the major motivations for this hypothesis is based on the implementation level. Specifically, neuronal implementations of circuit-based PLLs (NPLLs), like the one shown in Figure 1 and detailed in section 2, require neuronal oscillators whose frequencies can be controlled by the input rate (rate-controlled oscillator, RCO) (Ahissar et al., 1997; Ahissar, 1998). Thus, the evidence that over 10% of the individual neurons in the somatosensory cortex can operate as controllable neural oscillators (Ahissar et al., 1997; Ahissar & Vaadia, 1990; Amitai, 1994; Flint, Maisch, & Kriegstein, 1997; Lebedev & Nelson, 1995; Nicolelis, Baccala, Lin, & Chapin, 1995; Silva, Amitai, & Connors, 1991) provided the initial motivation and further support for the hypothesis that these neurons function as RCOs in circuit-based NPLLs. Other requirements for implementing PLLs in the vibrissal system and agreement with the model predictions have also been demonstrated:
r r r r
The frequencies of the local cortical oscillators can be increased by local glutamatergic excitation (Ahissar et al., 1997). These oscillators can track the whisker frequency (Ahissar et al., 1997). The whisker frequency is encoded in the latency of the response of thalamic neurons (Ahissar, Sosnik, & Haidarliu, 2000; Sosnik, Haidarliu, & Ahissar, 2001). Thalamic neurons respond after (and not before, as would be expected from relay neurons) cortical neurons (Nicolelis et. al, 1995) as predicted by a thalamocortical PLL (Ahissar et al., 1997).
While these investigations focused on the response of the vibrissal system to the reference, whisking-induced input, the response to the novel contact-induced input was not investigated in detail. The purpose of this article is to investigate and demonstrate how NPLLs respond to the novel contact-induced input and assess the resulting temporal decoding capabilities. Furthermore, we address the issue of whether circuit-based NPLLs provide any computational advantages over single-neuron implementations (Hoppensteadt, 1986). Specifically, we distinguish between two
Temporal Decoding of Phase-Locked Loops
1613
locking modes, which are sensitive to either the phase or co-phase of the input (the normalized delay of the input with respect to the preceding or succeeding oscillatory event, respectively). It is shown that single neurons can implement only phase-sensitive NPLLs, while circuit-based NPLLs can implement both. In the context of the vibrissal thalamocortical system, both the response characteristics and the nature of the information processing task suggest that the computation should be sensitive to the co-phase and thus should be implemented by circuit-based NPLLs. Section 2 develops a mathematical model of NPLLs and describes four possible variants and their respective characteristics. Section 3 investigates the temporal decoding capabilities provided by the different NPLLs and evaluates them with respect to the temporal decoding task performed by the vibrissal system. Section 4 investigates the response characteristics of cortical oscillators and determines which of the four NPLL variants they implement. The information processing capabilities of NPLLs implemented by single neural oscillators and by neural circuits are discussed in section 5, considering both temporal decoding and temporal pattern generation. 2 Mathematical Modeling of PLLs Different neuronal implementations of the well-known electronic PLLs (Gardner, 1979) are possible (Ahissar, 1998), including for example, the neuronal circuit of Figure 1. The instantaneous frequency of the neural oscillator depends on its intrinsic frequency and the rate of its input (ratecontrolled oscillator, RCO). The input to the RCO is generated by an ensemble of phase-detecting neurons, grouped together as a PD, whose ensemble output rate depends on the delay between the external spike and the RCOevoked spike. When the NPLL locks to the external input, the instantaneous frequency of the internal RCO tracks the instantaneous frequency of the external input, and the deviation from the intrinsic frequency is encoded in the output rate of the PD (Ahissar et al., 1997; Ahissar, 1998). 2.1 Phase Models. The activity of a neural oscillator may involve a single spike or a burst of spikes, which repeat periodically. It is natural to describe the periodic activity as a function of a phase variable (Rand, Cohen, & Holmes, 1986). By normalizing the phase of the oscillator θosc (t) to a unit interval, that is, θosc ∈ [0, 1], it describes the fraction of the elapsed cycle. When the phase reaches the unit level, it resets to zero, and the oscillator generates a single spike or a burst of spikes. The phase of a free oscillator varies at a constant rate, whose inverse determines its intrinsic period τosc . −1 So: θ˙ osc = τosc (Zacksenhouse, 2001). The input to the oscillator affects the rate at which the phase changes and thus the period of the oscillator. In general, the effect may depend on the complete history of the input. However, here we assume that upon completing an oscillatory cycle and generating a spike in the case of a neural
1614
M. Zacksenhouse and E. Ahissar
Output rate
External Input
PD
~ PLL
RCO
(A)
Figure 1: Neuronal Phase-locked loop (NPLL): (a) Schematic Output rate illustration of a NPLL, which includes a Phase Detector (PD) and a Sum Rate-controlled Oscillator (RCO). (b) Schematic illustration of a particular implementation of a PD – the sub-threshold activated θ External Input correlation-PD: inputwwspikes, marked by upward arrows, arrive from Threshold either an externalSub-threshold source or the internal oscillator and produce subactivation threshold activations are summed and evoke a fixed-rate response when the threshold is crossed. Thus, the PD responds when the subw threshold activations from both θthe internal and external sources w overlap. Other implementations are discussed in the text and depicted PD in Figure 2 and 3.
Σ
Input from the RCO
(B) Figure 1: Neuronal phase-locked loop (NPLL). (A) Schematic illustration of a NPLL, which includes a phase detector (PD) and a rate-controlled oscillator (RCO). (B) Schematic illustration of a particular PD, the subthreshold-activated correlation PD: input events, marked by upward arrows, arrive from either an external source or the internal oscillator and produce subthreshold activation of fixed strength and duration. The subthreshold activations are summed and evoke a fixed-rate response when the threshold is crossed. Thus, the PD responds when the subthreshold activations from both the internal and external sources overlap. Other implementations are discussed in the text and depicted in Figures 2 and 3.
Temporal Decoding of Phase-Locked Loops
1615
oscillator, the oscillator is reset independent of its history. Thus, the period is assumed to vary only as a function of the phase of the input during the current cycle. Specifically, the instantaneous frequency during the nth cycle N(t) −1 is θ˙ osc = τosc + h(θosc |{ηk }k=N(tn )+1 ), where ηk is the time of occurrence of the kth input event, N(t) is the number of input events that occurred up to time t (the counting process; Snyders, 1975), and tn is the time of occurrence of the nth oscillatory event (and the start of the nth cycle). The function N(t) h(θosc | {ηk }k=N(tn )+1 ) describes the effect of the input events that occur during the nth cycle and depends in general on their time of occurrence and the phase of the oscillator. The above effect may be simplified in two extreme but very important cases: pulse-coupled oscillators and rate-controlled oscillators. In the first case, the effect of an isolated input event, usually from a single source, is short compared with the inter-event interval, and in the extreme assumed instantaneous. In the second case, the effects from different input events, coming usually from different sources, are highly overlapping, so the rate rather than the timing of the individual events determines the overall effect. 2.1.1 Pulse-Coupled Oscillators. The instantaneous effect of the input −1 is described by θ˙ osc = τosc ± f (θosc )δ(t − η N(t) ), where f (θosc ) is known as the phase-response curve (PRC) (Perkel, Schulman, Bullock, Moore, & Segundo, 1964; Kawato & Suzuki, 1978; Winfree, 1980; Yamanishi, Kawato, & Suzuki, 1980; Zacksenhouse, 2001). Upon integration,
k=N(t)
θosc = t/τosc ±
f (θosc (ηk )),
k=N(tn )+1
and the perturbed period τ p (n) is given by τ p (n) = τosc 1 ∓
k=N(tn +τ p )
f (θosc (ηk )) .
k=N(tn )+1
When only one input event occurs during the oscillatory cycle, the modified period is τ p (n) = τosc 1 ∓ f (ϕ(n)) .
(2.1)
where ϕ(n) = (η N(tn )+1 − tn )/τosc is the phase of the oscillator at the time of occurrence of that input event. As will be further discussed in section 4, a single pulse-coupled oscillator is equivalent to a PLL. However, a PLL may also be implemented by a
1616
M. Zacksenhouse and E. Ahissar
(neural) circuit that includes an RCO, whose characteristics are detailed below. 2.1.2 Rate-Controlled Oscillator (RCO). For simplicity, we assume that the RCO response depends on its input spike rate r (t), independent of its −1 phase, so θ˙ RCO = τRCO ± h RCO (r (t)), where τRCO denotes the intrinsic period of the RCO and h RCO describes the effect of the input rate on the instantaneous frequency. Upon integration, the perturbed period is given by τ p (n) = τRCO 1 ∓
tn +τ p
h RCO (r (t))dt .
tn
This can be expressed in terms of the lumped rate parameter, R(n), which describes the integrated effect of the input to the RCO during its nth cycle on the duration of that cycle, as τ p (n) = τRCO 1 ∓ R(n) ,
R(n) =
tn +τ p
h RCO r (t) dt.
(2.2)
tn
Thus, the lumped rate parameter, R(n), describes the integrated effect of the input during the nth cycle, with the net effect of either shortening or lengthening the period, as denoted by the ∓ sign, respectively. These effects are usually associated with excitatory and inhibitory inputs, respectively, although intrinsic currents may cause the reverse effect (Jones, Pinto, Kaper, & Koppel, 2000; Pinto, Jones, Kaper, & Koppel, 2003). In the linear case, that is, linear h RCO , the lumped parameter R(n) is proportional to the total number of spikes that occur during the nth cycle. The RCO can be implemented by an integrate-and-fire neuron as analyzed and simulated in Zacksenhouse (2001). When the integrate-and-fire RCO is embedded in an inhibitory PLL, the lumped rate parameter is an approximately linear function of the total number of spikes (Zacksenhouse, 2001, equation A8). An RCO that is embedded in a PLL receives its input from a PD, whose response characteristics are analyzed next. 2.2 Phase Detectors. The PD receives input from two sources, the external input and the internal RCO, and converts the interval between them into an output spike rate. For unique decoding, the conversion should be monotonic, with either a decreasing or increasing response (Ahissar, 1998). In particular, the PD may compute the correlation between the two inputs and respond maximally when the interval is zero (correlation-based PD, or Corr-PD). Alternatively, the PD may compute the time difference between its two inputs and respond minimally when the interval is zero (differencebased PD, or Diff-PD) (Kleinfeld, Berg, & O’Connor, 1999).
Temporal Decoding of Phase-Locked Loops
1617
2.2.1 Input Representation. The mathematical formulation of the computation performed by the PD depends on the representation of its input signals, which may be analog, binary, or discrete (Gardner, 1979). Analog signals are described by waveforms (usually sinusoidal) that vary with the phase of the cycle. Binary signals are described by rectangular waveforms whose onset is taken to be the origin. Discrete signals consist of discrete events that occur once per cycle, at a particular phase, which is taken to be the origin. In the context of neural implementations, discrete signals may describe the spike trains from single oscillating neurons, binary signals may describe the spike trains from bursting neurons, and analog signals may describe the average firing rate from a population of neurons and postsynaptic potentials. The main difference between these representations is the information they provide (or do not provide) about the phase variable. Analog signals may provide continuous indication of the phase, while binary and discrete signals provide information only at a specific phase (the origin). (It is noted that binary and discrete signals may be derived from each other and are essentially equivalent. In particular, the zero crossing of the rectangular wave is a discrete signal, and discrete signals may be used to generate rectangular waves using a memory device that is switched on for a fixed duration whenever a discrete event occurs.) The nature of operation of the PD is directly related to the nature of its inputs. The phase between analog inputs is detected using multiplier circuits, or mixers (Gardner, 1979), which operate as correlation-based PDs. The phase between binary or discrete signals is detected using logical devices (Gardner, 1979), which may operate as either correlation-based PDs (e.g., binary AND-gate) or difference-based PDs (e.g., binary Exclusive-OR gate) (Ahissar, 1998). Considering the spiking nature of neuronal signaling, we adopt the discrete, or equivalently the binary, representations in this work. These representations facilitate the unified investigation and comparison of PLLs with correlation-based and difference-based PDs. We use the term event to reflect either an event in a discrete representation or the rising edge of a rectangular signal in a binary representation. In the following mathematical formulation, significant phase variables are defined by normalizing the corresponding time intervals with respect to the intrinsic period of the RCO, τRCO . In particular, each input event is localized with respect to the preceding and the succeeding RCO events, as shown in Figure 2. The normalized intervals between the kth input event and the preceding or succeeding RCO events are referred to as the phase, ϕ(k), and co-phase, ψ(k), respectively. The normalized intervals since the last input or RCO events are denoted by θi and θo , respectively. Equation 2.2 describes how the period of the RCO is perturbed by the input it receives from the PD. The external input is composed of two spike trains: (1) a reference spike train described by a free oscillator with periodτi p and normalized input period ζ = τi p /τRCO , and (2) a novel spike whose timing with respect
1618
M. Zacksenhouse and E. Ahissar
θw
RCO Input
ϕk
ζ=τip/τRCO
θo
ψk
θi
rCorr rDiff (A)
θw
RCO Input
ζ=τ ip/τRCO
ϕk
ψk
θo
θi
rCorr rDiff
(B) Figure 2: Phase relationships between input events, marked by upward arrows ending at the horizontal axis, and RCO-generated events, marked by upward arrows originating at the horizontal axis, and the corresponding response of a subthreshold-activated PD. (A, B) Cases in which the input is lagging or leading the RCO events, respectively. The horizontal axes gauge time normalized by the period of the intrinsic RCO, τRCO , so they correspond directly to the phase. The normalized intervals between the kth input event and the preceding or succeeding RCO events are referred to as the phase, ϕ(k), and co-phase, ψ(k), respectively. Each event causes a subthreshold activity during a window of normalized duration θw , which, for clarity, are marked for only one pair of events in each panel. The resulting responses evoked by that pair of events are depicted on the short axes, for a correlation-based and a difference-based PD, respectively.
Temporal Decoding of Phase-Locked Loops
1619
to the reference spike train carries the information to be detected by the PLL. 2.2.2 Correlation-based PD. Two types of simple correlation-based PDs are analyzed in detail and shown to have similar response characteristics, which are then abstracted to characterize the response of general correlation-based PDs. Simple example of a threshold-activated PD. This is the case depicted in Figure 1B. Each of the inputs to the PD, coming from either the external input or the internal RCO, produces excitatory subthreshold activation for a normalized duration θW . The correlation-based PD responds at a constant rate when these activities overlap, as shown schematically in Figure 2. Consequently, the instantaneous output rate of the PD is given by rCorr (θi , θo ) = r0 U(θW − θi )U(θW − θo ),
(2.3)
where r0 is the constant output rate and U(·) is the unit function (a function that is 1 when its argument is positive and 0 elsewhere). In general, the duration of the subthreshold activation may depend on whether the input is coming from the external input or the internal RCO, but for simplicity of notation, this difference is ignored, and a single θW is used. As derived in equation 2.2, the lumped effect of the PD on the period of the RCO depends on the lumped rate parameter R. Assuming a linear case (and without loss of generality, a unit gain), the lumped rate parameter is given by integrating the instantaneous rate given in equation 2.3. Consequently, an input event that occurs at a phase ϕ(k) and co-phase ψ(k) would result in a lumped rate parameter Rcorr (k) of r0 (θW − ϕ(k)) Rcorr (k) = r0 (θW − ψ(k)) 0
if ϕ(k) < θW if ψ(k) < θW .
(2.4)
otherwise
The duration of the subthreshold activation θW is assumed to be short enough so at the most, one of the first two conditions holds during regular operation (i.e., θW < τRCO /2). In the vibrissal system, the duration of individual reference signals, that is, whisking-locked responses of individual first-order trigeminal neurons, is indeed shorter than half of the whisking cycle, and is usually confined to the protraction (forward movement of the whiskers) period (Szwed et al., 2003). When the RCO’s period is locked to the whisking period, the above relationship would hold. Considering the order of input and RCO events that cause the PD to respond, we refer to the first and second cases as input lagging and leading, respectively.
1620
M. Zacksenhouse and E. Ahissar
RCO Input
Input burst
θ burst θgate
Gate Envelope
PD Output Figure 3: Gated PD. The external input evokes a burst of spikes, which is gated by the PD. The onset of the gating is determined by intrinsic oscillator (RCO).
Simple example of a gated PD. In this case, the external input is composed of a burst of spikes (Szwed et al., 2003), which is gated by the PD, as shown in Figure 3. The onset of the burst is determined by the external input, while the onset of the gating is determined by the RCO. The burst lasts for a normalized duration of θbur st and, for simplicity, is assumed to have a constant rate of ri p . The duration of the gating window is θga te and for simplicity is assumed to equal the duration of the input burst so θga te = θbur st ≡ θW . Thus, the instantaneous output rate of the gated PD is the same as that for the constant rate PD, with r0 = ri p , and equation 2.4 describes the resulting lumped rate parameter due to an isolated external event. General correlation-based PD. According to equation 2.4, the response of a correlation-based PD decreases linearly with the relevant phase variable (the phase or the co-phase when the input is lagging or leading, respectively). In general, the response may be nonlinear, but its derivative should characteristically be negative: g(ϕ(k)) if ϕ(k) < θW (input lagging) RCorr (k) = g(ψ(k)) if ψ(k) < θW (input leading) , 0 otherwise
where
dg(x) ≤ 0. dx
(2.5)
Temporal Decoding of Phase-Locked Loops
1621
Equations 2.4 and 2.5 specify the response of a linear or nonlinear correlation-based PD, respectively, to a pair of input and RCO events. 2.2.3 Difference-based PD. Two types of difference-based PDs, analogous to the ones considered above, are analyzed in detail and shown to have similar response characteristics, which are abstracted to characterize the response of general difference-based PDs. Simple example of a threshold-activated PD. Here the external input events evoke superthreshold activation of fixed strength and duration, while the RCO events evoke inhibitory activation of a similar strength and duration. The difference-based PD responds at a constant rate when the overall activation is superthreshold, that is, when an external event but not an RCO event occurred during the last window θw , as shown schematically in Figure 2. Consequently, the instantaneous output rate is given by rDiff (θi , θo ) = r0 U(θW − θi )U(θo − θW ).
(2.6)
In the linear case considered in the context of the correlation-based PD, the lumped rate parameter is given by: r0 ϕ(k) if ϕ(k) < θW (2.7) RDiff (k) = r0 ψ(k) if ψ(k) < θW . r0 θW otherwise The window is assumed short enough so at the most, one of the first two conditions, corresponding to input lagging or leading, respectively, holds during regular operation. Simple example of a gated PD. In this case, the external input involves a burst of spikes, which is relayed by the PD except for the duration of the gate, which blocks the PD response. Using the parameters defined above, the instantaneous output rate of the gated PD is the same as that for the constant rate PD, and equation 2.7 describes the resulting lumped rate parameter due to an isolated external event. General difference-based PD. According to equation 2.7, the response of a difference-based PD increases linearly with the phase or co-phase in the respective working regions. In general, the response of a difference-based PD may be nonlinear, but its derivative should characteristically be positive: g(ϕ(k)) RDiff (k) = g(ψ(k)) g(θW )
where
dg(x) ≥ 0. dx
if ϕ(k) < θW
(input lagging)
if ψ(k) < θW
(input leading) , otherwise
(2.8)
1622
M. Zacksenhouse and E. Ahissar
Table 1: PLL Characterization. Correlation-Based PD
Input Relevant phase Steady phase Linear case As ζ → 1
Difference-Based PD
ePLL (ζ ≤ 1)
iPLL (ζ ≥ 1)
ePLL (ζ ≤ 1)
iPLL (ζ ≥ 1)
Lagging Phase ϕ ϕ∞ = g −1 [(1 − ζ )] ϕ∞ =
Leading Co-phase ψ ψ∞ = g −1 [(ζ − 1)] ψ∞ =
Leading Co-phase ψ ψ∞ = g −1 [(1 − ζ )] ψ∞ = 1−ζ r
Lagging Phase ϕ ϕ∞ = g −1 [(ζ − 1)] ϕ∞ = −1+ζ r
ϕ∞ ↑
ψ∞ ↑
ψ∞ ↓
ϕ∞ ↓
r0 θw −1+ζ r0
r0 θw +1−ζ r0
0
0
Equations 2.7 and 2.8 specify the response of a linear or nonlinear differencebased PD, respectively, to a pair of input and RCO events.
2.3 PLL Stable Response. During stable 1:1 phase entrainment to a periodic, external input, the response of the PD (and thus of the PLL) is sensitive to either the phase or the co-phase depending on the type of the PD and its connection to the RCO. A PLL in which the PD connection to the RCO is excitatory is referred to as an excitatory PLL (ePLL), while a PLL in which the PD is connected to the RCO via an inhibitory interneuron is referred to as an inhibitory PLL (iPLL) (Ahissar, 1998). The following theorem characterizes the operation of the different PLLs and is summarized in Table 1. Theorem 1: PLL Characterization. During stable 1:1 entrainment of ePLL/ iPLL, the input is lagging or leading, respectively, when the PD is correlation based, and leading or lagging, respectively, when the PD is difference based. Within the working range (i.e., ζ ≤ 1 or ζ ≥ 1), as the input period approaches the intrinsic period of the RCO, the corresponding phase variable, that is, the steady-state phase ϕ∞ for lagging input or the steady-state co-phase ψ∞ for leading input, increases when the PD is correlation based and decreases when the PD is difference based. Proof. Considering an interval of time during which the input is consistently lagging (i.e., ϕ(k) < θW for all the input events in the interval), the phase of the (k + 1)th input is related to the phase of the kth input by ϕ(k + 1) = ϕ(k) + ζ − τ p (k)/τRCO . According to equation 2.2 and either equation 2.5 or 2.8, the perturbed period is given by τ p (k)/τRCO = 1 ∓ g ϕ(k) ,
(2.9)
Temporal Decoding of Phase-Locked Loops
1623
regardless of the type of the PD, so ϕ(k + 1) = ϕ(k) + ζ − 1 ± g ϕ(k) .
(2.10)
When the input is consistently leadingψ(k) < θW , the co-phase of the (k + 1)th input is related to the co-phase of the kth input by ψ(k + 1) = ψ(k) − ζ + τ p /τRCO . According to equation 2.2 and either equation 2.5 or 2.8, the perturbed period is given by τ p (k)/τRCO = 1 ∓ g ψ(k) ,
(2.11)
regardless of the type of the PD, so ψ(k + 1) = ψ(k) − ζ + 1 ∓ g ψ(k) .
(2.12)
Equations 2.10 and 2.12 imply that the equilibrium condition is specified by ϕ∞ = g −1 [±(1 − ζ )] when the input is lagging and by ψ∞ = g −1 [±(1 − ζ )] when the input is leading. Furthermore, the stability condition is given by −2 < ± dg(x) | < 0 when the input is lagging and by −2 < ∓ dg(x) | < d x ϕ∞ d x ψ∞ 0 when the input is leading. For the correlation-based PD, dg/d x ≤ 0 so the ePLL stabilizes with lagging input at ϕ∞ = g −1 [(1 − ζ )] and the iPLL stabilizes with leading input at ψ∞ = g −1 [ζ − 1]. For the difference-based PD, dg /d x ≥ 0, so the opposite holds. The period of the input to an ePLL is shorter than the intrinsic period of the RCO so ζ ≤ 1. As the frequency of the input approaches the intrinsic frequency of the RCO, ζ ↑ 1, so (1 − ζ ) ↓ 0. The period of the input to an iPLL is longer than the intrinsic period of the RCO, so ζ ≥ 1. As the frequency of the input approaches the intrinsic frequency of the RCO, ζ ↓ 1 so (ζ − 1) ↓ 0. For a correlation-based PD, dg /d x ≤ 0, and so both the steady phase ϕ∞ and the steady co-phase ψ∞ increase as the frequency of the input approaches the intrinsic frequency of the RCO from below or above for the ePLL/iPLL, respectively. For the difference-based PD, dg /d x ≥ 0 so both the steady phase ϕ∞ and the steady co-phase ψ∞ decrease as the frequency of the input approaches the intrinsic frequency of the RCO from below or above for the ePLL/iPLL, respectively. The linear operating curves of the different PLLs, given in Table 1, are depicted in Figure 4 in terms of the absolute delay as a function of the input period. The different panels depict the effect of the nominal rate r0 , and the parallel curves within each panel depict the effect of the intrinsic period τRCO . It is noted that the operating range increases as the nominal rate increases. However, according to the proof of the PLL characterization theorem, r0 should be less than 2 to ensure stability, so only the top panels depict stable (top left) and marginally stable (top right) operating curves.
1624
M. Zacksenhouse and E. Ahissar
r0=1
60 40
40
20
20
Absolute delay (msec)
0
60
0
100
200
r0=5
40 20 0
r0=2
60
0
0
iPLL
0 100 200 input period (msec)
200
r0=10
60 40
ePLL
100
Corr-PD
20
Diff-PD 0
0
100
200
Figure 4: Linear steady-state curves describing the absolute delay between the external input and internal oscillatory events as a function of the input period for four types of PLLs. The curves shift in parallel as the intrinsic rate of the RCOs is increased from 80 to 120 msec in steps of 10 msec as indicated by the arrows in the bottom left panel. The nominal rate r0 is 1 (top left panel), 2 (top right panel) 5 (bottom left panel), and 10 (bottom right panel).
Based on the PLL characterization theorem, we can classify the PLLs into two groups according to whether they are sensitive to the phase or co-phase of the input relative to the intrinsic oscillator. The phase-sensitive PLLs include (a1) the ePLL with correlation-based PD and (a2) the iPLL with difference-based PD, while the co-phase-sensitive PLLs include (b1) the iPLL with a correlation-based PD and (b2) the ePLL with a differencebased PD. 3 Temporal Decoding 3.1 Vibrissal Temporal Decoding Task. The entrainment of the PLL by a periodic input prepares the PLL to properly decode a novel input. In order to clarify this subtle issue, we consider in more detail the encoding of object
Temporal Decoding of Phase-Locked Loops
1625
Table 2: Total Response Rt of Phase-Locked NPLLs to a Novel Input. Correlation-Based PD ePLL Novel Input/RCO Event Leading (ψn ) g(ϕ∞ ) + g(ψn ) Lagging (ϕn ) g(ϕ∞ ) + g(ϕn ) Linear PD Leading (ψn ) r0 [2θw − (ϕ∞ + ψn )] r0 [2θw − Lagging (ϕn ) (ϕ∞ + ϕn )]
Difference-Based PD
iPLL
ePLL
iPLL
g(ψ∞ ) + g(ψn ) g(ψ∞ ) + g(ϕn )
g(ψ∞ ) + g(ψn ) g(ψ∞ ) + g(ϕn )
g(ϕ∞ ) + g(ψn ) g(ϕ∞ ) + g(ϕn )
r0 [2θw − (ψ∞ + ψn )] r0 [2θw − (ψ∞ + ϕn )]
r0 (ψ∞ + ψn )
r0 (ϕ∞ + ψn )
r0 (ψ∞ + ϕn )
r0 (ϕ∞ + ϕn )
location in the vibrissal system. The location of the object is encoded in the firing pattern of neurons in the trigeminal ganglion and probably also in the brainstem. In particular, the firing patterns of trigeminal neurons, which provide the external input to the vibrissal system, include two components (Kleinfeld et al., 1999; Szwed et al., 2003): (1) a reference signal composed of spikes at a preferred phase of the whisking cycle and (2) a contact-induced signal composed of spikes that are evoked on contact with an external object. The first component is periodic at the whisking period. The second component is the novel input whose time of occurrence relative to the reference signal (the first component) has to be decoded.
3.2 Effect of Novel Input. The external input to the PD is composed of two components: the reference, periodic input, and the novel input. We make the simple and physiologically appropriate assumption that the PD’s response to each of these components is the same and independent of each other, so the total response of the PD is the sum of the individual responses. For the gated PD described in section 2.2.3, for example, once the gate is opened by the RCO, the PD relays the bursts of activity that it receives from either or both of the external inputs. Using equations 2.4, 2.5, 2.7, and 2.8 for the response to either component of the external input, the total response Rt of the different NPLLs can be derived as summarized in Table 2 and depicted in Figure 5. As evident from Table 2 and Figure 5, the total PD response varies monotonically with the delay between the novel input and the oscillatory event (i.e., ϕn or ψn ) as long as the novel input is confined to either always lead or always lag the RCO event. However, in order to provide sensory decoding, the response should vary monotonically with the delay between the novel input and the reference events. The relevant decoding ranges are specified by the temporal detection theorem stated and proven in the next section.
1626
M. Zacksenhouse and E. Ahissar
Correlation-based
Difference-based
ePLL
Total Response (arbitrary units)
ePLL I
II
IV
ϕ∞
I
III
ψ∞
0
0
iPLL
iPLL I
ψ∞ −θw ψn
IV
IV
III
0
ϕn θ w −θ w ψn
I
II
0
IV
ϕ∞ ϕn θw
Figure 5: Total PD response as a function of the phase ϕn (increasing to the right) and co-phase ψn (increasing to the left) of a novel input to a PLL that is entrained by a periodic reference input with the indicated phase ϕ∞ or cophase ψ∞ relationship. The left/right pair of panels depict the total response of correlation-/difference-based PDs embedded in ePLL (upper panels) and iPLL (bottom panels). The arrows below the axes indicate the reference input, while the arrows above the axes indicate the RCO events. The roman numbers indicate the corresponding zone of the novel input as indicated in Table 2. The solid/dashed lines indicate the response when the novel input lags/lead the reference input.
3.3 PLL Temporal Decoding Capabilities Theorem 2: PLL Temporal Detection. During 1:1 stable phase locking to a periodic external input, a PLL can monotonically decode a novel input when it has a fixed order with respect to both the reference input and the RCO events. The resulting decoding ranges are specified in Table 3. Proof. The output of the PD varies monotonically with the phase of the novel input along the RCO cycle as long as the order between them is fixed (second column of Table 3). The phase difference between the novel input
Temporal Decoding of Phase-Locked Loops
1627
Table 3: Monotonic Decoding Ranges. Novel Input/ Reference Input
Novel Input/ RCO Events
Zone (Figure 5)
ePLL
iPLL
ePLL
iPLL
Leading Leading Lagging Lagging
Leading Lagging Leading Lagging
I II III IV
θW ϕ∞ 0 θ W − ϕ∞
θW − ψ∞ 0 ψ∞ θW
θW − ψ∞ 0 ψ∞ θW
θW ϕ∞ 0 θ W − ϕ∞
Correlation-Based PD
Difference-Based PD
and the reference input varies monotonically with the phase of the novel input along the cycle of the RCO as long as the order between them is fixed (first column of Table 3). Hence, the response of the PD varies monotonically with the phase difference between the novel input and the reference input when the order of the novel input with respect to both the reference input and the RCO events is fixed, as specified by each row. Finally, the relevant ranges with the specified phase relationships between the novel input and both the reference input (first column of Table 3) or the RCO events (second column of Table 3) follow directly given the steady-state phase and co-phase of the reference input with respect to the RCO events. It is apparent that the decoding range depends on whether the PLL is phase or co-phase sensitive. When the order between the novel input and the reference input is determined by the nature of the temporal decoding task, it is possible to distinguish between two decoding modes: (1) narrow but monotonic, and thus unambiguous, decoding range (e.g., correlation-based ePLL decoding a novel input that lags the reference input over the range θw − ϕ∞ ; bottom row of the third column in Table 3; see also the top-left panel in Figure 5), and (2) wide but partially ambiguous detection range (e.g., a correlation-based iPLL decoding a novel input that lags the reference input over the range θw + ϕ∞ ; bottom two rows of the fourth column in Table 3; see also the bottom-left panel in Figure 5). The ambiguity stems from the fact that the order of the novel input with respect to the RCO events is not constrained in this case. Thus, the temporal detection theorem provides a design criterion for selecting the PLL that best matches the requirements of a given temporal decoding task. The sensory information is encoded in the phase difference δ between the novel input and the reference input and can be expressed in terms of the phase of the novel input with respect to the closest RCO event and the phase of the reference input with respect to the same RCO event, as specified in Table 4. The following PLL temporal decoding theorem specifies how this informative phase difference may be determined from the response of the PD.
1628
M. Zacksenhouse and E. Ahissar
Table 4: Phase Difference δ Between the Novel Input and the Reference Input. Novel Input/Reference Input
Novel Input/ RCO Events
Phase-Sensitive PLLs
Co-Phase-Sensitive PLLs
Leading Leading Lagging Lagging
Leading (ψn ) Lagging (ϕn ) Leading (ψn ) Lagging (ϕn )
ϕ∞ + ψn ϕ∞ − ϕn NA ϕn − ϕ∞
ψn − ψ∞ NA ψ∞ − ψn ψ∞ + ϕn
Table 5: Parameters of the Relationship δ = a + b R∞ /r0 + c Rt /r0 Specifying the Phase Difference δ Between the Novel Input and the Reference Input as a Function of the Steady-State PD Response (R∞ ) and Total Response (Rt ) of the PD. Novel Input/Reference Input
Novel Input/RCO Events
Leading
Leading (ψn )
Leading
Lagging (ϕn )
Lagging
Leading (ψn )
Lagging
Lagging (ϕn )
Correlation-Based PD ePLL a b c a b c
= 2; = 0; = −1 = 0; = −2; =1 NA
a = 0; b = 2; c = −1
iPLL
Difference-Based PD ePLL
a = 0; b = 2; c = −1 NA
a = 0; b = −2; c=1 NA
a = 0; b = −2; c=1 a = 2; b = 0; c = −1
a = 0; b = 2; c = −1 a = 0; b = 0; c=1
iPLL a = 0; b = 0; c=1 a = 0; b = 2; c = −1 NA
a = 0; b = −2; c=1
Theorem 3: PLL Temporal Decoding. Consider a PLL that is phase-locked to a periodic reference signal, and denote by R∞ the steady-state response of its PD. A novel input induces an additional response so the total response of the PD is given by Rt . The phase difference δ between the novel input and the reference input may be determined by δ = a + b R∞ /r0 + c Rt /r0 with the parameters given in Table 5 for the specific PLL variant. Proof. The PLL decoding theorem follows directly from Tables 4 and 3 after expressing the steady-state phase or co-phase in terms of the steady response R∞ using equations 2.4 and 2.7. It is noted that in some cases, the computation involves the steady-state PD response R∞ . This may be made available by PLLs that do not receive the novel input and thus continue to respond at R∞ even when the novel input appears. However, when operating in the regime for which the specific
Temporal Decoding of Phase-Locked Loops
1629
PLL variant has the maximum decoding range (as specified in Table 2), the steady-state PD response is not required. Specifically, when the novel input lags both the RCO event and the reference event, the phase difference δ may be directly inferred from the PD response of the co-phase-sensitive PLLs (e.g., an iPLL with a correlation-based PD) after an appropriate offset.
3.4 Significance for Vibrissal Temporal Decoding. 3.4.1 Decoding Range. The whisking-locked reference signal is evoked upon the onset of the protraction phase of whisking, that is, the phase of forward movement, while the contact-induced signal is evoked later during protraction, upon contact with the object (Szwed et al., 2003). Thus, the contact-induced novel input lags the whisking-induced reference input. According to the temporal detection theorem, the decoding ranges that can be achieved in this case by the different PLL variants are specified in the last two rows of Table 2. In particular, the phase-sensitive PLLs (i.e., the ePLL with correlation-based PD and the iPLL with difference-based PD; zone IV, solid curves in the upper-left and lower-right panels of Figure 5) result in an unambiguous but narrow decoding range, while the co-phasesensitive PLLs (i.e., the iPLL with correlation-based PD and the ePLL with difference-based PD; zones III and IV, solid curves in Figure 5 lower-left and upper-right panels) result in a wide but partially ambiguous decoding range. In the latter case, the response is ambiguous when the novel input lags the reference input by less than twice the co-phase ψ∞ , that is, when the contact with the object occurs relatively close to the preferred phase of the whisking cycle. However, the response is still informative since it provides approximate indication of the phase of the novel input, and furthermore, it can be resolved by considering the response from a population of PLLs that receive reference signals produced at different preferred phases (Ahissar, 1998). Hence, it can be concluded that in the case of vibrissal temporal decoding, the widest detection and decoding ranges are obtained with co-phase-sensitive PLLs, for which the input leads the intrinsic oscillator during stable entrainment, in agreement with the observed oscillatory delay (Nicolelis et al., 1995; Ahissar et al., 1997). The two co-phase-sensitive PLLs, that is, the iPLL with a correlationbased PD and the ePLL with a difference-based PD, differ in their input operating ranges, which include input periods that are longer or shorter than the intrinsic period of the oscillator, respectively (see Table 1, first row). Recordings from whisking-range oscillatory neurons in the somatosensory cortex indicate that they track mainly frequencies below their spontaneous frequency (Ahissar et al., 1997). Thus, given the above theorems, the observations suggest that the somatosensory cortex participates in the implementation of iPLLs with correlation-based PDs.
1630
M. Zacksenhouse and E. Ahissar
3.4.2 Frequency Modulation Experiments. Additional support for the above conclusion may be drawn from frequency modulation experiments in which the whiskers are stimulated by air puffs whose frequency is modulated by a slowly varying sinusoidal signal. The responses of neurons along the paralemniscal pathway (one of the two major vibrissal sensory pathways) followed the oscillatory input with varying latencies and spike counts. As the frequency of the input varied sinusoidally between 3 and 7 Hz at a modulating frequency of 0.5 Hz, so did the latency and the spike count of these neurons. However, while the first varied in phase with the frequency of the stimulus, the latter varied in antiphase (Ahissar et al., 2000; Ahissar, Sosnik, Bagdasarian, & Haidarli, 2001). In particular, the latencies and spike counts of cortical neurons in layer 5a, which receive input from thalamic neurons in the medial division of the posterior nucleus (POm) along the paralemniscal pathway, were inversely related, as plotted in Figure 6 (connected stars). Under the hypothesis that the thalamocortical loops in the vibrissal paralemniscal system implement NPLLs, the thalamic neurons should act as PDs (Ahissar et al., 1997). According to section 2.2.2, the observed relationship in neurons of layer 5a indicates that the thalamic neurons that drive these cortical neurons behave as correlation-based PDs. Thus, considering the four possible PLL variants, the observed latencies and inverse relationship are consistent only with the assumption that the paralemniscal pathway implements iPLLs with correlation-based PDs, in agreement with our previous conclusion. Indeed, simulations of an iPLL with a correlation-based PD demonstrate a similar relationship, as shown in Figure 6. The circles in Figure 6 depict the relationship between the spike count and the latency of the response of the simulated PD to a frequency-modulated spike train input. To facilitate comparison, the linear fit to the measured data is marked by a dashed line, demonstrating good agreement with the simulated results.
4 Single Neural Oscillators A single neuron can also be modeled as a PLL (Hoppensteadt, 1986). However, as indicated by equation 2.1, single neural oscillators are sensitive to the phase at which the input events occur, not the co-phase. This is indeed the basis for characterizing neural oscillators using phase-response curves. In particular, a single neural oscillator, described by equation 2.1, is equivalent to a PLL with lagging input as described by equation 2.9, where f (ϕ) = g(ϕ). However, a single neural oscillator cannot operate as a NPLL with leading input, as described by equation 2.11, since its dynamics depends on only the phase, never the co-phase, of the input. As discussed above, the PLL temporal detection theorem suggests that single oscillators, which can be sensitive only to the phase but not to the co-phase, would provide a narrow decoding range when the novel input
Temporal Decoding of Phase-Locked Loops
1631
Response (Spikes per cycle)
7 6 5 4 3 2 1 0
0
20
40
60
80
Latency (msec) Figure 6: Average spike count versus average latency of paralemniscal cortical neurons recorded from layer 5a of the barrel cortex in experiments with frequency modulated (FM) stimulus (Ahissar et al., 2000; Ahissar et al., 2001). Data points are marked with stars, each representing the latency and spike count for one cycle, averaged across 36 repetitions of the same FM sequence (first cycle excluded). Results from consecutive cycles are connected by a solid line. The dashed line is a linear fit to the data with a slope of −0.08, and the circles are generated from a simulated iPLL with correlation-based PD in response to a frequency-modulated input spike train.
lags the reference input. Hence, we conclude that single oscillators are not optimal for temporal decoding of the whisking-induced signals. Since single neurons can be sensitive only to the phase, and not the cophase, of the external input, the PLL characterization theorem implies that they can implement either ePLL with correlation-based PD or iPLL with difference-based PDs. To be able to track frequencies below their spontaneous frequency (Ahissar et al., 1997), the single neurons should operate as iPLL with difference-based PDs. 5 Summary and Discussion 5.1 Temporal Decoding Tasks. In the context of neural information processing, temporal decoding refers to the ability to respond in a way that is sensitive to the temporal pattern of neural activity, not just its average
1632
M. Zacksenhouse and E. Ahissar
spike rate. This article addresses a specific temporal decoding task, which is sensitive to the relative phase of an information-carrying signal (spike train) relative to a periodic reference signal (periodic spike train). Phasedecoding capabilities facilitate the interpretation of neural activity evoked during active touch or active vision (Ahissar & Arieli, 2001). In such active processes, the controlled movements of the sensory organs evoke the reference spike train, while the sensed features of the environment evoke the information-carrying signal. During whisking, for example, the sensory organs are the flexible whiskers, which scan the environment rhythmically, and the relevant feature is the position of an object in that environment. The angle of contact, and thus the relative angular position of the object, can be inferred from the phase along the whisking cycle at which the contact occurred. 5.2 Temporal Decoding Capabilities of PLLs. PLLs are well-developed electronic circuits designed to track periodic signals over a wide frequency range with good noise-rejection performance. The output of the internal oscillator reproduces a cleaned version of the original signal, while the PD followed by a low-pass filter demodulates the input signal. Similarly, neuronal PLLs may be used to track the period of the input spike train and encode its variations in the output—the number of spikes per cycle—of the PD. In this mode, the sensitivity of the PLL may be defined as the change in the output of the PD induced by a small change in the period of the input. Previous work (Zacksenhouse, 2001) indicates that the sensitivity of the iPLL is relatively constant compared with the sensitivity of single-neuron oscillators. By tracking the frequency of the input, PLLs can also be used to detect the relative phase of a novel input, and thus accomplish the phase-decoding task, which is critical for the interpretation of active sensation, as discussed above. Specifically, the PD decodes the phase of the novel input with respect to the periodic activity of the internal oscillator. However, the internal oscillator of an entrained PLL is phase-locked to the reference input, and so the PLL indirectly decodes the phase of the novel input with respect to that reference spike train. The performance of PLLs with respect to this task is the focus of this article. The four PLL variants, involving correlation- and difference-based PDs with either inhibitory (iPLL) or excitatory (ePLL) connections, operate in two locking modes, which are sensitive to either the phase or co-phase of the input. In particular, the iPLL with correlation-based PD and the ePLL with difference-based PD are sensitive to the co-phase of the input, and thus establish a unique response pattern that cannot be produced by single-neuron oscillators. The operating range over which the timing of the novel input may be decoded with respect to the reference signal has been determined and provides a design criterion for selecting the optimum PLL variant that best matches the requirements for a given task.
Temporal Decoding of Phase-Locked Loops
1633
5.3 Circuit-Based versus Single-Neuron PLLs. The relationship between the operation of a single neuron and that of a phase-locked loop was suggested and extensively explored in Hoppensteadt (1986). The cell body is modeled as a voltage-controlled oscillator (VCO, equivalent to RCO here), and the synaptic effect as a monotonically increasing nonlinear function of the combined effect of the VCO and the external input. The resulting model was shown here to be equivalent to either iPLL with a difference-based PD or ePLL with correlation-based PD. 5.4 Temporal Decoding in the Vibrissal System. The hypothesis that temporal decoding in the vibrissal system is facilitated by neural circuits implementing PLLs has received substantial support from a range of observations: (1) existence of neural oscillators in the relevant range of frequencies (Ahissar et al., 1997), (2) existence of PD-like neurons, which exhibit frequency-dependent gated outputs (Ahissar et al., 2000), (3) phase locking of oscillators and PD-like neurons to a range of input frequencies (Ahissar et al., 1997), (4) monotonic direct relationships between input frequencies to locking phases in both oscillators and PD-like neurons (Ahissar et al., 2000), (5) monotonic inverse relationships between input frequencies to locking spike counts in PD-like neurons (Ahissar et al., 2000), which depend on the length of the stimulus burst (Ahissar et al., 2001; Sosnik et al., 2001), and (6) estimated sensitivity that agrees with the observed marginal stability (Zacksenhouse, 2001). In this article, we provide additional theoretical support for this hypothesis based on the decoding range of the different PLL variants. The nature of the temporal decoding task facing the vibrissal system suggests that the widest detection and decoding ranges may be achieved with the co-phase-sensitive PLLs, and those cannot be implemented by single neural oscillators. Furthermore, new observations from frequency-modulated experiments are described that support the hypothesis that the vibrissal system implements iPLLs with correlation-based PDs. Control Capabilities of PLLs. Neural oscillators play an important role not only in decoding but also in generating temporal patterns (Zacksenhouse, 2001). In particular, networks of coupled oscillators are assumed to generate patterns of rhythmic movements that underlie a diverse range of rhythmic tasks, including locomotion (Nishii, Uno, & Suzuki, 1994; Rand et al., 1986) and chewing (Rowat & Selverston, 1993), for example. These networks can generate the patterns of activity even in the absence of any sensory feedback, and thus are referred to as central pattern generators (CPGs). However, feedback may still play an important role in these tasks (Ekeberg, 1993; Grillner et al., 1995) and in particular in tasks that involve open-loop unstable dynamical systems. The closed-loop control of such tasks, and in particular the control of yoyo-playing with oscillatory units, revealed additional advantages of PLLs over single-neuron oscillators (Jin & Zacksenhouse, 2002, 2003). In
1634
M. Zacksenhouse and E. Ahissar
this application, the neural oscillator determines when to start the upward movement and receives a once-per-cycle input at a characteristic phase of the movement, when the yoyo reaches the bottom of its flight. As discussed here, single neural oscillators, or single-cell PLLs, may establish only inputlagging phase relationships, while neural network PLLs may also establish a unique input-leading phase relationship. The latter was demonstrated to have critical control advantages, which are essential in the context of yoyo playing (Jin & Zacksenhouse, 2002, 2003). Thus, the unique temporal detection characteristics of circuit PLLs also provide control capabilities beyond those of directly coupled neural oscillators.
Acknowledgments This work was supported by the United States-Israel Binational Science Foundation Grant # 2003222, the MINERVA Foundation, the Human Frontiers Science Programme, and by the Fund for the Promotion of Research at the Technion. E.A. holds the Helen Diller Family Professorial Chair of Neurobiology.
References Ahissar, E. (1998). Temporal-code to rate-code conversion by neuronal phase-locked loops. Neural Comput., 10, 597–650. Ahissar, E., & Arieli, A. (2001). Figuring space by time. Neuron, 22, 185–201. Ahissar, E., Haidarliu, S., & Zacksenhouse, M. (1997). Decoding temporally encoded sensory input by cortical oscillations and thalamic phase comparators. Proc. Natl. Acad. Sci. USA, 94, 11633–11638. Ahissar, E., Sosnik, R., Bagdasarian, K., & Haidarliu, S. (2001). Temporal frequency of whisker movement, II. Laminar organization of cortical representations. J. Neurophysiol, 86, 354–367. Ahissar, E., Sosnik, R., & Haidarliu, S. (2000). Transformation from temporal to rate coding in somatosensory thalamocortical pathway. Nature, 406, 302–305. Ahissar, E., & Vaadia, E. (1990). Oscillatory activity of single units in a somatosensory cortex of an awake monkey and their possible role in texture analysis. Proc. Natl. Acad. Sci. USA, 87, 8935–8939. Ahissar, E., & Zacksenhouse, M. (2001). Temporal and spatial coding in the rat vibrissal system. Prog. in Brain Res., 130, 75–87. Amitai, Y. (1994). Membrane potential oscillations underlying firing patterns in neocortical neurons. Neuroscience, 63, 151–161. Ekeberg, O. (1993). A neuro-mechanical model of undulatory swimming. Biol. Cybern., 69, 363–374. Flint, A. C., Maisch, U. S., & Kriegstein A. R. (1997). Postnatal development of low [Mg2+] oscillations in neocortex. J. Neurophysiol., 78, 1990–1996. Gardner, F. M. (1979). Phaselocked techniques (2nd ed.) New York: Wiley.
Temporal Decoding of Phase-Locked Loops
1635
Grillner, S., Deliagina, T., Ekeberg, O., El Manira, A., Hill, R. H., Lansner, A., Orlovsky, G. N., & Wallen, P. (1995). Neural-networks that co-ordinate locomotion and body orientation in the lamprey. Trends Neurosci., 18, 270–279. Hoppensteadt, F. C. (1986). An introduction to the mathematics of neurons. Cambridge: Cambridge University Press. Jin H., & Zacksenhouse, M. (2002). Necessary condition for simple oscillatory neural control of robotic yoyo. In Int. Joint Conf. on Neural Networks, World Cong. on Intell. Comp. (IJCNN-WCCI’02) (pp. 1427–1432). Honolulu, HI. Jin, H., & Zacksenhouse, M. (2003). Oscillatory neural control of dynamical systems. IEEE Trans. Neural Networks, 14(2), 317–325. Jones, S. R., Pinto, D. J., Kaper, T. J., & Koppel, N. (2000). Alpha-frequency rhythms desynchronize over long cortical distances: A modeling study. J. Comp. Neurosci., 9, 271–291. Kawato, M., & Suzuki, R. (1978). Biological oscillators can be stopped—topological study of a phase response curve. Biol. Cybern. 30, 241–248. Kleinfeld, D., Berg, R. W., & O’Connor, S. M. (1999). Anatomical loops and their electrical dynamics in relation to whisking by rat. Somatosensory and Motor Res., 16(2), 69–88. Lebedev, M. A., & Nelson, R. J. (1995). Rhythmically firing (20–50 Hz) neurons in monkey primary somatosensory cortex: Activity patterns during initiation of vibratory-cued hand movements. J. Comp. Neurosci., 2, 313– 334. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Freeman. Nicolelis, M. A. L., Baccala, L. A., Lin, R. C. S., & Chapin, J. K. (1995). Sensorimotor encoding by synchronous neural ensemble activity at multiple levels of the somatosensory system. Science, 268, 1353–1358. Nishii, J., Uno, Y. & Suzuki, R. (1994). Mathematical models for the swimming pattern of a lamprey I and II. Biol. Cybern., 72, 1–9, 11–18. Perkel, D. H., Schulman, J. H., Bullock, T. H., Moore, G. P. & Segundo J. P. (1964). Pacemaker neurons: Effects of regularly spaced synaptic input. Science, 145, 61–63. Pinto, D. J., Jones, S. R., Kaper, T. J., & Koppel, N. (2003). Analysis of state-dependent transitions in frequency and long-distance coordination in a model oscillatory cortical circuit. J. Comp. Neurosci., 15, 283–298. Rand, R. H., Cohen, A. H., & Holmes, P. J. (1986). Systems of coupled oscillators as models of central pattern generators. In A. H. Cohen, S. Rossignol, and S. Grillner (Eds.), Neural control of rhythmic movements in vertebrates. New York: Wiley. Rowat P. F., & Selverston, A. I. (1993). Modeling the gastric mill central Pattern generator of the lobster with a relaxation-oscillator Network. J. Neurophysiol., 70(3), 1030–1053. Silva, L. R., Amitai, Y., & Connors, B. W. (1991). Intrinsic oscillations of neocortex generated by layer 5 pyramidal neurons. Science, 251, 432–435. Sosnik, R., Haidarliu, S., & Ahissar, E. (2001). Temporal frequency of whisker movement. I. Representations in brainstem and thalamus. J. Neurophysiol., 86, 339– 353.
1636
M. Zacksenhouse and E. Ahissar
Snyders, D. L. (1975). Random point processes. New York: Wiley. Szwed, M., Bagdasarian, K., and Ahissar, E. (2003). Encoding of vibrissal active touch. Neuron, 40, 621–630. Winfree, A. T. (1980). Geometry of biological time. Berlin: Springer. Yamanishi, J., Kawato, M., & Suzuki, R. (1980). Two coupled oscillators as a model for the coordinated finger tapping by both hands. Biol. Cybern., 37, 219–225. Zacksenhouse, M. (2001). Sensitivity of basic oscillatory mechanisms for pattern generation and detection. Biol. Cybern., 85(4), 301–311.
Received September 10, 2004; accepted September 29, 2005.
LETTER
Communicated by Mark Ungless
Representation and Timing in Theories of the Dopamine System Nathaniel D. Daw
[email protected] UCL, Gatsby Computational Neuroscience Unit, London, WC1N3AR, U.K.
Aaron C. Courville
[email protected] Carnegie Mellon University, Robotics Institute and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A.
David S. Touretzky
[email protected] Carnegie Mellon University, Computer Science Department and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A.
Although the responses of dopamine neurons in the primate midbrain are well characterized as carrying a temporal difference (TD) error signal for reward prediction, existing theories do not offer a credible account of how the brain keeps track of past sensory events that may be relevant to predicting future reward. Empirically, these shortcomings of previous theories are particularly evident in their account of experiments in which animals were exposed to variation in the timing of events. The original theories mispredicted the results of such experiments due to their use of a representational device called a tapped delay line. Here we propose that a richer understanding of history representation and a better account of these experiments can be given by considering TD algorithms for a formal setting that incorporates two features not originally considered in theories of the dopaminergic response: partial observability (a distinction between the animal’s sensory experience and the true underlying state of the world) and semi-Markov dynamics (an explicit account of variation in the intervals between events). The new theory situates the dopaminergic system in a richer functional and anatomical context, since it assumes (in accord with recent computational theories of cortex) that problems of partial observability and stimulus history are solved in sensory cortex using statistical modeling and inference and that the TD system predicts reward using the results of this inference rather than raw sensory data. It also accounts for a range of experimental data, including the experiments involving programmed temporal variability and other previously unmodeled dopaminergic response phenomena, Neural Computation 18, 1637–1677 (2006)
1638
N. Daw, A. Courville, and D. Touretzky
which we suggest are related to subjective noise in animals’ interval timing. Finally, it offers new experimental predictions and a rich theoretical framework for designing future experiments.
1 Introduction The responses of dopamine neurons in the primate midbrain are well characterized by a temporal difference (TD) (Sutton, 1988) reinforcement learning (RL) theory, in which neuronal spiking is supposed to signal error in the prediction of future reward (Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). Although such theories have been influential, a key computational issue remains: How does the brain keep track of those sensory events that are relevant to predicting future reward, when the rewards and their predictors may be separated by long temporal intervals? The problem traces to a disconnect between the physical world and the abstract formalism underlying TD learning. The formalism is the Markov process, a model world that proceeds stochastically through a series of states, sometimes delivering reward. The TD algorithm learns to map each state to a prediction about the reward expected in the future. This is possible because, in a Markov process, future states and rewards are conditionally independent of past events, given only the current state. There is thus no need to remember past events: the current state contains all information relevant to prediction. This assumption is problematic when it comes to explaining experiments on dopamine neurons, which often involve delayed contingencies. In a typical experiment, a monkey learns that a transient flash of light signals that, after a 1-second delay, a drop of juice will be delivered. Because of this temporal gap, the animal’s immediate sensory experiences (gap, flash, gap, juice) do not by themselves correspond to the states of a Markov process. This example also demonstrates that these issues of memory are tied up with issues of timing—in this case, marking the passage of the 1 second interval. Existing TD theories of the dopamine system address these issues using variations on a device called a tapped delay line (Sutton & Barto, 1990), which redefines the state to include a buffer of previous sensory events within some time window. If the window is large enough to encompass relevant history, which is assumed in the dopamine theories, then the augmented states form a Markov process, and TD learning can succeed. Clearly, this approach fudges an issue of selection: How can the brain adaptively decide which events should be remembered, and for how long? In practice, the tapped delay line is also an awkward representation for predicting events whose timing can vary. As a result, the theory incorrectly predicted the firing of dopamine neurons in experiments in which the timing of events was
Representation and Timing in Theories of the Dopamine System
1639
varied (Hollerman & Schultz, 1998; Fiorillo & Schultz, 2001). This problem has received only limited attention (Suri & Schultz, 1998, 1999; Daw, 2003). In this letter, we take a deeper look at these issues by adopting a more appropriate formalism for the experimental situation. In particular, we propose modeling the dopamine response using a TD algorithm for a partially observable semi-Markov process (also known as a hidden semi-Markov model), which generalizes the Markov process in two ways. This richer formalism incorporates variability in the timing between events (semi-Markov dynamics; Bradtke & Duff, 1995) and a distinction between the sensory experience and the underlying but only partially observable state (Kaelbling, Littmann, & Cassandra, 1998). The established theory of RL with partial observability offers an elegant approach to maintaining relevant sensory history. The idea is to use Bayesian inference, with a statistical description (“world model”) of how the hidden process evolves, to infer a probability distribution over the likely values of the unobservable state. If the world model is correct, this inferred state distribution incorporates all relevant history (Kaelbling et al., 1998) and can itself be used in place of the delay line as a state representation for TD learning. Applied to theories of the dopamine system, this viewpoint casts new light on a number of issues. The system is viewed as making predictions using an inferred state representation rather than raw sensory history. This reframes the problem of representing adequate stimulus history in the computationally more familiar terms of learning an appropriate world model. It also situates the dopamine neurons in a broader anatomical and functional context, since predominant models of sensory cortex envision it performing the sort of world modeling and hidden state inference we require (Doya, 1999; Rao, Olshausen, & Lewicki, 2002). Combined with a number of additional assumptions (notably, about the relative strength of positive and negative error representation in the dopamine response; Niv, Duff, & Dayan, 2005; Bayer & Glimcher, 2005), the new model accounts for puzzling results on the responses of dopamine neurons when event timing is varied; further, armed with this account of temporal variability, we consider the effect of noise in internal timing processes and show that this can address other experimental phenomena. Previous models can be viewed as approximations to the new one under appropriate limits. The rest of the letter is organized as follows. In section 2 we discuss previous models and how they cope with temporal variability. We realize our own account of the system in several stages. We begin in section 3 with a general overview of the pieces of the model. In section 4, we develop and simulate a fully observable semi-Markov TD model, and in the following section we generalize it to the partially observable case. As a limiting case of the complete, partially observable model, the simpler model is appropriate for analyzing the complete model’s behavior in many situations. After presenting results about the behavior of each model, we discuss to what
1640
N. Daw, A. Courville, and D. Touretzky
extent its predictions are upheld experimentally. Finally, in section 6, we conclude with more general discussion. 2 Previous Models In this section, we review the TD algorithm and its use in models of the dopamine response, focusing on the example of a key, problematic experiment. These models address unit recordings of dopamine neurons in primates performing a variety of appetitive conditioning tasks (for review, see Schultz, 1998). These experiments can largely be viewed as variations on Pavlovian trace conditioning, a procedure in which a transient cue such as a flash of light is followed after some interval by a reinforcer, regardless of the animal’s actions. In fact, some of the experiments were conducted using delay conditioning (in which the initial stimulus is not punctate but rather prolonged to span the gap between stimulus onset and reward) or involved instrumental requirements (that is, the cue signaled the monkey to perform some behavioral response such as a key press to obtain a reward). For most of the data considered here, no notable differences in dopamine behavior have been observed between these methodological variations, to the extent that comparable experiments have been done. Thus, here, and in common with much prior modeling work on the system, we will neglect action selection and stimulus persistence and idealize the tasks as Pavlovian trace conditioning. 2.1 The TD Algorithm. The TD theory of the dopamine response (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997) involves modeling the experimental situation as a Markov process and drawing on the TD algorithm for reward prediction in such processes (Sutton, 1988). Such a process comprises a set S of states, a transition function T, and a reward function R. The process has discrete dynamics; at each time step t, a realvalued reward rt and a successor state st+1 ∈ S are drawn. The distributions over rt and st+1 are specified by the functions T and R and depend on only the value of the current state st . In modeling the experimental situation, the process time steps are taken to correspond to short, constant-length blocks of real time, and the state corresponds to some representation, internal to the animal, of relevant experimental stimuli. We can define a value function mapping states to expected cumulative discounted future reward,
Vs ≡ E
t end τ =t
γ
τ −t
rτ | st = s ,
(2.1)
Representation and Timing in Theories of the Dopamine System
1641
where the expectation is taken with respect to stochasticity in the state transitions and reward magnitudes, tend is the time the current trial ends, and γ is a parameter controlling the steepness of temporal discounting. The goal of the TD algorithm is to use samples of states st and rewards rt to learn an approximation Vˆ to the true value function V. If such an estimate were correct, it would satisfy Vˆ st = E[rt + γ Vˆ st+1 | st ],
(2.2)
which is just the value function definition rewritten recursively. The TD learning rule is based on this relation: given a sample of a pair of adjacent states and an intervening reward, the TD algorithm nudges the estimate Vˆ st toward rt + γ Vˆ st+1 . The change in Vˆ st is thus proportional to the TD error, δt = rt + γ Vˆ st+1 − Vˆ st ,
(2.3)
with values updated as Vˆ st ← Vˆ st + ν · δt for learning rate ν. In this article, we omit consideration of eligibility traces, as appear in the TD-λ algorithm (Sutton, 1988; Houk et al., 1995; Sutton & Barto, 1998). These would allow error at time t directly to affect states encountered some time steps before, an elaboration that can speed up learning but does not affect our general argument. 2.2 TD Models of the Dopamine Response. 2.2.1 Model Specification. The TD models of the dopamine response assume that dopamine neurons fire at a rate proportional to the prediction error δt added to some constant background activity level, so that positive δt corresponds to neuronal excitation and negative δt to inhibition. They differ in details of the value function definition (e.g., whether discounting is used) and how the state is represented as a function of the experimental stimuli. Here we roughly follow the formulation of Montague et al. (1996; Schultz et al., 1997), on which most subsequent work has built. The state is taken to represent both current and previous stimuli, represented using tapped delay lines (Sutton & Barto, 1990). Specifically, assuming the task involves only a single stimulus, the state st is defined as a binary vector, whose ith element is one if the stimulus was last seen at time t − i and zero otherwise. For multiple stimuli, the representation is the concatenation of several such history vectors, one for each stimulus. Importantly, reward delivery is not represented with its own delay line. In fact, reward delivery is assumed to have no effect on the state representation. As illustrated in Figure 1, stimulus delivery sets off a cascade of internal states, whose progression, once per time step, tracks the time since stimulus
1642
N. Daw, A. Courville, and D. Touretzky
ISI
ITI ...
stim
reward
Figure 1: The state space for a tapped delay line model of a trace conditioning experiment. The stimulus initiates a cascade of states that mark time relative to it. If the interstimulus interval is deterministic, the reward falls in one such state. ISI: interstimulus interval; ITI: intertrial interval.
delivery. These time steps are taken to correspond to constant slices of real time, of duration perhaps 100 ms. The value is estimated linearly as the dot product of the state vector with a weight vector: Vˆ st = st · wt . For the case of a single stimulus, this is equivalent to a table maintaining a separate value for each of the marker states shown in Figure 1. 2.2.2 Account for Basic Findings. The inclusion of the tapped delay line enables the model to mimic basic dopamine responses in trace conditioning (Montague et al., 1996; Schultz et al., 1997). Dopamine neurons burst to unexpected rewards or reward-predicting signals, when δt is positive, and pause when an anticipated reward is omitted (and δt is negative). The latter is a timed response and occurs in the model because value is elevated in the marker states intervening between stimulus and reward. If reward is omitted, the difference γ Vˆ st+1 − Vˆ st in equation 2.3 is negative at the state where the reward was expected, so negative error is seen when that state is encountered without reward. 2.2.3 Event Time Variability. This account fails to predict the response of dopamine neurons when the timing of rewards is varied from trial to trial (Hollerman & Schultz, 1998; see also Fiorillo & Schultz, 2001). Figure 2 (left) shows the simulated response when a reward is expected 1 second after the stimulus but instead delivered 0.5 second early (top) or late (bottom) in occasional probe trials. The noteworthy case is what follows early reward. Experiments (Hollerman & Schultz, 1998) show that the neurons burst to the early reward but do not subsequently pause at the time reward was originally expected. In contrast, because reward arrival does not affect the model’s stimulus representation, the delay line will still subsequently arrive in the state in which reward is usually received. There, when the reward is not delivered again, the error signal will be negative, predicting (contrary to experiment) a pause in dopamine cell firing at the time the reward was originally expected.
Representation and Timing in Theories of the Dopamine System
Delay line model
δ(t)
1 0 –1 1 0 –1 1 0 –1
0
1 time
2
1643
Delay line model with reset stim rew
1 0 –1
stim rew
stim rew
1 0 –1
stim rew
stim rew
1 0 –1
stim rew
0
1 time
2
Figure 2: TD error (simulated dopamine response) in two tapped delay line models (γ = 0.95) of a conditioning task in which reward was delivered early (top) or late (bottom) in occasional probe trials. (Middle: reward delivered at normal time.) (Left) In the standard model, positive error is seen to rewards arriving at unexpected times and negative error is seen on probe trials at the time reward had been originally expected. (Right) In a modified version of the model in which reward resets the delay line, no negative error is seen following an early reward.
It might seem that this problem could be solved simply by adding a second delay line to represent the time since reward delivery, so that the model could learn not to expect a second reward after an early one. However, in the experiment discussed here, the model might not have had the opportunity for such learning. Since early rewards were delivered only in occasional probe trials, value predictions were presumably determined by experience with the reward occurring at its normal time. Further, even given extensive experience with early rewards, two tapped delay lines plus a linear value function estimator could never learn the appropriate discrimination, because (as is easy to verify) the expected future value at different points in the task is a nonlinear function of the two-delay-line state representation. A number of authors have proposed fixing this misprediction by assuming that reward delivery resets the representational system, for example, by clearing the delay line representing time since the stimulus (Suri & Schultz, 1998, 1999; Brown, Bullock, & Grossberg, 1999). This operation negates all pending predictions and avoids negative TD error when they fail to play out. Figure 2 (right) verifies that this device eliminates the spurious inhibition after an early reward. However, it is unclear from the original work under what circumstances such a reset is justified or appropriate, and doubtful that this simple, ad hoc rule generalizes properly to other situations. We return to these considerations in the discussion. Here, we investigate a more
1644
N. Daw, A. Courville, and D. Touretzky
systematic approach to temporal representation in such experiments, based on the view of stimulus history taken in work on partially observable Markov processes (Kaelbling et al., 1998). On this view, a (in principle, learnable) world model is used to determine relevant stimulus history. Therefore, we outline an appropriate family of generative models for reinforcer and stimulus delivery in conditioning tasks: the partially observable semi-Markov process. 3 A New Model: A Broad Functional Framework In this article, we specify a new TD model of the dopamine system incorporating semi-Markov dynamics and partial observability. Our theory envisions the dopaminergic value learning system as part of a more extensive framework of interacting learning systems than had been previously considered in dopaminergic theories. Figure 3 lays out the pieces of the model; those implemented in this article are shown in black. The idea of the theory is to address prediction in the face of partial observability by using a statistical model of the world’s contingencies to infer a probability distribution over the world’s (unobservable) state and then to use this inferred representation as a basis for learning to predict values using a TD algorithm. Thus, we have:
r
A model learning system that learns a forward model of state transitions, state dwell times, and observable events. Similar functions are often ascribed to cortical areas, particularly prefrontal cortex (Owen, 1997; Daw, Niv, & Dayan, 2005).
Model learning State estimation (sensory cortex)
Value prediction (ventral striatum)
Action selection (dorsal striatum)
TD error signal (dopamine, serotonin?) Figure 3: Schematic of the interaction between multiple learning systems suggested in this article. Those discussed in detail here are shown in black.
Representation and Timing in Theories of the Dopamine System
r
r
1645
A state estimation system that infers the world’s state (and related latent variables) using sensory observations and the world model. This broadly corresponds to cortical sensory processing systems (Rao et al., 2002). A value prediction system that uses a TD error signal to map this inferred state representation to a prediction of future reward. This portion of the system works similarly to previous TD models of the dopamine system, except that we assume semi-Markov rather than Markov state transition dynamics. We associate this aspect of the model with the dopamine neurons and their targets (Schultz, 1998). Additionally, as discussed below, information about negative errors or aversive events, which may be missing from the dopaminergic error signal, could be provided by other systems such as serotonin (Daw, Kakade, & Dayan, 2002).
In this article, since we are studying asymptotic dopamine responding, we assume the correct model has already been learned and do not explicitly perform model learning (though we have studied it elsewhere in the context of theories of conditioning behavior; Courville & Touretzky, 2001; Courville, Daw, Gordon, & Touretzky, 2003; Courville, Daw, & Touretzky, 2004). Model fitting can be performed by variations on the expectation-maximization algorithm (Dempster, Laird, & Rubin, 1977); a version for hidden semiMarkov models is presented by Guedon and Cocozza-Thivent (1990). Here we would require an online version of these methods (as in Courville & Touretzky, 2001). RL is ultimately concerned with action selection in Markov decision processes, and it is widely assumed that the dopamine system is involved in control as well as prediction. In RL approaches such as actor-critic (Sutton, 1984), value prediction in a Markov process (as studied here) is a subproblem useful for learning action selection policies. Hence we assume that there is also:
r
An action selection system that uses information from the TD error (or perhaps the learned value function) to learn an action selection policy. Traditionally, this is associated with the dopamine system’s targets in the dorsal striatum (O’Doherty et al., 2004).
As we are focused on dopamine responses in Pavlovian tasks, we do not address policy learning in this article. 4 A New Model: Semi-Markov Dynamics We build our model in two stages, starting with a model that incorporates semi-Markov dynamics but not partial observability. This simplified model is useful for both motivating the description of the complete model and studying its behavior, since the simplified model is easier to analyze and
1646
N. Daw, A. Courville, and D. Touretzky
represents a good approximation to the complete model’s behavior under certain conditions. 4.1 A Fully Observable Semi-Markov Model. A first step toward addressing issues of temporal variability in dopamine experiments is to adopt a formalism that explicitly models such variability. Here we generalize the TD models presented in section 2 to use TD in a semi-Markov process, which adds temporal variability in the state transitions. In a semi-Markov process, state transitions occur as in a Markov process, except that they occur irregularly. The dwell time for each state visit is randomly drawn from a distribution associated with the state. In addition to transition and reward functions (T and R), semi-Markov models contain a function D specifying the dwell time distribution for each state. The process is known as semi-Markov because although the identities of successor states obey the Markov conditional independence property, the probability of a transition at a particular instant depends not just on the current state but on the time that has already been spent there. We model rewards and stimuli as instantaneous events occurring on the transition into a state. We require additional notation. It can at times be useful to index random variables either by their time t or by a discrete index k that counts state transitions. The time spent in state sk is dk , drawn conditional on sk from the distribution specified by the function D. If the system entered that state at time τ , delivering reward rk , then we can also write that st = sk for all τ ≤ t < τ + dk and rt = rk for t = τ while rt = 0 for τ < t < τ + dk . It is straightforward to adapt standard reinforcement learning algorithms to this richer formal framework, a task first tackled by Bradtke and Duff (1995). Our formulation is closer to that of Mahadevan, Marchalleck, Das, & Gosavi (1997; Das, Gosavi, Mahadevan, & Marchalleck, 1999). We use the value function Vˆ sk = E[rk+1 − ρdk + Vˆ sk+1 | sk ],
(4.1)
where the expectation is now taken additionally with respect to randomness in the dwell time dk . There are two further changes to the formulation here. First, for bookkeeping purposes, we omit the reward rk received on entering state sk from that state’s value. Second, in place of the exponentially discounted value function of equation 2.2, we use an average reward formulation, in which ρ ≡ limn→∞ (1/n) · t+n−1 rτ is the average reward τ =t per time step. This represents a limit of the exponentially discounted case as the discounting factor γ → 1 (for details, see Tsitsiklis & Van Roy, 2002) and has some useful properties for modeling dopamine responses (Daw & Touretzky, 2002; Daw et al., 2002). Following that work, we will henceforth assume that the value function is infinite horizon, that is, when written in
Representation and Timing in Theories of the Dopamine System
1647
reward ISI
ITI
duration
duration
stim Figure 4: The state space for a semi-Markov model of a trace conditioning experiment. States model intervals of time between events, which vary according to the distributions sketched in the insets. Stimulus and reward are delivered on state transitions. ISI: interstimulus interval; ITI: intertrial interval. Here, the ISI is constant, while the ITI is drawn from an exponential distribution.
the unrolled form of equation 2.1 as a sum of rewards, the sum does not terminate on a trial boundary but rather continues indefinitely. In TD for semi-Markov processes (Bradtke & Duff, 1995; Mahadevan et al., 1997; Das et al., 1999), value updates occur irregularly, whenever there is a state transition. The error signal is δk = rk+1 − ρk · dk + Vsk+1 − Vsk ,
(4.2)
where ρk is now subscripted because it must be estimated separately (e.g., k by the average reward over the last n states, ρk = k+1 k =k−n+1 rk / k =k−n dk ) 4.2 Connecting This Theory to the Dopamine Response. Here we discuss a number of issues related to simulating the dopamine response with the algorithm described in the previous section. 4.2.1 State Representation. To connect equation 4.2 to the firing of dopamine neurons, we must relate the states sk to the observable events. In the present, fully observable case, we take them to correspond one to one. In this model, a trace conditioning experiment has a very simple structure, consisting of two states that capture the intervals of time between events (see Figure 4; compare Figure 1). The CS is delivered on entry into the state labeled ISI (for interstimulus interval), while the reward is delivered on entering the ITI (intertrial interval) state. This formalism is convenient for reasoning about situations in which interevent intervals can vary, since such variability is built into the model. Late or early rewards, for instance, just correspond to longer or shorter times spent in the ISI state. Although the model assumes input from a separate timing mechanism— in order to measure the elapsed interval dk between events used in the update equation—the passage of time does not by itself have any effect on
1648
N. Daw, A. Courville, and D. Touretzky
the modeled dopamine signal. Instead, TD error is triggered only by state transitions, which are here taken to be always signaled by external events. Thus, this simple scheme cannot account for the finding that dopamine neurons pause when reward is omitted (Schultz et al., 1997). (It would instead assume the world remains in the ISI state with zero TD error until another event occurs, signaling a state transition and triggering learning.) In section 5, we handle these cases by using the assumption of partial observability to infer a state transition in the absence of a signal; however, when states are signaled reliably, that model will reduce to this one. We thus investigate this model in the context of experiments not involving reward omission. 4.2.2 Scaling of Negative Error. Because the background firing rates of dopamine neurons are low, excitatory responses have a much larger magnitude (measured by spike count) than inhibitory responses thought to represent the same absolute prediction error (Niv, Duff, et al., 2005). Recent work quantitatively comparing the firing rate to estimated prediction error confirms this observation and suggests that the dopamine response to negative error is rescaled or partially rectified (Bayer & Glimcher, 2005; Fiorillo, Tobler, & Schultz, 2003; Morris, Arkadir, Nevet, Vaadia, & Bergman, 2004). This fact can be important when mean firing rates are computed by averaging dopamine responses over trials containing both positive and negative prediction errors, since the negative errors will be underrepresented (Niv, Duff, et al., 2005). To account for this situation, we assume that dopaminergic firing rates are proportional to δ + ψ, positively rectified, where ψ is a small background firing rate. We average this rectified quantity over trials to simulate the dopamine response. The pattern of direct proportionality with rectification beneath a small level of negative error is consistent with the experimental results of Bayer and Glimcher (2005). Note that we assume that values are updated based on the complete error signal, with the missing information about negative errors reported separately to targets (perhaps by serotonin; Daw et al., 2002). An alternative possibility (Niv, Duff, et al., 2005) is that complete negative error information is present in the dopaminergic signal, though scaled differently, and targets are properly able to decode such a signal. There are as yet limited data to support or distinguish among these suggestions, but the difference is not material to our argument here. This article explores the implications for dopaminergic recordings of the asymmetry between bursts and pauses. Such asymmetry is empirically well demonstrated and distinct from speculation as to how dopamine targets might cope with it. 4.2.3 Interval Measurement Noise. We will in some cases consider the effects of internal timing noise on the modeled dopamine signal. In the model, time measurements enter the error signal calculation through the estimated dwell time durations dk . Following behavioral studies (Gibbon,
Representation and Timing in Theories of the Dopamine System
1649
1977), we assume that for a constant true duration, these vary from trial to trial with a standard deviation proportional to the length of the true interval. We take the noise to be normally distributed. 4.3 Results: Simulated Dopamine Responses in the Model. Here we present simulations demonstrating the behavior of the model in various conditions. We consider unsignaled and signaled reward and the effect of externally imposed or subjective variability in the timing of events. Finally, we discuss experimental evidence relating to the model’s predictions. 4.3.1 Results: Free Reward Delivery. The simplest experimental finding about dopamine neurons is that they burst when animals receive random, unsignaled reward (Schultz, 1998). The semi-Markov model’s explanation for this effect is different from the usual TD account. This “free reward” experiment can be modeled as a semi-Markov process with a single state (see Figure 5, bottom right). Assuming Poisson delivery of rewards with magnitude r , mean rate λ, and mean interreward interval θ = 1/λ, the dwell times are exponentially distributed. We examine the TD error, is arbitrary (since it only appears using equation 4.2. The state’s value V subtracted from itself in the error signal) and ρ = r/θ asymptotically. The TD error on receiving a reward of magnitude r after a delay d is thus −V δ = r − ρd + V = r (1 − d/θ ),
(4.3) (4.4)
which is positive if d < θ and negative if d > θ , as illustrated in Figure 5 (left). That is, the TD error is relative to the expected delay θ : rewards occurring earlier than usual have higher value than expected, and conversely for later-than-average rewards. Figure 5 (right top) confirms that the semi-Markov TD error averaged over multiple trials is zero. However, due to the partial rectification of inhibitory responses, excitation dominates in the average over trials of the simulated dopamine response (see Figure 5, right middle), and net phasic excitation is predicted. 4.3.2 Results: Signaled Reward and Timing Noise. When a reward is reliably signaled by a stimulus that precedes it, dopaminergic responses famously transfer from the reward to the stimulus (Schultz, 1998). The corresponding semi-Markov model is the two-state model of Figures 4 and 6a. (We assume the stimulus timing is randomized.) As in free-reward tasks, the model predicts that the single-trial response to an event can vary from positive to negative depending on the interval preceding it, and, if sufficiently variable, the response averaged over trials will skew excitatory due to partial rectification of negative errors.
1 0 –1 1
rew
0
δ(t)
–1 1
rew
0 –1 1
rew
avg. δ(t)
N. Daw, A. Courville, and D. Touretzky
1
rew
0.5
avg. rectified δ(t)
1650
0 1
rew
0.5 0 –1 –0.5 0 0.5 1 time relative to reward
rew
ITI
0 –1 1
rew
reward
0 –1
duration 2 4 6 8 10 time since prev. reward
Figure 5: TD error to rewards delivered freely at Poisson intervals, using the semi-Markov TD model of equation 4.2. The state space consists of a single state (illustrated bottom right), with reward delivered on entry. (Left) Error in a single trial ranges from strongly positive through zero to strongly negative (top to bottom), depending on the time since the previous reward. Traces are aligned with respect to the previous reward. (Right) Error averaged over trials, aligned on the current reward. Right top: Mean TD error over trials is zero. Right middle: Mean TD error over trials with negative errors partially rectified (simulated dopamine signal) is positive. Mean interreward interval: 5 sec; reward magnitude: 1; rectification threshold: −0.1.
Analogous to rewards, this is evident here also for cues, whose occurrence (but not timing) in this task is signaled by the reward in the previous trial. As shown in Figure 6a, the model thus predicts the transfer of the (trialaveraged) response from the wholly predictable reward to the temporally unpredictable stimulus. (Single-trial cue responses covary with the preceding intertrial interval in a manner exactly analogous to reward responses in Figure 5 and are not illustrated separately.) Variability in the stimulus-reward interval has analogous effects. If the stimulus-reward interval is jittered slightly (see Figure 6b), there is no effect on the average simulated dopamine response. This is because, in contrast to the situations considered thus far, minor temporal jitter produces only small negative and positive prediction errors, which fail to reach the threshold
Representation and Timing in Theories of the Dopamine System
1651
ISI
ITI
avg. rectified δ(t)
reward 0.4
stim rew
0.2
duration
duration
0 0
stim
(a)
1
ISI
ITI
avg. rectified δ(t)
reward 0.4
stim rew
0.2
duration
duration
0 0
reward ISI
ITI
avg. rectified δ(t)
stim
(b)
1–1.5
0.4
stim rew
0.2
duration
(c)
stim
duration
0 0 1–5 time since stim. onset
Figure 6: Semi-Markov model of experiments in which reward delivery is signaled. Tasks are illustrated as semi-Markov state spaces next to the corresponding simulated dopaminergic response. When stimulus-reward interval is (a) deterministic or (b) only slightly variable, excitation is seen to stimulus but not reward. (c) When the stimulus-reward interval varies appreciably, excitation is seen in the trial-averaged reward response as well. (ITI changed between conditions to match average trial lengths.) (a) Mean ITI: 5 sec, ISI: 1 sec; (b) Mean ITI: 4.75 secs, ISI: 1–1.5 secs uniform, (c) Mean ITI: 3 secs, ISI: 1–5 secs uniform; reward magnitude: 1; rectification threshold: –0.1.
of rectification and thus cancel each other out in the average. But if the variability is substantial, then a response is seen on average (see Figure 6c), because large variations in the interstimulus interval produce large positive and negative variations in the single-trial prediction error, exceeding the rectification threshold. Responding is broken out separately by delay in Figure 7. In general, the extent to which rectification biases the average dopaminergic response to be excitatory depends on how often, and by how much, negative TD error exceeds the rectification threshold. This in turn depends on the amount of jitter in the term −ρk · dk in equation 4.2, with larger average rewards ρ and more sizable jitter in the interreward intervals d promoting a net excitatory response.
1652
N. Daw, A. Courville, and D. Touretzky
0.4 0.2 0
stim rew
reward ISI
ITI
duration
duration
stim
avg. rectified δ(t)
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0
1 2 3 4 5 time since stim. onset
Figure 7: Semi-Markov TD error to rewards occurring at different delays after a stimulus (earlier to later, top to bottom); same task as Figure 6c. (Left) Task illustrated as a semi-Markov state space; rewards arrive at random, uniformly distributed intervals after the stimulus. (Right) Model predicts a decline in reward response with delay, with inhibition for rewards later than average. Parameters as in Figure 6c.
Finally, a parallel effect can be seen when we consider the additional effect of subjective time measurement noise. We repeat the conditioning experiment with a deterministic programmed ITI but add variability due to modeled subjective noise in timing. Figure 8 demonstrates that this noise has negligible effect when the delay between stimulus and reward is small, but for a longer delay, the modeled dopaminergic response to the reward reemerges and cannot be trained away. This is because timing noise has a constant coefficient of variation (Gibbon, 1977) and is thus more substantial for longer delays. 4.4 Discussion: Data Bearing on These Predictions. We have shown that the semi-Markov model explains dopaminergic responses to temporally unpredictable free rewards and reward predictors (Schultz, 1998). Compared to previous models, this account offers a new and testably different pattern of explanation for these excitatory responses. On stimulus or reward receipt, the “cost” −ρk · dk of the delay preceding it is subtracted from the phasic dopamine signal (see equation 4.2). Because of this
avg. rectified δ(t)
Representation and Timing in Theories of the Dopamine System
0.25
1653
0.25
0
stim rew
0
2 4 6 time since stim. onset
0
stim rew
0
2 4 6 time since stim. onset
Figure 8: Effect of timing noise on modeled dopaminergic response to signaled reward depends on the interval between stimulus and reward (ISI). (Left) For ISI = 1 sec, error variation due to timing noise is unaffected by rectification and response to reward is minimal. (Right) For ISI = 6 sec, error variation from timing noise exceeds rectification level, and response to reward emerges. Mean ITI: 3 sec; reward magnitude: 1; rectification threshold: −0.1.; coefficient of variation of timing noise: 0.5.
subtraction, the model predicts that the single-trial phasic dopamine response should decline as the intertrial or interstimulus interval preceding it increases. This prediction accords with results (albeit published only in abstract form) of Fiorillo and Schultz (2001) from a conditioning experiment in which the stimulus-reward interval varied uniformly between 1 and 3 seconds. The theory further predicts that the response to later-than-average rewards should actually become negative; the available data are ambiguous on this point.1 However, suggestively, all published dopamine neuron recordings exhibit noticeable trial-to-trial variability in excitatory responses (e.g., to temporally unpredictable stimuli), including many trials with a response at or below baseline. The suggestion that this variability reflects the preceding interevent interval could be tested with reanalysis of the data. These phenomena are not predicted by the original tapped delay line model. This is because, unlike the semi-Markov model, it assesses the cost of a delay not all together in the phasic response to an event (so that variability in the delay impacts on the event response) but instead gradually during the interval preceding it, on the passage through each marker state. (In particular, at each time step, the error includes a term of −ρ in the average reward formulation, or in the exponentially discounted version a related penalty arising from discounting; Daw & Touretzky, 2002.) On that account, rewards or reward predictors arriving on a Poisson or uniformly distributed random schedule should generally excite neurons regardless of their timing,
1 The publicly presented data from Fiorillo and Schultz (2001) include spike counts for different stimulus-reward delays, supporting the conclusion that the mean response never extends below baseline. However, the accompanying spike rasters suggest that this conclusion may depend on the length of the time window over which the spike counts are taken.
1654
N. Daw, A. Courville, and D. Touretzky
and the ubiquitous response variability must be attributed to unmodeled factors. The new theory also predicts that dopamine neurons should not respond when small amounts of temporal jitter precede an event, but that an excitatory response should emerge, on average, when the temporal variability is increased. This is consistent with the available data. In the experiment discussed above involving 1 to 3 second interstimulus intervals, Fiorillo and Schultz (2001) report net excitation to reward. Additionally, in an experiment involving a sequence of two stimuli predicting reward, neurons were excited by the second stimulus only when its timing varied (Schultz, Apicella, & Ljungberg, 1993). There is also evidence for tolerance of small levels of variability. In an early study (Ljungberg, Apicella, & Schultz, 1992), dopamine neurons did not respond to rewards (“no task” condition) or stimuli (“overtrained” condition) whose timing varied somewhat. Finally, the model predicts similar effects of subjective timing noise, and unpublished data support the model’s prediction that it should be impossible to train away the dopaminergic response to a reward whose timing is deterministically signaled by a sufficiently distant stimulus (C. Fiorillo, personal communication, 2002). Thus, insofar as data are available, the predictions of the theory discussed in this section appear to be borne out. A number of these phenomena would be difficult to explain using a tapped delay line model. The major gap in the theory as presented so far is the lack of account for experiments involving reward omission. We now show how to treat these as examples of partial observability. The model discussed so far follows as a limiting case of the resulting, more complex model whenever the world’s state is directly observable. 5 A New Model: Partial Observability Here we extend the model described in the previous section to include partial observability. We specify the formalism and discuss algorithms for state inference and value learning. Next, we present simulation results and analysis for several experiments involving temporal variability and reward omission. Finally, we discuss how the model’s predictions fare in the light of available data. 5.1 A Partially Observable Semi-Markov Model. Partial observability results from relaxing the one-to-one correspondence between states and observables that was assumed above. We assume that there is a set O of possible observations (which we take, for presentational simplicity, as each instantaneous and binary) and that reward is simply a special observation. The state evolves as before, but it is not observable; instead, each state is associated with a multinomial distribution over O, specified by an observation function O. Thus, when the process enters state sk , it emits an
Representation and Timing in Theories of the Dopamine System
1655
accompanying observation ok ∈ O according to a multinomial conditioned on the state. One observation in O is the null observation, ∅. If no state transition occurs at time t, then ot = ∅. That is, nonempty observations can occur only on state transitions. Crucially, the converse is not true: state transitions can occur silently. This is how we model omitted reward in a trace conditioning experiment. We wish to find a TD algorithm for value prediction in this formalism. Most of the terms in the error signal of equation 4.2 are unavailable, because the states they depend on are unobservable. In fact, it is not even clear when to apply the update, since the times of state transitions are themselves unobservable. Extending a standard approach to partial observability in Markov processes (Chrisman, 1992; Kaelbling et al., 1998) to the semi-Markov case, we assume that the animal learns a model of the hidden process (that is, the functions T, O, and D determining the conditional probabilities of state transitions, observations, and dwell time durations). Such a model can be used together with the observations to infer estimates of the unavailable quantities. Note that given such a model, the values of the hidden states could in principle be computed offline using value iteration. (Since the hidden process is just a normal semi-Markov process, partial observability does not affect the solution.) We return to this point in the discussion; here, motivated by evidence of dopaminergic involvement in error-driven learning, we present an online TD algorithm for learning the same values by sampling, assisted by the model. The new error signal has a form similar to the fully observable case: δs,t = βs,t (rt+1 − ρt · E[dt ] + E[ Vst+1 ] − Vs ).
(5.1)
We discuss the differences, from left to right. First, the new error signal is state as well as time dependent. We compute an error signal for each state s at each time step t, on the hypothesis that the process transitioned out of s at t. The error signal is weighted by the probability that this is actually the case: βs,t ≡ P(st = s, φt = 1|o1 , . . . , ot+1 ),
(5.2)
where φt is a binary indicator that takes the value one if the state transitioned between times t and t + 1 (self-transitions count) and zero otherwise. Note that this determination is conditioned on observations made through time t + 1; this is chosen to parallel the one-time-step backup in the TD algorithm. β can be tracked using a version of the standard forward-backward recursions for hidden Markov models; equations are given in the appendix. The remaining changes in the error signal of equation 5.1 are the expected st+1 ]. These dwell time E[dt ] and expected value of the successor state E[V
1656
N. Daw, A. Courville, and D. Touretzky
are computed from the observations using the model, again conditioned on the hypothesis that the system left state s at time t: E[dt ] ≡
∞
d · P(dt = d|st = s, φt = 1, o1 , . . . , ot+1 )
(5.3)
d=1
E[ Vst+1 ] ≡
Vs P(st+1 = s |st = s, φt = 1, ot+1 ).
(5.4)
s ∈S
Expressions for computing these quantities using the inference model are given in the appendix. Assuming the inference model is correct (i.e., that it accurately captures the process generating the observations), this TD algorithm for value learning is exact in that it has the same fixed point as value iteration using the inference model. The proof is sketched in the appendix. Note also that in the fully observable limit (i.e., when s, d, and φ can be inferred with certainty), the algorithm reduces exactly to the semi-Markov rule of equation 4.2. Simulations (not reported here) demonstrate that in general, when the posterior distributions over these parameters are relatively well specified (i.e., when uncertainty is moderate), this algorithm behaves similarly to the semi-Markov algorithm described in section 4. The main differences come about in cases of considerable state uncertainty, as when reward is omitted. We have described the TD error computation for learning values in a partially observable semi-Markov process. It may be useful to review how the computation actually proceeds. At each time step, the system receives a (possibly empty) observation or reward, and the representational system uses this to update its estimate of the state departure distribution β and other latent quantities. The TD learning system uses these estimates to compute the TD error δ, which is reported by the dopamine system (perhaps assisted are updated to reduce by the serotonin system). Stored value estimates V the error, and the cycle repeats. 5.2 Connecting This Theory to the Dopamine Response. In order to finalize the specification of the model, we discuss several further issues about simulating the dopamine response with the partially observable semiMarkov algorithm. 5.2.1 Vector vs. Scalar Error Signals. As already mentioned, equation 5.1 is a vector rather than a scalar error signal, since it contains an error for each state’s value. Previous models have largely assumed that the dopamine response reports a scalar error signal, supported by experiments showing striking similarity between the responses of most dopaminergic neurons (Schultz, 1998). However, there is some variability between neurons.
Representation and Timing in Theories of the Dopamine System
1657
Notably, only a subset (55–75%) of neurons displays any particular sort of phasic response (Schultz, 1998). Also, several studies report sizable subsets of neurons showing qualitatively different patterns of responding in the same situation (e.g., excitation versus inhibition; Schultz & Romo, 1990; Mirenowicz & Schultz, 1996; Waelti, Dickinson, & Schultz, 2001; Tobler, Dickinson, & Schultz, 2003, though see Ungless, Magill, & Bolam, 2004). We suggest that dopamine neurons might code a vector error signal like equation 4.2 in some distributed manner and that this might account for response variation between neurons. Absent data from experiments designed to constrain such a hypothesis, we illustrate for this article the dopamine response as a scalar, cumulative error over all the states: δt =
δs,t .
(5.5)
s∈S
This quantity may be interpreted in terms of either a vector or scalar model of the dopamine signal. If dopamine neurons uniformly reported this scalar signal, then values could be learned by apportioning the state specific error according to βs,t at targets (with Vs updated proportionally to δt · βs,t / s ∈S βs ,t ). This is a reasonable approximation to the full algorithm so long as there is only moderate state uncertainty and works well in our experience (simulations not reported here). The simplest vector signal would have different dopamine neurons reporting the state-dependent error δs,t for different states; the scalar error δt could then be viewed as modeling the sum or average over a large group of neurons. It is noteworthy that articles on dopamine neuron recordings predominantly report data in terms of such averages, accompanied by traces of a very few individual neurons. However, such a sparsely coded vector signal is probably unrealistic given the relative homogeneity reported for individual responses. A viable compromise might be a more coarsely coded vector scheme in which each dopamine neuron reports the cumulative TD error over a random subset of states. For large enough subsets (e.g, more than 50% of states per neuron), such a scheme would capture the overall homogeneity but limited between-neuron variability in the single-unit responses. In this case, the aggregate error signal from equation 5.5 would represent both the average over neurons and, roughly, a typical single-neuron response. 5.2.2 World Models and Asymptotic Model Uncertainty. As already mentioned, because our focus is on the influence of an internal model on asymptotic dopamine responding rather than on the process of learning such a model, for each task we take as given a fixed world model based on the actual structure that generated the task events. For instance, for trace-conditioning experiments, the model is based on Figure 4. Importantly, however, we assume that animals never become entirely certain
1658
N. Daw, A. Courville, and D. Touretzky
(nothing 2%) (stim 2%) reward 96% ISI
ITI
duration
duration
stim 96% (reward 2%) (nothing 2%) Figure 9: State space for semi-Markov model of trace conditioning experiment, with asymptotically uncertain dwell time distributions and observation models. For simplicity, analogous noise in the transition probabilities (small chance of self-transition) is not illustrated.
about the world’s precise contingencies; each model is thus systematically altered to include asymptotic uncertainty in its distributions over dwell times, transitions, and observations. This variance could reflect irreducible sensor noise (e.g., time measurement error) and persistent uncertainty due to assumed nonstationarity in the contingencies being modeled (Kakade & Dayan, 2000, 2002). We represent even deterministic dwell-time distributions as gaussians with nonzero variance. Similarly, the multinomials describing observation and state transition probabilities attribute nonzero probability even to anomalous events (such as state self-transitions or reward omission). The effects of these modifications on the semi-Markov model of trace conditioning are illustrated in Figure 9. 5.3 Results: Simulated Dopaminergic Responding in the Model. We first consider the basic effect of reward omission. Figure 10 (left top) shows the effect on a trace-conditioning task, with the inferred state distribution illustrated by a shaded bar under the trace. As time passes without reward, inferred probability mass leaks into the ITI state (shown as the progression from black to white in the bar, blown up on the inset), accompanied by negative TD error. Because the model’s judgment that the reward was omitted occurs progressively, the predicted dopaminergic inhibition is slightly delayed and prolonged compared to the expected time of reward delivery. Repeated reward omission also reduces the value predictions attributed to preceding stimuli, which in turn has an impact on the dopaminergic responses to the stimuli and to the subsequent rewards when they arrive. Figure 11 shows how, asymptotically, the degree of partial reinforcement trades off relative responding between stimuli and rewards.
avg. rectified δ(t)
Representation and Timing in Theories of the Dopamine System
Reward omitted
0.4 0.2 0
1659
Reward timing probed stim rew
stim 0.4 no R 0.2
0
1 time
2
reward
ISI
duration stim
ITI
duration
0 −0.2 0.4 0.2 0 −0.2
stim rew
0.4 0.2 0 −0.2
stim rew
0
1 time
2
Figure 10: Simulated dopamine responses from partially observable semiMarkov model of trace conditioning, with reward omitted or delivered at an unexpected time. (Left top) Reward omission inhibits response, somewhat after the time reward was expected. (Left bottom) State space of inference model. (Right) When reward is delivered earlier (top) or later (bottom) than expected, excitation is seen to reward. No inhibition is seen after early rewards, at the time reward was originally expected. Shaded bars show inferred state (white: ITI, black: ISI). Mean ITI: 5 sec; ISI: 1 sec; reward magnitude: 1; rectification threshold: −0.1; probability of anomalous event in inference model: 2%; CV of dwell time uncertainty: 0.08.
Hollerman and Schultz (1998) generalized the reward omission experiment to include probe trials in which the reward was delivered a halfsecond early or late. Figure 10 (right) shows the behavior of the partially observable semi-Markov model on this task. In accord with the findings discussed in section 2, the semi-Markov model predicts no inhibition at the time the reward was originally expected. As shown by the shaded bar underneath the trace, this is because the model assumes that the early reward has signaled an early transition into the ITI state, where no further reward is expected. While the inference model judges such an event unlikely, it is a better explanation for the observed data than any other path through the state space. 5.4 Discussion: Data Bearing on These Predictions. The partially observable model behaves like the fully observable model for the experiments reported in the previous section (simulations not shown) and additionally accounts for dopamine responses when reward is omitted (Schultz et al., 1997) or delivered at unexpected times (Hollerman & Schultz, 1998). The results are due to the inference model making judgments about the likely progression of the hidden state when observable signals are absent
1660
N. Daw, A. Courville, and D. Touretzky
4% reinforcement
1
stim
0.5 0
rew
25% reinforcement
1
stim
0.5 0
ISI
ITI
avg. rectified δ(t)
reward (prob p)
rew
50% reinforcement
1
stim
0.5
duration
stim
duration
0
rew
75% reinforcement 1
stim
0.5 0
rew
96% reinforcement
1
stim
0.5 0
rew
0
0.5 time
1
Figure 11: Simulated dopamine responses from partially observable semiMarkov model of trace conditioning with different levels of partial reinforcement. (Left) State space for inference. (Right) As chance of reinforcement increases, phasic responding to the reward decreases while responding to the stimulus increases. Shaded bars show inferred state (white: ITI, black: ISI). Mean ITI: 5 sec; ISI: 1 sec; reward magnitude: 1; rectification threshold: −0.1; probability of anomalous event in inference model: 2%; CV of dwell time uncertainty: 0.08.
or unusual. Since such judgments unfold progressively with the passage of time, simulated dopaminergic pauses are both later and longer than bursts. This difference is experimentally verified by reports of population duration and latency ranges (Schultz, 1998; Hollerman & Schultz, 1998) and is unexplained by delay line models.
Representation and Timing in Theories of the Dopamine System
1661
However, in order to obtain pause durations similar to experimental data, it was necessary to assume fairly low levels of variance in the inference model’s distribution over the interstimulus interval. That this variance (0.08) was much smaller than the level of timing noise suggested by behavioral experiments (0.3–0.5 according to Gibbon, 1977, or even 0.16 reported more recently by Gallistel, King, & McDonald, 2004), goes against the natural assumption that the uncertainty in the inference model captures the noise in internal timing processes. It is possible that because of the short, 1-second interevent durations used in the recordings, the monkeys are operating in a more accurate timing regime that seems behaviorally to dominate for subsecond intervals (see Gibbon, 1977). In this respect, it is interesting to compare the results of Morris et al. (2004), who recorded dopamine responses after slightly longer (1.5 and 2 seconds) deterministic trace intervals and reported noticeably more prolonged inhibitory responses. A fuller treatment of these issues would likely require both a more realistic theory that includes spike generation and as yet unavailable experimental analysis of the trial-to-trial variability in dopaminergic pause responses. The new model shares with previous TD models the prediction that the degree of partial reinforcement should trade off phasic dopaminergic responding between a stimulus and its reward (see Figure 11). This is well verified experimentally (Fiorillo et al., 2003; Morris et al., 2004). One of these studies (Fiorillo et al., 2003) revealed an additional response phenomenon: a tonic “ramp” of responding between stimulus and reward, maximal for 50% reinforcement. Such a ramp is not predicted by our model (in fact, we predict very slight inhibition preceding reward, related to the possibility of an early, unrewarded state transition), but we note that such ramping is seen only in delay, and not trace, conditioning (Fiorillo et al., 2003; Morris et al., 2004). Therefore, it might reflect some aspect of the processing of temporally extended stimuli that our theory (which incorporates only instantaneous stimuli) does not yet contemplate. Alternatively, Niv, Duff, et al. (2005) suggest that the ramp might result from trial averaging over errors on the progression between states in a tapped-delay line model, due to the asymmetric nature of the dopamine response. This explanation can be directly reproduced in our semi-Markov framework by assuming that there is at least one state transition during the interstimulus interval, perhaps related to the persistence of the stimulus (simulations not reported; Daw, 2003). Finally, we could join the authors of the original study (Fiorillo et al., 2003) in assuming that the ramp is an entirely separate signal multiplexed with the prediction error signal. (Note, however, that although they associate the ramp with “uncertainty,” this is not the same kind of uncertainty that we mainly discuss in this article. Our uncertainty measures posterior ignorance about the hidden state; Fiorillo et al. are concerned with known stochasticity or “risk” in reinforcer delivery.)
1662
N. Daw, A. Courville, and D. Touretzky
6 Discussion We have developed a new account of the dopaminergic response that builds on previous ones in a number of directions. The major features in the model are a partially observable state and semi-Markov dynamics; these are accompanied by a number of further assumptions (including asymptotic uncertainty, temporal measurement noise, and the rectification of negative TD error) to produce new or substantially different explanations for a range of dopaminergic response phenomena. We have focused particularly on a set of experiments that exercise both the temporal and hidden state aspects of the model—those involving the state uncertainty that arises when an event can vary in its timing or be altogether omitted. 6.1 Comparing Different Models. The two key features of our model, partial observability and semi-Markov dynamics, are to a certain extent separable and each interesting in its own right. We have already shown how a fully observable semi-Markov model with interesting properties arises as a limiting case of our model when, as in many experimental situations, the observations are unambiguous. Another interesting relative, which has yet to be explored, would include partial observability and state inference but not semi-Markov dynamics. One way to construct such a model is to note that any discrete-time semi-Markov process of the sort described here has a hidden Markov model that is equivalent to it for generating observation sequences. This can be obtained by subdividing each semi-Markov state into a series of discrete time marker states, each lasting one time step (see Figure 12). State inference and TD are straightforward in this setting (in
ISI ...
stim reward ...
ITI Figure 12: Markov model equivalent to the semi-Markov model of a trace conditioning experiment from Figure 4. The ISI and ITI states are subdivided into a series of substates that mark the passage of time. Stimuli and rewards occur only on transitions from one set of states into the other; dwell time distributions are encoded by the transition probabilities from each marker state.
Representation and Timing in Theories of the Dopamine System
1663
particular, standard TD can be performed directly on the state posteriors, which are Markov; Kaelbling et al., 1998). Moreover, the fully observable limit of this model (equivalently, the model obtained by subdividing states in our fully observable semi-Markov model) is just a version of the familiar tapped delay line model, in which a series of marker states marks time from each event. Since each new event launches a new cascade of marker states and stops the old one, this model includes a “reset” device similar to those discussed in section 2. An interesting avenue for exploration would be intermediate hybrids, in which long delays are subdivided more coarsely into a series of a few temporally extended, semi-Markov states. A somewhat similar division of long delays into a series of several, temporally extended (internal) substates of different length is a feature of some behavioral theories of timing (Killeen & Fetterman, 1988; Machado, 1997), in part because animals engage in different behaviors during different portions of a timed interval. Although the Markov and semi-Markov formalisms are equivalent as generative models, the TD algorithms for value learning in each are qualitatively different, because of their different approaches for “backing up” reward predictions over a long delay. The semi-Markov algorithms do so directly, while Markov algorithms employ a series of intermediate marker states. There is thus an empirical question as to which sort of algorithm best corresponds to the dopaminergic response. A key empirical signature of the semi-Markov model would be its prediction that when reward or stimulus timing can vary, later-than-average reward-related events should inhibit dopamine neurons (e.g., see Figure 7). Markov models generally predict no such inhibition, but instead excitation that could shrink toward but not below baseline. In principle, this question could be addressed with trial-by-trial reanalysis of any of a number of extant data sets, since the new model predicts the same pattern of dopaminergic excitation and inhibition as a function of the preceding interval in a number of experiments, including ones as simple as free reward delivery (see Figure 5). Unfortunately, the one study directly addressing this issue was published only as an abstract (Fiorillo & Schultz, 2001). Though the original interpretation followed the Markov account, the publicly presented data appear ambiguous. (See the discussion in section 4 and note 1.) A signature of many (though not all; Pan, Schmidt, Wickens, & Hyland, 2005) tapped delay line algorithms would be phasic dopaminergic activity during the period between stimulus and reinforcer, reflecting value backing up over intermediate marker states during initial acquisition of a stimulusreward association. No direct observations have been reported of such intermediate activity, which may not be determinative since it would be subtle and fleeting. As already mentioned Niv, Duff, et al. (2005) have suggested that the tonic ramp observed in the dopamine response by Fiorillo et al. (2003) might reflect the average over trials of such a response. Should
1664
N. Daw, A. Courville, and D. Touretzky
this hypothesis be upheld by a trial-by-trial analysis of the data, it would be the best evidence for Markov over semi-Markov TD. A final consideration regarding the trade-off between Markov and semiMarkov approaches is that, as Niv, Duff, and Dayan (2004) point out, Markov models are quite intolerant of timing noise. Our results suggest that semi-Markov models are more robustly able to account for dopaminergic response patterns in the light of presumably noisy internal timing. It is worth discussing the contributions of two other features of our model that differ from the standard TD account. Both have also been studied separately in previous work. First, the asymmetry between excitatory and inhibitory dopamine responses has appreciable effects only when averaging over trials with different prediction errors. Thus, it is crucial in the present semi-Markov account, where most of the burst responses we simulate originate from the asymmetric average over errors that differ from trial to trial due to differences in event timing. Delay line models do not generally predict similar effects of event timing on error, and so in that context, asymmetric averaging has mainly been important in understanding experiments in which error varies due to differential reward delivery (as in Figure 7; Niv, Duff, et al., 2005). In contrast, the use of an average-reward TD rule (rather than the more traditional discounted formulation) plays a more cosmetic role in this work. In the average reward formulation, trial-to-trial variability in delays dk affects the prediction error (see equation 4.2) through the term −ρk · dk ; in a discounted version, analogous effects would occur due to long delays being more discounted (as γ dk ). One advantage of the average reward formulation is that it is invariant to dilations or contractions of the timescale of events, which may be relevant to behavior (discussed below). Previous work on average-reward TD in the context of delay line models has suggested that this formulation might shed light on tonic dopamine, dopamine-serotonin interactions, and motivational effects on response vigor (Daw & Touretzky, 2002; Daw et al., 2002; Niv, Daw, & Dayan, 2005). 6.2 Behavioral Data. Our model is concerned with the responses of dopamine neurons. However, insofar as dopaminergically mediated learning may underlie some forms of both classical and instrumental conditioning (e.g., Parkinson et al., 2002; Faure, Haberland, Cond´e, & Massioui, 2005), the theory suggests connections with behavioral data as well. Much more work will be needed to develop such connections fully, but we mention a few interesting directions here. Our theory generalizes and provides some justification (in limited circumstances) for the “reset” hypothesis that had previously been proposed, on an ad hoc basis, to correct the TD account of the Hollerman and Schultz (1998) experiment (Suri & Schultz, 1998, 1999; Brown et al., 1999). In our theory, “reset” (here, an inferred transition into the ITI state) occurs after reward because this is consistent with the inference model for that experiment. But
Representation and Timing in Theories of the Dopamine System
1665
in other circumstances, for instance, those in which a stimulus is followed by a sequence of more than one reward, information about the stimulus remains predictively relevant after the first reward, and our theory (unlike its predecessors) would not immediately discard it. Dopamine neurons have not been recorded in such situations, but behavioral experiments offer some clues. Animals can readily learn that a stimulus predicts multiple reinforcers; in classical conditioning, this has been repeatedly, though indirectly, demonstrated by showing that adding or removing reinforcers to a sequence has effects on learning (upward and downward unblocking; Holland & Kenmuir, 2005; Holland, 1988; Dickinson & Mackintosh, 1979; Dickinson, Hall, & Mackintosh, 1976). No such learning would be possible in the tapped delay line model if the first reward reset the representation. Because it has somewhat different criteria for what circumstances trigger a reset, the Suri and Schultz (1998, 1999) model would also have problems learning a task known variously as feature-negative occasion setting or sequential conditioned inhibition (Yin, Barnet, & Miller, 1994; Bouton & Nelson, 1998; Holland, Lamoureux, Han, & Gallagher, 1999). However, we should note that a different sort of behavioral experiment does support a reward-triggered reset in one situation. In an instrumental conditioning task requiring animals to track elapsed time over a period of about 90 seconds, reward arrival seems to reset animals’ interval counts (Matell & Meck, 1999). It is unclear how to reconcile this last finding with the substantial evidence from classical conditioning. Gallistel and Gibbon (2000) have argued that behavioral response acquisition in classical conditioning experiments is timescale invariant in the sense that contracting or dilating the speed of all events does not affect the number of stimulus-reward pairings before a response is seen. This would not be true for responses learned by simple tapped delay line TD models (since doubling the speed of events would halve the number of marker states and thereby speed acquisition), but this property naturally holds for our fully observable semi-Markov model (and would hold for the partially observable model if the unspecified model learning phase were itself timescale invariant). There is also behavioral evidence that may relate to our predictions about the relationship between the dopamine response and the preceding interval. The latency to animals’ behavioral responses across many instrumental tasks is correlated with the previous interreinforcement interval, with earlier responding after shorter intervals (“linear waiting”; Staddon & Cerutti, 2003). Given dopamine’s involvement in response vigor (e.g., Dickinson, Smith, & Mirenowicz, 2000; Satoh, Nakai, Sato, & Kimura, 2003; McClure, Daw, & Montague, 2003; Niv, Daw, et al., 2005), this effect may reflect enhanced dopaminergic activity after shorter intervals, as our model predicts. However, the same reasoning applied to reward omission (after which dopaminergic responding is transiently suppressed) would incorrectly predict slower responding. In fact, animals respond earlier following reward
1666
N. Daw, A. Courville, and D. Touretzky
omission (Staddon & Innis, 1969; Mellon, Leak, Fairhurst, & Gibbon, 1995); we thus certainly do not have a full account of the many factors influencing behavioral response latency, particularly on instrumental tasks. Particularly given these complexities, we stress that our theory is not intended as a theory of behavioral timing. To the contrary, it is adjunct to such a theory: it assumes an extrinsic timing signal. We have investigated how information about the passage of time can be combined with sensory information to make inferences about future reward probabilities and drive dopaminergic responding. We have explored the resulting effect on the dopaminergic response of one feature common to most behavioral timing models—scalar timing noise (Gibbon, 1977; Killeen & Fetterman, 1988; Staddon & Higa, 1999)—but we are otherwise rather agnostic about the timing substrate. One prominent timing theory, BeT (Killeen & Fetterman, 1988), assumes animals time intervals by probabilistically visiting a series of internally generated behavioral states of extended duration (see also LeT; Machado, 1997). While these behavioral “states” may seem to parallel the semi-Markov states of our account, it is important to recall that in our theory, the states are not actual internal states of the animal but rather are notional features of the animal’s generative model of events in the external world. That generative model, plus extrinsic information about the passage of time, is used to infer a distribution over the external state. One concrete consequence of this difference can be seen in our simulations, in which even though the world is assumed to change abruptly and discretely, the internal representation is continuous and can sometimes change gradually (unlike the states of BeT). Finally, both behavioral and neuroscientific data from instrumental conditioning tasks suggest that depending on the circumstances, animals seem to evaluate potential actions using either TD-style model-free or dynamicprogramming-style model-based RL methods (Dickinson & Balleine, 2002; Daw, Niv, & Dayan, in press). This apparent heterogeneity of control is puzzling and relates to a potential objection to our framework. Specifically, our assumption that animals maintain a full world model may seem to make redundant the use of TD to learn value estimates, since the world model itself already contains the information necessary to derive value estimates (using dynamic programming methods such as value iteration). This tension may be resolved by considerations of efficiency and of balancing the advantages and disadvantages of both RL approaches. Given that the world model is in principle learned online and thus subject to ongoing change, it is computationally more frugal and often not appreciably less accurate to use TD to maintain relatively up-to-date value estimates rather than to repeat laborious episodes of value iteration. This simple idea is elaborated in RL algorithms like prioritized sweeping (Moore & Atkeson, 1993) and Dyna-Q (Sutton, 1990). In fact, animals behaviorally exhibit characteristics of each sort of planning under circumstances to which that method is computationally well suited (Daw et al., 2005).
Representation and Timing in Theories of the Dopamine System
1667
6.3 Future Theoretical Directions. A number of authors have previously suggested schemes for combining a world model with TD theories of the dopamine system (Dayan, 2002; Dayan & Balleine, 2002; Suri, 2001; Daw et al., in press; Daw et al., 2005; Smith, Becker, & Kapur, 2005). However, all of this work concerns the use of a world model for planning actions. The present account explores a separate, though not mutually exclusive, function of a world model: for state inference to address problems of partial observability. Our work thus situates the hypothesized dopaminergic RL system in the context of a broader view of the brain’s functional anatomy, in which the subcortical RL systems receive a refined, inferred sensory representation from cortex (similar frameworks have been suggested by Doya, 1999, 2000 and by Szita & Lorincz, 2004). Our Bayesian, generative view on this sensory processing is broadly consistent with recent theories of sensory cortex (Lewicki & Olshausen, 1999; Lewicki, 2002). Such theories may suggest how to address an important gap in the present work: we have not investigated how the hypothesized model learning and inference functions might be implemented in brain tissue. In the area of cortical sensory processing, such implementational questions are an extremely active area of research (Gold & Shadlen, 2002; Deneve, 2004; Rao, 2004; Zemel, Huys, Natarajan, & Dayan, 2004). Also, in the context of a generative model whose details are closer to our own, it has been suggested that a plausible neural implementation might be possible using sampling rather than exact inference for the state posterior (Kurth-Nelson & Redish, 2004). Here, we intend no claim that animal brains are using the same methods as we have to implement the calculations; our goal is rather to identify the computations and their implications for the dopamine response. The other major gap in our presentation is that while we have discussed reward prediction in partially observable semi-Markov processes, we have not explicitly treated action selection in the analogous decision processes. It is widely presumed that dopaminergic value learning ultimately supports the selection of high-valued actions, probably by an approach like actor-critic algorithms (Sutton, 1984; Sutton & Barto, 1998), which use TDderived value predictions to influence a separate process of learning action selection policies. There is suggestive evidence from functional anatomy that distinct dopaminergic targets in the ventral and dorsal striatum might subserve these two functions (Montague et al., 1996; Voorn, Vanderschuren, Groenewegen, Robbins, & Pennartz, 2004; Daw et al., in press; but see also Houk et al., 1995; Joel, Niv, & Ruppin, 2002, for alternative views). This idea has also recently received more direct support from fMRI (O’Doherty et al., 2004) and unit recording (Daw, Touretzky, & Skaggs, 2004) studies. We do not envision the elaborations in the present theory as substantially changing this picture. That said, hidden state deeply complicates the action selection problem in Markov decision processes (i.e., partially observable MDPs or POMDPs; Kaelbling et al., 1998). The difficulty, in a nutshell, is that the optimal action when the state is uncertain may be different
1668
N. Daw, A. Courville, and D. Touretzky
from what would be the optimal action if the agent were certainly in any particular state (e.g., for exploratory or information-gathering actions)— and, in general, the optimal action varies with the state posterior, which is a continuous-valued quantity unlike the manageably discrete state variable that determines actions in a fully observable MDP. Behavioral and neural recording experiments on monkeys in a task involving a simple form of state uncertainty suggest that animals indeed use their degree of state uncertainty to guide their behavior (Gold & Shadlen, 2002). Since the appropriate action can vary continuously with the state posterior, incorporating action selection in the present model would require approximating the policy (and value) as functions of the full belief state, preferably nonlinearly. 6.4 Future Experimental Directions. Our work enriches previous theories of dopaminergic responding by identifying two important theoretical issues that should guide future experiments: the distinction between Markov and semi-Markov backup and partial observability. We have already discussed how the former issue suggests new experimental analyses; similarly, issues of hidden state are ripe for future experiment. Partial observability suggests a particular question: whether dopaminergic neurons report an aggregate error signal or separate error signals tied to different hypotheses about the world’s state (see section 5.2). This could be studied more or less directly, for instance, by placing an animal in a situation where ambiguous reward expectancy (e.g., one reward, or two, or three) resolved into a situation where the intermediate reward was expected with certainty. On a scalar error code, dopamine neurons should not react to this event; with a vector error code, different neurons should report both positive and negative error. More generally, it would be interesting to study how dopaminergic neurons behave in many situations of sensory ambiguity (as in noisy motion discriminations, e.g., Gold & Shadlen, 2002, where much is known about how cortical neurons track uncertainty but there is no indication how, if at all, the dopaminergic system is involved). The present theory and the antecedent theory of partially observable Markov decision processes suggest a framework by which many such experiments could be designed and analyzed. Appendix Here we present and sketch derivations for the formulas for inference in the partially observable semi-Markov model. Inference rules for similar hidden semi-Markov models have been described by Levinson (1986) and Guedon & Cocozza-Thivent (1990). We also sketch the proof of the correctness of the TD algorithm for the model.
Representation and Timing in Theories of the Dopamine System
1669
Below, we use abbreviated notation for the transition, dwell time, and observation functions. We write the conditional transition probabilities as Ts ,s ≡ P(sk = s|sk−1 = s ); the conditional dwell time distributions as Ds,d ≡ P(dk = d|sk = s); and the observation probabilities as Os,o ≡ P(ok = o|sk = s). These functions, together with the observation sequence, are given; the goal is to infer posterior distributions over the unobservable quantities needed for TD learning. The chief quantity necessary for the TD learning rule of equation 5.1 is βs,t = P(st = s, φt = 1|o1 . . . ot+1 ), the probability that the process left state s at time t. To perform the one time step of smoothing in this equation, we use Bayes’ theorem on the subsequent observation: βs,t =
P (ot+1 |st = s, φt = 1) · P (st = s, φt = 1|o1 . . . ot ) . P (ot+1 |o1 . . . ot )
(A.1)
In this equation and several below, we have made use of the Markov property: the conditional independence of st+1 and ot+1 from the previous observations and states given the predecessor state st . In semi-Markov processes (unlike Markov processes), this property holds only at a state transition, that is, when φt = 1. The first term of the numerator of equation A.1 can be computed by integrating over st+1 in the model: P(ot+1 |st = s, φt = 1) = s ∈S Ts,s Os ,ot+1 . Call the second term of the numerator of equation A.1 αs,t . Computing it requires integrating over the possible durations of the stay in state s: αs,t = P(st = s, φt = 1|o1 . . . ot ) =
∞
P (st = s, φt = 1, dt = d|o1 ...ot )
(A.2) (A.3)
d=1
=
∞ d=1
=
P (ot−d+1 . . . ot |st = s, φt = 1, dt = d, o1 . . . ot−d ) · P (st = s, φt = 1, dt = d|o1 . . . ot−d ) P (ot−d+1 . . . ot |o1 . . . ot−d )
∞ Os,ot−d+1 Ds,d P (st−d+1 = s, φt−d = 1|o1 ...ot−d ) , P (ot−d+1 . . . ot |o1 ...ot−d )
(A.4)
(A.5)
d=1
where the sum need not actually be taken out to infinity, but only until the last time a nonempty observation was observed (where a state transition must have occurred). The derivation makes use of the fact that the observation ot is empty with probability one except on a state transition. Thus, under the hypothesis that the system dwelled in state s from time t − d + 1 through time t, the probability of the sequence of null observations during that period equals just the probability of the first, Os,ot−d+1 .
1670
N. Daw, A. Courville, and D. Touretzky
Integrating over predecessor states, the quantity P(st−d+1 = s, φt−d = 1|o1 . . . ot−d ), the probability that the process entered state s at time t − d + 1, equals
Ts ,s · P(st−d = s , φt−d = 1|o1 . . . ot−τ ) =
s ∈S
Ts ,s · αs ,t−d .
(A.6)
s ∈S
Thus, α can be computed recursively, and prior values of α back to the time of the last nonempty observation can be cached, allowing dynamic programming analogous to the Baum-Welch procedure for hidden Markov models (Baum, Petrie, Soulds, & Weiss, 1970). Finally, the normalizing factors in the denominators of equations A.5 and A.1 can be computed by similar recursions, after integrating over the state occupied at t − d (see equation A.5) or t (see equation A.1) and the value of φ at those times. Though we do not make use of this quantity in the learning rules, the belief state over state occupancy, Bs,t = P(st = s|o1 . . . ot ), can also be computed by a recursion on α exactly analogous to equation A.2, substituting P(dt ≥ d|st = s) for Ds,d . The two expectations in the TD learning rule of equation 5.1 are: E[ Vst+1 ] =
Vs P(st+1 = s |st = s, φt = 1, ot+1 )
(A.7)
s ∈S
=
s ∈S
Vs
Ts,s Os ,ot+1 s Ts,s Os ,ot+1
(A.8)
and E[dt ] =
∞
d · P(dt = d|st = s, φt = 1, o1 ...ot+1 )
(A.9)
d · P(dt = d|st = s, φt = 1, o1 . . . ot )
(A.10)
d=1
=
∞ d=1
=
∞ d=1
d · P(st = s, dt = d, φt = 1|o1 . . . ot ) , αs,t
(A.11)
where the sum can again be truncated at the time of the last nonempty observation, and P(st = s, dt = d, φt = 1|o1 . . . ot ) is computed as on the right-hand side of equation A.2. The proof that the TD algorithm of equation 5.1 has the same fixed point as value iteration is sketched below. We assume the inference model correctly matches the process generating the samples. With each TD update, Vs is nudged toward some target value with some step size βs,t . It is easy
Representation and Timing in Theories of the Dopamine System
1671
to show that, analogous to the standard stochastic update situation with constant step sizes, the fixed point is the average of the targets, weighted by their probabilities and their step sizes. Fixing some arbitrary t, the update targets and associated step sizes β are functions of the observations o1 , . . . , ot+1 , which are, by assumption, samples generated with probability P(o1 , . . . , ot+1 ) by a semi-Markov process whose parameters match those of the inference model. The fixed point is Vs =
o1 ...ot+1 [P(o1
. . . ot+1 ) · βs,t · (r (ot+1 ) − ρt · E[dt ] + E[ Vst+1 ])] , o1 ...ot+1 [P(o1 . . . ot+1 ) · βs,t ] (A.12)
where we have written the reward rt+1 as a function of the observation, r (ot+1 ), because rewards are just a special case of observations in the partially observable framework. The expansions of βs,t , E[dt ], and E[ Vst+1 ] are all conditioned on P(o1 , . . . , ot+1 ), where this probability from the inference model is assumed to match the empirical probability appearing in the numerator and denominator of this expression. Thus, we can marginalize out the observations in the sums on the numerator and denominator, reducing the fixed-point equation to Vs =
Ts,s
s ∈S
o∈O
[Os ,o
· r (o)] + Vs
−
ρt · d · Ds,d ,
(A.13)
d
which (assuming ρt = ρ) is Bellman’s equation for the value function, and is also by definition the same fixed point as value iteration. Acknowledgments This work was supported by National Science Foundation grants IIS9978403 and DGE-9987588. A.C. was funded in part by a Canadian NSERC PGS B fellowship. N.D. is funded by a Royal Society USA Research Fellowship and the Gatsby Foundation. We thank Sham Kakade, Yael Niv, and Peter Dayan for their helpful insights, and Chris Fiorillo, Wolfram Schultz, Hannah Bayer, and Paul Glimcher for helpfully sharing with us, often prior to publication, their thoughts and experimental observations. References Baum, L. E., Petrie, T., Soulds, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.
1672
N. Daw, A. Courville, and D. Touretzky
Bayer, H. M., & Glimcher, P. W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129–141. Bouton, M. E., & Nelson, J. B. (1998). Mechanisms of feature-positive and featurenegative discrimination learning in an appetitive conditioning paradigm. In N. A. Schmajuk & P. C. Holland (Eds.), Occasion setting: Associative learning and cognition in animals (pp. 69–112). Washington, DC: American Psychological Association. Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuoustime Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 393–400). Cambridge, MA: MIT Press. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19(23), 10502–10511. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92) (pp. 183–188). San Jose, CA: AAAI Press. Courville, A. C., Daw, N. D., Gordon, G. J., & Touretzky, D. S. (2003). Model ¨ uncertainty in classical conditioning. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 Cambridge, MA: MIT Press. Courville, A. C., Daw, N. D., & Touretzky, D. S. (2004). Similarity and discrimination in classical conditioning: A latent variable account. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Courville, A. C., & Touretzky, D. S. (2001). Modeling temporal structure in classical conditioning. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 3–10). Cambridge, MA: MIT Press. Das, T., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45, 560–574. Daw, N. D. (2003). Reinforcement learning models of the dopamine system and their behavioral implications. Unpublished doctoral dissertation, School of Computer Science, Carnegie Mellon University. Daw, N. D., Kakade, S., & Dayan, P. (2002). Opponent interactions between serotonin and dopamine. Neural Networks, 15, 603–616. Daw, N. D., Niv, Y., & Dayan, P. (in press). Actions, values, policies, and the basal ganglia. In E. Bezard (Ed.), Recent breakthroughs in basal ganglia research. New York: Nova Science. Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience. Daw, N. D., & Touretzky, D. S. (2002). Long-term reward prediction in TD models of the dopamine system. Neural Computation, 14, 2567–2583. Daw, N., Touretzky, D., & Skaggs, W. (2004). Contrasting neuronal correlates between dorsal and ventral striatum in the rat. In Cosyne04 Computational and Systems Neuroscience Abstracts, Vol. 1.
Representation and Timing in Theories of the Dopamine System
1673
Dayan, P. (2002). Motivated reinforcement learning. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 11– 18). Cambridge, MA: MIT Press. Dayan, P., & Balleine, B. W. (2002). Reward, motivation and reinforcement learning. Neuron, 36, 285–298. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Deneve, S. (2004). Bayesian inference in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Dickinson, A., & Balleine, B. (2002). The role of learning in motivation. In C. R. Gallistel (Ed.), Stevens’ handbook of experimental psychology (3rd ed.), Vol. 3: Learning, motivation and emotion (pp. 497–533). New York: Wiley. Dickinson, A., Hall, G., & Mackintosh, N. J. (1976). Surprise and the attenuation of blocking. Journal of Experimental Psychology: Animal Behavior Processes, 2, 313–322. Dickinson, A., & Mackintosh, N. J. (1979). Reinforcer specificity in the enhancement of conditioning by posttrial surprise. Journal of Experimental Psychology: Animal Behavior Processes, 5, 162–177. Dickinson, A., Smith, J., & Mirenowicz, J. (2000). Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behavioral Neuroscience, 114, 468–483. Doya, K. (1999). What are the computations in the cerebellum, the basal ganglia, and the cerebral cortex? Neural Networks, 12, 961–974. Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10, 732–739. Faure, A., Haberland, U., Cond´e, F., & Massioui, N. E. (2005). Lesion to the nigrostriatal dopamine system disrupts stimulus-response habit formation. Journal of Neuroscience, 25, 2771–2780. Fiorillo, C. D., & Schultz, W. (2001). The reward responses of dopamine neurons persist when prediction of reward is probabilistic with respect to time or occurrence. In Society for Neuroscience Abstracts, 27, 827.5. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Gallistel, C. R., & Gibbon, J. (2000). Time, rate and conditioning. Psychological Review, 107(2), 289–344. Gallistel, C. R., King, A., & McDonald, R. (2004). Sources of variability and systematic error in mouse timing behavior. Journal of Experimental Psychology: Animal Behavior Processes, 30, 3–16. Gibbon, J. (1977). Scalar expectancy theory and Weber’s law in animal timing. Psychological Review, 84, 279–325. Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36, 299– 308. Guedon, Y., & Cocozza-Thivent, C. (1990). Explicit state occupancy modeling by hidden semi-Markov models: Application of Derin’s scheme. Computer Speech and Language, 4, 167–192.
1674
N. Daw, A. Courville, and D. Touretzky
Holland, P. C. (1988). Excitation and inhibition in unblocking. Journal of Experimental Psychology: Animal Behavior Processes, 14, 261–279. Holland, P. C., & Kenmuir, C. (2005). Variations in unconditioned stimulus processing in unblocking. Journal of Experimental Psychology: Animal Behavior Processes, 31, 155–171. Holland, P. C., Lamoureux, J. A., Han, J., & Gallagher, M. (1999). Hippocampal lesions interfere with Pavlovian negative occasion setting. Hippocampus, 9, 143–157. Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1, 304–309. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134. Kakade, S., & Dayan, P. (2000). Acquisition in autoshaping. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Kakade, S., & Dayan, P. (2002). Acquisition and extinction in autoshaping. Psychological Review, 109, 533–544. Killeen, P. R., & Fetterman, J. G. (1988). A behavioral theory of timing. Psychological Review, 95, 274–295. Kurth-Nelson, Z., & Redish, A. (2004). µagents: Action-selection in temporally dependent phenomena using temporal difference learning over a collective belief structure. Society for Neuroscience Abstracts, 30, 207.1. Levinson, S. E. (1986). Continuously variable duration hidden Markov models for automatic speech recognition. Computer Speech and Language, 1, 29–45. Lewicki, M. S. (2002). Efficient coding of natural sounds. Nature Neuroscience, 5, 356–363. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A: Optics, Image Science, and Vision, 16, 1587–1601. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology, 67, 145–163. Machado, A. (1997). Learning the temporal dynamics of behavior. Psychological Review, 104, 241–265. Mahadevan, S., Marchalleck, N., Das, T., & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. In Proceedings of the 14th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Matell, M. S., & Meck, W. H. (1999). Reinforcement-induced within-trial resetting of an internal clock. Behavioural Processes, 45, 159–171. McClure, S. M., Daw, N. D., & Montague, P. R. (2003). A computational substrate for incentive salience. Trends in Neurosciences, 26, 423–428.
Representation and Timing in Theories of the Dopamine System
1675
Mellon, R. C., Leak, T. M., Fairhurst, S., & Gibbon, J. (1995). Timing processes in the reinforcement-omission effect. Animal Learning and Behavior, 23, 286– 296. Mirenowicz, J., & Schultz, W. (1996). Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature, 379, 449–451. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 103–130. Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133–143. Niv, Y., Daw, N. D., & Dayan, P. (2005). How fast to work: Response vigor, motivation, and tonic dopamine. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Niv, Y., Duff, M. O., & Dayan, P. (2004). The effects of uncertainty on TD learning. In Cosyne04—Computational and Systems Neuroscience Abstracts, vol. 1. Niv, Y., Duff, M. O., & Dayan, P. (2005). Dopamine, uncertainty, and TD learning. Behavioral and Brain Functions, 1, 6. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304, 452–454. Owen, A. M. (1997). Cognitive planning in humans: Neuropsychological, neuroanatomical and neuropharmacological perspectives. Progress in Neurobiology, 53, 431–450. Pan, W. X., Schmidt, R., Wickens, J., & Hyland, B. (2005). Dopamine cells respond to predicted events during classical conditioning: Evidence for eligibility traces in the reward-learning network. Journal of Neuroscience, 25, 6235–6242. Parkinson, J. A., Dalley, J. W., Cardinal, R. N., Bamford, A., Fehnert, B., Lachenal, G., Rudarakanchana, N., Halkerston, K., Robbins, T. W., & Everitt, B. J. (2002). Nucleus accumbens dopamine depletion impairs both acquisition and performance of appetitive Pavlovian approach behaviour: Implications for mesoaccumbens dopamine function. Behavioral Brain Research, 137, 149–163. Rao, R. P. N. (2004). Hierarchical Bayesian inference in networks of spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Satoh, T., Nakai, S., Sato, T., & Kimura, M. (2003). Correlated coding of motivation and outcome of decision by dopamine neurons. Journal of Neuroscience, 23, 9913– 9923. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913.
1676
N. Daw, A. Courville, and D. Touretzky
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Romo, R. (1990). Dopamine neurons of the monkey midbrain: Contingencies of responses to stimuli eliciting immediate behavioral reactions. Journal of Neurophysiology, 63, 607–624. Smith, A. J., Becker, S., & Kapur, S. (2005). A computational model of the functional role of the ventral-striatal D2 receptor in the expression of previously acquired behaviors. Neural Computation, 17, 361–395. Staddon, J. E. R., & Cerutti, D. T. (2003). Operant conditioning. Annual Reviews of Psychology, 54, 115–144. Staddon, J. E. R., & Higa, J. J. (1999). Time and memory: Towards a pacemakerfree theory of interval timing. Journal of the Experimental Analysis of Behavior, 71, 215–251. Staddon, J. E., & Innis, N. K. (1969). Reinforcement omission on fixed-interval schedules. Journal of the Experimental Analysis of Behavior, 12, 689–700. Suri, R. E. (2001). Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model. Experimental Brain Research, 140, 234–240. Suri, R. E., & Schultz, W. (1998). Learning of sequential movements with dopaminelike reinforcement signal in neural network model. Experimental Brain Research, 121, 350–354. Suri, R. E., & Schultz, W. (1999). A neural network with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience, 91, 871–890. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, (pp. 216–224). San Mateo, CA: Morgan Kaufmann. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Szita, I., & Lorincz, S. (2004). Kalman filter control embedded into the reinforcement learning framework. Neural Computation, 16, 491–499. Tobler, P., Dickinson, A., & Schultz, W. (2003). Coding of predicted reward omission by dopamine neurons in a conditioned inhibition paradigm. Journal of Neuroscience, 23, 10402–10410. Tsitsiklis, J. N., & Van Roy, B. (2002). On average versus discounted reward temporaldifference learning. Machine Learning, 49, 179–191. Ungless, M. A., Magill, P. J., & Bolam, J. P. (2004). Uniform inhibition of dopamine neurons in the ventral tegmental area by aversive stimuli. Science, 303, 2040–2042. Voorn, P., Vanderschuren, L. J., Groenewegen, H. J., Robbins, T. W., & Pennartz, C. M. (2004). Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neuroscience, 27, 468–474.
Representation and Timing in Theories of the Dopamine System
1677
Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412, 43–48. Yin, H., Barnet, R. C., & Miller, R. R. (1994). Second-order conditioning and Pavlovian conditioned inhibition: Operational similarities and differences. Journal of Experimental Psychology: Animal Behavior Processes, 20, 419–428. Zemel, R., Huys, Q., Natarajan, R., & Dayan, P. (2004). Probabilistic computation in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.
Received February 24, 2005; accepted September 29, 2005.
LETTER
Communicated by Richard Zemel
Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression D. L. Shrestha
[email protected] D. P. Solomatine
[email protected] UNESCO-IHE Institute for Water Education, Westvest 7 Delft, The Netherlands
The application of boosting technique to regression problems has received relatively little attention in contrast to research aimed at classification problems. This letter describes a new boosting algorithm, AdaBoost.RT, for regression problems. Its idea is in filtering out the examples with the relative estimation error that is higher than the preset threshold value, and then following the AdaBoost procedure. Thus, it requires selecting the suboptimal value of the error threshold to demarcate examples as poorly or well predicted. Some experimental results using the M5 model tree as a weak learning machine for several benchmark data sets are reported. The results are compared to other boosting methods, bagging, artificial neural networks, and a single M5 model tree. The preliminary empirical comparisons show higher performance of AdaBoost.RT for most of the considered data sets. 1 Introduction Recently many researchers have investigated various techniques combining the predictions from multiple predictors to produce a single predictor. The resulting predictor is generally more accurate than an individual one. The ensemble of predictors is often called a committee machine (or mixture of experts). In a committee machine, an ensemble of predictors (often referred as a weak learning machine or simply a machine) is generated by means of a learning process; the overall predictions of the committee machine are the combination of the individual committee members’ predictions (Tresp, 2001). Figure 1 presents the basic scheme of a committee machine. Each machine (1 through T) is trained using training examples sampled from the given training set. A filter is employed when different machines are to be fed with different subsets (denoted as type A) of the training set; in this case, the machines can be run in parallel. Flows of type B appear when machines pass unclassified data subsets to the subsequent machines, thus making a hierarchical committee machine. The individual outputs yi for each example Neural Computation 18, 1678–1710 (2006)
C 2006 Massachusetts Institute of Technology
Experiments with AdaBoost.RT
1679
Input x A
A Filter
Machine 1
y1 Machine 2 B
y2
Machine T B
yT
Combiner
Output y Figure 1: Block diagram of a committee machine. Flows of type A distribute data between the machines. Flows of type B appear when machines pass unclassified data subsets to the subsequent machines, thus making a hierarchy.
from each machine are combined to produce the overall output y of the ensemble. Bagging (Breiman, 1996a, 1996b) and boosting (Schapire, 1990; Freund & Schapire, 1996, 1997) are the two popular committee machines that combine the outputs from different predictors to improve overall accuracy. Several studies of boosting and bagging in classification have demonstrated that these techniques are generally more accurate than the individual classifiers. Boosting can be used to reduce the error of any “weak” learning machine that consistently generates classifiers that need be only a little bit better than random guessing (Freund & Schapire, 1996). Boosting works by repeatedly running a given weak learning machine on different distributions of training examples and combining their outputs. At each iteration, the distributions of training examples depend on the performance of the machine in the previous iteration. The method to calculate the distribution of the training examples is different for various boosting methods; the outputs of the different machines are combined using voting multiple classifiers in case of classification problems or weighted average or median in
1680
D. Shrestha and D. Solomatine
case of regression ones. There are different versions of boosting algorithm for classification and regression problems, and they are covered in detail in the following section. This letter introduces a new boosting scheme, AdaBoost.RT, for regression problems initially outlined by the authors in 2004 (Solomatine & Shrestha, 2004). It is based on the following idea. Typically in a regression, it is not possible to predict the output exactly as in classification, as real-valued variables may exhibit chaotic behavior, local nonstationarity, and high levels of noise (Avnimelech & Intrator, 1999). Some discrepancy between the predicted and the observed value is typically inevitable. A possibility here is to use insensitive margins to differentiate correct predictions from incorrect ones, as is done, for example, in support vector machines (Vapnik, 1995). The discrepancy (prediction error) for each example allows us to distinguish whether an example is well (acceptable) or poorly predicted. Once we define the measure of such a discrepancy, it is straightforward to follow the AdaBoost procedure (Freund & Schapire, 1997) by modifying the loss function that fits regression problems better. AdaBoost.RT addresses some of the issues associated with the existing boosting schemes covered in section 2 by introducing a number of novel features. First, AdaBoost.RT uses the so-called absolute relative error threshold φ to project training examples into two classes (poorly and well-predicted examples) by comparing the prediction error (absolute relative error) with the threshold φ. (The reasons to use absolute relative error instead of absolute error, as is done in many boosting algorithms, are discussed in section 3.) Second, the weight updating parameter is computed differently from in the AdaBoost.R2 algorithm (Drucker, 1997) to give more emphasis to the harder examples when error rate is very low. Third, the algorithm does not have to stop when the error rate is greater than 0.5, as happens in some other algorithms. This makes it possible to run a user-defined number of iterations, and in many cases the performance is improved. Last, the outputs from individual machines are combined by weighted average, while most of the methods consider the median, and this appears to give better performance. This letter covers a number of experiments with AdaBoost.RT using model trees (MTs) as weak learning machine, comparing it to other boosting algorithms, bagging, and artificial neural networks (ANNs). Finally, conclusions are drawn. 2 The Boosting Algorithms The original boosting approach is boosting by filtering and is described by Schapire (1990). It was motivated by the PAC (probably approximately correct) learning theory (Valiant, 1984; Kearns & Vazirani, 1994). Boosting by filtering requires a large number of training examples, which is not feasible in many real-life cases. This limitation can be overcome by using
Experiments with AdaBoost.RT
1681
another boosting algorithm, AdaBoost (Freund & Schapire 1996, 1997) in several versions. In boosting by subsampling, the fixed training size and training examples are used, and they are resampled according to a given probability distribution during training. In boosting by reweighting, all the training examples are used to train the weak learning machine, with weights assigned to each example. This technique is applicable only when the weak learning machine can handle the weighted examples. Originally boosting schemes were developed for binary classification problems. Freund and Schapire (1997) extended AdaBoost to a multiclass case, versions of which they called AdaBoost.M1 and AdaBoost.M2. Recently several applications of boosting algorithms for classification problems have been reported (e.g., Quinlan, 1996; Drucker, 1999; Opitz & Maclin, 1999). Application of boosting to regression problems has received attention as well. Freund and Schapire (1997) extended AdaBoost.M2 to boosting regression problems and called it AdaBoost.R. It solves regression problems by reducing them to classification ones. Although experimental work shows that the AdaBoost.R can be effective by projecting the regression data into classification data sets, it suffers from the two drawbacks. First, it expands each example in the regression sample into many classification examples, and the number grows linearly in the number of boosted iterations. Second, the loss function changes from iteration to iteration and even differs between examples in the same iteration. In the framework of AdaBoost.R, Ridgeway, Madigan, and Richardson (1999) performed experiments by projecting regression problems into classification ones on a data set of infinite size. Breiman (1997) proposed the arc-gv (arcing game value) algorithm for regression problems. Drucker (1997) developed the AdaBoost.R2 algorithm, which is an ad hoc modification of AdaBoost.R. He conducted some experiments with AdaBoost.R2 for regression problems and obtained good results. Avnimelech and Intrator (1999) extended the boosting algorithm to regression problems by introducing the notion of weak and strong learning and appropriate equivalence between them. They introduced so-called big errors with respect to threshold γ , which has to be chosen initially. They constructed triplets of weak learning machines and combined them to reduce the error rate using the median of the outputs of the three machines. Using the framework of Avnimelech and Intrator (1999), Feely (2000) introduced the big error margin (BEM) boosting technique. Namee, Cunningham, Byrne, and Corrigan (2000) compared the performance of AdaBoost.R2 with the BEM technique. Recently many researchers (for example, Friedman, Hastie, & Tibshirani, 2000; Friedman, 2001; Duffy & Helmbold, 2000; Zemel & Pitassi, 2001; Ridgeway, 1999) have viewed boosting as a “gradient machine” that optimizes a particular loss function. Friedman et al. (2000) explained the AdaBoost algorithm from a statistical perspective. They showed that the
1682
D. Shrestha and D. Solomatine
AdaBoost algorithm is a Newton method for optimizing a particular exponential loss function. Although all these methods involve diverse objectives and optimization approaches, they are all similar except for the one considered by Zemel and Pitassi (2001). In this latter approach, a gradient-based boosting algorithm was derived, which forms new hypotheses by modifying only the distribution of the training examples. This is in contrast to the former approaches, where the new regression models (hypotheses) are formed by changing the distribution of the training examples and modifying the target values. The following subsections describe briefly a boosting algorithm for classification problem: AdaBoost.M1 (Freund & Schapire, 1997). The reason is that our new boosting algorithm is its direct extension for regression problems. The threshold- or margin-based boosting algorithms for regression problems including AdaBoost.R2, which are similar to our algorithm, are described as well. 2.1 AdaBoost.M1. The AdaBoost.M1 algorithm (see appendix A) works as follows. The first weak learning machine is supplied with a training set of m examples with the uniform distribution of weights so that each example has an equal chance to be selected in the first training set. The performance of the machine is evaluated by computing the classification error rate εt as the ratio of incorrect classifications. Knowing εt , the weight updating parameter denoted by βt is computed as follows: βt = εt /(1 − εt ).
(2.1)
As εt is constrained on [0, 0.5], βt is constrained on [0, 1]. βt is a measure of confidence in the predictor. If εt is low, then βt is also low, and low βt means high confidence in the predictor. To compute the distribution for the next machine, the weight of each example is multiplied by βt if the previous machine classifies this example correctly (this reduces the weight of the example), or the weight remains unchanged otherwise. The weights are then normalized to make their set a distribution. The process is repeated until the preset number of machines are constructed or εt < 0.5. Finally, the weight denoted by W is computed using βt as W = log(1/βt ).
(2.2)
W is used to weight the outputs of the machines when the overall output is calculated. Notice that W becomes larger when εt is low, and, consequently, more weight is given to the machine when combining the outputs from individual machines.
Experiments with AdaBoost.RT
1683
The essence of the boosting algorithm is that “easy” examples that are correctly classified by most of the previous weak machines will be given a lower weight, and “hard” examples that often tend to be misclassified will get a higher weight. Thus, the idea of boosting is that each subsequent machine focuses on such hard examples. Freund and Schapire (1997) derived the exponential upper bound for the error rate of the resulting ensemble, and it is smaller than that of the single machine (weak learning machine). It does not guarantee that the performance on an independent test set will be improved; however, if the weak hypotheses are “simple” and the total number of iterations is “not too large,” then the difference between the training and test errors can also be theoretically bounded. As reported by Freund and Schapire (1997), AdaBoost.M1 has a disadvantage in that it is unable to handle weak learning machines with an error rate greater than 0.5. In this case, the value of βt will exceed unity, and consequently, the correctly classified examples will get a higher weight and W becomes negative. As a result, the boosting iterations have to be terminated. Breiman (1996c) describes a method by resetting all the weights of the examples to be equal and restart if either εt is not less than 0.5 or εt becomes 0. Following the revision described by Breiman (1996c), Opitz and Maclin (1999) used a very small positive value of W (e.g., 0.001) rather than a negative or 0 when εt is larger than 0.5. They reported results slightly better than those achieved by the standard AdaBoost.M1. 2.2 AdaBoost.R2. The AdaBoost.R2 (Drucker, 1997) boosting algorithm is an ad hoc modification of AdaBoost.R (Freund & Schapire, 1997), which is an extension of AdaBoost.M2 for regression problems. Drucker’s method followed the spirit of the AdaBoost.R algorithm by repeatedly using a regression tree as a weak learning machine followed by increasing the weights of the poorly predicted examples and decreasing the weights of the wellpredicted ones. Similar to the error rate in classification, he introduced the average loss function to measure the performance of the machine; it is given by
Lt =
m
L t (i)Dt (i),
(2.3)
i=1
where L t is one of three candidate loss functions (see appendix B), all of which are constrained on [0, 1]. The definition of βt remains unchanged. However, unlike projecting the regression data into classification in AdaBoost.R, the reweighting procedure is formulated in such a way that the
1684
D. Shrestha and D. Solomatine
5
9 8
4
7 6
3
W
β
5 4
2
3 2
1
1 0
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
ε (AdaBoost.RT) or Lta (AdaBoost.R2 ) BetaR2 BetaRT WR2 WRT Figure 2: Weight updating parameter (left y-axis) or weight of the machine (right y-axis) and error rate ε (AdaBoost.RT) or average loss Lta (AdaBoost.R2) BetaR2 and BetaRT represent the weight-updating parameters for AdaBoost.R2 and AdaBoost.RT, respectively. WR2 and WRT represent the weights of the machine for AdaBoost.R2 and AdaBoost.RT, respectively.
poorly predicted examples get higher weights and well-predicted examples get lower weights: (1−L t (i))
Dt+1 (i) =
Dt (i)βt Zt
.
(2.4)
In this way, all the weights are updated according to the exponential loss functions of βt so that the weight of the example with lower loss (smaller error) will be highly reduced, thus reducing the probability that this example will be picked up for the next machine. Finally, the outputs from each machine are combined using the weighted median. Figure 2 presents the relation between βt or W and L t . Similar to AdaBoost.M1, AdaBoost.R2 has a drawback that it is unable to handle weak learning machines with an error rate greater than 0.5. Furthermore, the algorithm is sensitive to noise and outliers, as reweighting formulation is proportional to the prediction error. This algorithm also has an advantage: it does not have any parameter that has to be calibrated,
Experiments with AdaBoost.RT
1685
which is not the case with the other boosting algorithms described in this article. 2.3 Boosting Method of Avnimelech and Intrator (1999). Avnimelech and Intrator (1999) were the first to introduce the threshold-based boosting algorithm for regression problems that used an analogy between classification errors and large errors in regression. The idea of using threshold for a big error is similar to the ε-insensitive loss function used in support vector machines (Vapnik, 1995) where the loss function of the examples having error less than ε is zero. The examples whose prediction errors are greater than the big error margin γ are filtered out, and in subsequent iterations, weak learning machine concentrates on them. The fundamental concept of this algorithm is that the mean squared error, which is often significantly greater than the square median of the error, can be reduced by reducing the number of large errors. Unlike other boosting algorithms, the method of Avnimelech and Intrator is an ensemble of only three weak learning machines. Depending on the distribution of training examples, their method has three versions (see appendix C). Initially the training examples have to be split into three sets, Set 1, Set 2, and Set 3, in such a way that the Set 1 should be smaller than the other two sets. The first machine is trained on Set 1 (constituting a portion of the original training set, e.g., 15%). The training set for the second machine is composed of all examples on which the first machine has a big error on Set 2 and the same number of examples sampled from Set 2 on which the first machine has a small error. Note that big error should be defined with respect to threshold γ , which has to be chosen initially. In Boost1, the training data set for the last machine consists of only examples on which exactly one of the previous machines had a big error. But in Boost2, the training data set for the last machine is composed of the data set constructed for Boost1 plus examples on which previous machines had big errors of different signs. The authors further modified this version and called it Modified Boost2, where the training data set for the last machine is composed of the data set constructed for Boost2, plus examples on which both previous machines had big errors, but there is a big difference between the magnitudes of the errors. The final output is the median of the outputs of the three machines. They proved that the error rate could be reduced from ε to 3ε 2 − 2ε 3 or even less. One of problems of this method is selection of the optimal value of threshold γ for big errors. Theoretically, the optimal value may be the lowest value for which there exists a γ -weak learner. A γ -weak learner is such a learner for which there exists some ε < 0.5, and for any given distribution D and δ > 0, it is capable of finding function f D with probability 1-δ such that D(i) < ε. There are, however, practical considerations i: f (xi )−yi > γ such as limitation of data sets and the limited number of machines in the ensemble, which makes it difficult to choose the desired value of γ in
1686
D. Shrestha and D. Solomatine
advance. Furthermore, an issue concerning splitting the training data into three subsets may arise. Since all of the training examples are not used to construct the weak learning machine, waste of data is a crucial issue for small data sets. Furthermore, similar to AdaBoost.R2, the algorithm does not work when the error rate of the first weak learning machine on Set 2 is greater than 0.5. It should also be noted that the use of an absolute error to identify big errors is not always the right measure in many practical applications when the variability of the target values is very high (an explanation follows in section 3). 2.4 BEM Boosting Method. The big error margin (BEM) boosting method (Feely, 2000) is quite similar to the AdaBoost.R2 method. It is based on an approach of Avnimelech and Intrator (1999). Similar to their method, the prediction error is compared with the preset threshold value called BEM, and the corresponding example is classified as either well or poorly predicted. The BEM method (see appendix D), counts the number of correct and incorrect predictions by comparing the prediction error with the preset BEM, which has to be chosen initially. The prediction is considered to be incorrect if the absolute value of prediction error is greater than BEM. Otherwise the prediction is considered to be correct. Counting the number of correct or incorrect predictions allows for computing the distribution of the training examples using the so-called UpFactor and DownFactor: Upfactort = m/errCountt Downfactort = 1/Upfactort .
(2.5)
Using these values, the numbers of copies of each training example to be included for the subsequent machine in the ensemble are calculated. So if the example is correctly predicted by the preceding machine in the ensemble, then the number of copies of this example for the next machine is given by the number of its copies in the current training set multiplied by the DownFactor. Similarly, for an incorrectly predicted example, the number of its copies to be presented to the subsequent machine is the one multiplied by the UpFactor. Finally, the outputs from individual machine are combined to give the final output. Similar to the method of Avnimelech and Intrator (1999), this method requires tuning of the value of BEM in order to achieve the optimum results. The way the training examples’ distribution is represented is different from that in other boosting algorithms. In the BEM boosting method, the distribution represents how many times an example will be included in the training set rather than the probability that it will appear. Furthermore, it has another drawback: the size of the training data set is changing for each machine and may increase to an impractical value after a few iterations. In
Experiments with AdaBoost.RT
1687
addition, this method has a problem encountered in the boosting method of Avnimelech and Intrator (1999) due to the use of absolute error measure when the variability of the target values is very high. In the new algorithm presented in the next section, an attempt has been made to resolve some of the problems noted here.
3 Methodology This section describes a new boosting algorithm for regression problems, AdaBoost.RT (R stands for regression and T for threshold). AdaBoost.RT Algorithm
1. Input: r Sequence of m examples (x1, y1 ), . . . , (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) r Threshold φ(0 < φ < 1) for demarcating correct and incorrect predictions 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m for all i r Error rate εt = 0 3. Iterate while t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate absolute relative error for each training example as f t (xi ) − yi ARE t (i) = yi r Calculate the error rate of f t (x) : εt = i:ARE t (i)>φ Dt (i) r Set βt = εn , where n = power coefficient (e.g., linear, square, or t cubic) r Update distribution Dt as Dt (i) βt if ARE t (i) RMSE K , j Nt
∀K , i = j
0, i = j
,
(6.1)
where RMSE K ,i is the RMSE of the ith machine (rows’ headers in the first column in Table 5), RMSE K , j is the root mean squared error of the jth machine (columns’ header in the first row), Nt is the total number of independent runs for all data sets, and K = 1, 2, ..., Nt. The following relation holds: QSMi, j + QSMj,i = 1
∀ i, j
and i = j.
(6.2)
For example, the value of 77 appearing in the second row and the first column means that AdaBoost.RT beats MT 77% on average. Table 5 demonstrates that AdaBoost.RT is better than all other machines considered 69% of times, and 62%, 47%, and 24% for AdaBoost.R, BEM the boosting method, and the method of Avnimelech and Intrator, respectively. If one is interested to analyze the relative performance of the algorithms more precisely, then a measure reflecting the relative performance is needed.
1704
D. Shrestha and D. Solomatine
Table 6: Quantitative Scoring Matrix for Different Machines. Machine
MT
RT
R2
AI
BEM
Bagging
ANN
Total
MT RT R2 AI BEM Bagging ANN
0.0 13.7 12.0 −7.5 4.3 6.9 5.8
−13.7 0.0 −1.4 −19.5 −8.7 −7.5 −5.9
−12.0 1.4 0.0 −17.6 −7.9 −5.9 −4.2
7.5 19.5 17.6 0.0 10.1 13.7 10.8
−4.3 8.7 7.9 −10.1 0.0 1.6 4.2
−6.9 7.5 5.9 −13.7 −1.6 0.0 −0.3
−5.8 5.9 4.2 −10.8 −4.2 0.3 0.0
−35.2 56.7 46.2 −79.2 −8.0 9.2 10.3
Note: Bold type indicates the total value for each learner.
For this reason, we used the so-called quantitative scoring matrix, or, simply, scoring matrix (Solomatine & Shrestha, 2004), as shown in Table 6. It shows the average relative performance (in %) of one technique (either single machine or committee machine) over another one for all the data sets. The element of scoring matrix SMi, j should be read as average performance of the ith machine (header row in Table 6) over the jth machine (header column) and is calculated as follows:
SMi, j
N 1 RMSE k, j − RMSE k,i , = N max(RMSE k, j , RMSE k,i ) k=1 0, i = j
i = j
,
(6.3)
where N is the number of data sets. The value of 13.7 appearing in the second row and the first column in Table 6 means that the performance of AdaBoost.RT over the MT is 13.7% higher if averaged over all seven data sets considered. By summing up all the elements’ values row-wise, one can determine the overall score of each machine. It can be clearly observed from Tables 5 and 6 that AdaBoost.RT on average scores highest, which implies that on the data sets used, AdaBoost.RT is comparatively better than the other techniques considered. 6.7 Comparison with Previous Research. In spite of the attention that boosting receives in classification problems, relatively few comparisons between boosting techniques for regression exist. Drucker (1997) performed some experiments using AdaBoost.R2 with the Friedman#1 and Boston Housing data sets. He obtained an RMSE of 1.69 and 3.27 for the Friedman#1 and Boston Housing data sets, respectively. Our results are consistent with his results; however, there are certain procedural differences in experimental settings. He used 200 training examples, 40 pruning samples, and 5000 test examples per run and per 10 runs to determine the best loss functions. In our experiments, we used only 990 training examples and 510
Experiments with AdaBoost.RT
1705
Table 7: Comparison of Performance (RMSE) of Boosting Algorithms on Laser Data Set (Test Data) Using ANN as a Weak Learning Machine. Boosting Algorithm ANN AdaBoost.RT AdaBoost.R2 AI BEM
RMSE
NMSE
4.07 2.48 2.84 3.13 3.42
0.008042 0.002768 0.003700 0.004402 0.005336
Notes: The results are averaged for 10 different runs. Bold type indicates the minimum value of RMSE among different learners.
test examples, and the number of iterations was only 10 against his 100. We did not optimize the loss functions; moreover, the weak learning machine is completely different, as we used the M5 model tree (potentially more accurate), whereas he used a regression tree. We also conducted experiments with the Laser data set using ANNs as a weak learning machine to compare with the previous work of Avnimelech and Intrator (1999). The architecture of neural networks is the same as described before. The hidden layer consisted of six nodes. The number of epochs was limited to 500. The results, reported in Table 7, show that AdaBoost.RT is superior compared to all other boosting methods. The RMSE of AdaBoost.RT is 2.48 as compared to 2.84 for AdaBoost.R2, 3.13 for Avnimelech and Intrator’s method, and 3.42 for the BEM method. The normalized mean squared error (NMSE: the mean squared error divided by the variance across the target value) was also calculated. Avnimelech and Intrator obtained an NMSE value of 0.0047, while ours was 0.0044 using their method on the Laser data set. There were some procedural differences in the experimental settings: they used 5 different partitions of 8000 training examples and 2000 test examples, whereas we used 10 different partitions of 7552 training examples and 2518 test examples. In spite of these differences, it can be said that our result is consistent with theirs. 7 Conclusions Experiments with a new boosting algorithm, AdaBoost.RT, for regression problems were presented and analyzed. Unlike several recent boosting algorithms for regression problems that follow the idea of gradient descent, AdaBoost.RT builds the regression model by simply changing the distribution of the sample and is a variant of AdaBoost.M1 modified for regression. The training examples are projected into two classes by comparing the accuracy of prediction with the preset relative error threshold. The idea of using an error threshold is analogous to the insensitivity range used, for example,
1706
D. Shrestha and D. Solomatine
in support vector machines. Loss function is computed using relative error rather than absolute error; in our view it is justified in many real-life applications. Unlike most of the other boosting algorithms, the AdaBoost.RT algorithm does not have to stop when the error rate is greater than 0.5. The modified weight updating parameter not only ensures that the value of machine weight is nonnegative, but also gives relatively higher emphasis on the harder examples. The boosting method of Avnimelech and Intrator (1999) suffers from data shortage, and the BEM boosting method (Feely, 2000) is time-consuming to handle large data sets. In this sense AdaBoost.RT would be a preferred option. If compared to AdaBoost.R2, however, AdaBoost.RT needs a parameter to be selected initially, and this introduces additional complexity that has to be handled as in other threshold-based boosting algorithms (for example, by Avnimelech and Intrator and BEM). The algorithmic structure of AdaBoost.RT is such that it updates the weight-updating parameter (used to calculate the probability to be chosen for the next machine) by the same value. This feature ensures that the outliers (noisy examples) do not dominate the training sets for the subsequent machines. The experimental results demonstrated that AdaBoost.RT outperforms a single machine (i.e., a weak learning machine for which an M5 model tree was used) for all of the data sets considered. The two-tail sign test also indicates that AdaBoost.RT is better than a single machine with a confidence level higher than 99%. Compared with the other boosting algorithms, it was observed that AdaBoost.RT surpasses them on most of the data sets considered. Qualitative and quantitative performance measures (scoring matrix) also give an indication of the higher accuracy of AdaBoost.RT. However, for more accurate and statistically significant comparison, more experiments are needed. An obvious next step would be to automate the choice of a (sub)optimal value for the threshold depending on the characteristics of the data set and to test other functional relationships for weight-updating parameters. Appendix A: AdaBoost.M1 Algorithm 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where labels y ∈ Y = {1, . . . , k} r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Error rate εt = 0 3. Iterate while error rate εt < 0.5 or t ≤ T
Experiments with AdaBoost.RT
1707
r r r r r
Call Weak Learner, providing it with distribution Dt Get back a hypothesis h t : X → Y Calculate the error rate of h t : εt = i : h t (xi ) = yi Dt (i) Set βt = εt /(1 − εt ) Update distribution Dt as Dt (i) βt if h t (xi ) = yi Dt+1 (i) = × 1 otherwise Zt where Zt is a normalization factor chosen such that Dt+1 will be a distribution r Set t = t + 1 4. Output the final hypothesis: h fin (x) = arg max t : h t (x)=y log β1t
Appendix B: AdaBoost.R2 Algorithm 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Average loss function L t = 0 3. Iterate while average loss function L t < 0.5 or t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate the loss for each training example as: lt (i) = f t (xi ) − yi r Calculate the loss function L t (i) for each training example using three different functional forms as: 2 lt (i) lt (i) Linear : L t (i) = ; Squarelaw : L t (i) = Denomt Denomt lt (i) Exponential : L t (i) = 1 − exp − Denomt
r r r r
whereDenomt = max (lt (i)) i=1,...,m m Calculate an average loss as: L t = i=1 L t (i)Dt (i) Set βt = L t /(1 − L t ) (1−L (i)) Dt (i)βt t Update distribution Dt as: Dt+1 (i) = where Zt is a Zt normalization factor chosen such that Dt+1 will be a distribution Set t = t + 1
1708
D. Shrestha and D. Solomatine
4. Output the final hypothesis: f f in (x) = inf y ∈ Y :
t: f t (x)≤y
1 1 1 log ≥ log βt 2 t βt
Appendix C: Boosting Method of Avnimelech and Intrator (1999) 1. Spilt the training examples to three sets—Set 1, Set 2, and Set 3—in such a way that Set 1 should be smaller than the other two sets. 2. Train the first machine on Set 1. 3. Assign to the second machine all the examples from Set 2 on which the first machine has a big error and a similar number of examples from Set 2 on which it does not have a big error, and train the second machine on it. 4. Assign the training examples to the third machine according to the following different versions: r Boost1: All examples on which exactly one of the first two machines has a big error. r Boost2: Data set constructed for Boost1 plus all examples on which both machines have a big error but these errors have different signs. r Modified Boost 2: Data set constructed for Boost2 plus all examples on which both machines have a big error, but there is a “big” difference between the magnitude of errors. 5. Combine the outputs of the three machines using the median as the final prediction from the ensemble. Appendix D: BEM Boosting Method (Feely, 2000) 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations r BEM for demarcating correct and incorrect predictions 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Error count errCountt = 0 3. Iterate while t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate absolute error AE t (i) for each training example
Experiments with AdaBoost.RT
r r
r r r
Calculate the error count of f t (x) : errCountt = whole training data Calculate the Upfactor and Downfactor as: m Upfactort = errCountt
1709
i : AE t (i) >
BEM
1 Upfactort Update distribution Dt as Up f a ctort i f AE t (i) > B E M Dt+1 (i) = Dt (i) × Downf a ctort other wise Sample new training data according to the distribution Dt+1 Set t = t + 1 Downfactort =
4. Combine outputs from individual machine References Avnimelech, R., & Intrator, N. (1999). Boosting regression estimators. Neural Computation, 11(2), 499–520. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Breiman, L. (1996a). Stacked regressor. Machine Learning, 24(1), 49–64. Breiman, L. (1996b). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996c). Bias, variance, and arcing classifiers, (Tech. Rep. 460). Berkeley: Statistics Department, University of California. Breiman, L. (1997). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1518. Cherkassky, V., & Ma, Y. (2004). Comparison of loss functions for linear regression. In Proc. of the International Joint Conference on Neural Networks (pp. 395–400). Piscataway, NJ: IEEE. Drucker, H. (1997). Improving regressor using boosting techniques. In D. H. Fisher, Jr. (Ed.), Proc. of the 14th International Conferences on Machine Learning (pp. 107–115). San Mateo, CA: Morgan Kaufmann. Drucker, H. (1999). Boosting using neural networks. In A. J. C. Sharkey (Ed.), Combining artificial neural Nets (pp. 51–77). London: Springer-Verlag. Duffy, N., & Helmbold, D. P. (2000). Leveraging for regression. In Proc.of the 13th Annual Conference on Computational Learning Theory (pp. 208–219). San Francisco: Morgan Kaufmann. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Feely, R. (2000). Predicting stock market volatility using neural networks. Unpublished B.A (Mod.) dissertation, Trinity College Dublin. Freund, Y., & Schapire, R. (1996). Experiment with a new boosting algorithm. In Proc. of the 13th International Conference on Machine Learning (pp. 148–156). San Francisco: Morgan Kaufmann.
1710
D. Shrestha and D. Solomatine
Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19(1), 1–82. Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statictical view of boosting. Annals of Statistics, 28(2), 337–374. Kearns, M., & Vazirani, U. V. (1994). An Introduction to computational learning theory. Cambridge, MA: MIT Press. Namee, B. M., Cunningham, P., Byrne, S., & Corrigan, O. I. (2000). The problem of bias in training data in regression problems in medical decision support. P´adraig Cunningham’s Online publications, TCD-CS-2000-58. Available online at http://www.cs.tcd.ie/Padraig.Cunningham/online-pubs.html. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. Quinlan, J. R. (1992). Learning with continuous classes. In Proc. of the 5th Australian Joint Conference on AI (pp. 343–348). Singapore: World Scientific. Quinlan, J. R. (1996). Bagging, boosting and C4.5. In Proc. of the 13th national Conference on Artificial Intelligence (pp. 725–730). Menlo Park, CA: AAAI Press. Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172– 181. Ridgeway, G., Madigan, D., & Richardson, T. (1999). Boosting methodology for regression problems. In Proc. of the 7th International Workshop on Artificial Intelligence and Statistics (pp. 152–161). San Francisco: Morgan Kaufmann. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Solomatine, D. P., & Dulal, K. N. (2003). Model tree as an alternative to neural network in rainfall-runoff modelling. Hydrological Science Journal, 48(3), 399–411. Solomatine, D. P., & Shrestha, D. L. (2004). AdaBoost.RT: A boosting algorithm for regression problems. In Proc. of the International Joint Conference on Neural Networks (pp. 1163–1168). Piscataway, NJ: IEEE. Tresp, V. (2001). Committee machines. In Y. H. Hu & J.-N. Hwang (Eds.), Handbook for neural network signal processing. Boca Raton, FL: CRC Press. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vapnik, V. (1995). The nature of statistical learning theorey. New York: Springer. Weigend A. S., & Gershenfeld G. (1993). Time series prediction: Forecasting the future and understanding the past. In Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis. Menlo Park, CA: Addison-Wesley. Witten, I. H., & Frank, E. (2000). Data mining. San Francisco: Morgan Kaufmann. Zemel, R., & Pitassi, T. (2001). A gradient-based boosting algorithm for regression problems. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press.
Received November 16, 2004; accepted October 21, 2005.
LETTER
Communicated by Carter Wendelken
A Connectionist Computational Model for Epistemic and Temporal Reasoning Artur S. d’Avila Garcez
[email protected] Department of Computing, City University, London EC1V 0HB, UK
Lu´ıs C. Lamb
[email protected] Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre RS, 91501-970, Brazil
The importance of the efforts to bridge the gap between the connectionist and symbolic paradigms of artificial intelligence has been widely recognized. The merging of theory (background knowledge) and data learning (learning from examples) into neural-symbolic systems has indicated that such a learning system is more effective than purely symbolic or purely connectionist systems. Until recently, however, neural-symbolic systems were not able to fully represent, reason, and learn expressive languages other than classical propositional and fragments of first-order logic. In this article, we show that nonclassical logics, in particular propositional temporal logic and combinations of temporal and epistemic (modal) reasoning, can be effectively computed by artificial neural networks. We present the language of a connectionist temporal logic of knowledge (CTLK). We then present a temporal algorithm that translates CTLK theories into ensembles of neural networks and prove that the translation is correct. Finally, we apply CTLK to the muddy children puzzle, which has been widely used as a testbed for distributed knowledge representation. We provide a complete solution to the puzzle with the use of simple neural networks, capable of reasoning about knowledge evolution in time and of knowledge acquisition through learning. 1 Introduction The importance of the efforts to bridge the gap between the connectionist and symbolic paradigms of artificial intelligence has been widely recognised (Ajjanagadde, 1997; Cloete & Zurada, 2000; Shastri, 1999; Sun, 1995; Sun & Alexandre, 1997). The merging of theory (background knowledge) and data learning (learning from examples) into neural networks has been shown to provide a learning system that is more effective than purely symbolic or purely connectionist systems, especially when data are noisy Neural Computation 18, 1711–1738 (2006)
C 2006 Massachusetts Institute of Technology
1712
A. d’Avila Garcez and L. Lamb
(Towell & Shavlik, 1994). This contributed decisively to the growing interest in developing neural-symbolic learning systems, hybrid systems based on neural networks that are capable of learning from examples and background knowledge (d’Avila Garcez, Broda, & Gabbay, 2002). Typically, translation algorithms from a symbolic to a connectionist representation, and vice versa, are employed to provide a neural implementation of a logic, a sound logical characterization of a neural system, or a hybrid learning system that brings together features from connectionism and symbolic artificial intelligence (AI). As argued in Browne and Sun (2001), if connectionism is an alternative paradigm to artificial intelligence, neural networks must be able to compute symbolic reasoning in an efficient and effective way. Further, it is argued that in hybrid systems, usually the connectionist component is fault tolerant, while the symbolic component may be brittle and rigid. We tackle this problem of the symbolic component by offering a principled way to compute, represent, and learn propositional temporal and epistemic reasoning within connectionist models. Until recently, neural-symbolic systems were not able to fully represent, compute, and learn expressive languages other than propositional and fragments of first-order, classical logic (Cloete & Zurada, 2000). However, in d’Avila Garcez, Lamb, Broda, and Gabbay (2004) and d’Avila Garcez, Lamb, and Gabbay (2002), a new approach to knowledge representation and reasoning in neural-symbolic systems based on neural network ensembles was proposed. This new approach shows that nonclassical, modal logics can be effectively represented in artificial neural networks. Learning in the connectionist modal logic system is achieved by training each network in the ensemble, which corresponds to the current knowledge of an agent within a possible world. In this letter, following the formalization of connectionist modal logics (CML) proposed in d’Avila Garcez, Lamb, et al. (2002), d’Avila Garcez, Lamb, Broda, and Gabbay (2003) and d’Avila Garcez et al. (2004), we show that temporal and epistemic logics, by means of temporal and epistemic logic programming fragments, can be effectively represented in, and combined with, artificial neural networks (d’Avila Garcez & Lamb 2004). This is done by providing a translation algorithm from temporal logic theories to the initial architecture of a neural network. A theorem then shows that the given temporal theory and the network are equivalent in the sense that the network computes the fixed point of the theory. We then validate the connectionist temporal logic of knowledge (CTLK) system by applying it to a distributed time and knowledge representation problem known as the muddy children puzzle (Fagin, Halpern, Moses, & Vardi, 1995). As an extension of CML that includes temporal operators, CTLK provides a combined (multimodal) connectionist system of knowledge and time. This allows the modeling of evolving situations such as changing environments or possible worlds, and the construction of a connectionist
A Connectionist Model for Temporal Reasoning
1713
model for reasoning about the temporal evolution of knowledge. These features, combined with the computational power of neural networks, lead us toward a rich neural-symbolic learning system (d’Avila Garcez, Broda, et al., 2002), where various forms of nonclassical reasoning are naturally represented, derived, and learned. Hence, the approach presented here extends the representation power of artificial neural networks beyond the classical level. There has been a considerable amount of work in symbolic AI using nonclassical logics. It is important that these are investigated within the neural computation paradigm.1 For instance, applications in AI and computer science have made extensive use of decidable modal logics, including the analysis and model checking of distributed and multi-agent systems, program verification and specification, and hardware model checking. In the case of temporal and epistemic logics, these have found a large number of applications, notably in game theory and in models of knowledge and interaction in multi-agent systems (Pnueli, 1977; Fagin et al., 1995; Gabbay, Hodkinson, & Reynolds, 1994). Therefore, this work contributes toward the representation of such expressive, highly successful logical languages in neural networks. Our long-term aim is to contribute to the challenge of representing expressive symbolic formalisms within learning systems. We are proposing a methodology for the representation of several nonclassical logics in artificial neural networks. Such expressive logics have been successfully used in computer science, and we believe that connectionist approaches should take them into account by means of adequate computational models catering for reasoning, knowledge representation, and learning. According to Valiant (2003), the two most fundamental phenomena of intelligent cognitive behavior are the ability to learn from experience and the ability to reason from what has been learned: We are seeking a semantics of knowledge that can computationally support the basic phenomena of intelligent behaviour. It should support integrated algorithms for learning and reasoning that are computationally tractable and have some nontrivial scope. Another requirement is that it has a principled way of ensuring that the knowledge-base from which reasoning is done is robust, in the sense that errors in the deductions are at least controllable. (Valiant, 2003)
Taking the requirements put forward by Valiant into consideration, this article provides a robust connectionist computational model for epistemic and temporal reasoning. Knowledge is represented by a symbolic language, while deduction and learning are carried out by a connectionist engine. 1 It is well known that modal logics correspond, in terms of expressive power, to a two-variable fragment of first-order logic (Vardi, 1997). Further, as the two-variable fragment of predicate logic is decidable, this explains why modal logics are so robustly decidable and amenable to multiple applications.
1714
A. d’Avila Garcez and L. Lamb
The remainder of this letter is organized as follows. In section 2, we recall some useful preliminary concepts and present the CML system. In section 3, we present the CTLK system, introduce the temporal algorithm, which translates temporal logic programs into artificial neural networks, and prove that the translation is correct. In section 4, we use CML and CTLK to tackle the muddy children puzzle and compare the solutions provided by each system. In section 5, we conclude and discuss directions for future work. 2 Preliminaries The CTLK uses ensembles of connectionist inductive learning and logic programming (C-ILP) neural networks (d’Avila Garcez, Broda, et al., 2002; d’Avila Garcez & Zaverucha, 1999). C-ILP networks are single hidden-layer networks that can be trained with backpropagation (Rumelhart, Hinton, & Williams, 1986). In C-ILP, a translation algorithm (given later) maps a logic program P into a single hidden-layer neural network N such that N computes the least fixed point of P. This provides a massively parallel model for computing the stable model semantics of P (Gelfond & Lifschitz, 1991). In addition, N can be trained with examples using backpropagation, having P as background knowledge. The knowledge acquired by training can then be extracted (d’Avila Garcez, Broda, & Gabbay, 2001), closing the learning cycle (as in Towell & Shavlik, 1994). Let us exemplify how the C-ILP translation algorithm works. Each rule rl of program P is mapped from the input layer to the output layer of N through one neuron (Nl ) in the single hidden layer of N . Thresholds and weights are such that the hidden layer computes a logical and of the input layer, while the output layer computes a logical or of the hidden layer. Hence, the translation algorithm from P to N has to implement the following conditions: (C1 ) The input potential of a hidden neuron (Nl ) can only exceed Nl ’s threshold (θl ), activating Nl , when all the positive antecedents of rule rl are assigned truth-value true while all the negative antecedents of rl are assigned false; and (C2 ) the input potential of an output neuron (A) can only exceed A’s threshold (θ A), activating A, when at least one hidden neuron Nl that is connected to A is activated. Example 1 (C-ILP). Consider the logic program P = {r1 : B ∧ C ∧ ¬D → A; r2 : E ∧ F → A; r3 : B}. The translation algorithm derives the network N of Figure 1, setting weights (W) and thresholds (θ ) in such a way that conditions C1 and C2 above are satisfied. Note that if N has to be fully connected, any other link (not shown in Figure 1) should receive weight zero initially. Each input and output neuron of N is associated with an atom of P. As a result, each input and output vector of N can be associated with an interpretation for P, so that an atom (e.g., A) is true iff its corresponding neuron (neuron A) is activated. Note also that each hidden neuron Nl corresponds
A Connectionist Model for Temporal Reasoning
θA
W θ1
W
B
θB
A
1715
B
W θ2
N1 W
C
W
N2
θ3 N 3
-W
W
D
W
E
F
Figure 1: Sketch of a neural network N that represents logic program P.
to a rule rl of P. In order to compute a stable model, output neuron B should feed input neuron B such that N is used to iterate the fixed-point operator TP of P (d’Avila Garcez, Lamb, et al. 2002). This is done by transforming N into a recurrent network Nr , containing feedback connections from the output to the input layer of N , all with fixed weights Wr = 1. As a result, the activation of output neuron B feeds back into the activation of input neuron B, what allows the network to compute chains such as A → B and B → C. In the case of P above, given any initial activation to the input layer of Nr (network of Figure 1 recurrently connected), it always converges to the following stable state: A = false, B = true, C = false, D = false, E = false, and F = false. In CML, a (one-dimensional) ensemble of C-ILP neural networks is used to represent modalities such as necessity and possibility. In CTLK, a twodimensional C-ILP ensemble is used to represent the evolution of knowledge through time. In both cases, each C-ILP network can be seen as representing a (learnable) possible world that contains information about the knowledge held by an agent in a distributed system.
1716
A. d’Avila Garcez and L. Lamb
2.1 Connectionist Modal Logic. Modal logic began with the analysis of concepts such as necessity and possibility under a philosophical logic perspective (Chagrov & Zakharyaschev, 1997). Modal logic was found to be appropriate to study mathematical necessity (in the logic of provability), time, knowledge, and other modalities (Fagin et al., 1995; Chagrov & Zakharyaschev, 1997). A main feature of modal logics is the use of Kripke’s possible world semantics, a fundamental abstraction in the semantics of modal logics (Broda, Gabbay, Lamb, & Russo, 2004; Chagrov & Zakharyaschev, 1997; Fagin et al., 1995; Gabbay, 1996). In such a semantics, a proposition is necessarily true in a world if it is true in all worlds possible in relation to that world, whereas it is possibly true in a world if it is true in at least one world that is possible in relation to that same world. This is expressed in the semantics formalization by the use of a binary accessibility relation between possible worlds. The language of basic propositional modal logic extends the language of propositional logic with the necessity () and possibility (♦) operators. The modality is also known as a universal operator, since it expresses the idea of a proposition being “universally” true, or true under all interpretations. The ♦ modality is seen as an existential operator, in the sense that it “quantifies” a formula that is possibly true, that is, true in some interpretation. We start with a simple example to illustrate intuitively how an ensemble of neural networks is used for modeling nonclassical reasoning with propositional modal logic. We will see that C-ILP ensembles are appropriate to represent the and ♦ modalities in a connectionist setting. In order to reason with modal operators (temporal, epistemic, spatial, or multimodal), a variety of proof systems for modal logics have been developed over the years (Broda et al., 2004; Gabbay, 1996). Further, logic programming approaches to deal with intentional (Orgun & Wadge, 1992) and modal operators (Orgun & Ma, 1994; Farinas del Cerro & Herzig, 1995) have also been extensively developed, leading to implementations of symbolic modal reasoning. In some of these reasoning mechanisms, formulas are labeled by the worlds (or states) in which they hold, facilitating the modal reasoning process, an approach we adopt here. Consider Figure 2. It shows an ensemble of three C-ILP neural networks labeled ω1 , ω2 , ω3 , which might communicate in different ways. We look at ω1 , ω2 , and ω3 as possible worlds. Input and output neurons may now represent L, ♦ L, or L, where L is a literal (an atom or a negation of an atom), denotes modal necessity ( L is also read “box L,” and it means that L is necessarily true), ♦ denotes modal possibility (♦ L is also read “diamond L,” and it means that L is possibly true). A will be true in a world ωi if A is true in all worlds ω j to which ωi is related. Similarly, ♦ A will be true in a world ωi if A is true in some world ω j to which ωi is related. As a result, if neuron A is activated at a world (network) ω1 , denoted by ω1 : A, and ω1 is related to worlds (networks) ω2 and ω3 , then neuron Amust be activated in ω2 and ω3 . Similarly, if neuron ♦ Ais activated
A Connectionist Model for Temporal Reasoning
1717
ω2 s
q q
M
p
M M q s
ω3
q
r
∧
∨ s
r
ω1 Figure 2: The ensemble of networks that represents modal program P.
in ω1 , then a neuron A must be activated in an arbitrary network related to ω1 .2 It is also possible to make use of connectionist modal logic (CML) ensembles to compute that A holds at a possible world, say ωi , whenever A holds at all possible worlds related to ωi , by connecting the output neurons of the related networks to a hidden neuron in ωi , which connects to an output neuron labeled as A. Dually for ♦ A, whenever A holds at some possible world related to ωi , we connect the output neuron representing A to a hidden neuron in ωi that connects to an output neuron labeled ♦ A.3
2 In practice, the easiest way to implement ♦A will be to create a network N in which A is set up, as described later in our translation algorithm. 3 These rules are actually instances of the rules known as -Elimination, -Introduction, ♦-Elimination, and ♦-Introduction (Broda et al., 2004). In a connectionist setting, -Elimination deals with neurons labeled by formulas of the type ωi : α. Intuitively, the modal necessity is “eliminated” in the sense that the rule allows the inference
1718
A. d’Avila Garcez and L. Lamb
Due to the simplicity of each network in the CML ensemble, when it comes to learning, we can use backpropagation on each network to learn the local knowledge in each possible world. The way we should connect the different networks in the ensemble is then given by the meaning of and ♦, which is not supposed to change, and the accessibility relation. Let us give an example of the use of CML. Example 2 (CML) Let P = {ω1 : r → q , ω1 : ♦s → r, ω2 : s, ω3 : q → ♦ p, R(ω1 ,ω2 ), R(ω1 ,ω3 )}. We start by applying the C-ILP translation algorithm (given later), which creates three neural networks to represent the worlds ω1 , ω2 , and ω3 (see Figure 2). Then we apply the modalities algorithm (also given later) in order to connect the networks. Hidden neurons labeled by {M, ∨, ∧} are created using the modalities algorithm. The remaining neurons are all created using the translation algorithm. For clarity, unconnected input and output neurons are not shown in Figure 2. Taking ω1 , q is connected to q in both ω2 and ω3 so that whenever q is active, q is also active. Taking ω2 , since s is a fact in ω2 , and ω1 is related to ω2 , neuron s must be connected to a neuron ♦s in ω1 such that, whenever s is active ♦s also is. Dually, ♦s is connected to s, and for q , neurons q in ω2 and ω3 are connected to neuron q in ω1 , so that if q is active in both ω2 and ω3 , then q is active in ω1 . The modalities algorithm will perform the translation of modal logic programs into neural networks; it reflects the underlying meaning of the and ♦ modalities as formally interpreted in Kripke models, according to the definitions below. Definition 1 (Kripke models). A Kripke model for a modal language L is a tuple M = , R, v where is a set of possible worlds, v is a mapping that assigns to each propositional letter of L a subset of , and R is a binary relation over . We say that a modal formula α is true at a possible world ω of a model M, written (M, ω) |= α, if one of the following satisfiability conditions holds. Definition 2 (satisfiability of modal formulas). Let L be a modal language, and let M = , R, v be a Kripke model. The satisfiability relation |= is uniquely defined as follows: i. (M, ω) |= p iff ω ∈ v( p) for a propositional letter p ii. (M, ω) |= ¬α iff (M, ω) |=α of α from α at all worlds, say ω j , related to ωi via the accessibility relation. In the case of ♦-Introduction, in a connectionist setting, the ♦ is “introduced,” in the sense that from a formula ω j : α and a relation R(ωi , ω j ), one can infer a formula ♦α at the world ωi . The cases of -Introduction and ♦-Elimination are analogous.
A Connectionist Model for Temporal Reasoning
1719
iii. (M, ω) |= α ∧ β iff (M, ω) |= α and (M, ω) |= β iv. (M, ω) |= α ∨ β iff (M, ω) |= α or (M, ω) |= β v. (M, ω) |= α → β iff (M, ω) |=α or (M, ω) |= β vi. (M, ω) |= α iff for all ω1 ∈ , if R(ω, ω1 ) then (M, ω1 ) |= α vii. (M, ω) |= ♦α iff there is a ω1 such that R(ω, ω1 ) and (M, ω1 ) |= α. In order to reason with CML, we have to define an algorithm that translates a class of extended modal programs into ensembles of neural networks. Formal definitions of extended modal programs are as follows. Definition 3 (modal literal). A modal atom is of the form MA where M ∈ {, ♦} and A is an atom. A modal literal is of the form ML where L is a literal. Definition 4 (modal logic program). A modal program is a finite set of clauses of the form a 1 , . . . , a n → a n+1 where a i (1 ≤ i ≤ n) is either an atom or a modal atom, and a n+1 is an atom. We define extended modal programs as modal programs extended with modalities and ♦ in the head of clauses, and negation in the body of clauses. In addition, each clause is labeled by the possible world in which it holds, as in Gabbay’s labeled deductive systems (Broda et al., 2004; Gabbay, 1996). Definition 5 (extended modal logic program). An extended modal program is a finite set of clauses C of the form ωi : l1 , . . . , ln → ln+1 , where li , (1 ≤ i ≤ n) is either a literal or a modal literal and ln+1 is either an atom or a modal atom, and a finite set of relations R(ωi , ω j ) between worlds ωi and ω j in C, where ωi is a label representing a world in which the associated clause holds. For example, P = {ω1 : r → q , ω1 : ♦s → r, ω2 : s, ω3 : q → ♦ p, R(ω1 , ω2 ), R(ω1 , ω3 )} is an extended modal program. Definition 6 (modal immediate consequence operator MTP ). Let P = {P1 , . . . , Pk } be an extended modal program, where each Pi is the set of modal clauses that hold in a world ωi (1 ≤ i ≤ k). Let BP denote the set of atoms occurring in P, and let I be an interpretation for P. Let a be either an atom or a modal atom. The mapping MTP : 2 BP → 2 BP in ωi is defined as follows: MTP (I ) = {a ∈ BP | either i or ii or iii or iv or v below holds}. i. l1 , . . . , ln → a is a clause in P and {l1 , . . . , ln } ⊆ I . ii. a is of the form ωi : A, ωi is a particular world associated with A, and there is a world ωk such that R(ωi , ωk ), and ωk : l1 , . . . , lm → ♦ A is a clause in P with {l1 , . . . , lm } ⊆ I .
1720
A. d’Avila Garcez and L. Lamb
iii. a is of the form ωi : ♦ A and there exists a world ω j such that R(ωi , ω j ), and ω j : l1 , . . . , lm → A is a clause in P with {l1 , . . . , lm } ⊆ I . iv. a is of the form ωi : A and for each a world ω j such that R(ωi , ω j ), and ω j : l1 , . . . , lo → A is a clause in P with {l1 , . . . , lo } ⊆ I . v. a is of the form ωi : A and there exists a world ωk such that R(ωk , ωi ), and ωk : l1 , . . . , lo → A is a clause in P with {l1 , . . . , lo } ⊆ I . 3 Connectionist Temporal Logic of Knowledge Temporal logic and its combination with other modalities such as knowledge and belief operators have long been the subject of intensive investigation (Fagin et al., 1995; Gabbay, Kurucz, Wolter, & Zakharyaschev, 2003; Hintikka, 1962). Temporal logic has evolved from philosophical logic to become one of the main logical systems used in computer science and artificial intelligence (Pnueli, 1977; Fagin et al., 1995; Gabbay et al., 1994). It has been shown to be a powerful formalism for the modeling, analysis, verification, and specification of distributed systems (Fagin et al., 1995; Halpern, van der Meyden, & Vardi, 2004). Further, in logic programming, several approaches to deal with temporal and modal reasoning have been developed, leading to application in databases, knowledge representation, and the specification of systems (see, e.g., Farinas del Cerro & Herzig, 1995; Orgun & Ma, 1994; Orgun & Wadge, 1994). In this section, we show how temporal logic programs can be expressed in a connectionist setting in conjunction with a knowledge operator. We do so by extending CML into a connectionist temporal logic of knowledge (CTLK), which allows the specification of knowledge evolution through time in network ensembles. In what follows, we present a temporal algorithm, which translates temporal logic programs into artificial neural networks, and a theorem showing that the temporal theory and the network ensemble are equivalent, and therefore that the translation is correct. Let us start by presenting a simple example. Example 3 (next time operator). One of the typical axioms of temporal logics of knowledge is K i α → K i α (Halpern et al., 2004), where denotes the next time temporal operator. This means that what an agent i knows today (K i ) about tomorrow (α), she still knows tomorrow (K i α). In other words, this axiom states that an agent does not forget what she knew. This can be represented in an ensemble of C-ILP networks with the use of a network that represents the agent’s knowledge today, a network that represents the agent’s knowledge tomorrow, and the appropriate connections between networks. Clearly, an output neuron K α of a network that represents agent i at time t needs to be connected to an output neuron K α of a network that represents agent i at time t + 1 in such a way that, whenever K α is activated, K α is also activated. This is illustrated in Figure 3,
A Connectionist Model for Temporal Reasoning
1721
Kα
Agent i at t+1 KOα
Agent i at t Figure 3: Simple example of connectionist temporal reasoning.
where the black circle denotes a neuron that is always activated, and the activation value of output neuron K α is propagated to output neuron K α. Weights must be such that K α is also activated. Generally the idea behind a connectionist temporal logic is to have (instead of a single ensemble) a number n of ensembles, each representing the knowledge held by a number of agents at a given time point t. Figure 4 illustrates how this dynamic feature can be combined with the symbolic features of the knowledge represented in each network, allowing not only the analysis of the current state (possible world or time point) but also the analysis of how knowledge changes through time. 3.1 The Language of CTLK. In order to reason over time and represent knowledge evolution, we combine temporal logic programming (Orgun & Ma, 1994) and the knowledge operator K i into a connectionist temporal logic of knowledge (CTLK). The implementation of K i is analogous to that of ; we treat K i as a universal modality as done in Fagin et al. (1995). This will become clearer when we apply a temporal operator and K i to the muddy children puzzle in section 4. Definition 7 (connectionist temporal logic). The language of CTLK contains: 1. A set { p, q , r, . . .} of primitive propositions 2. A set of agents A = {1, . . . , n}
1722
A. d’Avila Garcez and L. Lamb
t3 Agent 1
Agent 2
Agent 3
Agent 1
Agent 2
Agent 3
Agent 1
Agent 2
Agent 3
t2
t1
Figure 4: Evolving knowledge through time.
3. A set of connectives K i (i ∈ A), where K i p reads agent i knows p 4. The temporal operator (next time) 5. A set of extended modal logic clauses of the form t : ML 1 , . . . , ML n → ML n+1 , where t is a label representing a discrete time point in which the associated clause holds, M ∈ {, ♦}, and L j (1 ≤ j ≤ n + 1) is a literal. We consider the case of a linear flow of time. As a result, the semantics of CTLK requires that we build models in which possible states form a linear temporal relationship. Moreover, to each time point, we associate the set of formulas holding at that point by a valuation map. The definitions are as follows.
A Connectionist Model for Temporal Reasoning
1723
Definition 8 (time line). A time line T is a sequence of ordered points, each one corresponding to a natural number. Definition 9 (temporal model). A model M is a tuple M = (T, R1 , . . . , Rn , π), where (i) T is a (linear) time line; (ii) Ri (i ∈ A) is an agent accessibility relation over T; and (iii) π : T → ϕ is a map associating with each propositional variable of CTLK a set π( p) of time points in T. The truth conditions for CTLK’s well-formed formulas are then defined by the following satisfiability relation: Definition 10 (satisfiability of temporal formulas). Let M = T, Ri , π be a temporal model for CTLK. The satisfiability relation |= is uniquely defined as follows: i. (M, t) |= p iff t ∈ π( p) ii. (M, t) |= ¬α iff (M, t) |=α iii. (M, t) |= α ∧ β iff (M, t) |= α and (M, t) |= β iv. (M, t) |= α ∨ β iff (M, t) |= α or (M, t) |= β v. (M, t) |= α → β iff (M, t) |=α or (M, t) |= β vi. (M, t) |= α iff for all u ∈ T, if R(t, u) then (M, u) |= α vii. (M, t) |= ♦α iff there exists a u such that R(t, u) and (M, u) |= α viii. (M, t) |= α iff (M, t + 1) |= α ix. (M, t) |= K i α iff for all u ∈ T, if Ri (t, u) then (M, u) |= α Since every clause is labeled by a time point t ranging from 1 to n, if A holds at time point n, our time line will have n + 1 time points; otherwise, it will contain n time points. Results provided by Brzoska (1991), Orgun and Wadge (1992, 1994), and Farinas del Cerro and Herzig (1995) for temporal extensions of logic programming apply directly to CTLK. The following definitions will be needed to express the computation of CTLK in neural networks. Definition 11 (temporal clause). A clause of the form t : L 1 , . . . , L o → L o+1 is called a CTLK temporal clause, which holds at time point t, where L j (1 ≤ j ≤ o + 1) is either a literal, a modal literal, or of the form K i L j (i ∈ A). Definition 12 (temporal immediate consequence operator TP ). Let P = {P1 , . . . , Pk } be a CTLK temporal logic program (i.e., a finite set of CTLK temporal clauses). The mapping TPi : 2 BP → 2 BP at time point ti (1 ≤ i ≤ k) is defined as follows: TPi (I ) = {L ∈ BP | either i or ii or iii below holds}. (i) there exists a clause ti−1 : L 1 , . . . , L m → L in Pi−1 and {L 1 , . . . , L m } is satisfied
1724
A. d’Avila Garcez and L. Lamb
by an interpretation J for Pi−1 ; 4 (ii) L is qualified by , there exists a clause ti+1 : L 1 , . . . , L m → L in Pi+1 , and {L 1 , . . . , L m } is satisfied by an interpretation J for Pi+1 ; (iii) L ∈ MTPi (I ). A global temporal immediate consequence operator can be defined as TP (I1 , . . . , Ik ) = kj=1 {TP j }. 3.2 The CTLK Algorithm. In this section, we present an algorithm to translate temporal logic programs into (two-dimensional) neural network ensembles. We consider temporal clauses and make use of CML’s modalities algorithm and of C-ILP’s translation algorithm, both reproduced below. The temporal algorithm is concerned with how to represent the next time connective and the knowledge operator K , which may appear in clauses of the form ti : K a L 1 , . . . , K b L o → K c L o+1 , where a , b, c, . . . are agents and 1 ≤ i ≤ n. In such clauses, we extend a normal clause of the form L 1 , . . . , L o → L o+1 to allow the quantification of each literal with a knowledge operator indexed by different agents {a , b, c, . . .} varying from 1 to m. We also label the clause with a time point ti in our timescale varying from 1 to n, and we allow the use of the next time operator on the left-hand side of the knowledge operator.5 For example, the clause t1 : K j α, K k β → K j γ states that if agent j knows α and agent k knows β at time t1 , then agent j knows γ at time t2 . The CTLK algorithm is presented below, where Nk,t will denote a C-ILP neural network for agent k at time t. Let q denote the number of clauses occurring in P. Let o l denote the number of literals in the body of clause l. Let µl denote the number of clauses in P with the same consequent, for each clause l. Let h(x) = 1+e2−βx − 1, where β ∈ (0, 1). Let Amin be the minimum activation for a neuron to be considered “active” (or true), Amin ∈ (0, 1). Set Amin > (MAXP (o 1 , . . . , o q , µ1 , . . . , µq ) − 1)/ (MAXP (o 1 , . . . , o q , µ1 , . . . , µq ) + 1). Let W (respectively, −W) be the weight of connections associated with positive (respectively, negative) literals. Set W ≥ (2/β) · (ln(1 + Amin ) − ln(1 − Amin ))/(MAXP (o 1 , . . . , o q , µ1 , . . . , µq )· (Amin − 1) + Amin + 1)).6
4 Notice that this definition implements a credulous approach in which every agent is assumed to be truthful, and therefore every agent believes not only in what he knows about tomorrow but also in what he is informed by other agents about tomorrow. A more skeptical approach could be implemented by restricting the derivation of A to interpretations in Pi only. 5 Notice that according to definition 10, if A is true at time t and t is the last time point n, the CTLK algorithm will create n + 1 points, as described here. 6 These values for A min and W are obtained from the proof of the correctness of the C-ILP translation algorithm.
A Connectionist Model for Temporal Reasoning
1725
Temporal Algorithm For each time point t (1 ≤ t ≤ n) in P, for each agent k (1 ≤ k ≤ m) in P do: 1. For each clause l in P containing K k L i in the body: a. Create an output neuron K k L i in Nk,t (if it does not exist yet); b. Create an output neuron K k L i in Nk,t+1 (if it does not exist yet); c. Define the thresholds of K k L i and K k L i as θ = (1 + Amin ) · (1 − µl ) · W/2; d. Set h(x) as the activation function of output neurons K k L i and Kk Li ; e. Add a hidden neuron L ∨ to Nk,t , and set the step function as the activation function of L ∨ ; f. Connect K k L i in Nk,t+1 to L ∨ and set the connection weight to 1; g. Set the threshold θ ∨ of L ∨ such that −m Amin < θ ∨ < Amin − (m − 1);7 h. Connect L ∨ to K k L i in Nk,t and set the connection weight to W M such that W M > h −1 (Amin ) + µl W + θ . 2. For each clause in P containing K k L i in the head: a. Create an output neuron K k L i in Nk,t (if it does not exist yet); b. Create an output neuron K k L i in Nk,t+1 (if it does not exist yet); c. Define the thresholds of K k L i and K k L i as θ = (1 + Amin ) · (1 − µl ) · W/2; d. Set h(x) as the activation function of K k L i and K k L i ; e. Add a hidden neuron L to Nk,t+1 , and set the step function as the activation function of L ; f. Connect K k L i in Nk,t to L and set the connection weight to 1; g. Set the threshold θ of L such that −1 < θ < Amin ;8 h. Connect L to K k L i in Nk,t+1 and set the connection weight to W M such that W M > h −1 (Amin ) + µl W + θ . 3. Call the modalities algorithm. Modalities Algorithm 1. Rename each literal ML j in P by a new literal not occurring in P of ♦ the form L j if M = , or L j if M = ♦; 2. Call the C-ILP translation algorithm; 3. For each output neuron L ♦j in Nk,t , do: ♦ a. Add a hidden neuron L M j to an arbitrary network N ;
7 8
A maximum number of m agents could be making use of L ∨ . A maximum number of 1 agent will be making use of L .
1726
A. d’Avila Garcez and L. Lamb
b. c. d. e. f.
Set the step function as the activation function of L M j ; ♦ M Connect L j in Nk,t to L j and set the connection weight to 1; M Set the threshold θ M of L M < Amin ; j such that −1 < θ ♦ Create an output neuron L j in N , if it does not exist yet; ♦ M Connect L M j to L j in N and set the connection weight to W .
4. For each output neuron L j in Nk,t , do: a. Add a hidden neuron L M j to each Nk,t+1 such that Rk (t, t + 1); b. Set the step function as the activation function of L M j ; M c. Connect L j in Nk,t to L j and set the connection weight to 1; M < Amin ; d. Set the threshold θ M of L M j such that −1 < θ e. Create output neurons L j in Nk,t+1 , if it does not exist yet; M f. Connect L M j to L j in Nk,t+1 and set the connection weight to W . 5. For each output neuron L j in Nk,t , do: a. Add a hidden neuron L ∨j to Nk,t−1 such that Rk (t − 1, t); b. Set the step function as the activation function of L ∨j ; c. For each output neuron L ♦j in Nk,t−1 , do: i. Connect L j in Nk,t to L ∨j and set the connection weight to 1; ii. Set the threshold θ ∨ of L ∨j such that −m Amin < θ ∨ < Amin − (m − 1); iii. Create an output neuron L ♦j in Nk,t−1 if it does not exist yet; iv. Connect L ∨j to L ♦j in Nk,t−1 and set the connection weight to W M. d. Add a hidden neuron L ∧j to Nk,t−1 such that Rk (t − 1, t); e. Set the step function as the activation function of L ∧j ; f. For each output neuron L j in Nk,t−1 , do: i. Connect L j in Nk,t to L ∧j and set the connection weight to 1; ii. Set the threshold θ ∧ of L ∧j such that m − (1 + Amin ) < θ ∧ < m Amin ; iii. Create an output neuron L j in Nk,t−1 if it does not exist yet; ∧ iv. Connect L j to L j in Nk,t−1 and set the connection weight to W M. C-ILP Translation Algorithm 1. For each clause l of P of the form L 1 , . . . , L o → L o+1 , do:9 a. Create input neurons L 1 , . . . , L o and output neuron L o+1 in N (if they do not exist yet); b. Add a neuron Nl to the hidden layer of N ;
9
Here L i can be of the form K j L i or ¬K j L i .
A Connectionist Model for Temporal Reasoning
1727
c. Connect each neuron L i (1 ≤ i ≤ o) in the input layer to the neuron Nl in the hidden layer. If L i is a positive literal, then set the connection weight to W; otherwise, set the connection weight to −W; d. Connect Nl in the hidden layer to neuron L o+1 in the output layer, and set the connection weight to W; e. Set h(x) as the activation function of Nl and L o+1 ; f. Define the threshold of Nl in the hidden layer as ((1 + Amin ) · (o l − 1)/2)W; g. Define the threshold of neuron L o+1 in the output layer as ((1 + Amin ) · (1 − µl )/2)W. Theorem 3 below shows that the network ensemble N obtained from the temporal algorithm is equivalent to the original CTLK program P in the sense that N computes the temporal immediate consequence operator TP of P (see definition 12). The theorem makes use of theorems 1 and 2, which follow: Theorem 1 (correctness of C-ILP; d’Avila Garcez, Broda et al., 2002). For each definite logic program P, there exists a feedforward neural network N with exactly one hidden layer and semilinear neurons such that N computes the fixed-point operator TP of P. Theorem 2 (correctness of CML; d’Avila Garcez, Lamb, et al., 2002). For any extended modal program P, there exists an ensemble of feedforward neural networks N with a single hidden layer and semilinear neurons, such that N computes the modal fixed-point operator MTP of P. Theorem 3 (correctness of CTLK). For any CTLK program P, there exists an ensemble of single hidden-layer neural networks N such that N computes the temporal fixed-point operator TP of P. Proof. We need to show that K k L i is active in Nt+1 if and only if either (1) there exists a clause of P of the form ML 1 , . . . , ML o → K k L i s.t. ML 1 , . . . , ML o are satisfied by an interpretation (input vector), or (2) K k L i is active in Nt . Case 1 follows from theorem 1. The proof of case 2 follows from theorem 2 as the algorithm for is a special case of the algorithm for ♦ in which a more careful selection of world (i.e., t + 1) is made when applying the ♦ Elimination rule. 4 Case Study: The Muddy Children Puzzle In this section, we apply the CTLK system to the muddy children puzzle, a classic example of reasoning in multi-agent environments. We also compare the CTLK solution with a previous (connectionist modal logic-based)
1728
A. d’Avila Garcez and L. Lamb
solution, which uses snapshots in time instead of a time flow. We start by stating the puzzle as described in Fagin et al. (1995). A number n of (truthful and intelligent) children are playing in a garden. A certain number of children k (k ≤ n) have mud on their faces. Each child can see if the others are muddy but not himself or herself. Now, consider the following situation: a caretaker announces that at least one child is muddy (k ≥ 1) and asks, Does any of you know if you have mud on your own face?10 To help understand the puzzle, let us consider the cases in which k = 1, k = 2, and k = 3. If k = 1 (only one child is muddy), the muddy child answers yes at the first instance since she cannot see any other muddy child. All the other children answer no at the first instance. If k = 2, suppose children 1 and 2 are muddy. In the first instance, all children can only answer no. This allows 1 to reason as follows: if 2 had said yes the first time, she would have been the only muddy child. Since 2 said no, she must be seeing someone else muddy; and since I cannot see anyone else muddy apart from 2, I myself must be muddy! Child 2 can reason analogously and also answers yes the second time. If k = 3, suppose children 1, 2, and 3 are muddy. Every children can only answer no the first two times. Again, this allows child 1 to reason as follows: if 2 or 3 had said yes the second time, they would have been the only two muddy children. Thus, there must be a third person with mud. Since I can see only 2 and 3 with mud, this third person must be me! Children 2 and 3 can reason analogously to conclude as well that yes, they are muddy. The above cases clearly illustrate the need to distinguish between an agent’s individual knowledge and common knowledge about the world in a particular situation. For example, when k = 2, after everybody says no in the first round, it becomes common knowledge that at least two children are muddy. Similarly, when k = 3, after everybody says no twice, it becomes common knowledge that at least three children are muddy, and so on. In other words, when it is common knowledge that there are at least k − 1 muddy children, after the announcement that nobody knows if they are muddy or not, then it becomes common knowledge that there are at least k muddy children, for if there were k − 1 muddy children, all of them would know that they had mud in their faces. Notice that this reasoning process can start only once it is common knowledge that at least one child is muddy, as announced by the caretaker. 4.1 Distributed Knowledge Representation. In this section, we formalize the muddy children puzzle using CTLK. For comparison, we start by reproducing the CML formalization presented in d’Avila Garcez et al., 2004; d’Avila Garcez, Lamb, et al., 2002). Typically, the way to represent the knowledge of a particular agent is to express the idea that an agent knows a fact α if the agent considers that α is true at every world that the agent
10
Of course, if k > 1, they already know that there are muddy children among them.
A Connectionist Model for Temporal Reasoning
1729
sees as possible. In such a formalization, a K j modality that represents the knowledge of an agent j is used analogously to a modality as defined in section 2.1. In addition, we use pi to denote that proposition p is true for agent i. For example, K j pi means that agent j knows that p is true for agent i. We omit the subscript j of K whenever it is clear from the context. We use pi to say that child i is muddy and q k to say that at least k children are muddy (k ≤ n). Let us consider the case in which three children are playing in the garden (n = 3). Rule r11 below states that when child 1 knows that at least one child is muddy and that neither child 2 nor child 3 is muddy, then child 1 knows that she herself is muddy. Similarly, rule r21 states that if child 1 knows that there are at least two muddy children and she knows that child 2 is not muddy, then she must also be able to know that she herself is muddy, and so on. The rules for children 2 and 3 are interpreted analogously. Snapshot Rules for Agent (Child) 1 r11 : K1 q1 ∧K1 ¬p2 ∧K1 ¬p3 →K1 p1 r21 : K1 q2 ∧K1 ¬p2 →K1 p1 r31 : K1 q2 ∧K1 ¬p3 →K1 p1 r41 : K1 q3 →K1 p1 Each set of rules rml (1 ≤ l ≤ n, m ∈ N+ ) is implemented in a C-ILP network. Figure 5 shows the implementation of rules r11 to r41 (for agent 1). In addition, it contains p1 and Kq 1 , Kq 2 , and Kq 3 , all represented as facts. Note the difference between p1 (child 1 is muddy) and K p1 (child 1 knows she is muddy). Facts are highlighted in gray in Figure 5. This setting complies with a presentation of the puzzle given in Fagin et al. (1995), in which snapshots of the knowledge evolution along time rounds are taken in order to logically deduce the solution of the problem without the addition of a time variable. In contrast with p1 and Kq k (1 ≤ k ≤ 3), K¬p2 and K¬p3 must be obtained from agents 2 and 3, respectively, whenever agent 1 does not see mud on their foreheads. Figure 6 illustrates the interaction between three agents in the muddy children puzzle. The arrows connecting C-ILP networks implement the fact that when a child is muddy, the other children can see that. For clarity, only the rules rm1 , corresponding to neuron K1 p1 , are shown in Figure 5. Analogously, the rules for K2 p2 and K3 p3 would be represented in similar C-ILP networks. This is indicated in Figure 6 by neurons highlighted in black. In addition, Figure 6 shows only positive information about the problem. Recall that negative information such as ¬ p1 , K¬ p1 , K¬ p2 is to be added explicitly to the network, as shown in Figure 5. This completes the connectionist representation of the snapshot solution to the muddy children puzzle.
1730
A. d’Avila Garcez and L. Lamb
... p1
Kp1 K¬ ¬p2 K¬ ¬p3 Kq1 Kq2 Kq3
¬p3 Kq1 Kq2 K¬ ¬p2 K¬
Kq3
... Agent1 Figure 5: The implementation of rules {r11 , . . . , r41 }.
4.2 Temporal Knowledge Representation. The addition of a temporal variable to the muddy children puzzle allows one to reason about knowledge acquired after each time round. For example, assume as before that there are three muddy children playing in a garden. First, they all answer no when asked if they know whether they are muddy. Moreover, as each muddy child can see the other children, they will reason as previously described, and answer no the second time round, reaching the correct conclusion in time round 3. This solution requires, at each round, that the C-ILP networks be expanded with the knowledge acquired from reasoning about what is seen and what is heard by each agent. This clearly requires each agent to reason about time. The snapshot solution should then be seen as representing the knowledge held by the agents at an arbitrary time t.
A Connectionist Model for Temporal Reasoning
1731
agent 1 p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
p2 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
agent 2
p3 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
agent 3
Figure 6: Interaction among agents in the muddy children puzzle.
1732
A. d’Avila Garcez and L. Lamb
The knowledge held by the agents at time t + 1 would then be represented by another set of C-ILP networks, appropriately connected to the original set of networks. Let us consider again the case in which k = 3. There are alternative ways of modeling this, but one possible representation is as follows: Temporal Rules for Agent(Child) 1 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 3 Temporal Rules for Agent(Child) 2 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 2 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 2 q 3 Temporal Rules for Agent(Child) 3 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 3 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 3 q 3 In addition, the snapshot rules are still necessary here to assist each agent’s reasoning at any particular time point. Finally, the interaction between the agents, as depicted in Figure 6, is also necessary to model the fact that each child will know that another child is muddy when they see each other, analogously to the modality. This can be represented as ti : p1 → K 2 p1 , ti : p1 → K 3 p1 for time i = 1, 2, 3 and analogously for p2 and p3 . Together with ti : ¬ p2 → K 1 ¬ p2 , ti : ¬ p3 → K 1 ¬ p3 , also for time i = 1, 2, 3 and analogously for K 2 and K 3 , this completes the formalization. The rules above, the temporal rules, and the snapshot rules for Agent(child1) are described, following the temporal algorithm, in Figure 7, where dotted lines indicate negative weights and solid lines indicate positive weights. The network of Figure 7 provides a complete solution to the muddy children puzzle. It is worth noting that each network remains a simple single hidden-layer neural network that can be trained with the use of standard backpropagation or other off-the-shelf learning algorithm. 4.3 Learning. The merging of theory (background knowledge) and data learning (learning from examples) in neural networks has been shown to provide a learning system that is more effective than purely symbolic or purely connectionist systems, especially when data are noisy (Towell & Shavlik, 1994). The temporal algorithm introduced here allows one to perform theory and data learning in neural networks when the theory includes temporal knowledge.
A Connectionist Model for Temporal Reasoning
1733
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
Kq3
Agent 1 at t3
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3 OKq 3
¬p3 K¬ ¬p2 K¬
K2p2 K3p3
Kq2 Kp1
Agent 1 at t2
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3 OKq 2
¬p3 Kq1 Kp1 K¬ ¬p2 K¬
K2p2
K3p3
Agent 1 at t1 Figure 7: An agent’s knowledge evolution in time in the muddy children puzzle.
1734
A. d’Avila Garcez and L. Lamb
In this section, we use the temporal algorithm introduced above and standard backpropagation to compare learning from data only and learning from theory and data with temporal background knowledge. Since we show a relationship between temporal and epistemic logics and artificial neural network ensembles, we should also be able to learn epistemic and temporal knowledge in the ensemble (and, indeed, to perform knowledge extraction of revised temporal and epistemic rules after learning, but this is left as future work). We train two ensembles of C-ILP neural networks to compute a solution to the muddy children puzzle. To one of them we add temporal and epistemic background knowledge in the form of a single rule t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 2 by applying the temporal algorithm. To the other, we do not add any rule and then compare the average accuracies of the ensembles. We consider, in particular, the case in which Agent 1 is to decide whether he is muddy at time t2 . Each training example expresses the knowledge held by Agent 1 at t2 , according to the truth values of atoms K 1 ¬ p2 , K 1 ¬ p3 , K 1 q 1 , K 1 q 2 , and K 1 q 3 . As a result, we have 32 examples containing all possible combinations of truth values for input neurons K 1 ¬ p2 , K 1 ¬ p3 , K 1 q 1 , K 1 q 2 , and K 1 q 3 , where input value 1 indicates truth value true, while input −1 indicates truth value false. For each example, we are concerned with whether Agent 1 will know that he is muddy, that is, we are concerned with whether output neuron K 1 p1 is active. For example, if the inputs are false (input vector [−1, −1, −1, −1, −1]), then Agent 1 should not know whether he is muddy (K 1 p1 is false). If, however, K 1 q 2 is true and either K 1 ¬ p2 or K 1 ¬ p3 is true then Agent 1 should be able to recognize that indeed he is muddy (K 1 p1 is true). This allows us to create the 32 training examples. From the description of the muddy children puzzle, we know that at t2 , K 1 q 2 should be true (i.e., K 1 q 2 is a fact). This information can be derived from the temporal rule given as background knowledge above, but not from the training examples. Although the background knowledge can be changed by the training examples, it places a bias on certain combinations (in this case, the examples in which K 1 q 2 is true), and this may produce better performance, typically when the background knowledge is correct. This effect has been observed, for instance, in Towell and Shavlik (1994), on experiments on DNA sequence analysis, in which background knowledge is expressed by production rules. The set of examples is noisy, and background knowledge counteracts the noise and reduces the chances of overfitting. We have evaluated the two C-ILP ensembles using eight-fold crossvalidation, so that each time, four examples were left for testing. We have used a learning rate η = 0.2, a term of momentum α = 0.1, activation function h(x) = tanh(x), and bipolar inputs in {−1, 1}. For each training task, the training set was presented to the network for 10, 000 epochs. For both ensembles, the networks reached a training set error smaller than 0.01
A Connectionist Model for Temporal Reasoning
1735
before 10, 000 epochs had elapsed. In other words, all the networks have been trained successfully. As for the networks’ test set performance, the results corroborate the importance of exploiting any available background knowledge. For the first ensemble, in which the networks were trained with no background knowledge, an average test set accuracy of 81.25% was obtained. For the second ensemble, to which the temporal rule has been added, an average test set accuracy of 87.5% was obtained—a noticeable difference in performance considering there is a single rule in the background knowledge. In both cases, exactly the same training parameters have been used. The experiments above illustrate that the merging of temporal background knowledge and data learning may provide a system that is more effective than a purely connectionist system. The focus of this article has been on the theory of neural-symbolic systems, their expressiveness, and reasoning capabilities. More extensive experiments to validate the system proposed here would be useful and will be carried out in connection with knowledge extraction and using applications containing continuous attributes.
5 Conclusions and Future Work Connectionist temporal and modal logics render neural-symbolic learning systems with the ability to make use of more expressive representation languages. In his seminal paper, Valiant (1984) argues for rich logic-based knowledge representation mechanisms in learning systems. The connectionist model proposed here addresses such a need, while complying with important principles of connectionism such as massive parallelism and learning. A very important feature of our system is the temporal dimension that can be combined with an epistemic dimension for knowledge, belief, desires and intentions. This article provides the first account of how to integrate such dimensions in a neural-symbolic system. We have illustrated this by providing a full solution to the muddy children puzzle, where agents can reason about knowledge evolution in time. Although a number of multimodal systems—for example, combining knowledge and time (Gabbay et al., 2003; Halpern et al., 2004) and combining beliefs, desires, and intentions (Rao & Georgeff, 1998)—have been proposed for distributed knowledge representation, little attention has been paid to the integration of a learning component for knowledge acquisition. This work contributes to bridging this gap by allowing the knowledge representation to be integrated in a neural learning system. One could also think of the system presented here as a massively distributed system where each ensemble (or sets of ensembles) can be seen as a neural-symbolic processor. This would open several interesting research avenues. For instance, one could investigate how to reason about protocols and actions in this
1736
A. d’Avila Garcez and L. Lamb
neural-symbolic distributed system or how to train the processors in order to learn how to preserve the security of such systems. The connectionist temporal and knowledge logic presented here allows the representation of a variety of properties such as knowledge-based specifications, in the style of Fagin et al. (1995). These specifications are frequently represented using temporal and modal logics, but without a learning feature, which comes naturally in CTLK. As future work, we aim to investigate knowledge extraction of modalities from trained artificial neural network ensembles in which not only discrete data are used, as in the muddy children example, but also continuous attributes are relevant. In addition, since the models of the modal logic S4 can be used to model intuitionistic modal logics, we may also have a system that can combine reasoning about time and learn intuitionistic theories. This is an interesting result, as neural-symbolic systems can be used to “think” constructively, in the sense of Brouwer (Gabbay et al., 2003). In summary, we believe that the connectionist temporal and epistemic computational model presented here opens several interesting research avenues in the domain of neural-symbolic integration, allowing the distributed representation, computation, and learning of expressive knowledge representation formalisms. Acknowledgments We thank Gary Cottrell and two anonymous referees for several constructive comments that led to the improvement of this article. A.G. is partly supported by the Nuffield Foundation, UK. L.L. is partly supported by the Brazilian Research Council CNPq and by the CAPES and FAPERGS Foundations. References Ajjanagadde, V. (1997). Rule-based reasoning in connectionist networks. Unpublished doctoral dissertation, University of Minnesota. Broda, K., Gabbay, D. M., Lamb, L. C., & Russo, A. (2004). Compiled labelled deductive systems: A uniform presentation of non-classical logics. Baldock, UK: Research Studies Press/Institute of Physics Publishing. Browne, A., & Sun, R. (2001). Connectionist inference models. Neural Networks, 14, 1331–1355. Brzoska, C. (1991). Temporal logic programming and its relation to constraint logic programming. In Proc. International Symposium on Logic Programming (pp. 661– 677). Cambridge, MA: MIT Press. Chagrov, A., & Zakharyaschev, M. (1997). Modal logic. Oxford: Clarendon Press. Cloete, I., & Zurada, J. M. (Eds.). (2000). Knowledge-based neurocomputing. Cambridge, MA: MIT Press.
A Connectionist Model for Temporal Reasoning
1737
d’Avila Garcez, A. S., Broda, K., & Gabbay, D. M. (2001). Symbolic knowledge extraction from trained neural networks: A sound approach. Artificial Intelligence, 125, 155–207. d’Avila Garcez, A. S., Broda, K., & Gabbay, D. M. (2002). Neural-symbolic learning systems: Foundations and applications. Berlin: Springer-Verlag. d’Avila Garcez, A. S., & Lamb, L. C. (2004). Reasoning about time and knowledge ¨ in neural-symbolic learning systems. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (pp. 921–928). Cambridge, MA: MIT Press. d’Avila Garcez, A. S., Lamb, L. C., Broda, K., & Gabbay, D. M. (2003). Distributed knowledge representation in neural-symbolic learning systems: A case study. In Proceedings of AAAI International FLAIRS Conference (pp. 271–275). Menlo Park, CA: AAAI Press. d’Avila Garcez, A. S., Lamb, L. C., Broda, K., & Gabbay, D. M. (2004). Applying connectionist modal logics to distributed knowledge representation problems. International Journal on Artificial Intelligence Tools, 13(1), 115– 139. d’Avila Garcez, A. S., Lamb, L. C., & Gabbay, D. M. (2002). A connectionist inductive learning system for modal logic programming. In Proc. ICONIP’02 (pp. 1992– 1997). Singapore: IEEE Press. d’Avila Garcez, A. S., & Zaverucha, G. (1999). The connectionist inductive learning and logic programming system. Applied Intelligence Journal [Special issue] 11(1), 59–77. Fagin, R., Halpern, J., Moses, Y., & Vardi, M. (1995). Reasoning about knowledge. Cambridge, MA: MIT Press. Farinas del Cerro, L., & Herzig, A. (1995). Modal deduction with applications in epistemic and temporal logics. In D. M. Gabbay, C. J. Hogger, & J. A. Robinson (Eds.), Handbook of logic in artificial intelligence and logic programming (Vol. 4, pp. 499–594). New York: Oxford University Press. Gabbay, D. M. (1996). Labelled deductive systems (Vol 1). New York: Oxford University Press. Gabbay, D. M., Hodkinson, I., & Reynolds, M. (1994). Temporal logic: Mathematical foundations and computational aspects (Vol. 1). New York: Oxford University Press. Gabbay, D., Kurucz, A., Wolter, F., & Zakharyaschev, M. (2003). Many-dimensional modal logics: Theory and applications. Dordrecht, Elsevier. Gelfond, M., & Lifschitz, V. (1991). Classical negation in logic programs and disjunctive databases. New Generation Computing, 9, 365–385. Halpern, J. Y., van der Meyden, R., & Vardi, M. Y. (2004). Complete axiomatizations for reasoning about knowledge and time. SIAM Journal on Computing, 33(3), 674– 703. Hintikka, J. (1962). Knowledge and belief. Ithaca, NY: Cornell University Press. Orgun, M. A., & Ma, W. (1994). An overview of temporal and modal logic programming. In Proceedings of International Conference on Temporal Logic, ICTL’94 (pp. 445–479). Berlin: Springer. Orgun, M. A., & Wadge, W. W. (1992). Towards a unified theory of intensional logic programming. Journal of Logic Programming, 13(4), 413–440.
1738
A. d’Avila Garcez and L. Lamb
Orgun, M. A., & Wadge, W. W. (1994). Extending temporal logic programming with choice predicates non-determinism. Journal of Logic and Computation, 4(6), 877– 903. Pnueli, A. (1977). The temporal logic of programs. In Proceedings of 18th IEEE Annual Symposium on Foundations of Computer Science (pp. 46–57). Piscataway, NJ: IEEE Computer Society Press. Rao, A. S., & Georgeff, M. P. (1998). Decision procedures for BDI logics. Journal of Logic and Computation, 8(3), 293–343. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Shastri, L. (1999). Advances in SHRUTI: A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence Journal [Special issue] 11(1), 79–108. Sun, R. (1995). Robust reasoning: Integrating rule-based and similarity-based reasoning. Artificial Intelligence, 75(2), 241–296. Sun, R., & Alexandre, F. (1997). Connectionist symbolic integration. Hillsdale, NJ: Erlbaum. Towell, G. G., & Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial Intelligence, 70(1), 119–165. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Valiant, L. G. (2003). Three problems in computer science. Journal of the ACM, 50(1), 96–99. Vardi, M. Y. (1997). Why is modal logic so robustly decidable. In N. Immerman & P. Kolaitis (Eds.), Descriptive complexity and finite models (pp. 149–184). Providence, RI: American Mathematical Society.
Received August 23, 2004; accepted October 5, 2005.
ARTICLE
Communicated by Liam Paninski
Multivariate Information Bottleneck Noam Slonim
[email protected] Department of Physics and the Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, U.S.A.
Nir Friedman
[email protected] School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel
Naftali Tishby
[email protected] School of Computer Science and Engineering, and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
The information bottleneck (IB) method is an unsupervised model independent data organization technique. Given a joint distribution, p(X, Y), this method constructs a new variable, T, that extracts partitions, or clusters, over the values of X that are informative about Y. Algorithms that are motivated by the IB method have already been applied to text classification, gene expression, neural code, and spectral analysis. Here, we introduce a general principled framework for multivariate extensions of the IB method. This allows us to consider multiple systems of data partitions that are interrelated. Our approach utilizes Bayesian networks for specifying the systems of clusters and which information terms should be maintained. We show that this construction provides insights about bottleneck variations and enables us to characterize the solutions of these variations. We also present four different algorithmic approaches that allow us to construct solutions in practice and apply them to several real-world problems. 1 Introduction The volume of available data in a variety of domains has grown rapidly over the past few years. Examples include the consistent growth in the amount of Preliminary and partial versions of this work appeared in Proc. of the 17th conf. on Uncertainty in artificial Intelligence (UAI-17), 2001, and in Proc. of Neural Information Processing Systems (NIPS-14), 2002. Neural Computation 18, 1739–1789 (2006)
C 2006 Massachusetts Institute of Technology
1740
N. Slonim, N. Friedman, and N. Tishby
online text and the dramatic increase in the available genomic information. As a result, there is a crucial need for data analysis methods. A major goal in this context is the development of unsupervised data representation methods to reveal the inherent hidden structure in a given body of data. Such methods include various dimension-reduction, geometric embedding, and statistical modeling approaches. Arguably the most fundamental class of such methods are clustering techniques. At first, the clustering problem seems intuitively clear: similar elements should be assigned to the same cluster and dissimilar ones to different clusters. However, formalizing this notion in a well-defined way is not obvious. Nonetheless, such a formulation is essential in order to obtain an objective interpretation of the results. Indeed, while numerous clustering methods exist, many of them are driven by an algorithm rather than by a clear optimization principle, making their results hard to interpret. Clustering methods can be roughly divided into two categories. The first is based on a choice of some distance or distortion measure among the data points, which presumably reflects some background knowledge about the data. Such a measure implicitly represents the important attributes of the data and is as important as the choice of the data representation itself. For most problems, a proper choice of the distance measure can be the main practical difficulty and through which much of the arbitrariness of the results can enter. Another class of methods is based on statistical assumptions on the origin of the data. Such assumptions enable the design of a statistical model, such as a mixture of gaussians, where the model parameters are then estimated based on the given data. Roughly speaking, these approaches represent different frameworks for thinking about the data. For instance, in the statistical approach, one usually thinks of the given data as a sample taken from an underlying distribution, whereas in the distance-based approach, such an assumption is typically not required. Distance-based methods can be further divided into central and pairwise clustering methods. In the latter, one uses the direct distances between the points to partition the data, using various (often graph-theoretical) algorithms. In central clustering, one is provided with a distance function that can be applied to new points and generate cluster centroids by minimizing the expected distance to the data points. Central clustering is more closely related to the statistical modeling, whereas the results of the pairwise methods are often more difficult to interpret. Quite separated from this hierarchy of clustering techniques, the information bottleneck (IB) method (Tishby, Pereira, & Bialek, 1999; Slonim, 2002) stems from a rather different—information theoretic—perspective. The basic idea is surprisingly simple. While choosing the distance measure is notoriously difficult, often one can naturally specify another relevant variable that should be predicted by the obtained clusters. This specification determines which relevant underlying structure we desire. An illustrative example is the problem of speech recognition. While it is not obvious what
Multivariate Information Bottleneck
1741
the proper distance measure is among speech utterances, in many applications one would agree that the corresponding text is the relevant variable here. Moreover, this scheme allows different valid ways to cluster the same data. For example, one could also define the relevant variable as the identity of the speaker rather than the text and obtain an entirely different, yet equally valid, quantization of the signal. A common data type that calls for this type of analysis is co-occurrence data, such as verbs and direct objects in sentences (Pereira, Tishby, & Lee, 1993), words and documents (Baker & McCallum, 1998; Hofmann, 1999; Slonim & Tishby, 2000), tissues and gene expression patterns (Tishby & Slonim, 2000), or galaxies and spectral components (Slonim, Somerville, Tishby, & Lahav, 2001). In most such cases, there is no obvious “correct” measure of similarity between the data items. Thus, one would like to rely purely on the joint statistics of the co-occurrences and organize the data such that the relevant information among the variables is captured in the best possible way. 1.1 The Contributions of This Work. The main contribution of this article is in providing a general formulation for a multivariate extension of the IB method. This extension allows us to consider cases where the clustering is relevant with respect to several variables and multiple systems of clusters are constructed simultaneously. For example, in symmetric, or two-sided, clustering, we want to find two systems of clusters that are informative about each other. A possible application is relating documents to words, where we seek clustering of documents according to word usage and a corresponding clustering of words (Slonim & Tishby, 2000). In parallel clustering we attempt to build several systems of clusters of one variable in order to capture independent aspects of the information it conveys about another, relevant, variable. A possible example is the analysis of gene expression data, where multiple independent distinctions about tissues are relevant for the expression of genes. Furthermore, it is possible to think of more complicated scenarios, where there are more than two observed variables. For example, given many input variables that represent a visual stimulus, we might want to recover a smaller set of features that are most informative about an output variable that here represents the firing pattern of a neuron. A most general formulation should consider the compression of different subsets of the observed variables, while maximizing the information about other predefined subsets. The multivariate IB principle, as suggested in this work, provides such a principled general formulation. To address this type of problem within the IB framework, we use the concept of multi-information, a natural extension of the pairwise concept of mutual information. Our approach further utilizes the theory of probabilistic graphical models such as Bayesian networks for specifying the trade-off terms: which variables to compress and which information terms
1742
N. Slonim, N. Friedman, and N. Tishby
should be maintained. In particular, we use one Bayesian network, denoted as G in , to specify a set of variables that are compressed versions of the observed variables: each new variable compresses its parents in the network. A second network, G out , specifies the relations, or dependencies, that should be maintained or predicted: each variable should be predicted by its parents in the network. We formulate the general principle as a trade-off between the multi-information each network carries, where we want to minimize the information maintained by G in and at the same time maximize the information maintained by G out . We show that as with the original IB principle, it is possible to analytically characterize the form of the optimal solutions to the general multivariate trade-off principle. The original IB principle yielded practical algorithms that could be applied to a variety of real-world problems. Here, we show that all these algorithms are naturally extended to the new conceptual framework and illustrate their applicability on several real-world data: text processing applications, gene expression data analysis, and protein sequence analysis. 2 The Information Bottleneck Method In the original IB principle (Tishby et al., 1999), the relevance of one variable, X, with respect to another one, Y, is quantified in terms of the mutual information (Cover and Thomas, 1991), I (X; Y) =
x,y
p(x, y) log
p(x, y) . p(x) p(y)
This functional is symmetric and nonnegative, and it equals zero if and only if the variables are independent. It measures how many bits are needed on average to convey the information X has about Y and vice versa. The IB method is thus based on the availability of two variables, with their (assumed given) joint distribution, p(X, Y), where X is the variable we want to compress and Y is the variable we would like to predict. Using a correspondence between this task and the core problems of Shannon’s (1948) information theory, Tishby et al. (1999) formulated it as a trade-off between two types of information terms. The idea is to seek partitions of X values that preserve as much mutual information as possible about Y while losing as much information as possible about the original representation, X. Thus, among all the distinctions made by X, one tries to maintain only those that are most relevant to predict Y. Finding optimal representations is posed as a construction of an auxiliary variable, T, that represents soft partitions of X values through q (T | X), such that I (T; X) is minimized while I (T; Y) is maximized. Notice that throughout this article, we denote by q (·) the unknown distributions that involve the representation parameters, T, and by p(·) the distributions that are given as input and do not change during the analysis.
Multivariate Information Bottleneck
1743
Since T is a compressed representation of X, its distribution should be completely determined given X alone; that is, q (T | X, Y) = q (T | X), or q (X, Y, T) = p(X, Y)q (T | X).
(2.1)
An equivalent formulation is to require the following Markovian independence relation, known as the IB Markovian relation: T ↔ X ↔ Y.
(2.2)
Tishby et al. (1999) formulated this optimization problem as the minimization of the following IB functional, L[q (T | X)] = I (T; X) − β I (T; Y),
(2.3)
where β is a positive Lagrange multiplier and the minimization takes place over all the normalized q (T | X) distributions. The distributions q (T) and q (Y | T) that are further involved in this functional must satisfy the probabilistic consistency conditions,
q (x, t) = x p(x)q (t | x) q (y | t) = q 1(t) x q (x, y, t) = q 1(t) x p(x, y)q (t | x),
q (t) =
x
(2.4)
where the IB Markovian relation is used in the last equality. As shown in Tishby et al. (1999), every stationary point of the IB functional must satisfy q (t | x) =
q (t) exp (−β DK L [ p(y | x)||q (y | t)]) , Z(x, β)
(2.5)
where DK L [ p||q ] = E p [log qp ] is the familiar Kullback-Leibler (KL) divergence (Cover & Thomas, 1991) and Z(x, β) is a normalization (partition) function. The three equations in equations 2.4 and 2.5 must be satisfied selfconsistently at all the stationary points of L. The form of the optimal solution in equation 2.5 is reminiscent of the general form of the rate distortion optimal solution (Cover & Thomas, 1991). However, the effective IB distortion, d I B (X, T) = DK L [ p(Y | X)||q (Y | T)], emerges here from the trade-off principle rather than being chosen arbitrarily. The self-consistent condition in equation 2.5, together with the marginalization constraints in equation 2.4, can be turned into an iterative algorithm, similar to expectation maximization (EM) and the Blahut-Arimoto algorithm in information theory (Cover & Thomas, 1991). In particular, finding {q (T | X), q (T), q (Y | T)} that correspond to a (local) stationary point of L,
1744
N. Slonim, N. Friedman, and N. Tishby
involves the following iterations: q (m) (t) =
p(x)q (m) (t | x)
(2.6)
x
q (m) (y | t) = q (m+1) (t | x) =
1 q (m) (t)
p(x, y)q (m) (t | x)
x
(m)
q (t) (m) e −β DK L [ p(y|x)||q (y|t)] , Z(m+1) (x, β)
whose general convergence was proved in Tishby et al. (1999). 3 Multivariate Extensions of the IB Method The original formulation of the IB principle concentrated on the tradeoff between the compression of one variable, X, and the information this compression maintains about another, relevant variable, Y. We now describe a more general formulation for a multivariate extension of the IB trade-off principle. The motivation is to provide a similar framework for higherdimensional data and with different probabilistic dependencies, as found in ample examples of real-world problems. This extension combines the theory of probabilistic graphical models, such as Bayesian networks, and a multivariate extension of the mutual information concept, known as multiinformation. 3.1 Bayesian Networks and Multi-Information. A Bayesian network over a set of random variables X ≡ {X1 , . . . , Xn } is a directed acyclic graph (DAG), G, in which vertices are annotated by names of random variables. For each variable Xi , we denote by PaGXi the set of parents of Xi in G. We say that a distribution p(X) is consistent with G if it can be factored in the form p(X1 , . . . , Xn ) =
p(Xi | PaGXi ),
i
and use the notation p |= G to denote that. The information that the random variables {X1 , . . . , Xn } ∼ p(X1 , . . . , Xn ) share about each other is given by (see, e.g., Studeny & Vejnarova, 1998), I[ p(X)] = I(X) = DK L [ p(X1 , . . . , Xn )|| p(X1 ) · · · p(Xn )] p(X1 , . . . , Xn ) = E p(X1 ,...,Xn ) log . p(X1 ) · · · p(Xn )
Multivariate Information Bottleneck
1745
This multi-information measures the average number of bits that can be gained by a joint compression of all the variables versus independent compression. Like the mutual information, it is nonnegative and equals zero if and only if all the variables are independent. When the variables have known independence relations, the multiinformation can be simplified as follows. Proposition 1. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X such that p |= G. Then I[ p(X )] = I(X ) =
I (Xi ; Pa GXi ).
i
That is, the multi-information is the sum of “local” mutual information terms between each variable and its parents. Notice that even if p(X) is not consistent with G, this sum is well defined, but it may capture only part of the real multi-information. Hence, we introduce the following definition. Definition 1. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X . The multi-information in p(X ) with respect to G is defined as: I G [ p(X )] =
I (Xi ; Pa GXi ),
(3.1)
i
where each of the local mutual information terms is calculated using the marginal distributions of p(X ). If p |= G, then I[ p(X)] = I G [ p(X)], but in general, I[ p(X)] ≥ I G [ p(X)] (see below). In this case, we often want to know how close p is to the distribution class that is consistent with G. We define this notion through the M-projection of p on that class: DK L [ p||G] = min DK L [ p||q ]. q |=G
(3.2)
The following proposition specifies the form of the distribution q for which the minimum is attained. Proposition 2.
Let p(X ) be a distribution, and let G be a DAG over X . Then,
DK L [ p||G] = DK L [ p||q ∗ ],
(3.3)
where q ∗ is given by n q ∗ (X ) = i= p(Xi | Pa GXi ). 1
(3.4)
1746
N. Slonim, N. Friedman, and N. Tishby
Thus, q ∗ is equivalent to the factorization of p using the conditional independencies implied by G. This proposition extends the Csisz´ar-Tusn´ady (1984) lemma that refers to the special case of n = 2 (see also Cover & Thomas, 1991, p. 365). Next, we provide two possible interpretations of the M-projection distance, DK L [ p||G], in terms of the structure of G. Proposition 3. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X . Assume that the order X1 , . . . , Xn is consistent with the DAG G (i.e., Pa GXi ⊆ {X1 , . . . , Xi−1 }). Then DK L [ p||G] =
I (Xi ; {X1 , . . . , Xi−1 } \ Pa GXi | Pa GXi )
i
= I[ p(X )] − I G [ p(X )]. As an immediate corollary, we have I[ p(X)] ≥ I G [ p(X)], as mentioned earlier. Thus, DK L [ p||G] can be expressed as a sum of local conditional mutual information terms, where each term corresponds to a possible violation of a Markov independence assumption with respect to G. If every Xi is independent of {X1 , . . . , Xi−1 } \ PaGXi } given {PaGXi }, as implied by G, then DK L [ p||G] becomes zero. As these conditional independence assumptions are more extremely violated in p, the corresponding DK L [ p||G] will increase. Recall that the Markov independence assumptions with respect to a given order are necessary and sufficient to require the factored form of distributions consistent with G (Pearl, 1988). Therefore, we see that DK L [ p||G] = 0 if and only if p is consistent with G. An alternative interpretation of this measure is given in terms of multiinformation terms. Specifically, we see that DK L [ p||G] can be written as the difference between the real multi-information, I[ p(X)] = I(X), and the multi-information when p is forced to be consistent with G, I G [ p(X)], which in particular cannot be larger. Hence, we can think of DK L [ p||G] as the residual information between the variables that is not captured by the dependencies implied by G. 3.2 Multi-Information Bottleneck Principle. Let us consider a set of random variables, X = {X1 , . . . , Xn }, distributed according to p(X). We assume that p(X) is known and forms the input to our analysis, where finite sample issues are discussed in section 7. Given p(X), we specify a set of partition variables T = {T1 , . . . , Tk }, such that for each subset of X that we would like to compress, we specify a corresponding subset of the compression variables T. Recall that in the original IB, the IB Markovian relation, T ↔ X ↔ Y, defines a solution space with all the distributions over {X, Y, T} with q (X, Y, T) = p(X, Y)q (T | X). Analogously, we define the solution space in our case through a set of IB
Multivariate Information Bottleneck
1747
Markovian independence relations that imply that each compression variable, Tj ∈ T, is completely determined given the variables it represents. This is achieved by introducing a DAG, G in , over X ∪ T where the variables in T are leaves. G in is defined such that p(X) is consistent with its structure restricted to X. The edges from X to T define what compresses what, and the independencies implied by G in correspond to the required set of IB Markovian independence relations. In particular, since we require that every Tj is a leaf, every Tj is independent of all the other variables, given the variables in 1 it compresses, PaG Tj . The multivariate IB solution space thus consists of all the distributions q (X, T) |= G in , for which q (X, T) = p(X)
k
in q (Tj | PaG Tj ),
(3.5)
j=1
where the free parameters correspond to the stochastic mappings q (Tj | in PaG Tj ), and the other unknown q (·) distributions are determined by marginalization over q (X, T) using the Markovian structure of G in . Analogous to the original IB formulation, the information that we would like to minimize is now given by I G in , where I G in = I(X, T) since q (X, T) |= G in , that is, this is the real multi-information in q (X, T). Minimizing this quantity attempts to make the T variables as independent of the X variables as possible. Once G in is defined, we need to specify the relevant information that we wish to preserve. We do that by specifying another DAG, G out , which determines what predicts what. Specifically, for every Tj , we define the variables it should preserve information about as its children in G out . Thus, using definition 1, we may think of I G out as a measure of how much information the variables in T maintain about their target variables. This suggests that we should maximize I G out . The multivariate IB functional can now be written as L(1) [q (X, T)] = I G in [q (X, T)] − βI G out [q (X, T)],
(3.6)
where the minimization is done subject to the normalization constraints on the partition distributions, and β is the positive Lagrange multiplier controlling the trade-off.2 Notice that this functional is a direct generalization of the 1 It is possible to apply a similar derivation where the T variables are not required to be leaves in G in . This might be useful in various situations where, for example, Tj is used to design a better code for Tj . A relevant example is the relay channel (Cover & Thomas, 1991, Chap. 14). 2 Since I G out typically consists of several mutual information terms, in principle it is possible to define a separate Lagrange multiplier for each of these terms. This might be
1748
N. Slonim, N. Friedman, and N. Tishby
(a)
(b)
Figure 1: The source (left) and target networks for the original IB principle. The target network for the multivariate IB principle is presented in the middle panel. The target network for the structural principle is described in the right panel.
original IB functional. Again, we are interested in the competition between the complexity of the representation, measured through the compression multi-information, I G in , and the accuracy of the relevant predictions provided by this representation, as quantified by the multi-information I G out . For β → 0, the focus is on the compression term, I G in , yielding a trivial solution in which every Tj consists effectively of one value to which all the in values of PaG Tj are mapped, where all the distinctions among these values, relevant or not, are lost. If β → ∞, we concentrate on maintaining the relevant information terms as high as possible. This yields a trivial solution in at the opposite extreme, where every Tj is a one-to-one map of PaG Tj with no loss of relevant information. The interesting cases are, of course, in between, where β takes positive finite values. Example 1. Consider the application of the multivariate principle with G in (a ) (a ) and G out of Figure 1. G in specifies that T compresses X and G out specifies that G in T should preserve information about Y. For these choices, I = I (T; X) + I (X; Y) and I G out = I (T; Y). The resulting functional is L(1) = I (X; Y) + I (T; X) − β I (T; Y), where I (X; Y) is constant and can be ignored. Thus, we end up with a functional equivalent to that of the original IB functional.
3.3 A Structural Variational Principle. We now describe an alternative and closely related variational principle, which provides more insight useful if, for example, the preservation of one information term is of greater importance than the preservation of the others.
Multivariate Information Bottleneck
1749
into the relationship of IB to generative models with maximum-likelihood estimation. As before, we trade between two complementary goals. On the one hand, we want to compress the observed variables, that is, to minimize I G in . On the other hand, instead of maximizing I G out , we now utilize the compression variables to drive q (X, T) toward a desired structure, G out , that represents which dependencies and independencies we would like to impose. Let us consider again the two-variable case shown in Figure 1, where G in specifies that T is a compressed version of X. Ideally, T should preserve all the information about Y. This is equivalent to the situation where T separates X from Y (Cover & Thomas, 1991, Chap. 2), that is, X ↔ T ↔ Y, (b) as specified by G out of Figure 1. Thus, we wish to construct q (T | X) such that the specified independencies in G out are satisfied as much as possible. Notice that in general, G in and G out are incompatible. Except for trivial cases, we cannot achieve both sets of independencies simultaneously (Pereira et al. 1993; Slonim & Weiss, 2002). Instead, we aim to come as close as possible to achieving this by a trade-off between the two. We formalize this by requiring that q (X, T) be closely approximated by its closest distribution among all distributions consistent with G out . As previously discussed, a natural information theoretic measure of this discrepancy is DK L [q ||G]. Thus, the functional that we want to minimize is L(2) [q (X, T)] = I G in [q (X, T)] + γ DK L [q (X, T)||G out ],
(3.7)
where γ is, again, a positive Lagrange multiplier. We will refer to this functional as the structural multivariate IB functional. The range of γ is between 0, in which case we have the trivial—maximally compressed—solution, and ∞, in which we strive to make q as consistent as possible with G out . Notice that the γ → ∞ limit is different in nature from the β → ∞ limit, since forcing the dependency structure is different from preserving all the information, as we see next. (b)
Example 2. Consider again the example of Figure 1 with G in and G out . In this case, we have I G in = I (X; Y) + I (T; X) and I G out = I (T; X) + I (T; Y). From DK L [q ||G out ] = I G in − I G out (see propostion 3) we obtain L(2) = I (T; X) − γ I (T; Y) + (1 + γ )I (X; Y), where the last (constant) term can be ignored. Hence, we end up with the original IB functional. Thus, we can think of the original IB problem as finding a compression T of X that results in a joint distribution, q (X, Y, T), that is as close as possible to the DAG where X and Y are independent given T.
1750
N. Slonim, N. Friedman, and N. Tishby
From proposition 3 we obtain L(2) = I G in + γ (I G in − I G out ) = (1 + γ )I G in − γ I G out , γ . In which is similar to the functional L(1) under the transformation β = 1+γ this transformation, the range γ ∈ [0, ∞) corresponds to the range β ∈ [0, 1). Notice that when β = 1, we have L(1) = DK L [q ||G out ], which is the extreme case of L(2) . Thus, from a mathematical perspective, L(2) is a special case of L(1) with the restriction β ≤ 1. As we saw, the two principles require different versions of G out to reconstruct the original IB functional. More generally, for a given principle, different choices of G out yield different optimization problems. Alternatively, given G out , the two principles yield different optimization problems. In the previous example, we saw that these two effects can compensate for each other. In other words, using the structural variational principle with a different choice of G out ends up with the same optimization problem, which in this case is equivalent to the original IB problem. To further understand the relation between the two principles, we consider the range of solutions for extreme values of β and γ . When β → 0 and γ → 0, in both formulations we simply minimize I G in . In the other limit, the two principles differ. When β → ∞, in L(1) we simply maximize I G out . (a ) Here, applying L(1) with G out corresponds to maximizing I (T; Y). However, (b) applying L(1) with G out corresponds to maximizing I (T; X) + I (T; Y); thus, information about X will be preserved even if it is irrelevant to Y. When γ → ∞, in L(2) we simply minimize DK L [q ||G out ], that is, minimize the violations of conditional independencies implied by G out (see (b) proposition 3). For G out , this minimizes I (X; Y | T) = I (X; Y) − I (T; Y) (where we used the structure of G in and proposition 3), hence, this is (a ) equivalent to maximizing I (T; Y). For G out , when γ → ∞, we minimize I (X; Y | T) = I (X; Y) + I (T; X) − I (T; Y). Thus, unlike the application of (a ) L(1) to G out , we cannot ignore the term I (T; X). To summarize, we can say that L(1) focuses on the edges that are present in G out , while L(2) focuses on the edges that are missing or, more precisely, on the conditional independencies implied by their absence. Thus, although both principles can be applied to any choice of G out , some choices make more sense for L(1) than for L(2) , and vice versa.
3.4 Examples: IB Variations 3.4.1 Parallel IB. In Figure 2A we consider a simple extension of the original IB, where we introduce k compression variables, {T1 , . . . , Tk }, instead of one. Similar to the original IB problem, we want {T1 , . . . , Tk } to preserve the information X maintains about Y, as specified by the DAG
Multivariate Information Bottleneck
1751
(a)
(b)
(a)
(b)
(a)
(b)
Figure 2: Possible source and target networks for the parallel, symmetric, and triplet IB examples.
(a )
G out in the same panel. We call this example the parallel IB, as {T1 , . . . , Tk } compress X in parallel. (a ) Here, I G in = I (X; Y) + kj=1 I (Tj ; X) and I G out = I (T1 , . . . , Tk ; Y); thus, La(1) =
k
I (Tj ; X) − β I (T1 , . . . , Tk ; Y).
(3.8)
j=1
That is, we attempt to minimize the information between X and every Tj while maximizing the information all the Tj ’s preserve together about Y. From the structure of G in , we can also obtain k j=1
I (Tj ; X) = I (T1 , . . . , Tk ; X) + I(T1 , . . . , Tk ),
(3.9)
1752
N. Slonim, N. Friedman, and N. Tishby
where I(T1 , . . . , Tk ) is the multi-information of all the compression variables. Thus, La(1) = I (T1 , . . . , Tk ; X) + I(T1 , . . . , Tk ) − β I (T1 , . . . , Tk ; Y).
(3.10)
In other words, we aim to find {T1 , . . . , Tk } that compress X, preserve the information about Y, and remain independent of each other as much as possible. Recall that using L(2) , we aim at minimizing violation of indepen(b) dencies in G out . This suggests that the DAG G out of Figure 2A captures our intuitions above. In this DAG, X and Y are independent given (b)
every Tj and all the Tj ’s are independent of each other. Here, I G out = I (T1 , . . . , Tk ; X) + I (T1 , . . . , Tk ; Y), and using equation 3.9, we have (2)
Lb =
k
I (Tj ; X) + γ (I(T1 , . . . , Tk ) − I (T1 , . . . , Tk ; Y)),
j=1
which is reminiscent of equation 3.10. 3.4.2 Symmetric IB. Another natural extension of the original IB is the symmetric IB. Here, we want to compress X into TX and Y into TY such that TX extracts the information X contains about Y while TY extracts the information Y contains about X. The DAG G in of Figure 2B captures the (a ) form of the compression. For G out in the same panel, we have La(1) = I (TX ; X) + I (TY ; Y) − β I (TX ; TY ).
(3.11)
Thus, on one hand, we attempt to compress, and on the other hand, we attempt to make TX and TY as informative about each other as possible. Notice that if TX is informative about TY , then it is also informative about Y. Second, we use the structural principle, L(2) , for which we are interested in approximating the conditional independencies implied by G out . This sug(b) gests that G out of Figure 2B represents our desired target model. Here, both TX and TY are sufficient to separate X from Y, while being dependent on each other. Thus, we obtain La(2) = I (TX ; X) + I (TY ; Y) − γ I (TX ; TY ).
(3.12)
As in example 1, we see that by using the structural variational principle with a different G out , we end up with the same optimization problem as by using L(1) . Other alternative specifications of G out that are interesting in this
Multivariate Information Bottleneck
1753
context (Friedman, Mosenzon, Slonim, & Tishby, 2001) are omitted here for brevity. 3.4.3 Triplet IB. A challenging task in the analysis of sequence data, such as DNA and protein sequences or natural language text, is to identify features that are relevant for predicting another symbol in the sequence. Typically these features are different for forward prediction versus backward prediction. For example, the textual features that predict the next word to be information are clearly different from those that predict the previous word to be information. Here, we address this issue by extracting features of both types such that their combination is highly informative about a symbol between other known symbols. The DAG G in of Figure 2C is one way of capturing the form of the compression, where we denote by X p , Y, Xn the previous, current, and next symbol in a given sequence, respectively. Here, Tp compresses X p , while Tn compresses Xn . For the choice of G out , we consider again two alternatives. First, we simply require that the combination of Tp and Tn will maximally preserve the information that X p and Xn hold about the current symbol, Y. (a ) This is specified by G out in the same panel for which we obtain La(1) = I (Tp ; X p ) + I (Tn ; Xn ) − β I (Tp , Tn ; Y).
(3.13) (b)
Second, we use the structural principle, L(2) , with G out of Figure 2C. Here, Tp and Tn are independent, and both are needed to make Y independent of X p and Xn . Hence, the resulting Tp and Tn partitions provide compact, independent, and informative evidence regarding the value of Y. This specification yields (2)
Lb = I (Tp ; X p ) + I (Tn ; Xn ) − γ I (Tp , Tn ; Y),
(3.14)
which is equivalent to equation 3.13. We will term this example the triplet IB. 4 Characterization of the Solution As shown in Tishby et al. (1999), it is possible to implicitly characterize the form of the optimal solutions to the original IB functional. Here, we provide a similar characterization to the multivariate IB case. Specifically, we want in to describe the distributions q (Tj | PaG Tj ) that optimize the trade-off defined by each of the two principles. We present this characterization for L(1) . A similar analysis for L(2) is straightforward. We first need some additional notational shorthands. We denote by G out in U j = PaG the Tj the (X) variables that Tj should compress, by V Xi = Pa Xi
1754
N. Slonim, N. Friedman, and N. Tishby G
variables that should maintain information about Xi , and by VTj = PaTjout the variables that should maintain information about Tj . We also denote −j −j V Xi = V Xi \ {Tj } , and similarly, VT = VT \ {Tj }. To simplify the presentation, we also assume that U j ∩ VTj = ∅.3 In addition, we use the notation
E p(·|u j ) DK L p Y | Z, u j p Y | Z, t j
p z | u j DK L p Y | z, u j p Y | z, t j = z
p Y | Z, u j = E p(Y,Z|u j ) log
, p Y | Z, t j where Y is a random variable and Z is a set of random variables. Notice that this term implies averaging over all values of Y and Z using the conditional distribution p(Y, Z | u j ). In particular, if Y or Z intersects with U j , then only the values that are consistent with u j have positive weights in this averaging. Also, notice that if Z is empty, this term reduces to the standard DK L [ p(Y | u j )|| p(Y | t j )]. The main result of this section is as follows. Theorem 1. Assume that p(X ), G in , G out , and β are given and that q (X , T ) |= G in . The conditional distributions {q (Tj | U j )}kj=1 are a stationary point of L(1) [q (X , T )] = I G in [q (X , T )] − βI G out [q (X , T )] if and only if q (t j | u j ) =
q (t j ) e −βd(t j ,u j ) , ZTj (u j , β)
(4.1)
where ZTj (u j , β) is a normalization function, and
d(t j , u j ) ≡
−j
−j
E q (·|u j ) [DK L [q (Xi | V Xi , u j )||q (Xi | V Xi , t j )]]
i:Tj ∈V X i
+
−j
−j
E q (·|u j ) [DK L [q (T | V T , u j )||q (T | V T , t j )]]
:Tj ∈V T
+DK L [q (V Tj | u j )||q (V Tj | t j )].
(4.2)
The first sum is over all Xi such that Tj participates in predicting Xi . The second sum is over all Tl such that Tj participates in predicting Tl . The last term is related to cases where Tj should be predicted by some VTj = ∅. This theorem provides an implicit set of equations for q (Tj | U j ) through the multivariate relevant distortion d(Tj , U j ), which in turn depends on 3
This is, in fact, the standard situation, since U j ∩ T = ∅ and typically VT j ⊂ T.
Multivariate Information Bottleneck
1755
those unknown distributions. This distortion measures the degree of proximity of the conditional distributions in which U j is involved to those where we replace U j with its compact representative, Tj . For example, if some cluster t j ∈ T j behaves more similarly to u j ∈ U j than another cluster, t j ∈ T j , we have d(t j , u j ) < d(t j , u j ), which implies q (t j | u j ) > q (t j | u j ). In other words, if t j is a good representative of u j the corresponding membership probability, q (t j | u j ), is increased accordingly. As in the original IB problem, equation 4.1 must be solved selfconsistently with the equations for the other distributions that involve Tj , which emerge through marginalization over q (X, T) using the conditional independencies implied by G in . Notice that when β is small, the q (Tj | U j ) are diffused since β reduces the differences between the distortions for different values of Tj . When β → ∞, all the probability mass will be assigned to the value t j with the smallest distortion, that is, the above stochastic mapping will become deterministic. (a )
4.1 Examples. For G in and G out of Figure 1, it is easy to verify that equation 4.2 amounts to d(T, X) = DK L [ p(Y | X)||q (Y | T)], in full analogy to equation 2.5, as required. (a ) For the parallel IB case of G out in Figure 2A, we have d(Tj , X) = E q (·|X) [DK L [q (Y | T− j , X||q (Y | T− j , Tj )]],
(4.3)
where we used the notation T− j = T \ {Tj }. Notice that due to the structure of G in , q (Y | T− j , X) = p(Y | X). (a ) For the symmetric IB case of G out in Figure 2B, we obtain d(TX , X) = E p(·|X) [DK L [q (TY | X)||q (TY | TX )]]
(4.4)
and a symmetric expression for d(TY , Y). Thus, TX attempts to make predictions similar to those of X about TY . (a ) Last, for the triplet IB case of G out in Figure 2C, we have d(Tp , X p ) = E q (·|Xp ) [DK L [q (Y | Tn , X p )||q (Y | Tn , Tp )]].
(4.5)
Thus, q (tp | x p ) increases when the predictions about Y given by tp are similar to those given by x p (when averaging over Tn ). The distortion term for Tn is defined analogously. 5 Multivariate IB Algorithms Similar to the original IB functional, the multivariate IB functional is not convex with respect to all of its arguments simultaneously. Except for trivial cases, it always has multiple minima. Since theorem 1 provides necessary
1756
N. Slonim, N. Friedman, and N. Tishby
conditions for internal minima of the functional, they can be used to find such solutions. However, as in many other optimization problems, different heuristics can also be employed to construct solutions, with relative advantages and disadvantages. Here, we show that the four algorithmic approaches proposed for the original IB problem (Slonim, 2002) can be extended into the multivariate case. We concentrate on the variational principle L(1) . Deriving the same algorithms for L(2) is straightforward. 5.1 Iterative Optimization Algorithm: Multivariate iIB. We start with the case where β is fixed. Following Tishby et al. (1999), we apply the fixedpoint equations in equation 4.1, alternately with the equations for the other distributions that involve some T variables. Given the intermediate solution of the algorithm at the mth iteration, {q (m) (Tj | U j )}kj =1 , we find q (m) (Tj ) and d (m) (Tj , U j ) out of q (m) (X, T) = p(X)
k
q (m) (Tj | U j )
(5.1)
j =1
and then update (m+1) (t j | u j ) ← q
q
(m+1)
q (m) (t j ) (m+1) (u j ,β) j
ZT
(t j | u j ) ← q
(m)
e −βd
(m)
(t j ,u j )
, (5.2)
(t j | u j ), ∀ j = j.
In Figure 3 we present pseudocode for this iterative algorithm, which we will term multivariate iIB. As an example, consider the case of the symmetric IB. Given {q (m) (TX | X), (m) q (TY | Y)}, we find q (m) (TX ), q (m) (TY | X) and q (m) (TY | TX ) out of q (m) (X, Y, TX , YY ) = p(X, Y)q (m) (TX | X)q (m) (TY | Y), from which we obtain d (m) (TX , X) = DK L [q (m) (TY | X)||q (m) (TY | TX )]. Next, we update q (m+1) (tx | x) ←
(m) q (m) (tx ) e −βd (tx ,x) , Z(m+1) (x,β)
q (m+1) (t | y) ← q (m) (t | y). y y
(5.3)
In the next iteration, we find a new version for q (TY | Y) while q (TX | X) is kept still. We repeat these updates until convergence to a stationary point. We note that proving convergence in general, for any choice of β and any choice of G in and G out , is more involved than for the original IB problem due to the complex structure of equation 4.2. Nonetheless, in all our experiments, on real and synthetic data, the algorithm always converged to a (locally) optimal solution.
Multivariate Information Bottleneck
1757
Figure 3: Pseudocode of the multivariate iterative IB algorithm (multivariate iIB). JS denotes the Jensen-Shannon divergence (see equation 5.8). In principle, we repeat this procedure for different initializations and choose the solution that minimizes L = I G in − βI G out .
5.2 Deterministic Annealing Algorithm: Multivariate dIB. It is often desirable to explore a hierarchy of solutions at different β values. Thus, we now present a multivariate deterministic annealing-like procedure that extends the original approach in Tishby et al. (1999). In deterministic annealing, we iteratively increase β and then adapt the solution at the previous β value to the new one (Rose, 1998). Recall that for β → 0, the solution consists of essentially one cluster per Tj . As β is increased, at some critical point the values of some Tj diverge and show different behaviors. Successive increments of β will reach additional bifurcations that we wish to identify. Thus, for each Tj , we end up with a bifurcating hierarchy that traces the sequence of solutions at different β values. To detect these bifurcations, we adopt the method suggested in Tishby et al. (1999). Given the solution from the previous β value, we construct an initial problem in which every Tj value is duplicated. Let t j and trj be
1758
N. Slonim, N. Friedman, and N. Tishby
Figure 4: Pseudocode of the multivariate deterministic annealing-like algorithm (multivariate dIB). JS denotes the Jensen-Shannon divergence (see equation 5.8). NeGTj our denotes the neighbors of Tj in G out (parents/direct descendants). f (β, εβ ) is a simple function used to increase β based on its current value and on some scaling parameter εβ .
two such duplications of t j ∈ T j . Then we set q ∗ (t j | u j ) = q (t j |
u j ) 12 + α ˆ (t j , u j ) and q ∗ (trj | u j ) = q (t j | u j ) 12 − α ˆ (t j , u j ) , where ˆ (t j , u j ) is a randomly drawn noise term and α > 0 is a (small) scale parameter. Thus, each copy is a slightly perturbed version of t j . For large enough β, this perturbation suffices for the two copies to diverge; otherwise, they collapse to the same solution. Specifically, given this initial point, we apply the multivariate iIB algorithm. After convergence, if t j and trj are sufficiently different, we declare that t j has split and incorporate t j and trj into the hierarchy we construct for Tj . Finally, we increase β and repeat the whole process. We will term this algorithm multivariate dIB. A pseudocode is given in Figure 4.
Multivariate Information Bottleneck
1759
There are several difficulties with this algorithm. The parameters b ( j) that are involved in detecting the bifurcations often should scale with β. Further, one may need to tune the rate of increasing β; otherwise, cluster splits might be skipped. Last, the duplication process is stochastic in nature and involves additional parameters. Some of these issues were addressed rigorously for the original IB problem (Parker, Gedeon, & Dimitrov, 2002). Extending this work in our context seems like a natural direction for future research. 5.3 Agglomerative Algorithm: Multivariate aIB. The agglomerative IB algorithm was introduced in Slonim & Tishby (1999) as a simple and approximated algorithm for the original IB problem. It employs a greedy agglomerative clustering technique to find a hierarchical clustering tree in a bottom-up fashion and was found to be useful for various problems (Slonim, 2002). We now present a multivariate extension of this algorithm. To this aim, it is more convenient to consider the problem of maximizing Lmax [q (X, T)] = I G out [q (X, T)] − β −1 · I G in [q (X, T)],
(5.4)
which is clearly equivalent to minimizing L(1) (see equation 3.6). Our algorithm starts with the most fine-grained solution, Tj = U j . Thus, every u j is solely assigned to a unique singleton cluster, t j ∈ T j , and the assignment probabilities, {q (Tj | U j )}kj=1 are deterministic—either 0 or 1 (that is, “hard” clustering). Nonetheless, the following analysis holds for the general case of soft clustering as well. Given the singleton initialization, we reduce the cardinality of one Tj by agglomerating, or merging, two of its values, t j and trj , into a single value t¯ j . Formally, this is defined through q (¯t j | u j ) = q (t j | u j ) + q (trj | u j ).
(5.5)
The corresponding conditional merger distribution is defined through z = {π,z , πr,z } =
q (t j | z) q (trj | z) , . q (¯t j | z) q (¯t j | z) q (t )
q (tr )
(5.6)
Note that if Z = ∅, then z = = { q (t¯jj ) , q (t¯jj ) }, that is, these are the relative weights of each of the clusters that participate in the merger (Slonim & Tishby, 1999). The basic question in an agglomerative process is which pair to merge bef aft at each step. Let Tj and Tj denote the random variables that correspond to Tj , before and after a merger in Tj , respectively. Then our merger cost is
1760
N. Slonim, N. Friedman, and N. Tishby
given by Lmax (t j , trj ) = Lmax − Lmax , bef
bef
aft
(5.7)
aft
bef
aft
where Lmax and Lmax are calculated based on Tj and Tj , respectively. The greedy procedure evaluates all the potential mergers, for every Tj , and applies the one that minimizes Lmax (t j , trj ). This is repeated until all the T variables degenerate into trivial clusters. The resulting set of hierarchies describes a range of solutions at all the different resolutions. A direct calculation of all the potential merger costs using equation 5.7 is typically unfeasible. However, as in Slonim & Tishby (1999), one may calculate Lmax (t j , trj ) while examining only the distributions that involve t j and trj directly. An essential concept in this derivation is the Jensen-Shannon (JS) divergence. Specifically, the JS divergence between two probability distributions, p1 and p2 , with respect to the positive weights, = {π1 , π2 }, π1 + π2 = 1, is given by JS [ p1 , p2 ] = π1 DK L [ p1 || p¯ ] + π2 DK L [ p2 || p¯ ],
(5.8)
where p¯ = π1 p1 + π2 p2 . The JS is nonnegative and upper bounded, and it equals zero if and only if p1 = p2 . It is also symmetric, but it does not satisfy the triangle inequality. Comparing the JS between two empirical distributions of two samples with some predefined threshold is asymptotically the optimal way to determine whether both samples came from a single source (Gutman, 1989; Schreibman, 2000). Theorem 2.
Let t j , trj ∈ T j be two clusters. Then,
Lmax (t j , trj ) = q (¯t j ) · d A(t j , trj ),
(5.9)
where
d A(t j , trj ) ≡
i:Tj ∈V Xi
+
−j
E q (·|t¯ j ) [JS
:Tj ∈V T
−j VX i
E q (·|t¯ j ) [JS
−j
[q (Xi | V Xi , t j ), q (Xi | V Xi , trj )]]
−j VT
−j
−j
[q (T | V T , t j ), q (T | V T , trj )]]
+ JS [q (V Tj | t j ), q (V Tj | trj )] − β −1 · JS [q (U j | t j ), q (U j | trj )].
(5.10)
That is, the merger cost is a multiplication of the “weight” of the merger components, q (¯t j ), with their “distance,” d A(t j , trj ). Notice that due to the JS
Multivariate Information Bottleneck
1761
Figure 5: Pseudocode of the multivariate agglomerative IB algorithm (multivariate aIB).
properties, this “distance” is symmetric, but it is not a metric. It is small for pairs that give similar predictions about the variables that Tj should predict, and have different predictions, or minimum overlap about the variables that Tj should compress. Equation 5.10 is clearly analogous to equation 4.2. While for the multivariate iIB the optimization is governed by the KL divergences between data and cluster centroids, here it is controlled through JS divergences, which are related to the likelihood that the two merged clusters have a common source. For brevity, in the rest of this section, we focus on the simpler hard clustering case, for which JS [q (U j | t j ), q (U j | trj )] = H[], where H is Shannon’s entropy. A pseudocode of the general procedure is given in Figure 5. Examples. obtain
(a)
For the original IB problem (G in and G out in Figure 1), we
Lmax (tl , tr ) = q (t¯) · (JS [q (Y | tl ), q (Y | tr )] − β −1 H[]),
(5.11)
which is consistent with the original aIB algorithm (Slonim & Tishby, 1999).
1762
N. Slonim, N. Friedman, and N. Tishby (a )
For the parallel IB (G in and G out in Figure 2A), we have Lmax (t j , trj ) = q (¯t j ) · (E q (·|t¯ j ) [JST− j [q (Y | T− j , t j ), q (Y | T− j , trj )]] − β −1 H[]),
(5.12)
where again we used T− j = T \ {Tj }. (a ) For the symmetric IB (G in and G out in Figure 2B), we obtain Lmax (tX , tXr ) = q (¯t X ) · (JS [q (TY | tX ), q (TY | tXr )] − β −1 H[]),
(5.13)
and an analogous expression for TY . 5.4 Sequential Optimization Algorithm: Multivariate sIB. An agglomerative approach is relatively computationally demanding. If we start from Tj = U j , the time complexity is O( kj=1 | U j |3 | V j |) (where | V j | denotes the complexity of calculating a single merger in Tj ), while the space complexity is O( kj=1 | U j |2 ), that is, unfeasible for large data sets. Moreover, it is not guaranteed to extract even locally optimal solutions. Recently, we suggested a framework for casting an agglomerative clustering algorithm into a sequential optimization algorithm, which is guaranteed to converge to a stable solution in much better time and space complexity (Slonim, Friedman, & Tishby, 2002). Next, we describe how to apply this idea in our context. The sequential procedure maintains for each Tj a flat partition with Mj (hard) clusters. At each step we draw a u j ∈ U j out of its current cluster (denoted here t j (u j )) and represent it as a new singleton cluster. Using equation 5.9, we now merge u j into a cluster t new such that j t new = argmint j ∈T j Lmax ({u j }, t j ), to obtain a (possibly new) partition Tjnew , j with the appropriate cardinality. Since this step can only increase the (upper-bounded) functional Lmax , we are guaranteed to converge to a stable solution. It is easy to verify that our time complexity is O( · j | U j T j V V j |), where is the number of iterations we should perform until convergence is attained. Since typically · |T j | | U j |2 , we get a significant run-time improvement. Moreover, we dramatically improve our memory consump tion toward O( j |T j |2 ). We will term this algorithm multivariate sIB. A pseudocode is given in Figure 6. 6 Illustrative Applications In this section we consider a few illustrative applications of the general methodology. In practice we do not have access to the true joint distribution, p(X), but only a finite sample drawn out of this distribution. Here, a
Multivariate Information Bottleneck
1763
Figure 6: Pseudocode of the multivariate sequential IB algorithm (multivariate sIB). In principle, we repeat this procedure for different initializations and choose the solution that maximizes Lmax = I G out − β −1 I G in .
pragmatic approach was taken where we estimated p(X) through a simple normalization. Our results seem satisfactory even in extreme undersampling situations, and we leave the theoretical analysis of the finite sample effects over our methodology for future research. An appealing property of an information-theoretic approach, and the IB framework in particular, is that it can be applied to data sets of different nature in exactly the same manner. There is no need to tailor the algorithms to the given data or to define a domain-specific distortion measure. To demonstrate this issue, we apply our method to a variety of data types, including natural language text, protein sequences, and gene expression data (see appendix B regarding preprocess and implementation details). In all cases, the quality of the results can be assessed on similar grounds in terms of compression versus preservation of relevant information.
1764
N. Slonim, N. Friedman, and N. Tishby
6.1 Parallel IB Applications. We consider the specification of G in and (a ) G out of Figure 2A. The problem thus is to partition X into k sets of clusters, T = {T1 , . . . , Tk }, that minimize the information they maintain about X, maximize the information they hold about Y, and remain indepenent of each other as much as possible. For various technical reasons, the sIB algorithm is most suitable here. Specifically, here, we first apply sIB with k = 1, which is equivalent to solving the original IB problem. Given this solution, denoted as T1 , we apply again sIB with k = 2, while T1 is kept fixed. That is, given T1 , we look for T2 such that I (T1 , T2 ; Y)) − β −1 (I (T1 , T2 ; X) + I (T1 ; T2 )) is maximized. Next, we hold T1 and T2 fixed while looking for T3 , and so forth. Loosely speaking, in T1 we aim to extract the first principal partition of the data. In T2 we seek for a second—approximately independent—principal partition, and so on. 6.1.1 Parallel sIB for Style Versus Topic Text Clustering. A well-known concern in cluster analysis is that there might be more than one meaningful way to partition a given body of data. For example, text documents might have two possible dichotomies: by their topics and by their writing styles. Next, we construct such an example and solve it using our parallel IB approach. We selected four books: The Beasts of Tarzan and The Gods of Mars by E. R. Burroughs and The Jungle Book and Rewards and Fairies by R. Kipling. Thus, except for the partition according to the writing style, there is a possible topic partition of the “jungle” topic versus all the rest. We split each book into “documents” consisting of 200 successive words each. We defined p(w, d) as the number of occurrences of the word w in the document d, normalized by the total number of words in the corpus, and applied parallel sIB to cluster the documents into two partitions, T1 and T2 , of two clusters each. Since this setting already implies significant compression, we used β −1 = 0 and concentrated on maximizing I (T1 , T2 ; W). In Table 1, we see that T1 shows almost perfect correlation to an authorship partitioning, while T2 is correlated with a topical partitioning. Moreover, I (T1 ; T2 ) ≈ 0.001 nats, that is, these two partitions are practically independent. In addition, with only four clusters, I (T1 , T2 ; W) ≈ 0.3 nats, which is about 13% of the original (empirical) information, I (D; W). 6.1.2 Parallel sIB for Gene Expression Data Analysis. As our second example, we used the mRNA gene expression measurements of approximately 6800 human genes in 72 samples of leukemia (Golub et al., 1999). These data include independent annotations of their components, including the type of leukemia (ALL versus AML), the type of cells, the donating hospital, and others. We normalized the measurements of the genes in each sample independently to get an estimated joint distribution, p(S, G), over samples and genes (where p(S) is uniform). Given this joint distribution, we
Multivariate Information Bottleneck
1765
Table 1: Results for Parallel sIB Applied to Style Versus Topic Text Clustering.
The Beasts of Tarzan (Burroughs) The Gods of Mars (Burroughs) The Jungle Book (Kipling) Rewards and Fairies (Kipling)
T1,a
T1,b
T2,a
T2,b
315 407 0 0
2 0 255 367
315 1 254 42
2 406 1 325
Notes: Each entry indicates the number of “documents” in some cluster and some class. For example, the first cluster of the first partition, T1,a , includes 315 “documents” taken from the book The Beast of Tarzan.
Table 2: Results for Parallel sIB Applied to Gene Expression Measurements of Leukemia Samples (Golub et al., 1999).
AML ALL B-cell T-cell Average PS
T1,a
T1,b
T2,a
T2,b
T3,a
T3,b
T4,a
T4,b
23 0 0 0 0.64
2 47 38 9 0.72
14 37 37 0 0.71
11 10 1 9 0.66
12 9 6 3 0.53
13 38 32 6 0.76
13 22 20 2 0.70
12 25 18 7 0.69
Note: Each entry indicates the number of samples in some cluster and some class. Note that T-cell/B-cell annotations are available only for samples annotated as ALL type. The last row indicates the average “prediction strength” score (Golub et al., 1999) in the cluster.
applied the parallel sIB to cluster the samples into four clustering hierarchies, consisting of two clusters each (again, with β −1 = 0). In Table 2 we present the four partitions. T1 almost perfectly matches the AML versus ALL annotation, while T2 is correlated with the B-cells/T-cell split. For T3 we note that the average “prediction strength” score (see Golub et al., 1999) is very different between both clusters. For T4 we were not able to find any clear correlation with the available annotations, suggesting that this partition overfits the data or that further meaningful partitions of these data are not expressed in the provided annotations. In terms of information, I (T; G) preserves about 54% of the original information, I (S; G) ≈ 0.23 nats. 6.2 Symmetric IB Applications. Here, we illustrate the applicability of (a ) all our algorithms for G in and G out of Figure 2B. 6.2.1 Symmetric dIB and iIB for Word-Topic Clustering. We start with a simple text processing example, constructed out of the 20 news-groups data (Lang, 1995). These data consist of approximately 20,000 documents, distributed among 20 different topics. We defined p(w, c) as the number of occurrences of a word w in all documents of topic c, normalized by
1766
N. Slonim, N. Friedman, and N. Tishby
Figure 7: Application of the symmetric dIB to the 20-news-group data. The learned hierarchy of topic clusters, Tc , is presented after four splits. The numerical value inside each ellipse denotes the bifurcation β value. Notice that at the early stages, the algorithm is inconclusive regarding the assignment of the electronics topic, which demonstrates that the hierarchy obtained by the dIB algorithm does not necessarily construct a tree. Below every topic cluster, tc , we present its most probable word cluster (tw∗ = argmaxtw ∈Tw q (tw | tc )) through its five most probable words, sorted via q (w | tw∗ ).
the total number of words in the corpus, and applied the symmetric dIB algorithm. We start with β −1 = 0 and gradually “anneal” the system to extract a hierarchy of word clusters, Tw , and a corresponding hierarchy of topic clusters, Tc . The obtained Tc partitions were typically hard; hence, this hierarchy can be presented as a simple tree-like structure (see Figure 7). For every tc , we find its most probable word cluster (tw∗ = argmaxtw q (tw | tc )) and present it in the same figure. Evidently there is a strong semantic relation in every such pair. The word clusters also utilized the soft clustering utility to deal with words that are relevant to several topics, as illustrated in Table 3. In terms of information, after four splits, |Tw | = 14, |Tc | = 9, and I (Tw ; Tc ) = 0.6 nats, which is about 70% of the original information, I (W; C). We further applied the symmetric iIB algorithm to the same data. For purposes of comparison, the input parameters were set as in the dIB result, |Tw | = 14, |Tc | = 9, and β ≈ 22.72. We performed 100 different random initializations, 8 of which converged to a better minimum of L than the one
Multivariate Information Bottleneck
1767
Table 3: Results for “Soft” Word Clustering Using Symmetric dIB over the 20-News-group Data. W
q (tw |w)
war
0.92 0.06 0.02 0.86 0.08 0.06 0.77 0.23 0.74 0.26 0.99 0.01 0.58 0.42
killed
evidence price speed application
tc∗ politics religion-mideast religion-mideast politics religion-mideast religion-mideast religion-mideast politics hardware sport hardware sport hardware windows
q (tc∗ |tw ) 0.44 0.34 0.93 0.44 0.34 0.93 0.34 0.44 0.31 0.35 0.31 0.35 0.31 0.84
Notes: The first column indicates the word, that is, the W value. The second column presents q (tw | w) for all the clusters for which this probability was nonzero. tc∗ denotes the topic cluster that maximizes q (tc | tw ). It is represented in the table, in the third column, by the joint topic of its members (see Figure 7). The last column presents the probability of this topic cluster given tw .
found by dIB (see Figure 8). Thus, by tracking the changes in the solution as β increases, the dIB approach succeeds in finding a relatively good solution and also provides more details by describing a hierarchy of solutions. Nonetheless, if one is interested in a flat partition, applying iIB with a sufficient number of initializations will probably yield a better optimum. 6.2.2 Symmetric sIB and aIB for Protein Sequence Analysis. As a second example we used a subset of five protein classes taken from the PRINTS database (Attwood et al., 2000) (see Table 4). All five classes share a common protein structural unit, known as the glutathione S-transferase (GST) domain. A well-established database (the Pfam database, http://www.sanger.ac.uk/Pfam) has chosen not to model these groups separately due to high sequence similarity between them. Nonetheless, unsupervised symmetric IB algorithms extract clusters that are well correlated with these groups. We represented each protein as a count vector over different 4-mers of amino acids, or features, present in these data. We defined p( f | r ) as the relative frequency of a feature f in a protein r and further defined p(r ) as uniform. Given these data, we applied the symmetric aIB and sIB (with β −1 = 0) to extract protein clusters, TR , and feature clusters, TF , such that I (TR ; TF ) is maximized.
1768
N. Slonim, N. Friedman, and N. Tishby
L = I(Tw;W)+I(Tc;C)–22.72 I(Tw;Tc)
−6
Symmetric iIB Symmetric dIB
−6.5 −7 −7.5 −8 −8.5 −9 −9.5 −10
0
20
40 60 Initialization index
80
100
Figure 8: Application of symmetric dIB and symmetric iIB to the 20-news-group data. The iIB results over 100 different initializations are sorted with respect to L = I (Tw ; W) + I (Tc ; C) − 22.72 · I (Tw ; Tc ). In eight cases, iIB converged to a better minimum.
Table 4: Data Set Details of the Protein GST Domain Test.
Class
Family Name
c1 c2 c3 c4 c5
GST—no class label S crystalline Alpha class GST Mu class GST Pi class GST
Number of Proteins 298 29 40 32 22
For the sIB results, with 10 protein clusters and 10 feature clusters, we obtain I (TR ; TF ) = 1.1 nats (∼ 30% of the original information), and the algorithm almost perfectly recovers the manual biological partitioning of the proteins (see Table 5). For each tR , we identify its most probable feature cluster (tF∗ = argmaxtF q (tF | tR )) and present in Table 6 its most probable features, which apparently are good indicators for the biological class that is correlated with tR .
Multivariate Information Bottleneck
1769
Table 5: Results for Applying Symmetric sIB to the GST Protein Data Set with |T R | = 10, |T F | = 10. Class/ Cluster
tR1
tR2
tR3
tR4
tR5
tR6
tR7
tR8
tR9
tR10
c1 c2 c3 c4 c5 Errors
107 0 0 0 0 0
49 0 0 0 7 7
47 0 0 0 0 0
42 0 0 0 0 0
30 0 0 0 0 0
17 0 0 2 0 2
4 29 0 0 0 4
1 0 39 0 0 1
1 0 0 30 1 2
0 0 1 0 14 1
Notes: Each entry indicates the number of proteins in some cluster and some class. The last row indicates the number of “errors” for each cluster, defined as the number of proteins in this cluster that are not labeled by the cluster’s most dominant label.
Table 6: Results for Symmetric sIB: Indicative Features for GST Protein Classes. TR
tF∗
q (tF∗ |tR )
Feature
q ( f |tF∗ )
c1
c2
c3
c4
c5
tR7 (c 2 )
tF10
0.91
tR9 (c 4 )
tF8
0.89
tR1 0 (c 5 )
tF2
0.85
tR8 (c 3 )
tF1
0.85
tR5 (c 1 )
tF9
0.83
tR4 (c 1 )
tF5
0.80
RYLA GRGR NGRG FPNL AILR SNAI LDLL SFAD FETL FPLL YGKD AAGV TLVD WESR EFLK IPVL ARFW KIPV
0.022 0.02 0.019 0.025 0.018 0.017 0.019 0.017 0.017 0.018 0.017 0.016 0.015 0.015 0.015 0.010 0.010 0.009
0.08 0.04 0.04 0.06 0.03 0.03 0.04 0.04 0.04 0.01 0.03 0.03 0.13 0.12 0.00 0.11 0.11 0.09
1.00 0.72 0.72 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.45 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.58 0.43 0.00 0.03 0.03 0.08 0.03 0.03 0.90 0.88 0.78 0.00 0.00 0.00 0.00 0.00 0.00
0.19 0.00 0.00 1.00 0.88 0.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.64 0.46 0.59 0.64 0.59 0.77 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.05
Notes: The left column indicates the index of the cluster in T R , and in parentheses the most dominant protein class in it. Given this protein cluster, the second column indicates its most probable feature cluster, tF∗ = argmaxtF q (tF | tR ). The next column indicates the probability of this feature cluster, given the protein cluster. Results are presented only when this value is greater than 0.8, indicating high coupling between both clusters. We further sort all features by q (F | tF∗ ) and present the top three features in the next column. The last five columns indicate for each of these features its relative frequency in all five classes (estimated as the number of proteins that contain this feature, normalized by the total number of proteins in the class). Clearly, the extracted features are correlated with the biological class associated with tR .
1770
N. Slonim, N. Friedman, and N. Tishby
c1~49 c6~3
c1~102 c5~9 c6~7
c6~26 c1~15
c1~34
c1~39
c5~13 c3~1
c4~30 c1~1
c2~29 c1~4
c3~39 c1~1
c4~32 c1~18
c1~39 c5~13 c3~1
c1~151 c6~10 c5~9
c1~17 c4~2
c1~73 c5~13 c3~1
c3~39 c2~29 c1~5
c1~166 c6~36 c5~9
c1~239 c6~36 c5~22 c3~1
c3~39 c4~32 c2~29 c1~23
ALL
Figure 9: Application of the symmetric aIB to the GST protein data set. The learned protein cluster hierarchy, TR , is presented from |T R | = 10 and below. In each cluster, the number of proteins from every class is indicated. For example, in the extreme upper right cluster, there are 39 proteins from the class c 3 and a single protein from the unlabeled class, c 1 . After completing the experiments, we found that 36 out of the unlabeled (c 1 ) proteins were recently labeled as Omega class. This class is denoted by c 6 in the figure. Notice that all its members were clustered in the three left-most clusters.
Thus, our unsupervised analysis finds clusters that highly correlate with a manual partitioning of the proteins and simultaneously extracts features (subsequences of amino acids) that are good indicators for each such class. Last, we apply the symmetric aIB to the same data. For comparison purposes, we consider the solution at |T R | = |T F | = 10. Here, I (TR ; TF ) = 0.9 nats, clearly inferior to the sIB result. However, the differences are mainly in the feature clusters, TF , while the protein clusters obtained by aIB strongly correlate with the corresponding sIB solution, and thus also correlate with the protein class labels. In Figure 9 we present the TR hierarchy. Notice that many of our clusters correspond to the “unlabeled” c 1 class, and thus presumably correlate with additional—yet unknown—subclasses in the GST domain. In fact, after completing our experiments, it was brought to our attention that one such new class was recently defined in a different database, the InterPro database (Apweiler et al., 2000). Thirty-six proteins out of this new Omega class were present in our data. In Figure 9 we see
Multivariate Information Bottleneck
1771
that all these proteins were assigned automatically to a single branch in our hierarchy. (a )
6.3 Triplet IB Application. We consider the specification of G in and G out of Figure 2C, and use the triplet sIB algorithm over a simple text processing example. 6.3.1 Triplet sIB for Natural Language Processing. Our data consisted of seven Tarzan books by E. R. Burroughs, from which we got a sequence of about 600,000 words. We defined three random variables, Wp , W, and Wn , corresponding to the previous, current, and the next word in the sequence, respectively. For simplicity, we defined W as the set of 10 most frequent words in our data that are not neutral (“stop”) words. Hence, we considered only word triplets in which the middle word was one of these 10 and defined p(w p , w, wn ) as the relative frequency of a triplet {w p − w − wn } among all these triplets. Given p(w p , w, wn ), we applied the triplet sIB algorithm to construct two systems of clusters: Tp for the first word in the triplets, and Tn for the last word in the triplets, with |T p | = 10, |Tn | = 10, and β −1 = 0, such that I (Tp , Tn ; W) is maximized. In 50 different random initializations, the obtained solution always preserved more than 90% of the original information, I (Wp , Wn ; W) = 1.6 nats, although the dimensions of the compressed distribution q (Tp , W, Tn ) are more than 200 times smaller than those of the original matrix, p(Wp , W, Wn ). The best solution preserved about 94% of the original information, and we further concentrate on this solution. For every w ∈ W, a manual examination of the two clusters that maximize q (tp , w, tn ), indicates that they consist of word pairs that are indicative of the word in between them, reflecting how Tp and Tn preserve the information about W (data not shown). To validate the predictive power of Tp and Tn about W, we examined another book by E. R. Burroughs (The Son of Tarzan), which was not used while estimating p(Wp , W, Wn ) and constructing Tp and Tn . In this book, for every occurrence of one of the 10 words in W, we try to predict it using its two neighbors, w p and wn . Specifically, w p and wn correspond to two clusters, tp ∈ T p , tn ∈ Tn ; thus, we predict the in-between word to be w ˆ = argmaxw q (w | tp , tn ). For comparison, we also predict the in-between word from the complete joint statistics, w ˆ = argmaxw p(w | w p , wn ), and while using a single neighbor, w ˆ = argmaxw p(w | w p ), and w ˆ = argmaxw p(w | wn ). In Table 7 we present the precision and recall (Sebastiani, 2002) for all these prediction schemes. In spite of the significant compression implied by Tp and Tn , the averaged precision of their predictions is similar to those obtained using the original complete joint statistics. In terms of recall, predictions from the triplet IB clusters are even superior to those using the original Wp , Wn variables, since using q (Tp , W, Tn ) allows us to make predictions even for triplets that were not included in our training data.
1772
N. Slonim, N. Friedman, and N. Tishby
Table 7: Precision and Recall Results for Triplet sIB. Precision W apeman (33) apes (78) eyes (177) girl (240) great (219) jungle (241) tarzan (48) time (145) two (148) way (101) Microaveraged
Recall
Tp , Tn
Wp , Wn
Wp
Wn
Tp , Tn
Wp , Wn
Wp
Wn
5.9 43.3 82.6 43.3 91.7 49.3 41.3 70.4 41.0 59.6 53.3
7.4 25.6 80.7 30.0 92.0 53.7 66.7 82.2 92.3 80.8 55.4
4.3 93.6 58.0 0.0 58.0 0.0 30.9 70.6 84.6 61.3 28.2
1.5 11.4 65.3 37.5 91.0 37.6 7.7 31.1 91.7 61.3 34.3
24.2 16.7 32.2 5.4 50.2 27.4 39.6 47.6 10.8 27.7 27.9
30.3 14.1 28.3 1.3 47.5 24.1 25.0 25.5 8.1 20.8 22.2
81.8 37.2 49.2 0.0 21.5 0.0 60.4 53.1 7.4 18.8 22.8
3.0 6.4 18.1 1.3 55.7 18.3 47.9 34.5 14.9 18.8 22.5
Notes: The left column indicates the word w ∈ W and in parentheses its number of occurrences in the test sequence. The next column presents the precision of the predictions while using the triplet sIB clusters statistics, that is, q (W | Tp , Tn ). The third column presents the precision while using the original joint statistics, that is, p(W | Wp , Wn ). The next two columns present the precision while using only one word neighbor for the prediction, that is, p(W | Wp ) and p(W | Wn ), respectively. The last four columns indicate the recall of the predictions while using these four different prediction schemes. The last row presents the microaveraged precision and recall.
We note that this type of application might be useful in tasks like speech recognition, optical character recognition, and more, where it is not feasible to use the original joint distribution due to its high dimensionality. The triplet IB clusters provide a reasonable alternative that is dramatically less demanding. For biological sequence data, the analysis demonstrated here might be useful to gain further insights about the data properties. 7 Discussion and Future Research 7.1 An Information-Theoretic Perspective. The traditional approach to the analysis and modeling of empirical data is through generative models and maximum likelihood parameter estimation. However, complex data sets rarely come with their “correct” parametric model. Thus, the choice of the parametric class often involves nonobvious assumptions about the data. Shannon’s (1948) information theory represents a radically different approach, which is concerned with two fundamental trade-offs. The first is between compression and distortion and is known as rate distortion theory or (lossy) source coding. The second is between reliable error correction and its cost and is known as the capacity-cost trade-off or channel coding (Cover & Thomas, 1991).
Multivariate Information Bottleneck
1773
These two problems are dual components of one larger problem: the trade-off between distortion and cost, that is, the minimum average cost that is required for communication with at most a given average distortion. Shannon was able to break this problem, in the point-to-point communication setting, into a rate-distortion trade-off on one hand and a capacity-cost trade-off on the other. In the former, one minimizes mutual information subject to an average distortion constraint, while in the latter, one maximizes information subject to an average cost constraint. The optimal solution to both problems requires only the specification of the distortion or cost functions, without any assumptions about the nature of the underlying distributions, although they should be accessible. This is in sharp contrast to the statistical modeling approach. Nonetheless, in this work, we use a fundamental concept in the statistical modeling literature, known as Bayesian networks, in order to extend the information-theoretic trade-offs paradigm. In the original IB, we balance between losing irrelevant distinctions made by X, while maintaining those that are relevant about Y. The first part—minimizing I (T; X)—is analogous to rate distortion, while the second part—maximizing I (T; Y)—is analogous to channel coding. Tishby et al. (1999) formulated both parts in a single principle and characterized its general solution. Here, we described a natural extension of these ideas, which allows the consideration of more complicated scenarios. This extension required an elevation of the point-to-point communication problem that lies behind the original IB to a network communication setting, a notoriously more difficult and largely unsolved problem. Nonetheless, as we demonstrated, such an extension is both viable and natural. In particular, the source-channel separation that exists in the original IB no longer holds nor is needed, as G in and G out replace the source and the channel terms of the original IB, respectively. It is thus possible that the current work will provide a way around the difficult multiterminal information theory problem, which remains unsolved to date (Cover & Thomas, 1991). 7.2 Finite Data and Sample Complexity Issues. An immediate obstacle in our information-theoretic approach is the assumption that the joint distribution, p(X), is known, as typically we are given only a sample out of this distribution. Several ideas were suggested in this context for the original IB problem and are worth mentioning. First, finite sample effects become more severe as the complexity of the new representation, T , increases. That is, overfitting is more evident as the number of clusters increases (Pereira et al., 1993; Still & Bialek, 2003). Thus, in particular, one should apply with great care agglomerative algorithms that begin with substantial overfitting for small samples. Having said that, it is worth noting that the agglomerative IB algorithm is supported through the work of Gutman (1989). Specifically, theorem 1 in that work implies
1774
N. Slonim, N. Friedman, and N. Tishby
that the JS divergence, used here to determine which clusters to merge, is the optimal criterion to decide whether two empirical samples came from a single source. Next, given the empirical evidence, it might be useful to use the least informative distribution as the input to the IB analysis. The notion of least informative distributions, under expectation constraints, is related to maximum entropy methods and has been used recently for dimension reduction (Globerson & Tishby, 2004). One can prove generalization sample complexity bounds for the IB problem within this framework. Finally, it was shown that one can introduce a corrected IB functional in which finite sample effects are taken directly into account for the relevant information term (Still & Bialek, 2003; Atwal & Slonim, 2005). Extending these works in our context is clearly important yet out of the scope of this work. 7.3 Future Research 7.3.1 Choosing the Number of Clusters. A natural question in cluster analysis is the estimation of the “correct” number of clusters. It is important to bear in mind, though, that this question might have more than one proper answer, for example, if there is a natural hierarchical structure in p(X) such that different resolutions convey multiple important insights. The number of clusters we extract is related to the trade-off parameter β, where low β values imply a relatively low resolution, while high β values suggest that a large number of clusters should be employed. The deterministic annealing multivariate IB algorithm seems to be most relevant here since it automatically adjusts the resolutions of the different clustering systems as β is increased. For the original IB problem, a recent rigorous treatment characterizes the maximal β value that can be employed for given data before overfit effects take place (Still & Bialek, 2003). Extending this work in our context is a natural direction for future research. 7.3.2 Relation to Other Methods and Parametric Multivariate IB. The possible connections with other data analysis methods merit further investigation. The general structure of the multivariate iIB algorithm is reminiscent of EM (Dempster, Laird, & Rubin, 1977). Moreover, there are strong relations between the original IB problem and maximum likelihood estimation for mixture models (Slonim & Weiss, 2002). Hence, it is natural to look for further relationships between generative models and different multivariate IB problems. In particular, formulating new problems with the multivariate IB framework might suggest new generative models that are worth exploring. Other connections are, for example, to other dimensionality reduction techniques, such as independent component analysis (ICA) (Bell & Sejnowski, 1995). The parallel IB provides an ICA-like decomposition with
Multivariate Information Bottleneck
1775
an important distinction. In contrast to ICA, it is aimed at preserving information about predefined aspects of the data, specified through the choice of the relevant variables in G out . A parametric variant of our framework might be useful in different situations (Elidan & Friedman, 2003). This issue seem to be better addressed by the structural multivariate IB principle, L(2) . In this formulation, we aim to minimize the KL divergence between q (X, T) and the target class defined by G out . If we further require a particular parametric form over this class, minimizing this KL corresponds to finding q (X, T) with minimum violation of the conditional independencies implied by G out and with the appropriate parametric form. In particular, this means that the number of free parameters can be drastically reduced, avoiding possible redundant solutions. 7.3.3 Multivariate Relevance Compression Function. In the original IB problem, the trade-off in the IB functional is quantified by the relevancecompression function (also known as the information curve). Given p(X, Y), this concave function bounds the maximal achievable I (T; Y), for any level of compression, I (T; X) (Gilad-Bachrah, Navot, & Tishby, 2003). It is intimately related to the rate distortion function and the capacity cost function, and in a sense unifies them, as discussed earlier. It depends solely on p(X, Y) and characterizes the structure in this joint distribution: the clearer this structure is, the steeper this curve becomes. Analogously, given p(X), we may consider the multivariate relevance compression function as the two-dimensional optimal curve of the maximally attainable I G out for any level of I G in . From the variational principle, L(1) , it follows that the slope of this curve is β −1 . Thus, assuming that this curve is differentiable, it must be downward concave, as in the original IB case. Importantly, though, this function depends on the specification of G in and G out . That is, given p(X), there are many different such functions that characterize the structure in this joint distribution in multiple ways. 7.3.4 How to specify G in and G out . The underlying assumption in our formulation is that G in and G out are provided as part of the problem setup. However, specifying these two networks might be far from trivial. For example, in the parallel IB case, where T = {T1 , . . . , Tk }, setting k can be seen as a model selection task, and certainly not an easy one. Thus, an important issue is to develop automatic methods for specifying both networks. Possible guidance can come from the multivariate relevance compression function. Specifically, it seems plausible to prefer specifications that yield steeper relevance compression curves. This issue clearly deserves further investigation. 7.4 Conclusion. Our formulation corresponds to a rich family of optimization problems that are all unified under the same information-theoretic
1776
N. Slonim, N. Friedman, and N. Tishby
framework. In particular, it allows one to extract structure from data in many different ways. In this work, we focused on three examples: parallel IB, symmetric IB, and triplet IB. However, we believe that this is only the tip of the iceberg. An immediate corollary of our analysis is that the general term of clustering conceals a broad family of many distinct problems that deserve special consideration. To the best of our knowledge, the multivariate IB framework described in this work is the first successful attempt to define these subproblems, solve them, and demonstrate their importance. Appendix A: Proofs Proof of Proposition 1. Using the multi-information definition in equation 3.1 and the fact that p(X) |= G, we get p(x) p(x1 ) . . . p(xn )
p xi | paGXi n = E p(x) log i=1 p(xi )
n p xi | paGXi E p(x) log = p(xi )
I(X) = E p(x) log
i=1
n
I Xi ; PaGXi . = i=1
Proof of Proposition 2. p(x1 , . . . , xn ) DK L [ p||G] = min E p(x) log q |=G q (x1 , . . . , xn ) p(x1 , . . . , xn )
= min E p(x) log n q |=G i=1 p xi | paGXi
n p xi | paGXi i=1
+E p(x) log n i=1 q xi | paGXi p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi
n G p xi | pa Xi
p paGXi p xi | paGXi log + min q |=G q xi | paGXi G i=1 xi ,pa X
i
Multivariate Information Bottleneck
1777
p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi n
+ min p paGXi DK L p xi | paGXi q xi | paGXi ,
q |=G
i=1 paGX
i
and since the right term is nonnegative and equals zero if and only if we choose q (xi | paGXi ) = p(xi | paGXi ) we get the desired result. We use proposition 2:
Proof of Proposition 3.
p(x1 , . . . , xn ) DK L [ p||G] = min E p(x) log q |=G q (x1 , . . . , xn ) p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi n p(xi | x1 , . . . , xi−1 ) i=1
= E p(x) log n i=1 p xi | paGXi n p(xi | x1 , . . . , xi−1 )
= E p(x) log p xi | paGXi i=1 =
n
I Xi ; {X1 , . . . , Xi−1 } \ PaGXi | PaGXi , i=1
where we used the consistency of the order X1 , . . . , Xn with the order of the DAG G. To prove the second part of the proposition, we note that DK L [ p||G] = E p(x)
p(x1 , . . . , xn )
log n i=1 p xi | paGXi
n p xi | paGXi i=1 p(x1 , . . . , xn ) = E p(x) log − E p(x) log n n i=1 p(xi ) i=1 p(xi )
n p xi | paGXi E p(x) log = I(X) − p(xi ) i=1
= I(X) −
n
I Xi ; PaGXi . i=1
1778
N. Slonim, N. Friedman, and N. Tishby
Proof of Theorem 1. The basic idea is to find stationary points of L(1) subject to the normalization constraints. Thus, we add Lagrange multipliers and use definition 1 to get the Lagrangian, ˜ (X, T)] = L[q
k
I (T ; U ) − β
=1
+
λ(u )
n
I (Xi ; V Xi ) +
i=1
k
I (T ; VT )
=1
q (t | u ),
(A.1)
t
u
where we drop terms that depend only on the observed variables X. To differentiate L˜ with respect to a specific parameter q (t j | u j ) we use the following two lemmas. In the proofs of these two lemmas, we assume that q (X, T) |= G in and that the T variables are all leafs in G in . Lemma 1.
Under the above normalization constraints, for every event a over
X ∪ T (that is, a is some assignment to some subset of X ∪ T ), we have
∂q (a ) = q (u j )q (a | t j , u j ). ∂q (t j | u j )
(A.2)
Proof. Let Z denote all the random variables in X ∪ T suchthat their values are not set by the event a. In the following, the notation z,a q (x, t) means that the sum is only over the variables in Z (where the others are set through a): ∂q (a) ∂ q (x, t) = ∂q (t j | u j ) ∂q (t j | u j ) z,a = =
∂
∂q (t j | u j )
z,a
z,a
k=1 q (t | u )
∂ k q (t | u ). ∂q (t j | u j ) =1
Clearly the derivatives are nonzero only for terms in which tl = t j and ul = u j . For each such term, the derivative is simply k=1,= j q (t | u ). Dividing and multiplying every such term by q (t j | u j ), we obtain ∂q (a ) 1 k q (t | u ) = ∂q (t j | u j ) q (t j | u j ) z\{t ,u },a ,t ,u =1 j
=
q (a , t j , u j ) q (t j | u j )
j
j
j
Multivariate Information Bottleneck
1779
= q (u j )q (a | t j , u j ). Using this lemma, we get: Lemma 2.
For every Y, Z ⊆ X ∪ T
∂ I (Y; Z ) q (y | z ) = q (u j ) −1 . q (y, z | t j , u j )log ∂q (t j | u j ) q (y) y,z
(A.3)
Proof. ∂ q (y | z) ∂ I (Y; Z) log = q (y, z) ∂q (t j | u j ) y,z q (y) ∂q (t j | u j ) +
y,z
−
∂ q (y, z) ∂q (t j | u j ) q (z | y)
y,z
−
q (y | z)
y,z
∂ ∂q (t j | u j )
q (y)
∂ q (z). ∂q (t j | u j )
Applying lemma 1 for each of these derivatives, we get the desired result. We now can differentiate each mutual information term that appears in L˜ of equation A.1. Note that we can ignore terms that do not depend on the value of Tj since these are constants with respect to q (t j | u j ). Therefore, by taking the derivative and equating to zero, we get: log q (t j | u j ) = logq (t j ) −β
−j
p(xi ) −j q v Xi | u j q xi | v Xi , u j log
−j q xi | v Xi , t j i:Tj ∈V Xi v− j ,xi Xi −j
q (t ) −j q vT | u j q t | vT , u j log −
−j q t | vT , t j :Tj ∈VT v− j ,t T q (vTj ) + c(u j ), q (vTj | u j ) log − (A.4) q (vTj | t j ) v Tj
where c(u j ) is a term that depends only on u j . To get the desired KL form,
1780
N. Slonim, N. Friedman, and N. Tishby
we add and subtract −j v X ,Xi i
−j q xi | v Xi , u j −j
−j , q v Xi | u j q xi | v Xi , u j log p(xi )
(A.5)
for every term in the first outside summation. Note again that this is possible since we can absorb in c(u j ) every expression that depends only on u j . A similar transformation applies to the other two summations on the righthand side of equation A.4. Hence, we end up with log q (t j | u j ) = log q (t j )
−j −j E q (·|u j ) DK L q Xi | v Xi , u j q Xi | v Xi , t j −β · i:Tj ∈V Xi
−
:Tj ∈VT
−j −j E q (·|u j ) DK L q T | vT , u j q T | vT , t j
−DK L q (VTj | u j )q (VTj | t j ) + c(u j ).
(A.6)
Finally, taking the exponent and applying the normalization constraints for each distribution q (t j | u j ) completes the proof. Proof of Theorem 2. The following lemma allows drawing the connection between Tj and the other variables after every merger. Lemma 3.
and
Let Y , Z ⊂ X ∪ T . Then
q z , t¯ j = q z , t j + q z , trj ,
(A.7)
q (y | z , t¯ j ) = π,z · q y | y , t j + πr,z · q y | z , trj .
(A.8)
Proof. We use the following notations: W = Z ∩ U j , Z−W = Z \ −W {W}, U j = U j \ {W}. Note that in principle, it might be that W = ∅. For every z, t¯ j we have q (z, t¯ j ) = q (z) p(¯t j | z)
q u j −w | z q t¯ j | z−w , w, u j −w = q (z) u j −w
= q (z)
u j −w
q u j −w | z q t¯ j | u j ,
Multivariate Information Bottleneck
1781
where in the last step, we used the structure of G in and the fact that Z−W ∩ U j = ∅. Using equation 5.5, we find that q (z, t¯ j ) = q (z)
q u j −w | z q t j | u j + q trj | u j
u j −w
= q (z)
q u j −w | z q t j | z−w , w, u j −w + q trj | z−w , w, u j −w ,
u j −w
where again we used the structure of G in . Since Z = Z−W ∪ {W}, we get q (z, t¯ j ) = q (z)
q u j −w , t j | z + q u j −w , trj | z u j −w
= q z, t j + q z, trj ,
as required. To prove the second part we first note that if q (z, t¯ j ) = 0, then both sides of equation A.8 are trivially equal; thus, we assume that this is not the case. Then, for every y, z, t¯ j , we have q (y, z, t¯ j ) q (z, t¯ j )
q y, z, t j + q y, z, trj =
q z, t¯ j
q t j | z
q trj | z
q y | z, trj . =
q y | z, t j + q t¯ j | z q t¯ j | z
q (y | z, t¯ j ) =
Hence from equation 5.6, we get the desired form. Next, we need the following simple lemma. Recall that we denote by aft
Tj , Tj the random variables that correspond to Tj before and after the merger, respectively. Let V = V− j ∪ Tj be a set of random variables that be f includes Tj , and let Vbe f = V− j ∪ Tj , and similarly for Vaft . Let Y be a set of random variables such that Tj ∈ / Y. Using these notations, we have: bef
Lemma 4. The reduction of the mutual information I (Y ; V ) due to the merger {t j , trj } ⇒ t¯ j is given by I (Y ; V ) ≡ I (Y ; V be f ) − I (Y ; V afT ) = q (¯t j ) · E q (·|t¯ j ) [JSv − j [q (Y | t j , v − j ), q (Y | trj , v − j )]].
1782
N. Slonim, N. Friedman, and N. Tishby
Proof. Using the chain rule for mutual information (Cover & Thomas, 1991, p. 22), we get bef aft
I (Y; V) = I (V− j ; Y) + I Tj ; Y | V− j − I (V− j ; Y) − I Tj ; Y | V− j bef
aft
= I Tj ; Y | V− j − I Tj ; Y | V− j . From equation 5.5, we find that I (Y; V) =
q (v− j ) I (v− j ),
v− j
where we used the notation
q y | t j , v− j
−j log q tj , y | v I (v ) = q (y | v− j ) y −j
q y | trj , v−i
r −j log q tj , y | v + q (y | v− j ) y −
q y | t¯ j , v−i . q t¯ j , y | v− j log q (y | v− j ) y
Using lemma 3 (with Z = Y ∪ V− j ), we obtain
q t¯ j , y | v− j = q t j , y | v− j + q trj , y | v− j . Setting this in the previous equation, we get,
q y | t j , v− j
−j
log q tj , y | v I (v ) = q y | t¯ j , v− j y −j
q y | trj , v− j
r −j
+ q tj , y | v log q y | t¯ j , v− j y
= q t¯ j | v
−j
+q t¯ j | v
· π,v− j
−j
·π
q y | t j , v− j
−j
log q y | tj , v q y | t¯ j , v− j y
r,v− j
q y | trj , v− j
r −j
. q y | tj , v log q y | t¯ j , v− j y
Multivariate Information Bottleneck
1783
However, using again lemma 3, we see that
q y | t¯ j , v− j = π,v− j · q y | t j , v− j + πr,v− j · q y | trj , v− j . Therefore, using the JS definition in equation 5.8, we get
I (v− j ) = q t¯ j | v− j · JSv− j q Y | t j , v− j , q Y | trj , v− j . Setting this back in the expression for I (Y; V), we get I (Y; V) =
q v− j q t¯ j | v− j · JSv− j q Y | t j , v− j , q Y | trj , v− j v− j
= q t¯ j · E q (·|t¯ j ) JSv− j q Y | t j , v− j , q Y | trj , v− j . Using this lemma, we now prove the theorem. Note that the only information terms in L = I G out − β −1 I G in that change due to a merger in Tj are those that involve Tj . Therefore,
I (Xi ; V Xi ) + I (T ; VT ) + I (Tj ; VTj ) L t j , trj = i:Tj ∈V Xi
:Tj ∈VT
−β −1 I (Tj ; U j ).
(A.9)
Applying lemma 4 for each of these information terms, we get the desired form. Appendix B: Implementation and Preprocess Details In this appendix we describe the details of the implementation and the preprocess applied in our examples. In several cases, in order to avoid too high dimensionality, we apply feature selection by information gain before the clustering is applied. Specifically, given a joint distribution, p(X, Y), we sort all X values by their contribution to I (X; Y): p(x) y p(y | x) log p(y|x) , and select only the top sorted values for further p(y) analysis. Parallel sIB for Style Versus Topic Text Clustering
r r
All books were downloaded http://promo.net/pg/.
from
the
Gutenberg
Project,
Uppercase characters were lowered, digits were united to one symbol, and nonalphanumeric characters were ignored.
1784
r r r
N. Slonim, N. Friedman, and N. Tishby
Each book was split into “documents” consisting of 200 successive words each, ending up with 1346 documents and 15,528 distinct words. After normalization, we got an estimated joint distribution, p(D, W), where p(D) is uniform and each entry indicates the probability that a random word position is equal to w ∈ W while the document is d ∈ D. We applied the parallel sIB algorithm to this p(D, W) with T = {T1 , T2 }, |T j | = 2 and 5 different initializations per Tj .
Parallel sIB for Gene Expression Data Analysis
r r
r r r
We used the gene expression measurements of approximately 6800 human genes in 72 samples of leukemia (Golub et al., 1999). We removed about 1500 genes that were not expressed in the data and normalized the measurements of the remaining 5288 genes in each sample independently, to get an estimated joint distribution p(S, G) over samples and genes, with uniform p(S). We sorted all genes by their contribution to I (S; G), and selected the 500 most informative ones. After renormalization of the measurements in each sample, we ended up with an estimated joint distribution, p(S, G), with |S| = 72, |G| = 1 500, and p(S) = |S| . We applied the parallel sIB algorithm to this p(S, G) with T = {T1 , . . . , T4 }, |T j | = 2, and 5 different initializations per Tj .
Symmetric dIB and iIB for Word-Topic Clustering
r r r r r
We used the 20-news-group corpus (Lang, 1995), which contains about 20,000 documents and messages, distributed among 20 discussion groups, or topics. We removed all file headers, lowered uppercase characters, united digits into one symbol, and ignored nonalphanumeric characters. We further removed stop words and words with only one occurrence, ending up with a counts matrix of |D| = 19, 997 documents versus |W| = 74, 000 unique words. By summing the word counts of all the documents in each topic and applying simple normalization, we extracted an estimated joint distribution, p(W, C), of words versus topics with |C| = 20. We sorted all words by their contribution to I (W; C) and selected the 200 most informative ones. After renormalization, we ended up with a joint distribution with |W| = 200, |C| = 20.
Multivariate Information Bottleneck
r
1785
We applied the symmetric dIB algorithm to this joint distribution. Increasing β was done through β (r +1) = (1 + εβ )β (r ) , εβ = 0.001; β (0) = εβ . The split detection parameters were b = β1 , that is, as β increases, the algorithm becomes more liberal for declaring cluster splits. The scaling factor for the stochastic duplication was α = 0.005. However, before the first split, we used α1 = 0.95, so as to avoid the attractor of the trivial fixed point, q (Tj | U j ) = q (Tj ).
Symmetric sIB and aIB for Protein Sequence Analysis
r r r r
r
Each protein was represented as a count vector over all the 38,421 different 4-mers of amino acids present in the data. After normalizing the counts for each protein independently, we got 1 a joint distribution p(R, F ) with p(R) = |R| . We sorted all features by their contribution to I (R; F ) and selected the top 2,000. After renormalization, we ended up with a joint distribution p(R, F ) 1 with |R| = 421, |F| = 2, 000 and p(R) = |R| . In the sIB algorithm, we used for the initialization the strategy described in Slonim & Tishby (2000). We randomly initialize only TF and optimize it using the original sIB algorithm (Slonim et al., 2002), such that I (TF ; R) is maximized. Given this TF , we randomly initialize TR and use again the original sIB algorithm to optimize it such that I (TR ; TF ) is maximized. We use these two solutions as the initialization to the symmetric sIB algorithm, and continue by the general framework described in Figure 6 until convergence is attained. We repeat this procedure 100 times and select the solution that maximizes I (TR ; TF ).
Triplet sIB for Natural Language Processing
r
r r r
The seven Tarzan books, available from the Gutenberg project (http://promo.net/pg/), were Tarzan and the Jewels of Opar, Tarzan of the Apes, Tarzan the Terrible, Tarzan the Untamed, The Beasts of Tarzan, The Jungle Tales of Tarzan, and The Return of Tarzan. We followed the same preprocessing as in section 6.1.1, ending up with a sequence of 580,806 words taken from a vocabulary of 19,458 distinct words. The 10 most frequent words in the above books that are not stop words (i.e., W values) were apemans, apes, eyes, girl, great, jungle, tarzan, time, two, and way. After removing triplets with fewer than three occurrences, we had 672 different triplets with a total of 4,479 occurrences. The number
1786
r
r r
N. Slonim, N. Friedman, and N. Tishby
of distinct first words was 90, and the number of distinct last words was 233. Thus, after simple normalization, we had an estimated joint distribution, p(Wp , W, Wn ), with |W p | = 90, |W| = 10, |Wn | = 233. As in the symmetric IB, we first randomly initialize Tp and optimize it using the original sIB algorithm (Slonim et al., 2002), such that I (Tp ; W) is maximized. Similarly, we find Tn such that I (Tn ; W) is maximized. Using these initializations and the general scheme described in Figure 6, we optimize both systems of clusters until they converge to a local maximum of I (Tp , Tn ; W). We repeat this procedure for 50 different initializations to extract different locally optimal solutions. In the predictions over the new sequence (The Son of Tarzan), for each of the 10 words in W, we define the following quantities: A1 (w) is the number of w occurrences correctly predicted as w (true positives); A2 (w) is the number of words incorrectly predicted as w (false positives); A3 (w) is the number of w occurrences incorrectly not predicted as w (false negatives); The precision and recall for w are then defined as A1 (w) A1 (w) Prec(w) = A1 (w)+A , Rec(w) = A1 (w)+A , where the microaveraged 2 (w) 3 (w) precision and recall are defined by (Sebastiani, 2002): < Pr ec >= < Rec >=
w A1 (w) , A1 (w)+A2 (w) A (w) w 1 . w A1 (w)+A3 (w)
w
(B.1)
Appendix C: Notations Capital letters, {X, Y, T}, denote names of random variables. Lowercase letters, {x, y, t}, denote specific values taken by these variables. p(X) denotes the probability distribution function, and p(x) denotes the specific (scalar) value, p(X = x)—the probability that the assignment to the random variable X is the specific value x. We further use X ∼ p(X) to denote that X is distributed according to p(X). Probability distributions that are given as input and do not change during the analysis are denoted by p(·), while probability distributions that involve changeable parameters are denoted by q (·). Calligraphic notations, {X , Y, T }, denote the spaces to which the values of the random variables belong. Thus, X is the set of all possible values (or assignments) to X. The notation x means summation over all x ∈ X , and |X | stands for the cardinality of X . For simplicity, we limit the discussion to discrete random variables with a finite number of possible values. Sets of random variables are denoted by boldface capital letters {X, T} and specific values taken by those sets by boldface lowercase letters {x, t}.
Multivariate Information Bottleneck
1787
The boldface calligraphic notation T stands for the set of all possible assignments to T. Acknowledgments Insightful discussions with Ori Mosenzon are greatly appreciated. We thank Gill Bejerano who prepared the GST proteins data set and brought to our attention the existence of the new Omega class in these data. We thank Esther Singer and Michal Rosen-Zvi for comments on previous drafts of this article. This work was supported in part by the Israel Science Foundation (ISF), the Israeli Ministry of Science, and the US-Israel Bi-National Science Foundation. N.S. was also supported by an Eshkol fellowship. N.F. was also supported by an Alon fellowship and the Harry and Abe Sherman Senior Lectureship in Computer Science. Experiments reported in this work were run on equipment funded by an ISF Basic Equipment Grant. References Apweiler, R., and Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J., & Zdobnov, E. M. (2000). InterPro—an integrated documentation resource for protein families, domains and functional sites, Bioinformatics, 16, 1145–1150. Attwood, T. K., Croning, M. D. R., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J., & Wright, W. (2000). PRINTS-S: The database formerly known as PRINTS. Nucl. Acids Res., 28, 225–227. Atwal, G. S., & Slonim, N. (2005). Information bottleneck with finite samples. Unpublished manuscript. Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proc. of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Csisz´ar, I., & Tusn´ady, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, suppl. 1, 205–237. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Elidan, G., & Friedman, N. (2003). The information bottleneck expectation maximization algorith. In Proc. of the 19th Conf. on Uncertainty in Artificial Intelligence (UAI-19). San Mateo, CA: Morgan Kaufmann.
1788
N. Slonim, N. Friedman, and N. Tishby
Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. of Uncertainty in Artificial Intelligence (UAI-17). San Mateo, CA: Morgan Kaufmann. Gilad-Bachrach, R., Navot, A., & Tishby, N. (2003). An information theoretic tradeoff between complexity and accuracy. In The Sixteenth Annual Conference on Learning Theory (COLT). New York: Springer. Globerson, A., & Tishby, N. (2004). The minimum information principle in discriminative learning. In C. Meek, M. Chickering, & J. Halpern (Eds.), Uncertainty in artificial intelligence. Banff, Canada: AUAI Press. Golub, T., Slonim, D., Tamayo, P., Huard, C. M., Caasenbeek, J. M., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., & Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Gutman, M. (1989). Asymptotically optimal classification for multiple tests with empirically observed statistics, IEEE Trans. Inf. Theory, 35(2), 401–408. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57). New York: ACM. Lang, K. (1995). Learning to filter netnews. In Proc. of the 12th International Conf. on Machine Learning (ICML). San Mateo, CA: Morgan Kaufmann. Parker, E. A., Gedeon, T., & Dimitrov, A. G. (2002). Annealing and the rate distortion problem. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 969–976). Cambridge, MA: MIT Press. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Francisco: Morgan Kaufmann. Pereira, F. C., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics. New York: ACM. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86, 2210– 2239. Schriebman, A. (2000). Stochastic modeling for efficient computation of information theoretic quantities. Unpublished doctoral dissertation, Hebrew University, Jerusalem, Israel. Available online at http://www.cs.huji.ac.il/labs/learning/Theses/ theses list.html. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656. Slonim, N. (2002). The information bottleneck: Theory and applications. Unpublished doctoral dissertation, Hebrew University, Jerusalem, Israel. Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM.
Multivariate Information Bottleneck
1789
Slonim, N., Somerville, R., Tishby, N., & Lahav, O. (2001). Objective classification of galaxy spectra using the information bottleneck method. Monthly Notes of the Royal Astronomical Society, 323, 270–284. Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. In S. A. Solla, ¨ T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 617–623). Cambridge, MA: MIT Press. Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 208– 215). New York: ACM. Slonim, N., & Weiss, Y. (2002). Maximum likelihood and the information bottleneck. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Still, S., & Bialek, W. (2003). How many clusters? Physics/030101. Studeny, M., & Vejnarova, J. (1998). The Multi-information function as a tool for measuring stochastic dependence. In M. I. Jordan (Ed.), Learning in graphical models (pp. 261–298). Dordrecht: Kluwer. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In Proc. of 37th Allerton Conference on Communication and Computation. Tishby, N., & Slonim, N. (2000). Data clustering by Markovian relaxation and the information bottleneck method. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press.
Received February 25, 2005; accepted September 15, 2005.
LETTER
Communicated by Manfred Opper
Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors Mark Girolami
[email protected] Simon Rogers
[email protected] Department of Computing Science, University of Glasgow, Glasgow, Scotland
It is well known in the statistics literature that augmenting binary and polychotomous response models with gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favor of gaussian process (GP) priors over functions, and employing variational approximations to the full posterior, we obtain efficient computational methods for GP classification in the multiclass setting.1 The model augmentation with additional latent variables ensures full a posteriori class coupling while retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multiclass informative vector machines (IVM), emerge in a natural and straightforward manner. This is the first time that a fully variational Bayesian treatment for multiclass GP classification has been developed without having to resort to additional explicit approximations to the nongaussian likelihood term. Empirical comparisons with exact analysis use Markov Chain Monte Carlo (MCMC) and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation. 1 Introduction Albert and Chib (1993) were the first to show that by augmenting binary and multinomial probit regression models with a set of continuous latent variables yk , corresponding to the kth response value where yk = mk + , ∼ N (0, 1) and mk = j βk j x j , an exact Bayesian analysis can be performed by Gibbs sampling from the parameter posterior. As an example, consider binary probit regression on target variables tn ∈ {0, 1};
1 Matlab code to allow replication of the reported results is available online at http:// www.dcs.gla.ac.uk/people/personal/girolami/pubs 2005/VBGP/index.htm.
Neural Computation 18, 1790–1817 (2006)
C 2006 Massachusetts Institute of Technology
Variational Bayesian Multinomial Probit Regression
1791
the probit likelihood for the nth data sample taking unit value (tn = 1) is P(tn = 1|xn , β) = (β T xn ), where is the standardized normal cumulative distribution function (CDF). This can be obtained by the following marginalization, P(tn = 1, yn |xn , β)dyn = P(tn = 1|yn ) p(yn |xn , β)dyn , and by definition P(tn = 1|yn ) = δ(yn > 0), we see that the required marginal is simply the normalizing constant of a left truncated univariate gaussian so that P(tn = 1|x n , β) = δ(yn > 0)N yn (β T xn , 1)dyn = (β T xn ). The key observation here is that working with the joint distribution P(tn = 1, yn |xn , β) = δ(yn > 0)N yn (β T xn , 1) provides a straightforward means of Gibbs sampling from the parameter posterior, which would not be the case if the marginal term, (β T xn ), was employed in defining the joint distribution over data and parameters. This data augmentation strategy can be adopted in developing efficient methods to obtain binary and multiclass gaussian process (GP) (Williams & Rasmussen, 1996) classifiers, as will be presented in this article. With the exception of Neal (1998), where a full Markov chain Monte Carlo (MCMC) treatment to GP-based classification is provided, all other approaches have focused on methods to approximate the problematic form of the posterior,2 which allow analytic marginalization to proceed. Laplace approximations to the posterior were developed in Williams and Barber (1998), and lowerand upper-bound quadratic likelihood approximations were considered in Gibbs and MacKay (2000). Variational approximations for binary classification were developed in Seeger (2000) where a logit likelihood was considered and mean field approximations were applied to probit likelihood terms in Opper and Winther (2000) and Csato, Fokue, Opper, Schottky, and Winther (2000), respectively. Additionally, incremental (QuinoneroCandela & Winther, 2003) or sparse approximations based on assumed density filtering (ADF; Csato & Opper, 2002), informative Vector Machines (IVM, Lawrence, Seeger, and Herbrich, 2003), and expectation propagation (EP; Minka, 2001; Kim, 2005) have been proposed. With the exceptions of Williams and Barber (1998), Gibbs and MacKay (2000), Seeger and Jordan (2004), and Kim (2005) the focus of most recent work has largely been on the binary GP classification problem. Seeger and Jordan (2004) developed a multiclass generalization of the IVM employing a multinomial-logit softmax likelihood. However, considerable representational effort is required to ensure that the scaling of computation and storage required of the proposed method matches that of the original IVM with linear scaling in the number of classes. In contrast, by adopting the probabilistic representation of Albert and Chib (1993), we will see that GP-based K -class classification and efficient sparse approximations (IVM generalizations with scaling linear in the number of classes) can be realized by optimizing a strict lower
2 The likelihood is nonlinear in the parameters due to either the logistic or probit link functions required in the classification setting.
1792
M. Girolami and S. Rogers
α
ψ
ϕ
m
y
M
t N
K Figure 1: Graphical representation of the conditional dependencies within the general multinomial probit regression model with gaussian process priors.
bound of the marginal likelihood of a multinomial probit regression model that requires the solution of K computationally independent GP regression problems while still operating jointly (statistically) on the data. We will also show that the accuracy of this approximation is comparable to that obtained via MCMC. The following section introduces the multinomial-probit regression model with GP priors. 2 Multinomial Probit Regression Define the data matrix as X = [x1 , . . . , x N ]T , which has dimension N × D and the N × 1–dimensional vector of associated target values as t where each element tn ∈ {1, . . . , K }. The N × K matrix of GP random variables mnk is denoted by M. We represent the N × 1 dimensional columns of M by mk and the corresponding K × 1–dimensional rows by mn . The N × K matrix of auxiliary variables ynk is represented as Y, where the N × 1–dimensional columns are denoted by yk and the corresponding K × 1–dimensional rows as yn . The M × 1 vector of covariance kernel hyperparameters for each class is denoted by ϕ k , and associated hyperparameters ψ k and α k complete the model.3 The graphical representation of the conditional dependency structure in the auxiliary variable multinomial probit regression model with GP priors in the most general case is shown in Figure 1. 3 Prior Probabilities From the graphical model in Figure 1 a priori we can assume classspecific GP independence and define model priors such that mk |X, ϕ k ∼ G P(ϕ k ) = Nmk (0, Cϕ k ), where the matrix Cϕ k of dimension N × N defines 3 This is the most general setting; however, it is more common to employ a single and shared GP covariance function across classes.
Variational Bayesian Multinomial Probit Regression
1793
the class-specific GP covariance.4 Typical examples of such GP covariance functions are radial basis–style functions such that the i, jth element of each M Cϕ k is defined as exp{− d=1 ϕkd (xid − x jd )2 }, where in this case M = D; however, there are many other forms of covariance functions that may be employed within the GP function prior (see, e.g., MacKay, 2003). As in Albert and Chib (1993), we employ a standardized normal noise model such that the prior on the auxilliary variables is ynk |mnk ∼ N ynk (mnk , 1) to ensure appropriate matching with the probit function. Rather than having this variance fixed, it could also be made an additional free parameter of the model and therefore would yield a scaled probit function. For the presentation here, we restrict ourselves to the standardized model and consider extensions to a scaled probit model as possible further work. The relationship between the additional latent variables yn (denoting the nth row of Y) and the targets tn as defined in multinomial probit regression (Albert and Chib, 1993) is adopted here: tn = j
if
ynj = max {ynk }.
(3.1)
1≤k≤K
This has the effect of dividing R K (y space) into K nonoverlapping K -dimensional cones Ck = {y : yk > yi , k = i} where R K = ∪k Ck , and so each P(tn = i|yn ) can be represented as δ(yni > ynk ∀ k = i). We then see that similar to the binary case, where the probit function emerges from explicitly marginalizing the auxiliary variable, the multinomial probit takes the form given below (details are given in appendix A), P(tn = i|mn ) =
δ(yni > ynk ∀ k = i)
K
p(ynj |mnj ) dy
j=1
=
K Ci j=1
p(ynj |mnj )dy = E p(u)
j=i
(u + mni − mnj ) ,
where the random variable u is standardized normal, p(u) = N (0, 1). A hierarchic prior on the covariance function hyperparameters is employed such that each hyperparameter has, for example, an independent exponential distribution ϕkd ∼ Exp(ψkd ) and a gamma distribution is placed on the mean values of the exponential ψkd ∼ (σk , τk ), thus forming a conjugate pair. Of course, as detailed in Girolami and Rogers (2005), a
The model can be defined by employing K − 1 GP functions and an alternative truncation of the gaussian over the variables ynk ; however, for the multiclass case, we define a GP for each class. 4
1794
M. Girolami and S. Rogers
more general form of covariance function can be employed that will allow the integration of heterogeneous types of data, which takes the form of a weighted combination of base covariance functions. The associated hyper-hyperparameters α = σk=1,...,K , τk=1,...,K can be estimated via type II maximum likelihood or set to reflect some prior knowledge of the data. Alternatively, vague priors can be employed such that, for example, each σk = τk = 10−6 . Defining the parameter set as = {Y, M} and the hyperparameters as = {ϕ k=1,...,K , ψ k=1,...,K }, the joint likelihood takes the form p(t, , |X, α) =
K N n=1
×
K
δ(yni > ynk ∀ k = i)δ(tn = i)
i=1
p(ynk |mnk ) p(mk |X, ϕ k ) p(ϕ k |ψ k ) p(ψ k |α k ). (3.2)
k=1
4 Gaussian Process Multiclass Classification We now consider both exact and approximate Bayesian inference for GP classification with multiple classes employing the multinomial-probit regression model. 4.1 Exact Bayesian Inference: The Gibbs Sampler. The representation of the joint likelihood (see equation 3.2) is particularly convenient in that samples can be drawn from the full posterior over the model parameters (given the hyperparameter values) p(|t, X, , α) using a Gibbs sampler in a very straightforward manner with scaling per sample of O(K N3 ). Full details of the Gibbs sampler are provided in appendix D, and this sampler will be employed in the experimental section. 4.2 Approximate Bayesian Inference: The Laplace Approximation. The Laplace approximation of the posterior over GP variables, p(M|t, X, , α) (where Y is marginalized) requires finding the mode of the unnormalized posterior. Approximate Bayesian inference for GP classification with multiple classes employing a multinomial logit (softmax) likelihood has been developed previously Williams and Barber (1998). Due to the form of the multinomial logit likelihood, a Newton iteration to obtain the posterior mode will scale at best as O(K N3 ). Employing the multinomial probit likelihood, we find that each Newton step will scale as O(K 3 N3 ). Details are provided in appendix E. 4.3 Approximate Bayesian Inference: A Variational and Sparse Approximation. Employing a variational Bayes approximation (Beal, 2003; Jordan, Ghahramani, Jaakkola, & Saul, 1999; MacKay, 2003) by using an approximating ensemble of factored posteriors such that p(|t, X, , α) ≈
Variational Bayesian Multinomial Probit Regression
1795
i=1 Q(i ) = Q(Y)Q(M) for multinomial probit regression is more appealing from a computational perspective as a sparse representation, with scaling O(K NS2 ) (where S is the subset of samples entering the model and S N), can be obtained in a straightforward manner, as will be shown in the following sections. The lower bound5 (see, e.g., Beal, 2003; Jordan et al., 1999; MacKay, 2003) on the marginal likelihood log p(t|X, , α) ≥ E Q() log p(t, |X, , α) − E Q() log Q() is minimized by distributions that take an unnormalized form of Q(i ) ∝ exp E Q(\i ) {log P(t, |X, , α)} where Q(\i ) denotes the ensemble distribution with the ith component of removed. Details of the required posterior components are given in appendix A. The approximate posterior over the GP random variables takes a factored form such that
Q(M) =
K
Q(mk ) =
k=1
K
Nmk ( mk , k ),
(4.1)
k=1
where the shorthand tilde notation denotes posterior expectation, f (a ) = k = E Q(a ) { f (a )}, and so the required posterior mean for each k is given as m k yk , where k = Cϕ k (I + Cϕ k )−1 (see appendix A for full details). We will see that each row, yn , of Y will have posterior correlation structure induced, ensuring that the appropriate class-conditional posterior depen It should be stressed here that while there dencies will be induced in M. are K a posteriori independent GP processes, the associated K -dimensional posterior means for each of N data samples induces posterior dependencies due to the posterior coupling over each between each of the K columns of M of the auxiliary variables yn . We will see that this structure is particularly convenient in obtaining sparse approximations (Lawrence et al., 2003) for the multiclass GP in particular. Due to the multinomial probit definition of the dependency between each element of yn and tn (see equation 3.1), the posterior for the auxiliary variables follows as Q(Y) =
N n=1
Q(yn ) =
N
Nytnn ( mn , I)
(4.2)
n=1
where Nytnn ( mn , I) denotes a conic truncation of a multivariate gaussian such that if tn = i where i ∈ {1, . . . , K }, then the ith dimension has the largest
5
The bound follows from the application of Jensen’s inequality, for example, log p(t|X) = log p(t,|X) Q() log p(t,|X) Q() Q()d ≥ Q() d.
1796
M. Girolami and S. Rogers
value. The required posterior expectations ynk for all k = i and yni follow as E p(u) Nu ( mnk − m ni , 1)n,i,k u ynk = m nk − E p(u) (u + m nk )n,i,k ni − m u ni − ynj − m nj , yni = m
(4.3)
(4.4)
j=i
where n,i,k = j=i,k (u + m ni − m nj ), and p(u) = Nu (0, 1). The expectau tions with respect to p(u), which appear in equation 4.3, can be obtained by quadrature or straightforward sampling methods. If we also consider the set of hyperparameters, , in this variational treatment, then the approximate posterior for the covariance kernel hyperparameters takes the form of Q(ϕ k ) ∝ Nm k (0, Cϕ k )
M d=1
kd ), Exp(ϕkd |ψ
and the required posterior expectations can be estimated employing importance sampling. Expectations can be approximated by drawing S samples s kd ), and so such that each ϕkd ∼ Exp(ψ f (ϕ k ) ≈
S
f (ϕ sk )w(ϕ sk )
where
s=1
w(ϕ sk )
Nm k 0, Cϕ sk . = S k 0, Cϕ s s =1 Nm
(4.5)
k
This form of importance sampling within a variational Bayes procedure has been employed previously in Lawrence, Milo, Niranjan, Rashbass, and Soullier (2004). Clearly the scaling of the above estimator per sample is similar to that required in the gradient-based methods that search for optima of the marginal likelihood as employed in GP regression and classification (e.g., MacKay, 2003). Finally, we have that each Q(ψkd ) = ψkd (σk + 1, τk + ϕkd ), and the assokd = (σk + 1)/(τk + ciated posterior mean is simply ψ ϕkd ). 4.4 Summarizing Variational Multiclass GP Classification. We can summarize what has been presented by the following iterations that, in the general case, for all k and d, will optimize the bound on the marginal likelihood (explicit expressions for the bound are provided in appendix C): k ← Cϕ k (I + Cϕ k )−1 ( m mk + p k ) ϕk ← ϕ sk w(ϕ sk ) s
(4.6) (4.7)
Variational Bayesian Multinomial Probit Regression
kd ← ψ
σk + 1 , τk + ϕkd
1797
(4.8)
s kd ), w(ϕ sk ) is defined as previously and pk is the where each ϕkd ∼ Exp(ψ kth column of the N × K matrix P whose elements pnk are defined by the right-most terms in equations 4.3 and 4.4: for tn = i, then for all k = i, E {N ( m − m ,1)n,i,k } pnk = − p(u) u nk ni un,i,k and pni = − j=i pnj . mni − E p(u) (u+ mnk )u
These iterations can be viewed as obtaining K one-against-all binary classifiers: however, most important, they are not statistically independent of each other but are a posteriori coupled via the posterior mean estimates of each of the auxiliary variables yn . The computational scaling will be linear in the number of classes and cubic in the number of data points O(K N3 ). It is worth noting that if the covariance function hyperparameters are fixed, then the costly matrix inversion requires being computed only once. The Laplace approximation will require a matrix inversion for each Newton step when finding the mode of the posterior (Williams & Barber, 1998). 4.4.1 Binary Classification. Previous variational treatments of GP-based binary classification include Seeger (2000), Opper and Winther (2000), Gibbs and MacKay (2000), Csato and Opper (2002), and Csato et al. (2000). It is, however, interesting to note in passing that for binary classification, the outer plate in Figure 1 is removed, and further simplification follows as only K − 1, that is, one set of posterior mean values requires being estimated, = Cϕ (I + Cϕ )−1 and so the posterior expectations m y now operate on N × and 1–dimensional vectors m y. The posterior Q(y) is now a product of truncated univariate gaussians, so the expectation for the latent variables yn has an exact analytic form. For a unit variance gaussian truncated below zero if tn = 1 and above zero if tn = −1, the required posterior mean y has elements that can be obtained by the following analytic expression derived from straightforward results for corrections to the mean of a gaussian due to truncation yn = m n + tn Nm n ).6 So the following iteration will n (0, 1)/ (tn m guarantee an increase in the bound of the marginal likelihood, ← Cϕ (I + Cϕ )−1 ( m m + p),
(4.9)
where each element of the N × 1 vector p is defined as pn = tn Nm n ). n (0, 1)/(tn m 4.5 Variational Predictive Distributions. The predictive distribution, P(tnew = k|xnew , X, t), for a new sample xnew follows from results for +∞ For t = +1, then y = 0 yN y ( m, 1)/{1 − (− m)}dy = m + Nm m), and for (0, 1)/( 0 t = −1, then y = −∞ yN y ( m, 1)/(− m)dy = m − Nm m). (0, 1)/(− 6
1798
M. Girolami and S. Rogers
standard GP regression.7 The N × 1 vector Cnew ϕ k contains the covariance function values between the new point and those contained in X, and cϕnew k denotes the covariance function value for the new point and itself. So the GP posterior p(mnew |xnew , X, t) is a product of K gaussians each with mean and variance −1 new m new = yTk I + Cϕk Cϕ k k −1 new T σk2 ,new = cϕnew − (Cnew Cϕ k I + Cϕk ϕk ) k using the following shorthand, νknew = 1 + σk2 ,new , then it is straightforward (details in appendix B) to obtain the predictive distribution over possible target values as
P(tnew
1 new = k|xnew , X, t) = E p(u) new −m new u νk + m k j new ν j j=k
!
,
where, as before, u ∼ Nu (0, 1). The expectation can be obtained numerically employing sample estimates from a standardized gaussian. For the binary case, the standard result follows: P(tnew = 1|xnew , X, t) =
mnew , δ(ynew > 0)N ynew ( ν new ) dynew
" # " new # m m new = 1 − − new = . ν ν new
5 Sparse Variational Multiclass GP Classification The dominant O(N3 ) scaling of the matrix inversion required in the posterior mean updates in GP regression has been the motivation behind a large body of literature focusing on reducing this cost via reduced rank approximations (Williams & Seeger, 2001) and sparse online learning (Csato & Opper, 2002; Quinonero-Candela & Winther, 2003) where assumed density filtering (ADF) forms the basis of online learning and sparse approximations for GPs. Likewise in Lawrence et al. (2003) the informative vector machine (IVM) (refer to Lawrence, Platt, & Jordan, 2005, for comprehensive details) is proposed, which employs informative point selection criteria (Seeger, Williams, & Lawrence, 2003) and ADF updating of the approximations of
7
and α is implicit. Conditioning on Y, ϕ , ψ,
Variational Bayesian Multinomial Probit Regression
1799
the GP posterior parameters. Only binary classification has been considered in Lawrence et al. (2003), Csato and Opper (2002), and Quinonero-Candela and Winther (2003), and it is clear from Seeger and Jordan (2004) that extension of ADF-based approximations such as IVM to the multiclass problem is not at all straightforward when a multinomial-logit softmax likelihood is adopted. However, we now see that sparse GP-based classification for multiple classes (multiclass IVM) emerges as a simple by-product of online ADF approximations to the parameters of each Q(mk ) (multivariate gaussian). The ADF approximations when adding the nth data sample, selected at the lth of S iterations, for each of the K GP posteriors, Q(mk ), follow simply from details in Lawrence et al. (2005) as given below:
k,n ← Cnϕ k − MTk Mk,n s k ← sk −
1 T diag k,n k,n 1 + skn
1 T
k,n Mlk ← √ 1 + skn k ← m k + m
nk ynk − m
k,n . 1 + skn
(5.1) (5.2) (5.3) (5.4)
nk = pnk as defined in section 4.4 and can be obtained from Each ynk − m the current stored approximate values of each m n1 , . . . , m nK via equations 4.3 and 4.4, k,n , an N × 1 vector, is the nth column of the current estimate of each k ; likewise, Cnϕ k is the nth column of each GP covariance matrix. All elements of each Mk and mk are initialized to zero, while each sk has initial unit values. Of course, there is no requirement to explicitly store each N × N–dimensional matrix k ; only the S × N matrices Mk and N × 1 vectors sk require storage and maintenance. We denote indexing into the lth row of each Mk by Mlk , and the nth element of each sk by skn , which is the estimated posterior variance. The efficient Cholesky factor updating as detailed in Lawrence et al. (2005) will ensure that for N data samples, K distinct GP priors, and a maximum of S samples included in the model where S N, then at most O(K SN) storage and O(K NS2 ) compute scaling will be realized. As an alternative to the entropic scoring heuristic of Seeger et al. (2003) and Lawrence et al. (2003), we suggest that an appropriate criterion for point inclusion assessment will be the posterior predictive probability of a target value given the current model parameters for points that are currently not included in the model that is, P (tm |xm , {mk }, { k }), where the subscript m indexes such points. From the results of the previous section, this is equal
1800
M. Girolami and S. Rogers
to Pr (ym ∈ Ctm =k ), which is expressed as
E p(u)
j=k
"
1 uνkm + m mk − m mj
ν jm
#
,
(5.5)
$ where k is the value of tm , ν jm = 1 + s jm , and so the data point with the smallest posterior target probability should be selected for inclusion. This k and scoring criterion requires no additional storage overhead, as all m sk are already available and it can be computed for all m not currently in the model in, at most, O(K N) time.8 Intuitively, points in regions of low target posterior certainty, that is, class boundaries, will be the most influential in updating the approximation of the target posteriors. And so the inclusion of points with the most uncertain target posteriors will yield the largest possible translation of each updated mk into the interior of their respective cones Ck . Experiments in the following section will demonstrate the effectiveness of this multiclass IVM. 6 Experiments 6.1 Illustrative Multiclass Toy Example. Ten-dimensional data vectors, x, were generated such that if t = 1, then 0.5 > x12 + x22 > 0.1; for t = 2, then 1.0 > x12 + x22 > 0.6; and for t = 3, then [x1 , x2 ]T ∼ N (0, 0.01I) where I denotes an identity matrix of appropriate dimension. Finally x3 , . . . , x10 are all distributed as N (0, 1). Both of the first two dimensions are required to define the three class labels, with the remaining eight dimensions being irrelevant to the classification task. Each of the three target values was sampled uniformly, thus creating a balance of samples drawn from the three target classes. Two hundred forty draws were made from the above distribution, and the sample was used in the proposed variational inference routine, with a further 4620 points being used to compute a 0-1 loss class predictionerror. A common radial basis covariance function of the form exp{− d ϕd |xid − x jd |2 } was employed, and vague hyperparameters, σ = τ = 10−3 were placed on the length-scale hyperparameters ψ1 , . . . , ψ10 . The posterior expectations of the auxiliary variables y were obtained from equations 4.3 and 4.4 where the gaussian integrals were computed using 1000 samples drawn from p(u) = N (0, 1). The variational importance sampler d ) in estimating the correemployed 500 samples drawn from each Exp(ψ sponding posterior means ϕd for the covariance function parameters. Each M and Y was initialized randomly, and ϕ had unit initial values. In this
8
Assuming constant time to approximate the expectation.
Variational Bayesian Multinomial Probit Regression
−4.2
−5.2 0
10 20 30 40 Iteration Number
(a)
50
100
1 0
−1 −2 −3 0
10 20 30 40 Iteration Number
50
% Predictions Correct
Lower Bound × 10−2
10
Length Scale (Log )
−3.2
1801
80 60 40 20 0
10 20 30 40 Iteration Number
(b)
50
(c)
Figure 2: (a) Convergence of the lower bound on the marginal likelihood for the toy data set considered. (b) Evolution of estimated posterior means for the inverse squared length scale parameters (precision parameters) in the RBF covariance function. (c) Evolution of out-of-sample predictive performance on the toy data set.
example, the variational iterations ran for 50 steps, where each step corresponds to the sequential posterior mean updates of equations 4.6 to 4.8. The value of the variational lower bound was monitored during each step, and as would be expected, a steady convergence in the improvement of the bound can be observed in Figure 2a. The development of the estimated posterior mean values for the covariance function parameters ϕd , Figure 2b, shows automatic relevance detection (ARD) in progress (Neal, 1998), where the eight irrelevant features are effectively removed from the model. From Figure 2c we can see that the development of the predictive performance (out of sample) follows that of the lower bound (see Figure 2a), achieving a predictive performance of 99.37% at convergence. As a comparison to our multiclass GP classifier, we use a directed acyclic graph (DAG) SVM (Platt, Cristianni, & Shawe-Taylor, 2000) (assuming equal class dis tributions the scaling9 is O N3 K −1 ) on this example. By employing the values of the posterior mean values of the covariance function length scale parameters (one for each of the 10 dimensions) estimated by the proposed variational procedure in the RBF kernel of the DAG SVM, a predictive performance of 99.33% is obtained. So on this data set, the proposed GP classifier has comparable performance, under 0-1 loss, to the DAG SVM. However, the estimation of the covariance function parameters is a natural part of the approximate Bayesian inference routines employed in GP classification. There is no natural method of obtaining estimates of the 10 kernel parameters for the SVM without resorting to cross validation (CV), which, in the case of a single parameter, is feasible but rapidly becomes infeasible as the number of parameters increases. 9
This assumes the use of standard quadratic optimization routines.
1802
M. Girolami and S. Rogers
6.2 Comparing Laplace and Variational Approximations to Exact then Inference via Gibbs Sampling. This section provides a brief empirical comparison of the variational approximation, developed in previous sections, to a full MCMC treatment employing the Gibbs sampler detailed in appendix D. In addition, a Laplace approximation is considered in this short comparative study. Variational approximations provide a strict lower bound on the marginal likelihood, and this bound is one of the approximation’s attractive characteristics. However, it is less well understood how much the parameters obtained from such approximations differ from those obtained using exact methods. Preliminary analysis of the asymptotic properties of variational estimators is provided in Wang and Titterington (2004). A recent experimental study of EP and Laplace approximations to binary GP classifiers has been undertaken by Kuss and Rasmussen (2005), and it is motivating to consider a similar comparison for the variational approximation in the multiple-class setting. Kuss and Rasmussen observed that the marginal and predictive likelihoods, computed over a wide range of covariance kernel hyperparameter values, were less well preserved by the Laplace approximation than the EP approximation when compared to that obtained by MCMC. We then consider the predictive likelihood obtained by the Gibbs sampler and compare this to the variational and Laplace approximations of the GP-based classifiers. The toy data set from the previous section is employed, and as in Kuss and Rasmussen (2005), a covariance kernel of the form s exp{−ϕ d xid − x jd 2 } is adopted. Both s & ϕ are varied in the range (log scale) −1 to +5 and at each pair of hyperparameter values, a multinomial probit GP classifier is induced using (1) MCMC via the Gibbs sampler, (2) the proposed variational approximation, and (3) a Laplace approximation of the probit model. For the Gibbs sampler, after a burn-in of 2000 samples, the following 1000 samples were used for inference purposes, and the predictive likelihood (probability of target values in the test set) and test error (0-1 error loss) were estimated from the 1000 post-burn-in samples as detailed in appendix D. We first consider a binary classification problem by merging classes 2 and 3 of the toy data set into one class. The first thing to note from Figure 3 is that the predictive likelihood response under the variational approximation preserves, to a rather good degree, the predictive likelihood response obtained when using Gibbs sampling across the range of hyperparameter values. However, the Laplace approximation does not do as good a job in replicating the levels of the response profile obtained using MCMC over the range of hyperparameter values considered. This finding is consistent with the results of Kuss and Rasmussen (2005). The Laplace approximation to the multinomial probit model has O(K 3 N3 ) scaling (see appendix E), which limits its application to situations where the number of classes is small. For this reason, in the following experiments we instead consider the multinomial logit Laplace approximation
5
4
4
−20
3 2 1 −160 0
−1 −1
1
2
3
3 2 −40
−160 0
−60 0
5
−20
1
−40
5
−1 −1
0
1
2
3
4
−80
3 2 1
− 40 −160 −60
0
−60
−80 4
1803
log(s)
5
log(s)
log(s)
Variational Bayesian Multinomial Probit Regression
−80 4
5
−1 −1
0
(a)
1
2
3
4
5
log(ϕ)
log(ϕ)
log(ϕ)
(b)
(c)
Figure 3: Isocontours of predictive likelihood for binary classification problem: (a) Gibbs sampler, (b) variational approximation, (c) Laplace approximation. 5
5
5
−10 3 2
log(s)
4
log(s)
log(s)
−10 4 3 2
4 3 −30
2
−20 −20
1
0
−40 0
1
2
3
log(ϕ) (a)
4
5
−1
1
−50
−30
−30
0 −1 −1
1
0
1
2
3
log(ϕ) (b)
−50
0
− 40 4
5
−1 −1
−60
−40
0
1
2
3
4
5
log(ϕ) (c)
Figure 4: Isocontours of predictive likelihood for multiclass classification problem: (a) Gibbs sampler, (b) variational approximation, (c) Laplace approximation.
(Williams & Barber, 1998). In Figure 4 the isocontours of predictive likelihood for the toy data set in the multiclass setting under various hyperparameter settings are provided. As with the binary case, the variational multinomial probit approximation provides predictive likelihood response levels that are good representations of those obtained from the Gibbs sampler. The Laplace approximation for the multinomial logit suffers from the same distortion of the contours as does the Laplace approximation for the binary probit; in addition, the information in the predictions is lower. We note, as in Kuss and Rasmussen (2005), that for s = 1 (log s = 0), the Laplace approximation compares reasonably with results from both MCMC and variational approximations. In the following experiment, four standard multiclass data sets (Iris, Thyroid, Wine, and Forensic Glass) from the UCI Machine Learning Data
1804
M. Girolami and S. Rogers
Table 1: Results of Comparison of Gibbs Sampler, Variational, and Laplace Approximations When Applied to Several UCI Data Sets.
Toy Data Marginal likelihood Predictive error Predictive likelihood Iris Marginal likelihood Predictive error Predictive likelihood Thyroid Marginal likelihood Predictive error Predictive likelihood Wine Marginal likelihood Predictive error Predictive likelihood Forensic Glass Marginal likelihood Predictive error Predictive likelihood
Laplace
Variational
Gibbs Sampler
−169.27 ± 4.27 3.97 ± 2.00 −98.90 ± 8.22
−232.00 ± 17.13 3.65 ± 1.95 −72.27 ± 9.25
−94.07 ± 11.26 3.49 ± 1.69 −73.44 ± 7.67
−143.87 ± 1.04 3.88 ± 2.00 −10.43 ± 1.12
−202.98 ± 1.37 4.08 ± 2.16 −7.35 ± 1.27
−45.27 ± 6.17 4.08 ± 2.16 −7.26 ± 1.40
−158.18 ± 1.94 4.73 ± 2.36 −19.01 ± 2.55
−246.24 ± 1.63 3.86 ± 2.04 −14.62 ± 2.70
−68.82 ± 8.29 3.94 ± 2.02 −14.47 ± 2.39
−152.22 ± 1.29 2.95 ± 2.16 −14.57 ± 1.29
−253.90 ± 1.52 2.65 ± 1.87 −10.16 ± 1.47
−68.65 ± 6.19 2.78 ± 2.07 −10.47 ± 1.41
−275.11 ± 2.87 36.54 ± 4.74 −90.38 ± 3.25
−776.79 ± 5.75 32.79 ± 4.57 −77.60 ± 3.91
−268.21 ± 5.46 34.00 ± 4.62 −79.86 ± 4.80
Note: Best results for Predictive likelihood are highlighted in bold.
Repository,10 along with the toy data previously described, are used. For each data set a random 60% training and 40% testing split was used to assess the performance of each of the classification methods being considered, and 50 random splits of each data set were used. For the toy data set 50, random train and test sets were generated. The hyperparameters for an RBF covariance function taking the form of exp{− d ϕd xid − x jd 2 } were estimated employing the variational importance sampler, and these were then fixed and employed in all the classification methods considered. The marginal likelihood for the Gibbs sampler was estimated by using 1000 samples from the GP prior. For each data set and each method (multinomial logit Laplace approximation, variational approximation, and Gibbs sampler), the marginal likelihood, (lower bound in the case of the variational approximation), predictive error (0-1 loss), and predictive likelihood were measured. The results, given as the mean and standard deviation over the 50 data splits, are listed in Table 1. The predictive likelihood obtained from the multinomial logit Laplace approximation is consistently, across all data sets, lower than that of the 10
Available online at http://www.ics.uci.edu/∼ mlearn/MPRepository.html.
Variational Bayesian Multinomial Probit Regression
1805
variational approximation and the Gibbs sampler. This indicates that the predictions from the Laplace approximation are less informative about the target values than both other methods considered. In addition, the variational approximation yields predictive distributions that are as informative as those provided by the Gibbs sampler; however, the 0-1 prediction errors obtained across all methods do not differ as significantly. Kuss and then Rasmussen (2005) made a similar observation for the binary GP classification problem when Laplace and EP approximations were compared to MCMC. It will be interesting to further compare EP and variational approximations in this setting. We have observed that the predictions obtained from the variational approximation are in close agreement with those of MCMC, while the Laplace approximation suffers from some inaccuracy. This has also been reported for the binary classification setting in Kuss and Rasmussen (2005). 6.3 Multiclass Sparse Approximation. A further 1000 samples were drawn from the toy data generating process already described, and these were used to illustrate the sparse GP multiclass classifier in operation. The posterior mean values of the shared covariance kernel parameters estimated in the previous example were employed here, and so the covariance kernel parameters were not estimated. The predictive posterior scoring criterion proposed in section 6 was employed in selecting points for inclusion in the overall model. To assess how effective this criterion is, random sampling was also employed to compare the rates of convergence of both inclusion strategies in terms of predictive 0-1 loss on a held-out test set of 2385 samples. A maximum of S = 50 samples was to be included in the model defining a 95% sparsity level. In Figure 5a the first two dimensions of the 1000 samples are plotted with the three different target classes denoted by symbols. The isocontours of constant target posterior probability at a level of one-third (the decision boundaries) for each of the three classes are shown by the solid and dashed lines. What is interesting is that the 50 included points (circled) all sit close to, or on, the corresponding decision boundaries as would be expected given the selection criteria proposed. These can be considered as a probabilistic analog to the support vectors of an SVM. The rates of 0-1 error convergence using both random and informative point sampling are shown in Figure 5b. The procedure was repeated 20 times using the same data samples, and the error bars show one standard deviation over these repeats. It is clear that on this example at least, random sampling has the slowest convergence, and the informative point inclusion strategy achieves less than 1% predictive error after the inclusion of only 30 data points. Of course we should bridle our enthusiasm by recalling that the estimated covariance kernel parameters are already supplied. Nevertheless, multiclass IVM makes Bayesian GP inference on large-scale problems with multiple classes feasible, as will be demonstrated in the following example.
1806
M. Girolami and S. Rogers
1 Percentage Predictions Correct
100
0.5
0
−0.5
−1 −1
−0.5
0
(a)
0.5
1
90 80 70 60 50 40 30 20 0
10
20 30 40 Number Points Included
50
(b)
Figure 5: (a) Scatter plot of the first two dimensions of the 1000 available data sample. Each class is denoted by ×, +, • and the decision boundaries denoted by the contours of target posterior probability equal to one-third are plotted by solid and dashed line. The 50 points selected based on the proposed criterion are circled, and it is clear that these sit close to the decision boundaries. (b) The averaged predictive performance (percentage predictions correct) over 20 random starts (dashed line denotes random sampling and solid line denotes informative sampling) is shown, with the slowest converging plot characterizing what is achieved under a random sampling strategy.
6.4 Large-Scale Example of Sparse GP Multiclass Classification. The Isolet data set11 comprises of 6238 examples of letters from the alphabet spoken in isolation by 30 individual speakers, and each letter is represented by 617 features. An independent collection of 1559 spoken letters is available for classification test purposes. The best reported test performance over all 26 classes of letter was 3.27% error achieved using 30-bit error correcting codes with an artificial neural network. Here we employ a single RBF covariance kernel with a common inverse length scale of 0.001 (further fine tuning is of course possible), and a maximum of 2000 points from the available 6238 are to be employed in the sparse multiclass GP classifier. As in the previous example, data are standardized; both random and informative sampling strategies were employed, with the results given in Figure 6 illustrating the superior convergence of an informative sampling strategy. After including 2000 of the available 6238 samples in the model, under the informative sampling strategy, a test error rate of 3.52% is achieved. We are unaware of any multiclass GP classification method that has been applied to such a large-scale problem in terms of both data samples available and the number of classes. 11 The data set is available online from http://www.ics.uci.edu/mlearn/databases/ isolet.
Variational Bayesian Multinomial Probit Regression Percentage Predictions Correct
Log Predictive Likelihood
0 −1000 −2000 −3000 −4000 −5000 −6000
0
500
1000
1500
Number of Points Included
(a)
1807
100 90 80 70 60 50 40 30
0
500
1000
1500
Number of Points Included
(b)
Figure 6: (a) The predictive likelihood computed on held-out data for both random sampling (solid line with + markers) and informative sampling (solid line with markers). The predictive likelihood is computed once every 50 inclusion steps. (b) The predictive performance (percentage predictions correct) achieved for both random sampling (solid line with + markers) and informative sampling (solid line with markers)
Qi, Minka, Picard, and Ghahramani (2004) have presented an empirical study of ARD when employed to select basis functions in relevance vector machine (RVM) (Tipping, 2000) classifiers. It was observed that a reliance on the marginal likelihood alone as a criterion for model identification ran the risk of overfitting the available data sample by producing an overly sparse representation. The authors then employ an approximation to the leave-one-out error, which emerges from the EP iterations, to counteract this problem. For Bayesian methods that rely on optimizing in-sample marginal likelihood (or an appropriate bound), great care has to be taken when setting the convergence tolerance, which determines when the optimization routine should halt. However, in the experiments we have conducted, this phenomenon did not appear to be such a problem with the exception of one data set, discussed in the following section. 6.5 Comparison with Multiclass SVM. To briefly compare the performance of the proposed approach to multiclass classification with a number of multiclass SVM methods, we consider the recent study of Duan and Keerthi (2005). In that work, four forms of multiclass classifier were considered: WTAS (one-versus-all SVM method with winner-takes-all class selection, MWVS (one-versus-one SVM with a maximum votes class selection strategy, PWCK (one-versus-one SVM with probabilistic outputs employing pairwise coupling; see Duan & Keerthi, 2005), for details, and PWCK (kernel logistic regression with pairwise coupling of binary outputs). Five multiclass data sets from the UCI Machine Learning Data Repository were employed: ABE (16 dimensions and 3 classes), a subset of the Letters data
1808
M. Girolami and S. Rogers
Table 2: SVM and Variational Bayes GP Multiclass Classification Comparison.
SEG DNA ABE WAV SAT
WTAS
MWVS
PWCP
PWCK
VBGPM
VBGPS
9.4 ± 0.5 10.2 ± 1.3 1.9 ± 0.8 17.2 ± 1.4 11.1 ± 0.6
7.9 ± 1.2 9.9 ± 0.9 1.9 ± 0.6 17.8 ± 1.4 11.0 ± 0.7
7.9 ± 1.2 8.9 ± 0.8 1.8 ± 0.6 16.4 ± 1.4 10.9 ± 0.4
7.5 ± 1.2 9.7 ± 0.7 1.8 ± 0.6 15.6 ± 1.1 11.2 ± 0.6
∗7.8 ± 1.5 74.0 ± 0.3 ∗1.8 ± 0.8 25.2 ± 1.2 12.0 ± 0.4
11.5 ± 1.2 13.3 ± 1.3 2.4 ± 0.8 ∗15.6 ± 0.7 12.1 ± 0.4
Note: The asterisk highlights the cases where the proposed GP-based multiclass classifiers were part of the best-performing set.
set using the letters A, B, and E; DNA (180 dimensions and 3 classes); SAT (36 dimensions and 6 classes)—Satellite Image; SEG (18 dimensions and 7 classes)—Image Segmentation; and WAV (21 dimensions and 3 classes)— Waveform. For each of these, Duan and Keerthi (2005) created 20 random partitions into training and test sets for three different sizes of training set, ranging from small to large. Here we consider only the smallest training set sizes. In Duan and Keerthi (2005) thorough and extensive cross validation was employed to select the length-scale parameters (single) of the gaussian kernel and the associated regularization parameters used in each of the SVMs. The proposed importance sampler is employed to obtain the posterior mean estimates for both single and multiple length scales—VBGPS (variational bayes gaussian process classification) for the single-length scale and VBGPM (variational bayes gaussian process classification), for Multiplelength scales—for a common GP covariance shared across all classes. We monitor the bound on the marginal and consider that convergence has been achieved when less than a 1% increase in the bound is observed for all data sets except for ABE where a 10% convergence criterion was employed due to a degree of overfitting being observed after this point. In all experiments, data were standardized to have zero mean and unit variance. The percentage test errors averaged over each of the 20 data splits (mean ± standard deviation) are reported in Table 2. For each data set the classifiers that obtained the lowest prediction error and whose performances were indistinguishable from each other at the 1% significance level using a paired t-test are highlighted in bold. An asterisk highlights the cases where the proposed GP-based multiclass classifiers were part of the best-performing set. We see that in three of the five data sets, performance equal to the bestperforming SVMs is achieved by one of the GP-based classifiers without recourse to any cross validation or in-sample tuning with comparable performance being achieved for SAT and DNA. The performance of VBGPM is particularly poor on DNA, and this is possibly due to the large number (180) of binary features.
Variational Bayesian Multinomial Probit Regression
1809
7 Conclusion and Discussion The main novelty of this work has been to adopt the data augmentation strategy employed in obtaining an exact Bayesian analysis of binary and multinomial probit regression models for GP-based multiclass (of which binary is a specific case) classification. While a full Gibbs sampler can be straightforwardly obtained from the joint likelihood of the model, approximate inference employing a factored form for the posterior is appealing from the point of view of computational effort and efficiency. The variational Bayes procedures developed provide simple iterations due to the inherent decoupling effect of the auxiliary variable between the GP components related to each class. The scaling is still dominated by an O(N3 ) term due to the matrix inversion required in obtaining the posterior mean for the GP variables and the repeated computing of multivariate gaussians required for the weights in the importance sampler. However, with the simple decoupled form of the posterior updates, we have shown that ADF-based online and sparse estimation yields a full multiclass IVM that has linear scaling in the number of classes and the number of available data points, and this is achieved in a straightfoward manner. An empirical comparison with full MCMC suggests that the variational approximation proposed is superior to a Laplace approximation. Further ongoing work includes an investigation into the possible equivalences between EP and variationalbased approximate inference for the multiclass GP classification problem as well as developing a variational treatment to GP-based ordinal regression (Chu & Ghahramani, 2005). Appendix A A.1 Q(M). We employ the shorthand Q(ϕ) = k Q(ϕ k ) in the following. Consider the Q(M) component of the approximate posterior. We have Q(M) ∝ exp E Q(Y)Q(ϕ) log p(ynk |mnk ) + log p(mk |ϕ k ) n k log Nyk (mk , I) + log Nmk (0|Cϕ k ) ∝ exp E Q(Y)Q(ϕ) k " −1 # −1 Nyk (mk , I)Nmk 0, Cϕ k ∝ , k
and so we have
Q(M) =
K k=1
Q(mk ) =
K k=1
Nmk ( mk , k ),
1810
M. Girolami and S. Rogers
−1 −1 k = k yk . Now each element of C−1 where k = I + C and m ϕk ϕ k is a nonlinear function of ϕ k , and so, if considered appropriate, a first-order approximation can be made to the expectation of the matrix inverse such −1 ≈ C−1 , in which case = C (I + C )−1 . that C ϕk
ϕk
ϕk
k
ϕk
A.2 Q(Y) log p(tn |yn ) + log p(yn |mn ) Q(Y) ∝ exp E Q(M) n ∝ exp log p(tn |yn ) + log Nyn ( mn |I) n mn , I) δ(yni > ynk ∀ k = i)δ(tn = i). ∝ Nyn ( n
Each yn is then distributed as a truncated multivariate gaussian such that for tn = i, the ith dimension of yn is always the largest, and so we have Q(Y) =
N
Q(yn ) =
n=1
N
mn , I) , Nytnn (
n=1
where Nytnn (., .) denotes a K -dimensional gaussian truncated such that the dimension indicated by the value of tn is always the largest. The posterior expectation of each yn is now required. Note that Q(yn ) = Zn−1
N ynk ( mnk , 1),
k
where Zn = Pr (yn ∈ C) and C = {yn : ynj < yni , j = i}. Now Zn = Pr (yn ∈ C) +∞ N yni ( mni , 1) = −∞
= E p(u)
j=i
j=i
yni
−∞
N ynj ( mnj , 1)dyni dynj
(u + m ni − m nj ) ,
where u is a standardized gaussian random variable such that p(u) = Nu (0, 1). For all k = i, the posterior expectation follows as ynk = Zn−1
+∞ −∞
ynk
K j=1
N ynj ( mnj , 1)dynj
Variational Bayesian Multinomial Probit Regression
= Zn−1
+∞
−∞
yni −∞
ynk N ynk ( mnk , 1)
=m nk − Zn−1 E p(u)
N yni ( mni , 1)(yni − m nj )dyni dynk
j=i,k
mnk − m ni , 1) Nu (
1811
j=i,k
(u + m ni − m nj ) .
The required expectation for the ith component follows as yni = Zn−1
+∞ −∞
yni N yni ( mni , 1)
(yni − m nj )dyni
j=i
=m ni + Zn−1 E p(u) u (u + m ni − m nj ) j=i
( mnk − ynk ). =m ni + k=i
The final expression in the above follows from noting that for a random variable u ∼ N (0, 1) and any differentiable function g(u), E{ug(u)} = E{g (u)}, in which case E p(u) u (u + m ni − m nj ) j=i
=
k=i
E p(u) Nu ( mnk − m ni , 1) (u + m ni − m nj ) . j=i
A.3 Q(ϕ k ). For each k, we obtain the posterior component Q(ϕ k ) ∝ exp E Q(mk )Q(ψ k ) log p(mk |ϕ k ) + log p(ϕ k |ψ k ) kd ), = Zk Nm Expϕkd (ψ k (0|Cϕ k ) d
where Zk is the corresponding normalizing constant for each posterior, which is unobtainable in closed form. As such, the required expectations can be obtained by importance sampling. A.4 Q(ψ k ). The final posterior component required is Q(ψ k ) ∝ exp E Q(ϕ k ) log p(ϕ k |ψ k ) + log p(ψ k |α k )
1812
M. Girolami and S. Rogers
∝
Expϕkd (ψkd )ψkd (σk , τk )
d
=
ψkd (σk + 1, τk + ϕkd ),
d
kd = and the required posterior mean values follow as ψ
σk +1 τk + ϕkd
.
Appendix B The predictive distribution for a new point xnew can be obtained by first marginalizing the associated GP random variables such that p(ynew |xnew , X, t) = =
p(ynew |mnew ) p(mnew |xnew , X, t)dmnew K
Nmnew (yknew , 1)Nmnew ( mnew σknew )dmnew k , k k k
k=1
=
K
mnew N yknew ( νknew ) , k ,
k=1
where the shorthand νknew = 1 + σk2 ,new is employed. Now that we have the predictive posterior for the auxiliary variable ynew , the appropriate conic truncation of this spherical gaussian yields the required distribution P(tnew = k|xnew , X, t) as follows. Using the following shorthand, P(tnew = k|ynew ) = δ(yknew > yinew ∀ i = k)δ(tnew = k) ≡ δk,new , then P(tnew = k|xnew , X, t) =
P(tnew = k|ynew ) p(ynew |xnew , X, t)dynew
=
Ck
p(ynew |xnew , X, t)dynew
=
δk,new
K
mnew N yknew ( νknew ) dyknew k ,
k=1
1 new = E p(u) new −m new u νk + m k j new ν j j=k
!
.
Variational Bayesian Multinomial Probit Regression
1813
This is the probability that the auxiliary variable ynew is in the cone Ck , so K k=1
P(tnew = k|xnew , X, t) =
K k=1
=
RK
Ck
p(ynew |xnew , X, t)dynew
p(ynew |xnew , X, t)dynew = 1,
thus yielding a properly normalized posterior distribution over classes 1, . . . , K . Appendix C The variational bound conditioned on the current values of ϕ k , ψ k , α k (assuming these are fixed values) can be obtained in the following manner using the expansion of the relevant components of the lower bound: k
+
E Q(M)Q(Y) log p(ynk |mnk )
(C.1)
n
E Q(M) log p(mk |X, ϕ k )
(C.2)
E Q(mk ) {log Q(mk )}
(C.3)
E Q(yn ) {log Q(yn )}.
(C.4)
k
−
k
−
n
Expanding each component in turn obtains −
NK 1 2 %2 nk − 2 ynk m nk − log 2π y nk + m 2 k n 2
−
1 T −1 1 k Cϕ k m k m log |Cϕ k | − 2 k 2 k
−
NK 1 trace C−1 log 2π ϕ k k − 2 k 2
(C.6)
−
NK 1 NK − log 2π − log | k | 2 2 2 k
(C.7)
−
N 1 2 y nk + m 2nk − 2 ynk m nk − log Zn − log 2π. 2 k n 2 n
(C.8)
(C.5)
1814
M. Girolami and S. Rogers
Combining and manipulating equations C.5 to C.8 gives the following expression for the lower bound, −
NK N NK 1 tra ce{ k } log 2π + log 2π + − 2 2 2 2 k
−
1 1 T −1 k Cϕ k m k − m tra ce C−1 ϕ k k 2 k 2 k
−
1 1 log |Cϕ k | + log | k | + log Zn , 2 k 2 k n
where each Zn = E p(u)
j=i
(u + m ni − m nj ) .
Appendix D Details of the Gibbs sampler required to obtain samples from the posterior p(|t, X, , α) now follow. From the definition of the joint likelihood (see equation 3.2) it is straightforward to see that the conditional distribution for each yn |mn will be a truncated gaussian defined in the cone Ctn , centered at mn with identity covariance, and denoted by Nytn (mn , I). The distribution for each mk |yk is multivariate gaussian with covariance k = Cϕ k (I + Cϕ k )−1 and mean k yk . Thus, the Gibbs sampler, for each n and k, takes the simple form ∼ Nytn (m(i−1) , I) yn(i) |m(i−1) n n (i+1)
mk
(i)
(i)
|yk ∼ Nm ( k yk , k ),
where the superscript (i) denotes the ith sample drawn. The dominant scaling will be O(K N3 ) per sample draw. With the multinomial probit likelihood for a new data point defined as P(tnew = k|mnew ) = E p(u) (u + mnew − mnew k j ) , j=k
the predictive distribution12 is then obtained from P(tnew = k|mnew ) p(mnew |xnew , X, t) dmnew . P(tnew = k|xnew , X, t) = A Monte Carlo estimate of the above required marginal posterior expectation can be obtained by drawing samples from the full posterior 12
Conditioning on and α is implicit.
Variational Bayesian Multinomial Probit Regression
1815
distribution, p(|t, X, , α), using the above sampler. Then for each (i) sampled, an additional set of samples mnew,s is drawn, such that for each k, k −1 new (i) new,s (i) new,i new,i 2 mk |yk ∼ Nm (µk , σk ,new ), where µk = (yk )T I + Cϕk Cϕ , and −1 new k new T − (C ) C I + C the associated variance is σk2 ,new = c ϕnew ϕk ϕk ϕ k . The apk proximate predictive distribution can then be obtained by the following Monte Carlo estimate: 1 Nsa mps
Nsa mps
E p(u)
s=1
j=k
(u + mnew,s k
− mnew,s ) . j
An additional Metropolis-Hastings subsampler can be employed within the above Gibbs sampler to draw samples from the posterior p(, |t, X, α) if the covariance function hyperparameters are to be integrated out. Appendix E The Laplace approximation requires the Hessian matrix of second-order derivatives of the joint log likelihood with respect to each mn . The derivatives of the noise component, log P(tn = k|mn ) = log E p(u) × { j=k (u + m nk − m nj )}, follow, where we denote expectation with respect to a gaussian truncated in the cone Ck as E Nyk {·}: ∂ 1 log P(tn = k|mn ) = ∂mni P(tn = k|mn )
Ck
(yni − mni )Nyn (m, I)dy
= E Nyk {yni } − mni and ∂2 log P(tn = k|mn ) = E Nyk {yni ynj } − E Nyk {yni }E Nyk {ynj } − δi j . ∂mnj ∂mni This then defines an NK × NK –dimensional Hessian matrix that, unlike the Hessian of the multinomial logit counterpart, cannot be decomposed into a diagonal plus multiplicative form (refer to Williams & Barber, 1998, for details), due to the cross-diagonal elements E Nyk {yni ynj }, and so the required matrix inversions of the Newton step and those required to obtain the predictive covariance will operate on a full NK × NK matrix. Acknowledgments This work is supported by Engineering and Physical Sciences Research Council grants GR/R55184/02 & EP/C010620/1. We are grateful to
1816
M. Girolami and S. Rogers
˜ Chris Williams, Jim Kay, and Joaquin Quinonero-Candela for motivating discussions regarding this work. In addition, the comments and suggestions made by the anonymous reviewers helped to improve the manuscript significantly. References Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistcial Association, 88(422), 669– 679. Beal, M. (2003). Variational algorithms for approximate bayesian inference. Unpublished doctoral dissertation, University College London. Chu, W., & Ghahramani, Z. (2005). Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6, 1019–1041. Csato, L., Fokue, E., Opper, M., Schottky, B., & Winther, O. (2000). Efficient approaches to gaussian process classification. In S. A. Solla, T. K. Leen, & K.-R. ¨ Muller (Eds.), Advances in neural information processing systems, 12 (pp. 252–257). Cambridge, MA: MIT Press. Csato, L., & Opper, M. (2002). Sparse online gaussian processes. Neural Computation, 14, 641–668. Duan, K., & Keerthi, S. (2005). Which is the best multi-class SVM method? An empirical study. In N. C. Oza, R. Polikar, J. Kittler, & F. Roli (Eds.), Proceedings of the Sixth International Workshop on Multiple Classifier Systems (pp. 278–285). Seaside, CA. Gibbs, M. N., & MacKay, D. J. C. (2000). Variational gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6), 1458–1464. Girolami, M., & Rogers, S. (2005). Hierarchic Bayesian models for kernel learning. In Proceedings of the 22nd International Conference on Machine Learning (pp. 241–248). New York: ACM. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233. Kim, H. C. (2005). Bayesian and ensemble kernel classifiers. Unpublished doctoral dissertation, Pohang University of Science and Technology. Available online at http://home.postech.ac.kr/∼grass/publication/. Kuss, M., & Rasmussen, C. E. (2005). Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Research, 6, 1679–1704. Lawrence, N. D., Milo, M., Niranjan, M., Rashbass, P., & Soullier, S. (2004). Reducing the variability in cDNA microarray image processing by Bayesian inference. Bioinformatics, 20(4), 518–526. Lawrence, N. D, Platt, J. C., & Jordan, M. I. (2005). Extensions of the informative vector machine. In J. Winkler, N. D. Lawrence, & M. Niranjan (Eds.), Proceedings of the Sheffield Machine Learning Workshop. Berlin: Springer-Verlag. Lawrence, N. D., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process methods: The informative vector machine. In S. Thrun, S. Becker, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.
Variational Bayesian Multinomial Probit Regression
1817
MacKay, D. J. C (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Neal, R. (1998). Regression and classification using gaussian process priors. In A. P. Dawid, M. Bernardo, J. O. Berger, & A. F. M. Smith (Eds.), Bayesian statistics 6 (pp. 475–501). New York: Oxford University Press. Opper, M., & Winther, O. (2000). Gaussian processes for classification: Mean field algorithms. Neural Computation, 12, 2655–2684. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multi¨ class classification. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. Qi, Y., Minka, T. P., Picard, R. W., & Ghahramani, Z. (2004). Predictive automatic relevance determination by expectation propagation. In R. Greiner & D. Schuurmans (Eds.), Proceedings of the Twenty-First International Conference on Machine Learning. New York: ACM. Quinonero-Candela, J., & Winther, O. (2003). Incremental gaussian processes. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Seeger, M. (2000). Bayesian model selection for support vector machines, gaussian ¨ processes and other kernel classifiers. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing Systems, 12 (pp. 603–609). Cambridge, MA: MIT Press. Seeger, M., & Jordan, M. I. (2004). Sparse gaussian Process classification with multiple classes (Tech. Rep. 661). Berkeley: Department of Statistics, University of California. Seeger, M., Williams, C. K. I., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop, & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Np: Society for Artificial Intelligence and Statistics. Tipping, M. (2000). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Wang, B., & Titterington, D. M. (2004). Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values (Tech. Rep. No. 04–02). Glasgow: Department of Statistics, University of Glasgow. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 598–604). Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press.
Received July 1, 2005; accepted November 8, 2005.
LETTER
Communicated by Youshen Xia
A Novel Neural Network for a Class of Convex Quadratic Minimax Problems Xing-Bao Gao
[email protected] College of Mathematics and Information Science, Shaanxi Normal University, Xi’an, Shaanxi 710062, China
Li-Zhi Liao
[email protected] Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Based on the inherent properties of convex quadratic minimax problems, this article presents a new neural network model for a class of convex quadratic minimax problems. We show that the new model is stable in the sense of Lyapunov and will converge to an exact saddle point in finite time by defining a proper convex energy function. Furthermore, global exponential stability of the new model is shown under mild conditions. Compared with the existing neural networks for the convex quadratic minimax problem, the proposed neural network has finite-time convergence, a simpler structure, and lower complexity. Thus, the proposed neural network is more suitable for parallel implementation by using simple hardware units. The validity and transient behavior of the proposed neural network are illustrated by some simulation results. 1 Introduction In this letter, we are interested in the following convex quadratic minimax problem: min{max{ f (x, y)}}, x∈U
(1.1)
y∈V
where f (x, y) =
1 1 T x Hx + h T x − x T Qy − yT Sy − s T y, 2 2
(1.2)
H ∈ Rm×m , S ∈ Rn×n , Q ∈ Rm×n , h ∈ Rm , and s ∈ Rn are given with H and S being symmetric and positive semidefinite, U = {x ∈ Rm | a i ≤ xi ≤ b i , Neural Computation 18, 1818–1846 (2006)
C 2006 Massachusetts Institute of Technology
A Novel Neural Network
1819
i = 1, 2, · · · , m}, V = {y ∈ Rn | c j ≤ y j ≤ d j , j = 1, 2, · · · , n}, and some −a i (or −c j , b i , d j ) could be +∞. Minimax problems provide a useful reformulation of optimality conditions and also arise in a variety of engineering and economic contexts, including game theory, military scheduling, and automatic control. In particular, problem 1.1 includes:
r r r r
(Piecewise) linear programming (H = 0 and S = 0) Linear programming (H = 0, S = 0, and Q = 0) Quadratic programming (S = 0 and H = 0) Linear and quadratic programming with bound constraints
In many engineering and scientific applications, real-time online solutions of minimax problems are desired. However, traditional algorithms (Fukushima, 1992; He, 1996, 1999; Rockafellar, 1987; Solodov & Tseng, 1996; Tseng, 2000) are not suitable for a real-time online implementation on the computer because the computing time required for a solution is greatly dependent on the dimension and the structure of the problem and the complexity of the algorithm used. One promising approach to handle these problems with high dimension and dense structure is to employ artificial neural network–based circuit implementation. Because of the dynamic nature of optimization and the potential of electronic implementation, neural networks can be implemented physically by designated hardware such as application-specific integrated circuits, where the optimization procedure is truly done in parallel. Therefore, the neural network approach can solve optimization problems in running times that are orders of magnitude much faster than conventional optimization algorithms executed on general-purpose digital computers. It is of great interest to develop some neural network models that could provide a real-time online solution. In recent years, the neural network approach for solving optimization problems has been studied by many researchers, and many good results have been achieved (Bouzerdorm & Pattison, 1993; Friesz, Bernstein, Mehta, Tobin, & Ganjlizadeh, 1994; Gao, 2003, 2004; Gao & Liao, 2003; Gao, Liao, & Xue, 2004; Han, Liao, Qi, & Qi, 2001; He & Yang, 2000; Xia, 2004; Xia & Feng, 2004; Xia, Feng, & Wang, 2004; Xia & Wang, 1998, 2000, 2001). Since the condition of the saddle point of equation 1.2 can be formulated as the following linear variational inequality LVI(M, q , C), to find a vector z∗ ∈ C such that (z − z∗ )T (Mz∗ + q ) ≥ 0,
∀z ∈ C,
(1.3)
where M ∈ Rk×k , q ∈ Rk , and C ⊆ Rk is a nonempty closed convex set (see remark 1), problem 1.1 can be solved by using the models in Gao et al., (2004), He and Yang (2000), and Xia and Wang (1998, 2000). In particular,
1820
X.-B. Gao and L.-Z. Liao
Gao et al. (2004) proposed the following neural network, d dt
x − PU x − Hx − h + Qy x , = −λ y y − PV y − Sy − s − QT x
(1.4)
where λ > 0 is a scaling constant, PU : Rm → U is the projection operator defined by PU (u) = arg min u − v, v∈U
· is the Euclidean norm, and PV : Rn → V is the projection operator defined similar to PU . Gao et al. (2004) also provided several simple and feasible sufficient conditions to ensure the asymptotical stability of equation 1.4. Although model 1.4 has a one-layer structure and is exponentially stable for any initial point in U × V when matrices H and S are positive definite, its convergence is not very satisfactory since it may not be stable and does not have a finite-time convergence when matrices H and S are only positive semidefinite. For example, for the following problem, min max(xy), x
y
where x and y ∈ R1 , model 1.4 can be simplified as dx = λy, dt dy = −λx. dt
(1.5)
(1.6)
It is easy to see that equation 1.6 is divergent. The models proposed by He and Yang (2000) and Xia and Wang (1998) have good stability performance; however, because the model in He and Yang (2000) is not suitable for parallel implementation due to the choice of the varying parameter and the model in Xia and Wang (1998) has a complex structure, further simplification can be achieved. Therefore, it is necessary to build a new neural network for equation 1.1 with a lower complexity and good stability and convergence results. Based on the above considerations, in this article, we will propose a new neural network model for solving problem 1.1 by means of sufficient and necessary conditions of the saddle point of equation 1.2, define a convex energy function by introducing a convex function, and prove that the proposed neural network is stable in the sense of Lyapunov and will converge to an exact saddle point in finite time when matrices H and S are only positive semidefinite. Furthermore, global exponential stability of the new model is also shown when matrices H and S are positive definite. Compared
A Novel Neural Network
1821
with the existing neural networks and some conventional numerical methods, the new model has a lower complexity and finite-time convergence, and its asymptotical stability requires only the positive semidefiniteness of matrices H and S. Thus, the new model is very simple and more suitable for the hardware implementation. The solution of problem 1.1 is closely related to the saddle point of f (x, y). A point (x ∗ , y∗ ) ∈ U × V is said to be a saddle point of f (x, y) if f (x ∗ , y) ≤ f (x ∗ , y∗ ) ≤ f (x, y∗ ),
∀(x, y) ∈ U × V.
(1.7)
Throughout this letter, we assume that the set K ∗ = {(x, y) ∈ U × V | (x, y) is a saddle point of f (x, y)} = ∅ and there exists a finite point (x ∗ , y∗ ) ∈ K ∗ . Obviously, if (x ∗ , y∗ ) ∈ K ∗ is a saddle point of f (x, y), then it must be a solution of problem 1.1. Therefore, it would be sufficient to find a saddle point of f (x, y) for problem 1.1. For the convenience of later discussions, it is necessary to introduce the following definition: Definition 1. A neural network is said to have finite-time convergence to one of its equilibrium points z∗ if there exists a time τ0 such that the output trajectory z(t) of this network reaches z∗ for t ≥ τ0 (see Xia et al., 2004). In our following discussions, we let · denote the Euclidean norm, In denote the identity matrix of order n, ∇ϕ(x) = (∂ϕ(x)/∂ x1 , ∂ϕ(x)/ ∂ x2 ,. . ., ∂ϕ(x)/∂ xn )T ∈ Rn denote the gradient vector of the differentiable function ϕ(x) at x. For any vector u ∈ Rn , uT denotes its transpose. For any n × n real symmetric matrix M, λmin (M) and λmax (M) denote its minimum and maximum eigenvalues, respectively. A basic property of the projection mapping on a closed convex set U ⊆ Rm is (Kinderlehrer & Stampacchia, 1980) [w − PU (w)]T [PU (w) − p] ≥ 0,
∀w ∈ Rm , p ∈ U.
(1.8)
The rest of the letter is organized as follows. In section 2, a neural network model for problem 1.1 is proposed. The stability and convergence of the proposed network are analyzed in section 3. The simulation results of our proposed neural network are reported in section 4. Finally, some concluding remarks are drawn in section 5. 2 A Neural Network Model In this section, a neural network for solving problem 1.1 is presented, and the comparisons with the existing neural networks and some conventional numerical methods are discussed. First, we provide a necessary and
1822
X.-B. Gao and L.-Z. Liao
sufficient condition for the saddle point of f (x, y) in equation 1.2. This result provides the theoretical foundation for us to design the neural network for problem 1.1. Theorem 1.
(x ∗ , y∗ ) ∈ K ∗ if and only if
(x − x ∗ )T (Hx ∗ + h − Qy∗ ) ≥ 0 , (y − y∗ )T (Sy∗ + s + QT x ∗ ) ≥ 0 ,
∀x ∈ U, ∀y ∈ V.
(2.1)
Proof. From equation 1.7 and theorem 3.3.3 in Bazaraa, Sherali, and Shetty (1993), this can be easily proved. Remark 1. Theorem 1 indicates that z∗ = ((x ∗ )T , (y∗ )T )T ∈ K ∗ if and only if it is a solution of a monotone LVI(M, q , C) defined in equation 1.3 with k = m + n, M=
H −Q , QT S
q=
h , s
and C = U × V.
(2.2)
From equations 1.8 and 2.1, we can easily establish the following result, which shows that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 is the projection of some vector on U × V. Lemma 1.
(x ∗ , y∗ ) ∈ K ∗ if and only if
x ∗ = PU (x ∗ − Hx ∗ − h + Qy∗ ), y∗ = PV (y∗ − Sy∗ − s − QT x ∗ ),
(2.3)
where PU (x) = [(PU (x))1 , (PU (x))2 , . . . , (PU (x))m ]T and (PU (x))i = min{b i , ma x{xi , a i }} for i = 1, 2, . . . , m, PV (y) = [(PV (y))1 , (PV (y))2 , t, (PV (y))n ]T , and (PV (y)) j = min{d j , ma x{y j , c j }} for j = 1, 2, . . . , n. Lemma 1 indicates that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 can be obtained by solving equation 2.3. Based on the above results, we propose the following dynamical system for a neural network model to solve problem 1.1: d dt
s − QT x x 2 x − PU x − Hx − h + QPV y − Sy − , = −λ y y − PV y − Sy − s − QT x (2.4)
where λ > 0 is a scaling constant.
A Novel Neural Network
1823
q q
q
q
q q
ha1 q h11 + ? . + s q . h12j + .. + . q . h1m h q q h21 + a2 q ? s + q . h22j + q + . q . h2m .. ham q q .hm1+ ? q + s q . hm2j + + q q .. hmm
q11 + - ? + ? s q12+j -P (·) -ˆ j + U λ + q1n> q21 + ? ? + s q22+ j j -ˆ PU (·) .. q2n+ λ + >. .. . qm1 + + ? s qm2+? -ˆ j + j-PU (·) .. qmn λ + >.
q11 + s q q . 21j + + .. q .. qm1+ a6 . s 1 q q12 + s q . q22+ j q q q .. qm2+ + a6 q .. .. . .q1n s2 + q s q2nj + .. qmn + + q. . a6 q . . s
s11 - q W s s12+j - -λ q - y1 j -P (·) V s1n + -q6 6 s21 - W s s22+j -λ q - y2 - q js2n - + PV (·) - q6 .. 6 .. . sn1 . W s sn2+ j-λ - jq q y snn - + PV (·) ... n -6 6
q
n
q
q
q
q - x1
.q . . x2 .. . q xm
q
Figure 1: The architecture of network 2.4.
As a result of lemma 1, we have the following result, which describes the relationship between an equilibrium point of equation 2.4 and a saddle point of f (x, y) in equation 1.2. Lemma 2. 2.4. Proof.
(x T , yT )T ∈ K ∗ if and only if (x, y) is an equilibrium point of network
From lemma 1 and equations 2.3 and 2.4, the result is trivial.
Lemma 2 also illustrates that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 is the projection of some vector on U × V and can be obtained by the equilibrium point of equation 2.4. The architecture of neural network 2.4 is shown in Figure 1, where vectors x and y are the network’s outputs, vectors h = (h 1 , h 2 , . . . , h m )T and s = (s1 , s2 , . . . , sn )T are the external inputs, the projection operators PU (·) and PV (·) could be implemented by some piecewise activation functions
1824
X.-B. Gao and L.-Z. Liao
(Bouzerdorm & Pattison, 1993), and other parameters are defined by H = (h i j )m×m , Q = (q i j )m×n , S = (si j )n×n , and λˆ = 2λ. According to Figure 1, the circuit realizing the proposed neural network 2.4 consists of m + n integrators, m + n linear piecewise activation functions, (m + n)2 weighted connections for H, S, Q, and QT , and (m + n)2 + 2(m + n) adders. Thus, it can be implemented by using simple hardware units. For the convenience of later discussions, we denote z = (x T , yT )T ∈ Rm+n and
v = PV y − Sy − s − QT x , u = PU (x − Hx − h + Qv).
(2.5)
It should be noted that the definition of u in equation 2.5 requires the value of v. Then the proposed neural network 2.4 can be written as dz = −λF (z) = −λ dt
2(x − u) . y−v
(2.6)
To show the advantages of the proposed neural network 2.4, we compare it with four existing neural network models and some conventional numerical methods. First, we look at model 1.4 proposed by Gao et al. (2004). The function F (z) in equation 2.6 and the right-hand side of equation 1.4 are totally different since u = PU (x − Hx − h + Qv) = PU (x − Hx − h + Qy). It is easy to see that the complexity of the above model is about the same as that of proposed neural network 2.4, yet the stability conditions are different. When matrices H and S are only positive semidefinite, theorem 3 ensures that neural network 2.4 is stable in the sense of Lyapunov and converges to a saddle point in finite time, but model 1.4 may not be stable and may not converge even when the initial point z0 lies in U × V (see examples 2–5 in section 4). Thus the stability and finite-time convergence conditions of model 1.4 are stronger than that of 2.4. To clarify this issue further, we consider problem 1.5; then model 2.4 can be written as dx dt
= 2λ(y − x),
dy dt
= −λx.
Obviously this system differs from model 1.6, and is stable and convergent, but system 1.6 is divergent. Second, we compare the proposed neural network 2.4 with the models proposed by He and Yang (2000) and Xia and Wang (1998). The model
A Novel Neural Network
1825
proposed by Xia and Wang (1998) for problem 1.1 is defined as dz = −λ Im+n + MT e(z) dt
(2.7)
where λ > 0 is a scaling constant, M and q are defined in equation 2.2, and e(z) = z − PU×V (z − Mz − q ).
(2.8)
In terms of the model complexity, it is easy to see that the total multiplications/divisions and additions/subtractions per iteration for equation 2.7 are 2(m + n)2 + m + n and 2(m + n)2 + 2(m + n), respectively. But the total multiplications/divisions and additions/subtractions per iteration for neural network 2.4 are (m + n)2 + 2m + n and (m + n)2 + 2(m + n), respectively. Thus the asymptotic complexity of model 2.4 is about half of model 2.7. Furthermore, for problem 1.1, the model proposed by He and Yang (2000) is dz = λ{PU×V [z − θ α(z)(MT e(z) + Mz + q )] − z}, dt
(2.9)
where λ > 0 is a scaling constant, θ ∈ (0, 2), M, q , and e(z) are defined in equations 2.2 and 2.8, respectively, and α(z) = e(z)2 /(Im+n + MT )e(z)2 . It is easy to see that this model is not suitable for parallel implementation due to the choice of the varying parameter α(z) and requires computing two projections and terms e(z), MT e(z), and α(z) per iteration. Even though the proposed neural network 2.4 has a two-layer structure, it is required to compute only one projection and term F (z) in equation 2.6 per iteration. Since the complexity of F (z) is about the same as that of e(z), the proposed neural network has a low complexity. Therefore, model 2.4 is simpler than models 2.7 and 2.9 and reduces the model complexity in implementation. In addition, no result for the finite-time convergence for models 2.7 and 2.9 is available in the literature, and the stability of model 2.9 requires that the initial point z0 lies in U × V, yet theorem 3 ensures that neural network 2.4 is stable and convergent in finite time for any initial point z0 ∈ Rm+n . Third, we compare the proposed neural network 2.4 with the model proposed by Gao (2004). Model 2.4 is designed to solve convex quadratic minimax problems, while the model in Gao (2004) is developed to solve nonlinear convex programming problems. Thus, the energy functions and theoretical results of the two models are different. In particular, model 2.4 is globally exponentially stable when matrices H and S are positive definite (see theorem 4), but the model in Gao (2004) has no exponential stability result. Moreover, for a convex quadratic problem (problem 1.1) with S = 0 and V = {y ∈ Rm |y ≥ 0}), even though the two models are the same, the finite-time convergence results are different. Model 2.4 is stable
1826
X.-B. Gao and L.-Z. Liao
and converges to a saddle point in finite time (see theorem 3), but no result for the finite-time convergence of the model in Gao (2004) is available in the literature. Finally, we compare the proposed neural network 2.4 with two typically numerical methods: a modified projection-type method (Solodov & Tseng, 1996) and a forward-backward splitting method (Tseng, 2000). For problem 1.1, a modified projection-type method proposed by Solodov and Tseng (1996) is defined as zk+1 = zk − θ γ (zk )N−1 (Im+n + MT )e(zk ),
(2.10)
where θ ∈ (0, 2), N is an (m + n) × (m + n) symmetric positive-definite matrix, M and e(z) are defined in equations 2.2 and 2.8, respectively, and γ (z) = e(z)2 /[e(z)T (Im+n + M)N−1 (Im+n + MT )e(z)]. It is easy to see that this method is not suitable for parallel implementation due to the choice of the varying parameter γ (zk ), and its asymptotic complexity is about two times that of model 2.4 even when N = Im+n . Furthermore, for problem 1.1, a forward-backward splitting method proposed by Tseng (2000) is defined as
z¯ k = PU×V zk − θ Mzk + q , zk+1 = PU×V z¯ k + θ M zk − z¯ k ,
(2.11)
where θ is a positive constant and M and q are defined in equation 2.2. This method can be viewed as a prediction-correction method and requires two projections per iteration, and its asymptotic complexity is about two times that of model 2.4. In addition, besides the positive semidefiniteness requirement for matrix M, the parameters N and θ are key to the convergence of method 2.10, and method 2.11 is globally convergent only when θ < ν/M with 0 < ν < 1. On the other hand, the stability and convergence of model 2.4 require only the positive semidefiniteness of matrices H and S, and model 2.4 has finite-time convergence without condition θ < ν/M with 0 < ν < 1. Thus, model 2.4 is simpler than methods 2.10 and 2.11, avoids the difficulty of choosing the network parameters, and requires weaker convergence condition. 3 Stability Analysis In this section, we study some theoretical properties for model 2.4. First, we prove the following lemma, which will be very useful in our later discussion. Lemma 3. ϕ(z) =
Let 1 (y − Sy − s − QT x2 − y − Sy − s − QT x − v2 ), 2
(3.1)
A Novel Neural Network
1827
where v is defined in equation 2.5. Then the following is true: i. ϕ(z) is continuously differentiable and convex on Rm+n . ii. For any z, z = ((x )T , (y )T )T ∈ Rm+n , the following inequality holds: ϕ(z) ≤ ϕ(z ) + (z − z )T ∇ϕ(z ) + (z − z )T W(z − z )/2, where W=
QQT −Q(In − S) −(In − S)QT (In − S)2
=
−Q −QT In − S . In − S (3.2)
Proof. set ,
From equation 1.8, we can easily verify that for any closed convex
P ( p) − P (w)2 ≤ ( p − w)T [P ( p) − P (w)] ≤ p − w2 ,
∀ p, w ∈ Rn . (3.3)
Let ϕ1 (z) = y − Sy − s − QT x2 /2 and ϕ2 (z) = y − Sy − s − QT x − v2 /2, where v is defined in equation 2.5. Then ϕ(z) = ϕ1 (z) − ϕ2 (z). i. Obviously ϕ2 (z) is a compound function of the two functions: ψ(w) = w − PV (w)2 /2 and w = y − Sy − s − QT x. According to lemma 3.7 in Smith, Friesz, Bernstein, and Suo (1997), the function ψ(w) is continuously differentiable and ∇ψ(w) = w − PV (w). Thus, their compound function ϕ2 (z) is differential with respect to z, and ∇ϕ2 (z) =
−Q y − Sy − s − QT x − v . (In − S) y − Sy − s − QT x − v
(3.4)
Therefore, ϕ(z) defined in equation 3.1 is also continuously differentiable and ∇ϕ(z) =
−Qv (In − S)v
.
(3.5)
1828
X.-B. Gao and L.-Z. Liao
For any z, z ∈ Rm+n , let v = PV (y − Sy − s − QT x ). Then from equation 3.5, we have (z − z )T [∇ϕ(z) − ∇ϕ(z )] = (v − v )T [(In − S)(y − y ) −QT (x − x )] ≥ v − v 2 , where the last step follows by setting p = y − Sy − s − QT x and w = y − Sy − s − QT x on the left-hand side of equation 3.3. Thus, ϕ(z) is convex on Rm+n by theorem 3.4.5 in Ortega and Rheinboldt (1970). ii. Similar to the proof of lemma 3i, we can prove that ϕ2 (z) is also convex on Rm+n by equation 3.4 and the right-hand side of equation 3.3. Thus, ∀z, z ∈ Rm+n ; we have 1 ϕ1 (z) = ϕ1 (z ) + (z − z )T ∇ϕ1 (z ) + (z − z )T W(z − z ), 2 and ϕ2 (z) ≥ ϕ2 (z ) + (z − z )T ∇ϕ2 (z ) from theorem 3.3.3 in Bazaraa et al. (1993). Therefore, lemma 3ii holds from ϕ(z) = ϕ1 (z) − ϕ2 (z). Remark 2. If V is a closed convex cone, for example, V = {y ∈ Rn |y ≥ 0}, then ϕ(z) = v2 /2 (v is defined in equation 2.5) is continuously differentiable on Rn . However, this may not be true for a general closed convex set V. For example, let V = {x ∈ R1 | − 1 ≤ x ≤ 1}; then 1, if x > 1, v 2 = [PV (x)]2 = x 2 , if −1 ≤ x ≤ 1, 1, if x < −1, and if x > 1, 2x − 1, if −1 ≤ x ≤ 1, 2ϕ(x) = x 2 − [x − PV (x)]2 = x 2 , −2x − 1, if x < −1. Thus [PV (x)]2 = x 2 − [x − PV (x)]2 , and [PV (x)]2 is not differentiable on (−∞, +∞). From lemma 3i, we can define the function 1 G(z, z∗ ) = [(x − x ∗ )T H(x − x ∗ ) + 3(y − y∗ )T S(y − y∗ )] 2 1 + z − z∗ 2 + ϕ(z) − ϕ(z∗ ) − (z − z∗ )T ∇ϕ(z∗ ), 2
(3.6)
A Novel Neural Network
1829
where z∗ ∈ K ∗ is finite and ϕ(z) is defined in equation 3.1. Then we have the following result, which explores some properties for G(z, z∗ ). Lemma 4. Let G(z, z∗ ) be the function in equation 3.6 and W be the matrix in equation 3.2. Then the following is true: i. G(z, z∗ ) is continuously differentiable and convex on Rm+n and 1 µ1 z − z∗ 2 ≤ G(z, z∗ ) ≤ z − z∗ 2 , 2 2
∀z ∈ Rm+n ,
(3.7)
where µ1 = 1 + λmax (W) + max{λmax (H), 3λmax (S)} ≥ 1. ii. ∇G(z, z∗ )T F (z) ≥ F (z)2 /2, ∀z ∈ Rm+n . iii. ∇G(z, z∗ )T F (z) ≥ 2µ2 G(z, z∗ ), {λmin (H), λmin (S)}/µ1 .
∀z ∈ Rm+n ,
where
µ2 = 2min
Proof. i. Obviously, µ1 ≥ 1 by the semidefiniteness of matrices W, H, and S. From lemma 3i, we know that ϕ(z) is continuously differentiable and convex. Thus, G(z, z∗ ), defined in equation 3.6, is also continuously differentiable and convex on Rm+n , and ϕ(z) ≥ ϕ(z∗ ) + (z − z∗ )T ∇ϕ(z∗ ),
∀z ∈ Rm+n
from theorem 3.3.3 in Bazaraa et al. (1993). Therefore, the first inequality in equation 3.7 holds by equation 3.6 since matrices H and S are positive semidefinite. On the other hand, ∀z ∈ Rm+n , we have from lemma 3ii that 1 ϕ(z) ≤ ϕ(z∗ ) + (z − z∗ )T ∇ϕ(z∗ ) + λmax (W)z − z∗ 2 . 2 Thus, G(z, z∗ ) ≤
≤
1 λmax (H)x − x ∗ 2 + 3λmax (S)y − y∗ 2 2 + (1 + λmax (W))z − z∗ 2 µ1 z − z∗ 2 , 2
∀z ∈ Rm+n .
ii. From equations 2.3, 2.5, 3.5, and 3.6, we have ∗
∇G(z, z ) =
(Im + H)(x − x ∗ ) − Q(v − y∗ ) (In + 3S)(y − y∗ ) + (In − S)(v − y∗ )
.
1830
X.-B. Gao and L.-Z. Liao
Then ∀z ∈ Rm+n ; it is straightforward to have ∇G(z, z∗ )T F (z) = 2(x − u)T [(Im + H)(x − x ∗ ) − Q(v − y∗ )] + (y − v)T [2(In + S)(y − y∗ ) − (In − S)(y − v)] = 2(u − x ∗ )T [x − u − H(x − x ∗ ) + Q(v − y∗ )] + 2(x − x ∗ )T H(x − x ∗ ) + 2x − u2 + y − v2 + (y − v)T S(y − v) + 2(y − y∗ )T S(y − y∗ ) + 2(v − y∗ )T [y − v − S(y − y∗ ) − QT (x − x ∗ )] = 2x − u2 + y − v2 + 2(x − x ∗ )T H(x − x ∗ ) + (y − v)T S(y − v) + 2(y − y∗ )T S(y − y∗ ) + 2(u − x ∗ )T (x − u − Hx − h + Qv) + 2(u − x ∗ )T (Hx ∗ + h − Qy∗ ) + 2(v − y∗ )T (y − v − Sy − s − QT x) + 2(v − y∗ )T (Sy∗ + s + QT x ∗ ) ≥
1 F (z)2 + 2(x − x ∗ )T H(x − x ∗ ) 2
+ 2(y − y∗ )T S(y − y∗ ) + (y − v)T S(y − v),
(3.8)
where the last step follows from equation 2.1 (u ∈ U, v ∈ V), (x − Hx − h + Qv − u)T (u − x ∗ ) ≥ 0, by setting w = x − Hx − h + Qv and p = x ∗ ∈ U in equation 1.8, and (y − Sy − s − QT x − v)T (v − y∗ ) ≥ 0 by setting w = y − Sy − s − QT x and p = y∗ ∈ V in equation 1.8, respectively. Therefore, lemma 4ii holds by equation 3.8 since matrices H and S are positive semidefinite. iii. If min{λmin (H), λmin (S)} = 0, then µ2 = 0 from the definition of µ2 , and this result can be obtained by lemma 4ii. If min{λmin (H), λmin (S)} > 0, then µ2 > 0. From equation 3.8 and the right-hand side of equation 3.7, we have ∇G(z, z∗ )T F (z) ≥ 2[λmin (H)x − x ∗ 2 + λmin (S)]y − y∗ 2 ≥ 2 min{λmin (H), λmin (S)}z − z∗ 2 ≥ 2µ2 G(z, z∗ ),
∀z ∈ Rm+n .
A Novel Neural Network
1831
The results in lemma 4 are very important and pave the way for the stability results of neural network 2.4. In particular, neural network 2.4 has the following basic property: Theorem 2. For any z0 ∈ Rm+n , there exists a unique and continuous solution z(t) of neural network 2.4 for all t ≥ 0 with z(0) = z0 . Proof.
From equation 3.3, we have for any closed convex set
P ( p) − P (w) ≤ p − w,
∀ p, w ∈ Rn .
Thus, for any z, z ∈ Rm+n , by equation 2.5 and the above inequality, we have u − u ≤ (Im − H)(x − x ) + Q(v − v ) ≤ Im − H · x − x + Q · v − v and v − v ≤ In − S · y − y + Q · x − x , where v = PV (y − Sy − s − QT x ) and u = PU (x − Hx − h + Qv ). From the above two inequalities and F (z) − F (z ) ≤ 2(x − x + u − u ) +y − y + v − v ,
∀z, z ∈ Rm+n ,
we can see that F (z) is Lipschitz continuous on Rm+n . Thus, the result can be established from theorem 1 in Han et al. (2001). The results of lemma 2 and theorem 2 indicate that neural network model 2.4 is well defined. Now we are in the position to prove the following stability results for this model. Theorem 3. Neural network 2.4 is stable in the sense of Lyapunov, and for any z0 ∈ Rm+n , its trajectory will reach a saddle point of f (x, y) within a finite time when the scaling parameter λ is large enough. In particular, if problem 1.1 has a unique solution, then neural network 2.4 is globally asymptotically stable. Proof. From theorem 2, ∀z0 ∈ Rm+n , let z(t) be the unique and continuous solution of neural network 2.4 for all t ≥ 0 with z(0) = z0 .
1832
X.-B. Gao and L.-Z. Liao
For the function G(z, z∗ ) defined in equation 3.6, we have from lemma 4ii that d λ G(z(t), z∗ ) = −λ∇G(z(t), z∗ )T F (z(t)) ≤ − F [z(t)]2 ≤ 0, dt 2
∀t ≥ 0. (3.9)
From equation 3.9, we know that G(z(t), z∗ ) is monotonically nonincreasing on [0, +∞). Then z(t) − z∗ 2 ≤ 2G(z(t), z∗ ) ≤ 2G(z0 , z∗ ),
∀t ≥ 0
from the first inequality in equation 3.7. Thus, the set {z(t) | t ≥ 0} is bounded. Therefore, there exist a limit point zˆ and a sequence {tn } with 0 < t1 < t2 < . . . < tn < tn+1 < . . . and tn → +∞ as n → +∞ such that lim z(tk ) = zˆ .
(3.10)
k→+∞
On the other hand, ∀s ≥ 0, we have from equation 3.9 that λ 2
s
F [z(t)]2 dt ≤ G(z0 , z∗ ) − G(z(s), z∗ ) ≤ G(z0 , z∗ ).
0
Thus,
+∞
F [z(t)]2 dt ≤
0
2 G(z0 , z∗ ) < +∞. λ
This implies that lim F [z(t)] = 0. Therefore, F (ˆz) = 0 from equation 3.10, t→+∞
so zˆ ∈ K ∗ from lemma 2. By replacing zˆ with z∗ in G(z, z∗ ), we can prove that G(z, zˆ ) ≥ z − zˆ 2 /2 for all z ∈ Rm+n and G(z(t), zˆ ) is monotonically nonincreasing on [0, +∞). From the continuity of G(z, zˆ ), it follows that ∀ε > 0, ∃ δ > 0 such that G(z, zˆ )
0 and δ > 0 such that F [z(t)] ≥ δ for all t ∈ [0, τ ). It follows from equation 3.9 that G[z(t), z∗ ] ≤ G(z0 , z∗ ) −
λ 2
t
F [z(s)]2 ds ≤ G(z0 , z∗ ) −
0
λδτ , 2
∀t ≥ τ.
Therefore, z(t) − z∗ 2 ≤ µ1 z0 − z∗ 2 − λδτ,
∀t ≥ τ
from equation 3.7. Let λ = µ1 z0 − z∗ 2 /(δτ ) in the above inequality; then z(t) − z∗ ≡ 0 for all t ≥ τ . This implies that z(t) ≡ z∗ for all t ≥ τ . In particular, if problem 1.1 has a unique solution z∗ , then K ∗ = {z∗ } since K ∗ = ∅, and for each z0 ∈ Rm+n , the trajectory z(t) with z(0) = z0 will approach z∗ . So neural network 2.4 is globally asymptotically stable at z∗ . Remark 3. Compared with the existing finite-time convergence result for model 1.4 in Xia and Feng (2004) (see theorem 3 in Xia & Feng, 2004), theorem 3 for neural network 2.4 does not require the additional condition that the initial point z0 satisfies (z0 − z∗ )T M(z0 − z∗ ) = 0 or [e(z0 )]T Me(z0 ) = 0, where z∗ ∈ K ∗ and e(z) is defined in equation 2.8. Unlike the existing finite-time convergence result for model 1.4 in Xia (2004) (see remark 1 in Xia, 2004), theorem 3 holds without requiring the positive definiteness of matrix H or S (see examples 3–5 in section 4). When matrices H and S are positive definite, we have the following exponential stability result for neural network 2.4. Theorem 4. If matrices H and S are positive definite, then neural network 2.4 is globally exponentially stable at the unique saddle point of f (x, y). Proof. From the hypothesis of this theorem and K ∗ = ∅, there exists a unique saddle point z∗ ∈ K ∗ for f (x, y). From theorem 2, let z(t) be the unique solution of system 2.4 with z(0) = z0 ∈ Rm+n for all t ≥ 0. Since matrices H and S are positive definite, we have λmin (H) > 0 and λmin (S) > 0. Thus, µ2 > 0. For the function G(z, z∗ ) defined in equation 3.6, it follows from lemma 4iii that d G(z(t), z∗ ) ≤ −2λµ2 G(z(t), z∗ ), dt
∀t ≥ 0.
1834
X.-B. Gao and L.-Z. Liao
Thus, G(z(t), z∗ ) ≤ G(z0 , z∗ )e −2λµ2 t ,
∀t ≥ 0.
This and lemma 4i imply that z(t) − z∗ ≤
2G(z0 , z∗ )e −λµ2 t ≤
√
µ1 z0 − z∗ e −λµ2 t ,
∀t ≥ 0.
Remark 4. Obviously, model 2.4 can be applied to solve a class of linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with M, q , and C being defined in equation 2.2 and H and S being symmetric (see examples 2 and 5 in section 4). Remark 5. The results obtained in this article hold for any closed convex subset of Rn . In particular, the common cases for U or V are (1) n {x ∈ Rn |x ≤ c} (ball or l2 norm constraint) and (2) {x ∈ Rn | i=1 |xi | ≤ d} (l1 norm constraint). 4 Illustrative Examples In this section, five examples are provided to illustrate the theoretical results achieved in section 3 and the simulation results of the dynamical system 2.4. The simulation is conducted in Matlab, and the ODE solver used is ODE45 which is a nonstiff medium-order method. Example 1 shows the effectiveness of the proposed neural network 2.4 for problem 1.1 with infinite number of saddle points. 4.1 Example 1. s = (0, 0)T , H=
Consider problem 1.1 with U = V = {x ∈ R2 |x ≥ 0}, h =
4 −2 −2 1
,
Q=
−1 1 , 1 −2
and
S=
1 −2 −2 4
.
This problem has an infinite number of saddle points x ∗ = (0, 0)T and y∗ = (2r, r )T (r ≥ 0). From the analysis in section 3, we use neural network 2.4 to solve this problem; all simulation results show that neural network 2.4 always converges to one of its equilibrium points. For example, let λ = 100. Figure 2 shows the convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points. Example 2 shows that the proposed neural network 2.4 can be applied to solve a class of linear variational inequality problems.
A Novel Neural Network
1835
3
2.5
||z(t)−z*||
2
1.5
1
0.5
0
0
0.01
0.02
0.03
0.04
0.05 t
0.06
0.07
0.08
0.09
0.1
Figure 2: Convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points in example 1.
4.2 Example 2. Consider the linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with C = {z ∈ R4 | − 8 ≤ zi ≤ 9, i = 1, 2, 3, 4}, 0.1 0.1 0.5 −0.5 0.1 0.1 −0.5 0.5 M= −0.5 0.5 0.2 0.1 , 0.5 −0.5 0.1 0.05
1 −1 and q = 1. −1
This problem has a unique solution z∗ = (1, −1, −2/3, 4/3)T . Let U = V = {x ∈ R2 | − 8 ≤ xi ≤ 9, i = 1, 2}, h = s = (1, −1)T , H=
0.1 0.1 , 0.1 0.1
Q=
−0.5 0.5 , 0.5 −0.5
and
S=
0.2 0.1 . 0.1 0.05
Then model 2.4 can be applied to solve this problem from remark 4. All our simulation results show that neural network 2.4 is asymptotically stable at z∗ . For example, let λ = 100. Figure 3 displays the convergence behavior
1836
X.-B. Gao and L.-Z. Liao 4
3.5
3
||z(t)−z*||
2.5
2
1.5
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
t
Figure 3: Convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points in example 2.
of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points. It should be mentioned that neural network 1.4 cannot be used to solve this problem. In fact, Figure 4 shows model 1.4 with initial point (2, 2, 2, 2)T ∈ R4 and λ = 100 is not stable, where the error z(t) − z∗ approaches 0.0178632. Example 3 shows that the proposed neural network 2.4 can be applied to solve large-scale problems. 4.3 Example 3. Consider problem 1.1 with U = {x ∈ R2n | − 1 ≤ x ≤ 1}, V = {y ∈ Rn | − 1 ≤ y ≤ 1}, h = (0, 0, . . . , 0)T ∈ R2n , s = −(1, 1, . . . , 1)T ∈ Rn , S being an n × n zero matrix, 1 0 ... 0 1 −1 0 . . . 0 0 1 0 ... 0 −1 2 −1 . . . 0 0 0 1 ... 0 0 −1 2 . . . 0 0 , and Q = 0 1 . . . 0 . H= . . . . . . . . . . .. .. .. . . .. .. .. .. .. .. 0 0 0 . . . 2 −1 0 0 ... 1 0 0 0 . . . −1 1 2n×2n 0 0 . . . 1 2n×n
A Novel Neural Network
1837
4
3
2 z (t) 4 1 z1(t)
0 z (t) 3 −1
z (t) 2
−2
−3
0
1
2
3 t
4
5
6
Figure 4: Transient behavior of model 1.4 in example 2.
This problem has a unique saddle point x ∗ = (0.5, 0.5, . . . , 0.5)T ∈ R2n and y∗ = (0, 0, . . . , 0)T ∈ Rn . We use neural network 2.4 to solve this problem; all simulation results show that this neural network is asymptotically stable at z∗ . For example, let λ = 100. Figures 5a and 5b show the trajectories of the first 20 components of x and y of neural network 2.4 with 6 random initial points z0 for n = 1500 and 2000, respectively. Next, we compare the proposed model 2.4 with other methods. For simplicity, we let N = Im+n in method 2.10. Figure 6 shows that model 1.4 with initial point (0, 0, . . . , 0)T ∈ R18 and λ = 10 is not stable, where the error z(t) − z∗ approaches 1.0327. Tables 1 and 2 report the numerical results with two different initial points obtained by methods 2.4, 2.7, 2.9, 2.10, and 2.11, respectively, where “Iter”. represents the iterative number; t f denotes the time that the stopping criterion dz /λ < 10−5 is met for dt models 2.4, 2.7, and 2.9; and the stopping rule for methods 2.10 and 2.11 is zk+1 − zk < 10−5 . From Tables 1 and 2, we can see that the proposed method not only provides a better solution but also has a faster convergence than methods 2.7, 2.9, 2.10, and 2.11, except for method 2.10 with θ = 1. Example 4 illustrates that the proposed neural network 2.4 has a faster convergence than other methods.
1838
X.-B. Gao and L.-Z. Liao 2.5
2
1.5
1 x ∼x 1 20 0.5 y1∼ y20
0 −0.5 −1 −1.5 −2 −2.5
0
0.02
0.04
0.06
0.08
0.1 t
0.12
0.14
0.16
0.18
0.2
0.14
0.16
0.18
0.2
(a) n = 1500 3
2
1 x ∼x 1 20 y1∼ y20 0
−1
−2
−3
0
0.02
0.04
0.06
0.08
0.1 t
0.12
(b) n = 2000 Figure 5: Transient behavior of model 2.4 in example 3.
A Novel Neural Network
1839
1 x1∼ x12
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
y ∼y 1 6
0
1
2
3 t
4
5
6
Figure 6: Transient behavior of model 1.4 in example 3. Table 1: Numerical Results of Example 3 with Initial Point −(1, 1, . . . , 1) ∈ R1800 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 100, t f = 0.0931 λ = 100, t f = 0.1803 λ = 100, θ = 1.8, t f = 0.2631 λ = 100, θ = 1, t f = 0.2516 θ = 1.8 θ = 1 (best θ value) θ = 0.2 θ = 0.2475, ν = 0.99 θ = 0.15, ν = 0.6
81 121 117 89 106 28 105 246 604
3.110 9.641 8.797 6.672 3.578 0.985 3.563 8.291 20.172
5.34 × 10−6 4.08 × 10−6 6.80 × 10−6 1.22 × 10−5 6.35 × 10−6 1.09 × 10−5 5.81 × 10−5 2.43 × 10−5 4.46 × 10−5
Method 2.4 2.7 2.9 2.10
2.11
Example 4. Consider problem 1.1 with U = {x ∈ R4 |xi ≥ 0, i = 1, . . . , 4}, V = {y ∈ R4 |yi ≥ 0, i = 1, 2}, H and S are 4 × 4 zero matrix, h = −(6, 6, 5, 5)T , s = −(0, 0, 10, 5)T , and 1 −2 1 0 −1 30 0 1 Q= 0 0 1 0. 0 001
1840
X.-B. Gao and L.-Z. Liao
Table 2: Numerical Results of Example 3 with Initial Point (2, 2, . . . , 2) ∈ R1800 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 100, t f = 0.0949 λ = 100, t f = 0.1559 λ = 100, θ = 1.8, t f = 0.1568 λ = 100, θ = 1, t f = 0.2599 θ = 1.8 θ = 1 (best θ value) θ = 0.2 θ = 0.2475, ν = 0.99 θ = 0.15, ν = 0.6
73 113 85 89 109 30 113 240 591
3.062 8.203 6.968 6.453 3.688 1.047 3.828 8.032 19.766
5.33 × 10−6 4.08 × 10−6 6.80 × 10−6 1.22 × 10−5 6.12 × 10−6 7.98 × 10−6 5.76 × 10−5 2.42 × 10−5 4.51 × 10−5
Method 2.4 2.7 2.9 2.10
2.11
Then this problem has a unique solution x ∗ = (10, 5, 0, 0)T and y∗ = (0, 0, −6, −6)T . From the analysis in section 3, neural network 2.4 can be applied to solve this problem. In this case, it becomes dz d = dt dt
T x 2 x − PU x − h + QPV T y − Q x − s , = −λ y y − PV y − Q x − s
(4.1)
where x ∈ R4 and y ∈ R4 . All simulation results show that neural network 4.1 is always asymptotically stable at (x ∗ , y∗ ). For example, let λ = 1000. Figure 7 shows that the convergence behavior of the error z(t) − z∗ based on equation 4.1 with 20 random initial points. Next, we compare the proposed model 2.4 with other methods. For simplicity, we let N = Im+n in method 2.10. As a matter of fact, Figure 8 shows that model 1.4 with initial point (2, 2, . . . , 2)T ∈ R8 and λ = 100 is not stable, where the error z(t) − z∗ approaches 1.38113. Tables 3 and 4 report the numerical results obtained by methods 2.4, 2.7, 2.9, 2.10, and 2.11 with two different initial points, respectively, where Iter., t f , and the stopping criterion are the same as for example 3. From Tables 3 and 4, we can see that the proposed method not only provides a better solution but also has a faster convergence than other methods. Example 5 shows that the conditions of theorem 3 in Xia and Feng (2004) are not enough to ensure the finite-time convergence of model 1.4. Example 5. Consider the linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with C = R3 :
1 −1 −1 M = −1 1 −1 , 1 1 0
0 and q = 0 . 1
A Novel Neural Network
1841
16
14
12
||z(t)−z*||
10
8
6
4
2
0
0
0.005
0.01
0.015
t
Figure 7: Convergence behavior of the error z(t) − z∗ based on equation 4.1 with 20 random initial points in example 4. Table 3: Numerical Results of Example 4 with Initial Point −(2, 2, . . . , 2)T ∈ R8 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 1000, t f = 0.0187 λ = 1000, t f = 0.0490 λ = 1000, θ = 1.8, t f = 0.4575 λ = 1000, θ = 1, t f = 0.9082 θ = 1.8 θ = 0.9 (best θ value) θ = 0.2 θ = 0.0329, ν = 0.99 θ = 0.0199, ν = 0.6
177 53,585 745 1281 19,229 2290 4140 20,661 45,813
0.062 13.829 0.312 0.516 1.297 0.171 0.266 1.656 3.672
3.45 × 10−6 5.98 × 10−6 1.60 × 10−5 4.03 × 10−5 1.25 × 10−4 4.76 × 10−4 1.69 × 10−3 3.04 × 10−4 5.01 × 10−4
Method 2.4 2.7 2.9 2.10
2.11
This problem has a unique solution z∗ = (−0.5, −0.5, 0)T . From remark 4, this problem can be solved by model 2.4. In this case, model 2.4 becomes dz d = dt dt
2[Hx − Q y − QT x − s ] x , = −λ y QT x + s
(4.2)
1842
X.-B. Gao and L.-Z. Liao
40
30
20 x
1
10 x2 0
x3,x4,y1,y2 y ,y 3 4
−10
−20
−30
−40
−50
0
0.5
1
1.5
2
2.5 t
3
3.5
4
4.5
5
Figure 8: Transient behavior of model 1.4 in example 4. Table 4: Numerical Results of Example 4 with Initial Point (2, 2, . . . , 2)T ∈ R8 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 1000, t f = 0.0190 λ = 1000, t f = 0.0539 λ = 1000, θ = 1.8, t f = 0.5038 λ = 1000, θ = 1, t f = 0.8924 θ = 1.8 θ = 0.9 (best θ value) θ = 0.2 θ = 0.0329, ν = 0.99 θ = 0.0199, ν = 0.6
141 58,965 785 1245 19,026 2319 5292 20,746 46,007
0.047 13.563 0.312 0.500 1.234 0.156 0.360 1.671 3.719
5.99 × 10−6 7.61 × 10−6 1.22 × 10−5 4.41 × 10−5 1.26 × 10−4 3.85 × 10−4 1.39 × 10−3 3.04 × 10−4 5.02 × 10−4
Method 2.4 2.7 2.9 2.10
2.11
where x ∈ R2 , y ∈ R1 , s = 1, H=
1 −1 , −1 1
and
Q=
1 . 1
All simulation results show that neural network 4.2 is always asymptotically stable at z∗ . For example, let λ = 100. Figure 9 displays the convergence
A Novel Neural Network
1843
3.5
3
2.5
||z(t)−z*||
2
1.5
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
t
Figure 9: The convergence behavior of the error z(t) − z∗ based on equation 4.2 with 20 random initial points in example 5.
behavior of the error z(t) − z∗ based on equation 4.2 with 20 random initial points. It should be mentioned that model 1.4 cannot be used to solve this problem. When applied to this problem, model 1.4 becomes
dz1 dt
= λ(z2 − z1 + z3 ),
dz2
= λ(z1 − z2 + z3 ),
dt dz3 dt
= −λ(z1 + z2 + 1).
It is easy to verify that the solution of equation 4.3 is √
√ 0 z1 (t) = 2z3 sin(ω1 t) − 2ω2 cos(ω1 t) − 1 + z10 − z20 e −2λt /2, √
√ 0 2z3 sin(ω1 t) − 2ω2 cos(ω1 t) − 1 − z10 − z20 e −2λt /2, z2 (t) = z3 (t) = z30 cos(ω1 t) + ω2 sin(ω1 t),
(4.3)
1844
X.-B. Gao and L.-Z. Liao
√ √ where ω1 = 2λ and ω2 = − 2(z10 + z20 + 1)/2. Obviously equation 4.3 is divergent and has no finite-time convergence when |ω2 | + |z30 | > 0. However, for any z0 ∈ R3 with |ω2 | + |z30 | > 0 and (z0 − z∗ )T M(z0 − z∗ ) + [e(z0 )]T Me(z0 ) > 0 (e(z) is defined in equation 2.8), for example, z0 = (−3, 0, 0)T , the conditions of theorem 3 in Xia and Feng (2004) are satisfied. Thus, the conditions of theorem 3 in Xia and Feng (2004) are not enough to ensure the finite-time convergence of model 1.4 when the model’s trajectory z(t) with z(0) = z0 ∈ U × V does not converge to one of its equilibrium point z∗ . From the above examples and their simulation results, we have the following remark. Remark 6. (i) For model 1.4 the simulation results show that it is stable for example 1, yet unlike model 2.4, its stability and convergence might not be guaranteed when initial point z0 ∈ U × V (see Friesz et al, 1994; Gao, 2003, 2004; Gao et al., 2004; Xia, 2004; Xia & Feng, 2004; Xia & Wang, 2001), even the matrices H and S are positive semidefinite (see examples 2–5). (ii) For method 2.10, computational results for example 3 show that method 2.10 is better than others when θ = 1, yet unlike model 2.4, it is not suitable for parallel implementation due to the choice of the varying parameter γ (z) as mentioned in model 2.9, and its performance depends on the choices of parameters N and θ . (iii) Since H is positive semidefinite and S = 0 in examples 3 to 5, the existing finite-time convergence result for model 1.4 in Xia (2004) cannot be applied to these examples (see remark 1 in Xia, 2004). 5 Conclusion In this letter, we have proposed a new neural network for solving a class of convex quadratic minimax problems by means of its inherent properties. We have shown that the new model is stable in the sense of Lyapunov and converges to an exact saddle point in finite time when matrices H and S are positive semidefinite. Furthermore, the global exponential stability of the proposed neural network is also obtained under certain conditions. Compared with the existing neural networks and two typically numerical methods, the proposed neural network has finitetime convergence, a simpler structure, and lower complexity. Thus, the proposed neural network is more suitable for hardware implementation. Since the new network can be applied directly to solve a class of linear variational inequality problems and a broad set of classes of optimization problems, it has great application potential. Illustrative examples confirm the theoretical results and demonstrate that our new model is reliable and attractive.
A Novel Neural Network
1845
Acknowledgments We are very grateful to the two anonymous reviewers for their comments and constructive suggestions on earlier versions of this article. The research was supported in part by grants from Hong Kong Baptist University, the Research Grant Council of Hong Kong, and NSFC Grant of China No. 10471083. References Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming theory and algorithms (2nd ed.). New York: Wiley. Bouzerdorm, A., & Pattison, T. R. (1993). Neural network for quadratic optimization with bound constraints. IEEE Trans. Neural Networks, 4, 293–304. Friesz, T. L., Bernstein, D. H., Mehta, N. J., Tobin, R. L., & Ganjlizadeh, S. (1994). Day-to-day dynamic network disequilibria and idealized traveler information systems. Operations Research, 42, 1120–1136. Fukushima, M. (1992). Equivalent differentiable optimization problems and descent method for asymmetric variational inequality problems. Mathematical Programming, 53, 99–110. Gao, X. B. (2003). Exponential stability of globally projected dynamic systems. IEEE Trans. Neural Networks, 14, 426–431. Gao X. B. (2004). A novel neural network for nonlinear convex programming. IEEE Trans. Neural Networks, 15, 613–621. Gao, X. B., & Liao, L.-Z. (2003). A neural network for monotone variational inequalities with linear constraints. Physics Letters A, 307, 118–128. Gao, X. B., Liao, L.-Z., & Xue, W. M. (2004). A neural network for a class of convex quadratic minimax problems with constraints. IEEE Trans. Neural Networks, 15, 622–628. Han, Q. M., Liao, L.-Z., Qi, H. D., & Qi, L. Q. (2001). Stability analysis of gradientbased neural networks for optimization problems. J. Global Optim., 19, 363– 381. He, B. S. (1996). Solution and application of a class of general linear variational inequalities. Science in China, series A, 39, 395–404. He, B. S. (1999). Inexact implicit methods for monotone general variational inequalities. Mathematical Programming, 86, 199–217. He, B. S., & Yang, H. (2000). A neural-network model for monotone linear asymmetric variational inequalities. IEEE Trans. Neural Networks, 11, 3–16. Kinderlehrer, D., & Stampacchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Ortega, T. M., & Rheinboldt, W. C. (1970). Iterative solution of nonlinear equation in several variables. New York: Academic Press. Rockafellar, R. T. (1987). Linear-quadratic programming and optimal control. SIAM J. Control Optim., 25, 781–814. Smith, T. E., Friesz, T. L., Bernstein, D. H., & Suo, Z. G. (1997). A comparative analysis of two minimum-norm projective dynamics and their relationship to variational
1846
X.-B. Gao and L.-Z. Liao
inequalities. In M. C. Ferris, J. S. Pang (Eds.), Complementarity and variational problems: State of art (pp. 405–424). Philadelphia: SIAM. Solodov, M. V., & Tseng, P. (1996). Modified projection-type methods for monotone variational inequalities. SIAM J. Control Optim., 27, 1814–830. Tseng, P. (2000). A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim., 38, 431–446. Xia, Y. S. (2004). An extended projection neural network for constrained optimization. Neural Computation, 16, 863–883. Xia, Y. S., & Feng, G. (2004). On convergence rate of projection neural networks. IEEE Trans. Automatic Control, 49, 91–96. Xia, Y. S., Feng, G., & Wang, J. (2004). A recurrent neural network with exponential convergence for solving convex quadratic program and related linear piecewise equations. Neural Networks, 17, 1003–1015. Xia, Y. S., & Wang, J. (1998). A general methodology for designing globally convergent optimization neural networks. IEEE Trans. Neural Networks, 9, 1331–1343. Xia, Y. S., & Wang, J. (2000). On the stability of globally projected dynamical systems. J. Optim. Theory Appl., 106, 129–160. Xia, Y. S., & Wang, J. (2001). Global asymptotic and exponential stability of a dynamic neural system with asymmetric connection weights. IEEE Trans. Automatic Control, 46, 635–638.
Received August 9, 2004; accepted June 14, 2005.
LETTER
Communicated by Emilio Salinas
Dynamic Gain Changes During Attentional Modulation Arun P. Sripati
[email protected] Department of Electrical and Computer Engineering, Zanvyl-Krieger Mind Brain Institute, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Kenneth O. Johnson
[email protected] Department of Neuroscience, Zanvyl-Krieger Mind Brain Institute, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Attention causes a multiplicative effect on firing rates of cortical neurons without affecting their selectivity (Motter, 1993; McAdams & Maunsell, 1999a) or the relationship between the spike count mean and variance (McAdams & Maunsell, 1999b). We analyzed attentional modulation of the firing rates of 144 neurons in the secondary somatosensory cortex (SII) of two monkeys trained to switch their attention between a tactile pattern recognition task and a visual task. We found that neurons in SII cortex also undergo a predominantly multiplicative modulation in firing rates without affecting the ratio of variance to mean firing rate (i.e., the Fano factor). Furthermore, both additive and multiplicative components of attentional modulation varied dynamically during the stimulus presentation. We then used a standard conductance-based integrate-and-fire model neuron to ascertain which mechanisms might account for a multiplicative increase in firing rate without affecting the Fano factor. Six mechanisms were identified as biophysically plausible ways that attention could modify the firing rate: spike threshold, firing rate adaptation, excitatory input synchrony, synchrony between all inputs, membrane leak resistance, and reset potential. Of these, only a change in spike threshold or in firing rate adaptation affected model firing rates in a manner compatible with the observed neural data. The results indicate that only a limited number of biophysical mechanisms can account for observed attentional modulation. 1 Introduction Recent experiments have shown that attention can modify neuronal responses to relevant stimuli (e.g., Moran and Desimone, 1985; Motter, 1993; Hsiao, O’Shaughnessy, & Johnson, 1993). The analysis presented here was Neural Computation 18, 1847–1867 (2006)
C 2006 Massachusetts Institute of Technology
1848
A. Sripati and K. Johnson
motivated by the effect of attention on neuronal firing in visual area V4 (McAdams & Maunsell, 1999a, 1999b). When attention was directed into receptive fields of these neurons, their firing rates were scaled multiplicatively (almost doubled) without affecting orientation selectivity, and the ratio of spike count variance to mean (i.e., the Fano factor) was unaffected. Our aim was to examine whether similar attentional effects on firing rate are seen in other sensory areas besides visual cortex. A finding that the effects are similar will support the idea that the mechanisms of attention are common throughout cortex. Our second motivation was to understand the unusual nature of the change in gain: multiplying a random variable (in this case, spike count) by a factor g increases its mean and standard deviation by g, but its variance increases by g 2 . In contrast, data from V4 cortex (McAdams & Maunsell, 1999b) indicate that the Fano factor is unaffected, which suggests that the effect of attention is not a simple change in gain. We reasoned that the biophysical mechanisms might be inferred by investigating which of them could produce a gain change without affecting the Fano factor. We analyzed the responses of 144 neurons in the somatosensory cortex (SII) of two monkeys trained to switch their attention between a tactile pattern recognition task and a visual dimming detection task (Steinmetz et al., 2000; Hsiao et al., 1993). Because the tactile stimuli (letters) used in this study did not fall along a continuum, we were unable to compute the change in gain experienced by a single neuron. Instead, we measured the extent of attentional modulation over the entire population of neurons and separated the effect of attention into multiplicative and additive components. In addition, the temporal modulation of the multiplicative and additive components during the trial was examined by repeating this procedure over short time intervals across the trial. We then examined six mechanisms that affect the mean firing rate and variability in a standard integrate-and-fire cortical neuron model (Troyer & Miller, 1997; Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000), with the aim of finding which mechanisms could account for the observed data. Candidate mechanisms were chosen based on their biophysical feasibility and their ability to be modulated rapidly over the timescale required for attention (hundreds of milliseconds). To simplify our analysis, we assumed that the model neuron receives inputs with fixed firing rates and that attentional modulation acts either by modifying an intrinsic biophysical parameter (e.g., spike threshold) or changing the input synchrony. The results indicate that attentional modulation alters either the spike threshold or firing rate adaptation. 2 Methods 2.1 Neurophysiology and Behavioral Tasks. Two macaque monkeys, M1 and M2, were trained to perform both tactile and visual discrimination
Dynamic Gain Changes During Attentional Modulation
1849
tasks (Hsiao et al., 1993; Steinmetz et al., 2000). In the visual task, three white squares appeared on a computer screen, and after a random interval, one of the two outer squares dimmed slightly. To obtain a reward, the monkey was required to detect the dimming and indicate which of the squares dimmed by turning a switch. In the tactile task, the monkey discriminated raised capital letters (6.0 mm high) embossed on a rotating drum (Johnson & Phillips, 1988) that were scanned from proximal to distal across the center of a fingerpad (at 15 mm/sec). To obtain a reward, the animal responded when the letter scanning across the fingerpad matched a target letter displayed on a computer screen. For monkey M1, the target letter remained constant throughout the collection of data from a single neuron. For monkey M2, the task was made more difficult by changing the target letter randomly after each response. Single-unit recordings were obtained from the SII cortex of both monkeys using standard methods. These data have been analyzed for attentional modulation of the firing rate (Hsiao et al., 1993) and synchrony (Steinmetz et al., 2000). 2.2 Statistical Methods. To estimate the 95% confidence intervals shown in Figures 3 and 4, we used a bootstrap method. A random subset of neurons was drawn with replacement, and the slope and intercept were computed as a function of time for that subset. Nonparametric estimates of the variance in the computed slope and intercepts were obtained by repeating the procedure several times. To estimate the significance of the trends in the gain modulation, gaussian random vectors were generated using the means and variances estimated from the bootstrap data. As a result, these random vectors have no temporal trends. We then performed a multivariate analysis-of-variance (manova1, Matlab 7.0) between the bootstrap data and the randomly generated data to determine the significance of the observed trends in gain. 2.3 Conductance-Based Integrate-and-Fire Model. We used a conductance-based integrate-and-fire model as a simplified model of neuronal firing. This model reproduces the Poisson-like variability exhibited by cortical neurons (Salinas & Sejnowski, 2000; Shadlen & Newsome, 1998; Troyer & Miller, 1997). Parameters of this model were examined for their ability to account for the experimental data. The model is driven by excitatory and inhibitory inputs that arrive randomly. Each input spike triggers an exponentially decaying change in conductance that leads to the characteristic postsynaptic current—excitatory postsynaptic current (EPSC) or inhibitory postsynaptic current (IPSC)—at the soma. The membrane potential in the model is governed by a linear differential equation (2.1), driven by postsynaptic currents and by intrinsic adaptation and leak currents. When the membrane potential crosses the threshold Vθ , a spike is registered and the potential is held at a reset
1850
A. Sripati and K. Johnson
Table 1: Conductance Changes Accompanying Each Type of Spike in Equation 2.1. Current (Spike Type) SRA (output spike at t = t out j ) AMPA (excitatory spike at t = t jE ) GABA (inhibitory spike at t = t jI )
Conductance Change Due to Spike t−t out j , t > t out gSRA = g¯ SRA exp − τSRAj j t−t jE j gAMPA = g¯ AMPA exp − τAMPA , t > t jE j
gGABA = t−t I t−t I g¯ GABA exp − 1 j − exp − 2 j , t > t jI D τGABA
τGABA
j
Note: D is adjusted to make gGABA = g¯ GABA at t = 0.
potential, Vreset , for a time equal to the refractory period: C
dV = −gL (V − EL ) − gSRA (V − E K ) − gAMPA (V − EAMPA ) dt −gGABA (V − E GABA ).
(2.1)
In the above equation, C is the membrane capacitance, and g L is the conductance due to leak channels. The membrane potential is influenced by three types of currents: a current due to fast excitatory synapses (AMPA), a current due to fast inhibitory synapses (GABA), and a spike rate adaptation current (SRA). The conductances gSRA , gAMPA , and gGABA are the sums of the conductances evoked by past output spikes and excitatory and inhibitory input spikes, respectively (see Table 1). The arrival of the inputs is stochastic, and as a result, the membrane potential executes a random walk toward the threshold (Troyer & Miller, 1997). In the baseline condition, the model produces a firing rate of 40 spikes per second when driven by 160 excitatory inputs and 40 inhibitory inputs firing at 40 spikes per second. Input synchrony was controlled as described below. Model parameters for the baseline condition are listed in the appendix and are based on in vitro electrophysiology (McCormick, Connors, Lighthall, & Prince, 1985). 2.4 Input Synchrony. We used a method in which the synchrony of the inputs could be systematically modulated independent of their firing rates (Salinas & Sejnowski, 2000). Briefly, each input is modeled as a random walk to threshold. When the random walk reaches threshold, a spike is registered, and the random walk is reset. Synchrony between spikes is controlled using the degree of correlation between the random walks. However, changing the correlation between random walks also affects the firing rate. Therefore, we adjusted the standard deviation of individual steps in the random walks to maintain a constant firing rate (Salinas & Sejnowski,
Dynamic Gain Changes During Attentional Modulation
1851
2000). Synchrony between excitatory-excitatory (E-E), excitatory-inhibitory (E-I), and inhibitory-inhibitory (I-I) input pairs was controlled using the corresponding correlations φ E , φ E I , and φ I between the random walks. 3 Results 3.1 Observed Changes in Population Firing. We recorded the responses of 178 neurons from the somatosensory cortex (SII) across two monkeys, M1 and M2. Eighty percent (144/178) of these neurons were selected for further analysis based on the following criteria. First, at least two trials were required in both attended and unattended conditions (176/178). Among these neurons, we found that poorly isolated cells had a very high spike count variance (due to huge variations in the number of recorded spikes from trial to trial); these cells were eliminated by requiring the variance of the spike count to be less than 10 times the mean (144/176). All cells had a firing rate greater than 5 impulses per second. Our analysis was restricted to a 2.6 second time window during which the target stimulus was scanned across the skin. The target letter contacted the skin at a time t = 1.4 seconds after trial onset. Because the raised letter stimuli used in this study did not fall along any continuum, we were unable to compute the change in gain experienced by a single neuron. Instead, we measured the extent of attentional modulation over the entire population of neurons. We separated the effect of attention into multiplicative and additive components by plotting the spike count for each neuron in the attended condition versus the spike count in the unattended condition. In the plot, a deviation in slope from unity would indicate a multiplicative modulation (i.e., gain change) due to attention over the entire population of neurons, whereas a constant shift (positive or negative y-intercept) would indicate an additive modulation. Since neurons in SII cortex are known to exhibit both increases and decreases in firing rates with changes in attention (Hsiao et al., 1993), we separated the neurons into three categories in M1 and M2 (t-test for unequal variances for firing rates, p < 0.05): (1) neurons that did not show any significant modulation during the first 600 ms after the stimulus onset—52% (31/60) in M1 and 27% (23/84) in M2; (2) neurons that increased their firing rates during this period—27% (16/60) in M1 and 44% (37/84) in M2; and (3) neurons that decreased their firing rate during this period—22% (13/60) in M1 and 29% (24/84) in M2. Neurons from the second and third categories were selected for analysis of dynamic changes in gain and variability. Figure 1 shows the change in firing rates with a shift in attention during a 200 ms time period that resulted in the maximum gain change for neurons in monkeys M1 (see Figure 1A) and M2 (see Figure 1B) from the second category (i.e., those that increased their firing rates in the tactile task). Thus, for example, a neuron in monkey M1 with a firing rate of 10 impulses per second in the visual (unattended) task will have a firing rate of
1852
A. Sripati and K. Johnson
Firing rate (spikes/s), tactile task
A
100 80
y=x
y = 1.86x - 3.77
60 40 20 0 0
20 40 60 80 Firing rate (spikes/s), visual task
100
Firing rate (spikes/s), tactile task
B
100 80
y = 1.70x + 0.85
y=x
60 40 20 0 0
20 40 60 80 Firing rate (spikes/s), visual task
100
Figure 1: Effect of attention on the firing rate during the 200 ms period of maximum gain change. (A) Monkey M1. Spikes are counted between t = 1.8 and 2.0 seconds. (B) Monkey M2. Spikes are counted between t = 0.8 and 1.0 seconds. Data in both panels are from neurons that increased their firing significantly during the peak firing epoch ( p < 0.05, t-test). Best-fitting slopes, with 95% confidence intervals, were 1.85 ± 0.23 (M1) and 1.70 ± 0.11 (M2). The corresponding y-intercepts were 3.77 ± 4.26 (M1) and 0.85 ± 2.18 (M2).
14.8 impulses per second in the tactile task, with the multiplicative component contributing to most of the increase. We then plotted spike count variance against the mean spike count in the same time period for the unattended and attended conditions (see Figure 2).
Dynamic Gain Changes During Attentional Modulation
1853
A 3
Spike count variance
10
2
10
1
10
-1
10 -1 10
1
10 Spike count
2
10
3
10
B 3
Spike count variance
10
2
10
1
10
-1
10 -1 10
1
10 Spike count
2
10
3
10
Figure 2: Effect of attention on the variability of neuronal responses. (A) Monkey M1. (B) Monkey M2. Plots show the spike count variance versus spike count for neurons whose spike rate was increased by the focus of attention (data from neurons as in Figure 1). Crosses correspond to neurons in the attended condition, and dots correspond to the unattended condition. Fitted powers (with 95% confidence intervals): M1: attended, 1.0 ± 0.13; unattended, 1.1 ± 0.11. M2: attended, 1.2 ± 0.09; unattended, 1.1 ± 0.08. Coefficients (with 95% confidence intervals): M1: attended, 1 ± 0.24; unattended, 1.4 ± 0.21. M2: attended, 1.2 ± 0.16; unattended, 1.5 ± 0.13.
1854
A. Sripati and K. Johnson
The mean firing rate in the attended condition (M1, 20.4; M2, 22.6 spikes/s) was more than 1.5 times larger than in the unattended condition (M1, 12.9; M2, 12.8 spikes/s). Furthermore, the relationship between the mean spike count and the spike count variance is unaffected by the attentional state (i.e., the regressions in Figure 2 are not significantly different from one another). These results agree well with attentional effects observed in the visual cortex (see Figure 4 of McAdams & Maunsell, 1999b). 3.2 Dynamic Modulation of Additive and Multiplicative Components. We examined the temporal variations of additive and multiplicative modulation by repeating this procedure for successive 200 ms windows across the duration of the trial. Additive and multiplicative effects were modulated dynamically during the trial in both monkeys (see Figures 3 and 4). Although trends in both additive and multiplicative modulations were significant (MANOVA, p < 0.0005; see section 2), the effects were predominantly multiplicative in both monkeys (except among neurons in monkey M1, which experienced a decrease in firing rate). The time course of multiplicative and additive modulation was different in the two monkeys. In monkey M1, multiplicative gain reaches a maximum 0.6 seconds before stimulus onset, whereas the additive component peaks roughly 0.2 seconds after stimulus onset. In monkey M2, maximum multiplicative gain occurred 0.5 seconds after stimulus onset, and additive gain peaked 0.3 seconds after stimulus onset. Similar modulations were observed among the subpopulation of neurons that decreased their firing rates. Despite the changes in multiplicative and additive modulation across the trial, we did not observe significant modulations in the Fano factors in any subpopulation of neurons. 3.3 Mechanisms of Gain Change in a Model Neuron. The analysis of SII data shows that in general, many neurons exhibit a multiplicative gain change when the animal shifts its focus of attention (see Figure 1), and that the Fano factor is unaffected by the gain change (see Figure 2). We performed computer simulations on a model cortical neuron to identify which mechanisms could account for a multiplicative gain change without affecting the Fano factor. With independent Poisson inputs, the standard balanced conductance-based model generates Poisson-like output with a Fano factor of 1.0 (Shadlen & Newsome, 1998). To replicate the Fano factor of 2.0 seen in our data, we increased the level of input synchrony and set φ E = φ I = 0.2. This corresponds to a 20% common drive for all inputs, a correlation that is commonly seen in neuronal data (e.g., Britten, Shadlen, Newsome, & Movshon, 1992; Shadlen, Britten, Newsome, & Movshon, 1996). We confirmed that our simulations did not depend on the exact Fano factor chosen by repeating them for other levels of input synchrony (data not shown). We selected parameters of the model neuron such that the output firing rate would approximately equal that of the individual inputs over a
Dynamic Gain Changes During Attentional Modulation
1855
A
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
20 10 0 -10 0
1 1.4 Time, seconds
2
B
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
10
0 0
1 1.4 Time, seconds
2
Figure 3: Multiplicative and additive components of attentional modulation observed in monkey M1 within two subpopulations. (A) Neurons that increased their firing rates during the first 600 ms of the stimulus. (B) Neurons from monkey M1 that significantly decreased their firing rates during the tactile (attended) task. Dotted lines indicate 95% confidence intervals computed using bootstrap.
physiological range (0–100 spikes/s). We then analyzed six putative mechanisms, each corresponding to a single parameter in the model (e.g., firing threshold), in two steps. First, the mechanism was used to double the firing rate from 40 to 80 spikes per second while leaving the input firing rates
1856
A. Sripati and K. Johnson A
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
10
0 0
1 1.4 Time, seconds
2
B
Multiplicative component 2 1 0 0
1
1.4
2
spikes/s
Additive component 0
-10 0
1 1.4 Time, seconds
2
Figure 4: Multiplicative and additive components of attentional modulation for monkey M2 neurons, plotted analogous to the monkey M1 data (see Figure 3).
unchanged. Second, we examined the ability of the mechanism to modulate the slope of the input-output relationship without affecting the Fano factor (see Figure 5). A multiplicative modulation in the input-output relationship corresponds to a gain change across a population of neurons, each receiving inputs firing at different rates. Thus, we were able to determine whether a given mechanism produced an effect on the firing rate that was consistent with data. Note that although the analysis is shown for populations of
Dynamic Gain Changes During Attentional Modulation
1857
C
150
Output rate, spikes/s
Output rate, spikes/s
A
100 50 0 0
150 100
50 100 Exc. input rate, spikes/s
50 0 0
50 100 Exc. input rate, spikes/s
10 10
2
D 2
3
Firing variance, spikes /s
2
Firing variance, spikes /s
2
B 10
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
10 10 10
3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
Figure 5: Two types of gain change observed in the conductance-based integrate-and-fire model neuron. (A) Input-output relationship for two different values of threshold. The solid line corresponds to the baseline value (−54 mV) and the dotted line to a lowered threshold (−55 mV). (B) Spike count variance plotted against output rate for each of these data points (dots: baseline; crosses: lowered threshold). The solid line represents the relationship (y = x) expected of a Poisson process. (C, D) Corresponding plots for gain produced by increasing the membrane time constant by four times to 80 ms.
neurons that undergo an increase in gain, it is equally applicable to a population of neurons that experience a decrease in gain. Table 2 summarizes the effect of each mechanism on the slope of the input-output relationship and the Fano factor. Figure 5 indicates the two types of gain change produced in the model neuron. A slight change in threshold produced a multiplicative gain change without affecting the Fano factor (see Figures 5A and 5B). On the other hand, changing the membrane time constant had an additive effect on the inputoutput relationship, without affecting the Fano factor. We found that two of the six mechanisms had only weak effects on the firing rate but adverse
1858
A. Sripati and K. Johnson
Table 2: Parameter Changes Required to Double the Output Firing Rate and Their Effects on Gain and Fano Factor.
Parameter 1 2 3
4 5 6
Reset potential Synchrony of E inputs Synchrony between all input pairs Membrane time constant Adaptation conductance Threshold
% Change in Parameter Needed
Fano Factor at 40 Spikes/S After Change
Multiplicative Gain (Slope at 40 Spikes/S)
Consistent with Data?
Not possible +50%
NA 3.85
NA 1.8X
No No
Not possible
NA
NA
No
+300%
1.5
1X
No
−100%
2.0
2X
Yes
−1.8%
2.0
2X
Yes
Note: Missing values indicate that it was not possible to double the firing rate using that parameter.
effects on output variability; these were the reset potential and synchrony between all input pairs (see Figure 6). Changes in synchrony between all input pairs result in a cancellation of the effect of excitatory and inhibitory inputs; this effect has been discussed in detail by Salinas and Sejnowski (2000). On the other hand, changes in excitatory input synchrony produced a multiplicative gain change but a disproportionate increase in output variability (see Figures 7A and 7B and Table 2). Changes in inhibitory input synchrony produced effects that were identical to those produced by changes in excitatory input synchrony (data not shown). Finally, changes in firing rate adaptation produced a multiplicative change in firing while leaving the Fano factor unaffected (see Figures 7C and 7D). The results above also indicate that each constraint on attentional modulation can be violated independently; thus, while the membrane time constant did not affect the Fano factor but produced an additive increase in firing, excitatory synchrony produced a gain change but had a large effect on the Fano factor. In summary, of the six mechanisms considered here, only threshold and firing rate adaptation modulated firing rates in a manner consistent with a shift in attention. 4 Discussion Understanding how neural representations are affected by attentional state is a topic of intense investigation (for a review, see Reynolds & Chelazzi, 2004). However, it is unclear how attentional state affects neural coding
Dynamic Gain Changes During Attentional Modulation
1859
C output rate, spikes/s
output rate, spikes/s
A 60 40 20 0
-56 -55 Reset potential, mV
-54
60 40 20 0 0
0.5 φ ,φ E I
10 10
2 2
10
D
3
2
1
0
10 0 10
1
10 Firing rate, spikes/s
10
2
Firing variance, spikes /s
2
Firing variance, spikes /s
2
B
1
10 10 10
3
2
1
0
10 0 10
1
10 Firing rate, spikes/s
10
2
Figure 6: Effect of reset potential and synchrony between all pairs of inputs on output firing. (A) Output rate versus reset potential. (B) Output spike count variance versus output firing rate for these data. (C, D) The corresponding figures for increases in synchrony between all pairs of inputs (E-E, I-I, E-I). Input synchrony was increased without affecting the input firing rates (see section 2).
and what mechanisms underlie the modulation. Our analysis shows that attention multiplicatively modulates the population firing in a dynamic manner during a trial without affecting the firing variability; in addition, the unique nature of the observed gain change limits the possible mechanisms underlying attentional modulation. 4.1 Dynamic Gain Modulation. Many studies have shown that a shift in the focus of attention modulates neuronal responses over timescales of hundreds of milliseconds (Hsiao et al., 1993; Luck, Chelazzi, Hillyard, & Desimone, 1997; McAdams & Maunsell, 1999b). We observed a similar time course for the gain of neuronal responses (see Figures 3 and 4), which rose to a peak of about 1.5 times and decreased to 1.0 times over a period of
1860
A. Sripati and K. Johnson
C
150
Output rate, spikes/s
Output rate, spikes/s
A
100 50 0 0
50 100 Exc. input rate, spikes/s
150 100 50 0 0
50 100 Exc. input rate, spikes/s
10 10
2 2
10
D Firing variance, spikes /s
2
Firing variance, spikes /s
2
B 3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
10 10 10
3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
Figure 7: Effect of spike rate adaptation and excitatory input synchrony on output firing. (A) Change in the input-output relationship produced by an increase in synchrony from zero (solid) to φ E = 0.2 (dotted). (B) Output spike count variance versus output firing rate for the corresponding data points. Changing excitatory input synchrony produced a Fano factor of 3.85 at an output rate of 40 spikes per second. (C, D) Corresponding plots for spike rate adaptation conductance gSRA . The baseline model firing rate was elevated from 40 to 80 spikes per second by reducing the adaptation conductance to zero.
about 500 ms. We found different time courses for gain in the two monkeys, which could be due to task-related differences. In monkey M1, the target letter remained constant throughout, whereas in monkey M2, the target letter changed after each trial. Thus, the attentional focus of M1 tended to wax and wane since the animal knew that target letters never followed each other in succession. In contrast, the attentional focus of M2 remained consistently high throughout all tactile trials. Therefore, we hypothesized that the peak in gain observed in monkey M1 prior to stimulus onset may be related to stimulus expectation.
Dynamic Gain Changes During Attentional Modulation
1861
Despite the dynamic modulation of gain, the relationship between spike count variance and mean was unaffected by attentional state. Because the firing rates of neurons varied during the trial, we selected a short interval of 200 ms over which the rate remained relatively constant. Although the Fano factor is known to depend on the counting duration (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997), we found little variation over the short timescales required for the analysis (0.1–0.5 ms). The observation that spike count variance is proportional to the mean is thus supported not only across stimulus conditions (Shadlen & Newsome, 1998) but also across variations in attentional state. Therefore, the discharge variability may well be an intrinsic property rather than a function of the input. 4.2 Mechanisms Underlying Attentional Modulation. The main result is that the observed change in gain and variability with attentional state limits the possible mechanisms that can produce this modulation. We selected six biophysical mechanisms that we hypothesized could account for the neural data. Although we did not systematically test whether the effect could be due to some combination of these mechanisms, it is likely that attention functions through one or more of these possibilities. Out of the six mechanisms considered, only changes in spike threshold or in firing rate adaptation produced a gain modulation without affecting the Fano factor. Primarily, these parameters affect the distance between the threshold and the steady-state membrane potential to bring about a multiplicative gain change. The following mechanisms were found inconsistent with the observed data: changes in (1) reset potential, (2) synchrony between all input pairs, (3) synchrony between excitatory or inhibitory inputs, and (4) membrane time constant. The first two mechanisms affect the variability of the membrane potential about the steady state and produce an increase in the discharge variability. Changing the membrane time constant had an additive effect on firing rate because of the diminishing contribution of leak current with increasing input firing rates. Biophysical mechanisms underlying attentional modulation have been investigated in relatively few studies. Reynolds, Chelazzi, and Desimone (1999) proposed that attention may increase the effective synaptic strength, although how this might occur biophysically is unclear. Niebur and Koch (1994) first suggested that input correlations may play a role in attentional modulation. Furthermore, input synchrony has a multiplicative effect on firing rate (Salinas & Sejnowski, 2000; Tiesinga, Fellous, Salinas, Jose, & Sejnowski, 2004). However, input synchrony dramatically affects the output variability, as reported here as well as in both computational and in vitro studies (Stevens & Zador, 1998; Salinas & Sejnowski, 2000; Feng & Brown, 2000; Harsch & Robinson, 2000). Finally, Chance and Abbott (2002) have shown in vitro that balanced synaptic input can have a multiplicative effect on firing rate (see below).
1862
A. Sripati and K. Johnson
Although the observation that attention modulates neuronal gain is relatively recent (McAdams & Maunsell, 1999a), gain modulation itself is a widespread phenomenon (Salinas & Thier, 2000). Simulations using biophysical neuronal models have shown that balanced synaptic input has a modulatory effect on the overall gain and variability (Burkitt, Meffin, & Grayden, 2003). Finally, the impact of Poisson inputs on the output firing rate and variability may differ depending on whether an integrate-and-fire model or a Hodgkin-Huxley model is used (Tiesinga, Jose, & Sejnowski, 2000). Further in vitro studies using dynamic clamp methods need to be performed to develop a statistical description of the output as a function of the inputs (cf. Chance & Abbott, 2002). 4.3 The Neuron Model. We used a conductance-based integrate-andfire model for cortical neurons with parameters derived from in vitro electrophysiology (McCormick et al., 1985). While the essential features of integration of synaptic EPSPs and IPSPs on the soma are present in the model, the nonlinear dynamics of spiking are lumped together into a threshold (Koch, 1999). The assumption of a fixed threshold is justified because a substantial part of the discharge variability can be attributed to irregular synaptic input (Calvin & Stevens, 1968; Mainen & Sejnowski, 1995; Nowak, Sanchez-Vives, & McCormick, 1997). We used balanced input to produce high discharge variability in the baseline condition (Shadlen & Newsome, 1998). Due to the balanced input, the membrane potential rapidly achieves a steady-state value after an action potential; the subsequent random walk to the threshold produces a Poissonlike discharge variability (Troyer & Miller, 1997). Cortical neurons are not far from an approximate balance; for example, receptive fields of area 3b neurons in the somatosensory cortex show roughly equal strengths of inhibition and excitation with a median ratio close to 0.8 (DiCarlo, Johnson, & Hsiao, 1998). Our results were independent of balance as long as input synchrony was adjusted to yield the requisite discharge variability at baseline. This is important since changing input balance affects response selectivity, whereas a gain change leaves the selectivity unaffected (e.g., McAdams & Maunsell, 1999a). Therefore, we did not consider input balance to be a putative mechanism if attention affects only the gain of a neuron’s response without changing its selectivity. We made several assumptions regarding the nature of inputs to the model. First, we used input correlations and balanced synaptic input to obtain the high discharge variability seen in our data (Salinas & Sejnowski, 2000; Shadlen & Newsome, 1998). Second, we assumed inputs to be unaffected by attentional state. Although attentional effects increase progressively along the sensory pathway (Hsiao et al., 1993; Reynolds & Chelazzi, 2004), it is unclear whether attentional modulation in lower cortical areas has a feedforward effect on higher cortical areas. Therefore, we considered the simplified situation in which input firing rates are unaffected
Dynamic Gain Changes During Attentional Modulation
1863
by attentional state. We also sought to distinguish the impact of synchrony from that of the firing rate in order to identify how each affects gain and variability. It is unclear whether increasing the gain in a population of neurons will affect their synchrony. It is likely that the underlying mechanisms are constrained further by a network model that reproduces attentional effects on gain and variability reported here, as well as on pair-wise synchrony reported previously (Steinmetz et al., 2000). 4.4 Biophysical Feasibility. Each of the above mechanisms was selected based on its ability to be modulated in a biophysically feasible manner, over durations of hundreds of milliseconds. 4.4.1 Reset Potential. The reset potential is determined by the balance between sodium and potassium conductances. This can be modulated by a specific potassium channel conductance (e.g., M-type K+ channel; Hille, 2001). 4.4.2 Input Synchrony. Increased gamma-band oscillations can lead to increased synchrony, as has been observed recently (Fries, Reynolds, Rorie, & Desimone, 2001). However, we have found that a change in the synchrony alone (without affecting the firing rates) increases the firing variability far beyond physiological levels. Our results are consistent with in vitro experiments (Stevens & Zador, 1998; Harsch & Robinson, 2001) as well as with other theoretical and modeling studies (Salinas & Sejnowski, 2000; Feng & Brown, 2000). 4.4.3 Membrane Time Constant. The leak resistance can be modulated by a change in the properties of K-channels For example, the M-type potassium current can be affected at short timescales and can be modulated by acetylcholine inputs (Hille, 2001). 4.4.4 Adaptation Conductance. Calcium-triggered potassium channels are responsible for adaptation in the firing rates of neocortical neurons (McCormick et al., 1985; Hille, 2001). These channels can be modulated by second messenger effects; for example, acetylcholine or norepinephrine inputs can lead to a change in channel properties over hundreds of milliseconds (Hille, 2001). There is evidence that cholinergic modulation is related to the level of attention (Kodama, Honda, Watanabe, & Hikosaka, 2002; Davidson & Marrocco, 2000). Computational studies also suggest a possible role of feedback modulation in controlling attention, which can be controlled by the extent of firing rate adaptation (Wilson, Krupa, & Wilkinson, 2000; Rothenstein, Martinez-Trujillo, Treue, Tsotsos, & Wilson, 2002). 4.4.5 Threshold. A constant depolarizing current can affect the threshold of production of an action potential (Johnston & Wu, 1995). This can be
1864
A. Sripati and K. Johnson
achieved by a tonic input from other cortical areas. This input need not even be stimulus dependent. A similar observation has been made by Chance and Abbott (2002), in which they find that the level of balanced background synaptic input to neurons in vitro offsets the stimulus-driven current, bringing the steady state closer to threshold, and produces a multiplicative gain change. They too report that the output variability (measured by the coefficient of variation) is unaffected, which is consistent with our findings. An increase in tonic input is consistent with the findings of Luck et al (1997), in which the spontaneous firing rates of neurons in V2 and V4 were found to increase by 30% to 40% when attention was directed into the receptive field. 5 Conclusions Our results show that in the secondary somatosensory cortex, neurons experience a multiplicative increase in firing rates during attentional modulation while preserving the Fano factor. These effects are similar to those seen in visual area V4 (McAdams & Maunsell, 1999b) and suggest that the mechanisms of attention are common in both sensory areas. While there have been many studies of the effects of attention on perception and the underlying neural responses (Kastner & Ungerleider, 2000), the mechanisms that underlie the modulation of neuronal responses at the single-neuron level are unknown. In this study, we employ a deductive approach to identifying the mechanisms responsible for attentional modulation at the single-neuron level and conclude that attention causes changes in spike threshold or in firing rate adaptation. Further studies are required to understand how changes in these mechanisms account for other effects of selective attention such as the modulation of synchrony with attentional state (Steinmetz et al., 2000). Appendix Model neuron parameters at baseline (i.e., output rate = 40 spikes/s, Fano factor = 2.0): Conductances (in nS) g¯ SRA = 3.5; g¯ AMPA = 2.01; g¯ GABA = 27.85; gL = 25 Time constants (in ms) 1 τm = 20; τrefrac = 1.72; τSRA = 100; τAMPA = 5; τGABA 2 = 5.6; τGABA = 0.28
Dynamic Gain Changes During Attentional Modulation
1865
Reversal potentials (in mV) E K = −80; E L = −74; E AMPA = 0; E GABA = −61; Vrest = Vreset = −60; Vθ = −54 Numbers of inputs ME = 160 excitatory, MI = 40 inhibitory inputs Input rates (in spikes/sec) Excitatory rate λ E = 40, Inhibitory rate λ I = αλ E , α = 1.7 Parameters for all random walks Nreset = 20, Nθ = 40, Nrest = 0 Input synchrony φ E = φ I = 0.2 The subthreshold equation was numerically integrated with a time step of 0.05 ms. Acknowledgments We thank Steven Hsiao, Ernst Niebur, and Alfredo Kirkwood for helpful discussions. This work was supported by NIH grants NS34086 and NS18787. References Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4765. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motor neuron inter-spike intervals. J. Neurophysiol., 31, 574–587.
1866
A. Sripati and K. Johnson
Chance, F. S., & Abbott, L. F. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Davidson, M. C., & Marrocco, R. T. (2000). Local infusion of scopolamine into intraparietal cortex slows covert orienting in rhesus monkeys. J. Neurophysiol., 83, 1536–1549. DiCarlo, J. J., Johnson, K. O., & Hsiao, S. S. (1998). Structure of receptive fields in area 3b of primary somatosensory cortex in the alert monkey. J. Neurosci., 18(7), 2626–2645. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Computation, 12, 671–692. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Harsch, A., & Robinson, H. P. C. (2000). Postsynaptic variability of firing in rat neocortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hille, B. (2001). Ion channels of excitable membranes (3rd ed.). Sunderland, MA: Sinauer. Hsiao, S. S., O’Shaughnessy, D. M., & Johnson, K. O. (1993). Effects of selective attention on spatial form processing in monkey primary and secondary somatosensory cortex. J. Neurophysiol., 70(1), 444–447. Johnson, K. O., & Phillips, J. R. (1988). A rotating drum stimulator for scanning embossed patterns and textures across the skin. J. Neurosci. Methods, 22, 221–231. Johnston, D., & Wu, S. M. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Kastner, S., & Ungerleider, L. G. (2000). Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience, 23, 315–341. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Kodama, T., Honda, Y., Watanabe, M., & Hikosaka, K. (2002). Release of neurotransmitters in the monkey frontal cortex is related to level of attention. Psych. and Clinical Neurosciences, 56, 341–342. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2 and V4 of macaque visual cortex. J. Neurophysiol., 77, 24–42. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. McAdams, C. J., & Maunsell, J. H. R. (1999a). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. Journal of Neuroscience, 19, 431–441. McAdams, C. J., & Maunsell, J. H. R. (1999b). Effects of attention on the reliability of individual neurons in the monkey visual cortex. Neuron, 23, 765–773. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. Moran, J., & Desimone, R. (1985). Selective attention gates visual attention in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919.
Dynamic Gain Changes During Attentional Modulation
1867
Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. J. Comput. Neurosci., 1, 141–158. Nowak, L. G., Sanchez-Vives, M. V., & McCormick, D. A. (1997). Influence of low and high frequency inputs on spike timing in visual cortical neurons. Cerebral Cortex, 7, 487–501. Reynolds, J. H, & Chelazzi, L. (2004). Attentional modulation of visual processing. Annual Review of Neuroscience, 27, 611–647. Reynolds, J. H., Chelazzi, L., & Desimone, R. H. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19, 1736–1753. Rieke, F., Warland, D., Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rothenstein, A. L., Martinez-Trujillo, J. C., Treue, S., Tsotsos, J. K., & Wilson, H. R. (2002). Modeling attentional effects in cortical areas MT and MST of the macaque monkey through feedback loops. Society for Neuroscience Abstracts, 559.7. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. J. Neurosci., 16, 1486–1510. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1, 210–217. Tiesinga, P. H. E., Fellous, J.-M., Salinas, E., Jose, J. V., & Sejnowski, T. E. (2004). Synchronization as a mechanism for attentional modulation. Neurocomputing, 58, 641–646. Tiesinga, P. H. E, Jose, J. V., & Sejnowski, T. E. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gated channels. Physical Review E, 62, 8413–8419. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Wilson, H. R., Krupa, B., & Wilkinson, F. (2000). Dynamics of perceptual oscillation in form vision. Nature Neuroscience, 3(2), 170–176.
Received February 3, 2005; accepted June 28, 2005.
LETTER
Communicated by Odelia Schwartz
On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields Pietro Berkes
[email protected] Laurenz Wiskott
[email protected] Institute for Theoretical Biology, Humboldt University Berlin, D-10115 Berlin, Germany
In this letter, we introduce some mathematical and numerical tools to analyze and interpret inhomogeneous quadratic forms. The resulting characterization is in some aspects similar to that given by experimental studies of cortical cells, making it particularly suitable for application to secondorder approximations and theoretical models of physiological receptive fields. We first discuss two ways of analyzing a quadratic form by visualizing the coefficients of its quadratic and linear term directly and by considering the eigenvectors of its quadratic term. We then present an algorithm to compute the optimal excitatory and inhibitory stimuli—those that maximize and minimize the considered quadratic form, respectively, given a fixed energy constraint. The analysis of the optimal stimuli is completed by considering their invariances, which are the transformations to which the quadratic form is most insensitive, and by introducing a test to determine which of these are statistically significant. Next we propose a way to measure the relative contribution of the quadratic and linear term to the total output of the quadratic form. Furthermore, we derive simpler versions of the above techniques in the special case of a quadratic form without linear term. In the final part of the letter, we show that for each quadratic form, it is possible to build an equivalent twolayer neural network, which is compatible with (but more general than) related networks used in some recent articles and with the energy model of complex cells. We show that the neural network is unique only up to an arbitrary orthogonal transformation of the excitatory and inhibitory subunits in the first layer. 1 Introduction Recent research in neuroscience has seen an increasing number of extensions of established linear techniques to their nonlinear equivalent in both experimental and theoretical studies. This is the case, for example, for spatiotemporal receptive field estimates in physiological studies (see Neural Computation 18, 1868–1895 (2006)
C 2006 Massachusetts Institute of Technology
Inhomogeneous Quadratic Forms as Receptive Fields
1869
Simoncelli, Pillow, Paninski, & Schwartz, 2004, for a review) and information-theoretical models like principal component analysis (PCA) ¨ ¨ (Scholkopf, Smola, & Muller, 1998) and independent component analysis (ICA) (see Jutten & Karhunen, 2003, for a review). Additionally, new nonlinear unsupervised algorithms have been introduced, for example, slow feature analysis (SFA) (Wiskott & Sejnowski, 2002). The study of the resulting nonlinear functions can be a difficult task because of the lack of appropriate tools to characterize them qualitatively and quantitatively. During a recent project concerning the self-organization of complex cell receptive fields in the primary visual cortex (V1) (Berkes & Wiskott, 2002, 2005b; see section 2), we developed some of these tools to analyze quadratic functions in a high-dimensional space. Because of the complexity of the methods, we describe them here in a separate letter. The resulting characterization is in some aspects similar to that given by physiological studies, making it particularly suitable to be applied to the analysis of nonlinear receptive fields. We are going to focus on the analysis of the inhomogeneous quadratic form g(x) =
1 T x Hx + fT x + c , 2
(1.1)
where x is an N-dimensional input vector, H an N × N matrix, f an Ndimensional vector, and c a constant. Although some of the mathematical details of this study are specific to quadratic forms only, it should be straightforward to extend most of the methods to other nonlinear functions while preserving the same interpretations. In other contexts, it might be more useful to approximate the function under consideration by a quadratic form using a Taylor expansion up to the second order and then apply the algorithms described here. In experimental studies, quadratic forms occur naturally as a secondorder approximation of the receptive field of a neuron in a Wiener expansion (Marmarelis & Marmarelis, 1978; van Steveninck & Bialek, 1988; Lewis, Henry, & Yamada, 2002; Schwartz, Chichilnisky, & Simoncelli, 2002; Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2004; Simoncelli et al., 2004). Quadratic forms were also used in various theoretical articles, either explicitly (Hashimoto, 2003; Bartsch & Obermayer, 2003) or implicitly in the form of neural net¨ works (Hyv¨arinen & Hoyer, 2000, 2001; Kording, Kayser, Einh¨auser, & ¨ Konig, 2004). The analysis methods used in these studies are discussed in section 10. Table 1 lists some important terms and variables used throughout the article. We will refer to 12 xT Hx as the quadratic term, to fT x as the linear term, and to c as the constant term of the quadratic form. Without loss of generality, we assume that H is a symmetric matrix, since if necessary we can substitute
1870
P. Berkes and L. Wiskott
Table 1: Definitions of Some Important Terms. N ·t x g, g˜ H, hi vi , µi V, D f c x+ , x −
Number of dimensions of the input space Mean over time of the expression between the two brackets Input vector The considered inhomogeneous quadratic form and its restriction to a sphere N × N matrix of the quadratic term of the inhomogeneous quadratic form (see equation 1.1) and ith row of H (i.e., H = (h1 , . . . , h N )T ). H is assumed to be symmetric. ith eigenvector and eigenvalue of H, sorted by decreasing eigenvalues (i.e., µ1 ≥ µ2 ≥ . . . ≥ µ N ) The matrix of the eigenvectors V = (v1 , . . . , v N ) and the diagonal matrix of the eigenvalues, so that VT HV = D N-dimensional vector of the linear term of the inhomogeneous quadratic form (see equation 1.1) Scalar value of the constant term of the inhomogeneous quadratic form (see equation 1.1) Optimal excitatory and inhibitory stimuli, x+ = x− = r
H in equation 1.1 by the symmetric matrix 12 (H + HT ) without changing the values of the function g. We define µ1 , . . . , µ N to be the eigenvalues to the eigenvectors v1 , . . . , v N of H sorted in decreasing order µ1 ≥ µ2 ≥ . . . ≥ µ N . V = (v1 , . . . , v N ) denotes the matrix of the eigenvectors and D the diagonal matrix of the corresponding eigenvalues, so that VT HV = D. Furthermore, ·t indicates the mean over time of the expression included in the angle brackets. In the next section we introduce the model system that we use for illustration throughout this letter. Section 3 describes two ways of analyzing a quadratic form by visualizing the coefficients of its quadratic and linear term directly and by considering the eigenvectors of its quadratic term. We then present in section 4 an algorithm to compute the optimal excitatory and inhibitory stimuli—the stimuli that maximize and minimize a quadratic form, respectively, given a fixed energy constraint. In section 5 we consider the invariances of the optimal stimuli, which are the transformations to which the function is most insensitive, and in the following section we introduce a test to determine which of these are statistically significant. In section 7 we discuss two ways to determine the relative contribution of the different terms of a quadratic form to its output. Furthermore, in section 8 we consider the techniques described above in the special case of a quadratic form without the linear term. In the end, we present in section 9 a two-layer neural network architecture equivalent to a given quadratic form. The letter concludes with a discussion of the relation of our approach to other studies in section 10.
Inhomogeneous Quadratic Forms as Receptive Fields
1871
2 Model System To illustrate the analysis techniques presented here, we use the quadratic forms presented in Berkes and Wiskott (2002) in the context of a theoretical model of self-organization of complex-cell receptive fields in the primary visual cortex (see also Berkes & Wiskott, 2005b). In this section, we summarize the settings and main results of this example system. We generated image sequences from a set of natural images by moving an input window over an image by translation, rotation, and zoom and subsequently rescaling the collected stimuli to a standard size of 16 × 16 pixels. For efficiency reasons, the dimensionality of the input vectors x was reduced from 256 to 50 input dimensions and whitened using principal component analysis (PCA). We then determined quadratic forms (also called functions or units in the following) by applying SFA to the input data. SFA is an implementation of the temporal slowness principle (see Wiskott & Sejnowski, 2002, and references there). Given a finite-dimensional function space, SFA extracts the functions that, applied to the input data, return output signals that vary as slowly as possible in time (as measured by the variance of the first derivative) under the constraint that the output signals have zero mean and unit variance and are decorrelated. The functions are sorted by decreasing slowness. For analysis, the quadratic forms are projected back from the 50 first principal components to the input space. Note that the rank of the quadratic term after the transformation is the same as before, and it thus has only 50 eigenvectors. The units receive visual stimuli as an input and can be interpreted as nonlinear receptive fields. They were analyzed with the algorithms presented here and with sine-grating experiments similar to the ones performed in physiology and were found to reproduce many properties of complex cells in V1—not only the primary ones, that is, response to edges and phase-shift invariance (see sections 4 and 5), but also a range of secondary ones such as direction selectivity, nonorthogonal inhibition, end inhibition, and side inhibition. This model system is complex enough to require an extensive analysis and is representative of the application domain considered here, which includes second-order approximations and theoretical models of physiological receptive fields. 3 Visualization of Coefficients and Eigenvectors One way to analyze a quadratic form is to look at its coefficients. The coefficients f 1 , . . . , f N of the linear term can be visualized and interpreted directly. They give the shape of the input stimulus that maximizes the linear part given a fixed norm. The quadratic term can be interpreted as a sum over the inner product of the jth row h j of H with the vector of the products x j xi between the jth
1872
P. Berkes and L. Wiskott
linear term f
f 0.02
0.04 0.02 0 −0.02
0
−0.02
quadratic term y
h 52
h 116
h 56
h 120
h 184
h 60
h 124
h 188
Unit 4
y
h 180
0.4 0.2 0 −0.2 −0.4
h 52
h 116
h 180 0.5 0 −0.5
h 56
h 120
h 184
h 60
h 124
h 188
x
Unit 28
x
Figure 1: Some of the quadratic form coefficients of two functions learned in the model system. The top plots show the coefficients of the linear term f, reshaped to match the two-dimensional shape of the input. The bottom plots show the coefficients of nine of the rows h j of the quadratic term. The crosses indicate the spatial position of the corresponding reference index j.
variable x j and all other variables:
x j x1 N N x j x2 xT Hx = x j (hTj x) = hTj . . .. j=1 j=1 x j xN
(3.1)
In other words, the response of the quadratic term is formed by the sum of N linear filters h j which respond to all combinations of the jth variable with the other ones. If the input data have a two-dimensional spatial arrangement, as in our model system, the interpretation of the rows can be made easier by visualizing them as a series of images (by reshaping the vector h j to match the structure of the input) and arranging them according to the spatial position of the corresponding variable x j . In Figure 1 we show some of the
Inhomogeneous Quadratic Forms as Receptive Fields
1873
Unit 4 6.56
6.54
4.72
4.64
3.81
−2.89
−3.69
−3.74
−4.96
−5.00
−6.52
−7.93
−8.08 −11.72 −12.03
... Unit 28 12.23
12.10
6.38
6.24
6.04
... Figure 2: Eigenvectors of the quadratic term of two functions learned in the model system sorted by decreasing eigenvalues as indicated above each eigenvector.
coefficients of two units learned in the model system. In both, the linear term looks unstructured. The absolute values of its coefficients are small in comparison to those of the quadratic term so that its contribution to the output of the functions is very limited (cf. section 7). The row vectors h j of unit 4 have a localized distribution of their coefficients; they respond only to combinations of the corresponding variable x j and its neighbors. The filters h j are shaped like a four-leaf clover and centered on the variable itself. Pairs of opposed leaves have positive and negative values, respectively. This suggests that the unit responds to stimuli oriented in the direction of the two positive leaves and is inhibited by stimuli with an orthogonal orientation, which is confirmed by successive analysis (cf. later in this section and section 4). In unit 28 the appearance of h j depends on the spatial position of x j . In the bottom half of the receptive field, the interaction of the variables with their close neighbors along the vertical orientation is weighted positively, with a negative flank on the sides. In the top half, the rows have similar coefficients but with reversed polarity. As a consequence, the unit responds strongly to vertical edges in the bottom half, while vertical edges in the top half result in strong inhibition. Edges extending over the whole receptive field elicit only a weak total response. Such a unit is said to be end inhibited. Another possibility for visualizing the quadratic term is to display its eigenvectors. The output of the quadratic form to one of the (normalized) eigenvectors equals half of the corresponding eigenvalue, since 1 T v Hvi = 12 viT (µi vi ) = 12 µi . The first eigenvector can be interpreted as the 2 i stimulus that among all input vectors with norm 1 maximizes the output of the quadratic term. The jth eigenvector maximizes the quadratic term in the subspace that excludes the previous j − 1 ones. In Figure 2 we show the eigenvectors of the two functions previously analyzed in Figure 1. In unit 4, the first eigenvector looks like a Gabor wavelet (i.e., a sine grating multiplied by a gaussian). The second eigenvector has the same form except for a 90 degree phase shift of the sine grating. Since the two eigenvalues have almost the same magnitude, the response of the quadratic term
1874
P. Berkes and L. Wiskott
is similar for the two eigenvectors and also for linear combinations with constant norm 1. For this reason, the quadratic term of this unit has the main characteristics of complex cells in V1: a strong response to an oriented grating with an invariance to the phase of the grating. The last two eigenvectors, which correspond to the stimuli that minimize the quadratic term, are Gabor wavelets with orientation orthogonal to the first two. This means that the output of the quadratic term is inhibited by stimuli at an orientation orthogonal to the preferred one. A similar interpretation can be given in the case of unit 28, although in this case, the first and the last two eigenvalues have the same orientation but occupy two different halves of the receptive field. This confirms that unit 28 is end inhibited. A direct interpretation of the remaining eigenvectors in the two functions is difficult (see also section 8), although the magnitude of the eigenvalues shows that some of them elicit a strong response. Moreover, the interaction of the linear and quadratic terms to form the overall output of the quadratic form is not considered but cannot generally be neglected. The methods presented in the following sections often give a more direct and intuitive description of quadratic forms. 4 Optimal Stimuli Another characterization of a nonlinear function can be borrowed from neurophysiological experiments, where it is common practice to characterize a neuron by the stimulus to which the neuron responds best (for an overview, see Dayan & Abbott, 2001). Analogously, we can compute the optimal excitatory stimulus of g, the input vector x+ that maximizes g given a fixed norm x+ = r .1 Note that x+ depends qualitatively on the value of r : if r is very small, the linear term of the equation dominates, so that x+ ≈ f, while if r is very large, the quadratic part dominates, so that x+ equals the first eigenvector of H (see also section 8). We usually choose r to be the mean norm of all input vectors, since we want x+ to be representative of the typical input. In the same way, we can also compute the optimal inhibitory stimulus x− , which minimizes the response of the function.
1 The fixed norm constraint corresponds to a fixed energy constraint (Stork & Levinson, 1982) used in experiments involving the reconstruction of the Wiener kernel of a neuron (Dayan & Abbott, 2001). During physiological experiments in the visual system, one sometimes uses stimuli with fixed contrast instead. The optimal stimuli under these two constraints may be different. For example, with fixed contrast, one can extend a sine grating indefinitely in space without changing its intensity, while with fixed norm, its maximum intensity is going to dim as the extent of the grating increases. The fixed contrast constraint is more difficult to enforce analytically (e.g., because the surface of constant contrast is not bounded).
Inhomogeneous Quadratic Forms as Receptive Fields
1875
The problem of finding the optimal excitatory stimulus under the fixed energy constraint can be mathematically formulated as follows: maximize
g(x) = 12 xT Hx + fT x + c
under the constraint
xT x = r 2 .
(4.1)
This problem is known as the trust region subproblem and has been extensively studied in the context of numerical optimization, where a nonlinear function is minimized by successively approximating it by an inhomogeneous quadratic form, which is in turn minimized in a small neighborhood. Numerous studies have analyzed its properties in particular in the numerically difficult case where H is near to singular (see Fortin, 2000, and references there). We make use of some basic results and extend them where needed. If the linear term is equal to zero (i.e., f = 0), the problem can be easily solved (it is simply the first eigenvector scaled to norm r ; see section 8). In the following, we consider the more general case where f = 0. We can use a Lagrange formulation to find the necessary conditions for the extremum:
and
xT x = r 2
(4.2)
∇[g(x) − 12 λxT x] = 0
(4.3)
Hx + f − λx = 0
(4.4)
⇔ ⇔
−1
x = (λI − H)
f,
(4.5)
where we inserted the factor 12 for mathematical convenience. According to theorem 3.1 in Fortin (2000), if an x that satisfies equation 4.5 is a solution to equation 4.1, then (λI − H) is positive semidefinite (i.e., all eigenvalues are greater than or equal to 0). This imposes a strong lower bound on the range of possible values for λ. Note that the matrix (λI − H) has the same eigenvectors vi as H with eigenvalues (λ − µi ). For (λI − H) to be positive semidefinite, all eigenvalues must be nonnegative, and thus λ must be greater than the largest eigenvalue µ1 , µ1 ≤ λ .
(4.6)
An upper bound for lambda can be found by considering an upper bound for the norm of x. First, we note that matrix (λI − H)−1 is symmetric and has the same eigenvectors as H with eigenvalues 1/(λ − µi ). We also know that Av ≤ Av for every matrix A and vector v. A is here the spectral norm of A, which for symmetric matrices is simply the largest
1876
P. Berkes and L. Wiskott
absolute eigenvalue. With this we find an upper bound for λ: r = x
(4.7) −1
= (λI − H) f
(4.8)
≤ (λI − H)−1 f
1 f = max i λ − µi
(4.9)
=
(4.6)
⇔
λ≤
(4.10)
1 f λ − µ1
(4.11)
f + µ1 . r
(4.12)
The optimization problem, equation 4.1, is thus reduced to a search over λ on the interval [µ1 , ( f + µ1 )] until x defined by equation 4.5 fulfills the r constraint x = r (see equation 4.2). Vector x and norm x can be efficiently computed for each λ using the eigenvalue decomposition of f: x
(λI − H)−1 f vi (viT f) = (λI − H)−1 =
(4.5)
(4.13) (4.14)
i
=
(λI − H)−1 vi (viT f)
(4.15)
i
=
i
1 vi (viT f) λ − µi
(4.16)
and x2 =
i
1 λ − µi
2 (viT f)2 ,
(4.17)
where the terms viT f and (viT f)2 are constant for each quadratic form and can be computed in advance. The last equation also shows that the norm of x is monotonically decreasing in the considered interval, so that there is exactly one solution and the search can be efficiently performed by a bisection method. x− can be found in the same way by maximizing the negative of g. The pseudocode of an algorithm that implements all the considerations above can be found in Berkes and Wiskott (2005a). A Matlab version can be downloaded online from the authors’ home pages (http://itb.biologie.huberlin.de/{˜berkes,˜wiskott}).
1
5
6
10
11
15 ...
1877
...
Inhomogeneous Quadratic Forms as Receptive Fields
26
30
Figure 3: Optimal stimuli of some of the units in the model system. x+ looks like a Gabor wavelet in almost all cases, in agreement with physiological data. x− is usually structured and is also similar to a Gabor wavelet, which suggests that inhibition plays an important role.
If the matrix H is negative definite (i.e., all its eigenvalues are negative) there is a global maximum that may not lie on the sphere, which might be used in substitution for x+ if it lies in a region of the input space that has a high probability of being reached (the criterion is quite arbitrary, but the region could be chosen to include, for example, 75% of the input data with highest density). The gradient of the function disappears at the global extremum such that it can be found by solving a simple linear equation system: ∇g(x) = Hx + f = 0 ⇔
−1
x = −H f.
(4.18) (4.19)
In the same way, a positive definite matrix H has a negative global minimum, which might be used in substitution for x− . In Figure 3 we show the optimal stimuli of some of the units in the model system. In almost all cases, x+ looks like a Gabor wavelet, in agreement with physiological data for neurons of the primary visual cortex (Pollen & Ronner, 1981; Adelson & Bergen, 1985; Jones & Palmer, 1987). The functions respond best to oriented stimuli having the same frequency as x+ . x− is usually structured as well and looks like a Gabor wavelet too, which suggests that inhibition plays an important role. x+ can be used to compute the position and size of the receptive fields as well as the preferred orientation and frequency of the units for successive experiments. Note that although x+ is the stimulus that elicits the strongest response in the function, it does not necessarily mean that it is representative of the class of stimuli that give the most important contribution to its output. This
1878
P. Berkes and L. Wiskott
depends on the distribution of the input vectors. If x+ lies in a low-density region of the input space, it is possible that other kinds of stimuli drive the function more often. In that case, they might be considered more relevant than x+ to characterize the function. Symptomatic for this effect would be if the output of a function when applied to its optimal stimulus would lie far outside the range of normal activity. This means that x+ can be an atypical, artificial input that pushes the function in an uncommon state. A similar effect has also been reported in a physiological article comparing the response of neurons to natural stimuli and to artificial stimuli such as sine gratings (Baddeley et al., 1997). The characterization of a neuron or a nonlinear function as a feature detector via the optimal stimulus is thus at least incomplete (see also MacKay, 1985). However, the optimal stimuli remain extremely informative in practice. 5 Invariances Since the considered functions are nonlinear, the optimal stimuli do not provide a complete description of their properties. We can gain some additional insights by studying a neighborhood of x+ and x− . An interesting question is to which transformations of x+ or x− the function is invariant. This is similar to the common interpretation of neurons as detectors of a specific feature of the input that are invariant to a local transformation of that feature. For example, complex cells in the primary visual cortex are thought to respond to oriented bars and to be invariant to a local translation. In this section, we consider the function g˜ defined as g restricted to the sphere S of radius r , since as in section 4, we want to compare input vectors having fixed energy. Notice that although g˜ and g take the same values on S (i.e., g˜ (x) = g(x) for each x ∈ S), they are two distinct mathematical objects. For example, the gradient of g˜ in x+ is zero because x+ is by definition a maximum of g˜ . On the other hand, the gradient of g in the same point is Hx+ + f, which is in general different from zero. Strictly speaking, there is no invariance in x+ , since it is a maximum, and the output of g˜ decreases in all directions (except in the special case where the linear term is zero and the first two or more eigenvalues are equal). On the other hand, in a general, noncritical point x∗ (i.e., a point where the gradient does not disappear), the rate of change in any direction w is given by its inner product with the gradient, ∇ g˜ (x∗ ) · w. For all vectors orthogonal to the gradient (which span an N − 2 dimensional space), the rate of change is thus zero. Note that this is not merely a consequence of the fact that the gradient is a first-order approximation of g˜ . By the implicit function theorem (see e.g., Walter, 1995, theorem 4.5), in each open neighborhood U of a noncritical point x∗ , there is an N − 2–dimensional level surface {x ∈ U ⊂ S | g˜ (x) = g˜ (x∗ )}, since the domain of g˜ (the sphere S) is an N − 1–dimensional surface and its range (R) is one-dimensional. Each noncritical point thus belongs to an N − 2 dimensional surface where
Inhomogeneous Quadratic Forms as Receptive Fields
1879
Figure 4: Definition of invariance. This figure shows a contour plot of g˜ (x) on the surface of the sphere S in a neighborhood of x+ . Each general point x∗ on S lies on an N − 2–dimensional level surface (as indicated by the closed lines) where the output of the function g˜ does not change. The only interesting direction in x∗ is the one of maximal change, as indicated by the gradient ∇ g˜ (x∗ ). On the space orthogonal to it, the rate of change is zero. In x+ the function has a maximum, and its output decreases in all directions. There is thus no strict invariance. Considering the second derivative, however, we can identify the directions of minimal change. The arrows in x+ indicate the direction of the invariances (see equation 5.9) with a length proportional to the corresponding second derivative.
the value of the g˜ stays constant. This is a somewhat surprising result: for an optimal stimulus, there does not exist any invariance (except in some degenerate cases); for a general suboptimal stimulus, there exist many invariances. This shows that although it might be useful to observe, for example, that a given function f that maps images to real values is invariant to stimulus rotation, one should keep in mind that in a generic point, there is a large number of other transformations to which the function is equally invariant but would lack an easy interpretation. The strict concept of invariance is thus not useful for our analysis, since in the extrema we have no invariances at all, while in a general point, they are the typical case and the only interesting direction is the one of maximal change, as indicated by the gradient. In the extremum x+ , however, since the output changes in all directions, we can relax the definition of invariance and look for the transformation to which the function changes as little as possible, as indicated by the direction with the smallest absolute value of the second derivative (see Figure 4). (In a noncritical point, this weak definition of invariance still does not help. If the quadratic form that represents the second derivative has positive as well as negative eigenvalues, there is still an N − 3–dimensional surface where the second derivative is zero.)
1880
a)
P. Berkes and L. Wiskott
b)
α
Figure 5: Invariances. (a) To compute the second derivative of the quadratic form on the surface of the sphere, one can study the function along special paths on the sphere, known as geodetics. Geodetics of a sphere are great circles. (b) This plot illustrates how the invariances are visualized. Starting from the optimal stimulus (top), we move on the sphere in the direction of an invariance until the response of the function drops below 80% of the maximal output or α reaches 90 degrees. In the figure, two invariances of unit 4 are visualized. The one on the left represents a phase shift invariance and preserves more than 80% of the maximal output until 90 degrees (the output at 90 degrees is 99.6% of the maximum). The one on the right represents an invariance to orientation change with an output that drops below 80% at 55 degrees.
To study the invariances of the function g in a neighborhood of its optimal stimulus respecting the fixed energy constraint, we have defined the function g˜ as the function g restricted to S. This is particularly relevant here since we want to analyze the derivatives of the function, that is, its change under small movements. Any straight movement in space is going to leave the surface of the sphere. We must therefore be able to define movements on the sphere itself. This can be done by considering a path ϕ(t) on the surface of S such that ϕ(0) = x+ and then studying the change of g along ϕ. By doing this, however, we add the rate of change of the path (i.e., its acceleration) to that of the function. Of all possible paths, we must take the ones that have as little acceleration as possible—those that have just the acceleration that is needed to stay on the surface. Such a path is called a geodetic. The geodetics of a sphere are great circles, and our paths are thus defined as ϕ(t) = cos (t/r ) · x+ + sin (t/r ) · r w
(5.1)
for each direction w in the tangential space of S in x+ (i.e., for each w orthogonal to x+ ), as shown in Figure 5a. The 1/r factor in the cosine
Inhomogeneous Quadratic Forms as Receptive Fields
1881
d and sine arguments normalizes the function such that dt ϕ(0) = w with w = 1. For the first derivative of g˜ along ϕ, we obtain by straightforward calculations (with (g˜ ◦ ϕ)(t) := g˜ (ϕ(t)))
d d 1 T T (g˜ ◦ ϕ)(t) = ϕ(t) Hϕ(t) + f ϕ(t) + c = . . . dt dt 2
(5.2)
1 = − sin (t/r ) cos (t/r ) x+T Hx+ + cos (2t/r ) x+T Hw r + sin (t/r ) cos (t/r ) r wT Hw 1 − sin (t/r ) fT x+ + cos (t/r ) fT w , r
(5.3)
and for the second derivative, d2 1 2 (g˜ ◦ ϕ)(t) = − 2 cos (2t/r ) x+T Hx+ − sin (2t/r ) x+T Hw 2 dt r r 1 1 + cos (2t/r ) wT Hw − 2 cos (t/r ) fT x+ − sin (t/r ) fT w . r r (5.4) In t = 0 we have 1 d2 (g˜ ◦ ϕ)(0) = wT Hw − 2 (x+T Hx+ + fT x+ ) , dt 2 r
(5.5)
that is, the second derivative of g˜ in x+ in the direction of w is composed of two terms: wT Hw corresponds to the second derivative of g in the direction of w, while the constant term −1/r 2 · (x+T Hx+ + fT x+ ) depends on the curvature of the sphere 1/r 2 and on the gradient of g in x+ orthogonal to the surface of the sphere, ∇g(x+ ) · x+ = (Hx+ + f)T x+ =x
+T
+
T +
Hx + f x .
(5.6) (5.7)
To find the direction in which g˜ decreases as little as possible, we only need to minimize the absolute value of the second derivative (see equation 5.5). This is equivalent to maximizing the first term wT Hw in equation 5.5 since the second derivative in x+ is always negative (because x+ is a maximum of g˜ ) and the second term is constant. w is orthogonal to x+ , and thus the maximization must be performed in the space tangential
1882
P. Berkes and L. Wiskott
to the sphere in x+ . This can be done by computing a basis b2 , . . . , b N of the tangential space (e.g., using the Gram-Schmidt orthogonalization on x+ , e1 , . . . , e N−1 where ei is the canonical basis of R N ) and replacing the matrix H by ˜ = BT HB , H
(5.8)
where B = (b2 , · · · , b N ). The direction of the smallest second derivative ˜ with the largest positive eigenvalue. corresponds to the eigenvector v˜ 1 of H The eigenvector must then be projected back from the tangential space into the original space by a multiplication with B: w1 = Bv˜ 1 .
(5.9)
The remaining eigenvectors corresponding to eigenvalues of decreasing value are also interesting, as they point in orthogonal directions where the function changes with a gradually increasing rate of change. To visualize the invariances, we move x+ (or x− ) along a path on the sphere in the direction of a vector wi according to x(α) = cos (α) · x+ + sin (α) · r wi
(5.10)
for α ∈ [−90◦ , 90◦ ], as illustrated in Figure 5b. At each point, we measure the response of the function to the new input vector and stop when it drops below 80% of the maximal response. In this way, we generate for each invariance a movie like those shown in Figure 6 for some of the optimal stimuli (the corresponding animations are available at the authors’ home pages). Each frame of such a movie contains a nearly optimal stimulus. Using this analysis, we can systematically scan a neighborhood of the optimal stimuli, starting from the transformations to which the function is most insensitive up to those that lead to a great change in response. Note that our definition of invariance applies only locally to a small neighborhood of x+ . The path followed in equation 5.10 goes beyond such a neighborhood and is appropriate only for visualization. The pseudocode of an algorithm that computes and visualizes the invariances of the optimal stimuli can be found in Berkes and Wiskott (2005a). A Matlab version can be downloaded from the authors’ home pages. 6 Significant Invariances The procedure described above finds for each optimal stimulus a set of N − 1 invariances ordered by the degree of invariance (i.e., by increasing magnitude of the second derivative). We would like to know which of these are statistically significant. An invariance can be defined as significant
Inhomogeneous Quadratic Forms as Receptive Fields a) Unit 4, Inv. 1 – Phase shift °
99.5% (−90 )
°
100% (0 )
b) Unit 6, Inv. 3 – Position change °
99.6% (90 )
c) Unit 13, Inv. 4 – Size change °
92% (−29 )
°
100% (0 )
100% (0°)
°
84% (−59 )
°
100% (0 )
°
84% (59 )
d) Unit 14, Inv. 5 – Frequency change °
92% (29 )
e) Unit 9, Inv. 3 – Orientation change 88% (−44°)
1883
88% (44°)
°
81% (−37 )
°
100% (0 )
°
81% (37 )
f) Unit 6, Inv. 5 – Curvature change 80% (−42°)
100% (0°)
80% (42°)
Figure 6: Selected invariances for some of the optimal excitatory stimuli shown in Figure 3. The central patch of each plot represents the optimal stimulus of a unit, while the ones on the sides are produced by moving it in one (left patch) or the other (right patch) direction of the eigenvector corresponding to the invariance. In this image, we stopped before the output dropped below 80% of the maximum to make the interpretation of the invariances easier. The relative output of the function in percent and the angle of displacement α (see equation 5.10) are given above the patches. The animations corresponding to these invariances are available at the authors’ home pages.
if the function changes exceptionally little (less than chance level) in that direction, which can be measured by the value of the second derivative: the smaller its absolute value, the slower the function will change. To test for their significance, we compare the second derivatives of the invariances of the quadratic form we are considering with those of random inhomogeneous quadratic forms that are equally adapted to the statistics of the input data. We therefore constrain the random quadratic forms to produce an output that has the same variance and mean as the output of the analyzed ones when applied to the input stimuli. Without loss of generality, we assume here zero mean and unit variance. These constraints are compatible with the ones that are usually imposed on the functions learned by many theoretical models. Because of this normalization, the distribution of the random quadratic forms depends on the distribution of the input data. To understand how to efficiently build random quadratic forms under these constraints, it is useful to think in terms of a dual representation of the problem. A quadratic form over the input space is equivalent to a linear function over the space of the input expanded to all
1884
P. Berkes and L. Wiskott
monomials of degree one and two using the function ((x1 , . . . , xn )T ) := (x1 x1 , x1 x2 , x1 x3 , . . . , xn xn , x1 , . . . , xn )T , that is,
h 11 h 12 · · · 1 T h 12 h 22 x . .. .. 2 . ··· h 1n
h 1n
.. x + .
h nn
T
f1 f2 .. . fn
H
f
1 h 2 11 h 12
T
h 13 .. . x+c = 1 h nn 2 f1 . ..
fn
x1 x1 x1 x2 x1 x3 .. . xn xn + c . (6.1) x1 . ..
xn (x)
q
We can whiten the expanded input data (x) by subtracting its mean (x)t and transforming it with a whitening matrix S. In this new coordinate system, each vector with norm 1 applied to the input data using the scalar product fulfills the unit variance and zero mean constraints by construction. We can thus choose a random vector q of length 1 in the whitened, expanded space and derive the corresponding quadratic form in the original input space: T
qT (S((x) − (x)t )) = (ST q ) ((x) − (x)t )
(6.2)
=:q
= qT ((x) − (x)t ) =
(6.1)
(6.3)
1 T x Hx + fT x − qT (x)t 2
(6.4)
:=c
=
1 T x Hx + fT x + c , 2
(6.5)
with appropriately defined H and f according to equation 6.1. We can next compute the optimal stimuli and the second derivative of the invariances of the obtained random quadratic form. To make sure that we get independent measurements, we keep only one second derivative chosen at random for each random function. This operation, repeated over many quadratic forms, allows us to determine a distribution of the second derivatives of the invariances and a corresponding confidence interval. Figure 7a shows the distribution of 50,000 independent second derivatives of the invariances of random quadratic forms and the distribution of the second derivatives of all invariances of the first 50 units learned in the model system. The dashed line indicates the 95% confidence interval
Inhomogeneous Quadratic Forms as Receptive Fields
occurrences (%)
0.06
b)
Model System Random quadratic forms
0.04
0.02
0 −60
−40
−20
second derivative
50
# relevant invariances
a)
0
1885
40 30 20 10 0 0
10
20
30
40
50
unit number
Figure 7: Significant invariances. (a) Distribution of 50,000 independently drawn second derivatives of the invariances of random quadratic forms and distribution of the second derivatives of all invariances of the first 50 units learned in the model system. The dashed line indicates the 95% confidence interval as derived from the random quadratic forms. The distribution in the model system is more skewed toward small second derivatives and has a clear peak near zero. Twenty-eight percent of all invariances were classified as significant. (b) Number of significant invariances for each of the first 50 units learned in the model system (the functions were sorted by decreasing slowness; see section 2). The number of significant invariances tends to decrease with decreasing slowness.
derived from the former distribution. The latter is more skewed toward small second derivatives and has a clear peak near zero. Twenty-eight percent of all invariances were classified as significant. Figure 7b shows the number of significant invariances for each individual quadratic form in the model system. Each function has 49 invariances since the rank of the quadratic term is 50 (see section 2). The plot shows that the number of significant invariances decreases with increasing ordinal number (the functions are ordered by slowness, the first ones being the slowest). Forty-six units out of 50 have three or more significant invariances. The first invariance, which corresponds to a phase shift invariance, was always classified as significant, which confirms that the units behave like complex cells. Note that since the eigenvalues of a quadratic form are not independent of each other, with the method presented here it is possible to make statements only about the significance of individual invariances, and not about the joint probability distribution of two or more invariances. 7 Relative Contribution of the Quadratic, Linear, and Constant Term The inhomogeneous quadratic form has a quadratic, a linear, and a constant term. It is sometimes of interest to know what their relative contribution to the output is. The answer to this question depends on the considered input. For example, the quadratic term dominates for large input vectors,
1886
P. Berkes and L. Wiskott b) 0.5
a) 40
activity
30
occurrences (%)
abs(constant term) abs(linear term) abs(quadratic term)
20 4
10
2 0 0
0 0
10
20
30
5
40
10
50
0.4 0.3 0.2 0.1 0 −4
−3 −2
−1
0
1
2
3
4
unit number
Figure 8: Relative contribution of the quadratic, linear, and constant term. (a) The absolute value of the output of the quadratic, linear, and constant term in x+ for each of the first 50 units in the model system. In all but the first 2 units, the quadratic term has a larger output. The subplot shows a magnified version of the contribution of the terms for the first 10 units. (b) Histogram of the mean of the logarithm of the ratio between the activity of the linear and the quadratic term in the model system when applied to 90,000 test input vectors. A negative value means that the quadratic term dominates, and a positive value means the linear term dominates. In all but 4 units (units 1, 7, 8, and 24), the quadratic term is greater on average.
while the linear or even the constant term dominates for input vectors with a small norm. A first possibility is to look at the contribution of the individual terms at a particular point. A privileged point is, for example, the optimal excitatory stimulus, especially if the quadratic form can be interpreted as a feature detector (cf. section 4). Figure 8a shows for each function in the model system the absolute value of the output of all terms with x+ as an input. In all functions except the first two, the activity of the quadratic term is greater than that of the linear and of the constant term. The first function basically computes the mean pixel intensity, which explains the dominance of the linear term. The second function is dominated by a constant term from which a quadratic expression very similar to the squared mean pixel intensity is subtracted. As an alternative we can consider the ratio between linear and quadratic term, averaged over all input stimuli:
fT x 1 log 1 = log fT x − log xT Hx . T x Hx 2 t 2
(7.1)
t
The logarithm ensures that a given ratio (e.g., linear/quadratic = 3) has the same weight as the inverse ratio (e.g., linear/quadratic = 1/3) in the mean.
Inhomogeneous Quadratic Forms as Receptive Fields
1887
A negative result means that the quadratic term dominates, while for a positive value, the linear term dominates. Figure 8b shows the histogram of this measure for the functions in the model syatem. In all but 4 units (units 1, 7, 8, and 24), the quadratic term is on average greater than the linear one. 8 Quadratic Forms Without the Linear Term In the case of a quadratic form without the linear term, g(x) =
1 T x Hx + c, 2
(8.1)
the mathematics of sections 4 and 5 becomes much simpler. The quadratic form is now centered at x = 0, and the direction of maximal increase corresponds to the eigenvector v1 with the largest positive eigenvalue. The optimal excitatory stimulus x+ with norm r is thus x+ = r v 1 .
(8.2)
Similarly, the eigenvector corresponding to the largest negative eigenvalue v N points in the direction of x− . The second derivative, equation 5.5, in x+ in this case becomes 1 d2 (g˜ ◦ ϕ)(0) = wT Hw − 2 x+T Hx+ dt 2 r =
(8.2)
wT Hw − v1T Hv1
= wT Hw − µ1 .
(8.3) (8.4) (8.5)
The vector w is by construction orthogonal to x+ and therefore lies in the space spanned by the remaining eigenvectors v2 , . . . , v N . Since µ1 is the maximum value that wT Hw can assume for vectors of length 1, it is clear that equation 8.5 is always negative (as it should since x+ is a maximum) and that its absolute value is successively minimized by the eigenvectors v2 , . . . , v N in this order. The value of the second derivative on the sphere in the direction of vi is given by d2 (g˜ ◦ ϕ)(0) = viT Hvi − µ1 dt 2 = µi − µ1 .
(8.6) (8.7)
In the same way, the invariances of x− are given by v N−1 , . . . , v1 with second derivative values (µi − µ N ).
1888
P. Berkes and L. Wiskott
Since, as shown in Figure 8a, in the model system the linear term is mostly small in comparison with the quadratic one, the first and last eigenvectors of our units are expected to be very similar to their optimal stimuli. This can be verified by comparing Figures 2 and 3. Moreover, successive eigenvectors are almost equal to the directions of the most relevant invariances (compare, for example, unit 4 in Figure 2 and Figure 5b). This does not have to be the case in general. For example, the data in Lewis et al. (2002) show that cochlear neurons in the gerbil ear have a linear as well as a quadratic component. In such a situation, the algorithms must be applied in their general formulation. 9 Decomposition of a Quadratic Form in a Neural Network As also noticed by Hashimoto (2003), for each quadratic form there exists an equivalent two-layer neural network, which can be derived by rewriting the quadratic form using its eigenvector decomposition: 1 g(x) = xT Hx + fT x + c 2 1 = xT VDVT x + fT x + c 2 1 = (VT x)T D(VT x) + fT x + c 2 =
N µi i=1
2
(viT x)2 + fT x + c .
(9.1) (9.2) (9.3) (9.4)
The network has a first layer formed by a set of N linear subunits sk (x) = vkT x followed by a quadratic nonlinearity weighted by the coefficients µk /2. The output neuron sums the contribution of all subunits plus the output of a direct linear connection from the input layer (see Figure 9a). Since the eigenvalues can be negative, some of the subunits give an inhibitory contribution to the output. It is interesting to note that in an algorithm that learns quadratic forms, the number of inhibitory subunits in the equivalent neural network is not fixed but is a learned feature. As an alternative, one can scale the weights vi by |µi |/2 and specify which subunits are excitatory and which are inhibitory according to the sign of µi , since g(x)
=
N µi
(9.4)
i=1
=
N i=1 µi >0
2
(viT x)2 + fT x + c
(9.5)
T 2 T 2 N |µ | |µ | i i vi vi x − x + f T x + c . 2 2 i=1 µi n(Qe + Qi ),
(3.18)
where “iff” stands for “if and only if” and the index wn indicates that we consider the white noise case. Note that no symmetry argument applies for odd n since the asymptotic limits differ for ∞ and −∞ according to equation 3.17. For the mean, this implies that |vwn | < ∞ iff β > Qe + Qi ;
(3.19)
otherwise, the integral diverges. In general, the power law tail in the density is a hint that (for white noise at least) we face the problem of rare strong deviations in the voltage that are due to the specific properties of the model (multiplicative gaussian noise). Because of equation 3.12, similar conditions (differing by a prefactor of 1/2 on the respective right-hand sides) also apply for the finiteness of the mean and variance of the original solution, equation 2.12, proposed by R&D. For the mean value of this solution one, obtains the condition |v RD | < ∞ iff β >
Qe + Qi , 2
(3.20)
which should hold true in the general colored noise case but does not agree with the condition in equation 3.19 even in the white noise case.
Comment on “Characterization of Subthreshold Voltage Membranes”
1907
From the extended expression we obtain |vRD,ext | < ∞ iff β >
Qe Qi + . 1 + βτe 1 + βτi
(3.21)
Note that equation 3.21 agrees with equation 3.19 only in the white-noise case (i.e. for τe , τi → 0). Below we will show that equation 3.19 gives the correct condition for a finite mean value in the general case of arbitrary correlation times, too. Since for finite τe , τi , the two conditions equation 3.19 and equation 3.21 differ, we can already conclude that the equation 2.15 that led to condition equation 3.21 cannot be the exact solution of the original problem. 4 Additive Colored Noise Setting the multiplicative colored noise sources to zero, R&D obtain an expression for the marginal density in case of additive colored noise only (cf. equations 3.7–3.9 in R&D) 2 a g L Cm (V − E L − I0 /(g L a ))2 ρa dd,RD (V) = N exp − , σ I2 τ I
(4.1)
which corresponds in our notation and in terms of the shifted variable v to βv 2 . ρ˜ a dd,RD (v) = N exp − QI
(4.2)
Evidently, once more a factor 2 is missing in the white noise case (where the process v(t) itself becomes an OUP), since for an OUP, we should have ρ ∼ exp[−βv 2 /(2Q I )]. However, there is also a missing additional dependence on the correlation time. For additive noise only, the original problem given in equation 2.1 reduces to v˙ = −βv + yI , √ 1 2Q I ξ I (t). y˙I = − yI + τI τI
(4.3) (4.4)
This system is mathematically similar to the gaussian approximation or effective-time constant approximation, equation 2.25, in which no multiplicative noise is present as well. The density function for the voltage is well known; for clarity, we show here how to calculate it.
1908
B. Lindner and A. Longtin
The system, equations 4.3 and 4.4, obeys the two-dimensional FokkerPlanck equation, yI QI + 2 ∂ yI P(v, yI , t) ∂t P(v, yI , t) = ∂v (βv − yI ) + ∂ yI τI τI
(4.5)
The stationary problem (∂t P0 (v, yI ) = 0) is solved by an ansatz P0 (v, y) ∼ exp[Av 2 + Bvy + C y2 ], yielding the solution for the full probability density: P0 (v, yI ) = N exp
τ I (1 + βτ I ) c QI β yI2 − 2βvyI − 2 cv 2 , c = − . 2 QI τI
(4.6)
Integrating over yI yields the correct marginal density, ρa dd (v) =
β(1 + βτ I ) βv 2 exp − (1 + βτ I ) , 2π Q I 2Q I
(4.7)
which is in disagreement with equation 4.2 and hence also with equation 4.1. From the correct solution given in equation 4.7, we also see what happens in the limit of infinite τ for fixed noise intensity Q I : the exponent tends to minus infinity except at v = 0, or, put differently, the variance of the distribution tends to zero, and we end up with a δ function at v = 0. This limit makes sense (cf. note 1) but is not reflected at all in the original solution, equation 2.15, given by R&D. We can also rewrite the solution in terms of the white noise solution in the case of vanishing multiplicative noise: ρadd (v) = ρwn (v, Qe = 0, Qi = 0, Q I /[1 + βτ I ]).
(4.8)
Thus, for the additive noise is true, what has been assumed by R&D in the case of multiplicative noise: the density in the general colored noise case is given by the white noise density with a rescaled noise intensity QI = Q I /[1 + βτ I ] (or equivalently, rescaled correlation time τ I = 2τ I /[1 + βτ I ] in equation 4.2 with Q I = σ 2 τ I ). We cannot perform the limit of only additive noise in the extended expression, equation 2.15, proposed by R&D because this solution was meant for the case of only multiplicative noise. If, however, we generalize that expression to the case of additive and multiplicative colored noises, we can consider the limit of only additive noise in this expression. This is done by taking the original solution by R&D, equation 2.12, and replacing not only the correlation times of the multiplicative noises τe,i by the effective
Comment on “Characterization of Subthreshold Voltage Membranes”
1909
ones τe,i but also that of the additive noise τ I by an effective correlation time,
τ I =
2τ I . 1 + τI β
(4.9)
If we now take the limit Qe = Qi = 0, we obtain the correct density, ρr ud,e xt,a dd (v) = ρwn (v, Qe = 0, Qi = 0, Q I /[1 + βτ I ]),
(4.10)
as becomes evident on comparing the right-hand sides of equation 4.10 and equation 4.8. Finally, we note that the case of additive noise is the only limit that does not pose any condition on the finiteness of the moments. 5 Static Multiplicative Noises Only (Limit of Large τe,i ) Here we assume for simplicity σ˜ I = 0 and consider multiplicative noise 2 with fixed variances σ˜ e,i only. If the noise sources are much slower than the internal timescale of the system, that is, if 1/(βτe ) and 1/(βτi ) are practically zero, we can neglect the time derivative in equation 2.19. This means that the voltage adapts instantaneously to the multiplicative (“static”) noise sources which is strictly justified only for βτe , βτi → ∞. If τe , τi attain large but finite values (βτi , βτi 1), the formula derived below will be an approximation that works the better the larger these values are. Because of the slowness of the noise sources compared to the internal timescale, we call the resulting expression the “static-noise” theory for simplicity. This does not imply that the total system (membrane voltage plus noise sources) is not in the stationary state: we assume that any initial condition of the variables has decayed on a timescale t much larger than τe,i .4 For a simulation of the density, this has the practical implication that we should choose a simulation time much larger than any of the involved correlation times. Setting the time derivative in equation 2.19 to zero, we can determine at which position the voltage variable will be for a given quasi-static pair of (ye , yi ) values, yielding v=
ye VE + yi Vi . β + ye + yi
(5.1)
4 In the strict limit of βτ , βτ → ∞, this would imply that t goes stronger to infinity e i than the correlation times τe,i do.
1910
B. Lindner and A. Longtin
This sharp position will correspond to a δ peak of the probability density βv + yi (v − Vi ) ye VE + yi Vi |yi (Vi − Ve ) − βVe | δ y + δ v− = e β + ye + yi (v − Ve )2 (v − Ve )
(5.2)
(here we have used δ(a x) = δ(x)/|a |). This peak has to be averaged over all possible values of the noise, that is, integrated over the two gaussian distributions in order to obtain the marginal density: ρstatic (v) = δ(v − v(t)) ∞ ∞ dye dyi |yi (Vi − Ve ) − βVe | βv + yi (v − Vi ) = δ y + e ˜ i σ˜ e (v − Ve )2 (v − Ve ) −∞ −∞ 2π σ y2 y2 × exp − e 2 − i 2 (5.3) 2σ˜ e 2σ˜ i Carrying out these integrals yields ν(v) v2 πν(v) ν(v) σ˜ e σ˜ i |Ve − Vi | − 2µ(v) − µ(v) + e erf , e ρstatic (v) = πβ 2 µ(v) µ(v) µ(v)
(5.4)
where erf(z) is the error function (Abramowitz & Stegun, 1970) and the functions µ(v) and ν(v) are given by µ(v) = ν(v) =
σ˜ e2 (v − Ve )2 + σ˜ i2 (v − Vi )2 β2 2 2 σ˜ e Ve (v − Ve ) + σ˜ i2 Vi (v − Vi ) 2σ˜ e2 σ˜ i2 (Ve − Vi )2
(5.5) .
(5.6)
If one of the expressions by R&D, equation 2.12 or 2.15, would be the correct solution, it should converge for σ I = 0 and τe,i → ∞ to the formula for the static case, equation 5.4. In general, this is not the case since the functional structures of the white-noise solution and of the static-noise approximation are quite different. There is, however, one limit case in which the extended expression yields the same (although trivial) function. If we fix the noise intensities Qe,i and let the correlation times go to infinity, the variances will go to zero and the static noise density, equation 5.4, approaches a δ peak at v = 0. Although the extended expression, equation 2.15, has a different functional dependence on system parameters and voltage, the same thing happens in the extended expression for τe,i → ∞ because the effective noise intensities Qe,i = Qe,i /(1 + βτe,i ) approach zero in this limit. The white noise solution at vanishing noise intensities is, however, also
Comment on “Characterization of Subthreshold Voltage Membranes”
1911
a δ peak at v = 0. Hence, in the limit of large correlation time at fixed noise intensities, both the static noise theory, equation 5.4, and the extended expression yield the probability density of a noise-free system and therefore agree. For fixed variance where a nontrivial large-τ limit of the probability density exists, the static noise theory and the extended expression by R&D differ as we will also numerically verify. A final remark concerns the asymptotic behavior of the static noise solution, equation 5.4. The asymptotic expansions for v → ±∞ show that the density goes like |v|−2 in both limits. Hence, in this case, we cannot obtain a finite variance of the membrane voltage at all (the integral dv v 2 ρsta tic (v) will diverge). The mean may be finite since the coefficients of the v −2 term are symmetric in v. The estimation in the following section, however, will demonstrate that this is valid only strictly in the limit τe,i → ∞ but not at any large but finite value of τe,i . So the mean may diverge for large but finite τe,i . 6 Mean Value of the Voltage for Arbitrary Values of the Correlation Times By inspection of the limit cases, we have already seen that the moments do not have to be finite for an apparently sensible choice of parameters. For the white noise case, it was shown that the mean of the voltage is finite only if β > Qe + Qi . Next, we show by direct analytical solution of the stochastic differential equation, equation 2.19, involving the colored noise sources, equation 2.20, that this condition (i.e., equation 3.19), holds in general, and thus a divergence of the mean is obtained for β < Qe + Qi . For only one realization of the process, equation 2.19, the driving functions ye (t), yi (t), and yI (t) can be regarded as just time-dependent parameters in a linear differential equation. The solution is then straightforward (see also Richardson, 2004, for the special case of only multiplicative noise): t t v(t) = v0 exp −βt − du(ye (u) + yi (u)) + ds(Ve ye (s) + Vi yi (s) 0
0
t du(ye (u) + yi (u)) . +yI (s))e −β(t−s) exp −
(6.1)
s
t The integrated noise processes we,i (s, t) = s duye,i (u) in the exponents are independent gaussian processes with variance 2 we,i (s, t) = 2Qe,i (t − s − τe,i + τe,i e −(t−s)/τe,i ).
(6.2)
1912
B. Lindner and A. Longtin
For a gaussian variable, we know that e w = e w /2 (Gardiner, 1985). Using this relation for the integrated noise processes together with equation 6.2
t and expressing the average ye,i (s) exp[− s duye,i (u)] by a derivative of the exponential with respect to s, we find an integral expression for the mean value 2
v(t) = v0 e (Qe +Qi −β)t exp [−τe f e (t) − τi f i (t)] t − ds{Ve f e (s) + Vi f i (s)}e (Qe +Qi −β)s−τe fe (s)−τi fi (s) ,
(6.3)
0
where f e,i (s) = Qe,i (1 − exp[−s/τe,i ]). The stationary mean value corresponding to the stationary density is obtained from this expression in the asymptotic limit t → ∞. We want to draw attention to the fact that this mean value is finite exactly for the same condition as for the white noise case—for |v| < ∞ iff β > Qe + Qi
(6.4)
First, this is so because otherwise the exponent (Qe + Qi − β)t in the first line is positive and the exponential diverges for t → ∞. Furthermore, if β < Qe + Qi , the exponential in the integrand diverges at large s. In terms of the original parameters of R&D, the condition for a finite stationary mean value of the voltage reads |v| < ∞ iff g L a + ge0 + gi0 >
σe2 τe + σi2 τi aCm
(6.5)
Note that this depends also on a and Cm , and not only on the synaptic parameters. R&D use as standard parameter values (Rudolph & Destexhe, 2003, p. 2589) ge0 = 0.0121 µS, gi0 = 0.0573 µS, σe = 0.012 µS, σi = 0.0264 µS, τe = 2.728 ms, τi = 10.49 ms, a = 34636 µm2 , and Cm = 1 µF/cm2 . They state that the parameters have been varied in numerical simulations from 0% to 260% relative to these standard values covering more than “the physiological range observed in vivo” (Rudolph & Destexhe, 2003). Inserting the standard values into the relation, equation 6.5, yields 0.0851µS > 0.0221 µS.
(6.6)
So in this case, the mean will be finite. However, using twice the standard value for the inhibitory noise standard deviation—σi = 0.0528 µS (corresponding to 200% of the standard value) and all other parameters as before, leads to a diverging mean because we obtain 0.0852 µS on the right-hand side of equation 6.5, while the left-hand side is unchanged. This means,
Comment on “Characterization of Subthreshold Voltage Membranes”
1913
even in the parameter regime that R&D studied, that the model predicts an infinite mean value of the voltage. A stronger violation of equation 6.5 will be observed by either increasing the standard deviations σe,i and/or correlation times τe,i or decreasing the mean conductances ge,i . We also note that for higher moments, and especially for the variance, the condition for finiteness will be even more restrictive, as can be concluded from the limit cases investigated before. The stationary mean value at arbitrary correlation times can be inferred from equation 6.3 by taking the limit t → ∞. Assuming the relation, equation 6.4, holds true, we can neglect the first term involving the initial condition v0 and obtain v = −
∞
ds{Ve f e (s) + Vi f i (s)} exp[(Qe + Qi − β)s − τe f e (s) − τi f i (s)].
0
(6.7) We can also use equation 6.7 to recover the white noise result for the mean as, for instance, found in Richardson (2004) by taking τe,i → 0. In this case, we can integrate equation 6.7 and obtain vwn = −{Ve Qe + Vi Qi } =−
∞
ds exp [(Qe + Qi − β)s]
0
Ve Qe + Vi Qi . β − Qe − Qi
(6.8)
Because of the similarity of the R&D solution to the white noise solution (cf. equation 3.12), we can also infer that the mean value of the former density is v RD = −
Ve Qe + Vi Qi . 2β − Qe − Qi
(6.9)
Note the different prefactor of β in the denominator, which is due to the factor 1/2 in noise intensities of the solution, equation 2.12, by R&D. Finally, we can also determine easily the mean value for the extended expression by R&D (Rudolph & Destexhe, 2005) since this solution is also equivalent to the white noise solution with rescaled noise intensities. Using the noise intensities Qe,i from equation 3.15, we obtain v RD,e xt = − =−
Ve Qe (τe ) + Vi Qi (τi ) β − Qe (τe ) − Qi (τi ) Ve Qe (1 + βτi ) + Vi Qi (1 + βτe ) . β(1 + βτi )(1 + βτe ) − Qe (1 + βτi ) − Qi (1 + βτe )
(6.10)
1914
B. Lindner and A. Longtin
We will verify numerically that this expression is not equal to the exact solution, equation 6.7. One can, however, show that for small to medium values of the correlation times τe,i and weak noise intensities, these differences are not drastic. If we expand both equation 6.3 and equation 6.10 for small noise intensities Qe , Qi (assuming for the former that the products Qe τe , Qi τi are small, too), the resulting expressions agree to first order and also agree with a recently derived weak noise result for filtered Poissonian shot noise given by Richardson & Gerstner (2005, cf. eq. D.3): vRD,ext ≈ v ≈ −
Ve Qe (1 + βτi ) + Vi Qi (1 + βτe ) + O Q2e , Qi2 . β(1 + βτi )(1 + βτe )
(6.11)
The higher-order terms differ, and that is why a discrepancy between both expressions can be seen at nonweak noise. The results for the mean value achieved in this section are useful in two respects. First, we can check whether trajectories indeed diverge for parameters where the relation, equation 6.4, is violated. Second, the exact solution for the stationary mean value and the simple expressions resulting for the different solutions proposed by R&D can be compared in order to reveal their range of validity. This is done in the next section. 7 Comparison to Simulations Here we compare the different formulas for the probability density of the membrane voltage and its mean value to numerical simulations for different values of the correlation times, restricting ourselves to the case of multiplicative noise only. For the simulations, we followed a single realization v(t) using a simple Euler procedure. The probability density at a certain voltage is then proportional to the time spent by the realization in a small region around this voltage. Decreasing t or increasing the simulation time did not change our results. We will first discuss the original expression, equation 2.12, proposed by R&D and the analytical solutions for the limit cases of white and static multiplicative noise, equations 3.8 and 5.4, respectively; later we examine the validity of the new extended expression. Finally, we also check the stationary and time-dependent mean value of the membrane voltage and discuss how well these simple statistical characteristics are reproduced by the different theories, including our exact result, equation 6.3. To check the validity of the different expressions, we use first a dimensionless parameter set where β = 1 but also the original parameter set used by R&D (2003). In both cases, we consider variations of the correlation times between three orders of magnitude (standard values are varied between 10% and 1000%). Note that the latter choice goes beyond the range originally considered by R&D (2003), where parameter variations were limited to the range 0% to 260%.
Comment on “Characterization of Subthreshold Voltage Membranes”
1915
7.1 Probability Density of the Membrane Voltage—Original Expression by R&D. In a first set of simulations, we ignore the physical dimensions of all the parameters and pick rather arbitrary but simple values (β = 1, Qi = 0.75, Qe = 0.075). Keeping the ratio of the correlation times (τ I = 5τe ) and the values of the noise intensities Qe , Qi fixed, we vary the correlation times. In Figure 1, simulation results are shown for τe = 10−2 , 10−1 , 1, and 10. We recall that with a fixed noise intensity according to the result by R&D given in equation 2.12, the probability should not depend on τe at all. It is obvious, however, in Figure 1a that the simulation data depend strongly on the correlation times in contrast to what is predicted by equation 2.12. The difference between the original theory by R&D and the simulations is smallest for an intermediate correlation time (τe = 1). In contrast to the general discrepancy between simulations and equation 2.12, the white noise formula, equation 3.8, and the formula from the static noise theory (cf. the solid and dotted lines in Figure 1b), agree well with the simulations at τe = 0.01 (circles) and τe = 10 (diamonds), respectively. The small differences between simulations and theory decrease as we go to smaller or larger correlation times, respectively, as expected. R&D also present results of numerical simulations (Rudolph & Destexhe, 2003), which seem to agree fairly well with their formula. In order to give a sense of the reliability of these data, we have repeated the simulations for one parameter set in Rudolph and Destexhe (2003, Fig. 2b). These data are shown in Figure 2 and compared to R&D’s original solution, equation 2.12. For this specific parameter set, the agreement is indeed relatively good, although there are differences between the formula and the simulation results in the location of the maximum as well as at the flanks of the density. These differences do not vanish by extending the simulation time or decreasing the time step; hence, the curve according to equation 2.12 does not seem to be an exact solution but at best a good approximation. The disagreement becomes significant if the correlation times are changed by one order of magnitude (see Figure 3) (in this case, we keep the variances of the noises constant, as R&D have done, rather than the noise intensities as in Figure 1). The asymptotic formulas for either vanishing (see Figure 3a) or infinite (see 3b) correlation times derived in this article do a much better job in these limits. Note that the large correlation time used in Figure 3b is outside the range considered by R&D (2003). Regardless of the fact that the correlation times we have used in Figures 3a and 3b are possibly outside the physiological range, an analytical solution should also cover these cases. Regarding the question of whether the correlation time is short (close to the white noise limit), long (close to the static limit), or intermediate (as seems to be the case in the original parameter set of Figure 2b in Rudolph & Destexhe, 2003), it is not the absolute value of τe,i,I that matters
1916
B. Lindner and A. Longtin
a
τe=0.01 τe=0.1 τe=1 τe=10 R&D original
ρ
2
1
0
-1
-0.5
0
0.5
1
1.5
v in a.u.
b
τe=10 τe=0.01 static-noise theo. with τe=10 white-noise theo.
ρ
2
1
0
-1
-0.5
0
0.5
1
1.5
v in a.u. Figure 1: Probability density of the shifted voltage compared to results of numerical simulations. (a) Density according to equation 2.12 (theory by R&D) is compared to simulations at different correlation times as indicated (τi = 5τe ). Since the noise intensities are fixed, the simulated densities at different τe should all fall onto the solid line according to equation 2.12, which is not the case. (b) The simulations at small (τe = 0.01) and large (τe = 10) correlation times are compared to our expressions found in the limit case of white and static noise: equations 3.8 and 5.4, respectively. Note that in the constant-intensity scaling, equation 5.4 depends implicitly on τe,i since the variances change as σe,i = Qe,i /τe,i . Parameters: β = 1, Qe = 0.075, Qi = 0.75, Q I = 0, t = 0.001, and simulation time T = 105 .
but the product βτe,i,I . Varying one or more of the parameters g L , ge0 , gi0 , a , or Cm can push the dynamics in one of the limit cases without the necessity of changing τe,i,I .
Comment on “Characterization of Subthreshold Voltage Membranes” 70
1917
Sims, τe=2.728 ms τι=10.45 ms R&D original
60 50
ρ
40 30 20 10 0 -0.1
-0.08
-0.06
-0.04
-0.02
V in Volt Figure 2: Probability density of membrane voltage corresponding to the parameters in Figure 2b of Rudolph and Destexhe (2003): g L = 0.0452 mS/cm2 , a = 34636 µm2 , Cm = 1 µF/cm2 , E L = −80 mV, E e = 0 mV, E i = −75 mV, σe = 0.012 µS, σi = 0.0264 µS, ge0 = 0.0121 µS, gi0 = 0.0573 µS; additive-noise parameters (σ I , I0 ) are all zero; we used a time step of t = 0.1 ms and a simulation time of 100 s.
7.2 Probability Density of the Membrane Voltage—Extended Expression by R&D. So far we have not considered the extended expression (R&D, 2005) with the effective correlation times. Plotting the simulation data shown in Figures 1a and 3 against this new formula gives a very good, although not perfect, agreement (cf. Figures 4a and 5a). Note, for instance, in Figure 4a that the height of the peak for τe = 1 and the location of the maximum for τe = 0.1 are slightly underestimated by the new theory. Since most of the data look similar to gaussians, we may also check whether they are described by the ETC theory (cf. equation 2.25). This is shown in Figures 4b and 5b and reveals that for the two parameter sets studied so far, the noise intensities are reasonably small such that the ETC formula gives an approximation almost as good as the extended expression by R&D. One exception to this is shown in Figure 4b: at small correlation times where the noise is effectively white (τe = 0.1), the ETC formula fails since the noise variances become large. For τe = 0.01, the disagreement is even worse (not shown). In this range, the extended expression captures the density better, in particular its nongaussian features (e.g., the asymmetry in the density). Since the agreement of the extended expression to numerical simulations was so far very good, one could argue that it represents the exact solution to the problem and the small differences are merely due to numerical inaccuracy. We will check whether the extended expression is the exact solution in two ways. First, we know how the density behaves if both multiplicative noises are very slow (βτe , βτi 1), namely, according to equation 5.4.
1918
B. Lindner and A. Longtin
a
200
Sims, τe=0.2728 ms τι=1.045 ms white-noise theory R&D original
ρ
150 100 50 0
-0.07
-0.06
-0.05
V in Volt
b
50
Sims, τe=27.28 ms τι=104.5 ms static-noise theo. R&D original
40
ρ
30 20 10 0 -0.1
-0.08
-0.06
-0.04
-0.02
V in Volt Figure 3: Probability density of membrane voltage for different orders of magnitude of the correlation times τe , τi . Parameters as in Figure 2 except for the correlation times, which were chosen one order of magnitude smaller (a) or larger (b).
We thus possess an additional control of whether the extended expression, equation 2.15, is exact by comparing it not only to numerical simulation results but also to the static noise theory. Second, we have derived an exact integral expression, equation 6.7, for the stationary mean value, so we can compare the stationary mean value according to the extended expression by R&D (given in equation 6.10) to the exact expression and to numerical simulations. To check the extended expression against the static noise theory, we have to choose parameter values for which βτe and βτi are much larger than one; at the same time, the noise variances should be sufficiently large.
Comment on “Characterization of Subthreshold Voltage Membranes”
a 2.5
extended expression τe=0.01 τe=0.1 τe=1 τe=10
2
ρ
1919
1.5 1 0.5 0 -1
-0.5
0
0.5
1
v in a.u.
b 2.5
ETC appr. τe=0.1 τe=1 τe=10
ρ
2 1.5 1 0.5 0 -1
-0.5
0
0.5
1
v in a.u. Figure 4: Probability density of membrane voltage for simulation data and parameters as in Figure 1a. The extended expression equation 2.15 (a) and the effective time constant approximation, equation 2.25 (b), are compared to results of numerical simulations.
We compare both theories, equation 2.15 and equation 5.4, once for the system equation 2.19, equation 2.20, with simplified parameters at strong noise (Qe = Qi = 1) and large correlation times (βτe,i = 20) (see Figure 6a) and once for the original system (see Figure 6b). For the latter, increases in βτe,i can be achieved by increasing either g L , ge0 , gi0 or the synaptic correlation times τe,i . We do both and increase ge0 to the ten-fold of the standard value by R&D (i.e., ge0 = 0.0121 µS → ge0 = 0.121 µS) and also multiply the standard values of the correlation times by roughly three (i.e., τe = 2.728 ms, τi = 10.45 ms → τe = 7.5 ms, τi = 30 ms); additionally, we choose a larger standard deviation for the inhibitory conductance than
1920
B. Lindner and A. Longtin
a150 ρ
100
R&D extended Sims, τe=0.2728 ms τι=1.045 ms Sims, τe=27.28 ms τι=104.5 ms Sims, τe=2.728 ms τι=10.45 ms
50
0 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03
V in Volt
b150 ρ
100
ETC appr. Sims, τe=0.2728 ms τι=1.045 ms Sims, τe=27.28 ms τι=104.5 ms Sims, τe=2.728 ms τι=10.45 ms
50
0 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03
V in Volt
Figure 5: Probability density of membrane voltage for simulation data and parameters as in Figures 2 and 3. The extended expression, equation 2.15 (a), and the effective time constant approximation, equation 2.25 (b), are compared to results of numerical simulations.
in R&D’s standard parameter set (σi = 0.0264 µS → σi = 0.045 µS). For these parameters, we have βτe ≈ 4.2 and βτi ≈ 16.8, so we may expect a reasonable agreement between static noise theory and the true probability density of the voltage obtained by simulation. Indeed, for both parameter sets, the static noise theory works reasonably well. For the simulation of the original system (see Figure 6b), we also checked that the agreement is significantly enhanced (agreement within line width) by using larger correlation times (e.g., τe = 20 ms, τi = 100 ms) as can be expected. Compared to the static noise theory, the extended expression by R&D shows stronger although not large deviations. There are differences in the location and height of the maximum of the densities for
Comment on “Characterization of Subthreshold Voltage Membranes”
a
0.6
1921
R&D extended static noise theory sims
ρ
0.4 0.2 0 -3
-2
-1
0
1
v in a.u.
b
40 10 10
ρ
30
10 10
20
0
-1
-2
-0.04
0
0.04
0.08
R&D extended static noise theory sims
10 0
1
-0.04
-0.02
0
0.02
V in Volt Figure 6: Probability density of membrane voltage for long correlation times; static noise theory, (equation 5.4, solid lines) and extended expression by R&D (equation 2.15, dashed lines) against numerical simulations (symbols). (a) Density of the shifted variable v with Qe = Qi = 3, β = 1, τe = τi = 20, Ve = 1.5, Vi = −0.5. Here, the mean value is infinite. In the simulation, we implemented reflecting boundaries affecting the density only in its tails (not shown in the figure). (b) Density for the original voltage variable with g L = 0.0452 mS/cm2 , a = 34, 636 µm2 , Cm = 1 µF/cm2 , E L = −80 mV, E e = 0 mV, E i = −75 mV, σe = 0.012 µS, σi = 0.045 µS, ge0 = 0.121 µS, gi0 = 0.0574 µS, τe = 7.5 ms, τi = 30 ms. Here the mean value is finite. Inset: Same data on a logarithmic scale.
both parameter sets; prominent also is the difference between the tails of the densities (see the Figure 6b inset). Hence, there are parameters that are not completely outside the physiological range, for which the extended expression yields only an approximate description and for which the static
1922
B. Lindner and A. Longtin
noise theory works better than the extended expression by R&D. This is in particular the case for strong and long-correlated noise. 7.3 Mean Value of the Membrane Voltage. The second way to check the expressions by R&D was to compare their mean values to the exact expression for the stationary mean, equation 6.7. We do this for the transformed system, equation 2.19, equation 2.20, with dimensionless parameters. In Figure 7, the stationary mean value is shown as a function of the correlation time τe of the excitatory conductance. In the two panels, we keep the noise intensities Qe and Qi fixed; the correlation time of inhibition is small (in Figure 7a) or medium (in Figure 7b) compared to the intrinsic timescale (1/β = 1). We choose noise intensities Qi = 0.3 and Qe = 0.2 so that the mean value is finite because equation 6.4 is satisfied. In Figure 7a the disagreement between the extended expression by R&D (dash-dotted line) and the exact solution (thick solid line) is apparent for medium values of the correlation time. To verify this additionally, we also compare to numerical simulation results. The latter agree with our exact theory for the mean value within the numerical error of the simulation. We also plot two limits that may help to explain why the new theory by R&D works in this special case at very small and very large values of τ E . At small values, both noises are effectively white, and we have already discussed that in this case, the extended expression for the probability density, equation 2.15, approaches the correct white noise limit. Hence, also the first moment should be correctly reproduced in this limit. On the other hand, going to large correlation time τe at fixed noise intensity Qe means that the effect of the colored noise ye (t) on the dynamics vanishes. Hence, in this limit, we obtain the mean value of a system that is driven only by one white noise (i.e., yi (t)). Also this limit is correctly described by R&D’s new theory, since the effective noise intensity Qe = 2Qe /[1 + βτe ] vanishes for τe → ∞ if Qe is fixed. However, for medium values of τe , the new theory predicts a larger mean value than the true value. The mean value, equation 6.9, of the original solution, equation 2.12 (dotted lines in Figure 7), leads to a mean value of the voltage that does not depend on the correlation time τe at all. If the second correlation time τ I is of the order of the effective membrane time constant 1/β (see Figure 7b), the deviations between the mean value of the extended expression and the exact solution are smaller but extend over all values of τe . In this case, the new solution does not approach the correct one in either of the limit cases, τe → 0 or τe → ∞. The overall deviations between the mean according to the extended expression are small. Also for both panels, the differences in the mean are small compared to the standard deviations of the voltage. Thus, the expression equation 6.10, corresponding to the extended expression, can be regarded as a good approximation for the mean value. Finally, we illustrate the convergence or divergence of the mean if the condition equation 6.4 is obeyed or violated, respectively. First, we choose
Comment on “Characterization of Subthreshold Voltage Membranes”
a 0.2
white noise limit R&D extended R&D original exact solution Simulation results white-noise limit with Qe=0
0.1
in a.u.
1923
0 -0.1 -0.2 -0.3 10
-2
10
-1
10
0
10
1
10
2
τe in a.u.
in a.u.
b 0.1
R&D extended R&D original exact solution
0 -0.1 -0.2 -0.3 10
-2
10
-1
10
0
10
1
τe in a.u. Figure 7: Stationary mean value of the shifted voltage (in arbitrary units) versus correlation time (in arbitrary units) of the excitatory conductance. Noise intensities Qe = 0.2, Qi = 0.3, Q I = 0, and β = 1 are fixed in all panels. Correlation time of the inhibitory conductance: τi = 10−2 (a) and τi = 1 (b). Shown are the exact analytical result, equation 6.7 (solid line); the mean value according to the original solution, equation 6.9 (dotted line); and the mean value according to the extended expression, equation 6.10 (dash-dotted line). In panel a , we also compare to the mean value of the white noise solution for Qe = 0.2, Qi = 0.3 (thin solid line) and for Qe = 0, Qi = 0.3 (dashed line), as well as to numerical simulation results (symbols).
the original system and the standard set of parameters by Rudolph and Destexhe (2003) and simulate a large number of trajectories in parallel. All of these are started at the same value (V = 0) and each with independent noise sources, the initial values of which are drawn from the stationary
B. Lindner and A. Longtin
in Volt
1924
0.1
Theory σi=0.0264 mS Theory σi=0.0660 mS Sims σi=0.0264 mS Sims σi=0.0660 mS
0
-0.1 -4 10
-3
10
10
-2
t in seconds Figure 8: Time-dependent mean value of the original voltage variable (in volts) as a function of time (in seconds) for the initial value V(t = 0) = 0 V and different values of the inhibitory conductance standard deviation σi ; numerical simulations of equations 2.19 and 2.20 (circles) and theory according to equation 6.3 (solid lines). For all curves, ge0 = 0.0121 µS, gi0 = 0.0573 µS, σe = 0.012 µS, τe = 2.728 ms, τi = 10.49 ms, a = 34,636 µm2 , and Cm = 1 µF/cm2 . For the dashed line (theory) and the gray squares (simulations), we choose σi = 0.0264 µS; hence, in this case, parameters correspond to the standard parameter set by Rudolph and Destexhe (2003). For the solid line (theory) and the black circles, we used σi = 0.066 µS corresponding to the 250% of the standard value by R&D. At the standard parameter set, the mean value saturates at a finite level, in the second case, the mean diverges and goes beyond 100 mV within 31 ms. Simulations were carried out for 106 voltage trajectories using an adaptive time step (always smaller than 0.01 ms) that properly took into account the trajectories that diverge the strongest. The large number of trajectories was required in order to get a reliable estimate of the time-dependent mean value in the case of strong noise (σi = 0.066 µS) where voltage fluctuations are quite large.
gaussian densities. In an experiment, this corresponds exactly to fixing the voltage of the neuron via voltage clamp and then to let the voltage freely evolve under the influence of synaptic input (that has not been affected by the voltage clamp). In Figure 8 we compare the time-dependent average of all trajectories to our theory, equation 6.3 (in terms of the original variable and parameters). For R&D’s standard parameters, the mean value reaches after a relaxation of roughly 20 ms a finite value (V≈ −65 mV). The time course of the mean value is well reproduced by our theory, as it should be. Increasing one of the noise standard deviations to 2.5-fold of its standard value (σi = 0.0264 µS → 0.066 µS), which is still in the range inspected by
Comment on “Characterization of Subthreshold Voltage Membranes”
1925
R&D, results in a diverging mean.5 Again the theory (solid line) is confirmed by the simulation results (black circles). Starting from zero voltage, the voltage goes beyond 100 mV within 31 ms. In contrast to this, the mean value of the extended expression is finite (the condition equation 3.21 is obeyed) and the mean value formula for this density, equation 6.10, yields a stationary mean voltage of −66 mV. Thus, in the general colored noise case, the extended expression cannot be used to decide whether the moments of the membrane voltage will be finite. We note that the divergence of the mean is due to a small number of strongly deviating voltage trajectories in the ensemble over which we average. This implies that the divergence will not be seen in a typical trajectory and that a large ensemble of realizations and a careful simulation of the rare strong deviations (adaptive time step) are required to confirm the diverging mean predicted by the theory. Thus, although the linear model with multiplicative gaussian noise is thought to be a simple system compared to nonlinear spike generators with Poissonian input noise, its careful numerical simulation may be much harder than that of the latter type of model. 8 Conclusions We have investigated the formula for the probability density of the membrane voltage driven by multiplicative and/or additive (conductance and/or current noise) proposed by R&D in their original article. Their solution deviates from the numerical simulations in all three limits we have studied (white noise driving, colored additive noise, and static multiplicative noise). The deviation is significant over extensive parameter ranges. The extended expression by R&D (2005), however, seems to provide a good approximation to the probability density of the system for a large range of parameters. In the appendix we show where errors have been made in the derivation of the Fokker-Planck equation on which both the original and extended expressions are based. Although there are serious flaws in the derivation, we have seen that the new formula (obtained by an ad hoc introduction of effective correlation times in the original solution) gives a very good reasonable approximation to the probability density for weak noise. What could be the reason for this good agreement? The best, though still phenomenological, reasoning for the solution, equation 2.15, is as follows. First, an approximation to the probability
5 These parameter values were not considered by R&D to be in the physiological range. We cannot, however, exclude that other parameter variations (e.g., decreasing the leak conductance or increasing the synaptic correlation times) will not lead to a diverging mean for parameters in the physiological range.
1926
B. Lindner and A. Longtin
density should work in the solvable white noise limit: lim ρappr (v, Qe , Qi , τe , τi ) = ρwn (v, Qe , Qi ).
τe ,τi →0
(8.1)
Second, we know that at weak multiplicative noise of arbitrary correlation time, the effective time constant approximation will be approached: ρappr (v, Qe , Qi , τe , τi ) = ρETC (v, Qe , Qi , τe , τi ), (Qe , Qi small).
(8.2)
The latter density given in equation 2.25 can be expressed by the white noise density with rescaled noise intensities (note that the variance in the ETC approximation given in equation 2.26 has this property); furthermore, it is close to the density for white multiplicative noise if the noise is weak: ρETC (v, Qe , Qi , τe , τi )
= (Qe ,Qi
small)
ρETC (v, Qe /(1 + βτe ), Qi /(1 + βτi ), 0, 0),
≈
ρ(v, Qe /(1 + βτe ), Qi /(1 + βτi ), 0, 0)
=
ρwn (v, Qe /(1 + βτe ), Qi /(1 + βτi )).
(8.3)
Hence, using this equation together with equation 8.1, one arrives at ρappr (v, Qe , Qi , τe , τi ) ≈ ρwn (v, Qe /(1 + βτe ), Qi /(1 + βτi )).
(8.4)
This approximation, which also obeys equation 8.1, is the extended expression by R&D. It is expected to function in the white noise and the weak noise limits and can be regarded as an interpolation formula between these limits. We have seen that for stronger noise and large correlation times (i.e., in a parameter regime where neither of the above assumptions of weak or uncorrelated noise holds true), this density and its mean value disagree with numerical simulation results as well as with our static noise theory. Regarding the parameter sets for which we checked the extended expression for the probability density, it is remarkable that the differences to numerical simulations were not stronger. Two issues remain. First, we have shown that the linear model with gaussian conductance fluctuations can show a diverging mean value. Certainly, for higher moments, as, for instance, the variance, the restrictions on parameters will be even more severe than that for the mean value (this can be concluded from the tractable limit cases we have considered). As demonstrated in the case of the stationary mean value, the parameter regime for such a divergence cannot be determined using the different solutions proposed by R&D. Of course, a real neuron can be driven by a strong synaptic input without showing a diverging mean voltage—the divergence of moments found
Comment on “Characterization of Subthreshold Voltage Membranes”
1927
above is just due to the limitations of the model. One such limitation is the diffusion approximation on which the model is based. Applying this approximation, the synaptically filtered spike train inputs have been replaced by OUPs. In the original model with spike train input, it is well known that the voltage cannot go below the lowest reversal potential E i or above the excitatory reversal potential E e if no current (additive) noise is present (see, e.g., L´ansky´ & L´ansk´a, 1987, for the case of unfiltered Poissonian input). In this case, we do not expect a power law behavior of the probability density at large values of the voltage. Another limitation of the model considered by R&D is that no nonlinear spike-generating mechanism has been included. In particular, the mechanism responsible for the voltage reset after an action potential would prevent any power law at strong, positive voltage. Thus, we see that at strong, synaptic input, the shot-noise character of the input and nonlinearities in the dynamics cannot be neglected and even determine whether the mean of the voltage is finite. The second issue concerns the consequences of the diffusion approximation for the validity of the achieved results. Even if we assume a weak noise such that all the lower moments like mean and variance will be finite, is there any effect of the shot-noise character of the synaptic input that is not taken into account properly by the diffusion approximation? Richardson and Gerstner (2005) have recently addressed this issue and shown that the shot-noise character will affect the statistics of the voltage and that its contribution is comparable to that resulting from the multiplicativity of the noise. Thus, for a consistent treatment, one should either include both features (as done by Richardson and Gerstner, 2005, in the limit of weak synaptic noise) or none (corresponding to the effective timescale approximation; cf. Richardson & Gerstner, 2005). Summarizing, we believe that the use of the extended expression by R&D is restricted to parameters obeying β Qe + Qi .
(8.5)
This restriction is consistent with (1) the diffusion approximation on which the model is based, (2) a qualitative justification of the extended expression by R&D as given above, and (3) the finiteness of the stationary mean and variance. For parameters that do not obey the condition equation 8.5, one should take into account the shot-noise statistics of the synaptic drive. Recent perturbation results were given by Richardson and Gerstner (2005) assuming weak noise; we note that the small parameter in this theory is (Qe + Qi )/β and therefore exactly equal to the small parameter in equation 8.5. The most promising result in our letter seems to be the exact solution for the time-dependent mean value, a statistical measure that can be easily determined in an experiment and might tell us a lot about the synaptic
1928
B. Lindner and A. Longtin
dynamics and its parameters. The only weakness of this formula is that it is still based on the diffusion approximation, that is, on the assumption of gaussian conductance noise. One may, however, overcome this limitation by repeating the calculation for synaptically filtered shot noise. Appendix: Analysis of the Derivation of the Fokker-Planck Equation Here we show where in the derivation of the Fokker-Planck equation by R&D errors have been made. Let us first note that although R&D use a so-called Ito rule, there is no difference between the Ito and Stratonovich interpretations of the colored noise–driven membrane dynamics. Since the noise processes possess a finite correlation time, the Ito-Stratonovich dilemma occurring in systems driven by white multiplicative noise is not an issue here. To comprehend the errors in the analytical derivation of the FokkerPlanck equation in R&D, it suffices to consider the case of only additive OU noise. For clarity we will use our own notation: the OUP is denoted by yI (t), and we set h I = 1 (the latter function is used in R&D for generality). R&D give a formula for the differential of an arbitrary function F (v(t)) in equation B.9. 1 d F (v(t)) = ∂v F (v(t))dv + ∂v2 F (v(t))(dv)2 . 2
(A.1)
R&D use the membrane equation in its differential form, which for vanishing multiplicative noises reads dv = f (v)dt + dw I ,
(A.2)
where the drift term is f (v) = −βv and w I is the integrated OU process yI : wI =
t
ds yI (s).
(A.3)
0
Inserting equation A.2 into equation A.1, we obtain 1 d F (v(t)) = ∂v F (v(t)) f (v(t))dt + ∂v F (v(t))dw I + ∂v2 F (v(t))(dw I )2 . 2
(A.4)
This should correspond to equation B.10 in R&D for the case of zero multiplicative noise. However, our formula differs from equation B.10 in one important respect: R&D have replaced (dw I )2 by 2α I (t)dt using their Ito
Comment on “Characterization of Subthreshold Voltage Membranes”
1929
rule,6 equation A.13a. Dividing by dt, averaging, and using the fact that for finite τ I dw I (t)/dt = yI (t), we arrive at dF (v(t)) 1 2 (dw I )2 = ∂v F (v(t)) f (v(t)) + ∂v F (v(t))yI (t) + ∂ F (v(t)) . dt 2 v dt (A.5) This should correspond to equation B.12 in R&D (again for the case of vanishing multiplicative noise) but is not equivalent to the latter equation for two reasons. First, R&D set the second term on the right-hand side to zero, reasoning that the mean value yI (t) is zero (they also use an argument about h {e,i,I } , which is irrelevant in the additive noise case considered here). Evidently if yI (t) is a colored noise, it will be correlated to its values in the past y(t ) with t < t. The voltage v(t) and any nontrivial function F (v(t)) is a functional of and therefore correlated to yI (t ) with t < t. Consequently, there is also a correlation between yI (t) and F (v(t)), and thus ∂v F (v(t))yI (t) = ∂v F (v(t))yI (t) = 0.
(A.6)
Hence, setting the second term (which actually describes the effect of the noise on the system) to zero is wrong.7 This also applies to the respective terms due to the multiplicative noise. Second, the last term on the right-hand side of equation A.5 was treated as a finite term in the limit t → ∞. According to R&D’s equation A.13a (for i = j), equation 3.2, and equation 3.3, limt→∞ (dw I )2 = limt→∞ 2α I (t)dt = σ˜ I2 τ I dt and, thus (dw 2I )/dt → σ˜ I2 τ I as t → ∞. However, the averaged variance of dw I = yI (t)dt is (dw I )2 = yI (t)2 (dt)2 = σ˜ I2 (dt)2 and therefore the last term in equation A.5 is of first order in dt (since (dw I )2 /dt = yI (t)2 dt ∼ dt) and vanishes. This is the second error in the derivation. We note that the limit in equation 3.3 is not correctly carried out. Even if we follow R&D in using their relations, equation A.13a, together with the correct relation, equation A.10a (instead of the white noise formula, equation A.12a), we obtain that for finite τ I , the mean squared increment
6 Note that R&D use α (t) for two different expressions: according to equation B.8 for I σ˜ I2 [τ I (1 − exp(−t/τ I )) − t] + w 2I (t)/(2τ I ) but also according to equation 3.2 for the average of this stochastic quantity. 7 For readers still unconvinced of equation A.6, a simple example will be useful. Let F (v(t)) = v 2 (t)/2. Then
∂v F (v(t))yI (t) = v(t)yI (t). In the stationary state, this average can be calculated as dvdyI vyI P0 (v, yI ) using the density equation 4.6. This yields v(t)yI (t) = Q I /[1 + βτ I ], which is finite for all finite values of the noise intensity Q I and correlation time τ I . Note that this line of reasoning is valid only for truly colored noise (τ I > 0); the white noise case has to be treated separately.
1930
B. Lindner and A. Longtin
(dw I )2 is zero in linear order in dt for all times t, which is in contradiction to equation 3.3 in R&D. This incorrect limit stems from using the white noise formula, equation (A.12a) which R&D assume to go from equation 3.2 to equation 3.3 in R&D (2003). The use of equation A.12a is justified by R&D by the steady-state limit t → ∞ with t/τ I 1. However, t → ∞ with t/τ I 1 does not imply that τ I → 0 and that one can use equation A.12a, which holds true only for τ I → 0. In other words, a steady-state limit does not imply a white noise limit. We now show that keeping the proper terms in equation A.5 does not lead to a useful equation for the solution of the original problem. After applying what was explained above, equation A.5 reads correctly, dF (v(t)) = ∂v F (v(t)) f (v(t)) + ∂v F (v(t))yI (t) . dt
(A.7)
Because of the correlation between v(t) and yI (t), we have to use the full two-dimensional probability density to express the averages:
∂v F (v(t)) f (v(t)) =
dyI (∂v F (v)) f (v)P(v, yI , t)
dv
=
∂v F (v(t))yI (t) =
dv(∂v F (v)) f (v)ρ(v, t)
dv
dyI (∂v F (v))yI P(v, yI , t).
(A.8)
Inserting these relations into equation A.7, performing an integration by part, and setting F (v) = 1 leads us to ∂t ρ(v, t) = −∂v ( f (v)ρ(v, t)) − ∂v
dyI yI P(v, y, t) ,
(A.9)
which is not a closed equation for ρ(v, t) or a Fokker-Planck equation. The above equation with f (v) = −βv can be also obtained by integrating the two-dimensional Fokker-Planck equation, equation 4.5, over yI . In conclusion, by neglecting a finite term and assuming a vanishing term to be finite, R&D have effectively replaced one term by the other; the colored noise drift term is replaced by a white noise diffusion term, the latter with a prefactor that corresponds to only half of the noise intensity. This amounts to a white noise approximation of the colored conductance noise, although with a noise intensity that is not correct in the white noise limit of the problem.
Comment on “Characterization of Subthreshold Voltage Membranes”
1931
Acknowledgments This research was supported by NSERC Canada and a Premiers Research Excellence Award from the government of Ontario. We also acknowledge an anonymous reviewer for bringing to our attention the Note by R&D (2005), which at that time had not yet been published. References Abramowitz, M., & Stegun, I. A. (1970). Handbook of mathematical functions. New York: Dover. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky integrate-and-fire neurons with synaptic currents dynamics. J. Theor. Biol., 195, 87–95. Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. H¨anggi, P., & Jung, P. (1995). Colored noise in dynamical systems. Adv. Chem. Phys., 89, 239–326. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximation for neuronal activity including synaptic reversal potentials. J. Theor. Neurobiol., 2, 127–153. Holden, A. V. (1976). Models of the stochastic activity of neurones. Berlin: SpringerVerlag. ´ P., & Smith, C. E. (1994). Synaptic transmission in a diffusion L´ansk´a, V., L´ansky, model for neural activity. J. Theor. Biol., 166, 393–406. ´ P., & L´ansk´a, V. (1987). Diffusion approximation of the neuronal model with L´ansky, synaptic reversal potentials. Biol. Cybern., 56, 19–26. Richardson, M. J. E. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Phys. Rev. E, 69, 051918. Richardson, M. J. E., & Gerstner, W. (2005). Synaptic shot noise and conductance fluctuations affect the membrane voltage with equal significance. Neural Comp., 17, 923–948. Risken, H. (1984). The Fokker-Planck equation. Berlin: Springer. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comp., 15, 2577–2618. Rudolph, M., & Destexhe, A. (2005). An extended analytical expression for the membrane potential distribution of conductance-based synaptic noise. Neural Comp., 17, 2301–2315. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neuroscience. Philadelphia: Society for Industrial and Applied Mathematics.
Received February 3, 2005; accepted October 5, 2005.
LETTER
¨ Communicated by Klaus-Robert Muller
Kernel Projection Classifiers with Suppressing Features of Other Classes Yoshikazu Washizawa∗
[email protected] Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8552, Japan
Yukihiko Yamashita
[email protected] Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8552, Japan
We propose a new classification method based on a kernel technique called suppressed kernel sample space projection classifier (SKSP), which is extended from kernel sample space projection classifier (KSP). In kernel methods, samples are classified after they are mapped from an input space to a high-dimensional space called a feature space. The space that is spanned by samples of a class in a feature space is defined as a kernel sample space. In KSP, an unknown input vector is classified to the class of which projection norm onto the kernel sample space is maximized. KSP can be interpreted as a special type of kernel principal component analysis (KPCA). KPCA is also used in classification problems. However, KSP has more useful properties compared with KPCA, and its accuracy is almost the same as or better than that of KPCA classifier. Since KSP is a single-class classifier, it uses only self-class samples for learning. Thus, for a multiclass classification problem, even though there are very many classes, the computational cost does not change for each class. However, we expect that more useful features can be obtained for classification if samples from other classes are used. By extending KSP to SKSP, the effects of other classes are suppressed, and useful features can be extracted with an oblique projection. Experiments on two-class classification problems, indicate that SKSP shows high accuracy in many classification problems. 1 Introduction Kernel-based methods for pattern recognition, such as support vector machines (SVMs), kernel principal component analysis (KPCA), and the ∗ Current address: Brain Science Institute, RIKEN, 2-1 Hirosawa, wako-shi, Saitama 351-0198, Japan
Neural Computation 18, 1932–1950 (2006)
C 2006 Massachusetts Institute of Technology
Kernel Projection Classifiers
1933
¨ kernel Fisher discriminant (KFD), achieve high accuracy (Muller, Mika, ¨ R¨atsch, Tsuda, & Scholkopf, 2001; Vapnik, 1998). In particular, SVMs are widely used for classification and regression problems. In kernel-based methods, an input pattern is classified after it has been mapped to a feature space F whose dimension is very high or infinite. Let be a mapping from an N-dimensional input space to a feature space: :
R N → F.
(1.1)
Instead of calculating directly, one can introduce a kernel function k that satisfies k(x, y) = (x), (y),
(1.2)
where ·, · denotes the inner product. If we use the Mercer kernel function ¨ (Scholkopf, Mika, et al., 1999), a exists that satisfies equation 1.2. Mercer kernel functions, which are widely used, are as follows. Linear : Polynomial : Gaussian :
k(x, y) = x, y
(1.3)
k(x, y) = (x, y + 1) x − y2 k(x, y) = exp − . 2σ 2 d
(1.4) (1.5)
When equation 1.3 is the kernel function, is an identity operator. When the polynomial function, equation 1.4, and the gaussian kernel function, equation 1.5, are used, an N-dimensional input vector is mapped into a ( N+d Cd − 1)-dimensional or an infinite-dimensional Hilbert space, respectively. SVMs can be calculated only from the inner product of learning samples or an unknown input vector. Thus, kernel methods involve exchanging inner products in the feature space for kernel functions. Using a mapping , SVMs can achieve high accuracy for many types of classification and regression problems. However, SVMs are two-class classifiers, and a quadratic optimization problem has to be solved in the learning stage, so the computational cost is very high. Thus, for multiclass classification problems, the computational cost increases with the number of classes. A kernel-based method called kernel PCA (KPCA), which is extended from PCA or the Karhunen-Lo`eve transform (KLT), has been proposed ¨ ¨ (Scholkopf, Smola, & Muller, 1998). Subspace methods utilizing PCA or KLT are used widely for pattern recognition or classification problems (Watanabe & Pakvasa, 1973; Oja, 1983). KPCA has also been applied to classification problems in work by Tsuda (1999), Maeda and Murase (1999), and Murata
1934
Y. Washizawa and Y. Yamashita
and Onoda (2001). We call such methods a KPCA classifier. KPCA classifiers perform better than conventional subspace methods. Although they do not exceed kernel two-class classifiers (e.g., SVM or KFD) in accuracy, they have an advantage from the viewpoint of the computational cost of a multiclass classification problem because they are single-class classifiers. Note that SVMs and KFDs are two-class classifiers, and they require all samples belonging to all classes in the learning stage. A single-class classifier of each class needs only samples belonging to a class in the learning stage. Thus, a KPCA classifier is suitable for solving multiclass classification problems. Moreover, if the number of classes changes, two-class classifiers of all classes have to be reconstructed. In contrast, single-class classifiers require only the newly added classes to be constructed. Subspace methods in the input space need to reduce the dimension of the space that expresses the features of each class because spaces that are spanned by samples of various classes overlap. However, in the feature space, the overlap with other classes is small. Thus, for subspace methods in a feature space, reducing the dimension of the space is unnecessary (see section 4). Thus, the kernel sample space classifier (KSP) was proposed as a kernel subspace method that does not require reducing the dimension of the space for each class. We describe the details of KSP in section 2. Although KSP can be interpreted as a special case of KPCA, it is constructed by calculating an inverse matrix, whereas KPCA needs the solution of an eigenvalue problem. Thus, KSP can incorporate incremental learning with updates of an inverse matrix. Moreover, a gaussian elimination method can be applied to obtain an inverse matrix. This is useful in applications that need many reconstructions, such as leave-one-out cross validation or time-series data. KSP has almost the same or higher performance than the KPCA classifier (Washizawa & Yamashita, 2004). In this letter, we propose a suppressed kernel sample projection classifier (SKSP) that is extended from KSP. KSP and KPCA can extract the features that are included in not fewer than two classes. Such common features among classes are not only useless for classification; they are also harmful as noise. Therefore, such features have to be suppressed. We introduce a suppression trick to inhibit them. For the criterion of SKSP, we add a term that suppresses such common features by using some of the samples belonging to other classes in the learning stage. Section 2 describes characterization by an optimality criterion and introduces regularization. Section 3 defines a SKSP criterion as an extension of this viewpoint of KSP and provides its solution. Section 4 shows experimental results of two-class classification problems to compare the performance of SKSP with other classification methods. Finally section 5 discusses the advantages of SKSP and shows the differences between it and other methods.
Kernel Projection Classifiers
1935
Figure 1: Kernel sample space projection.
2 Kernel Sample Space Projection 2.1 Definition of KSP. Let i be a set of vectors that belong to class i and f 1i , f 2i , . . . , and f Li i be samples in i , where L i is the number of samples in class i. We define an operator Si as Si = f 1i f 2i . . . f Li i .
(2.1)
R(Si ) is a kernel sample space of class i where R(A) denotes the range of A. Generally, the dimension of the feature space is much larger than the number of samples L i . Therefore, a kernel sample space is sparse and expresses features of the class. In KSP, the similarity between an unknown input vector and a class i is evaluated by the norm of its projection onto the kernel sample space of class i. An unknown input vector f x is classified according to the maximum norm (see Figure 1). Let K Si be a Gram matrix in F of class i as K Si = Si∗ Si i i k f1 , f1 .. = i. i k f L i , f1
· · · k f 1i , f Li i .. .. , . i. i · · · k f Li , f Li
(2.2)
(2.3)
where A∗ is the adjoint operator of A. The projection operator PR(Si ) and the projection norm of ( f x ) onto R(Si ) are expressed as †
PR(Si ) = Si K Si Si∗
(2.4)
1936
Y. Washizawa and Y. Yamashita †
PR(Si ) ( f x )2 = h( f x ), K Si h( f x ),
(2.5)
¨ where h(x) = Si∗ (x) is the empirical kernel map defined by Scholkopf, Mika, et al. (1999) and A† is a Moore-Penrose generalized inverse operator that satisfies AA† A = A, A† AA† = A† , (AA† )∗ = AA† , and(A† A)∗ = A† A. If A is a nonsingular operator, A† = A−1 . Although ( f x ) is a vector in a high-dimensional space or in an infinite Hilbert space, Si∗ ( f x ) is an L i -dimensional real vector, that is, Si∗ ( f x ) = (k( f 1i , f x ) k( f 2i , f x ), . . . , k( f Li i , f x )) , and we can calculate it directly. An unknown input vector f x is classified into class i if and only if PR(Si ) ( f x )2 > PR(S j ) ( f x )2 ∀ j = i.
(2.6)
2.2 Properties of KSP. KSP is characterized by the following optimality criterion. Accordingly, a regularized KSP and a suppressed KSP are defined, min :
J [Xi ] =
Li 1 ( f si ) − Xi ( f si )2 Li
(2.7)
s=1
subject to :
N (Xi ) ⊃ R(Si )⊥ ,
(2.8)
where N (A) is the null space of A. To obtain the solution of KPCA, one has to solve the eigenvalue problem whose size is equal to the number of samples. However, in KSP, we obtain a solution by simply calculating the inverse problem of the size of the number of samples. If we use Sherman-Morrison-Woodbury’s theorem and its extension in Rohde (1965), incremental learning is easily achieved in KSP. As we mention further in the next section, Tikhonov’s regularization is introduced to KSP in order to avoid the overfitting problem, whereas KPCA is interpreted as a truncated singular value decomposition (TSVD) from the viewpoint of its regularization. † If the Gram matrix K Si is nonsingular, K Si = K S−1 . Since the problem we i have to solve is not the generalized inverse but the inverse problem, we can obtain a solution more easily. If a gaussian kernel function is used, the Gram matrix is always nonsingular unless equal samples exist. If polynomial kernel function is used and the dimension of its feature space is large enough, the Gram matrix is considered to be nonsingular unless equal samples exist. 2.3 Regularization of KSP. Generally, a set of learning samples will include outliers or noisy samples. Thus, the generalization capability of classifiers may not be high, even if they can classify finite learning samples correctly. This is the overfitting or overlearning problem, and it can be avoided by using regularization or model selection. For example, Cortes
Kernel Projection Classifiers
1937
¨ and Vapnik (1995) and R¨atsch, Onoda, and Muller (2001) introduced a soft margin for SVM and AdaBoost technique. In KSP, the learned patterns are always classified correctly as long as the Gram matrix is nonsingular. Thus, the overfitting problem may occur when it is used. The overfitting problem occurs when the classifier has a decision boundary that is too complex. If the classifier has a discriminant function, the complexity of its decision boundaries is measured using the variation of the function with respect to a very small variation in the input vector. Let be the former variation, δ f x be a small variation of f x , and d : R N → R be a discriminant function. Then is expressed as =
(d( f x + δ f x ) − d( f x ))/d( f x ) . δ f x / f x
(2.9)
For KSP, K SP is expressed as K SP = ≤
PR(Si ) ( f x + δ f x ) − PR(Si ) ( f x ) fx · δ f x PR(Si ) ( f x ) PR(Si ) ( f x + δ f x ) − PR(Si ) ( f x ) fx · . δ f x PR(Si ) ( f x )
(2.10)
To suppress K SP directly is difficult because is nonlinear. Instead of K SP , we introduce K SP , which is a variation of the feature with respect to a very small variation δ( f x ) of ( f x ). Then we have K SP =
PR(Si ) (( f x ) + δ( f x )) − PR(Si ) ( f x ) ( f x ) · δ( f x ) PR(Si ) ( f x )
≤
PR(Si ) (( f x ) + δ( f x )) − PR(Si ) ( f x ) ( f x ) · δ( f x ) PR(Si ) ( f x )
=
PR(Si ) δ( f x ) ( f x ) · . δ( f x ) PR(Si ) ( f x )
(2.11)
(2.12)
Suppression of has been discussed in reference to ill-posed problems and regularization (Groetsch, 1993). Most ill-posed problems are caused fx ) by the first part of , d( f x +δδf fxx)−d( . If d( f ) = Af , the maximum value of the first part of is given by its operator norm, which is defined as A = sup f =1 Af . Tikhonov’s regularization avoids ill-posed problems (Tikhonov & Arsenin,1977). It suppresses the Frobenius norm, which is defined as A2 = tr[A∗ A]. Since A ≤ A2 , we can avoid ill-posed problems by suppressing the Frobenius norm.
1938
Y. Washizawa and Y. Yamashita
For KSP, from equation 2.12, we add a term for suppressing X2 to equation 2.7 by using Tikhonov regularization. We define the regularized KSP as follows. Definition 1 (Regularized KSP). Regularized KSP is defined by the solution of the following optimization problem,
min : subject to :
J [Xi ] =
Li i 1 f − Xi f i 2 + µXi 2 , s s 2 L i s=1
N (X) ⊃ R(Si )⊥ ,
(2.13) (2.14)
where µ > 0 is a regularization parameter. Theorem 1.
A solution of regularized KSP is
P˜ R(Si ) = Si (K Si + µL i I )−1 Si∗ ,
(2.15)
where I denotes the identity matrix. This theorem is easily proven from theorem 3 (see section 3.3). 3 Suppressed KSP 3.1 Definition of Suppressed KSP. In KSP, an orthogonal projection can extract the features of each category. Thus, the projection norm of an unknown input vector ( f x ) onto R(Si ) stands for the similarity between f x and class i. However, KSP may extract features that belong to more than one class. Such features cannot be used for classification, since they may be as harmful as noise. They can be suppressed using an oblique projection because its direction is determined by a set of samples that belong to other classes. Let i be a set of vectors that do not belong to class i. Let g1i , g2i . . . and g iMi be samples in i and Ti = g1i g2i . . . g iMi
(3.1)
Ui = [Si Ti ]
(3.2)
K Ui = Ui∗ Ui .
(3.3)
Next, we introduce the following criterion for SKSP.
Kernel Projection Classifiers
1939
Definition 2 (Suppressed KSP). Suppressed KSP (SKSP) is defined by the solution of following optimization problem:
min :
J [Xi ] =
subject to:
Li Mi
i 1 f − Xi f i 2 + α Xi g i 2 s s t L i s=1 Mi t=1
(3.4)
N (Xi ) ⊃ R(Ui )⊥ ,
(3.5)
where α is a parameter that controls the effect of the suppression of other classes. However if K Ui is nonsingular (i.e., all of samples are linearly independent), α is vanished from the solution. After introducing regularization to SKSP, α appears in its solution. The concept behind the criterion is based on the least-squares estimation (Luenberger, 1969) and the relative Karhunen-Lo`eve transform (Yamashita & Ogawa, 1996). The additional term Xi (gti )2 is used to minimize the Mt by Xi . From this term, the features in both features extracted from {gti }t=1 i Ls i Mt { f s }s=1 and {gt }t=1 are suppressed because the similarity between an unknown input pattern f x and class i is obtained from Xi ( f x )2 . Criterion 3.7 cannot determine where X maps a vector out of R(Ui ). Constraint 3.5 removes the component orthogonal to R(Ui ). We provide a solution to this criterion in the form of the next theorem. Theorem 2. Let I L i 0 L i Mi Di = , 0 Mi L i 0 Mi Mi
(3.6)
where Ia is the identity matrix whose size is a and 0 a b is the a × b matrix of which
all elements are zero. If K Ui is nonsingular, the SKSP operator PR(S is given as i)
PR(S = Ui Di K U−i1 Ui∗ . i)
(3.7)
3.2 Properties of SKSP. Proposition 1.
is a projection operator onto R(Si ). If K Ui is nonsingular, PR(S i)
Proposition 2.
If K Ui is nonsingular,
PR(S P = PR(S , i ) R(Ui ) i)
(3.8)
= PR(S , PR(Ui ) PR(S i) i)
(3.9)
1940
Y. Washizawa and Y. Yamashita
Figure 2: Suppressed kernel sample space projection.
where PR(Ui ) is the orthogonal projection operator onto R(Ui ). Proposition 3.
v = 0 for all v ∈ R(Ti ). If K Ui is nonsingular, PR(S i)
Figure 2 shows a sketch of SKSP. From propositions 1, 2, and 3, PR(S ( f x ) i) can be considered as follows. First, ( f x ) is orthogonally projected onto R(Ui ), after it is projected onto R(Si ) along R(Ti ). The similarity between f x
and i against i is given as PR(S ( f x ). If i = φ, PR(S = PR(Si ) . Thus, i) i) SKSP is an extension of KSP. In actual problems, KSP can extract necessary features by itself. Thus, we do not have to use all samples of other classes; only the samples that are similar to i have to be included in i . Since the similarity of an input vector is evaluated using the projection norm onto R(Si ), it is sufficient that samples whose projection norms are large are included in i .
3.3 Regularized SKSP. We also introduce a Tikhonov’s regularization term to SKSP as well as KSP. Definition 3 (Regularized SKSP). min :
J [Xi ] =
Li i 1 f − Xi f i 2 s s L i s=1
+
Mi α Xi g i 2 + µXi 2 2 t Mi t=1
subject to: N (Xi ) ⊃ R(Ui )⊥ ,
(3.10) (3.11)
Kernel Projection Classifiers
1941
where µ > 0 is the regularization parameter. Theorem 3. Let L i I L i 0 L i ,Mi ˜i = D . i 0 Mi ,L i M I α Mi
(3.12)
A solution of regularized SKSP is
˜ i )−1 Ui∗ . = Ui Di (K Ui + µ D P˜ R(S i)
(3.13)
Let ˜ i, K˜ Ui = K Ui + µ D
h i ( f x ) = Ui∗ ( f x ) = k f x , f 1i , . . . , k f x , f Li i , k f x , g1i , . . . , k f x , g iMi . Then the similarity between an unknown input vector f x and i against i is given as P˜
2 = h i ( f x ), K˜ −1 Di K U Di K˜ −1 h i ( f x ) . i Ui Ui
R(Si ) ( f x )
(3.14)
4 Experiments We used several practical data sets that were used in R¨atsch et al. (2001), ¨ ¨ Mika, R¨atsch, Weston, Scholkopf, and Muller (1999) and R¨atsch (2001).1 It consists of 13 binary classification problems, and each data set consists of 100 or 20 realizations. We used gaussian kernel function (see equation 1.5) in our approach. The parameters of the kernel function, the regularization, and α were obtained with a fivefold cross validation. We extracted training samples from the first five realizations of each data set. For each of these realizations, a five-fold cross-validation procedure gives the best model among several parameters. About 10 values were tested for each parameter in two stages; first a global search was done to find a good guess for the parameter; the guess become more precise in the second stage. The model parameters were obtained as the median of five estimations and were used throughout the training on all 100 realizations of the data set. If there were identical samples, we added only one of them to the learning set because the kernel Gram matrix is singular if identical samples exist. The kernel sample space was not changed by this operation. The 1 The data sets were downloaded online from http://ida.first.fraunhofer.de/projects/ bench/benchmarks.htm.
1942
Y. Washizawa and Y. Yamashita
Table 1: Mean Test Error Rates and Their Standards Deviations. Data set
SKSP
KSP
AB Reg
SVM
KFD
Banana Breast Cancer Diabetes Flare Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform
10.5 ± 0.4 25.0 ± 4.9
11.3 ± 0.6 27.0 ± 4.1
10.9 ± 0.4 26.5 ± 4.5
11.5 ± 0.7 26.0 ± 4.7
10.8 ± 0.5 24.5 ± 4.6
23.1 ± 1.6 44.8 ± 1.8 23.2 ± 2.2 15.8 ± 2.2 2.6 ± 0.5 15.5 ± 1.9 14.2 ± 0.7 3.9 ± 2.2 27.3 ± 8.8 2.3 ± 0.1 10.7 ± 0.7
26.5 ± 1.8 46.8 ± 4.7 25.5 ± 2.4 18.8 ± 3.5 3.2 ± 0.6 41.4 ± 17.8 12.2 ± 0.9 3.9 ± 2.3 31.1 ± 13.5 2.8 ± 0.1 11.1 ± 0.4
23.8 ± 1.8 34.2 ± 2.2 24.7 ± 2.4 16.5 ± 3.5 2.7 ± 0.6 1.6 ± 0.1 9.5 ± 0.7 4.6 ± 2.2 22.6 ± 1.2 2.7 ± 0.2 9.8 ± 0.8
23.5 ± 1.73 32.4 ± 1.8 23.6 ± 2.1 16.0 ± 3.3 3.0 ± 0.6 1.7 ± 0.1 10.9 ± 0.7 4.8 ± 2.2 22.4 ± 1.0 3.0 ± 0.2 9.9 ± 0.4
23.2 ± 1.6 33.2 ± 1.7 23.7 ± 2.2 16.1 ± 3.4 4.8 ± 0.6 1.5 ± 0.1 10.5 ± 0.6 4.2 ± 2.1 23.3 ± 2.1 2.6 ± 0.2 9.9 ± 0.4
Number of bold Number of italic
7
1
2
2
2
1
0
3
3
6
Notes: AB Reg: Regularized AdaBoost. KFD: kernel Fisher discriminant. The best result is written in boldface, and the second best is italicized.
Table 2: Comparison of Run Times (Seconds).
Learning stage Recognition stage
KSP
SKSP
SVM
0.19 0.13
0.53 2.02
3.40 0.04
suppression set i of class i consists of all samples in the other class since the learning sets of these problems are not large. Table 1 lists mean test error rates and their standard deviations. The underlined results show the significance by t-test with 5% significant level between SKSP and SVM. The results except for KSP and SKSP are taken from the articles mentioned above. Furthermore, we compared the runtimes of SKSP, KSP, and SVM. Table 2 shows the computational cost in the learning stage and the recognition stage of one realization of the German data set. We used SVM Toolbox for Matlab, provided by A. Schwaighofer, to compare run times. For KSP and SKSP, we used ordinary Matlab code. We used a computer with a Pentium 4 3-GHz CPU and 2-Gbyte memory for computations. To show that the reduction of dimension in feature space is not so valid in many cases, we provide another experimental result. We present the
Kernel Projection Classifiers
1943
50 banana breast-cancer diabetis flare-solar german heart image ringnorm splice thyroid titanic twonorm waveform
Error rate [%]
40
30
20
10
0 0
5
10
15 Rank of KPCA
20
25
30
Figure 3: Error rates of KPCA classifier with respect to its rank.
error rates of the KPCA classifier with respect to its rank in Figure 3. Kernel parameters that achieved the best result were chosen in each rank from several preset parameters. The SKSP classifier had the lowest error rates in many problems. Thus, we can say that SKSP can suppress the effect of overlapping features in the other class and can extract important features. In KSP and SKSP, the error rates of the Ringnorm data set were high. Data of each class are sampled from an isotropic normal distribution. The means of two classes are almost the same, and their variances are different. Since any pair of random variables of an isotropic normal distribution is independent and their kernel covariance operator vanishes (Bach & Jordan, 2002), it is difficult to extract features by a subspace. According to Figure 3, KPCA cannot extract their features either. Figure 3 shows that error rates are stable and low in higher rank in many classification problems. These results prove the concept of KSP is valid. The sharp increases of error rates in some problems are due to ill-posedness. It can be suppressed by regularization introduced to KSP and SKSP. 5 Discussion Here, we compare SKSP with other classification methods and clarify their differences.
1944
Y. Washizawa and Y. Yamashita
The remarkable difference between SKSP and SVM or KFD is that SKSP is a quadratic discriminant function, while SVM and KFD are linear discriminant functions in the feature space. In the case of classification in the input space (not using a kernel method), quadratic classifiers generally perform better than linear ones because they have more degrees of freedom. When the kernel methods are used, we cannot know the advantages of the quadratic classifier because all discriminant functions are nonlinear. However, the advantages of SKSP were shown by our experiments. In SVM, a separating hyperplane is determined by a few samples called support vectors (SVs). The separating hyperplane depends on only samples around the boundary and does not depend on other samples or their distribution. If there are noisy samples or outliers in the learning sample, the separating hyperplane will be deteriorated by them because they become SVs with high probability. Thus, SVM is not fundamentally robust even if regularizing methods, such as soft margin (Cortes & Vapnik, 1995) ¨ or ν-SVM (Scholkopf, Bartlett, Smola, & Williamson, 1999), are used. In general, a classifier has a trade-off between robustness and sparseness in relation to the number of samples. The computational cost of SVM in the recognition stage is low because its solution is sparse. Let k and s be the computational costs of calculating a kernel function and a multiplication, respectively. Let L and L SV be the number of learning samples and SVs, respectively. Then main computational costs in the recognition stage are given as: SVM:
(k + s)L SV
KFD:
(k + s)L
SKSP:
s L 2 + (k + s)L .
Note that s < k and L SV < L. Only SKSP has a second-order term with respect to L, because it is a quadratic discriminant in the feature space. However, as mentioned in section 3, not all samples belonging to other classes have to be included. Hence, we can decrease L and the computational cost. SVM calculation takes a lot of time for the learning stage because a quadratic optimization problem must be solved. KFD involves solving an eigenvalue problem or a quadratic optimization problem transformed from it (Mika et al., 2000). On the other hand, the SKSP solution is given as a closed form with an inverse operation and multiplications. Moreover, as stated above, not all the samples belonging to other classes have to be used. The steps for calculating the inverse and multiplication are O(L 3 ). Generally, matrix inverse problems are easier than quadratic optimization problems. Thus, the computational cost of KSP or SKSP is lower than SVM in the learning stage.
Kernel Projection Classifiers
1945
Moreover, when there is a huge number of learning samples, we can easily introduce the multi template method to SKSP or KSP because they are not two-class classifiers. In the multitemplate method, subclasses in a class are prepared and an input vector is classified into one of the subclasses. On the other hand, to employ this method for two-class classifiers (e.g., SVM or KFD) is useless because they must use all samples of all classes. Appendix We prepare the following lemmas and corollaries for the proofs of theorems 2 and 3. Lemma 1 (Israel & Greville, 1974). Let H1 , H2 , H3 , and H4 be Hilbert spaces and A ∈ B(H3 , H4 ), B ∈ B(H1 , H2 ), C ∈ B(H1 , H4 ), where B(H, H ) is a bounded linear operator from H to H . Assume that R(A), R(B), and R(C) are closed. Then the operator equation, AXB = C,
(A.1)
has a solution X ∈ B(H2 , H3 ) when R(A) ⊃ R(C) and N (B) ⊂ N (C). A general form of a solution is given by X = A† C B † + Y − A† AY B B † ,
(A.2)
where Y is an arbitrary operator in B(H2 , H3 ). Corollary 1. Let A ∈ B(H1 , H2 ), B ∈ B(H1 , H3 ). If R(A) and R(B) are closed, an operator equation, A = XB,
(A.3)
has a solution when N (B) ⊂ N (A). Proof of Proposition 1.
P = Ui Di K U−1i Ui∗ Ui Di K U−1i Ui∗ PR(S i ) R(Si )
= Ui Di Di K U−1i Ui∗
= PR(S . i)
(A.4)
1946
Y. Washizawa and Y. Yamashita
Proof of Proposition 2.
PR(S P = Ui Di K U−1i Ui∗ Ui K U−1i Ui∗ = PR(S i ) R(Ui ) i)
PR(Ui ) PR(S = Ui K U−1i Ui∗ Ui Di K U−1i Ui∗ = Ui Di K U−1i Ui∗ = PR(S . i) i)
Proof of Proposition 3.
v is expressed as v =
Mi j=1
v j (g j ) = Ui v ,
where . . 0 ν1 ν2 . . . ν Mi ) . v = (0 . Li
Then
u = Ui Di K Ui −1 Ui∗ v = Ui Di K Ui −1 Ui∗ Ui v = Ui Di v = 0. PR(S i)
Proofs of Theorems 2 and 3. Here we omit the symbol i for a class i for brevity. When we let µ = 0 in equation 3.10, it corresponds to equation 3.4. Thus, we let µ ≥ 0. Since u − PR(U) v ≤ u − v for ∀ u ∈ R(U), ∀ v ∈ H, X can be expressed as X = U B. From corollary 1, X in definition 1 can be expressed as X = CU ∗ . Then we can let X = U AU ∗ ,
(A.5)
where A is a real matrix of which size is (L + M). Equation 3.10 yields that J=
L M 1 α ( f s ) − U AU ∗ ( f s )2 + U AU ∗ (gt )2 + µU AU ∗ 22 L M s=1
t=1
(A.6) =
L 1 k( f s , f s ) − 2( f s ), U AU ∗ ( f s ) + U AU ∗ ( f s ), U AU ∗ ( f s ) L s=1
α U AU ∗ (gt ), U AU ∗ (gt ) + µtr[U A∗ U ∗ U AU ∗ ] M M
+
t=1
=
1 L
L
k( f s , f s ) − 2U ∗ ( f s ), AU ∗ ( f s ) + AU ∗ ( f s ), K U AU ∗ ( f s )
s=1
α AU ∗ (gt ), K U AU ∗ (gt ) + µtr[A∗ K U AK U ]. M M
+
t=1
Kernel Projection Classifiers
1947
Note that U ∗ ( f s ) and U ∗ (gt ) are the sth and (L + t)th column of K U , ˜ as equation 3.6 and equation 3.12, respecrespectively. We define D and D tively. Then J is expressed as 1 tr[K U D − 2(K U D)∗ AK U D + (K U D)∗ A∗ K U AK U D] L α + tr[(K U D)∗ A∗ K AK U D] + µtr[A∗ K U AK U ] M 1 2 ∗ −1 ∗ ˜ = tr K U D − DK U AK U D + K U A K U AK U D + µA K U AK U , L L
J=
(A.7) ˜ ∗ = D, ˜ and K U∗ = K U . Hence, the variation of J with respect since D∗ = D, D to A is given as ˜ −1 + K U A∗ K U (δ A)K U D ˜ −1 δ J = tr K U (δ A)∗ K U AK U D 2 ∗ ∗ − DK U (δ A)K U D + µ(δ A) K U AK U + µ(δ A)K U A K U L 1 ˜ −1 K U − K U DK U + µK U AK U = tr (δ A)∗ K U AK U D L 1 −1 ∗ ˜ +(δ A) K U D K U AK U − K U DK U + µK U A K U L 1 ˜ −1 K U − K U DK U + µK U AK U . = 2tr (δ A)∗ K U AK U D L
(A.8)
J is minimum when ˜ D ˜ −1 K U = K U A(K U + µ D)
1 K U DK U . L
(A.9)
In the case of µ > 0, from lemma 1, 1 † † † † K K U DK U K U + W − K U K U WK U K U L U 1 † † D − W KU KU , = W + KU KU L
˜ D ˜ −1 = A(K U + µ D)
(A.10)
1948
Y. Washizawa and Y. Yamashita
where W is arbitrary operator. Let W = W − ˜ D ˜ −1 = A(K U + µ D)
1 L
D. It follows that
1 † † D + W − K U K U W K U K U . L
Then we have 1 ˜ U + µ D) ˜ −1 + W D(K ˜ −1 ˜ U + µ D) A= D D(K L † † ˜ ˜ −1 −K U K U W K U K U D(K U + µ D) .
Since
1 L
(A.11)
˜ = D, X, which minimizes J , is given as DD
X = U AU ∗ ˜ −1 U ∗ + UW D(K ˜ −1 U ∗ ˜ U + µ D) = U D(K U + µ D) †
†
˜ −1 U ∗ . ˜ U + µ D) −U K U K U W K U K U D(K Since K U = U ∗ U, R(K U ) = R(U ∗ ). Then we have †
KU KU U∗ = U∗, †
U K U K U = U. It is clear that ˜ −1 U ∗ + µI ) = (U ∗ U + µ D) ˜ D ˜ −1 U ∗ , U ∗ (U D so that ˜ −1 U ∗ = D ˜ −1 U ∗ (U D ˜ −1 U ∗ + µI )−1 . (K U + µ D)
(A.12)
Then we have † ˜ † ∗ −1 ˜ −1 ∗ ˜ −1 ∗ K U K U D(K U + µ D) U = K U K U U (U D U + µI )
˜ −1 U ∗ + µI )−1 = U ∗ (U D ˜ U + µ D) ˜ −1 U ∗ . = D(K Hence, equation A.12 yields that ˜ −1 U ∗ + UW D(K ˜ −1 U ∗ ˜ U + µ D) X = U D(K U + µ D) ˜ −1 U ∗ ˜ U + µ D) −UW D(K ˜ −1 U ∗ . = U D(K U + µ D)
(A.13)
Kernel Projection Classifiers
1949
In the case of µ = 0, if K U is nonsingular, equation A.9 yields A= DK U−1 ,
(A.14)
X = U DK U−1 U ∗ .
(A.15)
References Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. Groetsch, C. W. (1993). Inverse problems in the mathematical sciences. Braunschweig: Vieweg. Israel, A. B., & Greville, T. N. E. (1974). Generalized inverse: Theory and applications. New York: Wiley. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Maeda, E., & Murase, H. (1999). Multi-category classification by kernel based nonlinear subspace method. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Vol. 2, pp. 1025–1028). Piscataway, NJ: IEEE Press. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). Piscataway, NJ: IEEE. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., Smola, A., & Muller, K. (2000). Invariant feature extraction and classification in kernel spaces. In S. A. Solla, T. K. Leen, & ¨ K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 526– 532). Cambridge, MA: MIT Press. ¨ ¨ Muller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. Murata, H., & Onoda, T. (2001). Applying kernel based subspace classification to a non-intrusive monitoring for household electoric appliances. In G. Dorffiner, H. Bischof, & K. Hornik (Eds.), International Conference on Artificial Neural Networks (ICANN) (pp. 692–698). Berlin: Springer-Verlag. Oja, E. (1983). Subspace methods of pattern recognition. New York: Wiley. R¨atsch, G., (2001). Robust boosting via convex optimization. Unpublished doctoral dissertation, University of Potsdam. ¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320. Rohde, C. A. (1965). Generalized inverses of partitioned matrices. Journal of Soc. Indust. Appl. Math., 13, 1033–1035. ¨ Scholkopf, B., Bartlett, P., Smola, B., & Williamson, R. (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press.
1950
Y. Washizawa and Y. Yamashita
¨ ¨ Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., R¨atsch, G., & Smola, A. (1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of ill-posed problems. Silver Spring, MD: V. H. Winston and Sons. Tsuda, K. (1999). Subspace classifier in the Hilbert space. Pattern Recognition Letters, 20(5), 513–519. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Washizawa, Y., & Yamashita, Y. (2004). Kernel sample space projection classifier for pattern recognition. In 17th International Conference on Pattern Recognition (Vol. 2, pp. 435–438). Los Alamitos, CA: IEEE CS Press. Watanabe, S., & Pakvasa, N. (1973, February). Subspace method in pattern recognition. Proc. 1st Int. J. Conf on Pattern Recognition, Washington DC (pp. 25–32). New York: IEEE Press. Yamashita, Y., & Ogawa, H. (1996). Relative Karhunen-Lo`eve transform. IEEE Transactions on Signal Processing, 44(2), 1031–1033.
Received March 16, 2004; accepted December 16, 2005.
LETTER
Communicated by Peter Latham
Implications of Neuronal Diversity on Population Coding Maoz Shamir
[email protected] Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Haim Sompolinsky halm@fiz.huji.ac.il Racah Institute of Physics and Center for Neural Computation, Hebrew University of Jerusalem, Jerusalem 91904, Israel
In many cortical and subcortical areas, neurons are known to modulate their average firing rate in response to certain external stimulus features. It is widely believed that information about the stimulus features is coded by a weighted average of the neural responses. Recent theoretical studies have shown that the information capacity of such a coding scheme is very limited in the presence of the experimentally observed pairwise correlations. However, central to the analysis of these studies was the assumption of a homogeneous population of neurons. Experimental findings show a considerable measure of heterogeneity in the response properties of different neurons. In this study, we investigate the effect of neuronal heterogeneity on the information capacity of a correlated population of neurons. We show that information capacity of a heterogeneous network is not limited by the correlated noise, but scales linearly with the number of cells in the population. This information cannot be extracted by the population vector readout, whose accuracy is greatly suppressed by the correlated noise. On the other hand, we show that an optimal linear readout that takes into account the neuronal heterogeneity can extract most of this information. We study analytically the nature of the dependence of the optimal linear readout weights on the neuronal diversity. We show that simple online learning can generate readout weights with the appropriate dependence on the neuronal diversity, thereby yielding efficient readout. 1 Introduction In many areas of the central nervous system, information on specific stimulus features is coded by the average firing rates of a large population of neurons (see Hubel & Wiesel, 1962; Georgopoulos Schwartz & Kettner, 1982; Razak & Fuzessery, 2002; Coltz, Johnson, & Ebner, 2000). Recently Yoon and Sompolinsky (1999) and Sompolinsky, Yoon, Kang, and Shamir (2001) have Neural Computation 18, 1951–1986 (2006)
C 2006 Massachusetts Institute of Technology
1952
M. Shamir and H. Sompolinsky
shown that information coded by the mean responses of a neural population is greatly suppressed in the presence of the experimentally observed positive correlations. Thus, in the presence of noise correlations, there exists a finite amount of noise in the neural representation of the stimulus. This noise cannot be overcome by increasing the population size. However, central to the analysis of Sompolinsky et al. was the assumption of a homogeneous population of neurons. Empirical observations show a considerable measure of heterogeneity in the response properties of different neurons (Ringach, Shapley, & Hawken, 2002). Here we study the possible role of neuronal diversity in coding for information. We address two questions. First, what is the effect of heterogeneity on the information capacity? In particular, we are interested in the scaling of the information capacity with the number of cells in the population. Second, how does neuronal diversity affect biological readout mechanisms? We address these questions in the context of a statistical model for the responses of a population of N neurons coding for an angular variable, θ , which we term the stimulus, such as the direction of arm movement during a simple reaching task or the orientation of a grating stimulus. Below we define the statistical model of the neural responses and review the main results for a homogeneous population of neurons. We use the Fisher information (see, e.g., Thomas & Cover, 1991; Kay, 1993) and the accuracy of biologically plausible readout mechanisms as measures of information capacity. The first question is addressed in section 2, where the Fisher information of a diverse population of neurons is studied. In section 3 we investigate the efficiency of linear readout mechanisms. First, we study the population vector (see Georgopoulos, Schwartz, & Kettner, 1986) readout; then we investigate the optimal linear estimator (see Salinas & Abbott, 1994) and show a different scaling of their efficiencies with the population size in the presence of correlations. Finally we summarize our results in section 4 and discuss further extensions of our theory. 1.1 The Statistical Model. We consider a system of N neurons coding for an angle, θ ∈ [−π, π). Let ri denote the activity of the ith neuron during a single trial; ri can be thought of as the number of spikes the ith neuron has fired within a specific time interval around the onset of a stimulus, θ . Assuming the firing rates of the neurons are sufficiently high, we model the probability distribution of the neural responses during a single trial, given the stimulus θ , according to a multivariate gaussian distribution: P(r | θ ) =
1 1 exp − (r − m(θ ))T C−1 (r − m(θ )) . Z 2
(1.1)
Here mi (θ ), the tuning curve of neuron i, is the mean activity of the ith neuron averaged over many trials with the same stimulus, θ ; C is the firing rate covariance matrix; XT denotes the transpose of X; and Z is a normalization constant. We model the single neuron tuning curve by a unimodal
Implications of Neuronal Diversity on Population Coding
1953
bell-shaped function of θ with a maximum at the neuron’s preferred direction, mi (θ ) = m(εi , φi − θ ),
(1.2)
where φi is the preferred direction of neuron i. We take the preferred directions of the neurons to be evenly spaced on the ring: φk = −π N+1 + 2π k. N N The εi is a set of parameters characterizing the tuning curve of the ith neuron and representing the deviation from homogeneity. For example, εi can quantify how sharper or narrower is the tuning curve of neuron i than the stereotypical tuning curve or can represent the ratio between the tuning amplitude of the ith neuron and the tuning amplitude of the stereotypical tuning curve. Here, for simplicity, we shall assume that εi is a scalar. It is important to note that εi is a number that characterizes the tuning curve of neuron i; it is a property on the ith neuron and does not change from trial to trial (we ignore effects of plasticity and learning). However, different neurons in the population may have different values for their ε parameters, reflecting the heterogeneity of the population. Different neural populations N can be characterized by different realizations of their set of {εi }i=1 . We shall N assume the {εi }i=1 are independent and identically distributed random variN ables that are independent of the stimulus, that is, P({εi }) = i=1 p(εi ). We distinguish between two sources of randomness in this model. One source is the “warm fluctuations,” represented by the trial-to-trial variability of the neural responses, equation 1.1. The second is the “quenched disorder,” represented by the heterogeneity of the tuning curves of the different neurons, reflected by the distribution of the {εi }. Throughout this article, we will be interested in calculating quantities that involve spatial averaging over the entire population. The value of such quantities will depend on the specific realization of the neural heterogeneity, the {εi }, and will fluctuate from one realization of the neuronal heterogeneity to another. We can characterize such a quantity by its statistics with respect to the quenched randomness. Although for local quantities the quenched fluctuations may be considerable, they are uncorrelated spatially; hence, the quenched fluctuations of quantities that involve spatial averaging, relative to their means, will decrease to zero in the large N limit. This property of a random variable with vanishing standard deviation to mean ratio, in the large N limit, is called self-averaging. Note that the value of a self-averaging quantity in a typical system will be equal to its mean across different systems. The practical implications of this property are that one can replace a self-averaging quantity by its quenched mean for large systems. Hence, instead of computing self-averaged quantities for a specific realization of the neuronal heterogeneity, we can calculate the average of this quantity over different realizations of the heterogeneity. We denote by angular brackets averaging with respect to the trialto-trial fluctuations of the neural responses, given a specific stimulus:
1954
M. Shamir and H. Sompolinsky
X = drXP(r | θ ). This averaging is done with a fixed set of parameters, {εi }, reflecting the fact that the single-neuron tuning curves are fixed and unchanged across many different trials. Averaging over the quenched disorder is denoted by double angular brackets, X({εi }) = N i=1 dεi p(εi )X({εi }). Fluctuations with respect to the distribution of the neural responses in a given system are denoted by δ, that is, δ X ≡ X − X. We use to denote quenched fluctuations: X ≡ X− X . It is convenient to write the tuning curve of neuron i as the sum of its quenched average, mi (θ ) , plus a fluctuation mi (θ ): mi (θ ) = f (φi − θ ) + mi (θ )
(1.3)
f (φi − θ ) ≡ mi (θ ) .
(1.4)
Note that in the last equation, we used equation 1.2 and the fact that the statistics of the {εi } are independent of the neuronal preferred directions and the stimulus. Similar to the single-cell tuning curve, we model f (θ ) by a smooth bell-shaped function that peaks at θ = 0. Specifically, in our numerical simulations, we used the following average tuning curve, f (θ ) = ( f max − f ref ) exp
cos(θ ) − 1 σ2
+ f ref ,
(1.5)
where σ , ( f max − f ref ) and f ref are the tuning width, the tuning amplitude, and a reference value for the stereotypical average tuning curve f (θ ), respectively. For a given stimulus, the quenched fluctuations of the tuning curves, {mi (θ )} (see equation 1.3), is a set of zero mean independent random variables with respect to the statistics of the quenched disorder. An important quantity for the calculation of the Fisher information is the derivative of the i tuning curve with respect to θ , mi = ∂m . The quenched fluctuations of the ∂θ tuning curve, {mi (θ )}, are also smooth periodic functions of θ as a difference of such functions. Using the independence of the {εi } and equation 1.2, one obtains (1.6) mi (θ ) = f (φi − θ ) + mi (θ ) ∂mi (εi , θ ) d mi (θ ) = p(εi ) dεi = p(εi )mi (εi , θ )dεi = 0 (1.7) ∂θ dθ mi (θ )mj (θ ) = δij K (φi − θ ),
(1.8)
where K (φi − θ ) is the variance of the tuning curve derivative of a neuron with preferred direction φi , given a stimulus θ , with respect to the quenched disorder. We further assume that the quenched fluctuations of the tuning curve derivatives, {mi }, follow gaussian statistics. In section 2, where we
Implications of Neuronal Diversity on Population Coding
1955
study the Fisher information of a heterogeneous population of neurons, we use equations 1.6 to 1.8 to define the gaussian statistics of the quenched fluctuations of tuning curves. For small, quenched fluctuations, one can expand the tuning curves, equation 1.2, in powers of εi and approximate mi (θ ) = f (φi − θ ) + εi g(φi − θ )
(1.9)
where g(θ ) = ∂m(ε = 0, θ )/∂ε. A simple example, where the approximation of equation 1.9 is exact, is in the case where mi (θ ) = f (φi − θ )(1 + εi ).
(1.10)
We term this model the amplitude diversity model. In section 3, where we address the question of readout, we use the specific form of equation 1.10 for the tuning curves in order to make the analysis analytically tractable. This model, equation 1.10, is also used for all of the numerical results presented in this article. We further assume, in section 3 and in all of the numerical results, that the {εi } are independent and identically distributed (i.i.d.) gaussian random variables with zero mean and variance κ, εi = 0 εi ε j = δij κ
∀i
(1.11) ∀i, j.
(1.12)
In this case, mi = −εi f (φi − θ ) and (mi (θ ))2 = K (φi − θ ) = κ| f (φi − θ )|2 .
(1.13)
We assume the response covariance of two neurons i and j, Cij , is independent of the stimulus, θ , and only depends on the functional distance between the two neurons, that is, the difference in their preferred directions. A decrease in the response covariance of different neurons with the increase in their functional distance has been reported in cortical areas (see, e.g., Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998; van Kan, Scobey, & Gabor, 1985; Mastronarde, 1983). Specifically, in all of our numerical results, we have used exponentially decaying correlations, |φi − φ j | Cij = C(φi − φ j ) = a δij + c(1 − δij ) exp − , ρ
(1.14)
where a is the variance of the single-neuron response, c and ρ are the correlation strength and correlation length, respectively, and |φi − φ j | ∈ [0, π] is the functional distance between neurons i and j. Note that in
1956
M. Shamir and H. Sompolinsky
this simple model, we did not incorporate any measure of diversity in the higher-order statistics of the neuronal responses. Hence, the Fourier modes of the system are eigenvectors of the covariance matrix. We assume that in the biologically relevant regime, every neuron is correlated with a substantial fraction of the entire population. Mathematically this means that as we consider the limit of large N, both ρ and c remain finite. In this regime, the eigenvalues of the covariance matrix, C, scale linearly with the size of the system, N. For all numerical results presented in this article, we used the amplitude diversity model for the neuronal tuning curves, equation 1.10, √ with the following parameters: ρ = 1, c = 0.4, a = 40 [sec−1 ], σ = 1/ 2, f max = 60 [sec−1 ], f ref = 20 [sec−1 ], and κ = 0.25, unless stated otherwise. Note that these parameters are are given in rates, that is, the number per 1 second. In order to obtain the spike count statistics in a given time interval T, the tuning curves and correlation strength were scaled by a factor of T; unless stated otherwise, we used T = 0.5 [sec]. 1.2 The Fisher Information. Throughout this article, we will be interested in studying the efficiency of different estimators, θˆ (r), of the stimulus θ . We define the efficiency of an estimator by the inverse of its average quadratic estimation error, 1/(θˆ − θ )2 . It is convenient to distinguish beˆ − θ , and the trialtween two sources of estimation error: the bias, b = θ 2 ˆ to-trial variability, (δ θ ) . The Fisher information (see, e.g., Thomas & Cover, 1991; Kay, 1993) is given by
J =
∂ log P(r|θ ) ∂θ
2 .
(1.15)
From the Cram´er-Rao inequality, the square estimation error of any readout θˆ (r) is bounded by (θˆ − θ )2 = (δ θˆ )2 + b(θ )2
(1.16)
(δ θˆ )2 ≥ where b =
db . dθ
(1 + b )2 , J
(1.17)
The Fisher information of this model is given by
J = mT C−1 m ,
(1.18)
. Note that the Fisher information has the form of a squared where, m = dm dθ signal-to-noise ratio, where the signal is the sensitivity of the neural
Implications of Neuronal Diversity on Population Coding
1957
0.04 FI & PV efficiency [deg−2]
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0
100
200
300
400
500
N Figure 1: The Fisher information and population vector efficiency of a homogeneous population of neurons. The Fisher information is shown by the solid line as a function of the number of cells in the population, N. The analytical result for the population vector efficiency, equation 1.26, is shown by the dashed line. The open circles show the results of numerical calculation of averaging the population vector estimation error over 2000 trials estimating θ = 0. In this figure, κ = 0 was used for a homogeneous population.
responses to small changes in the stimulus, m , and the squared noise is represented by the correlation matrix, C. 1.2.1 Fisher Information of an Isotropic Population. In the limit of an isotropic population, K (θ ) = 0 (see equation 1.8), the signal, m = f , resides in a low-dimensional subspace of the neural responses, spanned by the slowly varying collective modes of the system. However, due to the correlations, both the squared signal and noise will scale linearly with the size of the system, yielding J = O(1) even in the large N limit (see Sompolinsky et al., 2001, for further discussion). Figure 1 shows the Fisher information of an isotropic system, κ = 0 in the amplitude diversity model (solid line), as a function of the population size, N. As can be seen from the figure, the Fisher information of an isotropic system saturates to a finite value in the limit of large N. In contrast, in a diverse population of neurons, the signal, √ m , will have an O( N) projection on a subspace spanned by eigenvectors
1958
M. Shamir and H. Sompolinsky
of C corresponding to eigenvalues of O(1). This effect of the neuronal diversity is studied in section 2. 1.3 Linear Readout. A linear readout is an estimator of the form zˆ = i wi ri , where w is a fixed weights vector that defines the readout. It is convenient to adopt complex notation for the stimulus. We denote by z = e iθ a two-dimensional unit vector in the complex plane pointing in the direction of the stimulus, θ . Similarly, the estimator, zˆ = xˆ + i yˆ , will represent a two-dimensional vector in the complex plane with θˆ = arg(ˆz). One can measure the performance of such a readout by either the efficiency of angular estimation, (θˆ − θ )2 −1 , or by the Euclidean distance between z and zˆ . In this work, we employ both measures. We shall call (θˆ − θ )2 −1 the efficiency of the estimator and |ˆz − z|2 the Euclidean error. Let E(w) be the Euclidean error of a linear readout with a weights vector w, E(w) = Q= U=
dθ |ˆz − z|2 = w† Qw − w† U − U† w + 1 2π
(1.19)
dθ T rr 2π
(1.20)
dθ re iθ , 2π
(1.21)
where X† denotes the conjugate transpose of X. It is important to note that, being a function of the neural responses, the linear estimator, zˆ = rT w, is a random variable that fluctuates from trial to trial with a probability distribution that depends on the stimulus, θ (see equation 1.1). The Euclidean error, E(w), defined in equation 1.19, incorporates two averaging steps. First is averaging the Euclidean error that results from the trial-to-trial fluctuations of the linear readout, |ˆz − z|2 , for a given stimulus angle, θ . The second is averaging the Euclidean error over the different possible stimuli by integrating over the stimulus angle, θ , assuming a uniform prior. The optimal linear estimator (Salinas & Abbott, 1994) is defined by the set of linear weights, wole , that minimizes E. The optimal linear estimator weights are given by wole = Q−1 U (see also equations 3.4 to 3.6), and the average quadratic estimation error of the optimal linear estimator is given by E(wole ) = 1 − U† Q−1 U. Below we present the main results for the optimal linear estimator efficiency in a correlated homogeneous population of neurons and in an uncorrelated heterogeneous population. The study of the optimal linear estimator performance in a heterogeneous population of correlated cells is the focus of section 3. 1.3.1 Linear Readout in a Correlated Homogeneous Population of Neurons. In the case of a homogeneous population of neurons, K (θ ) = 0 (see equation 1.8), the optimal linear estimator, zˆ ole = rT wole , is given by
Implications of Neuronal Diversity on Population Coding
1959
(see Shamir & Sompolinsky, 2004) zˆ pv =
∗ N f˜ (1) e iφ j r j , N | f˜ (1) |2 + c˜ 1 j=1
(1.22)
where we have used the following definitions for the Fourier transforms: 1 f (φ j )e inφ j = N N
f˜ (n) =
j=1
dφ f (φ)e inφ 2π
N dφ 1 in(φ j −φk ) C jk e = c˜ n = 2 C(φ)e inφ . N 2π
(1.23)
(1.24)
j=1
Note that N˜c n is the eigenvalue of the correlation matrix, C. The asterisk ∗ in the nominator of the right-hand side of equation 1.22, f˜ (1) , denotes (1) the complex conjugate of the Fourier transform f˜ . However, since the average tuning curve, f (θ ), is a real, even function of the stimulus, θ , its Fourier transforms are pure real. We shall omit the use of conjugate notation of real-valued terms hereafter. This linear estimator, equation 1.22, is the population vector readout. In this case, one can show that the population vector is unbiased with respect to its argument and that its average quadratic error, E(P V), is given by E(P V) =
1 , ˜ 1 + | f (1) |2 /˜c 1
(1.25)
which is of O(1) even in the limit of large N. Assuming small, angular estimation errors, one can expand θˆ in the fluctuations of xˆ and yˆ and study the efficiency of the population vector angular estimation. For a homogeneous population, this error will result only from the trial-to-trial fluctuations and will be independent of the stimulus in its magnitude. The efficiency of the population vector, in this case, is given by (see Sompolinsky et al., 2001; details of the calculation of the population vector efficiency also appear in appendix B) 1 2| f˜ (1) |2 = < J. c˜ 1 (δ θˆ pv )2
(1.26)
Thus, in the biologically relevant regime for the correlations, c > 0, c = O(1), ρ = O(1), both the nominator and denominator of equation 1.26 are O(1) in N, and the efficiency of the population vector saturates to a finite limit for large N. This can be seen in Figure 1, which shows the population vector efficiency as a function of the number of cells in the population
1960
M. Shamir and H. Sompolinsky
−2
OLE & PV efficiency [deg ]
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
100
200
300
400
500
N Figure 2: Linear readout efficiency in an uncorrelated population of neurons. The population vector efficiency, (θˆ pv − θ )2 −1 (solid line), is shown as a function of the population size, N. The inverse of the population vector bias, b 2pv −1 , and the population vector variance, (δ θˆ pv )2 −1 , are shown by the open circles and boxes respectively. The efficiency of the optimal linear estimator, (θˆole − θ )2 −1 , is represented by the dashed line, and the inverse of its average variance, (δ θˆole )2 −1 , by the open triangles. The dotted line shows the population vector efficiency in an uncorrelated homogeneous neural population. The estimator’s efficiency was calculated numerically by averaging over 400 different realizations of the neuronal diversity for each point. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the readout for 500 trials of estimating θ = 0. In this figure, c = 0 was used. For the population vector efficiency in a homogeneous neural population (dotted line), κ = 0 was used.
(dashed line and open circles). For large systems, the population vector efficiency reaches a size-independent limit. 1.3.2 Linear Readout in a Heterogeneous Population of Uncorrelated Neurons. In a diverse population of neurons, the population vector readout is no longer the optimal linear estimator. Moreover, the population vector estimator is biased. Figure 2 shows the efficiency of the population vector for angular estimation, (θˆ pv − θ )2 −1 , in an uncorrelated (c = 0)
Implications of Neuronal Diversity on Population Coding
1961
heterogeneous (κ > 0 in the amplitude diversity model, equation 1.10) population of neurons (solid line). The inverse of the population vector bias, b 2pv −1 is shown by the open circles. The inverse of the population vector variance is shown by the open squares. For comparison, we plot the efficiency of the population vector in a homogeneous population (κ = 0) by the dotted line. From the figure, one can see that in a heterogeneous population of neurons, the efficiency of the population vector is decreased, relative to the homogeneous case, due to the added bias term that scales as b 2pv ∝ 1/N; note that the population vector variance remains the same (compare open boxes and dotted line). The dashed line in Figure 2 shows the efficiency of the optimal linear estimator, and the triangles show the inverse of its variance. The performance of the optimal linear estimator is superior to that of the population vector for two reasons. First, the optimal linear estimator is practically unbiased (compare the dashed line and open triangles). Second, the variance of the optimal linear estimator is smaller than the variance of the population vector. This results from the fact that the population vector extracts information only from the slowly varying collective modes of the system, whereas the optimal linear estimator can extract information from the higher-order modes as well, thus increasing its signal-to-noise ratio. However, the efficiency of both readouts scales linearly with the size of the system. Thus, in the case of an uncorrelated population, there is no qualitative difference between the two readouts. Below we show that in the presence of correlations, the neuronal diversity produces a qualitative effect on both the information capacity of the system (section 2) and the efficiency of different readouts (section 3). 2 The Fisher Information of a Diverse Correlated Population The Fisher information, equation 1.18, of this model with K (θ ) > 0 (see equation 1.8) is given by J=
N
( f i + mi )Cij−1 ( f j + mj )
i, j=1
=
N i, j=1
f i Cij−1 f j + 2
N
mi Cij−1 f j +
i, j=1
N
mi Cij−1 mj .
(2.1)
i, j=1
The Fisher information of a specific system, that is, for a given realization of the {mi }, is a random variable that fluctuates from one realization of the neuronal diversity to another with respect to the quenched distribution of the {mi }. The statistics of the Fisher information can be characterized by its moments. We find (see appendix A) that to a leading order in N, J = N K¯ d + J homog
(2.2)
1962
M. Shamir and H. Sompolinsky
J homog = fT C−1 f = O(1) (J )2 = 2NK 2 d 2 + O(1),
(2.3) (2.4)
where d is the diagonal element of the inverse of the correlation matrix, d = [C−1 ]ii ; J homog is the Fisher information of a homogeneous population (κ = 0), which is of O(1) in the presence of correlation; and K¯ = K2 =
dθ K (θ ) 2π
(2.5)
dθ K (θ )2 . 2π
(2.6)
From equations 2.2 and 2.4, one finds that for large systems, the sampleto-sample fluctuations of the Fisher√information are small relative to its mean, (J )2 / J = O(1/ N). Hence, for large populations, the Fisher information of a typical system will be equal to its mean across many samples. This property of the Fisher information is an example of self-averaging. Figure 3a shows the mean and Figure 3b the variance of the Fisher information as a function of the population size N. The mean and variance of J were computed by averaging over 400 realizations of the neuronal populations (open circles) in the amplitude diversity model with κ = 0, 0.05, 0.1, 0.25 from bottom to top. The analytical results, equations 2.2 and 2.4, are shown by the solid lines as a function of the population size. Note that κ = 0 is the case of an isotropic population of neurons. In this case, the Fisher information saturates to a finite value in the limit of large N. In contrast, for κ > 0, the Fisher information of the system increases linearly with the population size (top lines) with a slope that is linear in κ. Interestingly, after averaging over the heterogeneity of the population, the statistics of the Fisher information are independent of the stimulus, θ . Hence, for all θ ∈ [−π, . . . π)√the Fisher information, J (θ ), will be equal to its quench average up to O( N) corrections. Figure 4 shows the stimulus fluctuations of the Fisher information for a typical realization of the neuronal diversity in a system of N = 1000 neurons in the amplitude diversity model, equation 1.10, with κ = 0.25. From the figure, one can see that the Fisher information is a smooth function of θ and is equal to J up to √ small fluctuations of O( N). For local discrimination tasks, linear readout can extract all the information coded by the first-order statistics of the neural responses (see Shamir & Sompolinsky, 2004). Below we study the efficiency of the linear readout for global estimation tasks.
Implications of Neuronal Diversity on Population Coding
1963
(a)
−2
mean of FI [deg ]
0.25 0.2 0.15 0.1 0.05 0 0
100
200
300
400
500
300
400
500
N x 10
−4
variance of FI [deg ]
−4
5
(b) 4 3 2 1 0 0
100
200
N Figure 3: The mean (a) and variance (b) of the Fisher information with respect to the quenched statistics are shown as a function of the system size, N. The open circles show statistics of the Fisher information as calculated numerically by averaging the Fisher information of 400 different realizations of the neuronal diversity. The statistics were calculated in the amplitude diversity model with κ = 0, 0.05, 0.1, 0.25 from bottom to top. The solid lines in a show the analytical result for the average Fisher information, equations 2.2 and 2.3. The solid lines in b show the leading, O(N), term of equation (30) for the Fisher information variance, 2NK 2 d 2 .
1964
M. Shamir and H. Sompolinsky
J(θ) [deg−2]
0.5 0.4 0.3 0.2 0.1 0 −180
−90
0
θ [deg]
90
180
Figure 4: The Fisher information of a specific system. The Fisher information of a single system is plotted (solid line) as a function of the stimulus, θ, for a given typical realization of the neuronal diversity, the {εi }. The quenched average of the Fisher information, J , √is shown in the dashed line for comparison. The dotted lines show J ± 2 J 2 . In this figure, we used N = 1000.
3 Linear Readout We now turn to the efficiency of linear readouts in a correlated heterogeneous population of neurons. For simplicity we restrict the discussion to the amplitude diversity model, equation 1.10. We first study the efficiency of the population vector readout. As mentioned above, in a homogeneous population of neurons, the population vector yields the optimal linear estimator. However, although the Fisher information in a diverse population grows linearly with the size of the system, the population vector efficiency saturates to a finite limit, as in the homogeneous case. In contrast, we show that the optimal linear estimator efficiency scales linearly with the number of cells in the population, N. 3.1 The Population Vector Readout. The efficiency of the population vector readout depends on the specific realization of the {εi }. We therefore characterize the population vector efficiency by its statistics with respect to the quenched disorder. Calculation of the quenched statistics of the
Implications of Neuronal Diversity on Population Coding
1965
population vector readout (see appendix B) reveals that the population √ vector bias is typically of order 1/ N with b pv = 0 and variance, b 2pv =
f 2 (0) − f 2 (2) κ κ , = O 2N N | f˜ (1) |2
(3.1)
where we have used the following notation for the Fourier transform of f 2 (θ ): f 2 (n) =
dϕ 2 f (ϕ)e inϕ . 2π
(3.2)
Note that after the quenched averaging, the statistics of the bias and variance of the population vector readout are independent of the stimulus. Analysis of the population vector trial-to-trial fluctuations (see appendix B) shows that the variance of the population vector estimator, (δ θˆ pv )2 , is a selfaveraging quantity on the order of 1 with (δ θˆ pv )2 =
c˜ 1 = O(1), 2| f˜ (1) |2
(3.3)
which is the same as the population vector efficiency in the homogeneous case, equation 1.26. Figure 5 shows the average efficiency of the population vector in a diverse population (circles) in terms of one over the average square angular estimation error, (θˆ − θ )2 −1 . The population vector efficiency was calculated by averaging over 400 realizations of the neuronal diversity. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the population vector for 200 trials of estimating θ = 0. The analytical results of substituting equations 3.1 and 3.3 into equation 1.16 are shown by the overlapping solid line. Comparing Figures 1 and 5 and equations 1.26 and 3.3, it is easy to see that the efficiency of the population vector readout is almost unaffected by the neuronal diversity in the limit of large N. Thus, although the Fisher information of a diverse population scales linearly with the size of the system, the efficiency of the population vector saturates to a finite limit as N grows, in the presence of correlations. Similarly, we find that to a leading order in N, the neuronal diversity has no effect on the population vector performance in terms of the Euclidean distance measure, equation 1.19. √ Hence, the result of equation 1.25 still holds up to a correction of O(1/ N) (see appendix B), yielding O(1) error even in the large N limit. These results raise the question of whether it is possible to obtain a linear estimator that will be able to use the neuronal diversity in order to overcome the correlated noise and obtain an efficiency that scales linearly with the size of the system. As discussed above, in
1966
M. Shamir and H. Sompolinsky
, estimator s efficiency [deg−2]
0.1
0.08
0.06
0.04
0.02
0 0
100
200
300
400
500
N Figure 5: Linear readout average efficiency. The average efficiency of the population vector (open circles) and of the optimal linear estimator (open boxes) is plotted as a function of the size of the system. The efficiency of these readouts was calculated numerically by averaging over 400 different realizations of the neuronal diversity for each point. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the readout over 200 trials of estimating θ = 0. The Fisher information is shown in the upper dashed line for comparison. The bottom solid line shows the analytical result for the population vector error, equations 1.16, 3.3, and 3.1.
an isotropic population of neurons, the population vector is the optimal linear estimator. However, in a diverse population, the population vector is no longer optimal. Below we study the efficiency of the optimal linear estimator in a heterogeneous population of neurons. 3.2 The Optimal Linear Estimator. The optimal linear estimator weights are given by (see equations 1.19–1.21) wole = Q−1 U Qij = Cij + (1 + εi )(1 + ε j ) U j = (1 + ε j ) f˜ (1) e iφ j .
(3.4) dθ f (φi − θ ) f (φ j − θ ) 2π
(3.5) (3.6)
Implications of Neuronal Diversity on Population Coding
1967
The efficiency of the optimal linear estimator, in terms of one over the average quadratic estimation error, (θˆ − θ )2 −1 , is shown in Figure 5 (boxes). The optimal linear estimator efficiency was calculated by averaging over 400 realizations of the neuronal diversity. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the optimal linear estimator over 200 trials of estimating θ = 0. From the figure, one can see that while the population vector efficiency (solid line and open circles) saturates to a finite limit, the optimal linear estimator efficiency scales linearly with the population size, N. Obtaining a complete analytical expression for the optimal linear estimator and its efficiency is not an easy task, mainly due to the difficulty of inverting the random matrix Q. However, for large populations, one can expand the optimal linear estimator to a leading order in 1/N, as presented in the following section. 3.2.1 The Zeroth Approximation of the Optimal Linear Estimator. To a leading order in 1/N, the optimal linear estimator weights are given by (see appendix C)
wole, j = w j + O(N−3/2 ) (0)
(0)
wj =
(3.7)
1 j e iφ j . Nκ f˜ (1)
(3.8)
Figure 6 shows the average overlap between the optimal linear estimator † w w(0) and the zeroth approximation readout weights: Real √ ole (0) . wole w
As can be seen from the figure, the overlap approaches the value of 1 as N grows. Hence, for large N, the zeroth approximation converges to the N→∞
optimal linear estimator, w(0) −→ wole . We find (see appendix D) that the average Euclidean error, equation 1.19, of the zeroth approximation fluctuates from sample to sample with mean and standard deviation of the order of 1/N. Assuming small angular estimation errors, we can study the angular estimation efficiency of the zeroth approximation. Our calculations (see appendix D) show that the bias of the zeroth √ approximation is zero, on average, with fluctuations that are of order 1/ N:
b =0 (b)2 =
f 2 (0) − f 2 (2) 1 + 2κ1 . N | f˜ (1) |2
(3.9) (3.10)
1968
M. Shamir and H. Sompolinsky
1 0.9
overlap
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
200
400
600
800
1000
N Figure 6: The average overlap between the optimal linear estimator and the zeroth approximation of the optimal linear estimator linear weights, Real †
(0)
w w { √ ole
wole w(0)
}, is shown as a function of the system size N. Note that
we find the imaginary part of the average overlap to be on the order of 10−3 and decreases rapidly with N (results not shown). Every point on the graph shows the overlap averaged over 400 realizations of the neuronal diversity.
The trial-to-trial fluctuations of the zeroth approximation obey the following statistics (see appendix D), (δ θˆ )2 = ((δ θˆ )2 )2 =
a 2Nκ| f˜ (1) |2
(3.11)
B˜ (0) + 12 B˜ (2) , 2(Nκ)2 | f˜ (1) |4
(3.12)
where a is the single-cell response variance (see equation 1.14) and B(θ ) ≡ C 2 (θ ). Thus, the efficiency of the zeroth-order approximation scales linearly with √ the size of the system, yielding an estimation error on the order of 1/ N. Figure 7 shows the efficiency of this readout as a function of the population size N; the open circles show the numerical evaluation of the efficiency, and the solid line shows the analytical results of substituting
Implications of Neuronal Diversity on Population Coding
1969
−2
readout efficiency [deg ]
0.04
0.03
0.02
0.01
0 0
500
1000
1500
2000
N Figure 7: The efficiency of the zeroth-order approximation of the optimal linear estimator as a function of the population size, N. The efficiency is shown in terms of one over the average quadratic angular estimation error, θˆ − θ2 −1 . The analytical results of equations 1.16, 3.10, and 3.11 are shown by the solid line. The open circles show the numerical estimation of the efficiency. For comparison, we present the optimal linear estimator efficiency (dashed line), as calculated numerically. The numerical calculation of the readouts efficiencies was done by averaging over 400 different realizations of the neuronal diversity and over 200 trials of simulating the neuronal stochastic responses to stimulus θ = 0 for each realization.
equations 3.11 and 3.10 into equation 1.16. For comparison, the optimal linear estimator efficiency is shown by the dashed line. As can be seen from the figure, the efficiency of the zeroth approximation scales linearly with the size of the system, even in the presence of correlated noise. Nevertheless, its performance is considerably inferior to that of the optimal linear estimator. This result contrasts with the high degree of similarity between w(0) and wole (see Figure 6), emphasizing the importance of the higher-order corrections to the optimal linear estimator readout. We find that by also incorporating the first-order corrections, we can retrieve most of the efficiency of the optimal linear estimator. However, the first-order corrections are nonlocal in space and involve global averages of the neuronal diversity across the entire population (results not shown). On the other hand, the analysis of the
1970
M. Shamir and H. Sompolinsky
zeroth-order approximation efficiency is sufficient to prove the linear scaling of the optimal linear estimator efficiency with the population size. The problem of fine tuning of the optimal linear estimator weights is addressed in section 4. 4 Summary and Discussion The efficiency of population codes has been the subject of considerable theoretical effort. Early studies investigated the efficiency of the code using the theoretical concept of the Fisher information and quantifying the accuracy of simple readout mechanisms (Paradiso, 1988; Seung & Sompolinsky, 1993). Assuming the trial-to-trial fluctuations in the responses of different cells have zero correlation, these studies have shown that the coding efficiency of the population grows linearly with the number of cells. Zohary et al. (1994) have shown the possible detrimental effect of nonzero cross-correlations on the coding efficiency of the population. On the other hand, differing results were presented by Abbott and Dayan (1999) claiming that correlations do not have a detrimental effect on the coding efficiency. Abbott and Dayan considered several models for the cross-correlations. One model incorporated a short-range correlation structure. In terms of the current study, this corresponds to scaling the correlation length, ρ, inversely with the number of cells in the population, ρ ∼ 1/N (see equation 1.14); hence, in the large N limit, this model is very similar to a model without correlations. Another interesting model studied by Abbott and Dayan is one with uniform correlations;1 it corresponds to the other extreme of taking the limit of very large correlation length, ρ −→ ∞ (see equation 1.14). Uniform correlations generate large collective fluctuations in the neural responses. However, these collective fluctuations are limited to the uniform direction, (1, 1, . . . , 1), that contains no information about the stimulus identity. Thus, signal and noise in this model are completely segregated into orthogonal subspaces of the phase space of neural responses. Due to this segregation, one can treat this model as one of an independent population of neurons and obtain qualitatively similar results. Qualitatively different results are obtained in the intermediate regime of ρ = O(1). In this regime, a considerable fraction of neuronal pairs shows significant correlations. Moreover, correlations are stronger for pairs of neurons with closer preferred directions. These properties of the ρ = O(1) regime are in agreement with experimental findings (see,